2021-07-11

  • cs.CV updates on arXiv.org

    Beyond BatchNorm: Towards a General Understanding of Normalization in Deep Learning. (arXiv:2106.05956v2 [cs.LG] UPDATED)
    (2 min) Inspired by BatchNorm, there has been an explosion of normalization layers for deep neural networks (DNNs). However, these alternative normalization layers have seen minimal use, partially due to a lack of guiding principles that can help identify when these layers can serve as a replacement for BatchNorm. To address this problem, we take a theoretical approach, generalizing the known beneficial mechanisms of BatchNorm to several recently proposed normalization techniques. Our generalized theory leads to the following set of principles: (i) similar to BatchNorm, activations-based normalization layers can prevent exponential growth of activations in ResNets, but parametric layers require explicit remedies; (ii) use of GroupNorm can ensure informative forward propagation, with different samples being assigned dissimilar activations, but increasing group size results in increasingly indistinguishable activations for different samples, explaining slow convergence speed in models with LayerNorm; (iii) small group sizes result in large gradient norm in earlier layers, hence explaining training instability issues in Instance Normalization and illustrating a speed-stability tradeoff in GroupNorm. Overall, our analysis reveals a unified set of mechanisms that underpin the success of normalization methods in deep learning, providing us with a compass to systematically explore the vast design space of DNN normalization layers.
  • cs.LG updates on arXiv.org

    Beyond BatchNorm: Towards a General Understanding of Normalization in Deep Learning. (arXiv:2106.05956v2 [cs.LG] UPDATED)
    (2 min) Inspired by BatchNorm, there has been an explosion of normalization layers for deep neural networks (DNNs). However, these alternative normalization layers have seen minimal use, partially due to a lack of guiding principles that can help identify when these layers can serve as a replacement for BatchNorm. To address this problem, we take a theoretical approach, generalizing the known beneficial mechanisms of BatchNorm to several recently proposed normalization techniques. Our generalized theory leads to the following set of principles: (i) similar to BatchNorm, activations-based normalization layers can prevent exponential growth of activations in ResNets, but parametric layers require explicit remedies; (ii) use of GroupNorm can ensure informative forward propagation, with different samples being assigned dissimilar activations, but increasing group size results in increasingly indistinguishable activations for different samples, explaining slow convergence speed in models with LayerNorm; (iii) small group sizes result in large gradient norm in earlier layers, hence explaining training instability issues in Instance Normalization and illustrating a speed-stability tradeoff in GroupNorm. Overall, our analysis reveals a unified set of mechanisms that underpin the success of normalization methods in deep learning, providing us with a compass to systematically explore the vast design space of DNN normalization layers.
    Synthetic Data -- Anonymisation Groundhog Day. (arXiv:2011.07018v4 [cs.LG] UPDATED)
    (2 min) Synthetic data has been advertised as a silver-bullet solution to privacy-preserving data publishing that addresses the shortcomings of traditional anonymisation techniques. The promise is that synthetic data drawn from generative models preserves the statistical properties of the original dataset but, at the same time, provides perfect protection against privacy attacks. In this work, we present the first quantitative evaluation of the privacy gain of synthetic data publishing and compare it to that of previous anonymisation techniques. Our evaluation of a wide range of state-of-the-art generative models demonstrates that synthetic data either does not prevent inference attacks or does not retain data utility. In other words, we empirically show that synthetic data suffers from the same limitations as traditional anonymisation techniques. Furthermore, we find that, in contrast to traditional anonymisation, the privacy-utility tradeoff of synthetic data publishing is hard to predict. Because it is impossible to predict what signals a synthetic dataset will preserve and what information will be lost, synthetic data leads to a highly variable privacy gain and unpredictable utility loss. In summary, we find that synthetic data is far from the holy grail of privacy-preserving data publishing.

2021-07-09

  • cs.CL updates on arXiv.org

    The value of text for small business default prediction: A deep learning approach. (arXiv:2003.08964v4 [cs.LG] UPDATED)
    (2 min) Compared to consumer lending, Micro, Small and Medium Enterprise (mSME) credit risk modelling is particularly challenging, as, often, the same sources of information are not available. Therefore, it is standard policy for a loan officer to provide a textual loan assessment to mitigate limited data availability. In turn, this statement is analysed by a credit expert alongside any available standard credit data. In our paper, we exploit recent advances from the field of Deep Learning and Natural Language Processing (NLP), including the BERT (Bidirectional Encoder Representations from Transformers) model, to extract information from 60 000 textual assessments provided by a lender. We consider the performance in terms of the AUC (Area Under the receiver operating characteristic Curve) and Brier Score metrics and find that the text alone is surprisingly effective for predicting default. However, when combined with traditional data, it yields no additional predictive capability, with performance dependent on the text's length. Our proposed deep learning model does, however, appear to be robust to the quality of the text and therefore suitable for partly automating the mSME lending process. We also demonstrate how the content of loan assessments influences performance, leading us to a series of recommendations on a new strategy for collecting future mSME loan assessments.
    Language Conditioned Imitation Learning over Unstructured Data. (arXiv:2005.07648v2 [cs.RO] UPDATED)
    (2 min) Natural language is perhaps the most flexible and intuitive way for humans to communicate tasks to a robot. Prior work in imitation learning typically requires each task be specified with a task id or goal image -- something that is often impractical in open-world environments. On the other hand, previous approaches in instruction following allow agent behavior to be guided by language, but typically assume structure in the observations, actuators, or language that limit their applicability to complex settings like robotics. In this work, we present a method for incorporating free-form natural language conditioning into imitation learning. Our approach learns perception from pixels, natural language understanding, and multitask continuous control end-to-end as a single neural network. Unlike prior work in imitation learning, our method is able to incorporate unlabeled and unstructured demonstration data (i.e. no task or language labels). We show this dramatically improves language conditioned performance, while reducing the cost of language annotation to less than 1% of total data. At test time, a single language conditioned visuomotor policy trained with our method can perform a wide variety of robotic manipulation skills in a 3D environment, specified only with natural language descriptions of each task (e.g. "open the drawer...now pick up the block...now press the green button..."). To scale up the number of instructions an agent can follow, we propose combining text conditioned policies with large pretrained neural language models. We find this allows a policy to be robust to many out-of-distribution synonym instructions, without requiring new demonstrations. See videos of a human typing live text commands to our agent at language-play.github.io
    Decoding Methods for Neural Narrative Generation. (arXiv:2010.07375v2 [cs.CL] UPDATED)
    (2 min) Narrative generation is an open-ended NLP task in which a model generates a story given a prompt. The task is similar to neural response generation for chatbots; however, innovations in response generation are often not applied to narrative generation, despite the similarity between these tasks. We aim to bridge this gap by applying and evaluating advances in decoding methods for neural response generation to neural narrative generation. In particular, we employ GPT-2 and perform ablations across nucleus sampling thresholds and diverse decoding hyperparameters -- specifically, maximum mutual information -- analyzing results over multiple criteria with automatic and human evaluation. We find that (1) nucleus sampling is generally best with thresholds between 0.7 and 0.9; (2) a maximum mutual information objective can improve the quality of generated stories; and (3) established automatic metrics do not correlate well with human judgments of narrative quality on any qualitative metric.
    Sparse and Structured Visual Attention. (arXiv:2002.05556v2 [cs.CL] UPDATED)
    (2 min) Visual attention mechanisms are widely used in multimodal tasks, as visual question answering (VQA). One drawback of softmax-based attention mechanisms is that they assign some probability mass to all image regions, regardless of their adjacency structure and of their relevance to the text. In this paper, to better link the image structure with the text, we replace the traditional softmax attention mechanism with two alternative sparsity-promoting transformations: sparsemax, which is able to select only the relevant regions (assigning zero weight to the rest), and a newly proposed Total-Variation Sparse Attention (TVmax), which further encourages the joint selection of adjacent spatial locations. Experiments in VQA show gains in accuracy as well as higher similarity to human attention, which suggests better interpretability.
    Cross-Lingual Transfer in Zero-Shot Cross-Language Entity Linking. (arXiv:2010.09828v2 [cs.CL] UPDATED)
    (2 min) Cross-language entity linking grounds mentions in multiple languages to a single-language knowledge base. We propose a neural ranking architecture for this task that uses multilingual BERT representations of the mention and the context in a neural network. We find that the multilingual ability of BERT leads to robust performance in monolingual and multilingual settings. Furthermore, we explore zero-shot language transfer and find surprisingly robust performance. We investigate the zero-shot degradation and find that it can be partially mitigated by a proposed auxiliary training objective, but that the remaining error can best be attributed to domain shift rather than language transfer.
    Transformer-based Conditional Variational Autoencoder for Controllable Story Generation. (arXiv:2101.00828v2 [cs.CL] UPDATED)
    (2 min) We investigate large-scale latent variable models (LVMs) for neural story generation -- an under-explored application for open-domain long text -- with objectives in two threads: generation effectiveness and controllability. LVMs, especially the variational autoencoder (VAE), have achieved both effective and controllable generation through exploiting flexible distributional latent representations. Recently, Transformers and its variants have achieved remarkable effectiveness without explicit latent representation learning, thus lack satisfying controllability in generation. In this paper, we advocate to revive latent variable modeling, essentially the power of representation learning, in the era of Transformers to enhance controllability without hurting state-of-the-art generation effectiveness. Specifically, we integrate latent representation vectors with a Transformer-based pre-trained architecture to build conditional variational autoencoder (CVAE). Model components such as encoder, decoder and the variational posterior are all built on top of pre-trained language models -- GPT2 specifically in this paper. Experiments demonstrate state-of-the-art conditional generation ability of our model, as well as its excellent representation learning capability and controllability.
    Vector Space Morphology with Linear Discriminative Learning. (arXiv:2107.03950v1 [cs.CL])
    (2 min) This paper presents three case studies of modeling aspects of lexical processing with Linear Discriminative Learning (LDL), the computational engine of the Discriminative Lexicon model (Baayen et al., 2019). With numeric representations of word forms and meanings, LDL learns to map one vector space onto the other, without being informed about any morphological structure or inflectional classes. The modeling results demonstrated that LDL not only performs well for understanding and producing morphologically complex words, but also generates quantitative measures that are predictive for human behavioral data. LDL models are straightforward to implement with the JudiLing package (Luo et al., 2021). Worked examples are provided for three modeling challenges: producing and understanding Korean verb inflection, predicting primed Dutch lexical decision latencies, and predicting the acoustic duration of Mandarin words.
    Meeting the SDGs : Enabling the Goals by Cooperation with Crowd using a Conversational AI Platform. (arXiv:2107.04011v1 [cs.CY])
    (2 min) In this paper, we report about a large-scale online discussion with 1099 citizens on the Afghanistan Sustainable Development Goals.
    Inspiration through Observation: Demonstrating the Influence of Automatically Generated Text on Creative Writing. (arXiv:2107.04007v1 [cs.CL])
    (2 min) Getting machines to generate text perceived as creative is a long-pursued goal. A growing body of research directs this goal towards augmenting the creative writing abilities of human authors. In this paper, we pursue this objective by analyzing how observing examples of automatically generated text influences writing. In particular, we examine a task referred to as sentence infilling, which involves transforming a list of words into a complete sentence. We emphasize "storiability" as a desirable feature of the resulting sentences, where "storiable" sentences are those that suggest a story a reader would be curious to hear about. Both humans and an automated system (based on a neural language model) performed this sentence infilling task. In one setting, people wrote sentences on their own; in a different setting, people observed the sentences produced by the model while writing their own sentences. Readers then assigned storiability preferences to the resulting sentences in a subsequent evaluation. We find that human-authored sentences were judged as more storiable when authors observed the generated examples, and that storiability increased as authors derived more semantic content from the examples. This result gives evidence of an "inspiration through observation" paradigm for human-computer collaborative writing, through which human writing can be enhanced by text generation models without directly copying their output.
    Anticipating Safety Issues in E2E Conversational AI: Framework and Tooling. (arXiv:2107.03451v1 [cs.CL])
    (2 min) Over the last several years, end-to-end neural conversational agents have vastly improved in their ability to carry a chit-chat conversation with humans. However, these models are often trained on large datasets from the internet, and as a result, may learn undesirable behaviors from this data, such as toxic or otherwise harmful language. Researchers must thus wrestle with the issue of how and when to release these models. In this paper, we survey the problem landscape for safety for end-to-end conversational AI and discuss recent and related work. We highlight tensions between values, potential positive impact and potential harms, and provide a framework for making decisions about whether and how to release these models, following the tenets of value-sensitive design. We additionally provide a suite of tools to enable researchers to make better-informed decisions about training and releasing end-to-end conversational AI models.
    CANDLE: Decomposing Conditional and Conjunctive Queries for Task-Oriented Dialogue Systems. (arXiv:2107.03884v1 [cs.CL])
    (2 min) Domain-specific dialogue systems generally determine user intents by relying on sentence-level classifiers which mainly focus on single action sentences. Such classifiers are not designed to effectively handle complex queries composed of conditional and sequential clauses that represent multiple actions. We attempt to decompose such queries into smaller single-action sub-queries that are reasonable for intent classifiers to understand in a dialogue pipeline. We release CANDLE (Conditional & AND type Expressions), a dataset consisting of 3124 utterances manually tagged with conditional and sequential labels and demonstrates this decomposition by training two baseline taggers.
    HinGE: A Dataset for Generation and Evaluation of Code-Mixed Hinglish Text. (arXiv:2107.03760v1 [cs.CL])
    (2 min) Text generation is a highly active area of research in the computational linguistic community. The evaluation of the generated text is a challenging task and multiple theories and metrics have been proposed over the years. Unfortunately, text generation and evaluation are relatively understudied due to the scarcity of high-quality resources in code-mixed languages where the words and phrases from multiple languages are mixed in a single utterance of text and speech. To address this challenge, we present a corpus (HinGE) for a widely popular code-mixed language Hinglish (code-mixing of Hindi and English languages). HinGE has Hinglish sentences generated by humans as well as two rule-based algorithms corresponding to the parallel Hindi-English sentences. In addition, we demonstrate the inefficacy of widely-used evaluation metrics on the code-mixed data. The HinGE dataset will facilitate the progress of natural language generation research in code-mixed languages.
    FarsTail: A Persian Natural Language Inference Dataset. (arXiv:2009.08820v2 [cs.CL] UPDATED)
    (2 min) Natural language inference (NLI) is known as one of the central tasks in natural language processing (NLP) which encapsulates many fundamental aspects of language understanding. With the considerable achievements of data-hungry deep learning methods in NLP tasks, a great amount of e ort has been devoted to develop more diverse datasets for di erent languages. In this paper, we present a new dataset for the NLI task in the Persian language, also known as Farsi, which is one of the dominant languages in the Middle East. This dataset, named FarsTail, includes 10,367 samples which are provided in both the Persian language as well as the indexed format to be useful for non-Persian researchers. The samples are generated from 3,539 multiple-choice questions with the least amount of annotator interventions in a way similar to the SciTail dataset. A carefully designed multi-step process is adopted to ensure the quality of the dataset. We also present the results of traditional and state-of-the-art methods on FarsTail including di erent embedding methods such as word2vec, fastText, ELMo, BERT, and LASER, as well as di erent modeling approaches such as DecompAtt, ESIM, HBMP, and ULMFiT to provide a solid baseline for the future research. The best obtained test accuracy is 83.38% which shows that there is a big room for improving the current methods to be useful for real-world NLP applications in di erent languages. We also investigate the extent to which the models exploit super cial clues, also known as dataset biases, in FarsTail, and partition the test set into easy and hard subsets according to the success of biased models. The dataset is available at https://github.com/dml-qom/ FarsTail.
    Can Transformer Models Measure Coherence In Text? Re-Thinking the Shuffle Test. (arXiv:2107.03448v1 [cs.CL])
    (2 min) The Shuffle Test is the most common task to evaluate whether NLP models can measure coherence in text. Most recent work uses direct supervision on the task; we show that by simply finetuning a RoBERTa model, we can achieve a near perfect accuracy of 97.8%, a state-of-the-art. We argue that this outstanding performance is unlikely to lead to a good model of text coherence, and suggest that the Shuffle Test should be approached in a Zero-Shot setting: models should be evaluated without being trained on the task itself. We evaluate common models in this setting, such as Generative and Bi-directional Transformers, and find that larger architectures achieve high-performance out-of-the-box. Finally, we suggest the k-Block Shuffle Test, a modification of the original by increasing the size of blocks shuffled. Even though human reader performance remains high (around 95% accuracy), model performance drops from 94% to 78% as block size increases, creating a conceptually simple challenge to benchmark NLP models. Code available: https://github.com/tingofurro/shuffle_test/
    A Review of Bangla Natural Language Processing Tasks and the Utility of Transformer Models. (arXiv:2107.03844v1 [cs.CL])
    (2 min) Bangla -- ranked as the 6th most widely spoken language across the world (https://www.ethnologue.com/guides/ethnologue200), with 230 million native speakers -- is still considered as a low-resource language in the natural language processing (NLP) community. With three decades of research, Bangla NLP (BNLP) is still lagging behind mainly due to the scarcity of resources and the challenges that come with it. There is sparse work in different areas of BNLP; however, a thorough survey reporting previous work and recent advances is yet to be done. In this study, we first provide a review of Bangla NLP tasks, resources, and tools available to the research community; we benchmark datasets collected from various platforms for nine NLP tasks using current state-of-the-art algorithms (i.e., transformer-based models). We provide comparative results for the studied NLP tasks by comparing monolingual vs. multilingual models of varying sizes. We report our results using both individual and consolidated datasets and provide data splits for future research. We reviewed a total of 108 papers and conducted 175 sets of experiments. Our results show promising performance using transformer-based models while highlighting the trade-off with computational costs. We hope that such a comprehensive survey will motivate the community to build on and further advance the research on Bangla NLP.
    Keep it Simple: Unsupervised Simplification of Multi-Paragraph Text. (arXiv:2107.03444v1 [cs.CL])
    (2 min) This work presents Keep it Simple (KiS), a new approach to unsupervised text simplification which learns to balance a reward across three properties: fluency, salience and simplicity. We train the model with a novel algorithm to optimize the reward (k-SCST), in which the model proposes several candidate simplifications, computes each candidate's reward, and encourages candidates that outperform the mean reward. Finally, we propose a realistic text comprehension task as an evaluation method for text simplification. When tested on the English news domain, the KiS model outperforms strong supervised baselines by more than 4 SARI points, and can help people complete a comprehension task an average of 18% faster while retaining accuracy, when compared to the original text. Code available: https://github.com/tingofurro/keep_it_simple
    POSLAN: Disentangling Chat with Positional and Language encoded Post Embeddings. (arXiv:2107.03529v1 [cs.CL])
    (2 min) Most online message threads inherently will be cluttered and any new user or an existing user visiting after a hiatus will have a difficult time understanding whats being discussed in the thread. Similarly cluttered responses in a message thread makes analyzing the messages a difficult problem. The need for disentangling the clutter is much higher when the platform where the discussion is taking place does not provide functions to retrieve reply relations of the messages. This introduces an interesting problem to which \cite{wang2011learning} phrases as a structural learning problem. We create vector embeddings for posts in a thread so that it captures both linguistic and positional features in relation to a context of where a given message is in. Using these embeddings for posts we compute a similarity based connectivity matrix which then converted into a graph. After employing a pruning mechanisms the resultant graph can be used to discover the reply relation for the posts in the thread. The process of discovering or disentangling chat is kept as an unsupervised mechanism. We present our experimental results on a data set obtained from Telegram with limited meta data.
    Using CollGram to Compare Formulaic Language in Human and Neural Machine Translation. (arXiv:2107.03625v1 [cs.CL])
    (2 min) A comparison of formulaic sequences in human and neural machine translation of quality newspaper articles shows that neural machine translations contain less lower-frequency, but strongly-associated formulaic sequences, and more high-frequency formulaic sequences. These differences were statistically significant and the effect sizes were almost always medium or large. These observations can be related to the differences between second language learners of various levels and between translated and untranslated texts. The comparison between the neural machine translation systems indicates that some systems produce more formulaic sequences of both types than other systems.
    Multilingual Speech Evaluation: Case Studies on English, Malay and Tamil. (arXiv:2107.03675v1 [cs.CL])
    (2 min) Speech evaluation is an essential component in computer-assisted language learning (CALL). While speech evaluation on English has been popular, automatic speech scoring on low resource languages remains challenging. Work in this area has focused on monolingual specific designs and handcrafted features stemming from resource-rich languages like English. Such approaches are often difficult to generalize to other languages, especially if we also want to consider suprasegmental qualities such as rhythm. In this work, we examine three different languages that possess distinct rhythm patterns: English (stress-timed), Malay (syllable-timed), and Tamil (mora-timed). We exploit robust feature representations inspired by music processing and vector representation learning. Empirical validations show consistent gains for all three languages when predicting pronunciation, rhythm and intonation performance.
    Privacy Concerns in Chatbot Interactions: When to Trust and When to Worry. (arXiv:2107.03959v1 [cs.CY])
    (2 min) Through advances in their conversational abilities, chatbots have started to request and process an increasing variety of sensitive personal information. The accurate disclosure of sensitive information is essential where it is used to provide advice and support to users in the healthcare and finance sectors. In this study, we explore users' concerns regarding factors associated with the use of sensitive data by chatbot providers. We surveyed a representative sample of 491 British citizens. Our results show that the user concerns focus on deleting personal information and concerns about their data's inappropriate use. We also identified that individuals were concerned about losing control over their data after a conversation with conversational agents. We found no effect from a user's gender or education but did find an effect from the user's age, with those over 45 being more concerned than those under 45. We also considered the factors that engender trust in a chatbot. Our respondents' primary focus was on the chatbot's technical elements, with factors such as the response quality being identified as the most critical factor. We again found no effect from the user's gender or education level; however, when we considered some social factors (e.g. avatars or perceived 'friendliness'), we found those under 45 years old rated these as more important than those over 45. The paper concludes with a discussion of these results within the context of designing inclusive, digital systems that support a wide range of users.
    LanguageRefer: Spatial-Language Model for 3D Visual Grounding. (arXiv:2107.03438v1 [cs.RO])
    (2 min) To realize robots that can understand human instructions and perform meaningful tasks in the near future, it is important to develop learned models that can understand referential language to identify common objects in real-world 3D scenes. In this paper, we develop a spatial-language model for a 3D visual grounding problem. Specifically, given a reconstructed 3D scene in the form of a point cloud with 3D bounding boxes of potential object candidates, and a language utterance referring to a target object in the scene, our model identifies the target object from a set of potential candidates. Our spatial-language model uses a transformer-based architecture that combines spatial embedding from bounding-box with a finetuned language embedding from DistilBert and reasons among the objects in the 3D scene to find the target object. We show that our model performs competitively on visio-linguistic datasets proposed by ReferIt3D. We provide additional analysis of performance in spatial reasoning tasks decoupled from perception noise, the effect of view-dependent utterances in terms of accuracy, and view-point annotations for potential robotics applications.
    COMBO: a new module for EUD parsing. (arXiv:2107.03809v1 [cs.CL])
    (2 min) We introduce the COMBO-based approach for EUD parsing and its implementation, which took part in the IWPT 2021 EUD shared task. The goal of this task is to parse raw texts in 17 languages into Enhanced Universal Dependencies (EUD). The proposed approach uses COMBO to predict UD trees and EUD graphs. These structures are then merged into the final EUD graphs. Some EUD edge labels are extended with case information using a single language-independent expansion rule. In the official evaluation, the solution ranked fourth, achieving an average ELAS of 83.79%. The source code is available at https://gitlab.clarin-pl.eu/syntactic-tools/combo.
    Handling Heavily Abbreviated Manuscripts: HTR engines vs text normalisation approaches. (arXiv:2107.03450v1 [cs.CL])
    (2 min) Although abbreviations are fairly common in handwritten sources, particularly in medieval and modern Western manuscripts, previous research dealing with computational approaches to their expansion is scarce. Yet abbreviations present particular challenges to computational approaches such as handwritten text recognition and natural language processing tasks. Often, pre-processing ultimately aims to lead from a digitised image of the source to a normalised text, which includes expansion of the abbreviations. We explore different setups to obtain such a normalised text, either directly, by training HTR engines on normalised (i.e., expanded, disabbreviated) text, or by decomposing the process into discrete steps, each making use of specialist models for recognition, word segmentation and normalisation. The case studies considered here are drawn from the medieval Latin tradition.
    Worry, coping and resignation -- A repeated-measures study on emotional responses after a year in the pandemic. (arXiv:2107.03466v1 [cs.CL])
    (2 min) The introduction of COVID-19 lockdown measures and an outlook on return to normality are demanding societal changes. Among the most pressing questions is how individuals adjust to the pandemic. This paper examines the emotional responses to the pandemic in a repeated-measures design. Data (n=1698) were collected in April 2020 (during strict lockdown measures) and in April 2021 (when vaccination programmes gained traction). We asked participants to report their emotions and express these in text data. Statistical tests revealed an average trend towards better adjustment to the pandemic. However, clustering analyses suggested a more complex heterogeneous pattern with a well-coping and a resigning subgroup of participants. Linguistic computational analyses uncovered that topics and n-gram frequencies shifted towards attention to the vaccination programme and away from general worrying. Implications for public mental health efforts in identifying people at heightened risk are discussed. The dataset is made publicly available.
  • cs.CV updates on arXiv.org

    No-Reference Quality Assessment for Colored Point Cloud and Mesh Based on Natural Scene Statistics. (arXiv:2107.02041v2 [cs.CV] UPDATED)
    (2 min) To improve the viewer's quality of experience and optimize processing systems in computer graphics applications, the 3D quality assessment (3D-QA) has become an important task in the multimedia area. Point cloud and mesh are the two most widely used electronic representation formats of 3D models, the quality of which is quite sensitive to operations like simplification and compression. Therefore, many studies concerning point cloud quality assessment (PCQA) and mesh quality assessment (MQA) have been carried out to measure the visual quality degradations caused by lossy operations. However, a large part of previous studies utilizes full-reference (FR) metrics, which means they may fail to predict the accurate quality level of 3D models when the reference 3D model is not available. Furthermore, limited numbers of 3D-QA metrics are carried out to take color features into consideration, which significantly restricts the effectiveness and scope of application. In many quality assessment studies, natural scene statistics (NSS) have shown a good ability to quantify the distortion of natural scenes to statistical parameters. Therefore, we propose an NSS-based no-reference quality assessment metric for colored 3D models. In this paper, quality-aware features are extracted from the aspects of color and geometry directly from the 3D models. Then the statistic parameters are estimated using different distribution models to describe the characteristic of the 3D models. Our method is mainly validated on the colored point cloud quality assessment database (SJTU-PCQA) and the colored mesh quality assessment database (CMDM). The experimental results show that the proposed method outperforms all the state-of-art NR 3D-QA metrics and obtains an acceptable gap with the state-of-art FR 3D-QA metrics.
    Active Safety Envelopes using Light Curtains with Probabilistic Guarantees. (arXiv:2107.04000v1 [cs.LG])
    (2 min) To safely navigate unknown environments, robots must accurately perceive dynamic obstacles. Instead of directly measuring the scene depth with a LiDAR sensor, we explore the use of a much cheaper and higher resolution sensor: programmable light curtains. Light curtains are controllable depth sensors that sense only along a surface that a user selects. We use light curtains to estimate the safety envelope of a scene: a hypothetical surface that separates the robot from all obstacles. We show that generating light curtains that sense random locations (from a particular distribution) can quickly discover the safety envelope for scenes with unknown objects. Importantly, we produce theoretical safety guarantees on the probability of detecting an obstacle using random curtains. We combine random curtains with a machine learning based model that forecasts and tracks the motion of the safety envelope efficiently. Our method accurately estimates safety envelopes while providing probabilistic safety guarantees that can be used to certify the efficacy of a robot perception system to detect and avoid dynamic obstacles. We evaluate our approach in a simulated urban driving environment and a real-world environment with moving pedestrians using a light curtain device and show that we can estimate safety envelopes efficiently and effectively. Project website: https://siddancha.github.io/projects/active-safety-envelopes-with-guarantees
    Spectral decoupling allows training transferable neural networks in medical imaging. (arXiv:2103.17171v3 [eess.IV] UPDATED)
    (3 min) Many current neural networks for medical imaging generalise poorly to data unseen during training. Such behaviour can be caused by networks overfitting easy-to-learn, or statistically dominant, features while disregarding other potentially informative features. For example, indistinguishable differences in the sharpness of the images from two different scanners can degrade the performance of the network significantly. All neural networks intended for clinical practice need to be robust to variation in data caused by differences in imaging equipment, sample preparation and patient populations. To address these challenges, we evaluate the utility of spectral decoupling as an implicit bias mitigation method. Spectral decoupling encourages the neural network to learn more features by simply regularising the networks' unnormalised prediction scores with an L2 penalty, thus having no added computational costs. We show that spectral decoupling allows training neural networks on datasets with strong spurious correlations. Networks trained without spectral decoupling do not learn the original task and appear to make false predictions based on the spurious correlations. Spectral decoupling also increases networks' robustness for data distribution shifts. To validate our findings, we train networks with and without spectral decoupling to detect prostate cancer tissue slides and COVID-19 in chest radiographs. Networks trained with spectral decoupling achieve substantially higher performance on all evaluation datasets. Our results show that spectral decoupling helps with generalisation issues associated with neural networks. We recommend using spectral decoupling as an implicit bias mitigation method in any neural network intended for clinical use.
    WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training. (arXiv:2103.06561v6 [cs.CV] UPDATED)
    (3 min) Multi-modal pre-training models have been intensively explored to bridge vision and language in recent years. However, most of them explicitly model the cross-modal interaction between image-text pairs, by assuming that there exists strong semantic correlation between the text and image modalities. Since this strong assumption is often invalid in real-world scenarios, we choose to implicitly model the cross-modal correlation for large-scale multi-modal pre-training, which is the focus of the Chinese project `WenLan' led by our team. Specifically, with the weak correlation assumption over image-text pairs, we propose a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework. Unlike OpenAI CLIP that adopts a simple contrastive learning method, we devise a more advanced algorithm by adapting the latest method MoCo into the cross-modal scenario. By building a large queue-based dictionary, our BriVL can incorporate more negative samples in limited GPU resources. We further construct a large Chinese multi-source image-text dataset called RUC-CAS-WenLan for pre-training our BriVL model. Extensive experiments demonstrate that the pre-trained BriVL model outperforms both UNITER and OpenAI CLIP on various downstream tasks.
    A Dataset and Method for Hallux Valgus Angle Estimation Based on Deep Learing. (arXiv:2107.03640v1 [cs.CV])
    (2 min) Angular measurements is essential to make a resonable treatment for Hallux valgus (HV), a common forefoot deformity. However, it still depends on manual labeling and measurement, which is time-consuming and sometimes unreliable. Automating this process is a thing of concern. However, it lack of dataset and the keypoints based method which made a great success in pose estimation is not suitable for this field.To solve the problems, we made a dataset and developed an algorithm based on deep learning and linear regression. It shows great fitting ability to the ground truth.
    Image Resolution Susceptibility of Face Recognition Models. (arXiv:2107.03769v1 [cs.CV])
    (2 min) Face recognition approaches often rely on equal image resolution for verification faces on two images. However, in practical applications, those image resolutions are usually not in the same range due to different image capture mechanisms or sources. In this work, we first analyze the impact of image resolutions on the face verification performance with a state-of-the-art face recognition model. For images, synthetically reduced to $5\, \times 5\, \mathrm{px}$ resolution, the verification performance drops from $99.23\%$ increasingly down to almost $55\%$. Especially, for cross-resolution image pairs (one high- and one low-resolution image), the verification accuracy decreases even further. We investigate this behavior more in-depth by looking at the feature distances for every 2-image test pair. To tackle this problem, we propose the following two methods: 1) Train a state-of-the-art face-recognition model straightforward with $50\%$ low-resolution images directly within each batch. \\ 2) Train a siamese-network structure and adding a cosine distance feature loss between high- and low-resolution features. Both methods show an improvement for cross-resolution scenarios and can increase the accuracy at very low resolution to approximately $70\%$. However, a disadvantage is that a specific model needs to be trained for every resolution-pair ...
    Flexibly Regularized Mixture Models and Application to Image Segmentation. (arXiv:1905.10629v2 [cs.CV] UPDATED)
    (2 min) Probabilistic finite mixture models are widely used for unsupervised clustering. These models can often be improved by adapting them to the topology of the data. For instance, in order to classify spatially adjacent data points similarly, it is common to introduce a Laplacian constraint on the posterior probability that each data point belongs to a class. Alternatively, the mixing probabilities can be treated as free parameters, while assuming Gauss-Markov or more complex priors to regularize those mixing probabilities. However, these approaches are constrained by the shape of the prior and often lead to complicated or intractable inference. Here, we propose a new parametrization of the Dirichlet distribution to flexibly regularize the mixing probabilities of over-parametrized mixture distributions. Using the Expectation-Maximization algorithm, we show that our approach allows us to define any linear update rule for the mixing probabilities, including spatial smoothing regularization as a special case. We then show that this flexible design can be extended to share class information between multiple mixture models. We apply our algorithm to artificial and natural image segmentation tasks, and we provide quantitative and qualitative comparison of the performance of Gaussian and Student-t mixtures on the Berkeley Segmentation Dataset. We also demonstrate how to propagate class information across the layers of deep convolutional neural networks in a probabilistically optimal way, suggesting a new interpretation for feedback signals in biological visual systems. Our flexible approach can be easily generalized to adapt probabilistic mixture models to arbitrary data topologies.
    Sparse and Structured Visual Attention. (arXiv:2002.05556v2 [cs.CL] UPDATED)
    (2 min) Visual attention mechanisms are widely used in multimodal tasks, as visual question answering (VQA). One drawback of softmax-based attention mechanisms is that they assign some probability mass to all image regions, regardless of their adjacency structure and of their relevance to the text. In this paper, to better link the image structure with the text, we replace the traditional softmax attention mechanism with two alternative sparsity-promoting transformations: sparsemax, which is able to select only the relevant regions (assigning zero weight to the rest), and a newly proposed Total-Variation Sparse Attention (TVmax), which further encourages the joint selection of adjacent spatial locations. Experiments in VQA show gains in accuracy as well as higher similarity to human attention, which suggests better interpretability.
    MotionRNN: A Flexible Model for Video Prediction with Spacetime-Varying Motions. (arXiv:2103.02243v3 [cs.CV] UPDATED)
    (2 min) This paper tackles video prediction from a new dimension of predicting spacetime-varying motions that are incessantly changing across both space and time. Prior methods mainly capture the temporal state transitions but overlook the complex spatiotemporal variations of the motion itself, making them difficult to adapt to ever-changing motions. We observe that physical world motions can be decomposed into transient variation and motion trend, while the latter can be regarded as the accumulation of previous motions. Thus, simultaneously capturing the transient variation and the motion trend is the key to make spacetime-varying motions more predictable. Based on these observations, we propose the MotionRNN framework, which can capture the complex variations within motions and adapt to spacetime-varying scenarios. MotionRNN has two main contributions. The first is that we design the MotionGRU unit, which can model the transient variation and motion trend in a unified way. The second is that we apply the MotionGRU to RNN-based predictive models and indicate a new flexible video prediction architecture with a Motion Highway that can significantly improve the ability to predict changeable motions and avoid motion vanishing for stacked multiple-layer predictive models. With high flexibility, this framework can adapt to a series of models for deterministic spatiotemporal prediction. Our MotionRNN can yield significant improvements on three challenging benchmarks for video prediction with spacetime-varying motions.
    BANet: Blur-aware Attention Networks for Dynamic Scene Deblurring. (arXiv:2101.07518v2 [cs.CV] UPDATED)
    (2 min) Image motion blur usually results from moving objects or camera shakes. Such blur is generally directional and non-uniform. Previous research efforts attempt to solve non-uniform blur by using self-recurrent multi-scale or multi-patch architectures accompanying with self-attention. However, using self-recurrent frameworks typically leads to a longer inference time, while inter-pixel or inter-channel self-attention may cause excessive memory usage. This paper proposes blur-aware attention networks (BANet) that accomplish accurate and efficient deblurring via a single forward pass. Our BANet utilizes region-based self-attention with multi-kernel strip pooling to disentangle blur patterns of different degrees and with cascaded parallel dilated convolution to aggregate multi-scale content features. Extensive experimental results on the GoPro and HIDE benchmarks demonstrate that the proposed BANet performs favorably against the state-of-the-art in blurred image restoration and can provide deblurred results in real-time.
    Learning Vision-Guided Quadrupedal Locomotion End-to-End with Cross-Modal Transformers. (arXiv:2107.03996v1 [cs.LG])
    (2 min) We propose to address quadrupedal locomotion tasks using Reinforcement Learning (RL) with a Transformer-based model that learns to combine proprioceptive information and high-dimensional depth sensor inputs. While learning-based locomotion has made great advances using RL, most methods still rely on domain randomization for training blind agents that generalize to challenging terrains. Our key insight is that proprioceptive states only offer contact measurements for immediate reaction, whereas an agent equipped with visual sensory observations can learn to proactively maneuver environments with obstacles and uneven terrain by anticipating changes in the environment many steps ahead. In this paper, we introduce LocoTransformer, an end-to-end RL method for quadrupedal locomotion that leverages a Transformer-based model for fusing proprioceptive states and visual observations. We evaluate our method in challenging simulated environments with different obstacles and uneven terrain. We show that our method obtains significant improvements over policies with only proprioceptive state inputs, and that Transformer-based models further improve generalization across environments. Our project page with videos is at https://RchalYang.github.io/LocoTransformer .
    CamTuner: Reinforcement-Learning based System for Camera Parameter Tuning to enhance Analytics. (arXiv:2107.03964v1 [cs.LG])
    (3 min) Complex sensors like video cameras include tens of configurable parameters, which can be set by end-users to customize the sensors to specific application scenarios. Although parameter settings significantly affect the quality of the sensor output and the accuracy of insights derived from sensor data, most end-users use a fixed parameter setting because they lack the skill or understanding to appropriately configure these parameters. We propose CamTuner, which is a system to automatically, and dynamically adapt the complex sensor to changing environments. CamTuner includes two key components. First, a bespoke analytics quality estimator, which is a deep-learning model to automatically and continuously estimate the quality of insights from an analytics unit as the environment around a sensor change. Second, a reinforcement learning (RL) module, which reacts to the changes in quality, and automatically adjusts the camera parameters to enhance the accuracy of insights. We improve the training time of the RL module by an order of magnitude by designing virtual models to mimic essential behavior of the camera: we design virtual knobs that can be set to different values to mimic the effects of assigning different values to the camera's configurable parameters, and we design a virtual camera model that mimics the output from a video camera at different times of the day. These virtual models significantly accelerate training because (a) frame rates from a real camera are limited to 25-30 fps while the virtual models enable processing at 300 fps, (b) we do not have to wait until the real camera sees different environments, which could take weeks or months, and (c) virtual knobs can be updated instantly, while it can take 200-500 ms to change the camera parameter settings. Our dynamic tuning approach results in up to 12% improvement in the accuracy of insights from several video analytics tasks.
    Weight Reparametrization for Budget-Aware Network Pruning. (arXiv:2107.03909v1 [cs.CV])
    (2 min) Pruning seeks to design lightweight architectures by removing redundant weights in overparameterized networks. Most of the existing techniques first remove structured sub-networks (filters, channels,...) and then fine-tune the resulting networks to maintain a high accuracy. However, removing a whole structure is a strong topological prior and recovering the accuracy, with fine-tuning, is highly cumbersome. In this paper, we introduce an "end-to-end" lightweight network design that achieves training and pruning simultaneously without fine-tuning. The design principle of our method relies on reparametrization that learns not only the weights but also the topological structure of the lightweight sub-network. This reparametrization acts as a prior (or regularizer) that defines pruning masks implicitly from the weights of the underlying network, without increasing the number of training parameters. Sparsity is induced with a budget loss that provides an accurate pruning. Extensive experiments conducted on the CIFAR10 and the TinyImageNet datasets, using standard architectures (namely Conv4, VGG19 and ResNet18), show compelling results without fine-tuning.
    Weakly-Supervised Online Hashing. (arXiv:2009.07436v2 [cs.CV] UPDATED)
    (2 min) With the rapid development of social websites, recent years have witnessed an explosive growth of social images with user-provided tags which continuously arrive in a streaming fashion. Due to the fast query speed and low storage cost, hashing-based methods for image search have attracted increasing attention. However, existing hashing methods for social image retrieval are based on batch mode which violates the nature of social images, i.e., social images are usually generated periodically or collected in a stream fashion. Although there exist many online image hashing methods, they either adopt unsupervised learning which ignore the relevant tags, or are designed in the supervised manner which needs high-quality labels. In this paper, to overcome the above limitations, we propose a new method named Weakly-supervised Online Hashing (WOH). In order to learn high-quality hash codes, WOH exploits the weak supervision by considering the semantics of tags and removing the noise. Besides, We develop a discrete online optimization algorithm for WOH, which is efficient and scalable. Extensive experiments conducted on two real-world datasets demonstrate the superiority of WOH compared with several state-of-the-art hashing baselines.
    Malware Classification Using Deep Boosted Learning. (arXiv:2107.04008v1 [cs.CR])
    (2 min) Malicious activities in cyberspace have gone further than simply hacking machines and spreading viruses. It has become a challenge for a nations survival and hence has evolved to cyber warfare. Malware is a key component of cyber-crime, and its analysis is the first line of defence against attack. This work proposes a novel deep boosted hybrid learning-based malware classification framework and named as Deep boosted Feature Space-based Malware classification (DFS-MC). In the proposed framework, the discrimination power is enhanced by fusing the feature spaces of the best performing customized CNN architectures models and its discrimination by an SVM for classification. The discrimination capacity of the proposed classification framework is assessed by comparing it against the standard customized CNNs. The customized CNN models are implemented in two ways: softmax classifier and deep hybrid learning-based malware classification. In the hybrid learning, Deep features are extracted from customized CNN architectures and fed into the conventional machine learning classifier to improve the classification performance. We also introduced the concept of transfer learning in a customized CNN architecture based malware classification framework through fine-tuning. The performance of the proposed malware classification approaches are validated on the MalImg malware dataset using the hold-out cross-validation technique. Experimental comparisons were conducted by employing innovative, customized CNN, trained from scratch and fine-tuning the customized CNN using transfer learning. The proposed classification framework DFS-MC showed improved results, Accuracy: 98.61%, F-score: 0.96, Precision: 0.96, and Recall: 0.96.
    An Embedded Iris Recognition System Optimization using Dynamically ReconfigurableDecoder with LDPC Codes. (arXiv:2107.03688v1 [cs.CV])
    (2 min) Extracting and analyzing iris textures for biometric recognition has been extensively studied. As the transition of iris recognition from lab technology to nation-scale applications, most systems are facing high complexity in either time or space, leading to unfitness for embedded devices. In this paper, the proposed design includes a minimal set of computer vision modules and multi-mode QC-LDPC decoder which can alleviate variability and noise caused by iris acquisition and follow-up process. Several classes of QC-LDPC code from IEEE 802.16 are tested for the validity of accuracy improvement. Some of the codes mentioned above are used for further QC-LDPC decoder quantization, validation and comparison to each other. We show that we can apply Dynamic Partial Reconfiguration technology to implement the multi-mode QC-LDPC decoder for the iris recognition system. The results show that the implementation is power-efficient and good for edge applications.
    GPNAS: A Neural Network Architecture Search Framework Based on Graphical Predictor. (arXiv:2103.11820v6 [cs.LG] UPDATED)
    (2 min) In practice, the problems encountered in Neural Architecture Search (NAS) training are not simple problems, but often a series of difficult combinations (wrong compensation estimation, curse of dimension, overfitting, high complexity, etc.). In this paper, we propose a framework to decouple network structure from operator search space, and use two BOHBs to search alternatively. Considering that activation function and initialization are also important parts of neural network, the generalization ability of the model will be affected. We introduce an activation function and an initialization method domain, and add them into the operator search space to form a generalized search space, so as to improve the generalization ability of the child model. We then trained a GCN-based predictor using feedback from the child model. This can not only improve the search efficiency, but also solve the problem of dimension curse. Next, unlike other NAS studies, we used predictors to analyze the stability of different network structures. Finally, we applied our framework to neural structure search and achieved significant improvements on multiple datasets.
    ACAE-REMIND for Online Continual Learning with Compressed Feature Replay. (arXiv:2105.08595v2 [cs.CV] UPDATED)
    (2 min) Online continual learning aims to learn from a non-IID stream of data from a number of different tasks, where the learner is only allowed to consider data once. Methods are typically allowed to use a limited buffer to store some of the images in the stream. Recently, it was found that feature replay, where an intermediate layer representation of the image is stored (or generated) leads to superior results than image replay, while requiring less memory. Quantized exemplars can further reduce the memory usage. However, a drawback of these methods is that they use a fixed (or very intransigent) backbone network. This significantly limits the learning of representations that can discriminate between all tasks. To address this problem, we propose an auxiliary classifier auto-encoder (ACAE) module for feature replay at intermediate layers with high compression rates. The reduced memory footprint per image allows us to save more exemplars for replay. In our experiments, we conduct task-agnostic evaluation under online continual learning setting and get state-of-the-art performance on ImageNet-Subset, CIFAR100 and CIFAR10 dataset.
    Task Fingerprinting for Meta Learning in Biomedical Image Analysis. (arXiv:2107.03949v1 [cs.CV])
    (2 min) Shortage of annotated data is one of the greatest bottlenecks in biomedical image analysis. Meta learning studies how learning systems can increase in efficiency through experience and could thus evolve as an important concept to overcome data sparsity. However, the core capability of meta learning-based approaches is the identification of similar previous tasks given a new task - a challenge largely unexplored in the biomedical imaging domain. In this paper, we address the problem of quantifying task similarity with a concept that we refer to as task fingerprinting. The concept involves converting a given task, represented by imaging data and corresponding labels, to a fixed-length vector representation. In fingerprint space, different tasks can be directly compared irrespective of their data set sizes, types of labels or specific resolutions. An initial feasibility study in the field of surgical data science (SDS) with 26 classification tasks from various medical and non-medical domains suggests that task fingerprinting could be leveraged for both (1) selecting appropriate data sets for pretraining and (2) selecting appropriate architectures for a new task. Task fingerprinting could thus become an important tool for meta learning in SDS and other fields of biomedical image analysis.
    Elastic deformation of optical coherence tomography images of diabetic macular edema for deep-learning models training: how far to go?. (arXiv:2107.03651v1 [eess.IV])
    (2 min) To explore the clinical validity of elastic deformation of optical coherence tomography (OCT) images for data augmentation in the development of deep-learning model for detection of diabetic macular edema (DME).
    Video 3D Sampling for Self-supervised Representation Learning. (arXiv:2107.03578v1 [cs.CV])
    (2 min) Most of the existing video self-supervised methods mainly leverage temporal signals of videos, ignoring that the semantics of moving objects and environmental information are all critical for video-related tasks. In this paper, we propose a novel self-supervised method for video representation learning, referred to as Video 3D Sampling (V3S). In order to sufficiently utilize the information (spatial and temporal) provided in videos, we pre-process a video from three dimensions (width, height, time). As a result, we can leverage the spatial information (the size of objects), temporal information (the direction and magnitude of motions) as our learning target. In our implementation, we combine the sampling of the three dimensions and propose the scale and projection transformations in space and time respectively. The experimental results show that, when applied to action recognition, video retrieval and action similarity labeling, our approach improves the state-of-the-arts with significant margins.
    Real-Time Ellipse Detection for Robotics Applications. (arXiv:2102.12670v2 [cs.RO] UPDATED)
    (2 min) We propose a new algorithm for real-time detection and tracking of elliptic patterns suitable for real-world robotics applications. The method fits ellipses to each contour in the image frame and rejects ellipses that do not yield a good fit. The resulting detection and tracking method is lightweight enough to be used on robots' resource-limited onboard computers, can deal with lighting variations and detect the pattern even when the view is partial. The method is tested on an example application of an autonomous UAV landing on a fast-moving vehicle to show its performance indoors, outdoors, and in simulation on a real-world robotics task. The comparison with other well-known ellipse detection methods shows that our proposed algorithm outperforms other methods with the F1 score of 0.981 on a dataset with over 1500 frames. The videos of experiments, the source codes, and the collected dataset are provided with the paper at https://theairlab.org/landing-on-vehicle .
    Grid Partitioned Attention: Efficient TransformerApproximation with Inductive Bias for High Resolution Detail Generation. (arXiv:2107.03742v1 [cs.CV])
    (2 min) Attention is a general reasoning mechanism than can flexibly deal with image information, but its memory requirements had made it so far impractical for high resolution image generation. We present Grid Partitioned Attention (GPA), a new approximate attention algorithm that leverages a sparse inductive bias for higher computational and memory efficiency in image domains: queries attend only to few keys, spatially close queries attend to close keys due to correlations. Our paper introduces the new attention layer, analyzes its complexity and how the trade-off between memory usage and model power can be tuned by the hyper-parameters.We will show how such attention enables novel deep learning architectures with copying modules that are especially useful for conditional image generation tasks like pose morphing. Our contributions are (i) algorithm and code1of the novel GPA layer, (ii) a novel deep attention-copying architecture, and (iii) new state-of-the art experimental results in human pose morphing generation benchmarks.
    Comparing ML based Segmentation Models on Jet Fire Radiation Zone. (arXiv:2107.03461v1 [cs.CV])
    (2 min) Risk assessment is relevant in any workplace, however there is a degree of unpredictability when dealing with flammable or hazardous materials so that detection of fire accidents by itself may not be enough. An example of this is the impingement of jet fires, where the heat fluxes of the flame could reach nearby equipment and dramatically increase the probability of a domino effect with catastrophic results. Because of this, the characterization of such fire accidents is important from a risk management point of view. One such characterization would be the segmentation of different radiation zones within the flame, so this paper presents an exploratory research regarding several traditional computer vision and Deep Learning segmentation approaches to solve this specific problem. A data set of propane jet fires is used to train and evaluate the different approaches and given the difference in the distribution of the zones and background of the images, different loss functions, that seek to alleviate data imbalance, are also explored. Additionally, different metrics are correlated to a manual ranking performed by experts to make an evaluation that closely resembles the expert's criteria. The Hausdorff Distance and Adjsted Random Index were the metrics with the highest correlation and the best results were obtained from the UNet architecture with a Weighted Cross-Entropy Loss. These results can be used in future research to extract more geometric information from the segmentation masks or could even be implemented on other types of fire accidents.
    Exploiting the relationship between visual and textual features in social networks for image classification with zero-shot deep learning. (arXiv:2107.03751v1 [cs.CV])
    (2 min) One of the main issues related to unsupervised machine learning is the cost of processing and extracting useful information from large datasets. In this work, we propose a classifier ensemble based on the transferable learning capabilities of the CLIP neural network architecture in multimodal environments (image and text) from social media. For this purpose, we used the InstaNY100K dataset and proposed a validation approach based on sampling techniques. Our experiments, based on image classification tasks according to the labels of the Places dataset, are performed by first considering only the visual part, and then adding the associated texts as support. The results obtained demonstrated that trained neural networks such as CLIP can be successfully applied to image classification with little fine-tuning, and considering the associated texts to the images can help to improve the accuracy depending on the goal. The results demonstrated what seems to be a promising research direction.
    Defending Against Multiple and Unforeseen Adversarial Videos. (arXiv:2009.05244v2 [cs.LG] UPDATED)
    (2 min) Adversarial robustness of deep neural networks has been actively investigated. However, most existing defense approaches are limited to a specific type of adversarial perturbations. Specifically, they often fail to offer resistance to multiple attack types simultaneously, i.e., they lack multi-perturbation robustness. Furthermore, compared to image recognition problems, the adversarial robustness of video recognition models is relatively unexplored. While several studies have proposed how to generate adversarial videos, only a handful of approaches about the defense strategies have been published in the literature. In this paper, we propose one of the first defense strategies against multiple types of adversarial videos for video recognition. The proposed method, referred to as MultiBN, performs adversarial training on multiple adversarial video types using multiple independent batch normalization (BN) layers with a learning-based BN selection module. With a multiple BN structure, each BN brach is responsible for learning the distribution of a single perturbation type and thus provides more precise distribution estimations. This mechanism benefits dealing with multiple perturbation types. The BN selection module detects the attack type of an input video and sends it to the corresponding BN branch, making MultiBN fully automatic and allow end-to-end training. Compared to present adversarial training approaches, the proposed MultiBN exhibits stronger multi-perturbation robustness against different and even unforeseen adversarial video types, ranging from Lp-bounded attacks and physically realizable attacks. This holds true on different datasets and target models. Moreover, we conduct an extensive analysis to study the properties of the multiple BN structure.
    4D Attention: Comprehensive Framework for Spatio-Temporal Gaze Mapping. (arXiv:2107.03606v1 [cs.RO])
    (2 min) This study presents a framework for capturing human attention in the spatio-temporal domain using eye-tracking glasses. Attention mapping is a key technology for human perceptual activity analysis or Human-Robot Interaction (HRI) to support human visual cognition; however, measuring human attention in dynamic environments is challenging owing to the difficulty in localizing the subject and dealing with moving objects. To address this, we present a comprehensive framework, 4D Attention, for unified gaze mapping onto static and dynamic objects. Specifically, we estimate the glasses pose by leveraging a loose coupling of direct visual localization and Inertial Measurement Unit (IMU) values. Further, by installing reconstruction components into our framework, dynamic objects not captured in the 3D environment map are instantiated based on the input images. Finally, a scene rendering component synthesizes a first-person view with identification (ID) textures and performs direct 2D-3D gaze association. Quantitative evaluations showed the effectiveness of our framework. Additionally, we demonstrated the applications of 4D Attention through experiments in real situations.
    SCSS-Net: Superpoint Constrained Semi-supervised Segmentation Network for 3D Indoor Scenes. (arXiv:2107.03601v1 [cs.CV])
    (2 min) Many existing deep neural networks (DNNs) for 3D point cloud semantic segmentation require a large amount of fully labeled training data. However, manually assigning point-level labels on the complex scenes is time-consuming. While unlabeled point clouds can be easily obtained from sensors or reconstruction, we propose a superpoint constrained semi-supervised segmentation network for 3D point clouds, named as SCSS-Net. Specifically, we use the pseudo labels predicted from unlabeled point clouds for self-training, and the superpoints produced by geometry-based and color-based Region Growing algorithms are combined to modify and delete pseudo labels with low confidence. Additionally, we propose an edge prediction module to constrain the features from edge points of geometry and color. A superpoint feature aggregation module and superpoint feature consistency loss functions are introduced to smooth the point features in each superpoint. Extensive experimental results on two 3D public indoor datasets demonstrate that our method can achieve better performance than some state-of-the-art point cloud segmentation networks and some popular semi-supervised segmentation methods with few labeled scenes.
    Regional Differential Information Entropy for Super-Resolution Image Quality Assessment. (arXiv:2107.03642v1 [eess.IV])
    (2 min) PSNR and SSIM are the most widely used metrics in super-resolution problems, because they are easy to use and can evaluate the similarities between generated images and reference images. However, single image super-resolution is an ill-posed problem, there are multiple corresponding high-resolution images for the same low-resolution image. The similarities can't totally reflect the restoration effect. The perceptual quality of generated images is also important, but PSNR and SSIM do not reflect perceptual quality well. To solve the problem, we proposed a method called regional differential information entropy to measure both of the similarities and perceptual quality. To overcome the problem that traditional image information entropy can't reflect the structure information, we proposed to measure every region's information entropy with sliding window. Considering that the human visual system is more sensitive to the brightness difference at low brightness, we take $\gamma$ quantization rather than linear quantization. To accelerate the method, we reorganized the calculation procedure of information entropy with a neural network. Through experiments on our IQA dataset and PIPAL, this paper proves that RDIE can better quantify perceptual quality of images especially GAN-based images.
    Atlas-Based Segmentation of Intracochlear Anatomy in Metal Artifact Affected CT Images of the Ear with Co-trained Deep Neural Networks. (arXiv:2107.03987v1 [eess.IV])
    (2 min) We propose an atlas-based method to segment the intracochlear anatomy (ICA) in the post-implantation CT (Post-CT) images of cochlear implant (CI) recipients that preserves the point-to-point correspondence between the meshes in the atlas and the segmented volumes. To solve this problem, which is challenging because of the strong artifacts produced by the implant, we use a pair of co-trained deep networks that generate dense deformation fields (DDFs) in opposite directions. One network is tasked with registering an atlas image to the Post-CT images and the other network is tasked with registering the Post-CT images to the atlas image. The networks are trained using loss functions based on voxel-wise labels, image content, fiducial registration error, and cycle-consistency constraint. The segmentation of the ICA in the Post-CT images is subsequently obtained by transferring the predefined segmentation meshes of the ICA in the atlas image to the Post-CT images using the corresponding DDFs generated by the trained registration networks. Our model can learn the underlying geometric features of the ICA even though they are obscured by the metal artifacts. We show that our end-to-end network produces results that are comparable to the current state of the art (SOTA) that relies on a two-steps approach that first uses conditional generative adversarial networks to synthesize artifact-free images from the Post-CT images and then uses an active shape model-based method to segment the ICA in the synthetic images. Our method requires a fraction of the time needed by the SOTA, which is important for end-user acceptance.
    Consensus Clustering With Unsupervised Representation Learning. (arXiv:2010.01245v2 [cs.CV] UPDATED)
    (2 min) Recent advances in deep clustering and unsupervised representation learning are based on the idea that different views of an input image (generated through data augmentation techniques) must either be closer in the representation space, or have a similar cluster assignment. Bootstrap Your Own Latent (BYOL) is one such representation learning algorithm that has achieved state-of-the-art results in self-supervised image classification on ImageNet under the linear evaluation protocol. However, the utility of the learnt features of BYOL to perform clustering is not explored. In this work, we study the clustering ability of BYOL and observe that features learnt using BYOL may not be optimal for clustering. We propose a novel consensus clustering based loss function, and train BYOL with the proposed loss in an end-to-end way that improves the clustering ability and outperforms similar clustering based methods on some popular computer vision datasets.
    Dual-Consistency Semi-Supervised Learning with Uncertainty Quantification for COVID-19 Lesion Segmentation from CT Images. (arXiv:2104.03225v2 [eess.IV] UPDATED)
    (2 min) The novel coronavirus disease 2019 (COVID-19) characterized by atypical pneumonia has caused millions of deaths worldwide. Automatically segmenting lesions from chest Computed Tomography (CT) is a promising way to assist doctors in COVID-19 screening, treatment planning, and follow-up monitoring. However, voxel-wise annotations are extremely expert-demanding and scarce, especially when it comes to novel diseases, while an abundance of unlabeled data could be available. To tackle the challenge of limited annotations, in this paper, we propose an uncertainty-guided dual-consistency learning network (UDC-Net) for semi-supervised COVID-19 lesion segmentation from CT images. Specifically, we present a dual-consistency learning scheme that simultaneously imposes image transformation equivalence and feature perturbation invariance to effectively harness the knowledge from unlabeled data. We then quantify the segmentation uncertainty in two forms and employ them together to guide the consistency regularization for more reliable unsupervised learning. Extensive experiments showed that our proposed UDC-Net improves the fully supervised method by 6.3% in Dice and outperforms other competitive semi-supervised approaches by significant margins, demonstrating high potential in real-world clinical practice.
    Feature Pyramid Network for Multi-task Affective Analysis. (arXiv:2107.03670v1 [cs.CV])
    (2 min) Affective Analysis is not a single task, and the valence-arousal value, expression class and action unit can be predicted at the same time. Previous researches failed to take them as a whole task or ignore the entanglement and hierarchical relation of this three facial attributes. We propose a novel model named feature pyramid networks for multi-task affect analysis. The hierarchical features are extracted to predict three labels and we apply teacher-student training strategy to learn from pretrained single-task models. Extensive experiment results demonstrate the proposed model outperform other models. The code and model are available for research purposes at $\href{https://github.com/ryanhe312/ABAW2-FPNMAA}{\text{this link}}$.
    TGHop: An Explainable, Efficient and Lightweight Method for Texture Generation. (arXiv:2107.04020v1 [cs.CV])
    (2 min) An explainable, efficient and lightweight method for texture generation, called TGHop (an acronym of Texture Generation PixelHop), is proposed in this work. Although synthesis of visually pleasant texture can be achieved by deep neural networks, the associated models are large in size, difficult to explain in theory, and computationally expensive in training. In contrast, TGHop is small in its model size, mathematically transparent, efficient in training and inference, and able to generate high quality texture. Given an exemplary texture, TGHop first crops many sample patches out of it to form a collection of sample patches called the source. Then, it analyzes pixel statistics of samples from the source and obtains a sequence of fine-to-coarse subspaces for these patches by using the PixelHop++ framework. To generate texture patches with TGHop, we begin with the coarsest subspace, which is called the core, and attempt to generate samples in each subspace by following the distribution of real samples. Finally, texture patches are stitched to form texture images of a large size. It is demonstrated by experimental results that TGHop can generate texture images of superior quality with a small model size and at a fast speed.
    Language Conditioned Imitation Learning over Unstructured Data. (arXiv:2005.07648v2 [cs.RO] UPDATED)
    (2 min) Natural language is perhaps the most flexible and intuitive way for humans to communicate tasks to a robot. Prior work in imitation learning typically requires each task be specified with a task id or goal image -- something that is often impractical in open-world environments. On the other hand, previous approaches in instruction following allow agent behavior to be guided by language, but typically assume structure in the observations, actuators, or language that limit their applicability to complex settings like robotics. In this work, we present a method for incorporating free-form natural language conditioning into imitation learning. Our approach learns perception from pixels, natural language understanding, and multitask continuous control end-to-end as a single neural network. Unlike prior work in imitation learning, our method is able to incorporate unlabeled and unstructured demonstration data (i.e. no task or language labels). We show this dramatically improves language conditioned performance, while reducing the cost of language annotation to less than 1% of total data. At test time, a single language conditioned visuomotor policy trained with our method can perform a wide variety of robotic manipulation skills in a 3D environment, specified only with natural language descriptions of each task (e.g. "open the drawer...now pick up the block...now press the green button..."). To scale up the number of instructions an agent can follow, we propose combining text conditioned policies with large pretrained neural language models. We find this allows a policy to be robust to many out-of-distribution synonym instructions, without requiring new demonstrations. See videos of a human typing live text commands to our agent at language-play.github.io
    Label-set Loss Functions for Partial Supervision: Application to Fetal Brain 3D MRI Parcellation. (arXiv:2107.03846v1 [eess.IV])
    (2 min) Deep neural networks have increased the accuracy of automatic segmentation, however, their accuracy depends on the availability of a large number of fully segmented images. Methods to train deep neural networks using images for which some, but not all, regions of interest are segmented are necessary to make better use of partially annotated datasets. In this paper, we propose the first axiomatic definition of label-set loss functions that are the loss functions that can handle partially segmented images. We prove that there is one and only one method to convert a classical loss function for fully segmented images into a proper label-set loss function. Our theory also allows us to define the leaf-Dice loss, a label-set generalization of the Dice loss particularly suited for partial supervision with only missing labels. Using the leaf-Dice loss, we set a new state of the art in partially supervised learning for fetal brain 3D MRI segmentation. We achieve a deep neural network able to segment white matter, ventricles, cerebellum, extra-ventricular CSF, cortical gray matter, deep gray matter, brainstem, and corpus callosum based on fetal brain 3D MRI of anatomically normal fetuses or with open spina bifida. Our implementation of the proposed label-set loss functions is available at https://github.com/LucasFidon/label-set-loss-functions
    Instance-Level Relative Saliency Ranking with Graph Reasoning. (arXiv:2107.03824v1 [cs.CV])
    (2 min) Conventional salient object detection models cannot differentiate the importance of different salient objects. Recently, two works have been proposed to detect saliency ranking by assigning different degrees of saliency to different objects. However, one of these models cannot differentiate object instances and the other focuses more on sequential attention shift order inference. In this paper, we investigate a practical problem setting that requires simultaneously segment salient instances and infer their relative saliency rank order. We present a novel unified model as the first end-to-end solution, where an improved Mask R-CNN is first used to segment salient instances and a saliency ranking branch is then added to infer the relative saliency. For relative saliency ranking, we build a new graph reasoning module by combining four graphs to incorporate the instance interaction relation, local contrast, global contrast, and a high-level semantic prior, respectively. A novel loss function is also proposed to effectively train the saliency ranking branch. Besides, a new dataset and an evaluation metric are proposed for this task, aiming at pushing forward this field of research. Finally, experimental results demonstrate that our proposed model is more effective than previous methods. We also show an example of its practical usage on adaptive image retargeting.
    Complete Scanning Application Using OpenCv. (arXiv:2107.03700v1 [cs.CV])
    (2 min) In the following paper, we have combined the various basic functionalities provided by the NumPy library and OpenCv library, which is an open source for Computer Vision applications, like conversion of colored images to grayscale, calculating threshold, finding contours and using those contour points to take perspective transform of the image inputted by the user, using Python version 3.7. Additional features include cropping, rotating and saving as well. All these functions and features, when implemented step by step, results in a complete scanning application. The applied procedure involves the following steps: Finding contours, applying Perspective transform and brightening the image, Adaptive Thresholding and applying filters for noise cancellation, and Rotation features and perspective transform for a special cropping algorithm. The described technique is implemented on various samples.
    Classification of Urban Morphology with Deep Learning: Application on Urban Vitality. (arXiv:2105.09908v2 [cs.CV] UPDATED)
    (2 min) There is a prevailing trend to study urban morphology quantitatively thanks to the growing accessibility to various forms of spatial big data, increasing computing power, and use cases benefiting from such information. The methods developed up to now measure urban morphology with numerical indices describing density, proportion, and mixture, but they do not directly represent morphological features from the human's visual and intuitive perspective. We take the first step to bridge the gap by proposing a deep learning-based technique to automatically classify road networks into four classes on a visual basis. The method is implemented by generating an image of the street network (Colored Road Hierarchy Diagram), which we introduce in this paper, and classifying it using a deep convolutional neural network (ResNet-34). The model achieves an overall classification accuracy of 0.875. Nine cities around the world are selected as the study areas with their road networks acquired from OpenStreetMap. Latent subgroups among the cities are uncovered through clustering on the percentage of each road network category. In the subsequent part of the paper, we focus on the usability of such classification: we apply our method in a case study of urban vitality prediction. An advanced tree-based regression model (LightGBM) is for the first time designated to establish the relationship between morphological indices and vitality indicators. The effect of road network classification is found to be small but positively associated with urban vitality. This work expands the toolkit of quantitative urban morphology study with new techniques, supporting further studies in the future.
    Uncertainty-Aware Camera Pose Estimation from Points and Lines. (arXiv:2107.03890v1 [cs.CV])
    (2 min) Perspective-n-Point-and-Line (P$n$PL) algorithms aim at fast, accurate, and robust camera localization with respect to a 3D model from 2D-3D feature correspondences, being a major part of modern robotic and AR/VR systems. Current point-based pose estimation methods use only 2D feature detection uncertainties, and the line-based methods do not take uncertainties into account. In our setup, both 3D coordinates and 2D projections of the features are considered uncertain. We propose PnP(L) solvers based on EPnP and DLS for the uncertainty-aware pose estimation. We also modify motion-only bundle adjustment to take 3D uncertainties into account. We perform exhaustive synthetic and real experiments on two different visual odometry datasets. The new PnP(L) methods outperform the state-of-the-art on real data in isolation, showing an increase in mean translation accuracy by 18% on a representative subset of KITTI, while the new uncertain refinement improves pose accuracy for most of the solvers, e.g. decreasing mean translation error for the EPnP by 16% compared to the standard refinement on the same dataset. The code is available at https://alexandervakhitov.github.io/uncertain-pnp/.
    Spectroscopic Approach to Correction and Visualisation of Bright-Field Light Transmission Microscopy Biological Data. (arXiv:1903.06519v3 [eess.IV] UPDATED)
    (2 min) The most realistic information about the transparent sample such as a live cell can be obtained only using bright-field light microscopy. At high-intensity pulsing LED illumination, we captured a primary 12-bit-per-channel (bpc) response from an observed sample using a bright-field microscope equipped with a high-resolution (4872x3248) image sensor. In order to suppress data distortions originating from the light interactions with elements in the optical path, poor sensor reproduction (geometrical defects of the camera sensor and some peculiarities of sensor sensitivity), we propose a spectroscopic approach for the correction of this uncompressed 12-bpc data by simultaneous calibration of all parts of the experimental arrangement. Moreover, the final intensities of the corrected images are proportional to the photon fluxes detected by a camera sensor. It can be visualized in 8-bpc intensity depth after the Least Information Loss compression [Lect. Notes Bioinform. 9656, 527 (2016)].
    Modality Completion via Gaussian Process Prior Variational Autoencoders for Multi-Modal Glioma Segmentation. (arXiv:2107.03442v1 [eess.IV])
    (2 min) In large studies involving multi protocol Magnetic Resonance Imaging (MRI), it can occur to miss one or more sub-modalities for a given patient owing to poor quality (e.g. imaging artifacts), failed acquisitions, or hallway interrupted imaging examinations. In some cases, certain protocols are unavailable due to limited scan time or to retrospectively harmonise the imaging protocols of two independent studies. Missing image modalities pose a challenge to segmentation frameworks as complementary information contributed by the missing scans is then lost. In this paper, we propose a novel model, Multi-modal Gaussian Process Prior Variational Autoencoder (MGP-VAE), to impute one or more missing sub-modalities for a patient scan. MGP-VAE can leverage the Gaussian Process (GP) prior on the Variational Autoencoder (VAE) to utilize the subjects/patients and sub-modalities correlations. Instead of designing one network for each possible subset of present sub-modalities or using frameworks to mix feature maps, missing data can be generated from a single model based on all the available samples. We show the applicability of MGP-VAE on brain tumor segmentation where either, two, or three of four sub-modalities may be missing. Our experiments against competitive segmentation baselines with missing sub-modality on BraTS'19 dataset indicate the effectiveness of the MGP-VAE model for segmentation tasks.
    Improving the Effectiveness and Efficiency of Stochastic Neighbour Embedding with Isolation Kernel. (arXiv:1906.09744v3 [cs.LG] UPDATED)
    (2 min) This paper presents a new insight into improving the performance of Stochastic Neighbour Embedding (t-SNE) by using Isolation kernel instead of Gaussian kernel. Isolation kernel outperforms Gaussian kernel in two aspects. First, the use of Isolation kernel in t-SNE overcomes the drawback of misrepresenting some structures in the data, which often occurs when Gaussian kernel is applied in t-SNE. This is because Gaussian kernel determines each local bandwidth based on one local point only, while Isolation kernel is derived directly from the data based on space partitioning. Second, the use of Isolation kernel yields a more efficient similarity computation because data-dependent Isolation kernel has only one parameter that needs to be tuned. In contrast, the use of data-independent Gaussian kernel increases the computational cost by determining n bandwidths for a dataset of n points. As the root cause of these deficiencies in t-SNE is Gaussian kernel, we show that simply replacing Gaussian kernel with Isolation kernel in t-SNE significantly improves the quality of the final visualisation output (without creating misrepresented structures) and removes one key obstacle that prevents t-SNE from processing large datasets. Moreover, Isolation kernel enables t-SNE to deal with large-scale datasets in less runtime without trading off accuracy, unlike existing methods in speeding up t-SNE.
    Deep Learning Based Image Retrieval in the JPEG Compressed Domain. (arXiv:2107.03648v1 [eess.IV])
    (2 min) Content-based image retrieval (CBIR) systems on pixel domain use low-level features, such as colour, texture and shape, to retrieve images. In this context, two types of image representations i.e. local and global image features have been studied in the literature. Extracting these features from pixel images and comparing them with images from the database is very time-consuming. Therefore, in recent years, there has been some effort to accomplish image analysis directly in the compressed domain with lesser computations. Furthermore, most of the images in our daily transactions are stored in the JPEG compressed format. Therefore, it would be ideal if we could retrieve features directly from the partially decoded or compressed data and use them for retrieval. Here, we propose a unified model for image retrieval which takes DCT coefficients as input and efficiently extracts global and local features directly in the JPEG compressed domain for accurate image retrieval. The experimental findings indicate that our proposed model performed similarly to the current DELG model which takes RGB features as an input with reference to mean average precision while having a faster training and retrieval speed.
    Rethinking of Pedestrian Attribute Recognition: A Reliable Evaluation under Zero-Shot Pedestrian Identity Setting. (arXiv:2107.03576v1 [cs.CV])
    (2 min) Pedestrian attribute recognition aims to assign multiple attributes to one pedestrian image captured by a video surveillance camera. Although numerous methods are proposed and make tremendous progress, we argue that it is time to step back and analyze the status quo of the area. We review and rethink the recent progress from three perspectives. First, given that there is no explicit and complete definition of pedestrian attribute recognition, we formally define and distinguish pedestrian attribute recognition from other similar tasks. Second, based on the proposed definition, we expose the limitations of the existing datasets, which violate the academic norm and are inconsistent with the essential requirement of practical industry application. Thus, we propose two datasets, PETA\textsubscript{$ZS$} and RAP\textsubscript{$ZS$}, constructed following the zero-shot settings on pedestrian identity. In addition, we also introduce several realistic criteria for future pedestrian attribute dataset construction. Finally, we reimplement existing state-of-the-art methods and introduce a strong baseline method to give reliable evaluations and fair comparisons. Experiments are conducted on four existing datasets and two proposed datasets to measure progress on pedestrian attribute recognition.
    Combining Transformer Generators with Convolutional Discriminators. (arXiv:2105.10189v2 [cs.CV] UPDATED)
    (2 min) Transformer models have recently attracted much interest from computer vision researchers and have since been successfully employed for several problems traditionally addressed with convolutional neural networks. At the same time, image synthesis using generative adversarial networks (GANs) has drastically improved over the last few years. The recently proposed TransGAN is the first GAN using only transformer-based architectures and achieves competitive results when compared to convolutional GANs. However, since transformers are data-hungry architectures, TransGAN requires data augmentation, an auxiliary super-resolution task during training, and a masking prior to guide the self-attention mechanism. In this paper, we study the combination of a transformer-based generator and convolutional discriminator and successfully remove the need of the aforementioned required design choices. We evaluate our approach by conducting a benchmark of well-known CNN discriminators, ablate the size of the transformer-based generator, and show that combining both architectural elements into a hybrid model leads to better results. Furthermore, we investigate the frequency spectrum properties of generated images and observe that our model retains the benefits of an attention based generator.
    Use of Affective Visual Information for Summarization of Human-Centric Videos. (arXiv:2107.03783v1 [cs.CV])
    (2 min) Increasing volume of user-generated human-centric video content and their applications, such as video retrieval and browsing, require compact representations that are addressed by the video summarization literature. Current supervised studies formulate video summarization as a sequence-to-sequence learning problem and the existing solutions often neglect the surge of human-centric view, which inherently contains affective content. In this study, we investigate the affective-information enriched supervised video summarization task for human-centric videos. First, we train a visual input-driven state-of-the-art continuous emotion recognition model (CER-NET) on the RECOLA dataset to estimate emotional attributes. Then, we integrate the estimated emotional attributes and the high-level representations from the CER-NET with the visual information to define the proposed affective video summarization architectures (AVSUM). In addition, we investigate the use of attention to improve the AVSUM architectures and propose two new architectures based on temporal attention (TA-AVSUM) and spatial attention (SA-AVSUM). We conduct video summarization experiments on the TvSum database. The proposed AVSUM-GRU architecture with an early fusion of high level GRU embeddings and the temporal attention based TA-AVSUM architecture attain competitive video summarization performances by bringing strong performance improvements for the human-centric videos compared to the state-of-the-art in terms of F-score and self-defined face recall metrics.
    A hybrid deep learning framework for Covid-19 detection via 3D Chest CT Images. (arXiv:2107.03904v1 [eess.IV])
    (2 min) In this paper, we present a hybrid deep learning framework named CTNet which combines convolutional neural network and transformer together for the detection of COVID-19 via 3D chest CT images. It consists of a CNN feature extractor module with SE attention to extract sufficient features from CT scans, together with a transformer model to model the discriminative features of the 3D CT scans. Compared to previous works, CTNet provides an effective and efficient method to perform COVID-19 diagnosis via 3D CT scans with data resampling strategy. Advanced results on a large and public benchmarks, COV19-CT-DB database was achieved by the proposed CTNet, over the state-of-the-art baseline approachproposed together with the dataset.
    Adversarial Semantic Hallucination for Domain Generalized Semantic Segmentation. (arXiv:2106.04144v2 [cs.CV] UPDATED)
    (2 min) Convolutional neural networks may perform poorly when the test and train data are from different domains. While this problem can be mitigated by using the target domain data to align the source and target domain feature representations, the target domain data may be unavailable due to privacy concerns. Consequently, there is a need for methods that generalize well without access to target domain data during training. In this work, we propose an adversarial hallucination approach, which combines a class-wise hallucination module and a semantic segmentation module. Since the segmentation performance varies across different classes, we design a semantic-conditioned style hallucination layer to adaptively stylize each class. The classwise stylization parameters are generated from the semantic knowledge in the segmentation probability maps of the source domain image. Both modules compete adversarially, with the hallucination module generating increasingly 'difficult' style images to challenge the segmentation module. In response, the segmentation module improves its performance as it is trained with generated samples at an appropriate class-wise difficulty level. Experiments on state of the art domain adaptation work demonstrate the efficacy of our proposed method when no target domain data are available for training.
    OXnet: Omni-supervised Thoracic Disease Detection from Chest X-rays. (arXiv:2104.03218v2 [cs.CV] UPDATED)
    (2 min) Chest X-ray (CXR) is the most typical diagnostic X-ray examination for screening various thoracic diseases. Automatically localizing lesions from CXR is promising for alleviating radiologists' reading burden. However, CXR datasets are often with massive image-level annotations and scarce lesion-level annotations, and more often, without annotations. Thus far, unifying different supervision granularities to develop thoracic disease detection algorithms has not been comprehensively addressed. In this paper, we present OXnet, the first deep omni-supervised thoracic disease detection network to our best knowledge that uses as much available supervision as possible for CXR diagnosis. We first introduce supervised learning via a one-stage detection model. Then, we inject a global classification head to the detection model and propose dual attention alignment to guide the global gradient to the local detection branch, which enables learning lesion detection from image-level annotations. We also impose intra-class compactness and inter-class separability with global prototype alignment to further enhance the global information learning. Moreover, we leverage a soft focal loss to distill the soft pseudo-labels of unlabeled data generated by a teacher model. Extensive experiments on a large-scale chest X-ray dataset show the proposed OXnet outperforms competitive methods with significant margins. Further, we investigate omni-supervision under various annotation granularities and corroborate OXnet is a promising choice to mitigate the plight of annotation shortage for medical image diagnosis.
    CHASE: Robust Visual Tracking via Cell-Level Differentiable Neural Architecture Search. (arXiv:2107.03463v1 [cs.CV])
    (2 min) A strong visual object tracker nowadays relies on its well-crafted modules, which typically consist of manually-designed network architectures to deliver high-quality tracking results. Not surprisingly, the manual design process becomes a particularly challenging barrier, as it demands sufficient prior experience, enormous effort, intuition and perhaps some good luck. Meanwhile, neural architecture search has gaining grounds in practical applications such as image segmentation, as a promising method in tackling the issue of automated search of feasible network structures. In this work, we propose a novel cell-level differentiable architecture search mechanism to automate the network design of the tracking module, aiming to adapt backbone features to the objective of a tracking network during offline training. The proposed approach is simple, efficient, and with no need to stack a series of modules to construct a network. Our approach is easy to be incorporated into existing trackers, which is empirically validated using different differentiable architecture search-based methods and tracking objectives. Extensive experimental evaluations demonstrate the superior performance of our approach over five commonly-used benchmarks. Meanwhile, our automated searching process takes 41 (18) hours for the second (first) order DARTS method on the TrackingNet dataset.
    Case-based similar image retrieval for weakly annotated large histopathological images of malignant lymphoma using deep metric learning. (arXiv:2107.03602v1 [cs.CV])
    (2 min) In the present study, we propose a novel case-based similar image retrieval (SIR) method for hematoxylin and eosin (H&E)-stained histopathological images of malignant lymphoma. When a whole slide image (WSI) is used as an input query, it is desirable to be able to retrieve similar cases by focusing on image patches in pathologically important regions such as tumor cells. To address this problem, we employ attention-based multiple instance learning, which enables us to focus on tumor-specific regions when the similarity between cases is computed. Moreover, we employ contrastive distance metric learning to incorporate immunohistochemical (IHC) staining patterns as useful supervised information for defining appropriate similarity between heterogeneous malignant lymphoma cases. In the experiment with 249 malignant lymphoma patients, we confirmed that the proposed method exhibited higher evaluation measures than the baseline case-based SIR methods. Furthermore, the subjective evaluation by pathologists revealed that our similarity measure using IHC staining patterns is appropriate for representing the similarity of H&E-stained tissue images for malignant lymphoma.
    Technical Report for Valence-Arousal Estimation in ABAW2 Challenge. (arXiv:2107.03891v1 [cs.CV])
    (2 min) In this work, we describe our method for tackling the valence-arousal estimation challenge from ABAW2 ICCV-2021 Competition. The competition organizers provide an in-the-wild Aff-Wild2 dataset for participants to analyze affective behavior in real-life settings. We use a two stream model to learn emotion features from appearance and action respectively. To solve data imbalanced problem, we apply label distribution smoothing (LDS) to re-weight labels. Our proposed method achieves Concordance Correlation Coefficient (CCC) of 0.591 and 0.617 for valence and arousal on the validation set of Aff-wild2 dataset.
    Rethinking annotation granularity for overcoming deep shortcut learning: A retrospective study on chest radiographs. (arXiv:2104.10553v2 [eess.IV] UPDATED)
    (2 min) Deep learning has demonstrated radiograph screening performances that are comparable or superior to radiologists. However, recent studies show that deep models for thoracic disease classification usually show degraded performance when applied to external data. Such phenomena can be categorized into shortcut learning, where the deep models learn unintended decision rules that can fit the identically distributed training and test set but fail to generalize to other distributions. A natural way to alleviate this defect is explicitly indicating the lesions and focusing the model on learning the intended features. In this paper, we conduct extensive retrospective experiments to compare a popular thoracic disease classification model, CheXNet, and a thoracic lesion detection model, CheXDet. We first showed that the two models achieved similar image-level classification performance on the internal test set with no significant differences under many scenarios. Meanwhile, we found incorporating external training data even led to performance degradation for CheXNet. Then, we compared the models' internal performance on the lesion localization task and showed that CheXDet achieved significantly better performance than CheXNet even when given 80% less training data. By further visualizing the models' decision-making regions, we revealed that CheXNet learned patterns other than the target lesions, demonstrating its shortcut learning defect. Moreover, CheXDet achieved significantly better external performance than CheXNet on both the image-level classification task and the lesion localization task. Our findings suggest improving annotation granularity for training deep learning systems as a promising way to elevate future deep learning-based diagnosis systems for clinical usage.
    Tensor Methods in Computer Vision and Deep Learning. (arXiv:2107.03436v1 [cs.CV])
    (2 min) Tensors, or multidimensional arrays, are data structures that can naturally represent visual data of multiple dimensions. Inherently able to efficiently capture structured, latent semantic spaces and high-order interactions, tensors have a long history of applications in a wide span of computer vision problems. With the advent of the deep learning paradigm shift in computer vision, tensors have become even more fundamental. Indeed, essential ingredients in modern deep learning architectures, such as convolutions and attention mechanisms, can readily be considered as tensor mappings. In effect, tensor methods are increasingly finding significant applications in deep learning, including the design of memory and compute efficient network architectures, improving robustness to random noise and adversarial attacks, and aiding the theoretical understanding of deep networks. This article provides an in-depth and practical review of tensors and tensor methods in the context of representation learning and deep learning, with a particular focus on visual data analysis and computer vision applications. Concretely, besides fundamental work in tensor-based visual data analysis methods, we focus on recent developments that have brought on a gradual increase of tensor methods, especially in deep learning architectures, and their implications in computer vision applications. To further enable the newcomer to grasp such concepts quickly, we provide companion Python notebooks, covering key aspects of the paper and implementing them, step-by-step with TensorLy.
    RMA: Rapid Motor Adaptation for Legged Robots. (arXiv:2107.04034v1 [cs.LG])
    (2 min) Successful real-world deployment of legged robots would require them to adapt in real-time to unseen scenarios like changing terrains, changing payloads, wear and tear. This paper presents Rapid Motor Adaptation (RMA) algorithm to solve this problem of real-time online adaptation in quadruped robots. RMA consists of two components: a base policy and an adaptation module. The combination of these components enables the robot to adapt to novel situations in fractions of a second. RMA is trained completely in simulation without using any domain knowledge like reference trajectories or predefined foot trajectory generators and is deployed on the A1 robot without any fine-tuning. We train RMA on a varied terrain generator using bioenergetics-inspired rewards and deploy it on a variety of difficult terrains including rocky, slippery, deformable surfaces in environments with grass, long vegetation, concrete, pebbles, stairs, sand, etc. RMA shows state-of-the-art performance across diverse real-world as well as simulation experiments. Video results at https://ashish-kmr.github.io/rma-legged-robots/
    AC-VRNN: Attentive Conditional-VRNN for Multi-Future Trajectory Prediction. (arXiv:2005.08307v2 [cs.CV] UPDATED)
    (2 min) Anticipating human motion in crowded scenarios is essential for developing intelligent transportation systems, social-aware robots and advanced video surveillance applications. A key component of this task is represented by the inherently multi-modal nature of human paths which makes socially acceptable multiple futures when human interactions are involved. To this end, we propose a generative architecture for multi-future trajectory predictions based on Conditional Variational Recurrent Neural Networks (C-VRNNs). Conditioning mainly relies on prior belief maps, representing most likely moving directions and forcing the model to consider past observed dynamics in generating future positions. Human interactions are modeled with a graph-based attention mechanism enabling an online attentive hidden state refinement of the recurrent estimation. To corroborate our model, we perform extensive experiments on publicly-available datasets (e.g., ETH/UCY, Stanford Drone Dataset, STATS SportVU NBA, Intersection Drone Dataset and TrajNet++) and demonstrate its effectiveness in crowded scenes compared to several state-of-the-art methods.
    Uncertainty-aware Human Motion Prediction. (arXiv:2107.03575v1 [cs.CV])
    (2 min) Human motion prediction is essential for tasks such as human motion analysis and human-robot interactions. Most existing approaches have been proposed to realize motion prediction. However, they ignore an important task, the evaluation of the quality of the predicted result. It is far more enough for current approaches in actual scenarios because people can't know how to interact with the machine without the evaluation of prediction, and unreliable predictions may mislead the machine to harm the human. Hence, we propose an uncertainty-aware framework for human motion prediction (UA-HMP). Concretely, we first design an uncertainty-aware predictor through Gaussian modeling to achieve the value and the uncertainty of predicted motion. Then, an uncertainty-guided learning scheme is proposed to quantitate the uncertainty and reduce the negative effect of the noisy samples during optimization for better performance. Our proposed framework is easily combined with current SOTA baselines to overcome their weakness in uncertainty modeling with slight parameters increment. Extensive experiments also show that they can achieve better performance in both short and long-term predictions in H3.6M, CMU-Mocap.
    Worsening Perception: Real-time Degradation of Autonomous Vehicle Perception Performance for Simulation of Adverse Weather Conditions. (arXiv:2103.02760v4 [cs.RO] UPDATED)
    (3 min) Autonomous vehicles rely heavily upon their perception subsystems to see the environment in which they operate. Unfortunately, the effect of variable weather conditions presents a significant challenge to object detection algorithms, and thus it is imperative to test the vehicle extensively in all conditions which it may experience. However, development of robust autonomous vehicle subsystems requires repeatable, controlled testing - while real weather is unpredictable and cannot be scheduled. Real-world testing in adverse conditions is an expensive and time-consuming task, often requiring access to specialist facilities. Simulation is commonly relied upon as a substitute, with increasingly visually realistic representations of the real-world being developed. In the context of the complete autonomous vehicle control pipeline, subsystems downstream of perception need to be tested with accurate recreations of the perception system output, rather than focusing on subjective visual realism of the input - whether in simulation or the real world. This study develops the untapped potential of a lightweight weather augmentation method in an autonomous racing vehicle - focusing not on visual accuracy, but rather the effect upon perception subsystem performance in real time. With minimal adjustment, the prototype developed in this study can replicate the effects of water droplets on the camera lens, and fading light conditions. This approach introduces a latency of less than 8 ms using compute hardware well suited to being carried in the vehicle - rendering it ideal for real-time implementation that can be run during experiments in simulation, and augmented reality testing in the real world.
    Causal affect prediction model using a facial image sequence. (arXiv:2107.03886v1 [cs.CV])
    (2 min) Among human affective behavior research, facial expression recognition research is improving in performance along with the development of deep learning. However, for improved performance, not only past images but also future images should be used along with corresponding facial images, but there are obstacles to the application of this technique to real-time environments. In this paper, we propose the causal affect prediction network (CAPNet), which uses only past facial images to predict corresponding affective valence and arousal. We train CAPNet to learn causal inference between past images and corresponding affective valence and arousal through supervised learning by pairing the sequence of past images with the current label using the Aff-Wild2 dataset. We show through experiments that the well-trained CAPNet outperforms the baseline of the second challenge of the Affective Behavior Analysis in-the-wild (ABAW2) Competition by predicting affective valence and arousal only with past facial images one-third of a second earlier. Therefore, in real-time application, CAPNet can reliably predict affective valence and arousal only with past data.
    $S^3$: Sign-Sparse-Shift Reparametrization for Effective Training of Low-bit Shift Networks. (arXiv:2107.03453v1 [cs.CV])
    (2 min) Shift neural networks reduce computation complexity by removing expensive multiplication operations and quantizing continuous weights into low-bit discrete values, which are fast and energy efficient compared to conventional neural networks. However, existing shift networks are sensitive to the weight initialization, and also yield a degraded performance caused by vanishing gradient and weight sign freezing problem. To address these issues, we propose S low-bit re-parameterization, a novel technique for training low-bit shift networks. Our method decomposes a discrete parameter in a sign-sparse-shift 3-fold manner. In this way, it efficiently learns a low-bit network with a weight dynamics similar to full-precision networks and insensitive to weight initialization. Our proposed training method pushes the boundaries of shift neural networks and shows 3-bit shift networks out-performs their full-precision counterparts in terms of top-1 accuracy on ImageNet.
    NccFlow: Unsupervised Learning of Optical Flow With Non-occlusion from Geometry. (arXiv:2107.03610v1 [cs.CV])
    (2 min) Optical flow estimation is a fundamental problem of computer vision and has many applications in the fields of robot learning and autonomous driving. This paper reveals novel geometric laws of optical flow based on the insight and detailed definition of non-occlusion. Then, two novel loss functions are proposed for the unsupervised learning of optical flow based on the geometric laws of non-occlusion. Specifically, after the occlusion part of the images are masked, the flowing process of pixels is carefully considered and geometric constraints are conducted based on the geometric laws of optical flow. First, neighboring pixels in the first frame will not intersect during the pixel displacement to the second frame. Secondly, when the cluster containing adjacent four pixels in the first frame moves to the second frame, no other pixels will flow into the quadrilateral formed by them. According to the two geometrical constraints, the optical flow non-intersection loss and the optical flow non-blocking loss in the non-occlusion regions are proposed. Two loss functions punish the irregular and inexact optical flows in the non-occlusion regions. The experiments on datasets demonstrated that the proposed unsupervised losses of optical flow based on the geometric laws in non-occlusion regions make the estimated optical flow more refined in detail, and improve the performance of unsupervised learning of optical flow. In addition, the experiments training on synthetic data and evaluating on real data show that the generalization ability of optical flow network is improved by our proposed unsupervised approach.
    An audiovisual and contextual approach for categorical and continuous emotion recognition in-the-wild. (arXiv:2107.03465v1 [cs.CV])
    (2 min) In this work we tackle the task of video-based audio-visual emotion recognition, within the premises of the 2nd Workshop and Competition on Affective Behavior Analysis in-the-wild (ABAW). Standard methodologies that rely solely on the extraction of facial features often fall short of accurate emotion prediction in cases where the aforementioned source of affective information is inaccessible due to head/body orientation, low resolution and poor illumination. We aspire to alleviate this problem by leveraging bodily as well as contextual features, as part of a broader emotion recognition framework. A standard CNN-RNN cascade constitutes the backbone of our proposed model for sequence-to-sequence (seq2seq) learning. Apart from learning through the \textit{RGB} input modality, we construct an aural stream which operates on sequences of extracted mel-spectrograms. Our extensive experiments on the challenging and newly assembled Affect-in-the-wild-2 (Aff-Wild2) dataset verify the superiority of our methods over existing approaches, while by properly incorporating all of the aforementioned modules in a network ensemble, we manage to surpass the previous best published recognition scores, in the official validation set. All the code was implemented using PyTorch\footnote{\url{https://pytorch.org/}} and is publicly available\footnote{\url{https://github.com/PanosAntoniadis/NTUA-ABAW2021}}.
    EEG-ConvTransformer for Single-Trial EEG based Visual Stimuli Classification. (arXiv:2107.03983v1 [cs.CV])
    (2 min) Different categories of visual stimuli activate different responses in the human brain. These signals can be captured with EEG for utilization in applications such as Brain-Computer Interface (BCI). However, accurate classification of single-trial data is challenging due to low signal-to-noise ratio of EEG. This work introduces an EEG-ConvTranformer network that is based on multi-headed self-attention. Unlike other transformers, the model incorporates self-attention to capture inter-region interactions. It further extends to adjunct convolutional filters with multi-head attention as a single module to learn temporal patterns. Experimental results demonstrate that EEG-ConvTransformer achieves improved classification accuracy over the state-of-the-art techniques across five different visual stimuli classification tasks. Finally, quantitative analysis of inter-head diversity also shows low similarity in representational subspaces, emphasizing the implicit diversity of multi-head attention.
    Multi-Modality Task Cascade for 3D Object Detection. (arXiv:2107.04013v1 [cs.CV])
    (2 min) Point clouds and RGB images are naturally complementary modalities for 3D visual understanding - the former provides sparse but accurate locations of points on objects, while the latter contains dense color and texture information. Despite this potential for close sensor fusion, many methods train two models in isolation and use simple feature concatenation to represent 3D sensor data. This separated training scheme results in potentially sub-optimal performance and prevents 3D tasks from being used to benefit 2D tasks that are often useful on their own. To provide a more integrated approach, we propose a novel Multi-Modality Task Cascade network (MTC-RCNN) that leverages 3D box proposals to improve 2D segmentation predictions, which are then used to further refine the 3D boxes. We show that including a 2D network between two stages of 3D modules significantly improves both 2D and 3D task performance. Moreover, to prevent the 3D module from over-relying on the overfitted 2D predictions, we propose a dual-head 2D segmentation training and inference scheme, allowing the 2nd 3D module to learn to interpret imperfect 2D segmentation predictions. Evaluating our model on the challenging SUN RGB-D dataset, we improve upon state-of-the-art results of both single modality and fusion networks by a large margin ($\textbf{+3.8}$ mAP@0.5). Code will be released $\href{https://github.com/Divadi/MTC_RCNN}{\text{here.}}$
    Collaboration of Experts: Achieving 80% Top-1 Accuracy on ImageNet with 100M FLOPs. (arXiv:2107.03815v1 [cs.CV])
    (2 min) In this paper, we propose a Collaboration of Experts (CoE) framework to pool together the expertise of multiple networks towards a common aim. Each expert is an individual network with expertise on a unique portion of the dataset, which enhances the collective capacity. Given a sample, an expert is selected by the delegator, which simultaneously outputs a rough prediction to support early termination. To fulfill this framework, we propose three modules to impel each model to play its role, namely weight generation module (WGM), label generation module (LGM) and variance calculation module (VCM). Our method achieves the state-of-the-art performance on ImageNet, 80.7% top-1 accuracy with 194M FLOPs. Combined with PWLU activation function and CondConv, CoE further achieves the accuracy of 80.0% with only 100M FLOPs for the first time. More importantly, our method is hardware friendly and achieves a 3-6x speedup compared with some existing conditional computation approaches.
    Crowd Counting via Perspective-Guided Fractional-Dilation Convolution. (arXiv:2107.03665v1 [cs.CV])
    (2 min) Crowd counting is critical for numerous video surveillance scenarios. One of the main issues in this task is how to handle the dramatic scale variations of pedestrians caused by the perspective effect. To address this issue, this paper proposes a novel convolution neural network-based crowd counting method, termed Perspective-guided Fractional-Dilation Network (PFDNet). By modeling the continuous scale variations, the proposed PFDNet is able to select the proper fractional dilation kernels for adapting to different spatial locations. It significantly improves the flexibility of the state-of-the-arts that only consider the discrete representative scales. In addition, by avoiding the multi-scale or multi-column architecture that used in other methods, it is computationally more efficient. In practice, the proposed PFDNet is constructed by stacking multiple Perspective-guided Fractional-Dilation Convolutions (PFC) on a VGG16-BN backbone. By introducing a novel generalized dilation convolution operation, the PFC can handle fractional dilation ratios in the spatial domain under the guidance of perspective annotations, achieving continuous scales modeling of pedestrians. To deal with the problem of unavailable perspective information in some cases, we further introduce an effective perspective estimation branch to the proposed PFDNet, which can be trained in either supervised or weakly-supervised setting once the branch has been pre-trained. Extensive experiments show that the proposed PFDNet outperforms state-of-the-art methods on ShanghaiTech A, ShanghaiTech B, WorldExpo'10, UCF-QNRF, UCF_CC_50 and TRANCOS dataset, achieving MAE 53.8, 6.5, 6.8, 84.3, 205.8, and 3.06 respectively.
    Relation-Based Associative Joint Location for Human Pose Estimation in Videos. (arXiv:2107.03591v1 [cs.CV])
    (2 min) Video-based human pose estimation (HPE) is a vital yet challenging task. While deep learning methods have made significant progress for the HPE, most approaches to this task detect each joint independently, damaging the pose structural information. In this paper, unlike the prior methods, we propose a Relation-based Pose Semantics Transfer Network (RPSTN) to locate joints associatively. Specifically, we design a lightweight joint relation extractor (JRE) to model the pose structural features and associatively generate heatmaps for joints by modeling the relation between any two joints heuristically instead of building each joint heatmap independently. Actually, the proposed JRE module models the spatial configuration of human poses through the relationship between any two joints. Moreover, considering the temporal semantic continuity of videos, the pose semantic information in the current frame is beneficial for guiding the location of joints in the next frame. Therefore, we use the idea of knowledge reuse to propagate the pose semantic information between consecutive frames. In this way, the proposed RPSTN captures temporal dynamics of poses. On the one hand, the JRE module can infer invisible joints according to the relationship between the invisible joints and other visible joints in space. On the other hand, in the time, the propose model can transfer the pose semantic features from the non-occluded frame to the occluded frame to locate occluded joints. Therefore, our method is robust to the occlusion and achieves state-of-the-art results on the two challenging datasets, which demonstrates its effectiveness for video-based human pose estimation. We will release the code and models publicly.
    Joint Motion Correction and Super Resolution for Cardiac Segmentation via Latent Optimisation. (arXiv:2107.03887v1 [eess.IV])
    (2 min) In cardiac magnetic resonance (CMR) imaging, a 3D high-resolution segmentation of the heart is essential for detailed description of its anatomical structures. However, due to the limit of acquisition duration and respiratory/cardiac motion, stacks of multi-slice 2D images are acquired in clinical routine. The segmentation of these images provides a low-resolution representation of cardiac anatomy, which may contain artefacts caused by motion. Here we propose a novel latent optimisation framework that jointly performs motion correction and super resolution for cardiac image segmentations. Given a low-resolution segmentation as input, the framework accounts for inter-slice motion in cardiac MR imaging and super-resolves the input into a high-resolution segmentation consistent with input. A multi-view loss is incorporated to leverage information from both short-axis view and long-axis view of cardiac imaging. To solve the inverse problem, iterative optimisation is performed in a latent space, which ensures the anatomical plausibility. This alleviates the need of paired low-resolution and high-resolution images for supervised learning. Experiments on two cardiac MR datasets show that the proposed framework achieves high performance, comparable to state-of-the-art super-resolution approaches and with better cross-domain generalisability and anatomical plausibility.
    Multi-frame Collaboration for Effective Endoscopic Video Polyp Detection via Spatial-Temporal Feature Transformation. (arXiv:2107.03609v1 [cs.CV])
    (2 min) Precise localization of polyp is crucial for early cancer screening in gastrointestinal endoscopy. Videos given by endoscopy bring both richer contextual information as well as more challenges than still images. The camera-moving situation, instead of the common camera-fixed-object-moving one, leads to significant background variation between frames. Severe internal artifacts (e.g. water flow in the human body, specular reflection by tissues) can make the quality of adjacent frames vary considerately. These factors hinder a video-based model to effectively aggregate features from neighborhood frames and give better predictions. In this paper, we present Spatial-Temporal Feature Transformation (STFT), a multi-frame collaborative framework to address these issues. Spatially, STFT mitigates inter-frame variations in the camera-moving situation with feature alignment by proposal-guided deformable convolutions. Temporally, STFT proposes a channel-aware attention module to simultaneously estimate the quality and correlation of adjacent frames for adaptive feature aggregation. Empirical studies and superior results demonstrate the effectiveness and stability of our method. For example, STFT improves the still image baseline FCOS by 10.6% and 20.6% on the comprehensive F1-score of the polyp localization task in CVC-Clinic and ASUMayo datasets, respectively, and outperforms the state-of-the-art video-based method by 3.6% and 8.0%, respectively. Code is available at \url{https://github.com/lingyunwu14/STFT}.
    Adiabatic Quantum Graph Matching with Permutation Matrix Constraints. (arXiv:2107.04032v1 [cs.CV])
    (2 min) Matching problems on 3D shapes and images are challenging as they are frequently formulated as combinatorial quadratic assignment problems (QAPs) with permutation matrix constraints, which are NP-hard. In this work, we address such problems with emerging quantum computing technology and propose several reformulations of QAPs as unconstrained problems suitable for efficient execution on quantum hardware. We investigate several ways to inject permutation matrix constraints in a quadratic unconstrained binary optimization problem which can be mapped to quantum hardware. We focus on obtaining a sufficient spectral gap, which further increases the probability to measure optimal solutions and valid permutation matrices in a single run. We perform our experiments on the quantum computer D-Wave 2000Q (2^11 qubits, adiabatic). Despite the observed discrepancy between simulated adiabatic quantum computing and execution on real quantum hardware, our reformulation of permutation matrix constraints increases the robustness of the numerical computations over other penalty approaches in our experiments. The proposed algorithm has the potential to scale to higher dimensions on future quantum computing architectures, which opens up multiple new directions for solving matching problems in 3D computer vision and graphics.
    3D Neural Scene Representations for Visuomotor Control. (arXiv:2107.04004v1 [cs.RO])
    (2 min) Humans have a strong intuitive understanding of the 3D environment around us. The mental model of the physics in our brain applies to objects of different materials and enables us to perform a wide range of manipulation tasks that are far beyond the reach of current robots. In this work, we desire to learn models for dynamic 3D scenes purely from 2D visual observations. Our model combines Neural Radiance Fields (NeRF) and time contrastive learning with an autoencoding framework, which learns viewpoint-invariant 3D-aware scene representations. We show that a dynamics model, constructed over the learned representation space, enables visuomotor control for challenging manipulation tasks involving both rigid bodies and fluids, where the target is specified in a viewpoint different from what the robot operates on. When coupled with an auto-decoding framework, it can even support goal specification from camera viewpoints that are outside the training distribution. We further demonstrate the richness of the learned 3D dynamics model by performing future prediction and novel view synthesis. Finally, we provide detailed ablation studies regarding different system designs and qualitative analysis of the learned representations.
    LanguageRefer: Spatial-Language Model for 3D Visual Grounding. (arXiv:2107.03438v1 [cs.RO])
    (2 min) To realize robots that can understand human instructions and perform meaningful tasks in the near future, it is important to develop learned models that can understand referential language to identify common objects in real-world 3D scenes. In this paper, we develop a spatial-language model for a 3D visual grounding problem. Specifically, given a reconstructed 3D scene in the form of a point cloud with 3D bounding boxes of potential object candidates, and a language utterance referring to a target object in the scene, our model identifies the target object from a set of potential candidates. Our spatial-language model uses a transformer-based architecture that combines spatial embedding from bounding-box with a finetuned language embedding from DistilBert and reasons among the objects in the 3D scene to find the target object. We show that our model performs competitively on visio-linguistic datasets proposed by ReferIt3D. We provide additional analysis of performance in spatial reasoning tasks decoupled from perception noise, the effect of view-dependent utterances in terms of accuracy, and view-point annotations for potential robotics applications.
    Optimizing Data Processing in Space for Object Detection in Satellite Imagery. (arXiv:2107.03774v1 [cs.CV])
    (3 min) There is a proliferation in the number of satellites launched each year, resulting in downlinking of terabytes of data each day. The data received by ground stations is often unprocessed, making this an expensive process considering the large data sizes and that not all of the data is useful. This, coupled with the increasing demand for real-time data processing, has led to a growing need for on-orbit processing solutions. In this work, we investigate the performance of CNN-based object detectors on constrained devices by applying different image compression techniques to satellite data. We examine the capabilities of the NVIDIA Jetson Nano and NVIDIA Jetson AGX Xavier; low-power, high-performance computers, with integrated GPUs, small enough to fit on-board a nanosatellite. We take a closer look at object detection networks, including the Single Shot MultiBox Detector (SSD) and Region-based Fully Convolutional Network (R-FCN) models that are pre-trained on DOTA - a Large Scale Dataset for Object Detection in Aerial Images. The performance is measured in terms of execution time, memory consumption, and accuracy, and are compared against a baseline containing a server with two powerful GPUs. The results show that by applying image compression techniques, we are able to improve the execution time and memory consumption, achieving a fully runnable dataset. A lossless compression technique achieves roughly a 10% reduction in execution time and about a 3% reduction in memory consumption, with no impact on the accuracy. While a lossy compression technique improves the execution time by up to 144% and the memory consumption is reduced by as much as 97%. However, it has a significant impact on accuracy, varying depending on the compression ratio. Thus the application and ratio of these compression techniques may differ depending on the required level of accuracy for a particular task.
    Investigate the Essence of Long-Tailed Recognition from a Unified Perspective. (arXiv:2107.03758v1 [cs.CV])
    (2 min) As the data scale grows, deep recognition models often suffer from long-tailed data distributions due to the heavy imbalanced sample number across categories. Indeed, real-world data usually exhibit some similarity relation among different categories (e.g., pigeons and sparrows), called category similarity in this work. It is doubly difficult when the imbalance occurs between such categories with similar appearances. However, existing solutions mainly focus on the sample number to re-balance data distribution. In this work, we systematically investigate the essence of the long-tailed problem from a unified perspective. Specifically, we demonstrate that long-tailed recognition suffers from both sample number and category similarity. Intuitively, using a toy example, we first show that sample number is not the unique influence factor for performance dropping of long-tailed recognition. Theoretically, we demonstrate that (1) category similarity, as an inevitable factor, would also influence the model learning under long-tailed distribution via similar samples, (2) using more discriminative representation methods (e.g., self-supervised learning) for similarity reduction, the classifier bias can be further alleviated with greatly improved performance. Extensive experiments on several long-tailed datasets verify the rationality of our theoretical analysis, and show that based on existing state-of-the-arts (SOTAs), the performance could be further improved by similarity reduction. Our investigations highlight the essence behind the long-tailed problem, and claim several feasible directions for future work.
    Staying in Shape: Learning Invariant Shape Representations using Contrastive Learning. (arXiv:2107.03552v1 [cs.CV])
    (2 min) Creating representations of shapes that are invari-ant to isometric or almost-isometric transforma-tions has long been an area of interest in shape anal-ysis, since enforcing invariance allows the learningof more effective and robust shape representations.Most existing invariant shape representations arehandcrafted, and previous work on learning shaperepresentations do not focus on producing invariantrepresentations. To solve the problem of learningunsupervised invariant shape representations, weuse contrastive learning, which produces discrimi-native representations through learning invarianceto user-specified data augmentations. To producerepresentations that are specifically isometry andalmost-isometry invariant, we propose new dataaugmentations that randomly sample these transfor-mations. We show experimentally that our methodoutperforms previous unsupervised learning ap-proaches in both effectiveness and robustness.
    Prior Aided Streaming Network for Multi-task Affective Recognitionat the 2nd ABAW2 Competition. (arXiv:2107.03708v1 [cs.CV])
    (2 min) Automatic affective recognition has been an important research topic in human computer interaction (HCI) area. With recent development of deep learning techniques and large scale in-the-wild annotated datasets, the facial emotion analysis is now aimed at challenges in the real world settings. In this paper, we introduce our submission to the 2nd Affective Behavior Analysis in-the-wild (ABAW2) Competition. In dealing with different emotion representations, including Categorical Emotions (CE), Action Units (AU), and Valence Arousal (VA), we propose a multi-task streaming network by a heuristic that the three representations are intrinsically associated with each other. Besides, we leverage an advanced facial expression embedding as prior knowledge, which is capable of capturing identity-invariant expression features while preserving the expression similarities, to aid the down-streaming recognition tasks. The extensive quantitative evaluations as well as ablation studies on the Aff-Wild2 dataset prove the effectiveness of our proposed prior aided streaming network approach.
    Automated Object Behavioral Feature Extraction for Potential Risk Analysis based on Video Sensor. (arXiv:2107.03554v1 [cs.CV])
    (2 min) Pedestrians are exposed to risk of death or serious injuries on roads, especially unsignalized crosswalks, for a variety of reasons. To date, an extensive variety of studies have reported on vision based traffic safety system. However, many studies required manual inspection of the volumes of traffic video to reliably obtain traffic related objects behavioral factors. In this paper, we propose an automated and simpler system for effectively extracting object behavioral features from video sensors deployed on the road. We conduct basic statistical analysis on these features, and show how they can be useful for monitoring the traffic behavior on the road. We confirm the feasibility of the proposed system by applying our prototype to two unsignalized crosswalks in Osan city, South Korea. To conclude, we compare behaviors of vehicles and pedestrians in those two areas by simple statistical analysis. This study demonstrates the potential for a network of connected video sensors to provide actionable data for smart cities to improve pedestrian safety in dangerous road environments.
  • cs.IR updates on arXiv.org

    Fast Multi-Step Critiquing for VAE-based Recommender Systems. (arXiv:2105.00774v2 [cs.IR] UPDATED)
    (2 min) Recent studies have shown that providing personalized explanations alongside recommendations increases trust and perceived quality. Furthermore, it gives users an opportunity to refine the recommendations by critiquing parts of the explanations. On one hand, current recommender systems model the recommendation, explanation, and critiquing objectives jointly, but this creates an inherent trade-off between their respective performance. On the other hand, although recent latent linear critiquing approaches are built upon an existing recommender system, they suffer from computational inefficiency at inference due to the objective optimized at each conversation's turn. We address these deficiencies with M&Ms-VAE, a novel variational autoencoder for recommendation and explanation that is based on multimodal modeling assumptions. We train the model under a weak supervision scheme to simulate both fully and partially observed variables. Then, we leverage the generalization ability of a trained M&Ms-VAE model to embed the user preference and the critique separately. Our work's most important innovation is our critiquing module, which is built upon and trained in a self-supervised manner with a simple ranking objective. Experiments on four real-world datasets demonstrate that among state-of-the-art models, our system is the first to dominate or match the performance in terms of recommendation, explanation, and multi-step critiquing. Moreover, M&Ms-VAE processes the critiques up to 25.6x faster than the best baselines. Finally, we show that our model infers coherent joint and cross generation, even under weak supervision, thanks to our multimodal-based modeling and training scheme.
    WenLan: Bridging Vision and Language by Large-Scale Multi-Modal Pre-Training. (arXiv:2103.06561v6 [cs.CV] UPDATED)
    (3 min) Multi-modal pre-training models have been intensively explored to bridge vision and language in recent years. However, most of them explicitly model the cross-modal interaction between image-text pairs, by assuming that there exists strong semantic correlation between the text and image modalities. Since this strong assumption is often invalid in real-world scenarios, we choose to implicitly model the cross-modal correlation for large-scale multi-modal pre-training, which is the focus of the Chinese project `WenLan' led by our team. Specifically, with the weak correlation assumption over image-text pairs, we propose a two-tower pre-training model called BriVL within the cross-modal contrastive learning framework. Unlike OpenAI CLIP that adopts a simple contrastive learning method, we devise a more advanced algorithm by adapting the latest method MoCo into the cross-modal scenario. By building a large queue-based dictionary, our BriVL can incorporate more negative samples in limited GPU resources. We further construct a large Chinese multi-source image-text dataset called RUC-CAS-WenLan for pre-training our BriVL model. Extensive experiments demonstrate that the pre-trained BriVL model outperforms both UNITER and OpenAI CLIP on various downstream tasks.
    Evaluation of Field-Aware Neural Ranking Models for Recipe Search. (arXiv:2105.05710v2 [cs.IR] UPDATED)
    (2 min) Explicitly modelling field interactions and correlations in complex document structures has recently gained popularity in neural document embedding and retrieval tasks. Although this requires the specification of bespoke task-dependent models, encouraging empirical results are beginning to emerge. We present the first in-depth analyses of non-linear multi-field interaction (NL-MFI) ranking in the cooking domain in this work. Our results show that field-weighted factorisation machines models provide a statistically significant improvement over baselines in recipe retrieval tasks. Additionally, we show that sparsely capturing subsets of field interactions based on domain knowledge and feature selection heuristics offers significant advantages over baselines and exhaustive alternatives. Although field-interaction aware models are more elaborate from an architectural basis, they are often more data-efficient in optimisation and are better suited for explainability due to mirrored document and model factorisation.
    A Review of Bangla Natural Language Processing Tasks and the Utility of Transformer Models. (arXiv:2107.03844v1 [cs.CL])
    (2 min) Bangla -- ranked as the 6th most widely spoken language across the world (https://www.ethnologue.com/guides/ethnologue200), with 230 million native speakers -- is still considered as a low-resource language in the natural language processing (NLP) community. With three decades of research, Bangla NLP (BNLP) is still lagging behind mainly due to the scarcity of resources and the challenges that come with it. There is sparse work in different areas of BNLP; however, a thorough survey reporting previous work and recent advances is yet to be done. In this study, we first provide a review of Bangla NLP tasks, resources, and tools available to the research community; we benchmark datasets collected from various platforms for nine NLP tasks using current state-of-the-art algorithms (i.e., transformer-based models). We provide comparative results for the studied NLP tasks by comparing monolingual vs. multilingual models of varying sizes. We report our results using both individual and consolidated datasets and provide data splits for future research. We reviewed a total of 108 papers and conducted 175 sets of experiments. Our results show promising performance using transformer-based models while highlighting the trade-off with computational costs. We hope that such a comprehensive survey will motivate the community to build on and further advance the research on Bangla NLP.
    Heterogeneous Global Graph Neural Networks for Personalized Session-based Recommendation. (arXiv:2107.03813v1 [cs.IR])
    (2 min) Predicting the next interaction of a short-term interaction session is a challenging task in session-based recommendation. Almost all existing works rely on item transition patterns, and neglect the impact of user historical sessions while modeling user preference, which often leads to non-personalized recommendation. Additionally, existing personalized session-based recommenders capture user preference only based on the sessions of the current user, but ignore the useful item-transition patterns from other user's historical sessions. To address these issues, we propose a novel Heterogeneous Global Graph Neural Networks (HG-GNN) to exploit the item transitions over all sessions in a subtle manner for better inferring user preference from the current and historical sessions. To effectively exploit the item transitions over all sessions from users, we propose a novel heterogeneous global graph that contains item transitions of sessions, user-item interactions and global co-occurrence items. Moreover, to capture user preference from sessions comprehensively, we propose to learn two levels of user representations from the global graph via two graph augmented preference encoders. Specifically, we design a novel heterogeneous graph neural network (HGNN) on the heterogeneous global graph to learn the long-term user preference and item representations with rich semantics. Based on the HGNN, we propose the Current Preference Encoder and the Historical Preference Encoder to capture the different levels of user preference from the current and historical sessions, respectively. To achieve personalized recommendation, we integrate the representations of the user current preference and historical interests to generate the final user preference representation. Extensive experimental results on three real-world datasets show that our model outperforms other state-of-the-art methods.
    Graph Neural Pre-training for Enhancing Recommendations using Side Information. (arXiv:2107.03936v1 [cs.IR])
    (2 min) Leveraging the side information associated with entities (i.e.\ users and items) to enhance the performance of recommendation systems has been widely recognized as an important modelling dimension. While many existing approaches focus on the \emph{integration scheme} to incorporate entity side information -- by combining the recommendation loss function with an extra side information-aware loss -- in this paper, we propose instead a novel \emph{pre-training scheme} for leveraging the side information. In particular, we first pre-train a representation model using the side information of the entities, and then fine-tune it using an existing general representation-based recommendation model. Specifically, we propose two pre-training models, named \gcn{} and \com{}, by considering the entities and their relations constructed from side information as two different types of graphs respectively, to pre-train entity embeddings. For the \gcn{} model, two single-relational graphs are constructed from all the users' and items' side information respectively, to pre-train entity representations by using the Graph Convolutional Networks. For the \com{} model, two multi-relational graphs are constructed to pre-train the entity representations by using the Composition-based Graph Convolutional Networks. An extensive evaluation of our pre-training models fine-tuned under four general representation-based recommender models, i.e.\ MF, NCF, NGCF and LightGCN, shows that effectively pre-training embeddings with both the user's and item's side information can significantly improve these original models in terms of both effectiveness and stability.
    Rating and aspect-based opinion graph embeddings for explainable recommendations. (arXiv:2107.03385v1 [cs.IR])
    (2 min) The success of neural network embeddings has entailed a renewed interest in using knowledge graphs for a wide variety of machine learning and information retrieval tasks. In particular, recent recommendation methods based on graph embeddings have shown state-of-the-art performance. In general, these methods encode latent rating patterns and content features. Differently from previous work, in this paper, we propose to exploit embeddings extracted from graphs that combine information from ratings and aspect-based opinions expressed in textual reviews. We then adapt and evaluate state-of-the-art graph embedding techniques over graphs generated from Amazon and Yelp reviews on six domains, outperforming baseline recommenders. Additionally, our method has the advantage of providing explanations that involve the coverage of aspect-based opinions given by users about recommended items.
    A Graph-based Approach for Mitigating Multi-sided Exposure Bias in Recommender Systems. (arXiv:2107.03415v1 [cs.IR])
    (2 min) Fairness is a critical system-level objective in recommender systems that has been the subject of extensive recent research. A specific form of fairness is supplier exposure fairness where the objective is to ensure equitable coverage of items across all suppliers in recommendations provided to users. This is especially important in multistakeholder recommendation scenarios where it may be important to optimize utilities not just for the end-user, but also for other stakeholders such as item sellers or producers who desire a fair representation of their items. This type of supplier fairness is sometimes accomplished by attempting to increasing aggregate diversity in order to mitigate popularity bias and to improve the coverage of long-tail items in recommendations. In this paper, we introduce FairMatch, a general graph-based algorithm that works as a post processing approach after recommendation generation to improve exposure fairness for items and suppliers. The algorithm iteratively adds high quality items that have low visibility or items from suppliers with low exposure to the users' final recommendation lists. A comprehensive set of experiments on two datasets and comparison with state-of-the-art baselines show that FairMatch, while significantly improves exposure fairness and aggregate diversity, maintains an acceptable level of relevance of the recommendations.
    Unsupervised Proxy Selection for Session-based Recommender Systems. (arXiv:2107.03564v1 [cs.IR])
    (2 min) Session-based Recommender Systems (SRSs) have been actively developed to recommend the next item of an anonymous short item sequence (i.e., session). Unlike sequence-aware recommender systems where the whole interaction sequence of each user can be used to model both the short-term interest and the general interest of the user, the absence of user-dependent information in SRSs makes it difficult to directly derive the user's general interest from data. Therefore, existing SRSs have focused on how to effectively model the information about short-term interest within the sessions, but they are insufficient to capture the general interest of users. To this end, we propose a novel framework to overcome the limitation of SRSs, named ProxySR, which imitates the missing information in SRSs (i.e., general interest of users) by modeling proxies of sessions. ProxySR selects a proxy for the input session in an unsupervised manner, and combines it with the encoded short-term interest of the session. As a proxy is jointly learned with the short-term interest and selected by multiple sessions, a proxy learns to play the role of the general interest of a user and ProxySR learns how to select a suitable proxy for an input session. Moreover, we propose another real-world situation of SRSs where a few users are logged-in and leave their identifiers in sessions, and a revision of ProxySR for the situation. Our experiments on real-world datasets show that ProxySR considerably outperforms the state-of-the-art competitors, and the proxies successfully imitate the general interest of the users without any user-dependent information.
  • cs.LG updates on arXiv.org

    The Price of Diversity. (arXiv:2107.03900v1 [cs.CY])
    (2 min) Systemic bias with respect to gender, race and ethnicity, often unconscious, is prevalent in datasets involving choices among individuals. Consequently, society has found it challenging to alleviate bias and achieve diversity in a way that maintains meritocracy in such settings. We propose (a) a novel optimization approach based on optimally flipping outcome labels and training classification models simultaneously to discover changes to be made in the selection process so as to achieve diversity without significantly affecting meritocracy, and (b) a novel implementation tool employing optimal classification trees to provide insights on which attributes of individuals lead to flipping of their labels, and to help make changes in the current selection processes in a manner understandable by human decision makers. We present case studies on three real-world datasets consisting of parole, admissions to the bar and lending decisions, and demonstrate that the price of diversity is low and sometimes negative, that is we can modify our selection processes in a way that enhances diversity without affecting meritocracy significantly, and sometimes improving it.
    Adaptive Stress Testing for Adversarial Learning in a Financial Environment. (arXiv:2107.03577v1 [cs.AI])
    (2 min) We demonstrate the use of Adaptive Stress Testing to detect and address potential vulnerabilities in a financial environment. We develop a simplified model for credit card fraud detection that utilizes a linear regression classifier based on historical payment transaction data coupled with business rules. We then apply the reinforcement learning model known as Adaptive Stress Testing to train an agent, that can be thought of as a potential fraudster, to find the most likely path to system failure -- successfully defrauding the system. We show the connection between this most likely failure path and the limits of the classifier and discuss how the fraud detection system's business rules can be further augmented to mitigate these failure modes.
    Improving the Effectiveness and Efficiency of Stochastic Neighbour Embedding with Isolation Kernel. (arXiv:1906.09744v3 [cs.LG] UPDATED)
    (2 min) This paper presents a new insight into improving the performance of Stochastic Neighbour Embedding (t-SNE) by using Isolation kernel instead of Gaussian kernel. Isolation kernel outperforms Gaussian kernel in two aspects. First, the use of Isolation kernel in t-SNE overcomes the drawback of misrepresenting some structures in the data, which often occurs when Gaussian kernel is applied in t-SNE. This is because Gaussian kernel determines each local bandwidth based on one local point only, while Isolation kernel is derived directly from the data based on space partitioning. Second, the use of Isolation kernel yields a more efficient similarity computation because data-dependent Isolation kernel has only one parameter that needs to be tuned. In contrast, the use of data-independent Gaussian kernel increases the computational cost by determining n bandwidths for a dataset of n points. As the root cause of these deficiencies in t-SNE is Gaussian kernel, we show that simply replacing Gaussian kernel with Isolation kernel in t-SNE significantly improves the quality of the final visualisation output (without creating misrepresented structures) and removes one key obstacle that prevents t-SNE from processing large datasets. Moreover, Isolation kernel enables t-SNE to deal with large-scale datasets in less runtime without trading off accuracy, unlike existing methods in speeding up t-SNE.
    Fast Multi-Step Critiquing for VAE-based Recommender Systems. (arXiv:2105.00774v2 [cs.IR] UPDATED)
    (2 min) Recent studies have shown that providing personalized explanations alongside recommendations increases trust and perceived quality. Furthermore, it gives users an opportunity to refine the recommendations by critiquing parts of the explanations. On one hand, current recommender systems model the recommendation, explanation, and critiquing objectives jointly, but this creates an inherent trade-off between their respective performance. On the other hand, although recent latent linear critiquing approaches are built upon an existing recommender system, they suffer from computational inefficiency at inference due to the objective optimized at each conversation's turn. We address these deficiencies with M&Ms-VAE, a novel variational autoencoder for recommendation and explanation that is based on multimodal modeling assumptions. We train the model under a weak supervision scheme to simulate both fully and partially observed variables. Then, we leverage the generalization ability of a trained M&Ms-VAE model to embed the user preference and the critique separately. Our work's most important innovation is our critiquing module, which is built upon and trained in a self-supervised manner with a simple ranking objective. Experiments on four real-world datasets demonstrate that among state-of-the-art models, our system is the first to dominate or match the performance in terms of recommendation, explanation, and multi-step critiquing. Moreover, M&Ms-VAE processes the critiques up to 25.6x faster than the best baselines. Finally, we show that our model infers coherent joint and cross generation, even under weak supervision, thanks to our multimodal-based modeling and training scheme.
    Spectral decoupling allows training transferable neural networks in medical imaging. (arXiv:2103.17171v3 [eess.IV] UPDATED)
    (3 min) Many current neural networks for medical imaging generalise poorly to data unseen during training. Such behaviour can be caused by networks overfitting easy-to-learn, or statistically dominant, features while disregarding other potentially informative features. For example, indistinguishable differences in the sharpness of the images from two different scanners can degrade the performance of the network significantly. All neural networks intended for clinical practice need to be robust to variation in data caused by differences in imaging equipment, sample preparation and patient populations. To address these challenges, we evaluate the utility of spectral decoupling as an implicit bias mitigation method. Spectral decoupling encourages the neural network to learn more features by simply regularising the networks' unnormalised prediction scores with an L2 penalty, thus having no added computational costs. We show that spectral decoupling allows training neural networks on datasets with strong spurious correlations. Networks trained without spectral decoupling do not learn the original task and appear to make false predictions based on the spurious correlations. Spectral decoupling also increases networks' robustness for data distribution shifts. To validate our findings, we train networks with and without spectral decoupling to detect prostate cancer tissue slides and COVID-19 in chest radiographs. Networks trained with spectral decoupling achieve substantially higher performance on all evaluation datasets. Our results show that spectral decoupling helps with generalisation issues associated with neural networks. We recommend using spectral decoupling as an implicit bias mitigation method in any neural network intended for clinical use.
    Optimal Network Compression. (arXiv:2008.08733v3 [q-fin.RM] UPDATED)
    (2 min) This paper introduces a formulation of the optimal network compression problem for financial systems. This general formulation is presented for different levels of network compression or rerouting allowed from the initial interbank network. We prove that this problem is, generically, NP-hard. We focus on objective functions generated by systemic risk measures under systematic shocks to the financial network. We conclude by studying the optimal compression problem for specific networks; this permits us to study the so-called robust fragility of certain network topologies more generally as well as the potential benefits and costs of network compression. In particular, under systematic shocks and heterogeneous financial networks the typical heuristics of robust fragility no longer hold generally.
    Extending Lagrangian and Hamiltonian Neural Networks with Differentiable Contact Models. (arXiv:2102.06794v2 [cs.RO] UPDATED)
    (2 min) The incorporation of appropriate inductive bias plays a critical role in learning dynamics from data. A growing body of work has been exploring ways to enforce energy conservation in the learned dynamics by encoding Lagrangian or Hamiltonian dynamics into the neural network architecture. These existing approaches are based on differential equations, which do not allow discontinuity in the states and thereby limit the class of systems one can learn. However, in reality, most physical systems, such as legged robots and robotic manipulators, involve contacts and collisions, which introduce discontinuities in the states. In this paper, we introduce a differentiable contact model, which can capture contact mechanics: frictionless/frictional, as well as elastic/inelastic. This model can also accommodate inequality constraints, such as limits on the joint angles. The proposed contact model extends the scope of Lagrangian and Hamiltonian neural networks by allowing simultaneous learning of contact and system properties. We demonstrate this framework on a series of challenging 2D and 3D physical systems with different coefficients of restitution and friction. The learned dynamics can be used as a differentiable physics simulator for downstream gradient-based optimization tasks, such as planning and control.
    A Machine Learning Approach to Safer Airplane Landings: Predicting Runway Conditions using Weather and Flight Data. (arXiv:2107.04010v1 [cs.CY])
    (2 min) The presence of snow and ice on runway surfaces reduces the available tire-pavement friction needed for retardation and directional control and causes potential economic and safety threats for the aviation industry during the winter seasons. To activate appropriate safety procedures, pilots need accurate and timely information on the actual runway surface conditions. In this study, XGBoost is used to create a combined runway assessment system, which includes a classifcation model to predict slippery conditions and a regression model to predict the level of slipperiness. The models are trained on weather data and data from runway reports. The runway surface conditions are represented by the tire-pavement friction coefficient, which is estimated from flight sensor data from landing aircrafts. To evaluate the performance of the models, they are compared to several state-of-the-art runway assessment methods. The XGBoost models identify slippery runway conditions with a ROC AUC of 0.95, predict the friction coefficient with a MAE of 0.0254, and outperforms all the previous methods. The results show the strong abilities of machine learning methods to model complex, physical phenomena with a good accuracy when domain knowledge is used in the variable extraction. The XGBoost models are combined with SHAP (SHapley Additive exPlanations) approximations to provide a comprehensible decision support system for airport operators and pilots, which can contribute to safer and more economic operations of airport runways.
    Sim-to-Real Transfer in Deep Reinforcement Learning for Robotics: a Survey. (arXiv:2009.13303v2 [cs.LG] UPDATED)
    (2 min) Deep reinforcement learning has recently seen huge success across multiple areas in the robotics domain. Owing to the limitations of gathering real-world data, i.e., sample inefficiency and the cost of collecting it, simulation environments are utilized for training the different agents. This not only aids in providing a potentially infinite data source, but also alleviates safety concerns with real robots. Nonetheless, the gap between the simulated and real worlds degrades the performance of the policies once the models are transferred into real robots. Multiple research efforts are therefore now being directed towards closing this sim-to-real gap and accomplish more efficient policy transfer. Recent years have seen the emergence of multiple methods applicable to different domains, but there is a lack, to the best of our knowledge, of a comprehensive review summarizing and putting into context the different methods. In this survey paper, we cover the fundamental background behind sim-to-real transfer in deep reinforcement learning and overview the main methods being utilized at the moment: domain randomization, domain adaptation, imitation learning, meta-learning and knowledge distillation. We categorize some of the most relevant recent works, and outline the main application scenarios. Finally, we discuss the main opportunities and challenges of the different approaches and point to the most promising directions.
    Knowledge Transfer by Discriminative Pre-training for Academic Performance Prediction. (arXiv:2107.04009v1 [cs.CY])
    (2 min) The needs for precisely estimating a student's academic performance have been emphasized with an increasing amount of attention paid to Intelligent Tutoring System (ITS). However, since labels for academic performance, such as test scores, are collected from outside of ITS, obtaining the labels is costly, leading to label-scarcity problem which brings challenge in taking machine learning approaches for academic performance prediction. To this end, inspired by the recent advancement of pre-training method in natural language processing community, we propose DPA, a transfer learning framework with Discriminative Pre-training tasks for Academic performance prediction. DPA pre-trains two models, a generator and a discriminator, and fine-tunes the discriminator on academic performance prediction. In DPA's pre-training phase, a sequence of interactions where some tokens are masked is provided to the generator which is trained to reconstruct the original sequence. Then, the discriminator takes an interaction sequence where the masked tokens are replaced by the generator's outputs, and is trained to predict the originalities of all tokens in the sequence. Compared to the previous state-of-the-art generative pre-training method, DPA is more sample efficient, leading to fast convergence to lower academic performance prediction error. We conduct extensive experimental studies on a real-world dataset obtained from a multi-platform ITS application and show that DPA outperforms the previous state-of-the-art generative pre-training method with a reduction of 4.05% in mean absolute error and more robust to increased label-scarcity.
    GPNAS: A Neural Network Architecture Search Framework Based on Graphical Predictor. (arXiv:2103.11820v6 [cs.LG] UPDATED)
    (2 min) In practice, the problems encountered in Neural Architecture Search (NAS) training are not simple problems, but often a series of difficult combinations (wrong compensation estimation, curse of dimension, overfitting, high complexity, etc.). In this paper, we propose a framework to decouple network structure from operator search space, and use two BOHBs to search alternatively. Considering that activation function and initialization are also important parts of neural network, the generalization ability of the model will be affected. We introduce an activation function and an initialization method domain, and add them into the operator search space to form a generalized search space, so as to improve the generalization ability of the child model. We then trained a GCN-based predictor using feedback from the child model. This can not only improve the search efficiency, but also solve the problem of dimension curse. Next, unlike other NAS studies, we used predictors to analyze the stability of different network structures. Finally, we applied our framework to neural structure search and achieved significant improvements on multiple datasets.
    A Long Short-Term Memory for AI Applications in Spike-based Neuromorphic Hardware. (arXiv:2107.03992v1 [cs.NE])
    (2 min) In spite of intensive efforts it has remained an open problem to what extent current Artificial Intelligence (AI) methods that employ Deep Neural Networks (DNNs) can be implemented more energy-efficiently on spike-based neuromorphic hardware. This holds in particular for AI methods that solve sequence processing tasks, a primary application target for spike-based neuromorphic hardware. One difficulty is that DNNs for such tasks typically employ Long Short-Term Memory (LSTM) units. Yet an efficient emulation of these units in spike-based hardware has been missing. We present a biologically inspired solution that solves this problem. This solution enables us to implement a major class of DNNs for sequence processing tasks such as time series classification and question answering with substantial energy savings on neuromorphic hardware. In fact, the Relational Network for reasoning about relations between objects that we use for question answering is the first example of a large DNN that carries out a sequence processing task with substantial energy-saving on neuromorphic hardware.
    Design Smells in Deep Learning Programs: An Empirical Study. (arXiv:2107.02279v2 [cs.SE] CROSS LISTED)
    (2 min) Nowadays, we are witnessing an increasing adoption of Deep Learning (DL) based software systems in many industries. Designing a DL program requires constructing a deep neural network (DNN) and then training it on a dataset. This process requires that developers make multiple architectural (e.g., type, size, number, and order of layers) and configuration (e.g., optimizer, regularization methods, and activation functions) choices that affect the quality of the DL models, and consequently software quality. An under-specified or poorly-designed DL model may train successfully but is likely to perform poorly when deployed in production. Design smells in DL programs are poor design and-or configuration decisions taken during the development of DL components, that are likely to have a negative impact on the performance (i.e., prediction accuracy) and then quality of DL based software systems. In this paper, we present a catalogue of 8 design smells for a popular DL architecture, namely deep Feedforward Neural Networks which is widely employed in industrial applications. The design smells were identified through a review of the existing literature on DL design and a manual inspection of 659 DL programs with performance issues and design inefficiencies. The smells are specified by describing their context, consequences, and recommended refactorings. To provide empirical evidence on the relevance and perceived impact of the proposed design smells, we conducted a survey with 81 DL developers. In general, the developers perceived the proposed design smells as reflective of design or implementation problems, with agreement levels varying between 47\% and 68\%.
    Non-stationary Reinforcement Learning without Prior Knowledge: An Optimal Black-box Approach. (arXiv:2102.05406v2 [cs.LG] UPDATED)
    (2 min) We propose a black-box reduction that turns a certain reinforcement learning algorithm with optimal regret in a (near-)stationary environment into another algorithm with optimal dynamic regret in a non-stationary environment, importantly without any prior knowledge on the degree of non-stationarity. By plugging different algorithms into our black-box, we provide a list of examples showing that our approach not only recovers recent results for (contextual) multi-armed bandits achieved by very specialized algorithms, but also significantly improves the state of the art for (generalized) linear bandits, episodic MDPs, and infinite-horizon MDPs in various ways. Specifically, in most cases our algorithm achieves the optimal dynamic regret $\widetilde{\mathcal{O}}(\min\{\sqrt{LT}, \Delta^{1/3}T^{2/3}\})$ where $T$ is the number of rounds and $L$ and $\Delta$ are the number and amount of changes of the world respectively, while previous works only obtain suboptimal bounds and/or require the knowledge of $L$ and $\Delta$.
    Recommendation system using a deep learning and graph analysis approach. (arXiv:2004.08100v6 [cs.LG] UPDATED)
    (2 min) When a user connects to the Internet to fulfill his needs, he often encounters a huge amount of related information. Recommender systems are the techniques for massively filtering information and offering the items that users find them satisfying and interesting. The advances in machine learning methods, especially deep learning, have led to great achievements in recommender systems, although these systems still suffer from challenges such as cold-start and sparsity problems. To solve these problems, context information such as user communication network is usually used. In this paper, we have proposed a novel recommendation method based on Matrix Factorization and graph analysis methods. In addition, we leverage deep Autoencoders to initialize users and items latent factors, and deep embedding method gathers users' latent factors from the user trust graph. The proposed method is implemented on two standard datasets. The experimental results and comparisons demonstrate that the proposed approach is superior to the existing state-of-the-art recommendation methods. Our approach outperforms other comparative methods and achieves great improvements. This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible
    Active Safety Envelopes using Light Curtains with Probabilistic Guarantees. (arXiv:2107.04000v1 [cs.LG])
    (2 min) To safely navigate unknown environments, robots must accurately perceive dynamic obstacles. Instead of directly measuring the scene depth with a LiDAR sensor, we explore the use of a much cheaper and higher resolution sensor: programmable light curtains. Light curtains are controllable depth sensors that sense only along a surface that a user selects. We use light curtains to estimate the safety envelope of a scene: a hypothetical surface that separates the robot from all obstacles. We show that generating light curtains that sense random locations (from a particular distribution) can quickly discover the safety envelope for scenes with unknown objects. Importantly, we produce theoretical safety guarantees on the probability of detecting an obstacle using random curtains. We combine random curtains with a machine learning based model that forecasts and tracks the motion of the safety envelope efficiently. Our method accurately estimates safety envelopes while providing probabilistic safety guarantees that can be used to certify the efficacy of a robot perception system to detect and avoid dynamic obstacles. We evaluate our approach in a simulated urban driving environment and a real-world environment with moving pedestrians using a light curtain device and show that we can estimate safety envelopes efficiently and effectively. Project website: https://siddancha.github.io/projects/active-safety-envelopes-with-guarantees
    Offline Meta-Reinforcement Learning with Online Self-Supervision. (arXiv:2107.03974v1 [cs.LG])
    (2 min) Meta-reinforcement learning (RL) can be used to train policies that quickly adapt to new tasks with orders of magnitude less data than standard RL, but this fast adaptation often comes at the cost of greatly increasing the amount of reward supervision during meta-training time. Offline meta-RL removes the need to continuously provide reward supervision because rewards must only be provided once when the offline dataset is generated. In addition to the challenges of offline RL, a unique distribution shift is present in meta RL: agents learn exploration strategies that can gather the experience needed to learn a new task, and also learn adaptation strategies that work well when presented with the trajectories in the dataset, but the adaptation strategies are not adapted to the data distribution that the learned exploration strategies collect. Unlike the online setting, the adaptation and exploration strategies cannot effectively adapt to each other, resulting in poor performance. In this paper, we propose a hybrid offline meta-RL algorithm, which uses offline data with rewards to meta-train an adaptive policy, and then collects additional unsupervised online data, without any ground truth reward labels, to bridge this distribution shift problem. Our method uses the offline data to learn the distribution of reward functions, which is then sampled to self-supervise reward labels for the additional online data. By removing the need to provide reward labels for the online experience, our approach can be more practical to use in settings where reward supervision would otherwise be provided manually. We compare our method to prior work on offline meta-RL on simulated robot locomotion and manipulation tasks and find that using additional data and self-generated rewards significantly improves an agent's ability to generalize.
    Rethinking Default Values: a Low Cost and Efficient Strategy to Define Hyperparameters. (arXiv:2008.00025v3 [cs.LG] UPDATED)
    (3 min) Machine Learning (ML) algorithms have been increasingly applied to problems from several different areas. Despite their growing popularity, their predictive performance is usually affected by the values assigned to their hyperparameters (HPs). As consequence, researchers and practitioners face the challenge of how to set these values. Many users have limited knowledge about ML algorithms and the effect of their HP values and, therefore, do not take advantage of suitable settings. They usually define the HP values by trial and error, which is very subjective, not guaranteed to find good values and dependent on the user experience. Tuning techniques search for HP values able to maximize the predictive performance of induced models for a given dataset, but have the drawback of a high computational cost. Thus, practitioners use default values suggested by the algorithm developer or by tools implementing the algorithm. Although default values usually result in models with acceptable predictive performance, different implementations of the same algorithm can suggest distinct default values. To maintain a balance between tuning and using default values, we propose a strategy to generate new optimized default values. Our approach is grounded on a small set of optimized values able to obtain predictive performance values better than default settings provided by popular tools. After performing a large experiment and a careful analysis of the results, we concluded that our approach delivers better default values. Besides, it leads to competitive solutions when compared to tuned values, making it easier to use and having a lower cost. We also extracted simple rules to guide practitioners in deciding whether to use our new methodology or a HP tuning approach.
    COVID19-HPSMP: COVID-19 Adopted Hybrid and Parallel Deep Information Fusion Framework for Stock Price Movement Prediction. (arXiv:2101.02287v2 [q-fin.ST] UPDATED)
    (3 min) The novel of coronavirus (COVID-19) has suddenly and abruptly changed the world as we knew at the start of the 3rd decade of the 21st century. Particularly, COVID-19 pandemic has negatively affected financial econometrics and stock markets across the globe. Artificial Intelligence (AI) and Machine Learning (ML)-based prediction models, especially Deep Neural Network (DNN) architectures, have the potential to act as a key enabling factor to reduce the adverse effects of the COVID-19 pandemic and future possible ones on financial markets. In this regard, first, a unique COVID-19 related PRIce MOvement prediction (COVID19 PRIMO) dataset is introduced in this paper, which incorporates effects of social media trends related to COVID-19 on stock market price movements. Afterwards, a novel hybrid and parallel DNN-based framework is proposed that integrates different and diversified learning architectures. Referred to as the COVID-19 adopted Hybrid and Parallel deep fusion framework for Stock price Movement Prediction (COVID19-HPSMP), innovative fusion strategies are used to combine scattered social media news related to COVID-19 with historical mark data. The proposed COVID19-HPSMP consists of two parallel paths (hence hybrid), one based on Convolutional Neural Network (CNN) with Local/Global Attention modules, and one integrated CNN and Bi-directional Long Short term Memory (BLSTM) path. The two parallel paths are followed by a multilayer fusion layer acting as a fusion centre that combines localized features. Performance evaluations are performed based on the introduced COVID19 PRIMO dataset illustrating superior performance of the proposed framework.
    Certifying clusters from sum-of-norms clustering. (arXiv:2006.11355v2 [cs.LG] UPDATED)
    (2 min) Sum-of-norms clustering is a clustering formulation based on convex optimization that automatically induces hierarchy. Multiple algorithms have been proposed to solve the optimization problem: subgradient descent by Hocking et al., ADMM and ADA by Chi and Lange, stochastic incremental algorithm by Panahi et al. and semismooth Newton-CG augmented Lagrangian method by Sun et al. All algorithms yield approximate solutions, even though an exact solution is demanded to determine the correct cluster assignment. The purpose of this paper is to close the gap between the output from existing algorithms and the exact solution to the optimization problem. We present a clustering test that identifies and certifies the correct cluster assignment from an approximate solution yielded by any primal-dual algorithm. Our certification validates clustering for both unit and multiplicative weights. The test may not succeed if the approximation is inaccurate. However, we show the correct cluster assignment is guaranteed to be certified by a primal-dual path following algorithm after sufficiently many iterations, provided that the model parameter $\lambda$ avoids a finite number of bad values. Numerical experiments are conducted on Gaussian mixture and half-moon data, which indicate that carefully chosen multiplicative weights increase the recovery power of sum-of-norms clustering.
    Matrix Sketching for Secure Collaborative Machine Learning. (arXiv:1909.11201v4 [cs.LG] UPDATED)
    (2 min) Collaborative learning allows participants to jointly train a model without data sharing. To update the model parameters, the central server broadcasts model parameters to the clients, and the clients send updating directions such as gradients to the server. While data do not leave a client device, the communicated gradients and parameters will leak a client's privacy. Attacks that infer clients' privacy from gradients and parameters have been developed by prior work. Simple defenses such as dropout and differential privacy either fail to defend the attacks or seriously hurt test accuracy. We propose a practical defense which we call Double-Blind Collaborative Learning (DBCL). The high-level idea is to apply random matrix sketching to the parameters (aka weights) and re-generate random sketching after each iteration. DBCL prevents clients from conducting gradient-based privacy inferences which are the most effective attacks. DBCL works because from the attacker's perspective, sketching is effectively random noise that outweighs the signal. Notably, DBCL does not much increase computation and communication costs and does not hurt test accuracy at all.
    3D Neural Scene Representations for Visuomotor Control. (arXiv:2107.04004v1 [cs.RO])
    (2 min) Humans have a strong intuitive understanding of the 3D environment around us. The mental model of the physics in our brain applies to objects of different materials and enables us to perform a wide range of manipulation tasks that are far beyond the reach of current robots. In this work, we desire to learn models for dynamic 3D scenes purely from 2D visual observations. Our model combines Neural Radiance Fields (NeRF) and time contrastive learning with an autoencoding framework, which learns viewpoint-invariant 3D-aware scene representations. We show that a dynamics model, constructed over the learned representation space, enables visuomotor control for challenging manipulation tasks involving both rigid bodies and fluids, where the target is specified in a viewpoint different from what the robot operates on. When coupled with an auto-decoding framework, it can even support goal specification from camera viewpoints that are outside the training distribution. We further demonstrate the richness of the learned 3D dynamics model by performing future prediction and novel view synthesis. Finally, we provide detailed ablation studies regarding different system designs and qualitative analysis of the learned representations.
    Deep Reinforcement Learning Methods for Structure-Guided Processing Path Optimization. (arXiv:2009.09706v4 [cs.LG] UPDATED)
    (3 min) A major goal of materials design is to find material structures with desired properties and in a second step to find a processing path to reach one of these structures. In this paper, we propose and investigate a deep reinforcement learning approach for the optimization of processing paths. The goal is to find optimal processing paths in the material structure space that lead to target-structures, which have been identified beforehand to result in desired material properties. There exists a target set containing one or multiple different structures. Our proposed methods can find an optimal path from a start structure to a single target structure, or optimize the processing paths to one of the equivalent target-structures in the set. In the latter case, the algorithm learns during processing to simultaneously identify the best reachable target structure and the optimal path to it. The proposed methods belong to the family of model-free deep reinforcement learning algorithms. They are guided by structure representations as features of the process state and by a reward signal, which is formulated based on a distance function in the structure space. Model-free reinforcement learning algorithms learn through trial and error while interacting with the process. Thereby, they are not restricted to information from a priori sampled processing data and are able to adapt to the specific process. The optimization itself is model-free and does not require any prior knowledge about the process itself. We instantiate and evaluate the proposed methods by optimizing paths of a generic metal forming process. We show the ability of both methods to find processing paths leading close to target structures and the ability of the extended method to identify target-structures that can be reached effectively and efficiently and to focus on these targets for sample efficient processing path optimization.
    Classification of Urban Morphology with Deep Learning: Application on Urban Vitality. (arXiv:2105.09908v2 [cs.CV] UPDATED)
    (2 min) There is a prevailing trend to study urban morphology quantitatively thanks to the growing accessibility to various forms of spatial big data, increasing computing power, and use cases benefiting from such information. The methods developed up to now measure urban morphology with numerical indices describing density, proportion, and mixture, but they do not directly represent morphological features from the human's visual and intuitive perspective. We take the first step to bridge the gap by proposing a deep learning-based technique to automatically classify road networks into four classes on a visual basis. The method is implemented by generating an image of the street network (Colored Road Hierarchy Diagram), which we introduce in this paper, and classifying it using a deep convolutional neural network (ResNet-34). The model achieves an overall classification accuracy of 0.875. Nine cities around the world are selected as the study areas with their road networks acquired from OpenStreetMap. Latent subgroups among the cities are uncovered through clustering on the percentage of each road network category. In the subsequent part of the paper, we focus on the usability of such classification: we apply our method in a case study of urban vitality prediction. An advanced tree-based regression model (LightGBM) is for the first time designated to establish the relationship between morphological indices and vitality indicators. The effect of road network classification is found to be small but positively associated with urban vitality. This work expands the toolkit of quantitative urban morphology study with new techniques, supporting further studies in the future.
    Real-world Ride-hailing Vehicle Repositioning using Deep Reinforcement Learning. (arXiv:2103.04555v3 [cs.LG] UPDATED)
    (2 min) We present a new practical framework based on deep reinforcement learning and decision-time planning for real-world vehicle repositioning on ride-hailing (a type of mobility-on-demand, MoD) platforms. Our approach learns the spatiotemporal state-value function using a batch training algorithm with deep value networks. The optimal repositioning action is generated on-demand through value-based policy search, which combines planning and bootstrapping with the value networks. For the large-fleet problems, we develop several algorithmic features that we incorporate into our framework and that we demonstrate to induce coordination among the algorithmically-guided vehicles. We benchmark our algorithm with baselines in a ride-hailing simulation environment to demonstrate its superiority in improving income efficiency meausred by income-per-hour. We have also designed and run a real-world experiment program with regular drivers on a major ride-hailing platform. We have observed significantly positive results on key metrics comparing our method with experienced drivers who performed idle-time repositioning based on their own expertise.
    A Probabilistic Interpretation of Self-Paced Learning with Applications to Reinforcement Learning. (arXiv:2102.13176v2 [cs.LG] UPDATED)
    (2 min) Across machine learning, the use of curricula has shown strong empirical potential to improve learning from data by avoiding local optima of training objectives. For reinforcement learning (RL), curricula are especially interesting, as the underlying optimization has a strong tendency to get stuck in local optima due to the exploration-exploitation trade-off. Recently, a number of approaches for an automatic generation of curricula for RL have been shown to increase performance while requiring less expert knowledge compared to manually designed curricula. However, these approaches are seldomly investigated from a theoretical perspective, preventing a deeper understanding of their mechanics. In this paper, we present an approach for automated curriculum generation in RL with a clear theoretical underpinning. More precisely, we formalize the well-known self-paced learning paradigm as inducing a distribution over training tasks, which trades off between task complexity and the objective to match a desired task distribution. Experiments show that training on this induced distribution helps to avoid poor local optima across RL algorithms in different tasks with uninformative rewards and challenging exploration requirements.
    Transformer-based Conditional Variational Autoencoder for Controllable Story Generation. (arXiv:2101.00828v2 [cs.CL] UPDATED)
    (2 min) We investigate large-scale latent variable models (LVMs) for neural story generation -- an under-explored application for open-domain long text -- with objectives in two threads: generation effectiveness and controllability. LVMs, especially the variational autoencoder (VAE), have achieved both effective and controllable generation through exploiting flexible distributional latent representations. Recently, Transformers and its variants have achieved remarkable effectiveness without explicit latent representation learning, thus lack satisfying controllability in generation. In this paper, we advocate to revive latent variable modeling, essentially the power of representation learning, in the era of Transformers to enhance controllability without hurting state-of-the-art generation effectiveness. Specifically, we integrate latent representation vectors with a Transformer-based pre-trained architecture to build conditional variational autoencoder (CVAE). Model components such as encoder, decoder and the variational posterior are all built on top of pre-trained language models -- GPT2 specifically in this paper. Experiments demonstrate state-of-the-art conditional generation ability of our model, as well as its excellent representation learning capability and controllability.
    Defending Against Multiple and Unforeseen Adversarial Videos. (arXiv:2009.05244v2 [cs.LG] UPDATED)
    (2 min) Adversarial robustness of deep neural networks has been actively investigated. However, most existing defense approaches are limited to a specific type of adversarial perturbations. Specifically, they often fail to offer resistance to multiple attack types simultaneously, i.e., they lack multi-perturbation robustness. Furthermore, compared to image recognition problems, the adversarial robustness of video recognition models is relatively unexplored. While several studies have proposed how to generate adversarial videos, only a handful of approaches about the defense strategies have been published in the literature. In this paper, we propose one of the first defense strategies against multiple types of adversarial videos for video recognition. The proposed method, referred to as MultiBN, performs adversarial training on multiple adversarial video types using multiple independent batch normalization (BN) layers with a learning-based BN selection module. With a multiple BN structure, each BN brach is responsible for learning the distribution of a single perturbation type and thus provides more precise distribution estimations. This mechanism benefits dealing with multiple perturbation types. The BN selection module detects the attack type of an input video and sends it to the corresponding BN branch, making MultiBN fully automatic and allow end-to-end training. Compared to present adversarial training approaches, the proposed MultiBN exhibits stronger multi-perturbation robustness against different and even unforeseen adversarial video types, ranging from Lp-bounded attacks and physically realizable attacks. This holds true on different datasets and target models. Moreover, we conduct an extensive analysis to study the properties of the multiple BN structure.
    Multi-Modality Task Cascade for 3D Object Detection. (arXiv:2107.04013v1 [cs.CV])
    (2 min) Point clouds and RGB images are naturally complementary modalities for 3D visual understanding - the former provides sparse but accurate locations of points on objects, while the latter contains dense color and texture information. Despite this potential for close sensor fusion, many methods train two models in isolation and use simple feature concatenation to represent 3D sensor data. This separated training scheme results in potentially sub-optimal performance and prevents 3D tasks from being used to benefit 2D tasks that are often useful on their own. To provide a more integrated approach, we propose a novel Multi-Modality Task Cascade network (MTC-RCNN) that leverages 3D box proposals to improve 2D segmentation predictions, which are then used to further refine the 3D boxes. We show that including a 2D network between two stages of 3D modules significantly improves both 2D and 3D task performance. Moreover, to prevent the 3D module from over-relying on the overfitted 2D predictions, we propose a dual-head 2D segmentation training and inference scheme, allowing the 2nd 3D module to learn to interpret imperfect 2D segmentation predictions. Evaluating our model on the challenging SUN RGB-D dataset, we improve upon state-of-the-art results of both single modality and fusion networks by a large margin ($\textbf{+3.8}$ mAP@0.5). Code will be released $\href{https://github.com/Divadi/MTC_RCNN}{\text{here.}}$
    RMA: Rapid Motor Adaptation for Legged Robots. (arXiv:2107.04034v1 [cs.LG])
    (2 min) Successful real-world deployment of legged robots would require them to adapt in real-time to unseen scenarios like changing terrains, changing payloads, wear and tear. This paper presents Rapid Motor Adaptation (RMA) algorithm to solve this problem of real-time online adaptation in quadruped robots. RMA consists of two components: a base policy and an adaptation module. The combination of these components enables the robot to adapt to novel situations in fractions of a second. RMA is trained completely in simulation without using any domain knowledge like reference trajectories or predefined foot trajectory generators and is deployed on the A1 robot without any fine-tuning. We train RMA on a varied terrain generator using bioenergetics-inspired rewards and deploy it on a variety of difficult terrains including rocky, slippery, deformable surfaces in environments with grass, long vegetation, concrete, pebbles, stairs, sand, etc. RMA shows state-of-the-art performance across diverse real-world as well as simulation experiments. Video results at https://ashish-kmr.github.io/rma-legged-robots/
    AC-VRNN: Attentive Conditional-VRNN for Multi-Future Trajectory Prediction. (arXiv:2005.08307v2 [cs.CV] UPDATED)
    (2 min) Anticipating human motion in crowded scenarios is essential for developing intelligent transportation systems, social-aware robots and advanced video surveillance applications. A key component of this task is represented by the inherently multi-modal nature of human paths which makes socially acceptable multiple futures when human interactions are involved. To this end, we propose a generative architecture for multi-future trajectory predictions based on Conditional Variational Recurrent Neural Networks (C-VRNNs). Conditioning mainly relies on prior belief maps, representing most likely moving directions and forcing the model to consider past observed dynamics in generating future positions. Human interactions are modeled with a graph-based attention mechanism enabling an online attentive hidden state refinement of the recurrent estimation. To corroborate our model, we perform extensive experiments on publicly-available datasets (e.g., ETH/UCY, Stanford Drone Dataset, STATS SportVU NBA, Intersection Drone Dataset and TrajNet++) and demonstrate its effectiveness in crowded scenes compared to several state-of-the-art methods.
    Fourier Sparse Leverage Scores and Approximate Kernel Learning. (arXiv:2006.07340v3 [cs.DS] UPDATED)
    (2 min) We prove new explicit upper bounds on the leverage scores of Fourier sparse functions under both the Gaussian and Laplace measures. In particular, we study $s$-sparse functions of the form $f(x) = \sum_{j=1}^s a_j e^{i \lambda_j x}$ for coefficients $a_j \in \mathbb{C}$ and frequencies $\lambda_j \in \mathbb{R}$. Bounding Fourier sparse leverage scores under various measures is of pure mathematical interest in approximation theory, and our work extends existing results for the uniform measure [Erd17,CP19a]. Practically, our bounds are motivated by two important applications in machine learning: 1. Kernel Approximation. They yield a new random Fourier features algorithm for approximating Gaussian and Cauchy (rational quadratic) kernel matrices. For low-dimensional data, our method uses a near optimal number of features, and its runtime is polynomial in the $statistical\ dimension$ of the approximated kernel matrix. It is the first "oblivious sketching method" with this property for any kernel besides the polynomial kernel, resolving an open question of [AKM+17,AKK+20b]. 2. Active Learning. They can be used as non-uniform sampling distributions for robust active learning when data follows a Gaussian or Laplace distribution. Using the framework of [AKM+19], we provide essentially optimal results for bandlimited and multiband interpolation, and Gaussian process regression. These results generalize existing work that only applies to uniformly distributed data.
    Community detection in the sparse hypergraph stochastic block model. (arXiv:1904.05981v6 [math.PR] UPDATED)
    (2 min) We consider the community detection problem in sparse random hypergraphs. Angelini et al. (2015) conjectured the existence of a sharp threshold on model parameters for community detection in sparse hypergraphs generated by a hypergraph stochastic block model. We solve the positive part of the conjecture for the case of two blocks: above the threshold, there is a spectral algorithm which asymptotically almost surely constructs a partition of the hypergraph correlated with the true partition. Our method is a generalization to random hypergraphs of the method developed by Massouli\'{e} (2014) for sparse random graphs.
    Malware Classification Using Deep Boosted Learning. (arXiv:2107.04008v1 [cs.CR])
    (2 min) Malicious activities in cyberspace have gone further than simply hacking machines and spreading viruses. It has become a challenge for a nations survival and hence has evolved to cyber warfare. Malware is a key component of cyber-crime, and its analysis is the first line of defence against attack. This work proposes a novel deep boosted hybrid learning-based malware classification framework and named as Deep boosted Feature Space-based Malware classification (DFS-MC). In the proposed framework, the discrimination power is enhanced by fusing the feature spaces of the best performing customized CNN architectures models and its discrimination by an SVM for classification. The discrimination capacity of the proposed classification framework is assessed by comparing it against the standard customized CNNs. The customized CNN models are implemented in two ways: softmax classifier and deep hybrid learning-based malware classification. In the hybrid learning, Deep features are extracted from customized CNN architectures and fed into the conventional machine learning classifier to improve the classification performance. We also introduced the concept of transfer learning in a customized CNN architecture based malware classification framework through fine-tuning. The performance of the proposed malware classification approaches are validated on the MalImg malware dataset using the hold-out cross-validation technique. Experimental comparisons were conducted by employing innovative, customized CNN, trained from scratch and fine-tuning the customized CNN using transfer learning. The proposed classification framework DFS-MC showed improved results, Accuracy: 98.61%, F-score: 0.96, Precision: 0.96, and Recall: 0.96.
    On Margins and Derandomisation in PAC-Bayes. (arXiv:2107.03955v1 [cs.LG])
    (2 min) We develop a framework for derandomising PAC-Bayesian generalisation bounds achieving a margin on training data, relating this process to the concentration-of-measure phenomenon. We apply these tools to linear prediction, single-hidden-layer neural networks with an unusual erf activation function, and deep ReLU networks, obtaining new bounds. The approach is also extended to the idea of "partial-derandomisation" where only some layers are derandomised and the others are stochastic. This allows empirical evaluation of single-hidden-layer networks on more complex datasets, and helps bridge the gap between generalisation bounds for non-stochastic deep networks and those for randomised deep networks as generally examined in PAC-Bayes.
    Locally differentially private estimation of nonlinear functionals of discrete distributions. (arXiv:2107.03940v1 [math.ST])
    (2 min) We study the problem of estimating non-linear functionals of discrete distributions in the context of local differential privacy. The initial data $x_1,\ldots,x_n \in [K]$ are supposed i.i.d. and distributed according to an unknown discrete distribution $p = (p_1,\ldots,p_K)$. Only $\alpha$-locally differentially private (LDP) samples $z_1,...,z_n$ are publicly available, where the term 'local' means that each $z_i$ is produced using one individual attribute $x_i$. We exhibit privacy mechanisms (PM) that are interactive (i.e. they are allowed to use already published confidential data) or non-interactive. We describe the behavior of the quadratic risk for estimating the power sum functional $F_{\gamma} = \sum_{k=1}^K p_k^{\gamma}$, $\gamma >0$ as a function of $K, \, n$ and $\alpha$. In the non-interactive case, we study two plug-in type estimators of $F_{\gamma}$, for all $\gamma >0$, that are similar to the MLE analyzed by Jiao et al. (2017) in the multinomial model. However, due to the privacy constraint the rates we attain are slower and similar to those obtained in the Gaussian model by Collier et al. (2020). In the interactive case, we introduce for all $\gamma >1$ a two-step procedure which attains the faster parametric rate $(n \alpha^2)^{-1/2}$ when $\gamma \geq 2$. We give lower bounds results over all $\alpha$-LDP mechanisms and all estimators using the private samples.
    Measuring Financial Time Series Similarity With a View to Identifying Profitable Stock Market Opportunities. (arXiv:2107.03926v1 [q-fin.ST])
    (2 min) Forecasting stock returns is a challenging problem due to the highly stochastic nature of the market and the vast array of factors and events that can influence trading volume and prices. Nevertheless it has proven to be an attractive target for machine learning research because of the potential for even modest levels of prediction accuracy to deliver significant benefits. In this paper, we describe a case-based reasoning approach to predicting stock market returns using only historical pricing data. We argue that one of the impediments for case-based stock prediction has been the lack of a suitable similarity metric when it comes to identifying similar pricing histories as the basis for a future prediction -- traditional Euclidean and correlation based approaches are not effective for a variety of reasons -- and in this regard, a key contribution of this work is the development of a novel similarity metric for comparing historical pricing data. We demonstrate the benefits of this metric and the case-based approach in a real-world application in comparison to a variety of conventional benchmarks.
    CSDI: Conditional Score-based Diffusion Models for Probabilistic Time Series Imputation. (arXiv:2107.03502v1 [cs.LG])
    (2 min) The imputation of missing values in time series has many applications in healthcare and finance. While autoregressive models are natural candidates for time series imputation, score-based diffusion models have recently outperformed existing counterparts including autoregressive models in many tasks such as image generation and audio synthesis, and would be promising for time series imputation. In this paper, we propose Conditional Score-based Diffusion models for Imputation (CSDI), a novel time series imputation method that utilizes score-based diffusion models conditioned on observed data. Unlike existing score-based approaches, the conditional diffusion model is explicitly trained for imputation and can exploit correlations between observed values. On healthcare and environmental data, CSDI improves by 40-70% over existing probabilistic imputation methods on popular performance metrics. In addition, deterministic imputation by CSDI reduces the error by 5-20% compared to the state-of-the-art deterministic imputation methods. Furthermore, CSDI can also be applied to time series interpolation and probabilistic forecasting, and is competitive with existing baselines.
    Frequentist Consistency of Variational Bayes. (arXiv:1705.03439v3 [stat.ML] UPDATED)
    (2 min) A key challenge for modern Bayesian statistics is how to perform scalable inference of posterior distributions. To address this challenge, variational Bayes (VB) methods have emerged as a popular alternative to the classical Markov chain Monte Carlo (MCMC) methods. VB methods tend to be faster while achieving comparable predictive performance. However, there are few theoretical results around VB. In this paper, we establish frequentist consistency and asymptotic normality of VB methods. Specifically, we connect VB methods to point estimates based on variational approximations, called frequentist variational approximations, and we use the connection to prove a variational Bernstein-von Mises theorem. The theorem leverages the theoretical characterizations of frequentist variational approximations to understand asymptotic properties of VB. In summary, we prove that (1) the VB posterior converges to the Kullback-Leibler (KL) minimizer of a normal distribution, centered at the truth and (2) the corresponding variational expectation of the parameter is consistent and asymptotically normal. As applications of the theorem, we derive asymptotic properties of VB posteriors in Bayesian mixture models, Bayesian generalized linear mixed models, and Bayesian stochastic block models. We conduct a simulation study to illustrate these theoretical results.
    BumbleBee: A Transformer for Music. (arXiv:2107.03443v1 [cs.SD])
    (2 min) We will introduce BumbleBee, a transformer model that will generate MIDI music data . We will tackle the issue of transformers applied to long sequences by implementing a longformer generative model that uses dilating sliding windows to compute the attention layers. We will compare our results to that of the music transformer and Long-Short term memory (LSTM) to benchmark our results. This analysis will be performed using piano MIDI files, in particular , the JSB Chorales dataset that has already been used for other research works (Huang et al., 2018)
    Digitizing Handwriting with a Sensor Pen: A Writer-Independent Recognizer. (arXiv:2107.03704v1 [cs.LG])
    (2 min) Online handwriting recognition has been studied for a long time with only few practicable results when writing on normal paper. Previous approaches using sensor-based devices encountered problems that limited the usage of the developed systems in real-world applications. This paper presents a writer-independent system that recognizes characters written on plain paper with the use of a sensor-equipped pen. This system is applicable in real-world applications and requires no user-specific training for recognition. The pen provides linear acceleration, angular velocity, magnetic field, and force applied by the user, and acts as a digitizer that transforms the analogue signals of the sensors into timeseries data while writing on regular paper. The dataset we collected with this pen consists of Latin lower-case and upper-case alphabets. We present the results of a convolutional neural network model for letter classification and show that this approach is practical and achieves promising results for writer-independent character recognition. This work aims at providing a realtime handwriting recognition system to be used for writing on normal paper.
    Adaptation of Quadruped Robot Locomotion with Meta-Learning. (arXiv:2107.03741v1 [cs.RO])
    (2 min) Animals have remarkable abilities to adapt locomotion to different terrains and tasks. However, robots trained by means of reinforcement learning are typically able to solve only a single task and a transferred policy is usually inferior to that trained from scratch. In this work, we demonstrate that meta-reinforcement learning can be used to successfully train a robot capable to solve a wide range of locomotion tasks. The performance of the meta-trained robot is similar to that of a robot that is trained on a single task.
    End-to-End Optimized Speech Coding with Deep Neural Networks. (arXiv:1710.09064v3 [cs.SD] UPDATED)
    (2 min) Modern compression algorithms are often the result of laborious domain-specific research; industry standards such as MP3, JPEG, and AMR-WB took years to develop and were largely hand-designed. We present a deep neural network model which optimizes all the steps of a wideband speech coding pipeline (compression, quantization, entropy coding, and decompression) end-to-end directly from raw speech data -- no manual feature engineering necessary, and it trains in hours. In testing, our DNN-based coder performs on par with the AMR-WB standard at a variety of bitrates (~9kbps up to ~24kbps). It also runs in realtime on a 3.8GhZ Intel CPU.
    Label-set Loss Functions for Partial Supervision: Application to Fetal Brain 3D MRI Parcellation. (arXiv:2107.03846v1 [eess.IV])
    (2 min) Deep neural networks have increased the accuracy of automatic segmentation, however, their accuracy depends on the availability of a large number of fully segmented images. Methods to train deep neural networks using images for which some, but not all, regions of interest are segmented are necessary to make better use of partially annotated datasets. In this paper, we propose the first axiomatic definition of label-set loss functions that are the loss functions that can handle partially segmented images. We prove that there is one and only one method to convert a classical loss function for fully segmented images into a proper label-set loss function. Our theory also allows us to define the leaf-Dice loss, a label-set generalization of the Dice loss particularly suited for partial supervision with only missing labels. Using the leaf-Dice loss, we set a new state of the art in partially supervised learning for fetal brain 3D MRI segmentation. We achieve a deep neural network able to segment white matter, ventricles, cerebellum, extra-ventricular CSF, cortical gray matter, deep gray matter, brainstem, and corpus callosum based on fetal brain 3D MRI of anatomically normal fetuses or with open spina bifida. Our implementation of the proposed label-set loss functions is available at https://github.com/LucasFidon/label-set-loss-functions
    Optimizing Data Processing in Space for Object Detection in Satellite Imagery. (arXiv:2107.03774v1 [cs.CV])
    (3 min) There is a proliferation in the number of satellites launched each year, resulting in downlinking of terabytes of data each day. The data received by ground stations is often unprocessed, making this an expensive process considering the large data sizes and that not all of the data is useful. This, coupled with the increasing demand for real-time data processing, has led to a growing need for on-orbit processing solutions. In this work, we investigate the performance of CNN-based object detectors on constrained devices by applying different image compression techniques to satellite data. We examine the capabilities of the NVIDIA Jetson Nano and NVIDIA Jetson AGX Xavier; low-power, high-performance computers, with integrated GPUs, small enough to fit on-board a nanosatellite. We take a closer look at object detection networks, including the Single Shot MultiBox Detector (SSD) and Region-based Fully Convolutional Network (R-FCN) models that are pre-trained on DOTA - a Large Scale Dataset for Object Detection in Aerial Images. The performance is measured in terms of execution time, memory consumption, and accuracy, and are compared against a baseline containing a server with two powerful GPUs. The results show that by applying image compression techniques, we are able to improve the execution time and memory consumption, achieving a fully runnable dataset. A lossless compression technique achieves roughly a 10% reduction in execution time and about a 3% reduction in memory consumption, with no impact on the accuracy. While a lossy compression technique improves the execution time by up to 144% and the memory consumption is reduced by as much as 97%. However, it has a significant impact on accuracy, varying depending on the compression ratio. Thus the application and ratio of these compression techniques may differ depending on the required level of accuracy for a particular task.
    Explainable AI (XAI) for PHM of Industrial Asset: A State-of-The-Art, PRISMA-Compliant Systematic Review. (arXiv:2107.03869v1 [cs.AI])
    (2 min) A state-of-the-art systematic review on XAI applied to Prognostic and Health Management (PHM) of industrial asset is presented. The work attempts to provide an overview of the general trend of XAI in PHM, answers the question of accuracy versus explainability, investigates the extent of human role, explainability evaluation and uncertainty management in PHM XAI. Research articles linked to PHM XAI, in English language, from 2015 to 2021 are selected from IEEE Xplore, ScienceDirect, SpringerLink, ACM Digital Library and Scopus databases using PRISMA guidelines. Data was extracted from 35 selected articles and examined using MS. Excel. Several findings were synthesized. Firstly, while the discipline is still young, the analysis indicates the growing acceptance of XAI in PHM domain. Secondly, XAI functions as a double edge sword, where it is assimilated as a tool to execute PHM tasks as well as a mean of explanation, in particular in diagnostic and anomaly detection. There is thus a need for XAI in PHM. Thirdly, the review shows that PHM XAI papers produce either good or excellent results in general, suggesting that PHM performance is unaffected by XAI. Fourthly, human role, explainability metrics and uncertainty management are areas requiring further attention by the PHM community. Adequate explainability metrics to cater for PHM need are urgently needed. Finally, most case study featured on the accepted articles are based on real, indicating that available AI and XAI approaches are equipped to solve complex real-world challenges, increasing the confidence of AI model adoption in the industry. This work is funded by the Universiti Teknologi Petronas Foundation.
    Proceedings of the First Workshop on Weakly Supervised Learning (WeaSuL). (arXiv:2107.03690v1 [cs.LG])
    (2 min) Welcome to WeaSuL 2021, the First Workshop on Weakly Supervised Learning, co-located with ICLR 2021. In this workshop, we want to advance theory, methods and tools for allowing experts to express prior coded knowledge for automatic data annotations that can be used to train arbitrary deep neural networks for prediction. The ICLR 2021 Workshop on Weak Supervision aims at advancing methods that help modern machine-learning methods to generalize from knowledge provided by experts, in interaction with observable (unlabeled) data. In total, 15 papers were accepted. All the accepted contributions are listed in these Proceedings.
    TensorFlow RiemOpt: a library for optimization on Riemannian manifolds. (arXiv:2105.13921v2 [cs.MS] UPDATED)
    (2 min) The adoption of neural networks and deep learning in non-Euclidean domains has been hindered until recently by the lack of scalable and efficient learning frameworks. Existing toolboxes in this space were mainly motivated by research and education use cases, whereas practical aspects, such as deploying and maintaining machine learning models, were often overlooked. We attempt to bridge this gap by proposing TensorFlow RiemOpt, a Python library for optimization on Riemannian manifolds in TensorFlow. The library is designed with the aim for a seamless integration with the TensorFlow ecosystem, targeting not only research, but also streamlining production machine learning pipelines.
    A hybrid virtual sensing approach for approximating non-linear dynamic system behavior using LSTM networks. (arXiv:2107.03645v1 [eess.SP])
    (2 min) Modern Internet of Things solutions are used in a variety of different areas, ranging from connected vehicles and healthcare to industrial applications. They rely on a large amount of interconnected sensors, which can lead to both technical and economical challenges. Virtual sensing techniques aim to reduce the number of physical sensors in a system by using data from available measurements to estimate additional unknown quantities of interest. Successful model-based solutions include Kalman filters or the combination of finite element models and modal analysis, while many data-driven methods rely on machine learning algorithms. The presented hybrid virtual sensing approach combines Long Short-Term Memory networks with frequency response function models in order to estimate the behavior of non-linear dynamic systems with multiple input and output channels. Network training and prediction make use of short signal subsequences, which are later recombined by applying a windowing technique. The frequency response function model acts as a baseline estimate which perfectly captures linear dynamic systems and is augmented by the non-linear Long Short-Term Memory network following two different hybrid modeling strategies. The approach is tested using a non-linear experimental dataset, which results from measurements of a three-component servo-hydraulic fatigue test bench. A variety of metrics in time and frequency domains, as well as fatigue strength under variable amplitudes are used to evaluate the approximation quality of the proposed method. In addition to virtual sensing, the algorithm is also applied to a forward prediction task. Synthetic data are used in a separate study to estimate the prediction quality on datasets of different size.
    Recurrence-Aware Long-Term Cognitive Network for Explainable Pattern Classification. (arXiv:2107.03423v1 [cs.LG])
    (2 min) Machine learning solutions for pattern classification problems are nowadays widely deployed in society and industry. However, the lack of transparency and accountability of most accurate models often hinders their meaningful and safe use. Thus, there is a clear need for developing explainable artificial intelligence mechanisms. There exist model-agnostic methods that summarize feature contributions, but their interpretability is limited to specific predictions made by black-box models. An open challenge is to develop models that have intrinsic interpretability and produce their own explanations, even for classes of models that are traditionally considered black boxes like (recurrent) neural networks. In this paper, we propose an LTCN-based model for interpretable pattern classification of structured data. Our method brings its own mechanism for providing explanations by quantifying the relevance of each feature in the decision process. For supporting the interpretability without affecting the performance, the model incorporates more flexibility through a quasi-nonlinear reasoning rule that allows controlling nonlinearity. Besides, we propose a recurrence-aware decision model that evades the issues posed by unique fixed points while introducing a deterministic learning method to compute the learnable parameters. The simulations show that our interpretable model obtains competitive performance when compared to the state-of-the-art white and black boxes.
    Flexibly Regularized Mixture Models and Application to Image Segmentation. (arXiv:1905.10629v2 [cs.CV] UPDATED)
    (2 min) Probabilistic finite mixture models are widely used for unsupervised clustering. These models can often be improved by adapting them to the topology of the data. For instance, in order to classify spatially adjacent data points similarly, it is common to introduce a Laplacian constraint on the posterior probability that each data point belongs to a class. Alternatively, the mixing probabilities can be treated as free parameters, while assuming Gauss-Markov or more complex priors to regularize those mixing probabilities. However, these approaches are constrained by the shape of the prior and often lead to complicated or intractable inference. Here, we propose a new parametrization of the Dirichlet distribution to flexibly regularize the mixing probabilities of over-parametrized mixture distributions. Using the Expectation-Maximization algorithm, we show that our approach allows us to define any linear update rule for the mixing probabilities, including spatial smoothing regularization as a special case. We then show that this flexible design can be extended to share class information between multiple mixture models. We apply our algorithm to artificial and natural image segmentation tasks, and we provide quantitative and qualitative comparison of the performance of Gaussian and Student-t mixtures on the Berkeley Segmentation Dataset. We also demonstrate how to propagate class information across the layers of deep convolutional neural networks in a probabilistically optimal way, suggesting a new interpretation for feedback signals in biological visual systems. Our flexible approach can be easily generalized to adapt probabilistic mixture models to arbitrary data topologies.
    The value of text for small business default prediction: A deep learning approach. (arXiv:2003.08964v4 [cs.LG] UPDATED)
    (2 min) Compared to consumer lending, Micro, Small and Medium Enterprise (mSME) credit risk modelling is particularly challenging, as, often, the same sources of information are not available. Therefore, it is standard policy for a loan officer to provide a textual loan assessment to mitigate limited data availability. In turn, this statement is analysed by a credit expert alongside any available standard credit data. In our paper, we exploit recent advances from the field of Deep Learning and Natural Language Processing (NLP), including the BERT (Bidirectional Encoder Representations from Transformers) model, to extract information from 60 000 textual assessments provided by a lender. We consider the performance in terms of the AUC (Area Under the receiver operating characteristic Curve) and Brier Score metrics and find that the text alone is surprisingly effective for predicting default. However, when combined with traditional data, it yields no additional predictive capability, with performance dependent on the text's length. Our proposed deep learning model does, however, appear to be robust to the quality of the text and therefore suitable for partly automating the mSME lending process. We also demonstrate how the content of loan assessments influences performance, leading us to a series of recommendations on a new strategy for collecting future mSME loan assessments.
    Convergence of Langevin Monte Carlo in Chi-Squared and Renyi Divergence. (arXiv:2007.11612v4 [stat.ML] UPDATED)
    (2 min) We study sampling from a target distribution $\nu_* = e^{-f}$ using the unadjusted Langevin Monte Carlo (LMC) algorithm when the potential $f$ satisfies a strong dissipativity condition and it is first-order smooth with a Lipschitz gradient. We prove that, initialized with a Gaussian random vector that has sufficiently small variance, iterating the LMC algorithm for $\widetilde{\mathcal{O}}(\lambda^2 d\epsilon^{-1})$ steps is sufficient to reach $\epsilon$-neighborhood of the target in both Chi-squared and Renyi divergence, where $\lambda$ is the logarithmic Sobolev constant of $\nu_*$. Our results do not require warm-start to deal with the exponential dimension dependency in Chi-squared divergence at initialization. In particular, for strongly convex and first-order smooth potentials, we show that the LMC algorithm achieves the rate estimate $\widetilde{\mathcal{O}}(d\epsilon^{-1})$ which improves the previously known rates in both of these metrics, under the same assumptions. Translating this rate to other metrics, our results also recover the state-of-the-art rate estimates in KL divergence, total variation and $2$-Wasserstein distance in the same setup. Finally, as we rely on the logarithmic Sobolev inequality, our framework covers a range of non-convex potentials that are first-order smooth and exhibit strong convexity outside of a compact region.
    Consensus Clustering With Unsupervised Representation Learning. (arXiv:2010.01245v2 [cs.CV] UPDATED)
    (2 min) Recent advances in deep clustering and unsupervised representation learning are based on the idea that different views of an input image (generated through data augmentation techniques) must either be closer in the representation space, or have a similar cluster assignment. Bootstrap Your Own Latent (BYOL) is one such representation learning algorithm that has achieved state-of-the-art results in self-supervised image classification on ImageNet under the linear evaluation protocol. However, the utility of the learnt features of BYOL to perform clustering is not explored. In this work, we study the clustering ability of BYOL and observe that features learnt using BYOL may not be optimal for clustering. We propose a novel consensus clustering based loss function, and train BYOL with the proposed loss in an end-to-end way that improves the clustering ability and outperforms similar clustering based methods on some popular computer vision datasets.
    Comparing Supervised Models And Learned Speech Representations For Classifying Intelligibility Of Disordered Speech On Selected Phrases. (arXiv:2107.03985v1 [eess.AS])
    (2 min) Automatic classification of disordered speech can provide an objective tool for identifying the presence and severity of speech impairment. Classification approaches can also help identify hard-to-recognize speech samples to teach ASR systems about the variable manifestations of impaired speech. Here, we develop and compare different deep learning techniques to classify the intelligibility of disordered speech on selected phrases. We collected samples from a diverse set of 661 speakers with a variety of self-reported disorders speaking 29 words or phrases, which were rated by speech-language pathologists for their overall intelligibility using a five-point Likert scale. We then evaluated classifiers developed using 3 approaches: (1) a convolutional neural network (CNN) trained for the task, (2) classifiers trained on non-semantic speech representations from CNNs that used an unsupervised objective [1], and (3) classifiers trained on the acoustic (encoder) embeddings from an ASR system trained on typical speech [2]. We found that the ASR encoder's embeddings considerably outperform the other two on detecting and classifying disordered speech. Further analysis shows that the ASR embeddings cluster speech by the spoken phrase, while the non-semantic embeddings cluster speech by speaker. Also, longer phrases are more indicative of intelligibility deficits than single words.
    Federated Learning for Multi-Center Imaging Diagnostics: A Study in Cardiovascular Disease. (arXiv:2107.03901v1 [eess.IV])
    (2 min) Deep learning models can enable accurate and efficient disease diagnosis, but have thus far been hampered by the data scarcity present in the medical world. Automated diagnosis studies have been constrained by underpowered single-center datasets, and although some results have shown promise, their generalizability to other institutions remains questionable as the data heterogeneity between institutions is not taken into account. By allowing models to be trained in a distributed manner that preserves patients' privacy, federated learning promises to alleviate these issues, by enabling diligent multi-center studies. We present the first federated learning study on the modality of cardiovascular magnetic resonance (CMR) and use four centers derived from subsets of the M\&M and ACDC datasets, focusing on the diagnosis of hypertrophic cardiomyopathy (HCM). We adapt a 3D-CNN network pretrained on action recognition and explore two different ways of incorporating shape prior information to the model, and four different data augmentation set-ups, systematically analyzing their impact on the different collaborative learning choices. We show that despite the small size of data (180 subjects derived from four centers), the privacy preserving federated learning achieves promising results that are competitive with traditional centralized learning. We further find that federatively trained models exhibit increased robustness and are more sensitive to domain shift effects.
    Grid Partitioned Attention: Efficient TransformerApproximation with Inductive Bias for High Resolution Detail Generation. (arXiv:2107.03742v1 [cs.CV])
    (2 min) Attention is a general reasoning mechanism than can flexibly deal with image information, but its memory requirements had made it so far impractical for high resolution image generation. We present Grid Partitioned Attention (GPA), a new approximate attention algorithm that leverages a sparse inductive bias for higher computational and memory efficiency in image domains: queries attend only to few keys, spatially close queries attend to close keys due to correlations. Our paper introduces the new attention layer, analyzes its complexity and how the trade-off between memory usage and model power can be tuned by the hyper-parameters.We will show how such attention enables novel deep learning architectures with copying modules that are especially useful for conditional image generation tasks like pose morphing. Our contributions are (i) algorithm and code1of the novel GPA layer, (ii) a novel deep attention-copying architecture, and (iii) new state-of-the art experimental results in human pose morphing generation benchmarks.
    Federated Learning as a Mean-Field Game. (arXiv:2107.03770v1 [stat.ML])
    (2 min) We establish a connection between federated learning, a concept from machine learning, and mean-field games, a concept from game theory and control theory. In this analogy, the local federated learners are considered as the players and the aggregation of the gradients in a central server is the mean-field effect. We present federated learning as a differential game and discuss the properties of the equilibrium of this game. We hope this novel view to federated learning brings together researchers from these two distinct areas to work on fundamental problems of large-scale distributed and privacy-preserving learning algorithms.
    Bootstrapping Generalization of Process Models Discovered From Event Data. (arXiv:2107.03876v1 [cs.AI])
    (2 min) Process mining studies ways to derive value from process executions recorded in event logs of IT-systems, with process discovery the task of inferring a process model for an event log emitted by some unknown system. One quality criterion for discovered process models is generalization. Generalization seeks to quantify how well the discovered model describes future executions of the system, and is perhaps the least understood quality criterion in process mining. The lack of understanding is primarily a consequence of generalization seeking to measure properties over the entire future behavior of the system, when the only available sample of behavior is that provided by the event log itself. In this paper, we draw inspiration from computational statistics, and employ a bootstrap approach to estimate properties of a population based on a sample. Specifically, we define an estimator of the model's generalization based on the event log it was discovered from, and then use bootstrapping to measure the generalization of the model with respect to the system, and its statistical significance. Experiments demonstrate the feasibility of the approach in industrial settings.
    Differentiable Random Access Memory using Lattices. (arXiv:2107.03474v1 [cs.LG])
    (2 min) We introduce a differentiable random access memory module with $O(1)$ performance regardless of size, scaling to billions of entries. The design stores entries on points of a chosen lattice to calculate nearest neighbours of arbitrary points efficiently by exploiting symmetries. Augmenting a standard neural network architecture with a single memory layer based on this, we can scale the parameter count up to memory limits with negligible computational overhead, giving better accuracy at similar cost. On large language modelling tasks, these enhanced models with larger capacity significantly outperform the unmodified transformer baseline. We found continued scaling with memory size up to the limits tested.
    Encoding Domain Information with Sparse Priors for Inferring Explainable Latent Variables. (arXiv:2107.03730v1 [stat.ML])
    (2 min) Latent variable models are powerful statistical tools that can uncover relevant variation between patients or cells, by inferring unobserved hidden states from observable high-dimensional data. A major shortcoming of current methods, however, is their inability to learn sparse and interpretable hidden states. Additionally, in settings where partial knowledge on the latent structure of the data is readily available, a statistically sound integration of prior information into current methods is challenging. To address these issues, we propose spex-LVM, a factorial latent variable model with sparse priors to encourage the inference of explainable factors driven by domain-relevant information. spex-LVM utilizes existing knowledge of curated biomedical pathways to automatically assign annotated attributes to latent factors, yielding interpretable results tailored to the corresponding domain of interest. Evaluations on simulated and real single-cell RNA-seq datasets demonstrate that our model robustly identifies relevant structure in an inherently explainable manner, distinguishes technical noise from sources of biomedical variation, and provides dataset-specific adaptations of existing pathway annotations. Implementation is available at https://github.com/MLO-lab/spexlvm.
    Output Randomization: A Novel Defense for both White-box and Black-box Adversarial Models. (arXiv:2107.03806v1 [cs.LG])
    (2 min) Adversarial examples pose a threat to deep neural network models in a variety of scenarios, from settings where the adversary has complete knowledge of the model in a "white box" setting and to the opposite in a "black box" setting. In this paper, we explore the use of output randomization as a defense against attacks in both the black box and white box models and propose two defenses. In the first defense, we propose output randomization at test time to thwart finite difference attacks in black box settings. Since this type of attack relies on repeated queries to the model to estimate gradients, we investigate the use of randomization to thwart such adversaries from successfully creating adversarial examples. We empirically show that this defense can limit the success rate of a black box adversary using the Zeroth Order Optimization attack to 0%. Secondly, we propose output randomization training as a defense against white box adversaries. Unlike prior approaches that use randomization, our defense does not require its use at test time, eliminating the Backward Pass Differentiable Approximation attack, which was shown to be effective against other randomization defenses. Additionally, this defense has low overhead and is easily implemented, allowing it to be used together with other defenses across various model architectures. We evaluate output randomization training against the Projected Gradient Descent attacker and show that the defense can reduce the PGD attack's success rate down to 12% when using cross-entropy loss.
    The Three Ensemble Clustering (3EC) Algorithm for Pattern Discovery in Unsupervised Learning. (arXiv:2107.03729v1 [cs.LG])
    (2 min) This paper presents a multiple learner algorithm called the 'Three Ensemble Clustering 3EC' algorithm that classifies unlabeled data into quality clusters as a part of unsupervised learning. It offers the flexibility to explore the context of new clusters formed by an ensemble of algorithms based on internal validation indices. It is worth mentioning that the input data set is considered to be a cluster of clusters. An anomaly can possibly manifest as a cluster as well. Each partitioned cluster is considered to be a new data set and is a candidate to explore the most optimal algorithm and its number of partition splits until a predefined stopping criteria is met. The algorithms independently partition the data set into clusters and the quality of the partitioning is assessed by an ensemble of internal cluster validation indices. The 3EC algorithm presents the validation index scores from a choice of algorithms and its configuration of partitions and it is called the Tau Grid. 3EC chooses the most optimal score. The 3EC algorithm owes its name to the two input ensembles of algorithms and internal validation indices and an output ensemble of final clusters. Quality plays an important role in this clustering approach and it also acts as a stopping criteria from further partitioning. Quality is determined based on the quality of the clusters provided by an algorithm and its optimal number of splits. The 3EC algorithm determines this from the score of the ensemble of validation indices. The user can configure the stopping criteria by providing quality thresholds for the score range of each of the validation indices and the optimal size of the output cluster. The users can experiment with different sets of stopping criteria and choose the most 'sensible group' of quality clusters
    Collaboration of Experts: Achieving 80% Top-1 Accuracy on ImageNet with 100M FLOPs. (arXiv:2107.03815v1 [cs.CV])
    (2 min) In this paper, we propose a Collaboration of Experts (CoE) framework to pool together the expertise of multiple networks towards a common aim. Each expert is an individual network with expertise on a unique portion of the dataset, which enhances the collective capacity. Given a sample, an expert is selected by the delegator, which simultaneously outputs a rough prediction to support early termination. To fulfill this framework, we propose three modules to impel each model to play its role, namely weight generation module (WGM), label generation module (LGM) and variance calculation module (VCM). Our method achieves the state-of-the-art performance on ImageNet, 80.7% top-1 accuracy with 194M FLOPs. Combined with PWLU activation function and CondConv, CoE further achieves the accuracy of 80.0% with only 100M FLOPs for the first time. More importantly, our method is hardware friendly and achieves a 3-6x speedup compared with some existing conditional computation approaches.
    Image Resolution Susceptibility of Face Recognition Models. (arXiv:2107.03769v1 [cs.CV])
    (2 min) Face recognition approaches often rely on equal image resolution for verification faces on two images. However, in practical applications, those image resolutions are usually not in the same range due to different image capture mechanisms or sources. In this work, we first analyze the impact of image resolutions on the face verification performance with a state-of-the-art face recognition model. For images, synthetically reduced to $5\, \times 5\, \mathrm{px}$ resolution, the verification performance drops from $99.23\%$ increasingly down to almost $55\%$. Especially, for cross-resolution image pairs (one high- and one low-resolution image), the verification accuracy decreases even further. We investigate this behavior more in-depth by looking at the feature distances for every 2-image test pair. To tackle this problem, we propose the following two methods: 1) Train a state-of-the-art face-recognition model straightforward with $50\%$ low-resolution images directly within each batch. \\ 2) Train a siamese-network structure and adding a cosine distance feature loss between high- and low-resolution features. Both methods show an improvement for cross-resolution scenarios and can increase the accuracy at very low resolution to approximately $70\%$. However, a disadvantage is that a specific model needs to be trained for every resolution-pair ...
    Energy Efficient Federated Learning in Integrated Fog-Cloud Computing Enabled Internet-of-Things Networks. (arXiv:2107.03520v1 [eess.SP])
    (2 min) We investigate resource allocation scheme to reduce the energy consumption of federated learning (FL) in the integrated fog-cloud computing enabled Internet-of-things (IoT) networks. In the envisioned system, IoT devices are connected with the centralized cloud server (CS) via multiple fog access points (F-APs). We consider two different scenarios for training the local models. In the first scenario, local models are trained at the IoT devices and the F-APs upload the local model parameters to the CS. In the second scenario, local models are trained at the F-APs based on the collected data from the IoT devices and the F-APs collaborate with the CS for updating the model parameters. Our objective is to minimize the overall energy-consumption of both scenarios subject to FL time constraint. Towards this goal, we devise a joint optimization of scheduling of IoT devices with the F-APs, transmit power allocation, computation frequency allocation at the devices and F-APs and decouple it into two subproblems. In the first subproblem, we optimize the IoT device scheduling and power allocation, while in the second subproblem, we optimize the computation frequency allocation. For each scenario, we develop a conflict graph based solution to iteratively solve the two subproblems. Simulation results show that the proposed two schemes achieve a considerable performance gain in terms of the energy consumption minimization. The presented simulation results interestingly reveal that for a large number of IoT devices and large data sizes, it is more energy efficient to train the local models at the IoT devices instead of the F-APs.
    Assessing putative bias in prediction of anti-microbial resistance from real-world genotyping data under explicit causal assumptions. (arXiv:2107.03383v1 [q-bio.GN])
    (2 min) Whole genome sequencing (WGS) is quickly becoming the customary means for identification of antimicrobial resistance (AMR) due to its ability to obtain high resolution information about the genes and mechanisms that are causing resistance and driving pathogen mobility. By contrast, traditional phenotypic (antibiogram) testing cannot easily elucidate such information. Yet development of AMR prediction tools from genotype-phenotype data can be biased, since sampling is non-randomized. Sample provenience, period of collection, and species representation can confound the association of genetic traits with AMR. Thus, prediction models can perform poorly on new data with sampling distribution shifts. In this work -- under an explicit set of causal assumptions -- we evaluate the effectiveness of propensity-based rebalancing and confounding adjustment on AMR prediction using genotype-phenotype AMR data from the Pathosystems Resource Integration Center (PATRIC). We select bacterial genotypes (encoded as k-mer signatures, i.e. DNA fragments of length k), country, year, species, and AMR phenotypes for the tetracycline drug class, preparing test data with recent genomes coming from a single country. We test boosted logistic regression (BLR) and random forests (RF) with/without bias-handling. On 10,936 instances, we find evidence of species, location and year imbalance with respect to the AMR phenotype. The crude versus bias-adjusted change in effect of genetic signatures on AMR varies but only moderately (selecting the top 20,000 out of 40+ million k-mers). The area under the receiver operating characteristic (AUROC) of the RF (0.95) is comparable to that of BLR (0.94) on both out-of-bag samples from bootstrap and the external test (n=1,085), where AUROCs do not decrease. We observe a 1%-5% gain in AUROC with bias-handling compared to the sole use of genetic signatures. ...
    Understanding the Limits of Unsupervised Domain Adaptation via Data Poisoning. (arXiv:2107.03919v1 [cs.LG])
    (2 min) Unsupervised domain adaptation (UDA) enables cross-domain learning without target domain labels by transferring knowledge from a labeled source domain whose distribution differs from the target. However, UDA is not always successful and several accounts of "negative transfer" have been reported in the literature. In this work, we prove a simple lower bound on the target domain error that complements the existing upper bound. Our bound shows the insufficiency of minimizing source domain error and marginal distribution mismatch for a guaranteed reduction in the target domain error, due to the possible increase of induced labeling function mismatch. This insufficiency is further illustrated through simple distributions for which the same UDA approach succeeds, fails, and may succeed or fail with an equal chance. Motivated from this, we propose novel data poisoning attacks to fool UDA methods into learning representations that produce large target domain errors. We evaluate the effect of these attacks on popular UDA methods using benchmark datasets where they have been previously shown to be successful. Our results show that poisoning can significantly decrease the target domain accuracy, dropping it to almost 0\% in some cases, with the addition of only 10\% poisoned data in the source domain. The failure of UDA methods demonstrates the limitations of UDA at guaranteeing cross-domain generalization consistent with the lower bound. Thus, evaluation of UDA methods in adversarial settings such as data poisoning can provide a better sense of their robustness in scenarios unfavorable for UDA.
    MOD-Net: A Machine Learning Approach via Model-Operator-Data Network for Solving PDEs. (arXiv:2107.03673v1 [math.NA])
    (2 min) In this paper, we propose a model-operator-data network (MOD-Net) for solving PDEs. A MOD-Net is driven by a model to solve PDEs based on operator representation with regularization from data. In this work, we use a deep neural network to parameterize the Green's function. The empirical risk consists of the mean square of the governing equation, boundary conditions, and a few labels, which are numerically computed by traditional schemes on coarse grid points with cheap computation cost. With only the labeled dataset or only the model constraints, it is insufficient to accurately train a MOD-Net for complicate problems. Intuitively, the labeled dataset works as a regularization in addition to the model constraints. The MOD-Net is much efficient than original neural operator because the MOD-Net also uses the information of governing equation and the boundary conditions of the PDE rather than purely the expensive labels. Since the MOD-Net learns the Green's function of a PDE, it solves a type of PDEs but not a specific case. We numerically show MOD-Net is very efficient in solving Poisson equation and one-dimensional Boltzmann equation. For non-linear PDEs, where the concept of the Green's function does not apply, the non-linear MOD-Net can be similarly used as an ansatz for solving non-linear PDEs.
    Sublinear Regret for Learning POMDPs. (arXiv:2107.03635v1 [cs.LG])
    (2 min) We study the model-based undiscounted reinforcement learning for partially observable Markov decision processes (POMDPs). The oracle we consider is the optimal policy of the POMDP with a known environment in terms of the average reward over an infinite horizon. We propose a learning algorithm for this problem, building on spectral method-of-moments estimations for hidden Markov models, the belief error control in POMDPs and upper-confidence-bound methods for online learning. We establish a regret bound of $O(T^{2/3}\sqrt{\log T})$ for the proposed learning algorithm where $T$ is the learning horizon. This is, to the best of our knowledge, the first algorithm achieving sublinear regret with respect to our oracle for learning general POMDPs.
    Analytically Tractable Hidden-States Inference in Bayesian Neural Networks. (arXiv:2107.03759v1 [cs.LG])
    (2 min) With few exceptions, neural networks have been relying on backpropagation and gradient descent as the inference engine in order to learn the model parameters, because the closed-form Bayesian inference for neural networks has been considered to be intractable. In this paper, we show how we can leverage the tractable approximate Gaussian inference's (TAGI) capabilities to infer hidden states, rather than only using it for inferring the network's parameters. One novel aspect it allows is to infer hidden states through the imposition of constraints designed to achieve specific objectives, as illustrated through three examples: (1) the generation of adversarial-attack examples, (2) the usage of a neural network as a black-box optimization method, and (3) the application of inference on continuous-action reinforcement learning. These applications showcase how tasks that were previously reserved to gradient-based optimization approaches can now be approached with analytically tractable inference
    Model Selection for Generic Contextual Bandits. (arXiv:2107.03455v1 [stat.ML])
    (2 min) We consider the problem of model selection for the general stochastic contextual bandits under the realizability assumption. We propose a successive refinement based algorithm called Adaptive Contextual Bandit ({\ttfamily ACB}), that works in phases and successively eliminates model classes that are too simple to fit the given instance. We prove that this algorithm is adaptive, i.e., the regret rate order-wise matches that of {\ttfamily FALCON}, the state-of-art contextual bandit algorithm of Levi et. al '20, that needs knowledge of the true model class. The price of not knowing the correct model class is only an additive term contributing to the second order term in the regret bound. This cost possess the intuitive property that it becomes smaller as the model class becomes easier to identify, and vice-versa. We then show that a much simpler explore-then-commit (ETC) style algorithm also obtains a regret rate of matching that of {\ttfamily FALCON}, despite not knowing the true model class. However, the cost of model selection is higher in ETC as opposed to in {\ttfamily ACB}, as expected. Furthermore, {\ttfamily ACB} applied to the linear bandit setting with unknown sparsity, order-wise recovers the model selection guarantees previously established by algorithms tailored to the linear setting.
    Management of Resource at the Network Edge for Federated Learning. (arXiv:2107.03428v1 [cs.NI])
    (2 min) Federated learning has been explored as a promising solution for training at the edge, where end devices collaborate to train models without sharing data with other entities. Since the execution of these learning models occurs at the edge, where resources are limited, new solutions must be developed. In this paper, we describe the recent work on resource management at the edge, and explore the challenges and future directions to allow the execution of federated learning at the edge. Some of the problems of this management, such as discovery of resources, deployment, load balancing, migration, and energy efficiency will be discussed in the paper.
    Elastic deformation of optical coherence tomography images of diabetic macular edema for deep-learning models training: how far to go?. (arXiv:2107.03651v1 [eess.IV])
    (2 min) To explore the clinical validity of elastic deformation of optical coherence tomography (OCT) images for data augmentation in the development of deep-learning model for detection of diabetic macular edema (DME).
    Short-term Renewable Energy Forecasting in Greece using Prophet Decomposition and Tree-based Ensembles. (arXiv:2107.03825v1 [cs.LG])
    (2 min) Energy production using renewable sources exhibits inherent uncertainties due to their intermittent nature. Nevertheless, the unified European energy market promotes the increasing penetration of renewable energy sources (RES) by the regional energy system operators. Consequently, RES forecasting can assist in the integration of these volatile energy sources, since it leads to higher reliability and reduced ancillary operational costs for power systems. This paper presents a new dataset for solar and wind energy generation forecast in Greece and introduces a feature engineering pipeline that enriches the dimensional space of the dataset. In addition, we propose a novel method that utilizes the innovative Prophet model, an end-to-end forecasting tool that considers several kinds of nonlinear trends in decomposing the energy time series before a tree-based ensemble provides short-term predictions. The performance of the system is measured through representative evaluation metrics, and by estimating the model's generalization under an industryprovided scheme of absolute error thresholds. The proposed hybrid model competes with baseline persistence models, tree-based regression ensembles, and the Prophet model, managing to outperform them, presenting both lower error rates and more favorable error distribution.
    Probabilistic Time Series Forecasting with Implicit Quantile Networks. (arXiv:2107.03743v1 [cs.LG])
    (2 min) Here, we propose a general method for probabilistic time series forecasting. We combine an autoregressive recurrent neural network to model temporal dynamics with Implicit Quantile Networks to learn a large class of distributions over a time-series target. When compared to other probabilistic neural forecasting models on real- and simulated data, our approach is favorable in terms of point-wise prediction accuracy as well as on estimating the underlying temporal distribution.
    Self-organized criticality in neural networks. (arXiv:2107.03402v1 [cond-mat.stat-mech])
    (2 min) We demonstrate, both analytically and numerically, that learning dynamics of neural networks is generically attracted towards a self-organized critical state. The effect can be modeled with quartic interactions between non-trainable variables (e.g. states of neurons) and trainable variables (e.g. weight matrix). Non-trainable variables are rapidly driven towards stochastic equilibrium and trainable variables are slowly driven towards learning equilibrium described by a scale-invariant distribution on a wide range of scales. Our results suggest that the scale invariance observed in many physical and biological systems might be due to some kind of learning dynamics and support the claim that the universe might be a neural network.
    In-Network Learning: Distributed Training and Inference in Networks. (arXiv:2107.03433v1 [cs.LG])
    (2 min) It is widely perceived that leveraging the success of modern machine learning techniques to mobile devices and wireless networks has the potential of enabling important new services. This, however, poses significant challenges, essentially due to that both data and processing power are highly distributed in a wireless network. In this paper, we develop a learning algorithm and an architecture that make use of multiple data streams and processing units, not only during the training phase but also during the inference phase. In particular, the analysis reveals how inference propagates and fuses across a network. We study the design criterion of our proposed method and its bandwidth requirements. Also, we discuss implementation aspects using neural networks in typical wireless radio access; and provide experiments that illustrate benefits over state-of-the-art techniques.
    Bag of Tricks for Neural Architecture Search. (arXiv:2107.03719v1 [cs.LG])
    (2 min) While neural architecture search methods have been successful in previous years and led to new state-of-the-art performance on various problems, they have also been criticized for being unstable, being highly sensitive with respect to their hyperparameters, and often not performing better than random search. To shed some light on this issue, we discuss some practical considerations that help improve the stability, efficiency and overall performance.
    Learning Vision-Guided Quadrupedal Locomotion End-to-End with Cross-Modal Transformers. (arXiv:2107.03996v1 [cs.LG])
    (2 min) We propose to address quadrupedal locomotion tasks using Reinforcement Learning (RL) with a Transformer-based model that learns to combine proprioceptive information and high-dimensional depth sensor inputs. While learning-based locomotion has made great advances using RL, most methods still rely on domain randomization for training blind agents that generalize to challenging terrains. Our key insight is that proprioceptive states only offer contact measurements for immediate reaction, whereas an agent equipped with visual sensory observations can learn to proactively maneuver environments with obstacles and uneven terrain by anticipating changes in the environment many steps ahead. In this paper, we introduce LocoTransformer, an end-to-end RL method for quadrupedal locomotion that leverages a Transformer-based model for fusing proprioceptive states and visual observations. We evaluate our method in challenging simulated environments with different obstacles and uneven terrain. We show that our method obtains significant improvements over policies with only proprioceptive state inputs, and that Transformer-based models further improve generalization across environments. Our project page with videos is at https://RchalYang.github.io/LocoTransformer .
    Manifold Hypothesis in Data Analysis: Double Geometrically-Probabilistic Approach to Manifold Dimension Estimation. (arXiv:2107.03903v1 [cs.LG])
    (2 min) Manifold hypothesis states that data points in high-dimensional space actually lie in close vicinity of a manifold of much lower dimension. In many cases this hypothesis was empirically verified and used to enhance unsupervised and semi-supervised learning. Here we present new approach to manifold hypothesis checking and underlying manifold dimension estimation. In order to do it we use two very different methods simultaneously - one geometric, another probabilistic - and check whether they give the same result. Our geometrical method is a modification for sparse data of a well-known box-counting algorithm for Minkowski dimension calculation. The probabilistic method is new. Although it exploits standard nearest neighborhood distance, it is different from methods which were previously used in such situations. This method is robust, fast and includes special preliminary data transformation. Experiments on real datasets show that the suggested approach based on two methods combination is powerful and effective.
    Likelihood-Free Frequentist Inference: Bridging Classical Statistics and Machine Learning in Simulation and Uncertainty Quantification. (arXiv:2107.03920v1 [stat.ML])
    (2 min) Many areas of science make extensive use of computer simulators that implicitly encode likelihood functions for complex systems. Classical statistical methods are poorly suited for these so-called likelihood-free inference (LFI) settings, outside the asymptotic and low-dimensional regimes. Although new machine learning methods, such as normalizing flows, have revolutionized the sample efficiency and capacity of LFI methods, it remains an open question whether they produce reliable measures of uncertainty. In this paper, we present a statistical framework for LFI that unifies classical statistics with modern machine learning to: (1) construct frequentist confidence sets and hypothesis tests with finite-sample guarantees of nominal coverage (type I error control) and power, and (2) provide rigorous diagnostics for assessing empirical coverage over the entire parameter space. We refer to our framework as likelihood-free frequentist inference (LF2I). Any method that estimates a test statistic, such as the likelihood ratio, can be plugged into our framework to create powerful tests and confidence sets with correct coverage. In this work, we specifically study two test statistics (ACORE and BFF), which, respectively, maximize versus integrate an odds function over the parameter space. Our theoretical and empirical results offer multifaceted perspectives on error sources and challenges in likelihood-free frequentist inference.
    Consistency of the Maximal Information Coefficient Estimator. (arXiv:2107.03836v1 [stat.ME])
    (2 min) The Maximal Information Coefficient (MIC) of Reshef et al. (Science, 2011) is a statistic for measuring dependence between variable pairs in large datasets. In this note, we prove that MIC is a consistent estimator of the corresponding population statistic MIC$_*$. This corrects an error in an argument of Reshef et al. (JMLR, 2016), which we describe.
    Impossibility results for fair representations. (arXiv:2107.03483v1 [cs.LG])
    (2 min) With the growing awareness to fairness in machine learning and the realization of the central role that data representation has in data processing tasks, there is an obvious interest in notions of fair data representations. The goal of such representations is that a model trained on data under the representation (e.g., a classifier) will be guaranteed to respect some fairness constraints. Such representations are useful when they can be fixed for training models on various different tasks and also when they serve as data filtering between the raw data (known to the representation designer) and potentially malicious agents that use the data under the representation to learn predictive models and make decisions. A long list of recent research papers strive to provide tools for achieving these goals. However, we prove that this is basically a futile effort. Roughly stated, we prove that no representation can guarantee the fairness of classifiers for different tasks trained using it; even the basic goal of achieving label-independent Demographic Parity fairness fails once the marginal data distribution shifts. More refined notions of fairness, like Odds Equality, cannot be guaranteed by a representation that does not take into account the task specific labeling rule with respect to which such fairness will be evaluated (even if the marginal data distribution is known a priory). Furthermore, except for trivial cases, no representation can guarantee Odds Equality fairness for any two different tasks, while allowing accurate label predictions for both. While some of our conclusions are intuitive, we formulate (and prove) crisp statements of such impossibilities, often contrasting impressions conveyed by many recent works on fair representations.
    Benchpress: a scalable and platform-independent workflow for benchmarking structure learning algorithms for graphical models. (arXiv:2107.03863v1 [stat.ML])
    (2 min) Describing the relationship between the variables in a study domain and modelling the data generating mechanism is a fundamental problem in many empirical sciences. Probabilistic graphical models are one common approach to tackle the problem. Learning the graphical structure is computationally challenging and a fervent area of current research with a plethora of algorithms being developed. To facilitate the benchmarking of different methods, we present a novel automated workflow, called benchpress for producing scalable, reproducible, and platform-independent benchmarks of structure learning algorithms for probabilistic graphical models. Benchpress is interfaced via a simple JSON-file, which makes it accessible for all users, while the code is designed in a fully modular fashion to enable researchers to contribute additional methodologies. Benchpress currently provides an interface to a large number of state-of-the-art algorithms from libraries such as BiDAG, bnlearn, GOBNILP, pcalg, r.blip, scikit-learn, TETRAD, and trilearn as well as a variety of methods for data generating models and performance evaluation. Alongside user-defined models and randomly generated datasets, the software tool also includes a number of standard datasets and graphical models from the literature, which may be included in a benchmarking workflow. We demonstrate the applicability of this workflow for learning Bayesian networks in four typical data scenarios. The source code and documentation is publicly available from this http URL
    Imitation by Predicting Observations. (arXiv:2107.03851v1 [cs.LG])
    (2 min) Imitation learning enables agents to reuse and adapt the hard-won expertise of others, offering a solution to several key challenges in learning behavior. Although it is easy to observe behavior in the real-world, the underlying actions may not be accessible. We present a new method for imitation solely from observations that achieves comparable performance to experts on challenging continuous control tasks while also exhibiting robustness in the presence of observations unrelated to the task. Our method, which we call FORM (for "Future Observation Reward Model") is derived from an inverse RL objective and imitates using a model of expert behavior learned by generative modelling of the expert's observations, without needing ground truth actions. We show that FORM performs comparably to a strong baseline IRL method (GAIL) on the DeepMind Control Suite benchmark, while outperforming GAIL in the presence of task-irrelevant features.
    CamTuner: Reinforcement-Learning based System for Camera Parameter Tuning to enhance Analytics. (arXiv:2107.03964v1 [cs.LG])
    (3 min) Complex sensors like video cameras include tens of configurable parameters, which can be set by end-users to customize the sensors to specific application scenarios. Although parameter settings significantly affect the quality of the sensor output and the accuracy of insights derived from sensor data, most end-users use a fixed parameter setting because they lack the skill or understanding to appropriately configure these parameters. We propose CamTuner, which is a system to automatically, and dynamically adapt the complex sensor to changing environments. CamTuner includes two key components. First, a bespoke analytics quality estimator, which is a deep-learning model to automatically and continuously estimate the quality of insights from an analytics unit as the environment around a sensor change. Second, a reinforcement learning (RL) module, which reacts to the changes in quality, and automatically adjusts the camera parameters to enhance the accuracy of insights. We improve the training time of the RL module by an order of magnitude by designing virtual models to mimic essential behavior of the camera: we design virtual knobs that can be set to different values to mimic the effects of assigning different values to the camera's configurable parameters, and we design a virtual camera model that mimics the output from a video camera at different times of the day. These virtual models significantly accelerate training because (a) frame rates from a real camera are limited to 25-30 fps while the virtual models enable processing at 300 fps, (b) we do not have to wait until the real camera sees different environments, which could take weeks or months, and (c) virtual knobs can be updated instantly, while it can take 200-500 ms to change the camera parameter settings. Our dynamic tuning approach results in up to 12% improvement in the accuracy of insights from several video analytics tasks.
    Identification and Adaptation with Binary-Valued Observations under Non-Persistent Excitation Condition. (arXiv:2107.03588v1 [eess.SY])
    (2 min) Dynamical systems with binary-valued observations are widely used in information industry, technology of biological pharmacy and other fields. Though there have been much efforts devoted to the identification of such systems, most of the previous investigations are based on first-order gradient algorithm which usually has much slower convergence rate than the Quasi-Newton algorithm. Moreover, persistence of excitation(PE) conditions are usually required to guarantee consistent parameter estimates in the existing literature, which are hard to be verified or guaranteed for feedback control systems. In this paper, we propose an online projected Quasi-Newton type algorithm for parameter estimation of stochastic regression models with binary-valued observations and varying thresholds. By using both the stochastic Lyapunov function and martingale estimation methods, we establish the strong consistency of the estimation algorithm and provide the convergence rate, under a signal condition which is considerably weaker than the traditional PE condition and coincides with the weakest possible excitation known for the classical least square algorithm of stochastic regression models. Convergence of adaptive predictors and their applications in adaptive control are also discussed.
    Deep Learning for Two-Sided Matching. (arXiv:2107.03427v1 [cs.GT])
    (2 min) We initiate the use of a multi-layer neural network to model two-sided matching and to explore the design space between strategy-proofness and stability. It is well known that both properties cannot be achieved simultaneously but the efficient frontier in this design space is not understood. We show empirically that it is possible to achieve a good compromise between stability and strategy-proofness-substantially better than that achievable through a convex combination of deferred acceptance (stable and strategy-proof for only one side of the market) and randomized serial dictatorship (strategy-proof but not stable).
    IowaRain: A Statewide Rain Event Dataset Based on Weather Radars and Quantitative Precipitation Estimation. (arXiv:2107.03432v1 [cs.LG])
    (2 min) Effective environmental planning and management to address climate change could be achieved through extensive environmental modeling with machine learning and conventional physical models. In order to develop and improve these models, practitioners and researchers need comprehensive benchmark datasets that are prepared and processed with environmental expertise that they can rely on. This study presents an extensive dataset of rainfall events for the state of Iowa (2016-2019) acquired from the National Weather Service Next Generation Weather Radar (NEXRAD) system and processed by a quantitative precipitation estimation system. The dataset presented in this study could be used for better disaster monitoring, response and recovery by paving the way for both predictive and prescriptive modeling.
    A Review of Bangla Natural Language Processing Tasks and the Utility of Transformer Models. (arXiv:2107.03844v1 [cs.CL])
    (2 min) Bangla -- ranked as the 6th most widely spoken language across the world (https://www.ethnologue.com/guides/ethnologue200), with 230 million native speakers -- is still considered as a low-resource language in the natural language processing (NLP) community. With three decades of research, Bangla NLP (BNLP) is still lagging behind mainly due to the scarcity of resources and the challenges that come with it. There is sparse work in different areas of BNLP; however, a thorough survey reporting previous work and recent advances is yet to be done. In this study, we first provide a review of Bangla NLP tasks, resources, and tools available to the research community; we benchmark datasets collected from various platforms for nine NLP tasks using current state-of-the-art algorithms (i.e., transformer-based models). We provide comparative results for the studied NLP tasks by comparing monolingual vs. multilingual models of varying sizes. We report our results using both individual and consolidated datasets and provide data splits for future research. We reviewed a total of 108 papers and conducted 175 sets of experiments. Our results show promising performance using transformer-based models while highlighting the trade-off with computational costs. We hope that such a comprehensive survey will motivate the community to build on and further advance the research on Bangla NLP.
    MAFIA: Machine Learning Acceleration on FPGAs for IoT Applications. (arXiv:2107.03653v1 [cs.AR])
    (2 min) Recent breakthroughs in ML have produced new classes of models that allow ML inference to run directly on milliwatt-powered IoT devices. On one hand, existing ML-to-FPGA compilers are designed for deep neural-networks on large FPGAs. On the other hand, general-purpose HLS tools fail to exploit properties specific to ML inference, thereby resulting in suboptimal performance. We propose MAFIA, a tool to compile ML inference on small form-factor FPGAs for IoT applications. MAFIA provides native support for linear algebra operations and can express a variety of ML algorithms, including state-of-the-art models. We show that MAFIA-generated programs outperform best-performing variant of a commercial HLS compiler by 2.5x on average.
    Direct detection of plasticity onset through total-strain profile evolution. (arXiv:2107.03738v1 [cond-mat.mtrl-sci])
    (2 min) Plastic yielding in solids strongly depends on various conditions, such as temperature and loading rate and indeed, sample-dependent knowledge of yield points in structural materials promotes reliability in mechanical behavior. Commonly, yielding is measured through controlled mechanical testing at small or large scales, in ways that either distinguish elastic (stress) from total deformation measurements, or by identifying plastic slip contributions. In this paper we argue that instead of separate elastic/plastic measurements, yielding can be unraveled through statistical analysis of total strain fluctuations during the evolution sequence of profiles measured in-situ, through digital image correlation. We demonstrate two distinct ways of precisely quantifying yield locations in widely applicable crystal plasticity models, that apply in polycrystalline solids, either by using principal component analysis or discrete wavelet transforms. We test and compare these approaches in synthetic data of polycrystal simulations and a variety of yielding responses, through changes of the applied loading rates and the strain-rate sensitivity exponents.
    SSSE: Efficiently Erasing Samples from Trained Machine Learning Models. (arXiv:2107.03860v1 [cs.LG])
    (2 min) The availability of large amounts of user-provided data has been key to the success of machine learning for many real-world tasks. Recently, an increasing awareness has emerged that users should be given more control about how their data is used. In particular, users should have the right to prohibit the use of their data for training machine learning systems, and to have it erased from already trained systems. While several sample erasure methods have been proposed, all of them have drawbacks which have prevented them from gaining widespread adoption. Most methods are either only applicable to very specific families of models, sacrifice too much of the original model's accuracy, or they have prohibitive memory or computational requirements. In this paper, we propose an efficient and effective algorithm, SSSE, for samples erasure, that is applicable to a wide class of machine learning models. From a second-order analysis of the model's loss landscape we derive a closed-form update step of the model parameters that only requires access to the data to be erased, not to the original training set. Experiments on three datasets, CelebFaces attributes (CelebA), Animals with Attributes 2 (AwA2) and CIFAR10, show that in certain cases SSSE can erase samples almost as well as the optimal, yet impractical, gold standard of training a new model from scratch with only the permitted data.
    Quadruplet Deep Metric Learning Model for Imbalanced Time-series Fault Diagnosis. (arXiv:2107.03786v1 [cs.LG])
    (2 min) Intelligent diagnosis method based on data-driven and deep learning is an attractive and meaningful field in recent years. However, in practical application scenarios, the imbalance of time-series fault is an urgent problem to be solved. From the perspective of Bayesian probability, this paper analyzes how to improve the performance of imbalanced classification by adjusting the distance between classes and the distribution within a class and proposes a time-series fault diagnosis model based on deep metric learning. As a core of deep metric learning, a novel quadruplet data pair design considering imbalance class is proposed with reference to traditional deep metric learning. Based on such data pair, this paper proposes a quadruplet loss function which takes into account the inter-class distance and the intra-class data distribution, and pays special attention to imbalanced sample pairs. The reasonable combination of quadruplet loss and softmax loss function can reduce the impact of imbalance. Experiments on two open datasets are carried out to verify the effectiveness and robustness of the model. Experimental results show that the proposed method can effectively improve the performance of imbalanced classification.
    Patient Embeddings in Healthcare and Insurance Applications. (arXiv:2107.03913v1 [cs.CY])
    (2 min) The paper researches the problem of concept and patient representations in the medical domain. We present the patient histories from Electronic Health Records (EHRs) as temporal sequences of ICD concepts for which embeddings are learned in an unsupervised setup with a transformer-based neural network model. The model training was performed on the collection of one million patients' histories in 6 years. The predictive power of such a model is assessed in comparison with several baseline methods. A series of experiments on the MIMIC-III data show the advantage of the presented model compared to a similar system. Further, we analyze the obtained embedding space with regards to concept relations and show how knowledge from the medical domain can be successfully transferred to the practical task of insurance scoring in the form of patient embeddings.
    Rating and aspect-based opinion graph embeddings for explainable recommendations. (arXiv:2107.03385v1 [cs.IR])
    (2 min) The success of neural network embeddings has entailed a renewed interest in using knowledge graphs for a wide variety of machine learning and information retrieval tasks. In particular, recent recommendation methods based on graph embeddings have shown state-of-the-art performance. In general, these methods encode latent rating patterns and content features. Differently from previous work, in this paper, we propose to exploit embeddings extracted from graphs that combine information from ratings and aspect-based opinions expressed in textual reviews. We then adapt and evaluate state-of-the-art graph embedding techniques over graphs generated from Amazon and Yelp reviews on six domains, outperforming baseline recommenders. Additionally, our method has the advantage of providing explanations that involve the coverage of aspect-based opinions given by users about recommended items.
    Augmented Data as an Auxiliary Plug-in Towards Categorization of Crowdsourced Heritage Data. (arXiv:2107.03852v1 [cs.LG])
    (2 min) In this paper, we propose a strategy to mitigate the problem of inefficient clustering performance by introducing data augmentation as an auxiliary plug-in. Classical clustering techniques such as K-means, Gaussian mixture model and spectral clustering are central to many data-driven applications. However, recently unsupervised simultaneous feature learning and clustering using neural networks also known as Deep Embedded Clustering (DEC) has gained prominence. Pioneering works on deep feature clustering focus on defining relevant clustering loss function and choosing the right neural network for extracting features. A central problem in all these cases is data sparsity accompanied by high intra-class and low inter-class variance, which subsequently leads to poor clustering performance and erroneous candidate assignments. Towards this, we employ data augmentation techniques to improve the density of the clusters, thus improving the overall performance. We train a variant of Convolutional Autoencoder (CAE) with augmented data to construct the initial feature space as a novel model for deep clustering. We demonstrate the results of proposed strategy on crowdsourced Indian Heritage dataset. Extensive experiments show consistent improvements over existing works.
    SpecGrav -- Detection of Gravitational Waves using Deep Learning. (arXiv:2107.03607v1 [astro-ph.IM])
    (2 min) Gravitational waves are ripples in the fabric of space-time that travel at the speed of light. The detection of gravitational waves by LIGO is a major breakthrough in the field of astronomy. Deep Learning has revolutionized many industries including health care, finance and education. Deep Learning techniques have also been explored for detection of gravitational waves to overcome the drawbacks of traditional matched filtering method. However, in several researches, the training phase of neural network is very time consuming and hardware devices with large memory are required for the task. In order to reduce the extensive amount of hardware resources and time required in training a neural network for detecting gravitational waves, we made SpecGrav. We use 2D Convolutional Neural Network and spectrograms of gravitational waves embedded in noise to detect gravitational waves from binary black hole merger and binary neutron star merger. The training phase of our neural network was of about just 19 minutes on a 2GB GPU.
    Predicting Disease Progress with Imprecise Lab Test Results. (arXiv:2107.03620v1 [cs.LG])
    (2 min) In existing deep learning methods, almost all loss functions assume that sample data values used to be predicted are the only correct ones. This assumption does not hold for laboratory test data. Test results are often within tolerable or imprecision ranges, with all values in the ranges acceptable. By considering imprecision samples, we propose an imprecision range loss (IR loss) method and incorporate it into Long Short Term Memory (LSTM) model for disease progress prediction. In this method, each sample in imprecision range space has a certain probability to be the real value, participating in the loss calculation. The loss is defined as the integral of the error of each point in the impression range space. A sampling method for imprecision space is formulated. The continuous imprecision space is discretized, and a sequence of imprecise data sets are obtained, which is convenient for gradient descent learning. A heuristic learning algorithm is developed to learn the model parameters based on the imprecise data sets. Experimental results on real data show that the prediction method based on IR loss can provide more stable and consistent prediction result when test samples are generated from imprecision range.
    Modality Completion via Gaussian Process Prior Variational Autoencoders for Multi-Modal Glioma Segmentation. (arXiv:2107.03442v1 [eess.IV])
    (2 min) In large studies involving multi protocol Magnetic Resonance Imaging (MRI), it can occur to miss one or more sub-modalities for a given patient owing to poor quality (e.g. imaging artifacts), failed acquisitions, or hallway interrupted imaging examinations. In some cases, certain protocols are unavailable due to limited scan time or to retrospectively harmonise the imaging protocols of two independent studies. Missing image modalities pose a challenge to segmentation frameworks as complementary information contributed by the missing scans is then lost. In this paper, we propose a novel model, Multi-modal Gaussian Process Prior Variational Autoencoder (MGP-VAE), to impute one or more missing sub-modalities for a patient scan. MGP-VAE can leverage the Gaussian Process (GP) prior on the Variational Autoencoder (VAE) to utilize the subjects/patients and sub-modalities correlations. Instead of designing one network for each possible subset of present sub-modalities or using frameworks to mix feature maps, missing data can be generated from a single model based on all the available samples. We show the applicability of MGP-VAE on brain tumor segmentation where either, two, or three of four sub-modalities may be missing. Our experiments against competitive segmentation baselines with missing sub-modality on BraTS'19 dataset indicate the effectiveness of the MGP-VAE model for segmentation tasks.
    CHASE: Robust Visual Tracking via Cell-Level Differentiable Neural Architecture Search. (arXiv:2107.03463v1 [cs.CV])
    (2 min) A strong visual object tracker nowadays relies on its well-crafted modules, which typically consist of manually-designed network architectures to deliver high-quality tracking results. Not surprisingly, the manual design process becomes a particularly challenging barrier, as it demands sufficient prior experience, enormous effort, intuition and perhaps some good luck. Meanwhile, neural architecture search has gaining grounds in practical applications such as image segmentation, as a promising method in tackling the issue of automated search of feasible network structures. In this work, we propose a novel cell-level differentiable architecture search mechanism to automate the network design of the tracking module, aiming to adapt backbone features to the objective of a tracking network during offline training. The proposed approach is simple, efficient, and with no need to stack a series of modules to construct a network. Our approach is easy to be incorporated into existing trackers, which is empirically validated using different differentiable architecture search-based methods and tracking objectives. Extensive experimental evaluations demonstrate the superior performance of our approach over five commonly-used benchmarks. Meanwhile, our automated searching process takes 41 (18) hours for the second (first) order DARTS method on the TrackingNet dataset.
    Generalization Error of GAN from the Discriminator's Perspective. (arXiv:2107.03633v1 [cs.LG])
    (2 min) The generative adversarial network (GAN) is a well-known model for learning high-dimensional distributions, but the mechanism for its generalization ability is not understood. In particular, GAN is vulnerable to the memorization phenomenon, the eventual convergence to the empirical distribution. We consider a simplified GAN model with the generator replaced by a density, and analyze how the discriminator contributes to generalization. We show that with early stopping, the generalization error measured by Wasserstein metric escapes from the curse of dimensionality, despite that in the long term, memorization is inevitable. In addition, we present a hardness of learning result for WGAN.
    $S^3$: Sign-Sparse-Shift Reparametrization for Effective Training of Low-bit Shift Networks. (arXiv:2107.03453v1 [cs.CV])
    (2 min) Shift neural networks reduce computation complexity by removing expensive multiplication operations and quantizing continuous weights into low-bit discrete values, which are fast and energy efficient compared to conventional neural networks. However, existing shift networks are sensitive to the weight initialization, and also yield a degraded performance caused by vanishing gradient and weight sign freezing problem. To address these issues, we propose S low-bit re-parameterization, a novel technique for training low-bit shift networks. Our method decomposes a discrete parameter in a sign-sparse-shift 3-fold manner. In this way, it efficiently learns a low-bit network with a weight dynamics similar to full-precision networks and insensitive to weight initialization. Our proposed training method pushes the boundaries of shift neural networks and shows 3-bit shift networks out-performs their full-precision counterparts in terms of top-1 accuracy on ImageNet.
    Sleep syndromes onset detection based on automatic sleep staging algorithm. (arXiv:2107.03387v1 [q-bio.NC])
    (2 min) In this paper, we propose a novel method and a practical approach to predicting early onsets of sleep syndromes, including restless leg syndrome, insomnia, based on an algorithm that is comprised of two modules. A Fast Fourier Transform is applied to 30 seconds long epochs of EEG recordings to provide localized time-frequency information, and a deep convolutional LSTM neural network is trained for sleep stage classification. Automating sleep stages detection from EEG data offers great potential to tackling sleep irregularities on a daily basis. Thereby, a novel approach for sleep stage classification is proposed which combines the best of signal processing and statistics. In this study, we used the PhysioNet Sleep European Data Format (EDF) Database. The code evaluation showed impressive results, reaching an accuracy of 86.43, precision of 77.76, recall of 93,32, F1-score of 89.12 with the final mean false error loss of 0.09.

2021-07-08

  • cs.CL updates on arXiv.org

    A Comparative Study of Modular and Joint Approaches for Speaker-Attributed ASR on Monaural Long-Form Audio. (arXiv:2107.02852v1 [eess.AS])
    (2 min) Speaker-attributed automatic speech recognition (SA-ASR) is a task to recognize "who spoke what" from multi-talker recordings. An SA-ASR system usually consists of multiple modules such as speech separation, speaker diarization and ASR. On the other hand, considering the joint optimization, an end-to-end (E2E) SA-ASR model has recently been proposed with promising results on simulation data. In this paper, we present our recent study on the comparison of such modular and joint approaches towards SA-ASR on real monaural recordings. We develop state-of-the-art SA-ASR systems for both modular and joint approaches by leveraging large-scale training data, including 75 thousand hours of ASR training data and the VoxCeleb corpus for speaker representation learning. We also propose a new pipeline that performs the E2E SA-ASR model after speaker clustering. Our evaluation on the AMI meeting corpus reveals that after fine-tuning with a small real data, the joint system performs 9.2--29.4% better in accuracy compared to the best modular system while the modular system performs better before such fine-tuning. We also conduct various error analyses to show the remaining issues for the monaural SA-ASR.
    Advancing CTC-CRF Based End-to-End Speech Recognition with Wordpieces and Conformers. (arXiv:2107.03007v1 [eess.AS])
    (2 min) Automatic speech recognition systems have been largely improved in the past few decades and current systems are mainly hybrid-based and end-to-end-based. The recently proposed CTC-CRF framework inherits the data-efficiency of the hybrid approach and the simplicity of the end-to-end approach. In this paper, we further advance CTC-CRF based ASR technique with explorations on modeling units and neural architectures. Specifically, we investigate techniques to enable the recently developed wordpiece modeling units and Conformer neural networks to be succesfully applied in CTC-CRFs. Experiments are conducted on two English datasets (Switchboard, Librispeech) and a German dataset from CommonVoice. Experimental results suggest that (i) Conformer can improve the recognition performance significantly; (ii) Wordpiece-based systems perform slightly worse compared with phone-based systems for the target language with a low degree of grapheme-phoneme correspondence (e.g. English), while the two systems can perform equally strong when such degree of correspondence is high for the target language (e.g. German).
    Question Answering over Knowledge Graphs with Neural Machine Translation and Entity Linking. (arXiv:2107.02865v1 [cs.AI])
    (2 min) The goal of Question Answering over Knowledge Graphs (KGQA) is to find answers for natural language questions over a knowledge graph. Recent KGQA approaches adopt a neural machine translation (NMT) approach, where the natural language question is translated into a structured query language. However, NMT suffers from the out-of-vocabulary problem, where terms in a question may not have been seen during training, impeding their translation. This issue is particularly problematic for the millions of entities that large knowledge graphs describe. We rather propose a KGQA approach that delegates the processing of entities to entity linking (EL) systems. NMT is then used to create a query template with placeholders that are filled by entities identified in an EL phase. Slot filling is used to decide which entity fills which placeholder. Experiments for QA over Wikidata show that our approach outperforms pure NMT: while there remains a strong dependence on having seen similar query templates during training, errors relating to entities are greatly reduced.
    Trans4E: Link Prediction on Scholarly Knowledge Graphs. (arXiv:2107.03297v1 [cs.AI])
    (2 min) The incompleteness of Knowledge Graphs (KGs) is a crucial issue affecting the quality of AI-based services. In the scholarly domain, KGs describing research publications typically lack important information, hindering our ability to analyse and predict research dynamics. In recent years, link prediction approaches based on Knowledge Graph Embedding models became the first aid for this issue. In this work, we present Trans4E, a novel embedding model that is particularly fit for KGs which include N to M relations with N$\gg$M. This is typical for KGs that categorize a large number of entities (e.g., research articles, patents, persons) according to a relatively small set of categories. Trans4E was applied on two large-scale knowledge graphs, the Academia/Industry DynAmics (AIDA) and Microsoft Academic Graph (MAG), for completing the information about Fields of Study (e.g., 'neural networks', 'machine learning', 'artificial intelligence'), and affiliation types (e.g., 'education', 'company', 'government'), improving the scope and accuracy of the resulting data. We evaluated our approach against alternative solutions on AIDA, MAG, and four other benchmarks (FB15k, FB15k-237, WN18, and WN18RR). Trans4E outperforms the other models when using low embedding dimensions and obtains competitive results in high dimensions.
    Plot and Rework: Modeling Storylines for Visual Storytelling. (arXiv:2105.06950v3 [cs.CL] UPDATED)
    (2 min) Writing a coherent and engaging story is not easy. Creative writers use their knowledge and worldview to put disjointed elements together to form a coherent storyline, and work and rework iteratively toward perfection. Automated visual storytelling (VIST) models, however, make poor use of external knowledge and iterative generation when attempting to create stories. This paper introduces PR-VIST, a framework that represents the input image sequence as a story graph in which it finds the best path to form a storyline. PR-VIST then takes this path and learns to generate the final story via an iterative training process. This framework produces stories that are superior in terms of diversity, coherence, and humanness, per both automatic and human evaluations. An ablation study shows that both plotting and reworking contribute to the model's superiority.
    Neural Natural Language Processing for Unstructured Data in Electronic Health Records: a Review. (arXiv:2107.02975v1 [cs.CL])
    (2 min) Electronic health records (EHRs), digital collections of patient healthcare events and observations, are ubiquitous in medicine and critical to healthcare delivery, operations, and research. Despite this central role, EHRs are notoriously difficult to process automatically. Well over half of the information stored within EHRs is in the form of unstructured text (e.g. provider notes, operation reports) and remains largely untapped for secondary use. Recently, however, newer neural network and deep learning approaches to Natural Language Processing (NLP) have made considerable advances, outperforming traditional statistical and rule-based systems on a variety of tasks. In this survey paper, we summarize current neural NLP methods for EHR applications. We focus on a broad scope of tasks, namely, classification and prediction, word embeddings, extraction, generation, and other topics such as question answering, phenotyping, knowledge graphs, medical dialogue, multilinguality, interpretability, etc.
    On Training Instance Selection for Few-Shot Neural Text Generation. (arXiv:2107.03176v1 [cs.CL])
    (2 min) Large-scale pretrained language models have led to dramatic improvements in text generation. Impressive performance can be achieved by finetuning only on a small number of instances (few-shot setting). Nonetheless, almost all previous work simply applies random sampling to select the few-shot training instances. Little to no attention has been paid to the selection strategies and how they would affect model performance. In this work, we present a study on training instance selection in few-shot neural text generation. The selection decision is made based only on the unlabeled data so as to identify the most worthwhile data points that should be annotated under some budget of labeling cost. Based on the intuition that the few-shot training instances should be diverse and representative of the entire data distribution, we propose a simple selection strategy with K-means clustering. We show that even with the naive clustering-based approach, the generation models consistently outperform random sampling on three text generation tasks: data-to-text generation, document summarization and question generation. We hope that this work will call for more attention on this largely unexplored area.
    Text2App: A Framework for Creating Android Apps from Text Descriptions. (arXiv:2104.08301v2 [cs.CL] UPDATED)
    (2 min) We present Text2App -- a framework that allows users to create functional Android applications from natural language specifications. The conventional method of source code generation tries to generate source code directly, which is impractical for creating complex software. We overcome this limitation by transforming natural language into an abstract intermediate formal language representing an application with a substantially smaller number of tokens. The intermediate formal representation is then compiled into target source codes. This abstraction of programming details allows seq2seq networks to learn complex application structures with less overhead. In order to train sequence models, we introduce a data synthesis method grounded in a human survey. We demonstrate that Text2App generalizes well to unseen combination of app components and it is capable of handling noisy natural language instructions. We explore the possibility of creating applications from highly abstract instructions by coupling our system with GPT-3 -- a large pretrained language model. We perform an extensive human evaluation and identify the capabilities and limitations of our system. The source code, a ready-to-run demo notebook, and a demo video are publicly available at \url{https://github.com/text2app/Text2App}.
    Interventional Video Grounding with Dual Contrastive Learning. (arXiv:2106.11013v2 [cs.CV] UPDATED)
    (2 min) Video grounding aims to localize a moment from an untrimmed video for a given textual query. Existing approaches focus more on the alignment of visual and language stimuli with various likelihood-based matching or regression strategies, i.e., P(Y|X). Consequently, these models may suffer from spurious correlations between the language and video features due to the selection bias of the dataset. 1) To uncover the causality behind the model and data, we first propose a novel paradigm from the perspective of the causal inference, i.e., interventional video grounding (IVG) that leverages backdoor adjustment to deconfound the selection bias based on structured causal model (SCM) and do-calculus P(Y|do(X)). Then, we present a simple yet effective method to approximate the unobserved confounder as it cannot be directly sampled from the dataset. 2) Meanwhile, we introduce a dual contrastive learning approach (DCL) to better align the text and video by maximizing the mutual information (MI) between query and video clips, and the MI between start/end frames of a target moment and the others within a video to learn more informative visual representations. Experiments on three standard benchmarks show the effectiveness of our approaches. Our code is available on GitHub: https://github.com/nanguoshun/IVG.
    Automatic Assignment of Radiology Examination Protocols Using Pre-trained Language Models with Knowledge Distillation. (arXiv:2009.00694v3 [cs.CL] UPDATED)
    (2 min) Selecting radiology examination protocol is a repetitive, and time-consuming process. In this paper, we present a deep learning approach to automatically assign protocols to computer tomography examinations, by pre-training a domain-specific BERT model ($BERT_{rad}$). To handle the high data imbalance across exam protocols, we used a knowledge distillation approach that up-sampled the minority classes through data augmentation. We compared classification performance of the described approach with the statistical n-gram models using Support Vector Machine (SVM), Gradient Boosting Machine (GBM), and Random Forest (RF) classifiers, as well as the Google's $BERT_{base}$ model. SVM, GBM and RF achieved macro-averaged F1 scores of 0.45, 0.45, and 0.6 while $BERT_{base}$ and $BERT_{rad}$ achieved 0.61 and 0.63. Knowledge distillation improved overall performance on the minority classes, achieving a F1 score of 0.66.
    Time-Aware Ancient Chinese Text Translation and Inference. (arXiv:2107.03179v1 [cs.CL])
    (2 min) In this paper, we aim to address the challenges surrounding the translation of ancient Chinese text: (1) The linguistic gap due to the difference in eras results in translations that are poor in quality, and (2) most translations are missing the contextual information that is often very crucial to understanding the text. To this end, we improve upon past translation techniques by proposing the following: We reframe the task as a multi-label prediction task where the model predicts both the translation and its particular era. We observe that this helps to bridge the linguistic gap as chronological context is also used as auxiliary information. % As a natural step of generalization, we pivot on the modern Chinese translations to generate multilingual outputs. %We show experimentally the efficacy of our framework in producing quality translation outputs and also validate our framework on a collected task-specific parallel corpus. We validate our framework on a parallel corpus annotated with chronology information and show experimentally its efficacy in producing quality translation outputs. We release both the code and the data https://github.com/orina1123/time-aware-ancient-text-translation for future research.
    Lemmatization of Historical Old Literary Finnish Texts in Modern Orthography. (arXiv:2107.03266v1 [cs.CL])
    (2 min) Texts written in Old Literary Finnish represent the first literary work ever written in Finnish starting from the 16th century. There have been several projects in Finland that have digitized old publications and made them available for research use. However, using modern NLP methods in such data poses great challenges. In this paper we propose an approach for simultaneously normalizing and lemmatizing Old Literary Finnish into modern spelling. Our best model reaches to 96.3\% accuracy in texts written by Agricola and 87.7\% accuracy in other contemporary out-of-domain text. Our method has been made freely available on Zenodo and Github.
    Android Security using NLP Techniques: A Review. (arXiv:2107.03072v1 [cs.CR])
    (2 min) Android is among the most targeted platform by attackers. While attackers are improving their techniques, traditional solutions based on static and dynamic analysis have been also evolving. In addition to the application code, Android applications have some metadata that could be useful for security analysis of applications. Unlike traditional application distribution mechanisms, Android applications are distributed centrally in mobile markets. Therefore, beside application packages, such markets contain app information provided by app developers and app users. The availability of such useful textual data together with the advancement in Natural Language Processing (NLP) that is used to process and understand textual data has encouraged researchers to investigate the use of NLP techniques in Android security. Especially, security solutions based on NLP have accelerated in the last 5 years and proven to be useful. This study reviews these proposals and aim to explore possible research directions for future studies by presenting state-of-the-art in this domain. We mainly focus on NLP-based solutions under four categories: description-to-behaviour fidelity, description generation, privacy and malware detection.
    Explainable Prediction of Text Complexity: The Missing Preliminaries for Text Simplification. (arXiv:2007.15823v2 [cs.CL] UPDATED)
    (2 min) Text simplification reduces the language complexity of professional content for accessibility purposes. End-to-end neural network models have been widely adopted to directly generate the simplified version of input text, usually functioning as a blackbox. We show that text simplification can be decomposed into a compact pipeline of tasks to ensure the transparency and explainability of the process. The first two steps in this pipeline are often neglected: 1) to predict whether a given piece of text needs to be simplified, and 2) if yes, to identify complex parts of the text. The two tasks can be solved separately using either lexical or deep learning methods, or solved jointly. Simply applying explainable complexity prediction as a preliminary step, the out-of-sample text simplification performance of the state-of-the-art, black-box simplification models can be improved by a large margin.
    Kosp2e: Korean Speech to English Translation Corpus. (arXiv:2107.02875v1 [cs.CL])
    (2 min) Most speech-to-text (S2T) translation studies use English speech as a source, which makes it difficult for non-English speakers to take advantage of the S2T technologies. For some languages, this problem was tackled through corpus construction, but the farther linguistically from English or the more under-resourced, this deficiency and underrepresentedness becomes more significant. In this paper, we introduce kosp2e (read as `kospi'), a corpus that allows Korean speech to be translated into English text in an end-to-end manner. We adopt open license speech recognition corpus, translation corpus, and spoken language corpora to make our dataset freely available to the public, and check the performance through the pipeline and training-based approaches. Using pipeline and various end-to-end schemes, we obtain the highest BLEU of 21.3 and 18.0 for each based on the English hypothesis, validating the feasibility of our data. We plan to supplement annotations for other target languages through community contributions in the future.
    Fibrational Initial Algebra-Final Coalgebra Coincidence over Initial Algebras: Turning Verification Witnesses Upside Down. (arXiv:2105.04817v2 [cs.LO] UPDATED)
    (2 min) The coincidence between initial algebras (IAs) and final coalgebras (FCs) is a phenomenon that underpins various important results in theoretical computer science. In this paper, we identify a general fibrational condition for the IA-FC coincidence, namely in the fiber over an initial algebra in the base category. Identifying (co)algebras in a fiber as (co)inductive predicates, our fibrational IA-FC coincidence allows one to use coinductive witnesses (such as invariants) for verifying inductive properties (such as liveness). Our general fibrational theory features the technical condition of stability of chain colimits; we extend the framework to the presence of a monadic effect, too, restricting to fibrations of complete lattice-valued predicates. Practical benefits of our categorical theory are exemplified by new "upside-down" witness notions for three verification problems: probabilistic liveness, and acceptance and model-checking with respect to bottom-up tree automata.
    Over a Decade of Social Opinion Mining: A Systematic Review. (arXiv:2012.03091v2 [cs.CL] UPDATED)
    (2 min) Social media popularity and importance is on the increase due to people using it for various types of social interaction across multiple channels. This systematic review focuses on the evolving research area of Social Opinion Mining, tasked with the identification of multiple opinion dimensions, such as subjectivity, sentiment polarity, emotion, affect, sarcasm and irony, from user-generated content represented across multiple social media platforms and in various media formats, like text, image, video and audio. Through Social Opinion Mining, natural language can be understood in terms of the different opinion dimensions, as expressed by humans. This contributes towards the evolution of Artificial Intelligence which in turn helps the advancement of several real-world use cases, such as customer service and decision making. A thorough systematic review was carried out on Social Opinion Mining research which totals 485 published studies and spans a period of twelve years between 2007 and 2018. The in-depth analysis focuses on the social media platforms, techniques, social datasets, language, modality, tools and technologies, and other aspects derived. Social Opinion Mining can be utilised in many application areas, ranging from marketing, advertising and sales for product/service management, and in multiple domains and industries, such as politics, technology, finance, healthcare, sports and government. The latest developments in Social Opinion Mining beyond 2018 are also presented together with future research directions, with the aim of leaving a wider academic and societal impact in several real-world applications.
    Structured Denoising Diffusion Models in Discrete State-Spaces. (arXiv:2107.03006v1 [cs.LG])
    (2 min) Denoising diffusion probabilistic models (DDPMs) (Ho et al. 2020) have shown impressive results on image and waveform generation in continuous state spaces. Here, we introduce Discrete Denoising Diffusion Probabilistic Models (D3PMs), diffusion-like generative models for discrete data that generalize the multinomial diffusion model of Hoogeboom et al. 2021, by going beyond corruption processes with uniform transition probabilities. This includes corruption with transition matrices that mimic Gaussian kernels in continuous space, matrices based on nearest neighbors in embedding space, and matrices that introduce absorbing states. The third allows us to draw a connection between diffusion models and autoregressive and mask-based generative models. We show that the choice of transition matrix is an important design decision that leads to improved results in image and text domains. We also introduce a new loss function that combines the variational lower bound with an auxiliary cross entropy loss. For text, this model class achieves strong results on character-level text generation while scaling to large vocabularies on LM1B. On the image dataset CIFAR-10, our models approach the sample quality and exceed the log-likelihood of the continuous-space DDPM model.
    A Survey on Dialogue Summarization: Recent Advances and New Frontiers. (arXiv:2107.03175v1 [cs.CL])
    (2 min) With the development of dialogue systems and natural language generation techniques, the resurgence of dialogue summarization has attracted significant research attentions, which aims to condense the original dialogue into a shorter version covering salient information. However, there remains a lack of comprehensive survey for this task. To this end, we take the first step and present a thorough review of this research field. In detail, we provide an overview of publicly available research datasets, summarize existing works according to the domain of input dialogue as well as organize leaderboards under unified metrics. Furthermore, we discuss some future directions and give our thoughts. We hope that this first survey of dialogue summarization can provide the community with a quick access and a general picture to this task and motivate future researches.
    SinSpell: A Comprehensive Spelling Checker for Sinhala. (arXiv:2107.02983v1 [cs.CL])
    (2 min) We have built SinSpell, a comprehensive spelling checker for the Sinhala language which is spoken by over 16 million people, mainly in Sri Lanka. However, until recently, Sinhala had no spelling checker with acceptable coverage. Sinspell is still the only open source Sinhala spelling checker. SinSpell identifies possible spelling errors and suggests corrections. It also contains a module which auto-corrects evident errors. To maintain accuracy, SinSpell was designed as a rule-based system based on Hunspell. A set of words was compiled from several sources and verified. These were divided into morphological classes, and the valid roots, suffixes and prefixes for each class were identified, together with lists of irregular words and exceptions. The errors in a corpus of Sinhala documents were analysed and commonly misspelled words and types of common errors were identified. We found that the most common errors were in vowel length and similar sounding letters. Errors due to incorrect typing and encoding were also found. This analysis was used to develop the suggestion generator and auto-corrector.
    DORA: Toward Policy Optimization for Task-oriented Dialogue System with Efficient Context. (arXiv:2107.03286v1 [cs.CL])
    (2 min) Recently, reinforcement learning (RL) has been applied to task-oriented dialogue systems by using latent actions to solve shortcomings of supervised learning (SL). In this paper, we propose a multi-domain task-oriented dialogue system, called Dialogue System with Optimizing a Recurrent Action Policy using Efficient Context (DORA), that uses SL, with subsequently applied RL to optimize dialogue systems using a recurrent dialogue policy. This dialogue policy recurrently generates explicit system actions as a both word-level and high-level policy. As a result, DORA is clearly optimized during both SL and RL steps by using an explicit system action policy that considers an efficient context instead of the entire dialogue history. The system actions are both interpretable and controllable, whereas the latent actions are not. DORA improved the success rate by 6.6 points on MultiWOZ 2.0 and by 10.9 points on MultiWOZ 2.1.
    MACCIF-TDNN: Multi aspect aggregation of channel and context interdependence features in TDNN-based speaker verification. (arXiv:2107.03104v1 [cs.SD])
    (2 min) Most of the recent state-of-the-art results for speaker verification are achieved by X-vector and its subsequent variants. In this paper, we propose a new network architecture which aggregates the channel and context interdependence features from multi aspect based on Time Delay Neural Network (TDNN). Firstly, we use the SE-Res2Blocks as in ECAPA-TDNN to explicitly model the channel interdependence to realize adaptive calibration of channel features, and process local context features in a multi-scale way at a more granular level compared with conventional TDNN-based methods. Secondly, we explore to use the encoder structure of Transformer to model the global context interdependence features at an utterance level which can capture better long term temporal characteristics. Before the pooling layer, we aggregate the outputs of SE-Res2Blocks and Transformer encoder to leverage the complementary channel and context interdependence features learned by themself respectively. Finally, instead of performing a single attentive statistics pooling, we also find it beneficial to extend the pooling method in a multi-head way which can discriminate features from multiple aspect. The proposed MACCIF-TDNN architecture can outperform most of the state-of-the-art TDNN-based systems on VoxCeleb1 test sets.
    Robustifying Multi-hop QA through Pseudo-Evidentiality Training. (arXiv:2107.03242v1 [cs.CL])
    (2 min) This paper studies the bias problem of multi-hop question answering models, of answering correctly without correct reasoning. One way to robustify these models is by supervising to not only answer right, but also with right reasoning chains. An existing direction is to annotate reasoning chains to train models, requiring expensive additional annotations. In contrast, we propose a new approach to learn evidentiality, deciding whether the answer prediction is supported by correct evidences, without such annotations. Instead, we compare counterfactual changes in answer confidence with and without evidence sentences, to generate "pseudo-evidentiality" annotations. We validate our proposed model on an original set and challenge set in HotpotQA, showing that our method is accurate and robust in multi-hop reasoning.
    Linear-time calculation of the expected sum of edge lengths in random projective linearizations of trees. (arXiv:2107.03277v1 [cs.CL])
    (2 min) The syntactic structure of a sentence is often represented using syntactic dependency trees. The sum of the distances between syntactically related words has been in the limelight for the past decades. Research on dependency distances led to the formulation of the principle of dependency distance minimization whereby words in sentences are ordered so as to minimize that sum. Numerous random baselines have been defined to carry out related quantitative studies on languages. The simplest random baseline is the expected value of the sum in unconstrained random permutations of the words in the sentence, namely when all the shufflings of the words of a sentence are allowed and equally likely. Here we focus on a popular baseline: random projective permutations of the words of the sentence, that is, permutations where the syntactic dependency structure is projective, a formal constraint that sentences satisfy often in languages. Thus far, the expectation of the sum of dependency distances in random projective shufflings of a sentence has been estimated approximately with a Monte Carlo procedure whose cost is of the order of $Zn$, where $n$ is the number of words of the sentence and $Z$ is the number of samples; the larger $Z$, the lower the error of the estimation but the larger the time cost. Here we present formulae to compute that expectation without error in time of the order of $n$. Furthermore, we show that star trees maximize it, and devise a dynamic programming algorithm to retrieve the trees that minimize it.
    MedGPT: Medical Concept Prediction from Clinical Narratives. (arXiv:2107.03134v1 [cs.CL])
    (2 min) The data available in Electronic Health Records (EHRs) provides the opportunity to transform care, and the best way to provide better care for one patient is through learning from the data available on all other patients. Temporal modelling of a patient's medical history, which takes into account the sequence of past events, can be used to predict future events such as a diagnosis of a new disorder or complication of a previous or existing disorder. While most prediction approaches use mostly the structured data in EHRs or a subset of single-domain predictions and outcomes, we present MedGPT a novel transformer-based pipeline that uses Named Entity Recognition and Linking tools (i.e. MedCAT) to structure and organize the free text portion of EHRs and anticipate a range of future medical events (initially disorders). Since a large portion of EHR data is in text form, such an approach benefits from a granular and detailed view of a patient while introducing modest additional noise. MedGPT effectively deals with the noise and the added granularity, and achieves a precision of 0.344, 0.552 and 0.640 (vs LSTM 0.329, 0.538 and 0.633) when predicting the top 1, 3 and 5 candidate future disorders on real world hospital data from King's College Hospital, London, UK (\textasciitilde600k patients). We also show that our model captures medical knowledge by testing it on an experimental medical multiple choice question answering task, and by examining the attentional focus of the model using gradient-based saliency methods.
    Hierarchical Text Classification of Urdu News using Deep Neural Network. (arXiv:2107.03141v1 [cs.CL])
    (2 min) Digital text is increasing day by day on the internet. It is very challenging to classify a large and heterogeneous collection of data, which require improved information processing methods to organize text. To classify large size of corpus, one common approach is to use hierarchical text classification, which aims to classify textual data in a hierarchical structure. Several approaches have been proposed to tackle classification of text but most of the research has been done on English language. This paper proposes a deep learning model for hierarchical text classification of news in Urdu language - consisting of 51,325 sentences from 8 online news websites belonging to the following genres: Sports; Technology; and Entertainment. The objectives of this paper are twofold: (1) to develop a large human-annotated dataset of news in Urdu language for hierarchical text classification; and (2) to classify Urdu news hierarchically using our proposed model based on LSTM mechanism named as Hierarchical Multi-layer LSTMs (HMLSTM). Our model consists of two modules: Text Representing Layer, for obtaining text representation in which we use Word2vec embedding to transform the words to vector and Urdu Hierarchical LSTM Layer (UHLSTML) an end-to-end fully connected deep LSTMs network to perform automatic feature learning, we train one LSTM layer for each level of the class hierarchy. We have performed extensive experiments on our self created dataset named as Urdu News Dataset for Hierarchical Text Classification (UNDHTC). The result shows that our proposed method is very effective for hierarchical text classification and it outperforms baseline methods significantly and also achieved good results as compare to deep neural model.
    A Survey on Data Augmentation for Text Classification. (arXiv:2107.03158v1 [cs.CL])
    (2 min) Data augmentation, the artificial creation of training data for machine learning by transformations, is a widely studied research field across machine learning disciplines. While it is useful for increasing the generalization capabilities of a model, it can also address many other challenges and problems, from overcoming a limited amount of training data over regularizing the objective to limiting the amount data used to protect privacy. Based on a precise description of the goals and applications of data augmentation (C1) and a taxonomy for existing works (C2), this survey is concerned with data augmentation methods for textual classification and aims to achieve a concise and comprehensive overview for researchers and practitioners (C3). Derived from the taxonomy, we divided more than 100 methods into 12 different groupings and provide state-of-the-art references expounding which methods are highly promising (C4). Finally, research perspectives that may constitute a building block for future work are given (C5).
    Answering Chinese Elementary School Social Study Multiple Choice Questions. (arXiv:2107.02893v1 [cs.CL])
    (2 min) We present a novel approach to answer the Chinese elementary school Social Study Multiple Choice questions. Although BERT has demonstrated excellent performance on Reading Comprehension tasks, it is found not good at handling some specific types of questions, such as Negation, All-of-the-above, and None-of-the-above. We thus propose a novel framework to cascade BERT with a Pre-Processor and an Answer-Selector modules to tackle the above challenges. Experimental results show the proposed approach effectively improves the performance of BERT, and thus demonstrate the feasibility of supplementing BERT with additional modules.
    Deep Extrapolation for Attribute-Enhanced Generation. (arXiv:2107.02968v1 [cs.LG])
    (2 min) Attribute extrapolation in sample generation is challenging for deep neural networks operating beyond the training distribution. We formulate a new task for extrapolation in sequence generation, focusing on natural language and proteins, and propose GENhance, a generative framework that enhances attributes through a learned latent space. Trained on movie reviews and a computed protein stability dataset, GENhance can generate strongly-positive text reviews and highly stable protein sequences without being exposed to similar data during training. We release our benchmark tasks and models to contribute to the study of generative modeling extrapolation and data-driven design in biology and chemistry.
    EchoEA: Echo Information between Entities and Relations for Entity Alignment. (arXiv:2107.03054v1 [cs.CL])
    (2 min) Entity alignment (EA) is to discover entities referring to the same object in the real world from different knowledge graphs (KGs). It plays an important role in automatically integrating KGs from multiple sources. Existing knowledge graph embedding (KGE) methods based on Graph Neural Networks (GNNs) have achieved promising results, which enhance entity representation with relation information unidirectionally. Besides, more and more methods introduce semi-supervision to ask for more labeled training data. However, two challenges still exist in these methods: (1) Insufficient interaction: The interaction between entities and relations is insufficiently utilized. (2) Low-quality bootstrapping: The generated semi-supervised data is of low quality. In this paper, we propose a novel framework, Echo Entity Alignment (EchoEA), which leverages self-attention mechanism to spread entity information to relations and echo back to entities. The relation representation is dynamically computed from entity representation. Symmetrically, the next entity representation is dynamically calculated from relation representation, which shows sufficient interaction. Furthermore, we propose attribute-combined bi-directional global-filtered strategy (ABGS) to improve bootstrapping, reduce false samples and generate high-quality training data. The experimental results on three real-world cross-lingual datasets are stable at around 96\% at hits@1 on average, showing that our approach not only significantly outperforms the state-of-the-art methods, but also is universal and transferable for existing KGE methods.
    Topic Modeling in the Voynich Manuscript. (arXiv:2107.02858v1 [cs.CL])
    (2 min) This article presents the results of investigations using topic modeling of the Voynich Manuscript (Beinecke MS408). Topic modeling is a set of computational methods which are used to identify clusters of subjects within text. We use latent dirichlet allocation, latent semantic analysis, and nonnegative matrix factorization to cluster Voynich pages into `topics'. We then compare the topics derived from the computational models to clusters derived from the Voynich illustrations and from paleographic analysis. We find that computationally derived clusters match closely to a conjunction of scribe and subject matter (as per the illustrations), providing further evidence that the Voynich Manuscript contains meaningful text.
    Efficient Transformer for Direct Speech Translation. (arXiv:2107.03069v1 [cs.CL])
    (2 min) The advent of Transformer-based models has surpassed the barriers of text. When working with speech, we must face a problem: the sequence length of an audio input is not suitable for the Transformer. To bypass this problem, a usual approach is adding strided convolutional layers, to reduce the sequence length before using the Transformer. In this paper, we propose a new approach for direct Speech Translation, where thanks to an efficient Transformer we can work with a spectrogram without having to use convolutional layers before the Transformer. This allows the encoder to learn directly from the spectrogram and no information is lost. We have created an encoder-decoder model, where the encoder is an efficient Transformer -- the Longformer -- and the decoder is a traditional Transformer decoder. Our results, which are close to the ones obtained with the standard approach, show that this is a promising research direction.
  • cs.CV updates on arXiv.org

    FasterPose: A Faster Simple Baseline for Human Pose Estimation. (arXiv:2107.03215v1 [cs.CV])
    (2 min) The performance of human pose estimation depends on the spatial accuracy of keypoint localization. Most existing methods pursue the spatial accuracy through learning the high-resolution (HR) representation from input images. By the experimental analysis, we find that the HR representation leads to a sharp increase of computational cost, while the accuracy improvement remains marginal compared with the low-resolution (LR) representation. In this paper, we propose a design paradigm for cost-effective network with LR representation for efficient pose estimation, named FasterPose. Whereas the LR design largely shrinks the model complexity, yet how to effectively train the network with respect to the spatial accuracy is a concomitant challenge. We study the training behavior of FasterPose, and formulate a novel regressive cross-entropy (RCE) loss function for accelerating the convergence and promoting the accuracy. The RCE loss generalizes the ordinary cross-entropy loss from the binary supervision to a continuous range, thus the training of pose estimation network is able to benefit from the sigmoid function. By doing so, the output heatmap can be inferred from the LR features without loss of spatial accuracy, while the computational cost and model size has been significantly reduced. Compared with the previously dominant network of pose estimation, our method reduces 58% of the FLOPs and simultaneously gains 1.3% improvement of accuracy. Extensive experiments show that FasterPose yields promising results on the common benchmarks, i.e., COCO and MPII, consistently validating the effectiveness and efficiency for practical utilization, especially the low-latency and low-energy-budget applications in the non-GPU scenarios.
    Few-Shot Learning by Integrating Spatial and Frequency Representation. (arXiv:2105.05348v2 [cs.CV] UPDATED)
    (2 min) Human beings can recognize new objects with only a few labeled examples, however, few-shot learning remains a challenging problem for machine learning systems. Most previous algorithms in few-shot learning only utilize spatial information of the images. In this paper, we propose to integrate the frequency information into the learning model to boost the discrimination ability of the system. We employ Discrete Cosine Transformation (DCT) to generate the frequency representation, then, integrate the features from both the spatial domain and frequency domain for classification. The proposed strategy and its effectiveness are validated with different backbones, datasets, and algorithms. Extensive experiments demonstrate that the frequency information is complementary to the spatial representations in few-shot classification. The classification accuracy is boosted significantly by integrating features from both the spatial and frequency domains in different few-shot learning tasks.
    Novel Visual Category Discovery with Dual Ranking Statistics and Mutual Knowledge Distillation. (arXiv:2107.03358v1 [cs.CV])
    (2 min) In this paper, we tackle the problem of novel visual category discovery, i.e., grouping unlabelled images from new classes into different semantic partitions by leveraging a labelled dataset that contains images from other different but relevant categories. This is a more realistic and challenging setting than conventional semi-supervised learning. We propose a two-branch learning framework for this problem, with one branch focusing on local part-level information and the other branch focusing on overall characteristics. To transfer knowledge from the labelled data to the unlabelled, we propose using dual ranking statistics on both branches to generate pseudo labels for training on the unlabelled data. We further introduce a mutual knowledge distillation method to allow information exchange and encourage agreement between the two branches for discovering new categories, allowing our model to enjoy the benefits of global and local features. We comprehensively evaluate our method on public benchmarks for generic object classification, as well as the more challenging datasets for fine-grained visual recognition, achieving state-of-the-art performance.
    Latent Space Regularization for Unsupervised Domain Adaptation in Semantic Segmentation. (arXiv:2104.02633v3 [cs.CV] UPDATED)
    (2 min) Deep convolutional neural networks for semantic segmentation achieve outstanding accuracy, however they also have a couple of major drawbacks: first, they do not generalize well to distributions slightly different from the one of the training data; second, they require a huge amount of labeled data for their optimization. In this paper, we introduce feature-level space-shaping regularization strategies to reduce the domain discrepancy in semantic segmentation. In particular, for this purpose we jointly enforce a clustering objective, a perpendicularity constraint and a norm alignment goal on the feature vectors corresponding to source and target samples. Additionally, we propose a novel measure able to capture the relative efficacy of an adaptation strategy compared to supervised training. We verify the effectiveness of such methods in the autonomous driving setting achieving state-of-the-art results in multiple synthetic-to-real road scenes benchmarks.
    Long Short-Term Transformer for Online Action Detection. (arXiv:2107.03377v1 [cs.CV])
    (2 min) In this paper, we present Long Short-term TRansformer (LSTR), a new temporal modeling algorithm for online action detection, by employing a long- and short-term memories mechanism that is able to model prolonged sequence data. It consists of an LSTR encoder that is capable of dynamically exploiting coarse-scale historical information from an extensively long time window (e.g., 2048 long-range frames of up to 8 minutes), together with an LSTR decoder that focuses on a short time window (e.g., 32 short-range frames of 8 seconds) to model the fine-scale characterization of the ongoing event. Compared to prior work, LSTR provides an effective and efficient method to model long videos with less heuristic algorithm design. LSTR achieves significantly improved results on standard online action detection benchmarks, THUMOS'14, TVSeries, and HACS Segment, over the existing state-of-the-art approaches. Extensive empirical analysis validates the setup of the long- and short-term memories and the design choices of LSTR.
    Cross-View Exocentric to Egocentric Video Synthesis. (arXiv:2107.03120v1 [cs.CV])
    (2 min) Cross-view video synthesis task seeks to generate video sequences of one view from another dramatically different view. In this paper, we investigate the exocentric (third-person) view to egocentric (first-person) view video generation task. This is challenging because egocentric view sometimes is remarkably different from the exocentric view. Thus, transforming the appearances across the two different views is a non-trivial task. Particularly, we propose a novel Bi-directional Spatial Temporal Attention Fusion Generative Adversarial Network (STA-GAN) to learn both spatial and temporal information to generate egocentric video sequences from the exocentric view. The proposed STA-GAN consists of three parts: temporal branch, spatial branch, and attention fusion. First, the temporal and spatial branches generate a sequence of fake frames and their corresponding features. The fake frames are generated in both downstream and upstream directions for both temporal and spatial branches. Next, the generated four different fake frames and their corresponding features (spatial and temporal branches in two directions) are fed into a novel multi-generation attention fusion module to produce the final video sequence. Meanwhile, we also propose a novel temporal and spatial dual-discriminator for more robust network optimization. Extensive experiments on the Side2Ego and Top2Ego datasets show that the proposed STA-GAN significantly outperforms the existing methods.
    A Deep Residual Star Generative Adversarial Network for multi-domain Image Super-Resolution. (arXiv:2107.03145v1 [eess.IV])
    (2 min) Recently, most of state-of-the-art single image super-resolution (SISR) methods have attained impressive performance by using deep convolutional neural networks (DCNNs). The existing SR methods have limited performance due to a fixed degradation settings, i.e. usually a bicubic downscaling of low-resolution (LR) image. However, in real-world settings, the LR degradation process is unknown which can be bicubic LR, bilinear LR, nearest-neighbor LR, or real LR. Therefore, most SR methods are ineffective and inefficient in handling more than one degradation settings within a single network. To handle the multiple degradation, i.e. refers to multi-domain image super-resolution, we propose a deep Super-Resolution Residual StarGAN (SR2*GAN), a novel and scalable approach that super-resolves the LR images for the multiple LR domains using only a single model. The proposed scheme is trained in a StarGAN like network topology with a single generator and discriminator networks. We demonstrate the effectiveness of our proposed approach in quantitative and qualitative experiments compared to other state-of-the-art methods.
    PoseRN: A 2D pose refinement network for bias-free multi-view 3D human pose estimation. (arXiv:2107.03000v1 [cs.CV])
    (2 min) We propose a new 2D pose refinement network that learns to predict the human bias in the estimated 2D pose. There are biases in 2D pose estimations that are due to differences between annotations of 2D joint locations based on annotators' perception and those defined by motion capture (MoCap) systems. These biases are crafted into publicly available 2D pose datasets and cannot be removed with existing error reduction approaches. Our proposed pose refinement network allows us to efficiently remove the human bias in the estimated 2D poses and achieve highly accurate multi-view 3D human pose estimation.
    Scalable Data Balancing for Unlabeled Satellite Imagery. (arXiv:2107.03227v1 [cs.CV])
    (2 min) Data imbalance is a ubiquitous problem in machine learning. In large scale collected and annotated datasets, data imbalance is either mitigated manually by undersampling frequent classes and oversampling rare classes, or planned for with imputation and augmentation techniques. In both cases balancing data requires labels. In other words, only annotated data can be balanced. Collecting fully annotated datasets is challenging, especially for large scale satellite systems such as the unlabeled NASA's 35 PB Earth Imagery dataset. Although the NASA Earth Imagery dataset is unlabeled, there are implicit properties of the data source that we can rely on to hypothesize about its imbalance, such as distribution of land and water in the case of the Earth's imagery. We present a new iterative method to balance unlabeled data. Our method utilizes image embeddings as a proxy for image labels that can be used to balance data, and ultimately when trained increases overall accuracy.
    GA-NET: Global Attention Network for Point Cloud Semantic Segmentation. (arXiv:2107.03101v1 [cs.CV])
    (2 min) How to learn long-range dependencies from 3D point clouds is a challenging problem in 3D point cloud analysis. Addressing this problem, we propose a global attention network for point cloud semantic segmentation, named as GA-Net, consisting of a point-independent global attention module and a point-dependent global attention module for obtaining contextual information of 3D point clouds in this paper. The point-independent global attention module simply shares a global attention map for all 3D points. In the point-dependent global attention module, for each point, a novel random cross attention block using only two randomly sampled subsets is exploited to learn the contextual information of all the points. Additionally, we design a novel point-adaptive aggregation block to replace linear skip connection for aggregating more discriminate features. Extensive experimental results on three 3D public datasets demonstrate that our method outperforms state-of-the-art methods in most cases.
    A convolutional neural network for teeth margin detection on 3-dimensional dental meshes. (arXiv:2107.03030v1 [cs.CV])
    (2 min) We proposed a convolutional neural network for vertex classification on 3-dimensional dental meshes, and used it to detect teeth margins. An expanding layer was constructed to collect statistic values of neighbor vertex features and compute new features for each vertex with convolutional neural networks. An end-to-end neural network was proposed to take vertex features, including coordinates, curvatures and distance, as input and output each vertex classification label. Several network structures with different parameters of expanding layers and a base line network without expanding layers were designed and trained by 1156 dental meshes. The accuracy, recall and precision were validated on 145 dental meshes to rate the best network structures, which were finally tested on another 144 dental meshes. All networks with our expanding layers performed better than baseline, and the best one achieved an accuracy of 0.877 both on validation dataset and test dataset.
    WeClick: Weakly-Supervised Video Semantic Segmentation with Click Annotations. (arXiv:2107.03088v1 [cs.CV])
    (2 min) Compared with tedious per-pixel mask annotating, it is much easier to annotate data by clicks, which costs only several seconds for an image. However, applying clicks to learn video semantic segmentation model has not been explored before. In this work, we propose an effective weakly-supervised video semantic segmentation pipeline with click annotations, called WeClick, for saving laborious annotating effort by segmenting an instance of the semantic class with only a single click. Since detailed semantic information is not captured by clicks, directly training with click labels leads to poor segmentation predictions. To mitigate this problem, we design a novel memory flow knowledge distillation strategy to exploit temporal information (named memory flow) in abundant unlabeled video frames, by distilling the neighboring predictions to the target frame via estimated motion. Moreover, we adopt vanilla knowledge distillation for model compression. In this case, WeClick learns compact video semantic segmentation models with the low-cost click annotations during the training phase yet achieves real-time and accurate models during the inference period. Experimental results on Cityscapes and Camvid show that WeClick outperforms the state-of-the-art methods, increases performance by 10.24% mIoU than baseline, and achieves real-time execution.
    Semi-supervised Left Atrium Segmentation with Mutual Consistency Training. (arXiv:2103.02911v2 [cs.CV] UPDATED)
    (2 min) Semi-supervised learning has attracted great attention in the field of machine learning, especially for medical image segmentation tasks, since it alleviates the heavy burden of collecting abundant densely annotated data for training. However, most of existing methods underestimate the importance of challenging regions (e.g. small branches or blurred edges) during training. We believe that these unlabeled regions may contain more crucial information to minimize the uncertainty prediction for the model and should be emphasized in the training process. Therefore, in this paper, we propose a novel Mutual Consistency Network (MC-Net) for semi-supervised left atrium segmentation from 3D MR images. Particularly, our MC-Net consists of one encoder and two slightly different decoders, and the prediction discrepancies of two decoders are transformed as an unsupervised loss by our designed cycled pseudo label scheme to encourage mutual consistency. Such mutual consistency encourages the two decoders to have consistent and low-entropy predictions and enables the model to gradually capture generalized features from these unlabeled challenging regions. We evaluate our MC-Net on the public Left Atrium (LA) database and it obtains impressive performance gains by exploiting the unlabeled data effectively. Our MC-Net outperforms six recent semi-supervised methods for left atrium segmentation, and sets the new state-of-the-art performance on the LA database.
    Deep Learning based Micro-expression Recognition: A Survey. (arXiv:2107.02823v1 [cs.CV])
    (2 min) Micro-expressions (MEs) are involuntary facial movements revealing people's hidden feelings in high-stake situations and have practical importance in medical treatment, national security, interrogations and many human-computer interaction systems. Early methods for MER mainly based on traditional appearance and geometry features. Recently, with the success of deep learning (DL) in various fields, neural networks have received increasing interests in MER. Different from macro-expressions, MEs are spontaneous, subtle, and rapid facial movements, leading to difficult data collection, thus have small-scale datasets. DL based MER becomes challenging due to above ME characters. To data, various DL approaches have been proposed to solve the ME issues and improve MER performance. In this survey, we provide a comprehensive review of deep micro-expression recognition (MER), including datasets, deep MER pipeline, and the bench-marking of most influential methods. This survey defines a new taxonomy for the field, encompassing all aspects of MER based on DL. For each aspect, the basic approaches and advanced developments are summarized and discussed. In addition, we conclude the remaining challenges and and potential directions for the design of robust deep MER systems. To the best of our knowledge, this is the first survey of deep MER methods, and this survey can serve as a reference point for future MER research.
    Speaker embeddings by modeling channel-wise correlations. (arXiv:2104.02571v2 [eess.AS] UPDATED)
    (2 min) Speaker embeddings extracted with deep 2D convolutional neural networks are typically modeled as projections of first and second order statistics of channel-frequency pairs onto a linear layer, using either average or attentive pooling along the time axis. In this paper we examine an alternative pooling method, where pairwise correlations between channels for given frequencies are used as statistics. The method is inspired by style-transfer methods in computer vision, where the style of an image, modeled by the matrix of channel-wise correlations, is transferred to another image, in order to produce a new image having the style of the first and the content of the second. By drawing analogies between image style and speaker characteristics, and between image content and phonetic sequence, we explore the use of such channel-wise correlations features to train a ResNet architecture in an end-to-end fashion. Our experiments on VoxCeleb demonstrate the effectiveness of the proposed pooling method in speaker recognition.
    MuVAM: A Multi-View Attention-based Model for Medical Visual Question Answering. (arXiv:2107.03216v1 [cs.CV])
    (2 min) Medical Visual Question Answering (VQA) is a multi-modal challenging task widely considered by research communities of the computer vision and natural language processing. Since most current medical VQA models focus on visual content, ignoring the importance of text, this paper proposes a multi-view attention-based model(MuVAM) for medical visual question answering which integrates the high-level semantics of medical images on the basis of text description. Firstly, different methods are utilized to extract the features of the image and the question for the two modalities of vision and text. Secondly, this paper proposes a multi-view attention mechanism that include Image-to-Question (I2Q) attention and Word-to-Text (W2T) attention. Multi-view attention can correlate the question with image and word in order to better analyze the question and get an accurate answer. Thirdly, a composite loss is presented to predict the answer accurately after multi-modal feature fusion and improve the similarity between visual and textual cross-modal features. It consists of classification loss and image-question complementary (IQC) loss. Finally, for data errors and missing labels in the VQA-RAD dataset, we collaborate with medical experts to correct and complete this dataset and then construct an enhanced dataset, VQA-RADPh. The experiments on these two datasets show that the effectiveness of MuVAM surpasses the state-of-the-art method.
    Differentiable Architecture Pruning for Transfer Learning. (arXiv:2107.03375v1 [cs.LG])
    (2 min) We propose a new gradient-based approach for extracting sub-architectures from a given large model. Contrarily to existing pruning methods, which are unable to disentangle the network architecture and the corresponding weights, our architecture-pruning scheme produces transferable new structures that can be successfully retrained to solve different tasks. We focus on a transfer-learning setup where architectures can be trained on a large data set but very few data points are available for fine-tuning them on new tasks. We define a new gradient-based algorithm that trains architectures of arbitrarily low complexity independently from the attached weights. Given a search space defined by an existing large neural model, we reformulate the architecture search task as a complexity-penalized subset-selection problem and solve it through a two-temperature relaxation scheme. We provide theoretical convergence guarantees and validate the proposed transfer-learning strategy on real data.
    Introducing the structural bases of typicality effects in deep learning. (arXiv:2107.03279v1 [cs.CV])
    (2 min) In this paper, we hypothesize that the effects of the degree of typicality in natural semantic categories can be generated based on the structure of artificial categories learned with deep learning models. Motivated by the human approach to representing natural semantic categories and based on the Prototype Theory foundations, we propose a novel Computational Prototype Model (CPM) to represent the internal structure of semantic categories. Unlike other prototype learning approaches, our mathematical framework proposes a first approach to provide deep neural networks with the ability to model abstract semantic concepts such as category central semantic meaning, typicality degree of an object's image, and family resemblance relationship. We proposed several methodologies based on the typicality's concept to evaluate our CPM-model in image semantic processing tasks such as image classification, a global semantic description, and transfer learning. Our experiments on different image datasets, such as ImageNet and Coco, showed that our approach might be an admissible proposition in the effort to endow machines with greater power of abstraction for the semantic representation of objects' categories.
    Decentralised Learning from Independent Multi-Domain Labels for Person Re-Identification. (arXiv:2006.04150v5 [cs.CV] UPDATED)
    (2 min) Deep learning has been successful for many computer vision tasks due to the availability of shared and centralised large-scale training data. However, increasing awareness of privacy concerns poses new challenges to deep learning, especially for human subject related recognition such as person re-identification (Re-ID). In this work, we solve the Re-ID problem by decentralised learning from non-shared private training data distributed at multiple user sites of independent multi-domain label spaces. We propose a novel paradigm called Federated Person Re-Identification (FedReID) to construct a generalisable global model (a central server) by simultaneously learning with multiple privacy-preserved local models (local clients). Specifically, each local client receives global model updates from the server and trains a local model using its local data independent from all the other clients. Then, the central server aggregates transferrable local model updates to construct a generalisable global feature embedding model without accessing local data so to preserve local privacy. This client-server collaborative learning process is iteratively performed under privacy control, enabling FedReID to realise decentralised learning without sharing distributed data nor collecting any centralised data. Extensive experiments on ten Re-ID benchmarks show that FedReID achieves compelling generalisation performance beyond any locally trained models without using shared training data, whilst inherently protects the privacy of each local client. This is uniquely advantageous over contemporary Re-ID methods.
    Self-supervised Outdoor Scene Relighting. (arXiv:2107.03106v1 [cs.CV])
    (2 min) Outdoor scene relighting is a challenging problem that requires good understanding of the scene geometry, illumination and albedo. Current techniques are completely supervised, requiring high quality synthetic renderings to train a solution. Such renderings are synthesized using priors learned from limited data. In contrast, we propose a self-supervised approach for relighting. Our approach is trained only on corpora of images collected from the internet without any user-supervision. This virtually endless source of training data allows training a general relighting solution. Our approach first decomposes an image into its albedo, geometry and illumination. A novel relighting is then produced by modifying the illumination parameters. Our solution capture shadow using a dedicated shadow prediction map, and does not rely on accurate geometry estimation. We evaluate our technique subjectively and objectively using a new dataset with ground-truth relighting. Results show the ability of our technique to produce photo-realistic and physically plausible results, that generalizes to unseen scenes.
    Rotation Transformation Network: Learning View-Invariant Point Cloud for Classification and Segmentation. (arXiv:2107.03105v1 [cs.CV])
    (2 min) Many recent works show that a spatial manipulation module could boost the performances of deep neural networks (DNNs) for 3D point cloud analysis. In this paper, we aim to provide an insight into spatial manipulation modules. Firstly, we find that the smaller the rotational degree of freedom (RDF) of objects is, the more easily these objects are handled by these DNNs. Then, we investigate the effect of the popular T-Net module and find that it could not reduce the RDF of objects. Motivated by the above two issues, we propose a rotation transformation network for point cloud analysis, called RTN, which could reduce the RDF of input 3D objects to 0. The RTN could be seamlessly inserted into many existing DNNs for point cloud analysis. Extensive experimental results on 3D point cloud classification and segmentation tasks demonstrate that the proposed RTN could improve the performances of several state-of-the-art methods significantly.
    Greedy Offset-Guided Keypoint Grouping for Human Pose Estimation. (arXiv:2107.03098v1 [cs.CV])
    (2 min) We propose a simple yet reliable bottom-up approach with a good trade-off between accuracy and efficiency for the problem of multi-person pose estimation. Given an image, we employ an Hourglass Network to infer all the keypoints from different persons indiscriminately as well as the guiding offsets connecting the adjacent keypoints belonging to the same persons. Then, we greedily group the candidate keypoints into multiple human poses (if any), utilizing the predicted guiding offsets. And we refer to this process as greedy offset-guided keypoint grouping (GOG). Moreover, we revisit the encoding-decoding method for the multi-person keypoint coordinates and reveal some important facts affecting accuracy. Experiments have demonstrated the obvious performance improvements brought by the introduced components. Our approach is comparable to the state of the art on the challenging COCO dataset under fair conditions. The source code and our pre-trained model are publicly available online.
    Video-Based Camera Localization Using Anchor View Detection and Recursive 3D Reconstruction. (arXiv:2107.03068v1 [cs.CV])
    (2 min) In this paper we introduce a new camera localization strategy designed for image sequences captured in challenging industrial situations such as industrial parts inspection. To deal with peculiar appearances that hurt standard 3D reconstruction pipeline, we exploit pre-knowledge of the scene by selecting key frames in the sequence (called as anchors) which are roughly connected to a certain location. Our method then seek the location of each frame in time-order, while recursively updating an augmented 3D model which can provide current camera location and surrounding 3D structure. In an experiment on a practical industrial situation, our method can localize over 99% frames in the input sequence, whereas standard localization methods fail to reconstruct a complete camera trajectory.
    Visual Odometry with an Event Camera Using Continuous Ray Warping and Volumetric Contrast Maximization. (arXiv:2107.03011v1 [cs.CV])
    (2 min) We present a new solution to tracking and mapping with an event camera. The motion of the camera contains both rotation and translation, and the displacements happen in an arbitrarily structured environment. As a result, the image matching may no longer be represented by a low-dimensional homographic warping, thus complicating an application of the commonly used Image of Warped Events (IWE). We introduce a new solution to this problem by performing contrast maximization in 3D. The 3D location of the rays cast for each event is smoothly varied as a function of a continuous-time motion parametrization, and the optimal parameters are found by maximizing the contrast in a volumetric ray density field. Our method thus performs joint optimization over motion and structure. The practical validity of our approach is supported by an application to AGV motion estimation and 3D reconstruction with a single vehicle-mounted event camera. The method approaches the performance obtained with regular cameras, and eventually outperforms in challenging visual conditions.
    Maintaining a Reliable World Model using Action-aware Perceptual Anchoring. (arXiv:2107.03038v1 [cs.RO])
    (2 min) Reliable perception is essential for robots that interact with the world. But sensors alone are often insufficient to provide this capability, and they are prone to errors due to various conditions in the environment. Furthermore, there is a need for robots to maintain a model of its surroundings even when objects go out of view and are no longer visible. This requires anchoring perceptual information onto symbols that represent the objects in the environment. In this paper, we present a model for action-aware perceptual anchoring that enables robots to track objects in a persistent manner. Our rule-based approach considers inductive biases to perform high-level reasoning over the results from low-level object detection, and it improves the robot's perceptual capability for complex tasks. We evaluate our model against existing baseline models for object permanence and show that it outperforms these on a snitch localisation task using a dataset of 1,371 videos. We also integrate our action-aware perceptual anchoring in the context of a cognitive architecture and demonstrate its benefits in a realistic gearbox assembly task on a Universal Robot.
    Is 2D Heatmap Representation Even Necessary for Human Pose Estimation?. (arXiv:2107.03332v1 [cs.CV])
    (2 min) The 2D heatmap representation has dominated human pose estimation for years due to its high performance. However, heatmap-based approaches have some drawbacks: 1) The performance drops dramatically in the low-resolution images, which are frequently encountered in real-world scenarios. 2) To improve the localization precision, multiple upsample layers may be needed to recover the feature map resolution from low to high, which are computationally expensive. 3) Extra coordinate refinement is usually necessary to reduce the quantization error of downscaled heatmaps. To address these issues, we propose a \textbf{Sim}ple yet promising \textbf{D}isentangled \textbf{R}epresentation for keypoint coordinate (\emph{SimDR}), reformulating human keypoint localization as a task of classification. In detail, we propose to disentangle the representation of horizontal and vertical coordinates for keypoint location, leading to a more efficient scheme without extra upsampling and refinement. Comprehensive experiments conducted over COCO dataset show that the proposed \emph{heatmap-free} methods outperform \emph{heatmap-based} counterparts in all tested input resolutions, especially in lower resolutions by a large margin. Code will be made publicly available at \url{https://github.com/leeyegy/SimDR}.
    Interventional Video Grounding with Dual Contrastive Learning. (arXiv:2106.11013v2 [cs.CV] UPDATED)
    (2 min) Video grounding aims to localize a moment from an untrimmed video for a given textual query. Existing approaches focus more on the alignment of visual and language stimuli with various likelihood-based matching or regression strategies, i.e., P(Y|X). Consequently, these models may suffer from spurious correlations between the language and video features due to the selection bias of the dataset. 1) To uncover the causality behind the model and data, we first propose a novel paradigm from the perspective of the causal inference, i.e., interventional video grounding (IVG) that leverages backdoor adjustment to deconfound the selection bias based on structured causal model (SCM) and do-calculus P(Y|do(X)). Then, we present a simple yet effective method to approximate the unobserved confounder as it cannot be directly sampled from the dataset. 2) Meanwhile, we introduce a dual contrastive learning approach (DCL) to better align the text and video by maximizing the mutual information (MI) between query and video clips, and the MI between start/end frames of a target moment and the others within a video to learn more informative visual representations. Experiments on three standard benchmarks show the effectiveness of our approaches. Our code is available on GitHub: https://github.com/nanguoshun/IVG.
    Face Hallucination via Split-Attention in Split-Attention Network. (arXiv:2010.11575v3 [cs.CV] UPDATED)
    (2 min) Recently, convolutional neural networks (CNNs) have been widely employed to promote the face hallucination due to the ability to predict high-frequency details from a large number of samples. However, most of them fail to take into account the overall facial profile and fine texture details simultaneously, resulting in reduced naturalness and fidelity of the reconstructed face, and further impairing the performance of downstream tasks (e.g., face detection, facial recognition). To tackle this issue, we propose a novel external-internal split attention group (ESAG), which encompasses two paths responsible for facial structure information and facial texture details, respectively. By fusing the features from these two paths, the consistency of facial structure and the fidelity of facial details are strengthened at the same time. Then, we propose a split-attention in split-attention network (SISN) to reconstruct photorealistic high-resolution facial images by cascading several ESAGs. Experimental results on face hallucination and face recognition unveil that the proposed method not only significantly improves the clarity of hallucinated faces, but also encourages the subsequent face recognition performance substantially. Codes have been released at https://github.com/mdswyz/SISN-Face-Hallucination.
    An automatic multi-tissue human fetal brain segmentation benchmark using the Fetal Tissue Annotation Dataset. (arXiv:2010.15526v4 [eess.IV] UPDATED)
    (2 min) It is critical to quantitatively analyse the developing human fetal brain in order to fully understand neurodevelopment in both normal fetuses and those with congenital disorders. To facilitate this analysis, automatic multi-tissue fetal brain segmentation algorithms are needed, which in turn requires open databases of segmented fetal brains. Here we introduce a publicly available database of 50 manually segmented pathological and non-pathological fetal magnetic resonance brain volume reconstructions across a range of gestational ages (20 to 33 weeks) into 7 different tissue categories (external cerebrospinal fluid, grey matter, white matter, ventricles, cerebellum, deep grey matter, brainstem/spinal cord). In addition, we quantitatively evaluate the accuracy of several automatic multi-tissue segmentation algorithms of the developing human fetal brain. Four research groups participated, submitting a total of 10 algorithms, demonstrating the benefits the database for the development of automatic algorithms.
    Understanding the Security of Deepfake Detection. (arXiv:2107.02045v2 [cs.CR] UPDATED)
    (2 min) Deepfakes pose growing challenges to the trust of information on the Internet. Thus, detecting deepfakes has attracted increasing attentions from both academia and industry. State-of-the-art deepfake detection methods consist of two key components, i.e., face extractor and face classifier, which extract the face region in an image and classify it to be real/fake, respectively. Existing studies mainly focused on improving the detection performance in non-adversarial settings, leaving security of deepfake detection in adversarial settings largely unexplored. In this work, we aim to bridge the gap. In particular, we perform a systematic measurement study to understand the security of the state-of-the-art deepfake detection methods in adversarial settings. We use two large-scale public deepfakes data sources including FaceForensics++ and Facebook Deepfake Detection Challenge, where the deepfakes are fake face images; and we train state-of-the-art deepfake detection methods. These detection methods can achieve 0.94--0.99 accuracies in non-adversarial settings on these datasets. However, our measurement results uncover multiple security limitations of the deepfake detection methods in adversarial settings. First, we find that an attacker can evade a face extractor, i.e., the face extractor fails to extract the correct face regions, via adding small Gaussian noise to its deepfake images. Second, we find that a face classifier trained using deepfakes generated by one method cannot detect deepfakes generated by another method, i.e., an attacker can evade detection via generating deepfakes using a new method. Third, we find that an attacker can leverage backdoor attacks developed by the adversarial machine learning community to evade a face classifier. Our results highlight that deepfake detection should consider the adversarial nature of the problem.
    Detection and Mitigation of Rare Subclasses in Deep Neural Network Classifiers. (arXiv:1911.12780v2 [cs.LG] UPDATED)
    (2 min) Regions of high-dimensional input spaces that are underrepresented in training datasets reduce machine-learnt classifier performance, and may lead to corner cases and unwanted bias for classifiers used in decision making systems. When these regions belong to otherwise well-represented classes, their presence and negative impact are very hard to identify. We propose an approach for the detection and mitigation of such rare subclasses in deep neural network classifiers. The new approach is underpinned by an easy-to-compute commonality metric that supports the detection of rare subclasses, and comprises methods for reducing the impact of these subclasses during both model training and model exploitation. We demonstrate our approach using two well-known datasets, MNIST's handwritten digits and Kaggle's cats/dogs, identifying rare subclasses and producing models which compensate for subclass rarity. In addition we demonstrate how our run-time approach increases the ability of users to identify samples likely to be misclassified at run-time.
    Medical Transformer: Gated Axial-Attention for Medical Image Segmentation. (arXiv:2102.10662v2 [cs.CV] UPDATED)
    (2 min) Over the past decade, Deep Convolutional Neural Networks have been widely adopted for medical image segmentation and shown to achieve adequate performance. However, due to the inherent inductive biases present in the convolutional architectures, they lack understanding of long-range dependencies in the image. Recently proposed Transformer-based architectures that leverage self-attention mechanism encode long-range dependencies and learn representations that are highly expressive. This motivates us to explore Transformer-based solutions and study the feasibility of using Transformer-based network architectures for medical image segmentation tasks. Majority of existing Transformer-based network architectures proposed for vision applications require large-scale datasets to train properly. However, compared to the datasets for vision applications, for medical imaging the number of data samples is relatively low, making it difficult to efficiently train transformers for medical applications. To this end, we propose a Gated Axial-Attention model which extends the existing architectures by introducing an additional control mechanism in the self-attention module. Furthermore, to train the model effectively on medical images, we propose a Local-Global training strategy (LoGo) which further improves the performance. Specifically, we operate on the whole image and patches to learn global and local features, respectively. The proposed Medical Transformer (MedT) is evaluated on three different medical image segmentation datasets and it is shown that it achieves better performance than the convolutional and other related transformer-based architectures. Code: https://github.com/jeya-maria-jose/Medical-Transformer
    Predicting with Confidence on Unseen Distributions. (arXiv:2107.03315v1 [cs.LG])
    (2 min) Recent work has shown that the performance of machine learning models can vary substantially when models are evaluated on data drawn from a distribution that is close to but different from the training distribution. As a result, predicting model performance on unseen distributions is an important challenge. Our work connects techniques from domain adaptation and predictive uncertainty literature, and allows us to predict model accuracy on challenging unseen distributions without access to labeled data. In the context of distribution shift, distributional distances are often used to adapt models and improve their performance on new domains, however accuracy estimation, or other forms of predictive uncertainty, are often neglected in these investigations. Through investigating a wide range of established distributional distances, such as Frechet distance or Maximum Mean Discrepancy, we determine that they fail to induce reliable estimates of performance under distribution shift. On the other hand, we find that the difference of confidences (DoC) of a classifier's predictions successfully estimates the classifier's performance change over a variety of shifts. We specifically investigate the distinction between synthetic and natural distribution shifts and observe that despite its simplicity DoC consistently outperforms other quantifications of distributional difference. $DoC$ reduces predictive error by almost half ($46\%$) on several realistic and challenging distribution shifts, e.g., on the ImageNet-Vid-Robust and ImageNet-Rendition datasets.
    An Alternative Practice of Tropical Convolution to Traditional Convolutional Neural Networks. (arXiv:2103.02096v2 [cs.CV] UPDATED)
    (2 min) Convolutional neural networks (CNNs) have been used in many machine learning fields. In practical applications, the computational cost of convolutional neural networks is often high with the deepening of the network and the growth of data volume, mostly due to a large amount of multiplication operations of floating-point numbers in convolution operations. To reduce the amount of multiplications, we propose a new type of CNNs called Tropical Convolutional Neural Networks (TCNNs) which are built on tropical convolutions in which the multiplications and additions in conventional convolutional layers are replaced by additions and min/max operations respectively. In addition, since tropical convolution operators are essentially nonlinear operators, we expect TCNNs to have higher nonlinear fitting ability than conventional CNNs. In the experiments, we test and analyze several different architectures of TCNNs for image classification tasks in comparison with similar-sized conventional CNNs. The results show that TCNN can achieve higher expressive power than ordinary convolutional layers on the MNIST and CIFAR10 image data set. In different noise environments, there are wins and losses in the robustness of TCNN and ordinary CNNs.
    Multi-VAE: Learning Disentangled View-common and View-peculiar Visual Representations for Multi-view Clustering. (arXiv:2106.11232v2 [cs.CV] UPDATED)
    (2 min) Multi-view clustering, a long-standing and important research problem, focuses on mining complementary information from diverse views. However, existing works often fuse multiple views' representations or handle clustering in a common feature space, which may result in their entanglement especially for visual representations. To address this issue, we present a novel VAE-based multi-view clustering framework (Multi-VAE) by learning disentangled visual representations. Concretely, we define a view-common variable and multiple view-peculiar variables in the generative model. The prior of view-common variable obeys approximately discrete Gumbel Softmax distribution, which is introduced to extract the common cluster factor of multiple views. Meanwhile, the prior of view-peculiar variable follows continuous Gaussian distribution, which is used to represent each view's peculiar visual factors. By controlling the mutual information capacity to disentangle the view-common and view-peculiar representations, continuous visual information of multiple views can be separated so that their common discrete cluster information can be effectively mined. Experimental results demonstrate that Multi-VAE enjoys the disentangled and explainable visual representations, while obtaining superior clustering performance compared with state-of-the-art methods.
    Partial 3D Object Retrieval using Local Binary QUICCI Descriptors and Dissimilarity Tree Indexing. (arXiv:2107.03368v1 [cs.CV])
    (2 min) A complete pipeline is presented for accurate and efficient partial 3D object retrieval based on Quick Intersection Count Change Image (QUICCI) binary local descriptors and a novel indexing tree. It is shown how a modification to the QUICCI query descriptor makes it ideal for partial retrieval. An indexing structure called Dissimilarity Tree is proposed which can significantly accelerate searching the large space of local descriptors; this is applicable to QUICCI and other binary descriptors. The index exploits the distribution of bits within descriptors for efficient retrieval. The retrieval pipeline is tested on the artificial part of SHREC'16 dataset with near-ideal retrieval results.
    Reborn Mechanism: Rethinking the Negative Phase Information Flow in Convolutional Neural Network. (arXiv:2106.07026v2 [cs.CV] UPDATED)
    (2 min) This paper proposes a novel nonlinear activation mechanism typically for convolutional neural network (CNN), named as reborn mechanism. In sharp contrast to ReLU which cuts off the negative phase value, the reborn mechanism enjoys the capacity to reborn and reconstruct dead neurons. Compared to other improved ReLU functions, reborn mechanism introduces a more proper way to utilize the negative phase information. Extensive experiments validate that this activation mechanism is able to enhance the model representation ability more significantly and make the better use of the input data information while maintaining the advantages of the original ReLU function. Moreover, reborn mechanism enables a non-symmetry that is hardly achieved by traditional CNNs and can act as a channel compensation method, offering competitive or even better performance but with fewer learned parameters than traditional methods. Reborn mechanism was tested on various benchmark datasets, all obtaining better performance than previous nonlinear activation functions.
    Monocular Depth Estimation via Listwise Ranking using the Plackett-Luce Model. (arXiv:2010.13118v4 [cs.CV] UPDATED)
    (2 min) In many real-world applications, the relative depth of objects in an image is crucial for scene understanding. Recent approaches mainly tackle the problem of depth prediction in monocular images by treating the problem as a regression task. Yet, being interested in an order relation in the first place, ranking methods suggest themselves as a natural alternative to regression, and indeed, ranking approaches leveraging pairwise comparisons as training information ("object A is closer to the camera than B") have shown promising performance on this problem. In this paper, we elaborate on the use of so-called listwise ranking as a generalization of the pairwise approach. Our method is based on the Plackett-Luce (PL) model, a probability distribution on rankings, which we combine with a state-of-the-art neural network architecture and a simple sampling strategy to reduce training complexity. Moreover, taking advantage of the representation of PL as a random utility model, the proposed predictor offers a natural way to recover (shift-invariant) metric depth information from ranking-only data provided at training time. An empirical evaluation on several benchmark datasets in a "zero-shot" setting demonstrates the effectiveness of our approach compared to existing ranking and regression methods.
    DSANet: Dynamic Segment Aggregation Network for Video-Level Representation Learning. (arXiv:2105.12085v2 [cs.CV] UPDATED)
    (2 min) Long-range and short-range temporal modeling are two complementary and crucial aspects of video recognition. Most of the state-of-the-arts focus on short-range spatio-temporal modeling and then average multiple snippet-level predictions to yield the final video-level prediction. Thus, their video-level prediction does not consider spatio-temporal features of how video evolves along the temporal dimension. In this paper, we introduce a novel Dynamic Segment Aggregation (DSA) module to capture relationship among snippets. To be more specific, we attempt to generate a dynamic kernel for a convolutional operation to aggregate long-range temporal information among adjacent snippets adaptively. The DSA module is an efficient plug-and-play module and can be combined with the off-the-shelf clip-based models (i.e., TSM, I3D) to perform powerful long-range modeling with minimal overhead. The final video architecture, coined as DSANet. We conduct extensive experiments on several video recognition benchmarks (i.e., Mini-Kinetics-200, Kinetics-400, Something-Something V1 and ActivityNet) to show its superiority. Our proposed DSA module is shown to benefit various video recognition models significantly. For example, equipped with DSA modules, the top-1 accuracy of I3D ResNet-50 is improved from 74.9% to 78.2% on Kinetics-400. Codes are available at https://github.com/whwu95/DSANet.
    Anomaly detection and automatic labeling for solar cell quality inspection based on Generative Adversarial Network. (arXiv:2103.03518v2 [cs.CV] UPDATED)
    (3 min) Quality inspection applications in industry are required to move towards a zero-defect manufacturing scenario, withnon-destructive inspection and traceability of 100 % of produced parts. Developing robust fault detection and classification modelsfrom the start-up of the lines is challenging due to the difficulty in getting enough representative samples of the faulty patternsand the need to manually label them. This work presents a methodology to develop a robust inspection system, targeting thesepeculiarities, in the context of solar cell manufacturing. The methodology is divided into two phases: In the first phase, an anomalydetection model based on a Generative Adversarial Network (GAN) is employed. This model enables the detection and localizationof anomalous patterns within the solar cells from the beginning, using only non-defective samples for training and without anymanual labeling involved. In a second stage, as defective samples arise, the detected anomalies will be used as automaticallygenerated annotations for the supervised training of a Fully Convolutional Network that is capable of detecting multiple types offaults. The experimental results using 1873 EL images of monocrystalline cells show that (a) the anomaly detection scheme can beused to start detecting features with very little available data, (b) the anomaly detection may serve as automatic labeling in order totrain a supervised model, and (c) segmentation and classification results of supervised models trained with automatic labels arecomparable to the ones obtained from the models trained with manual labels.
    XPDNet for MRI Reconstruction: an application to the 2020 fastMRI challenge. (arXiv:2010.07290v2 [eess.IV] UPDATED)
    (2 min) We present a new neural network, the XPDNet, for MRI reconstruction from periodically under-sampled multi-coil data. We inform the design of this network by taking best practices from MRI reconstruction and computer vision. We show that this network can achieve state-of-the-art reconstruction results, as shown by its ranking of second in the fastMRI 2020 challenge.
    AGD-Autoencoder: Attention Gated Deep Convolutional Autoencoder for Brain Tumor Segmentation. (arXiv:2107.03323v1 [eess.IV])
    (2 min) Brain tumor segmentation is a challenging problem in medical image analysis. The endpoint is to generate the salient masks that accurately identify brain tumor regions in an fMRI screening. In this paper, we propose a novel attention gate (AG model) for brain tumor segmentation that utilizes both the edge detecting unit and the attention gated network to highlight and segment the salient regions from fMRI images. This feature enables us to eliminate the necessity of having to explicitly point towards the damaged area(external tissue localization) and classify(classification) as per classical computer vision techniques. AGs can easily be integrated within the deep convolutional neural networks(CNNs). Minimal computional overhead is required while the AGs increase the sensitivity scores significantly. We show that the edge detector along with an attention gated mechanism provide a sufficient enough method for brain segmentation reaching an IOU of 0.78
    Bone Surface Reconstruction and Clinical Features Estimation from Sparse Landmarks and Statistical Shape Models: A feasibility study on the femur. (arXiv:2107.03292v1 [eess.IV])
    (2 min) In this study, we investigated a method allowing the determination of the femur bone surface as well as its mechanical axis from some easy-to-identify bony landmarks. The reconstruction of the whole femur is therefore performed from these landmarks using a Statistical Shape Model (SSM). The aim of this research is therefore to assess the impact of the number, the position, and the accuracy of the landmarks for the reconstruction of the femur and the determination of its related mechanical axis, an important clinical parameter to consider for the lower limb analysis. Two statistical femur models were created from our in-house dataset and a publicly available dataset. Both were evaluated in terms of average point-to-point surface distance error and through the mechanical axis of the femur. Furthermore, the clinical impact of using landmarks on the skin in replacement of bony landmarks is investigated. The predicted proximal femurs from bony landmarks were more accurate compared to on-skin landmarks while both had less than 3.5 degrees mechanical axis angle deviation error. The results regarding the non-invasive determination of the mechanical axis are very encouraging and could open very interesting clinical perspectives for the analysis of the lower limb either for orthopedics or functional rehabilitation.
    Hierarchical Semantic Segmentation using Psychometric Learning. (arXiv:2107.03212v1 [cs.CV])
    (2 min) Assigning meaning to parts of image data is the goal of semantic image segmentation. Machine learning methods, specifically supervised learning is commonly used in a variety of tasks formulated as semantic segmentation. One of the major challenges in the supervised learning approaches is expressing and collecting the rich knowledge that experts have with respect to the meaning present in the image data. Towards this, typically a fixed set of labels is specified and experts are tasked with annotating the pixels, patches or segments in the images with the given labels. In general, however, the set of classes does not fully capture the rich semantic information present in the images. For example, in medical imaging such as histology images, the different parts of cells could be grouped and sub-grouped based on the expertise of the pathologist. To achieve such a precise semantic representation of the concepts in the image, we need access to the full depth of knowledge of the annotator. In this work, we develop a novel approach to collect segmentation annotations from experts based on psychometric testing. Our method consists of the psychometric testing procedure, active query selection, query enhancement, and a deep metric learning model to achieve a patch-level image embedding that allows for semantic segmentation of images. We show the merits of our method with evaluation on the synthetically generated image, aerial image and histology image.
    KaFiStO: A Kalman Filtering Framework for Stochastic Optimization. (arXiv:2107.03331v1 [cs.LG])
    (2 min) Optimization is often cast as a deterministic problem, where the solution is found through some iterative procedure such as gradient descent. However, when training neural networks the loss function changes over (iteration) time due to the randomized selection of a subset of the samples. This randomization turns the optimization problem into a stochastic one. We propose to consider the loss as a noisy observation with respect to some reference optimum. This interpretation of the loss allows us to adopt Kalman filtering as an optimizer, as its recursive formulation is designed to estimate unknown parameters from noisy measurements. Moreover, we show that the Kalman Filter dynamical model for the evolution of the unknown parameters can be used to capture the gradient dynamics of advanced methods such as Momentum and Adam. We call this stochastic optimization method KaFiStO. KaFiStO is an easy to implement, scalable, and efficient method to train neural networks. We show that it also yields parameter estimates that are on par with or better than existing optimization algorithms across several neural network architectures and machine learning tasks, such as computer vision and language modeling.
    Trans4Trans: Efficient Transformer for Transparent Object Segmentation to Help Visually Impaired People Navigate in the Real World. (arXiv:2107.03172v1 [cs.CV])
    (2 min) Common fully glazed facades and transparent objects present architectural barriers and impede the mobility of people with low vision or blindness, for instance, a path detected behind a glass door is inaccessible unless it is correctly perceived and reacted. However, segmenting these safety-critical objects is rarely covered by conventional assistive technologies. To tackle this issue, we construct a wearable system with a novel dual-head Transformer for Transparency (Trans4Trans) model, which is capable of segmenting general and transparent objects and performing real-time wayfinding to assist people walking alone more safely. Especially, both decoders created by our proposed Transformer Parsing Module (TPM) enable effective joint learning from different datasets. Besides, the efficient Trans4Trans model composed of symmetric transformer-based encoder and decoder, requires little computational expenses and is readily deployed on portable GPUs. Our Trans4Trans model outperforms state-of-the-art methods on the test sets of Stanford2D3D and Trans10K-v2 datasets and obtains mIoU of 45.13% and 75.14%, respectively. Through various pre-tests and a user study conducted in indoor and outdoor scenarios, the usability and reliability of our assistive system have been extensively verified.
    IntraLoss: Further Margin via Gradient-Enhancing Term for Deep Face Recognition. (arXiv:2107.03352v1 [cs.CV])
    (2 min) Existing classification-based face recognition methods have achieved remarkable progress, introducing large margin into hypersphere manifold to learn discriminative facial representations. However, the feature distribution is ignored. Poor feature distribution will wipe out the performance improvement brought about by margin scheme. Recent studies focus on the unbalanced inter-class distribution and form a equidistributed feature representations by penalizing the angle between identity and its nearest neighbor. But the problem is more than that, we also found the anisotropy of intra-class distribution. In this paper, we propose the `gradient-enhancing term' that concentrates on the distribution characteristics within the class. This method, named IntraLoss, explicitly performs gradient enhancement in the anisotropic region so that the intra-class distribution continues to shrink, resulting in isotropic and more compact intra-class distribution and further margin between identities. The experimental results on LFW, YTF and CFP-FP show that our outperforms state-of-the-art methods by gradient enhancement, demonstrating the superiority of our method. In addition, our method has intuitive geometric interpretation and can be easily combined with existing methods to solve the previously ignored problems.
    Samplets: A new paradigm for data compression. (arXiv:2107.03337v1 [math.NA])
    (2 min) In this article, we introduce the novel concept of samplets by transferring the construction of Tausch-White wavelets to the realm of data. This way we obtain a multilevel representation of discrete data which directly enables data compression, detection of singularities and adaptivity. Applying samplets to represent kernel matrices, as they arise in kernel based learning or Gaussian process regression, we end up with quasi-sparse matrices. By thresholding small entries, these matrices are compressible to O(N log N) relevant entries, where N is the number of data points. This feature allows for the use of fill-in reducing reorderings to obtain a sparse factorization of the compressed matrices. Besides the comprehensive introduction to samplets and their properties, we present extensive numerical studies to benchmark the approach. Our results demonstrate that samplets mark a considerable step in the direction of making large data sets accessible for analysis.
    FBC-GAN: Diverse and Flexible Image Synthesis via Foreground-Background Composition. (arXiv:2107.03166v1 [cs.CV])
    (2 min) Generative Adversarial Networks (GANs) have become the de-facto standard in image synthesis. However, without considering the foreground-background decomposition, existing GANs tend to capture excessive content correlation between foreground and background, thus constraining the diversity in image generation. This paper presents a novel Foreground-Background Composition GAN (FBC-GAN) that performs image generation by generating foreground objects and background scenes concurrently and independently, followed by composing them with style and geometrical consistency. With this explicit design, FBC-GAN can generate images with foregrounds and backgrounds that are mutually independent in contents, thus lifting the undesirably learned content correlation constraint and achieving superior diversity. It also provides excellent flexibility by allowing the same foreground object with different background scenes, the same background scene with varying foreground objects, or the same foreground object and background scene with different object positions, sizes and poses. It can compose foreground objects and background scenes sampled from different datasets as well. Extensive experiments over multiple datasets show that FBC-GAN achieves competitive visual realism and superior diversity as compared with state-of-the-art methods.
    Categorical Relation-Preserving Contrastive Knowledge Distillation for Medical Image Classification. (arXiv:2107.03225v1 [cs.CV])
    (2 min) The amount of medical images for training deep classification models is typically very scarce, making these deep models prone to overfit the training data. Studies showed that knowledge distillation (KD), especially the mean-teacher framework which is more robust to perturbations, can help mitigate the over-fitting effect. However, directly transferring KD from computer vision to medical image classification yields inferior performance as medical images suffer from higher intra-class variance and class imbalance. To address these issues, we propose a novel Categorical Relation-preserving Contrastive Knowledge Distillation (CRCKD) algorithm, which takes the commonly used mean-teacher model as the supervisor. Specifically, we propose a novel Class-guided Contrastive Distillation (CCD) module to pull closer positive image pairs from the same class in the teacher and student models, while pushing apart negative image pairs from different classes. With this regularization, the feature distribution of the student model shows higher intra-class similarity and inter-class variance. Besides, we propose a Categorical Relation Preserving (CRP) loss to distill the teacher's relational knowledge in a robust and class-balanced manner. With the contribution of the CCD and CRP, our CRCKD algorithm can distill the relational knowledge more comprehensively. Extensive experiments on the HAM10000 and APTOS datasets demonstrate the superiority of the proposed CRCKD method.
    Learning Vision Transformer with Squeeze and Excitation for Facial Expression Recognition. (arXiv:2107.03107v1 [cs.CV])
    (2 min) As various databases of facial expressions have been made accessible over the last few decades, the Facial Expression Recognition (FER) task has gotten a lot of interest. The multiple sources of the available databases raised several challenges for facial recognition task. These challenges are usually addressed by Convolution Neural Network (CNN) architectures. Different from CNN models, a Transformer model based on attention mechanism has been presented recently to address vision tasks. One of the major issue with Transformers is the need of a large data for training, while most FER databases are limited compared to other vision applications. Therefore, we propose in this paper to learn a vision Transformer jointly with a Squeeze and Excitation (SE) block for FER task. The proposed method is evaluated on different publicly available FER databases including CK+, JAFFE,RAF-DB and SFEW. Experiments demonstrate that our model outperforms state-of-the-art methods on CK+ and SFEW and achieves competitive results on JAFFE and RAF-DB.
    SpectralFormer: Rethinking Hyperspectral Image Classification with Transformers. (arXiv:2107.02988v1 [cs.CV])
    (2 min) Hyperspectral (HS) images are characterized by approximately contiguous spectral information, enabling the fine identification of materials by capturing subtle spectral discrepancies. Owing to their excellent locally contextual modeling ability, convolutional neural networks (CNNs) have been proven to be a powerful feature extractor in HS image classification. However, CNNs fail to mine and represent the sequence attributes of spectral signatures well due to the limitations of their inherent network backbone. To solve this issue, we rethink HS image classification from a sequential perspective with transformers, and propose a novel backbone network called \ul{SpectralFormer}. Beyond band-wise representations in classic transformers, SpectralFormer is capable of learning spectrally local sequence information from neighboring bands of HS images, yielding group-wise spectral embeddings. More significantly, to reduce the possibility of losing valuable information in the layer-wise propagation process, we devise a cross-layer skip connection to convey memory-like components from shallow to deep layers by adaptively learning to fuse "soft" residuals across layers. It is worth noting that the proposed SpectralFormer is a highly flexible backbone network, which can be applicable to both pixel- and patch-wise inputs. We evaluate the classification performance of the proposed SpectralFormer on three HS datasets by conducting extensive experiments, showing the superiority over classic transformers and achieving a significant improvement in comparison with state-of-the-art backbone networks. The codes of this work will be available at \url{https://sites.google.com/view/danfeng-hong} for the sake of reproducibility.
    Urban Tree Species Classification Using Aerial Imagery. (arXiv:2107.03182v1 [cs.CV])
    (2 min) Urban trees help regulate temperature, reduce energy consumption, improve urban air quality, reduce wind speeds, and mitigating the urban heat island effect. Urban trees also play a key role in climate change mitigation and global warming by capturing and storing atmospheric carbon-dioxide which is the largest contributor to greenhouse gases. Automated tree detection and species classification using aerial imagery can be a powerful tool for sustainable forest and urban tree management. Hence, This study first offers a pipeline for generating labelled dataset of urban trees using Google Map's aerial images and then investigates how state of the art deep Convolutional Neural Network models such as VGG and ResNet handle the classification problem of urban tree aerial images under different parameters. Experimental results show our best model achieves an average accuracy of 60% over 6 tree species.
    Egocentric Videoconferencing. (arXiv:2107.03109v1 [cs.GR])
    (2 min) We introduce a method for egocentric videoconferencing that enables hands-free video calls, for instance by people wearing smart glasses or other mixed-reality devices. Videoconferencing portrays valuable non-verbal communication and face expression cues, but usually requires a front-facing camera. Using a frontal camera in a hands-free setting when a person is on the move is impractical. Even holding a mobile phone camera in the front of the face while sitting for a long duration is not convenient. To overcome these issues, we propose a low-cost wearable egocentric camera setup that can be integrated into smart glasses. Our goal is to mimic a classical video call, and therefore, we transform the egocentric perspective of this camera into a front facing video. To this end, we employ a conditional generative adversarial neural network that learns a transition from the highly distorted egocentric views to frontal views common in videoconferencing. Our approach learns to transfer expression details directly from the egocentric view without using a complex intermediate parametric expressions model, as it is used by related face reenactment methods. We successfully handle subtle expressions, not easily captured by parametric blendshape-based solutions, e.g., tongue movement, eye movements, eye blinking, strong expressions and depth varying movements. To get control over the rigid head movements in the target view, we condition the generator on synthetic renderings of a moving neutral face. This allows us to synthesis results at different head poses. Our technique produces temporally smooth video-realistic renderings in real-time using a video-to-video translation network in conjunction with a temporal discriminator. We demonstrate the improved capabilities of our technique by comparing against related state-of-the art approaches.
    HIDA: Towards Holistic Indoor Understanding for the Visually Impaired via Semantic Instance Segmentation with a Wearable Solid-State LiDAR Sensor. (arXiv:2107.03180v1 [cs.CV])
    (2 min) Independently exploring unknown spaces or finding objects in an indoor environment is a daily but challenging task for visually impaired people. However, common 2D assistive systems lack depth relationships between various objects, resulting in difficulty to obtain accurate spatial layout and relative positions of objects. To tackle these issues, we propose HIDA, a lightweight assistive system based on 3D point cloud instance segmentation with a solid-state LiDAR sensor, for holistic indoor detection and avoidance. Our entire system consists of three hardware components, two interactive functions~(obstacle avoidance and object finding) and a voice user interface. Based on voice guidance, the point cloud from the most recent state of the changing indoor environment is captured through an on-site scanning performed by the user. In addition, we design a point cloud segmentation model with dual lightweight decoders for semantic and offset predictions, which satisfies the efficiency of the whole system. After the 3D instance segmentation, we post-process the segmented point cloud by removing outliers and projecting all points onto a top-view 2D map representation. The system integrates the information above and interacts with users intuitively by acoustic feedback. The proposed 3D instance segmentation model has achieved state-of-the-art performance on ScanNet v2 dataset. Comprehensive field tests with various tasks in a user study verify the usability and effectiveness of our system for assisting visually impaired people in holistic indoor understanding, obstacle avoidance and object search.
    Action Units Recognition Using Improved Pairwise Deep Architecture. (arXiv:2107.03143v1 [cs.CV])
    (2 min) Facial Action Units (AUs) represent a set of facial muscular activities and various combinations of AUs can represent a wide range of emotions. AU recognition is often used in many applications, including marketing, healthcare, education, and so forth. Although a lot of studies have developed various methods to improve recognition accuracy, it still remains a major challenge for AU recognition. In the Affective Behavior Analysis in-the-wild (ABAW) 2020 competition, we proposed a new automatic Action Units (AUs) recognition method using a pairwise deep architecture to derive the Pseudo-Intensities of each AU and then convert them into predicted intensities. This year, we introduced a new technique to last year's framework to further reduce AU recognition errors due to temporary face occlusion such as temporary face occlusion such as face hiding or large face orientation. We obtained a score of 0.65 in the validation data set for this year's competition.
    Mitigating Generation Shifts for Generalized Zero-Shot Learning. (arXiv:2107.03163v1 [cs.CV])
    (2 min) Generalized Zero-Shot Learning (GZSL) is the task of leveraging semantic information (e.g., attributes) to recognize the seen and unseen samples, where unseen classes are not observable during training. It is natural to derive generative models and hallucinate training samples for unseen classes based on the knowledge learned from the seen samples. However, most of these models suffer from the `generation shifts', where the synthesized samples may drift from the real distribution of unseen data. In this paper, we conduct an in-depth analysis on this issue and propose a novel Generation Shifts Mitigating Flow (GSMFlow) framework, which is comprised of multiple conditional affine coupling layers for learning unseen data synthesis efficiently and effectively. In particular, we identify three potential problems that trigger the generation shifts, i.e., semantic inconsistency, variance decay, and structural permutation and address them respectively. First, to reinforce the correlations between the generated samples and the respective attributes, we explicitly embed the semantic information into the transformations in each of the coupling layers. Second, to recover the intrinsic variance of the synthesized unseen features, we introduce a visual perturbation strategy to diversify the intra-class variance of generated data and hereby help adjust the decision boundary of the classifier. Third, to avoid structural permutation in the semantic space, we propose a relative positioning strategy to manipulate the attribute embeddings, guiding which to fully preserve the inter-class geometric structure. Experimental results demonstrate that GSMFlow achieves state-of-the-art recognition performance in both conventional and generalized zero-shot settings. Our code is available at: https://github.com/uqzhichen/GSMFlow
    End-to-End Simultaneous Learning of Single-particle Orientation and 3D Map Reconstruction from Cryo-electron Microscopy Data. (arXiv:2107.02958v1 [eess.IV])
    (2 min) Cryogenic electron microscopy (cryo-EM) provides images from different copies of the same biomolecule in arbitrary orientations. Here, we present an end-to-end unsupervised approach that learns individual particle orientations from cryo-EM data while reconstructing the average 3D map of the biomolecule, starting from a random initialization. The approach relies on an auto-encoder architecture where the latent space is explicitly interpreted as orientations used by the decoder to form an image according to the linear projection model. We evaluate our method on simulated data and show that it is able to reconstruct 3D particle maps from noisy- and CTF-corrupted 2D projection images of unknown particle orientations.
    Blind Image Super-Resolution: A Survey and Beyond. (arXiv:2107.03055v1 [cs.CV])
    (2 min) Blind image super-resolution (SR), aiming to super-resolve low-resolution images with unknown degradation, has attracted increasing attention due to its significance in promoting real-world applications. Many novel and effective solutions have been proposed recently, especially with the powerful deep learning techniques. Despite years of efforts, it still remains as a challenging research problem. This paper serves as a systematic review on recent progress in blind image SR, and proposes a taxonomy to categorize existing methods into three different classes according to their ways of degradation modelling and the data used for solving the SR model. This taxonomy helps summarize and distinguish among existing methods. We hope to provide insights into current research states, as well as to reveal novel research directions worth exploring. In addition, we make a summary on commonly used datasets and previous competitions related to blind image SR. Last but not least, a comparison among different methods is provided with detailed analysis on their merits and demerits using both synthetic and real testing images.
    Plot2Spectra: an Automatic Spectra Extraction Tool. (arXiv:2107.02827v1 [cs.CV])
    (2 min) Different types of spectroscopies, such as X-ray absorption near edge structure (XANES) and Raman spectroscopy, play a very important role in analyzing the characteristics of different materials. In scientific literature, XANES/Raman data are usually plotted in line graphs which is a visually appropriate way to represent the information when the end-user is a human reader. However, such graphs are not conducive to direct programmatic analysis due to the lack of automatic tools. In this paper, we develop a plot digitizer, named Plot2Spectra, to extract data points from spectroscopy graph images in an automatic fashion, which makes it possible for large scale data acquisition and analysis. Specifically, the plot digitizer is a two-stage framework. In the first axis alignment stage, we adopt an anchor-free detector to detect the plot region and then refine the detected bounding boxes with an edge-based constraint to locate the position of two axes. We also apply scene text detector to extract and interpret all tick information below the x-axis. In the second plot data extraction stage, we first employ semantic segmentation to separate pixels belonging to plot lines from the background, and from there, incorporate optical flow constraints to the plot line pixels to assign them to the appropriate line (data instance) they encode. Extensive experiments are conducted to validate the effectiveness of the proposed plot digitizer, which shows that such a tool could help accelerate the discovery and machine learning of materials properties.
    RAM-VO: Less is more in Visual Odometry. (arXiv:2107.02974v1 [cs.RO])
    (2 min) Building vehicles capable of operating without human supervision requires the determination of the agent's pose. Visual Odometry (VO) algorithms estimate the egomotion using only visual changes from the input images. The most recent VO methods implement deep-learning techniques using convolutional neural networks (CNN) extensively, which add a substantial cost when dealing with high-resolution images. Furthermore, in VO tasks, more input data does not mean a better prediction; on the contrary, the architecture may filter out useless information. Therefore, the implementation of computationally efficient and lightweight architectures is essential. In this work, we propose the RAM-VO, an extension of the Recurrent Attention Model (RAM) for visual odometry tasks. RAM-VO improves the visual and temporal representation of information and implements the Proximal Policy Optimization (PPO) algorithm to learn robust policies. The results indicate that RAM-VO can perform regressions with six degrees of freedom from monocular input images using approximately 3 million parameters. In addition, experiments on the KITTI dataset demonstrate that RAM-VO achieves competitive results using only 5.7% of the available visual information.
    Learn to Learn Metric Space for Few-Shot Segmentation of 3D Shapes. (arXiv:2107.02972v1 [cs.CV])
    (2 min) Recent research has seen numerous supervised learning-based methods for 3D shape segmentation and remarkable performance has been achieved on various benchmark datasets. These supervised methods require a large amount of annotated data to train deep neural networks to ensure the generalization ability on the unseen test set. In this paper, we introduce a meta-learning-based method for few-shot 3D shape segmentation where only a few labeled samples are provided for the unseen classes. To achieve this, we treat the shape segmentation as a point labeling problem in the metric space. Specifically, we first design a meta-metric learner to transform input shapes into embedding space and our model learns to learn a proper metric space for each object class based on point embeddings. Then, for each class, we design a metric learner to extract part-specific prototype representations from a few support shapes and our model performs per-point segmentation over the query shapes by matching each point to its nearest prototype in the learned metric space. A metric-based loss function is used to dynamically modify distances between point embeddings thus maximizes in-part similarity while minimizing inter-part similarity. A dual segmentation branch is adopted to make full use of the support information and implicitly encourages consistency between the support and query prototypes. We demonstrate the superior performance of our proposed on the ShapeNet part dataset under the few-shot scenario, compared with well-established baseline and state-of-the-art semi-supervised methods.
    Controlled Caption Generation for Images Through Adversarial Attacks. (arXiv:2107.03050v1 [cs.CV])
    (2 min) Deep learning is found to be vulnerable to adversarial examples. However, its adversarial susceptibility in image caption generation is under-explored. We study adversarial examples for vision and language models, which typically adopt an encoder-decoder framework consisting of two major components: a Convolutional Neural Network (i.e., CNN) for image feature extraction and a Recurrent Neural Network (RNN) for caption generation. In particular, we investigate attacks on the visual encoder's hidden layer that is fed to the subsequent recurrent network. The existing methods either attack the classification layer of the visual encoder or they back-propagate the gradients from the language model. In contrast, we propose a GAN-based algorithm for crafting adversarial examples for neural image captioning that mimics the internal representation of the CNN such that the resulting deep features of the input image enable a controlled incorrect caption generation through the recurrent network. Our contribution provides new insights for understanding adversarial attacks on vision systems with language component. The proposed method employs two strategies for a comprehensive evaluation. The first examines if a neural image captioning system can be misled to output targeted image captions. The second analyzes the possibility of keywords into the predicted captions. Experiments show that our algorithm can craft effective adversarial images based on the CNN hidden layers to fool captioning framework. Moreover, we discover the proposed attack to be highly transferable. Our work leads to new robustness implications for neural image captioning.
    GAN-based Data Augmentation for Chest X-ray Classification. (arXiv:2107.02970v1 [eess.IV])
    (2 min) A common problem in computer vision -- particularly in medical applications -- is a lack of sufficiently diverse, large sets of training data. These datasets often suffer from severe class imbalance. As a result, networks often overfit and are unable to generalize to novel examples. Generative Adversarial Networks (GANs) offer a novel method of synthetic data augmentation. In this work, we evaluate the use of GAN- based data augmentation to artificially expand the CheXpert dataset of chest radiographs. We compare performance to traditional augmentation and find that GAN-based augmentation leads to higher downstream performance for underrepresented classes. Furthermore, we see that this result is pronounced in low data regimens. This suggests that GAN-based augmentation a promising area of research to improve network performance when data collection is prohibitively expensive.
    Learning Stixel-based Instance Segmentation. (arXiv:2107.03070v1 [cs.CV])
    (2 min) Stixels have been successfully applied to a wide range of vision tasks in autonomous driving, recently including instance segmentation. However, due to their sparse occurrence in the image, until now Stixels seldomly served as input for Deep Learning algorithms, restricting their utility for such approaches. In this work we present StixelPointNet, a novel method to perform fast instance segmentation directly on Stixels. By regarding the Stixel representation as unstructured data similar to point clouds, architectures like PointNet are able to learn features from Stixels. We use a bounding box detector to propose candidate instances, for which the relevant Stixels are extracted from the input image. On these Stixels, a PointNet models learns binary segmentations, which we then unify throughout the whole image in a final selection step. StixelPointNet achieves state-of-the-art performance on Stixel-level, is considerably faster than pixel-based segmentation methods, and shows that with our approach the Stixel domain can be introduced to many new 3D Deep Learning tasks.
    Edge-aware Bidirectional Diffusion for Dense Depth Estimation from Light Fields. (arXiv:2107.02967v1 [cs.CV])
    (2 min) We present an algorithm to estimate fast and accurate depth maps from light fields via a sparse set of depth edges and gradients. Our proposed approach is based around the idea that true depth edges are more sensitive than texture edges to local constraints, and so they can be reliably disambiguated through a bidirectional diffusion process. First, we use epipolar-plane images to estimate sub-pixel disparity at a sparse set of pixels. To find sparse points efficiently, we propose an entropy-based refinement approach to a line estimate from a limited set of oriented filter banks. Next, to estimate the diffusion direction away from sparse points, we optimize constraints at these points via our bidirectional diffusion method. This resolves the ambiguity of which surface the edge belongs to and reliably separates depth from texture edges, allowing us to diffuse the sparse set in a depth-edge and occlusion-aware manner to obtain accurate dense depth maps.
    Transformer Network for Significant Stenosis Detection in CCTA of Coronary Arteries. (arXiv:2107.03035v1 [eess.IV])
    (2 min) Coronary artery disease (CAD) has posed a leading threat to the lives of cardiovascular disease patients worldwide for a long time. Therefore, automated diagnosis of CAD has indispensable significance in clinical medicine. However, the complexity of coronary artery plaques that cause CAD makes the automatic detection of coronary artery stenosis in Coronary CT angiography (CCTA) a difficult task. In this paper, we propose a Transformer network (TR-Net) for the automatic detection of significant stenosis (i.e. luminal narrowing > 50%) while practically completing the computer-assisted diagnosis of CAD. The proposed TR-Net introduces a novel Transformer, and tightly combines convolutional layers and Transformer encoders, allowing their advantages to be demonstrated in the task. By analyzing semantic information sequences, TR-Net can fully understand the relationship between image information in each position of a multiplanar reformatted (MPR) image, and accurately detect significant stenosis based on both local and global information. We evaluate our TR-Net on a dataset of 76 patients from different patients annotated by experienced radiologists. Experimental results illustrate that our TR-Net has achieved better results in ACC (0.92), Spec (0.96), PPV (0.84), F1 (0.79) and MCC (0.74) indicators compared with the state-of-the-art methods. The source code is publicly available from the link (https://github.com/XinghuaMa/TR-Net).
    Deep Convolutional Correlation Iterative Particle Filter for Visual Tracking. (arXiv:2107.02984v1 [cs.CV])
    (2 min) This work proposes a novel framework for visual tracking based on the integration of an iterative particle filter, a deep convolutional neural network, and a correlation filter. The iterative particle filter enables the particles to correct themselves and converge to the correct target position. We employ a novel strategy to assess the likelihood of the particles after the iterations by applying K-means clustering. Our approach ensures a consistent support for the posterior distribution. Thus, we do not need to perform resampling at every video frame, improving the utilization of prior distribution information. Experimental results on two different benchmark datasets show that our tracker performs favorably against state-of-the-art methods.
    Bi-level Feature Alignment for Versatile Image Translation and Manipulation. (arXiv:2107.03021v1 [cs.CV])
    (2 min) Generative adversarial networks (GANs) have achieved great success in image translation and manipulation. However, high-fidelity image generation with faithful style control remains a grand challenge in computer vision. This paper presents a versatile image translation and manipulation framework that achieves accurate semantic and style guidance in image generation by explicitly building a correspondence. To handle the quadratic complexity incurred by building the dense correspondences, we introduce a bi-level feature alignment strategy that adopts a top-$k$ operation to rank block-wise features followed by dense attention between block features which reduces memory cost substantially. As the top-$k$ operation involves index swapping which precludes the gradient propagation, we propose to approximate the non-differentiable top-$k$ operation with a regularized earth mover's problem so that its gradient can be effectively back-propagated. In addition, we design a novel semantic position encoding mechanism that builds up coordinate for each individual semantic region to preserve texture structures while building correspondences. Further, we design a novel confidence feature injection module which mitigates mismatch problem by fusing features adaptively according to the reliability of built correspondences. Extensive experiments show that our method achieves superior performance qualitatively and quantitatively as compared with the state-of-the-art. The code is available at \href{https://github.com/fnzhan/RABIT}{https://github.com/fnzhan/RABIT}.
    Poly-NL: Linear Complexity Non-local Layers with Polynomials. (arXiv:2107.02859v1 [cs.CV])
    (2 min) Spatial self-attention layers, in the form of Non-Local blocks, introduce long-range dependencies in Convolutional Neural Networks by computing pairwise similarities among all possible positions. Such pairwise functions underpin the effectiveness of non-local layers, but also determine a complexity that scales quadratically with respect to the input size both in space and time. This is a severely limiting factor that practically hinders the applicability of non-local blocks to even moderately sized inputs. Previous works focused on reducing the complexity by modifying the underlying matrix operations, however in this work we aim to retain full expressiveness of non-local layers while keeping complexity linear. We overcome the efficiency limitation of non-local blocks by framing them as special cases of 3rd order polynomial functions. This fact enables us to formulate novel fast Non-Local blocks, capable of reducing the complexity from quadratic to linear with no loss in performance, by replacing any direct computation of pairwise similarities with element-wise multiplications. The proposed method, which we dub as "Poly-NL", is competitive with state-of-the-art performance across image recognition, instance segmentation, and face detection tasks, while having considerably less computational overhead.
    Learning Invariant Representation with Consistency and Diversity for Semi-supervised Source Hypothesis Transfer. (arXiv:2107.03008v1 [cs.CV])
    (2 min) Semi-supervised domain adaptation (SSDA) aims to solve tasks in target domain by utilizing transferable information learned from the available source domain and a few labeled target data. However, source data is not always accessible in practical scenarios, which restricts the application of SSDA in real world circumstances. In this paper, we propose a novel task named Semi-supervised Source Hypothesis Transfer (SSHT), which performs domain adaptation based on source trained model, to generalize well in target domain with a few supervisions. In SSHT, we are facing two challenges: (1) The insufficient labeled target data may result in target features near the decision boundary, with the increased risk of mis-classification; (2) The data are usually imbalanced in source domain, so the model trained with these data is biased. The biased model is prone to categorize samples of minority categories into majority ones, resulting in low prediction diversity. To tackle the above issues, we propose Consistency and Diversity Learning (CDL), a simple but effective framework for SSHT by facilitating prediction consistency between two randomly augmented unlabeled data and maintaining the prediction diversity when adapting model to target domain. Encouraging consistency regularization brings difficulty to memorize the few labeled target data and thus enhances the generalization ability of the learned model. We further integrate Batch Nuclear-norm Maximization into our method to enhance the discriminability and diversity. Experimental results show that our method outperforms existing SSDA methods and unsupervised model adaptation methods on DomainNet, Office-Home and Office-31 datasets. The code is available at https://github.com/Wang-xd1899/SSHT.
    VIN: Voxel-based Implicit Network for Joint 3D Object Detection and Segmentation for Lidars. (arXiv:2107.02980v1 [cs.CV])
    (2 min) A unified neural network structure is presented for joint 3D object detection and point cloud segmentation in this paper. We leverage rich supervision from both detection and segmentation labels rather than using just one of them. In addition, an extension based on single-stage object detectors is proposed based on the implicit function widely used in 3D scene and object understanding. The extension branch takes the final feature map from the object detection module as input, and produces an implicit function that generates semantic distribution for each point for its corresponding voxel center. We demonstrated the performance of our structure on nuScenes-lidarseg, a large-scale outdoor dataset. Our solution achieves competitive results against state-of-the-art methods in both 3D object detection and point cloud segmentation with little additional computation load compared with object detection solutions. The capability of efficient weakly supervision semantic segmentation of the proposed method is also validated by experiments.
    GLiT: Neural Architecture Search for Global and Local Image Transformer. (arXiv:2107.02960v1 [cs.CV])
    (2 min) We introduce the first Neural Architecture Search (NAS) method to find a better transformer architecture for image recognition. Recently, transformers without CNN-based backbones are found to achieve impressive performance for image recognition. However, the transformer is designed for NLP tasks and thus could be sub-optimal when directly used for image recognition. In order to improve the visual representation ability for transformers, we propose a new search space and searching algorithm. Specifically, we introduce a locality module that models the local correlations in images explicitly with fewer computational cost. With the locality module, our search space is defined to let the search algorithm freely trade off between global and local information as well as optimizing the low-level design choice in each module. To tackle the problem caused by huge search space, a hierarchical neural architecture search method is proposed to search the optimal vision transformer from two levels separately with the evolutionary algorithm. Extensive experiments on the ImageNet dataset demonstrate that our method can find more discriminative and efficient transformer variants than the ResNet family (e.g., ResNet101) and the baseline ViT for image classification.
    Structured Denoising Diffusion Models in Discrete State-Spaces. (arXiv:2107.03006v1 [cs.LG])
    (2 min) Denoising diffusion probabilistic models (DDPMs) (Ho et al. 2020) have shown impressive results on image and waveform generation in continuous state spaces. Here, we introduce Discrete Denoising Diffusion Probabilistic Models (D3PMs), diffusion-like generative models for discrete data that generalize the multinomial diffusion model of Hoogeboom et al. 2021, by going beyond corruption processes with uniform transition probabilities. This includes corruption with transition matrices that mimic Gaussian kernels in continuous space, matrices based on nearest neighbors in embedding space, and matrices that introduce absorbing states. The third allows us to draw a connection between diffusion models and autoregressive and mask-based generative models. We show that the choice of transition matrix is an important design decision that leads to improved results in image and text domains. We also introduce a new loss function that combines the variational lower bound with an auxiliary cross entropy loss. For text, this model class achieves strong results on character-level text generation while scaling to large vocabularies on LM1B. On the image dataset CIFAR-10, our models approach the sample quality and exceed the log-likelihood of the continuous-space DDPM model.
    E-PixelHop: An Enhanced PixelHop Method for Object Classification. (arXiv:2107.02966v1 [cs.CV])
    (2 min) Based on PixelHop and PixelHop++, which are recently developed using the successive subspace learning (SSL) framework, we propose an enhanced solution for object classification, called E-PixelHop, in this work. E-PixelHop consists of the following steps. First, to decouple the color channels for a color image, we apply principle component analysis and project RGB three color channels onto two principle subspaces which are processed separately for classification. Second, to address the importance of multi-scale features, we conduct pixel-level classification at each hop with various receptive fields. Third, to further improve pixel-level classification accuracy, we develop a supervised label smoothing (SLS) scheme to ensure prediction consistency. Forth, pixel-level decisions from each hop and from each color subspace are fused together for image-level decision. Fifth, to resolve confusing classes for further performance boosting, we formulate E-PixelHop as a two-stage pipeline. In the first stage, multi-class classification is performed to get a soft decision for each class, where the top 2 classes with the highest probabilities are called confusing classes. Then,we conduct a binary classification in the second stage. The main contributions lie in Steps 1, 3 and 5.We use the classification of the CIFAR-10 dataset as an example to demonstrate the effectiveness of the above-mentioned key components of E-PixelHop.
    Multi-modal Affect Analysis using standardized data within subjects in the Wild. (arXiv:2107.03009v1 [cs.CV])
    (2 min) Human affective recognition is an important factor in human-computer interaction. However, the method development with in-the-wild data is not yet accurate enough for practical usage. In this paper, we introduce the affective recognition method focusing on facial expression (EXP) and valence-arousal calculation that was submitted to the Affective Behavior Analysis in-the-wild (ABAW) 2021 Contest. When annotating facial expressions from a video, we thought that it would be judged not only from the features common to all people, but also from the relative changes in the time series of individuals. Therefore, after learning the common features for each frame, we constructed a facial expression estimation model and valence-arousal model using time-series data after combining the common features and the standardized features for each video. Furthermore, the above features were learned using multi-modal data such as image features, AU, Head pose, and Gaze. In the validation set, our model achieved a facial expression score of 0.546. These verification results reveal that our proposed framework can improve estimation accuracy and robustness effectively.
    Disentangle Your Dense Object Detector. (arXiv:2107.02963v1 [cs.CV])
    (2 min) Deep learning-based dense object detectors have achieved great success in the past few years and have been applied to numerous multimedia applications such as video understanding. However, the current training pipeline for dense detectors is compromised to lots of conjunctions that may not hold. In this paper, we investigate three such important conjunctions: 1) only samples assigned as positive in classification head are used to train the regression head; 2) classification and regression share the same input feature and computational fields defined by the parallel head architecture; and 3) samples distributed in different feature pyramid layers are treated equally when computing the loss. We first carry out a series of pilot experiments to show disentangling such conjunctions can lead to persistent performance improvement. Then, based on these findings, we propose Disentangled Dense Object Detector (DDOD), in which simple and effective disentanglement mechanisms are designed and integrated into the current state-of-the-art dense object detectors. Extensive experiments on MS COCO benchmark show that our approach can lead to 2.0 mAP, 2.4 mAP and 2.2 mAP absolute improvements on RetinaNet, FCOS, and ATSS baselines with negligible extra overhead. Notably, our best model reaches 55.0 mAP on the COCO test-dev set and 93.5 AP on the hard subset of WIDER FACE, achieving new state-of-the-art performance on these two competitive benchmarks. Code is available at https://github.com/zehuichen123/DDOD.
    Group Sampling for Unsupervised Person Re-identification. (arXiv:2107.03024v1 [cs.CV])
    (2 min) Unsupervised person re-identification (re-ID) remains a challenging task, where the classifier and feature representation could be easily misled by the noisy pseudo labels towards deteriorated over-fitting. In this paper, we propose a simple yet effective approach, termed Group Sampling, to alleviate the negative impact of noisy pseudo labels within unsupervised person re-ID models. The idea behind Group Sampling is that it can gather a group of samples from the same class in the same mini-batch, such that the model is trained upon group normalized samples while alleviating the effect of a single sample. Group sampling updates the pipeline of pseudo label generation by guaranteeing the samples to be better divided into the correct classes. Group Sampling regularizes classifier training and representation learning, leading to the statistical stability of feature representation in a progressive fashion. Qualitative and quantitative experiments on Market-1501, DukeMTMC-reID, and MSMT17 show that Grouping Sampling improves the state-of-the-arts by up to 2.2%~6.1%. Code is available at https://github.com/wavinflaghxm/GroupSampling.
    Deep Mesh Prior: Unsupervised Mesh Restoration using Graph Convolutional Networks. (arXiv:2107.02909v1 [cs.CV])
    (2 min) This paper addresses mesh restoration problems, i.e., denoising and completion, by learning self-similarity in an unsupervised manner. For this purpose, the proposed method, which we refer to as Deep Mesh Prior, uses a graph convolutional network on meshes to learn the self-similarity. The network takes a single incomplete mesh as input data and directly outputs the reconstructed mesh without being trained using large-scale datasets. Our method does not use any intermediate representations such as an implicit field because the whole process works on a mesh. We demonstrate that our unsupervised method performs equally well or even better than the state-of-the-art methods using large-scale datasets.
    Image Complexity Guided Network Compression for Biomedical Image Segmentation. (arXiv:2107.02927v1 [eess.IV])
    (2 min) Compression is a standard procedure for making convolutional neural networks (CNNs) adhere to some specific computing resource constraints. However, searching for a compressed architecture typically involves a series of time-consuming training/validation experiments to determine a good compromise between network size and performance accuracy. To address this, we propose an image complexity-guided network compression technique for biomedical image segmentation. Given any resource constraints, our framework utilizes data complexity and network architecture to quickly estimate a compressed model which does not require network training. Specifically, we map the dataset complexity to the target network accuracy degradation caused by compression. Such mapping enables us to predict the final accuracy for different network sizes, based on the computed dataset complexity. Thus, one may choose a solution that meets both the network size and segmentation accuracy requirements. Finally, the mapping is used to determine the convolutional layer-wise multiplicative factor for generating a compressed network. We conduct experiments using 5 datasets, employing 3 commonly-used CNN architectures for biomedical image segmentation as representative networks. Our proposed framework is shown to be effective for generating compressed segmentation networks, retaining up to $\approx 95\%$ of the full-sized network segmentation accuracy, and at the same time, utilizing $\approx 32x$ fewer network trainable weights (average reduction) of the full-sized networks.
  • cs.IR updates on arXiv.org

    AutoDebias: Learning to Debias for Recommendation. (arXiv:2105.04170v2 [cs.LG] UPDATED)
    (2 min) Recommender systems rely on user behavior data like ratings and clicks to build personalization model. However, the collected data is observational rather than experimental, causing various biases in the data which significantly affect the learned model. Most existing work for recommendation debiasing, such as the inverse propensity scoring and imputation approaches, focuses on one or two specific biases, lacking the universal capacity that can account for mixed or even unknown biases in the data. Towards this research gap, we first analyze the origin of biases from the perspective of \textit{risk discrepancy} that represents the difference between the expectation empirical risk and the true risk. Remarkably, we derive a general learning framework that well summarizes most existing debiasing strategies by specifying some parameters of the general framework. This provides a valuable opportunity to develop a universal solution for debiasing, e.g., by learning the debiasing parameters from data. However, the training data lacks important signal of how the data is biased and what the unbiased data looks like. To move this idea forward, we propose \textit{AotoDebias} that leverages another (small) set of uniform data to optimize the debiasing parameters by solving the bi-level optimization problem with meta-learning. Through theoretical analyses, we derive the generalization bound for AutoDebias and prove its ability to acquire the appropriate debiasing strategy. Extensive experiments on two real datasets and a simulated dataset demonstrated effectiveness of AutoDebias. The code is available at \url{https://github.com/DongHande/AutoDebias}.
    How Big Are Peoples' Computer Files? File Size Distributions Among User-managed Collections. (arXiv:2107.03272v1 [cs.HC])
    (2 min) Improving file management interfaces and optimising system performance requires current data about users' digital collections and particularly about the file size distributions of such collections. However, prior works have examined only the sizes of system files and users' work files in varied contexts, and there has been no such study since 2013; it therefore remains unclear how today's file sizes are distributed, particularly personal files, and further if distributions differ among the major operating systems or common occupations. Here we examine such differences among 49 million files in 348 user collections. We find that the average file size has grown more than ten-fold since the mid-2000s, though most files are still under 8 MB, and that there are demographic and technological influences in the size distributions. We discuss the implications for user interfaces, system optimisation, and PIM research.
    Graphing else matters: exploiting aspect opinions and ratings in explainable graph-based recommendations. (arXiv:2107.03226v1 [cs.IR])
    (2 min) The success of neural network embeddings has entailed a renewed interest in using knowledge graphs for a wide variety of machine learning and information retrieval tasks. In particular, current recommendation methods based on graph embeddings have shown state-of-the-art performance. These methods commonly encode latent rating patterns and content features. Different from previous work, in this paper, we propose to exploit embeddings extracted from graphs that combine information from ratings and aspect-based opinions expressed in textual reviews. We then adapt and evaluate state-of-the-art graph embedding techniques over graphs generated from Amazon and Yelp reviews on six domains, outperforming baseline recommenders. Our approach has the advantage of providing explanations which leverage aspect-based opinions given by users about recommended items. Furthermore, we also provide examples of the applicability of recommendations utilizing aspect opinions as explanations in a visualization dashboard, which allows obtaining information about the most and least liked aspects of similar users obtained from the embeddings of an input graph.
    Computationally Efficient Optimization of Plackett-Luce Ranking Models for Relevance and Fairness. (arXiv:2105.00855v2 [cs.IR] UPDATED)
    (2 min) Recent work has proposed stochastic Plackett-Luce (PL) ranking models as a robust choice for optimizing relevance and fairness metrics. Unlike their deterministic counterparts that require heuristic optimization algorithms, PL models are fully differentiable. Theoretically, they can be used to optimize ranking metrics via stochastic gradient descent. However, in practice, the computation of the gradient is infeasible because it requires one to iterate over all possible permutations of items. Consequently, actual applications rely on approximating the gradient via sampling techniques. In this paper, we introduce a novel algorithm: PL-Rank, that estimates the gradient of a PL ranking model w.r.t. both relevance and fairness metrics. Unlike existing approaches that are based on policy gradients, PL-Rank makes use of the specific structure of PL models and ranking metrics. Our experimental analysis shows that PL-Rank has a greater sample-efficiency and is computationally less costly than existing policy gradients, resulting in faster convergence at higher performance. PL-Rank further enables the industry to apply PL models for more relevant and fairer real-world ranking systems.
    "Are you sure?": Preliminary Insights from Scaling Product Comparisons to Multiple Shops. (arXiv:2107.03256v1 [cs.IR])
    (2 min) Large eCommerce players introduced comparison tables as a new type of recommendations. However, building comparisons at scale without pre-existing training/taxonomy data remains an open challenge, especially within the operational constraints of shops in the long tail. We present preliminary results from building a comparison pipeline designed to scale in a multi-shop scenario: we describe our design choices and run extensive benchmarks on multiple shops to stress-test it. Finally, we run a small user study on property selection and conclude by discussing potential improvements and highlighting the questions that remain to be addressed.
    SelfCF: A Simple Framework for Self-supervised Collaborative Filtering. (arXiv:2107.03019v1 [cs.IR])
    (2 min) Collaborative filtering (CF) is widely used to learn an informative latent representation of a user or item from observed interactions. Existing CF-based methods commonly adopt negative sampling to discriminate different items. That is, observed user-item pairs are treated as positive instances; unobserved pairs are considered as negative instances and are sampled under a defined distribution for training. Training with negative sampling on large datasets is computationally expensive. Further, negative items should be carefully sampled under the defined distribution, in order to avoid selecting an observed positive item in the training dataset. Unavoidably, some negative items sampled from the training dataset could be positive in the test set. Recently, self-supervised learning (SSL) has emerged as a powerful tool to learn a model without negative samples. In this paper, we propose a self-supervised collaborative filtering framework (SelfCF), that is specially designed for recommender scenario with implicit feedback. The main idea of SelfCF is to augment the output embeddings generated by backbone networks, because it is infeasible to augment raw input of user/item ids. We propose and study three output perturbation techniques that can be applied to different types of backbone networks including both traditional CF models and graph-based models. By encapsulating two popular recommendation models into the framework, our experiments on three datasets show that the best performance of our framework is comparable or better than the supervised counterpart. We also show that SelfCF can boost up the performance by up to 8.93\% on average, compared with another self-supervised framework as the baseline. Source codes are available at: https://github.com/enoche/SelfCF.
  • cs.LG updates on arXiv.org

    ICDAR 2021 Competition on Components Segmentation Task of Document Photos. (arXiv:2106.08499v1 [cs.CV] CROSS LISTED)
    (2 min) This paper describes the short-term competition on Components Segmentation Task of Document Photos that was prepared in the context of the 16th International Conference on Document Analysis and Recognition (ICDAR 2021). This competition aims to bring together researchers working on the filed of identification document image processing and provides them a suitable benchmark to compare their techniques on the component segmentation task of document images. Three challenge tasks were proposed entailing different segmentation assignments to be performed on a provided dataset. The collected data are from several types of Brazilian ID documents, whose personal information was conveniently replaced. There were 16 participants whose results obtained for some or all the three tasks show different rates for the adopted metrics, like Dice Similarity Coefficient ranging from 0.06 to 0.99. Different Deep Learning models were applied by the entrants with diverse strategies to achieve the best results in each of the tasks. Obtained results show that the current applied methods for solving one of the proposed tasks (document boundary detection) are already well stablished. However, for the other two challenge tasks (text zone and handwritten sign detection) research and development of more robust approaches are still required to achieve acceptable results.
    Simple Agent, Complex Environment: Efficient Reinforcement Learning with Agent States. (arXiv:2102.05261v6 [cs.LG] UPDATED)
    (2 min) We design a simple reinforcement learning (RL) agent that implements an optimistic version of $Q$-learning and establish through regret analysis that this agent can operate with some level of competence in any environment. While we leverage concepts from the literature on provably efficient RL, we consider a general agent-environment interface and provide a novel agent design and analysis. This level of generality positions our results to inform the design of future agents for operation in complex real environments. We establish that, as time progresses, our agent performs competitively relative to policies that require longer times to evaluate. The time it takes to approach asymptotic performance is polynomial in the complexity of the agent's state representation and the time required to evaluate the best policy that the agent can represent. Notably, there is no dependence on the complexity of the environment. The ultimate per-period performance loss of the agent is bounded by a constant multiple of a measure of distortion introduced by the agent's state representation. This work is the first to establish that an algorithm approaches this asymptotic condition within a tractable time frame.
    Approximate Multi-Agent Fitted Q Iteration. (arXiv:2104.09343v2 [cs.LG] UPDATED)
    (2 min) We formulate an efficient approximation for multi-agent batch reinforcement learning, the approximate multi-agent fitted Q iteration (AMAFQI). We present a detailed derivation of our approach. We propose an iterative policy search and show that it yields a greedy policy with respect to multiple approximations of the centralized, standard Q-function. In each iteration and policy evaluation, AMAFQI requires a number of computations that scales linearly with the number of agents whereas the analogous number of computations increase exponentially for the fitted Q iteration (FQI), one of the most commonly used approaches in batch reinforcement learning. This property of AMAFQI is fundamental for the design of a tractable multi-agent approach. We evaluate the performance of AMAFQI and compare it to FQI in numerical simulations. Numerical examples illustrate the significant computation time reduction when using AMAFQI instead of FQI in multi-agent problems and corroborate the similar decision-making performance of both approaches.
    Big Data Information and Nowcasting: Consumption and Investment from Bank Transactions in Turkey. (arXiv:2107.03299v1 [econ.EM])
    (2 min) We use the aggregate information from individual-to-firm and firm-to-firm in Garanti BBVA Bank transactions to mimic domestic private demand. Particularly, we replicate the quarterly national accounts aggregate consumption and investment (gross fixed capital formation) and its bigger components (Machinery and Equipment and Construction) in real time for the case of Turkey. In order to validate the usefulness of the information derived from these indicators we test the nowcasting ability of both indicators to nowcast the Turkish GDP using different nowcasting models. The results are successful and confirm the usefulness of Consumption and Investment Banking transactions for nowcasting purposes. The value of the Big data information is more relevant at the beginning of the nowcasting process, when the traditional hard data information is scarce. This makes this information specially relevant for those countries where statistical release lags are longer like the Emerging Markets.
    Urban Tree Species Classification Using Aerial Imagery. (arXiv:2107.03182v1 [cs.CV])
    (2 min) Urban trees help regulate temperature, reduce energy consumption, improve urban air quality, reduce wind speeds, and mitigating the urban heat island effect. Urban trees also play a key role in climate change mitigation and global warming by capturing and storing atmospheric carbon-dioxide which is the largest contributor to greenhouse gases. Automated tree detection and species classification using aerial imagery can be a powerful tool for sustainable forest and urban tree management. Hence, This study first offers a pipeline for generating labelled dataset of urban trees using Google Map's aerial images and then investigates how state of the art deep Convolutional Neural Network models such as VGG and ResNet handle the classification problem of urban tree aerial images under different parameters. Experimental results show our best model achieves an average accuracy of 60% over 6 tree species.
    Multi-VAE: Learning Disentangled View-common and View-peculiar Visual Representations for Multi-view Clustering. (arXiv:2106.11232v2 [cs.CV] UPDATED)
    (2 min) Multi-view clustering, a long-standing and important research problem, focuses on mining complementary information from diverse views. However, existing works often fuse multiple views' representations or handle clustering in a common feature space, which may result in their entanglement especially for visual representations. To address this issue, we present a novel VAE-based multi-view clustering framework (Multi-VAE) by learning disentangled visual representations. Concretely, we define a view-common variable and multiple view-peculiar variables in the generative model. The prior of view-common variable obeys approximately discrete Gumbel Softmax distribution, which is introduced to extract the common cluster factor of multiple views. Meanwhile, the prior of view-peculiar variable follows continuous Gaussian distribution, which is used to represent each view's peculiar visual factors. By controlling the mutual information capacity to disentangle the view-common and view-peculiar representations, continuous visual information of multiple views can be separated so that their common discrete cluster information can be effectively mined. Experimental results demonstrate that Multi-VAE enjoys the disentangled and explainable visual representations, while obtaining superior clustering performance compared with state-of-the-art methods.
    GRIMGEP: Learning Progress for Robust Goal Sampling in Visual Deep Reinforcement Learning. (arXiv:2008.04388v2 [cs.LG] UPDATED)
    (2 min) Designing agents, capable of learning autonomously a wide range of skills is critical in order to increase the scope of reinforcement learning. It will both increase the diversity of learned skills and reduce the burden of manually designing reward functions for each skill. Self-supervised agents, setting their own goals, and trying to maximize the diversity of those goals have shown great promise towards this end. However, a currently known limitation of agents trying to maximize the diversity of sampled goals is that they tend to get attracted to noise or more generally to parts of the environments that cannot be controlled (distractors). When agents have access to predefined goal features or expert knowledge, absolute Learning Progress (ALP) provides a way to distinguish between regions that can be controlled and those that cannot. However, those methods often fall short when the agents are only provided with raw sensory inputs such as images. In this work we extend those concepts to unsupervised image-based goal exploration. We propose a framework that allows agents to autonomously identify and ignore noisy distracting regions while searching for novelty in the learnable regions to both improve overall performance and avoid catastrophic forgetting. Our framework can be combined with any state-of-the-art novelty seeking goal exploration approaches. We construct a rich 3D image based environment with distractors. Experiments on this environment show that agents using our framework successfully identify interesting regions of the environment, resulting in drastically improved performances. The source code is available at https://sites.google.com/view/grimgep.
    Modeling Atmospheric Data and Identifying Dynamics: Temporal Data-Driven Modeling of Air Pollutants. (arXiv:2010.06538v2 [stat.AP] UPDATED)
    (2 min) Atmospheric modeling has recently experienced a surge with the advent of deep learning. Most of these models, however, predict concentrations of pollutants following a data-driven approach in which the physical laws that govern their behaviors and relationships remain hidden. With the aid of real-world air quality data collected hourly in different stations throughout Madrid, we present an empirical approach using data-driven techniques with the following goals: (1) Find parsimonious systems of ordinary differential equations via sparse identification of nonlinear dynamics (SINDy) that model the concentration of pollutants and their changes over time; (2) assess the performance and limitations of our models using stability analysis; (3) reconstruct the time series of chemical pollutants not measured in certain stations using delay coordinate embedding results. Our results show that Akaike's Information Criterion can work well in conjunction with best subset regression as to find an equilibrium between sparsity and goodness of fit. We also find that, due to the complexity of the chemical system under study, identifying the dynamics of this system over longer periods of time require higher levels of data filtering and smoothing. Stability analysis for the reconstructed ordinary differential equations (ODEs) reveals that more than half of the physically relevant critical points are saddle points, suggesting that the system is unstable even under the idealized assumption that all environmental conditions are constant over time.
    Principles for Evaluation of AI/ML Model Performance and Robustness. (arXiv:2107.02868v1 [cs.LG])
    (2 min) The Department of Defense (DoD) has significantly increased its investment in the design, evaluation, and deployment of Artificial Intelligence and Machine Learning (AI/ML) capabilities to address national security needs. While there are numerous AI/ML successes in the academic and commercial sectors, many of these systems have also been shown to be brittle and nonrobust. In a complex and ever-changing national security environment, it is vital that the DoD establish a sound and methodical process to evaluate the performance and robustness of AI/ML models before these new capabilities are deployed to the field. This paper reviews the AI/ML development process, highlights common best practices for AI/ML model evaluation, and makes recommendations to DoD evaluators to ensure the deployment of robust AI/ML capabilities for national security needs.
    RISAN: Robust Instance Specific Abstention Network. (arXiv:2107.03090v1 [cs.LG])
    (2 min) In this paper, we propose deep architectures for learning instance specific abstain (reject option) binary classifiers. The proposed approach uses double sigmoid loss function as described by Kulin Shah and Naresh Manwani in ("Online Active Learning of Reject Option Classifiers", AAAI, 2020), as a performance measure. We show that the double sigmoid loss is classification calibrated. We also show that the excess risk of 0-d-1 loss is upper bounded by the excess risk of double sigmoid loss. We derive the generalization error bounds for the proposed architecture for reject option classifiers. To show the effectiveness of the proposed approach, we experiment with several real world datasets. We observe that the proposed approach not only performs comparable to the state-of-the-art approaches, it is also robust against label noise. We also provide visualizations to observe the important features learned by the network corresponding to the abstaining decision.
    Regularization-based Continual Learning for Fault Prediction in Lithium-Ion Batteries. (arXiv:2107.03336v1 [cs.LG])
    (2 min) In recent years, the use of lithium-ion batteries has greatly expanded into products from many industrial sectors, e.g. cars, power tools or medical devices. An early prediction and robust understanding of battery faults could therefore greatly increase product quality in those fields. While current approaches for data-driven fault prediction provide good results on the exact processes they were trained on, they often lack the ability to flexibly adapt to changes, e.g. in operational or environmental parameters. Continual learning promises such flexibility, allowing for an automatic adaption of previously learnt knowledge to new tasks. Therefore, this article discusses different continual learning approaches from the group of regularization strategies, which are implemented, evaluated and compared based on a real battery wear dataset. Online elastic weight consolidation delivers the best results, but, as with all examined approaches, its performance appears to be strongly dependent on task characteristics and task sequence.
    Differentiable Architecture Pruning for Transfer Learning. (arXiv:2107.03375v1 [cs.LG])
    (2 min) We propose a new gradient-based approach for extracting sub-architectures from a given large model. Contrarily to existing pruning methods, which are unable to disentangle the network architecture and the corresponding weights, our architecture-pruning scheme produces transferable new structures that can be successfully retrained to solve different tasks. We focus on a transfer-learning setup where architectures can be trained on a large data set but very few data points are available for fine-tuning them on new tasks. We define a new gradient-based algorithm that trains architectures of arbitrarily low complexity independently from the attached weights. Given a search space defined by an existing large neural model, we reformulate the architecture search task as a complexity-penalized subset-selection problem and solve it through a two-temperature relaxation scheme. We provide theoretical convergence guarantees and validate the proposed transfer-learning strategy on real data.
    Predicting with Confidence on Unseen Distributions. (arXiv:2107.03315v1 [cs.LG])
    (2 min) Recent work has shown that the performance of machine learning models can vary substantially when models are evaluated on data drawn from a distribution that is close to but different from the training distribution. As a result, predicting model performance on unseen distributions is an important challenge. Our work connects techniques from domain adaptation and predictive uncertainty literature, and allows us to predict model accuracy on challenging unseen distributions without access to labeled data. In the context of distribution shift, distributional distances are often used to adapt models and improve their performance on new domains, however accuracy estimation, or other forms of predictive uncertainty, are often neglected in these investigations. Through investigating a wide range of established distributional distances, such as Frechet distance or Maximum Mean Discrepancy, we determine that they fail to induce reliable estimates of performance under distribution shift. On the other hand, we find that the difference of confidences (DoC) of a classifier's predictions successfully estimates the classifier's performance change over a variety of shifts. We specifically investigate the distinction between synthetic and natural distribution shifts and observe that despite its simplicity DoC consistently outperforms other quantifications of distributional difference. $DoC$ reduces predictive error by almost half ($46\%$) on several realistic and challenging distribution shifts, e.g., on the ImageNet-Vid-Robust and ImageNet-Rendition datasets.
    An algorithmic view of $\ell_2$ regularization and some path-following algorithms. (arXiv:2107.03322v1 [stat.ML])
    (2 min) We establish an equivalence between the $\ell_2$-regularized solution path for a convex loss function, and the solution of an ordinary differentiable equation (ODE). Importantly, this equivalence reveals that the solution path can be viewed as the flow of a hybrid of gradient descent and Newton method applying to the empirical loss, which is similar to a widely used optimization technique called trust region method. This provides an interesting algorithmic view of $\ell_2$ regularization, and is in contrast to the conventional view that the $\ell_2$ regularization solution path is similar to the gradient flow of the empirical loss.New path-following algorithms based on homotopy methods and numerical ODE solvers are proposed to numerically approximate the solution path. In particular, we consider respectively Newton method and gradient descent method as the basis algorithm for the homotopy method, and establish their approximation error rates over the solution path. Importantly, our theory suggests novel schemes to choose grid points that guarantee an arbitrarily small suboptimality for the solution path. In terms of computational cost, we prove that in order to achieve an $\epsilon$-suboptimality for the entire solution path, the number of Newton steps required for the Newton method is $\mathcal O(\epsilon^{-1/2})$, while the number of gradient steps required for the gradient descent method is $\mathcal O\left(\epsilon^{-1} \ln(\epsilon^{-1})\right)$. Finally, we use $\ell_2$-regularized logistic regression as an illustrating example to demonstrate the effectiveness of the proposed path-following algorithms.
    Impulse data models for the inverse problem of electrocardiography. (arXiv:2102.00570v2 [q-bio.QM] UPDATED)
    (2 min) The proposed method re-frames traditional inverse problems of electrocardiography into regression problems, constraining the solution space by decomposing signals with multidimensional Gaussian impulse basis functions. Impulse HSPs were generated with single Gaussian basis functions at discrete heart surface locations and projected to corresponding BSPs using a volume conductor torso model. Both BSP (inputs) and HSP (outputs) were mapped to regular 2D surface meshes and used to train a neural network. Predictive capabilities of the network were tested with unseen synthetic and experimental data. A dense full connected single hidden layer neural network was trained to map body surface impulses to heart surface Gaussian basis functions for reconstructing HSP. Synthetic pulses moving across the heart surface were predicted from the neural network with root mean squared error of $9.1\pm1.4$%. Predicted signals were robust to noise up to 20 dB and errors due to displacement and rotation of the heart within the torso were bounded and predictable. A shift of the heart 40 mm toward the spine resulted in a 4\% increase in signal feature localization error. The set of training impulse function data could be reduced and prediction error remained bounded. Recorded HSPs from in-vitro pig hearts were reliably decomposed using space-time Gaussian basis functions. Predicted HSPs for left-ventricular pacing had a mean absolute error of $10.4\pm11.4$ ms. Other pacing scenarios were analyzed with similar success. Conclusion: Impulses from Gaussian basis functions are potentially an effective and robust way to train simple neural network data models for reconstructing HSPs from decomposed BSPs. The HSPs predicted by the neural network can be used to generate activation maps that non-invasively identify features of cardiac electrical dysfunction and can guide subsequent treatment options.
    Diametrical Risk Minimization: Theory and Computations. (arXiv:1910.10844v3 [math.OC] UPDATED)
    (2 min) The theoretical and empirical performance of Empirical Risk Minimization (ERM) often suffers when loss functions are poorly behaved with large Lipschitz moduli and spurious sharp minimizers. We propose and analyze a counterpart to ERM called Diametrical Risk Minimization (DRM), which accounts for worst-case empirical risks within neighborhoods in parameter space. DRM has generalization bounds that are independent of Lipschitz moduli for convex as well as nonconvex problems and it can be implemented using a practical algorithm based on stochastic gradient descent. Numerical results illustrate the ability of DRM to find quality solutions with low generalization error in sharp empirical risk landscapes from benchmark neural network classification problems with corrupted labels.
    KaFiStO: A Kalman Filtering Framework for Stochastic Optimization. (arXiv:2107.03331v1 [cs.LG])
    (2 min) Optimization is often cast as a deterministic problem, where the solution is found through some iterative procedure such as gradient descent. However, when training neural networks the loss function changes over (iteration) time due to the randomized selection of a subset of the samples. This randomization turns the optimization problem into a stochastic one. We propose to consider the loss as a noisy observation with respect to some reference optimum. This interpretation of the loss allows us to adopt Kalman filtering as an optimizer, as its recursive formulation is designed to estimate unknown parameters from noisy measurements. Moreover, we show that the Kalman Filter dynamical model for the evolution of the unknown parameters can be used to capture the gradient dynamics of advanced methods such as Momentum and Adam. We call this stochastic optimization method KaFiStO. KaFiStO is an easy to implement, scalable, and efficient method to train neural networks. We show that it also yields parameter estimates that are on par with or better than existing optimization algorithms across several neural network architectures and machine learning tasks, such as computer vision and language modeling.
    A comparative study of various Deep Learning techniques for spatio-temporal Super-Resolution reconstruction of Forced Isotropic Turbulent flows. (arXiv:2107.03361v1 [physics.flu-dyn])
    (2 min) Super-resolution is an innovative technique that upscales the resolution of an image or a video and thus enables us to reconstruct high-fidelity images from low-resolution data. This study performs super-resolution analysis on turbulent flow fields spatially and temporally using various state-of-the-art machine learning techniques like ESPCN, ESRGAN and TecoGAN to reconstruct high-resolution flow fields from low-resolution flow field data, especially keeping in mind the need for low resource consumption and rapid results production/verification. The dataset used for this study is extracted from the 'isotropic 1024 coarse' dataset which is a part of Johns Hopkins Turbulence Databases (JHTDB). We have utilized pre-trained models and fine tuned them to our needs, so as to minimize the computational resources and the time required for the implementation of the super-resolution models. The advantages presented by this method far exceed the expectations and the outcomes of regular single structure models. The results obtained through these models are then compared using MSE, PSNR, SAM, VIF and SCC metrics in order to evaluate the upscaled results, find the balance between computational power and output quality, and then identify the most accurate and efficient model for spatial and temporal super-resolution of turbulent flow fields.
    Discriminative Mutual Information Estimators for Channel Capacity Learning. (arXiv:2107.03084v1 [cs.IT])
    (2 min) Channel capacity plays a crucial role in the development of modern communication systems as it represents the maximum rate at which information can be reliably transmitted over a communication channel. Nevertheless, for the majority of channels, finding a closed-form capacity expression remains an open challenge. This is because it requires to carry out two formidable tasks a) the computation of the mutual information between the channel input and output, and b) its maximization with respect to the signal distribution at the channel input. In this paper, we address both tasks. Inspired by implicit generative models, we propose a novel cooperative framework to automatically learn the channel capacity, for any type of memory-less channel. In particular, we firstly develop a new methodology to estimate the mutual information directly from a discriminator typically deployed to train adversarial networks, referred to as discriminative mutual information estimator (DIME). Secondly, we include the discriminator in a cooperative channel capacity learning framework, referred to as CORTICAL, where a discriminator learns to distinguish between dependent and independent channel input-output samples while a generator learns to produce the optimal channel input distribution for which the discriminator exhibits the best performance. Lastly, we prove that a particular choice of the cooperative value function solves the channel capacity estimation problem. Simulation results demonstrate that the proposed method offers high accuracy.
    XPDNet for MRI Reconstruction: an application to the 2020 fastMRI challenge. (arXiv:2010.07290v2 [eess.IV] UPDATED)
    (2 min) We present a new neural network, the XPDNet, for MRI reconstruction from periodically under-sampled multi-coil data. We inform the design of this network by taking best practices from MRI reconstruction and computer vision. We show that this network can achieve state-of-the-art reconstruction results, as shown by its ranking of second in the fastMRI 2020 challenge.
    A Comparative Study of Using Spatial-Temporal Graph Convolutional Networks for Predicting Availability in Bike Sharing Schemes. (arXiv:2104.10644v2 [cs.LG] UPDATED)
    (2 min) Accurately forecasting transportation demand is crucial for efficient urban traffic guidance, control and management. One solution to enhance the level of prediction accuracy is to leverage graph convolutional networks (GCN), a neural network based modelling approach with the ability to process data contained in graph based structures. As a powerful extension of GCN, a spatial-temporal graph convolutional network (ST-GCN) aims to capture the relationship of data contained in the graphical nodes across both spatial and temporal dimensions, which presents a novel deep learning paradigm for the analysis of complex time-series data that also involves spatial information as present in transportation use cases. In this paper, we present an Attention-based ST-GCN (AST-GCN) for predicting the number of available bikes in bike-sharing systems in cities, where the attention-based mechanism is introduced to further improve the performance of an ST-GCN. Furthermore, we also discuss the impacts of different modelling methods of adjacency matrices on the proposed architecture. Our experimental results are presented using two real-world datasets, Dublinbikes and NYC-Citi Bike, to illustrate the efficacy of our proposed model which outperforms the majority of existing approaches.
    Evaluating Large Language Models Trained on Code. (arXiv:2107.03374v1 [cs.LG])
    (2 min) We introduce Codex, a GPT language model fine-tuned on publicly available code from GitHub, and study its Python code-writing capabilities. A distinct production version of Codex powers GitHub Copilot. On HumanEval, a new evaluation set we release to measure functional correctness for synthesizing programs from docstrings, our model solves 28.8% of the problems, while GPT-3 solves 0% and GPT-J solves 11.4%. Furthermore, we find that repeated sampling from the model is a surprisingly effective strategy for producing working solutions to difficult prompts. Using this method, we solve 70.2% of our problems with 100 samples per problem. Careful investigation of our model reveals its limitations, including difficulty with docstrings describing long chains of operations and with binding operations to variables. Finally, we discuss the potential broader impacts of deploying powerful code generation technologies, covering safety, security, and economics.
    Identifying Training Stop Point with Noisy Labeled Data. (arXiv:2012.13435v2 [cs.LG] UPDATED)
    (2 min) Training deep neural networks (DNNs) with noisy labels is a challenging problem due to over-parameterization. DNNs tend to essentially fit on clean samples at a higher rate in the initial stages, and later fit on the noisy samples at a relatively lower rate. Thus, with a noisy dataset, the test accuracy increases initially and drops in the later stages. To find an early stopping point at the maximum obtainable test accuracy (MOTA), recent studies assume either that i) a clean validation set is available or ii) the noise ratio is known, or, both. However, often a clean validation set is unavailable, and the noise estimation can be inaccurate. To overcome these issues, we provide a novel training solution, free of these conditions. We analyze the rate of change of the training accuracy for different noise ratios under different conditions to identify a training stop region. We further develop a heuristic algorithm based on a small-learning assumption to find a training stop point (TSP) at or close to MOTA. To the best of our knowledge, our method is the first to rely solely on the \textit{training behavior}, while utilizing the entire training set, to automatically find a TSP. We validated the robustness of our algorithm (AutoTSP) through several experiments on CIFAR-10, CIFAR-100, and a real-world noisy dataset for different noise ratios, noise types, and architectures.
    Last-iterate Convergence of Decentralized Optimistic Gradient Descent/Ascent in Infinite-horizon Competitive Markov Games. (arXiv:2102.04540v2 [cs.LG] UPDATED)
    (2 min) We study infinite-horizon discounted two-player zero-sum Markov games, and develop a decentralized algorithm that provably converges to the set of Nash equilibria under self-play. Our algorithm is based on running an Optimistic Gradient Descent Ascent algorithm on each state to learn the policies, with a critic that slowly learns the value of each state. To the best of our knowledge, this is the first algorithm in this setting that is simultaneously rational (converging to the opponent's best response when it uses a stationary policy), convergent (converging to the set of Nash equilibria under self-play), agnostic (no need to know the actions played by the opponent), symmetric (players taking symmetric roles in the algorithm), and enjoying a finite-time last-iterate convergence guarantee, all of which are desirable properties of decentralized algorithms.
    Decoupling Exploration and Exploitation for Meta-Reinforcement Learning without Sacrifices. (arXiv:2008.02790v3 [cs.LG] UPDATED)
    (2 min) The goal of meta-reinforcement learning (meta-RL) is to build agents that can quickly learn new tasks by leveraging prior experience on related tasks. Learning a new task often requires both exploring to gather task-relevant information and exploiting this information to solve the task. In principle, optimal exploration and exploitation can be learned end-to-end by simply maximizing task performance. However, such meta-RL approaches struggle with local optima due to a chicken-and-egg problem: learning to explore requires good exploitation to gauge the exploration's utility, but learning to exploit requires information gathered via exploration. Optimizing separate objectives for exploration and exploitation can avoid this problem, but prior meta-RL exploration objectives yield suboptimal policies that gather information irrelevant to the task. We alleviate both concerns by constructing an exploitation objective that automatically identifies task-relevant information and an exploration objective to recover only this information. This avoids local optima in end-to-end training, without sacrificing optimal exploration. Empirically, DREAM substantially outperforms existing approaches on complex meta-RL problems, such as sparse-reward 3D visual navigation. Videos of DREAM: https://ezliu.github.io/dream/
    Robust Dynamic Multi-Modal Data Fusion: A Model Uncertainty Perspective. (arXiv:2105.06018v2 [cs.LG] UPDATED)
    (2 min) This paper is concerned with multi-modal data fusion (MMDF) under unexpected modality failures in nonlinear non-Gaussian dynamic processes. An efficient framework to tackle this problem is proposed. In particular, a notion termed modality "\emph{usefulness}", which takes a value of 1 or 0, is used for indicating whether the observation of this modality is useful or not. For $n$ modalities involved, $2^n$ combinations of their "\emph{usefulness}" values exist. Each combination defines one hypothetical model of the true data generative process. Then the problem of concern is formalized as a task of nonlinear non-Gaussian state filtering under model uncertainty, which is addressed by a dynamic model averaging (DMA) based particle filter (PF) algorithm. This DMA algorithm employs $2^n$ models, while all models share the same state-transition function and a unique set of particle values. That makes the computational complexity of this algorithm only slightly larger than a single model based PF algorithm, especially for scenarios in which $n$ is small. Experimental results show that the proposed solution outperforms remarkably state-of-the-art methods. Code and data are available at https://github.com/robinlau1981/fusion.
    Distributed stochastic optimization with large delays. (arXiv:2107.02919v1 [math.OC])
    (2 min) One of the most widely used methods for solving large-scale stochastic optimization problems is distributed asynchronous stochastic gradient descent (DASGD), a family of algorithms that result from parallelizing stochastic gradient descent on distributed computing architectures (possibly) asychronously. However, a key obstacle in the efficient implementation of DASGD is the issue of delays: when a computing node contributes a gradient update, the global model parameter may have already been updated by other nodes several times over, thereby rendering this gradient information stale. These delays can quickly add up if the computational throughput of a node is saturated, so the convergence of DASGD may be compromised in the presence of large delays. Our first contribution is that, by carefully tuning the algorithm's step-size, convergence to the critical set is still achieved in mean square, even if the delays grow unbounded at a polynomial rate. We also establish finer results in a broad class of structured optimization problems (called variationally coherent), where we show that DASGD converges to a global optimum with probability $1$ under the same delay assumptions. Together, these results contribute to the broad landscape of large-scale non-convex stochastic optimization by offering state-of-the-art theoretical guarantees and providing insights for algorithm design.
    Detection and Mitigation of Rare Subclasses in Deep Neural Network Classifiers. (arXiv:1911.12780v2 [cs.LG] UPDATED)
    (2 min) Regions of high-dimensional input spaces that are underrepresented in training datasets reduce machine-learnt classifier performance, and may lead to corner cases and unwanted bias for classifiers used in decision making systems. When these regions belong to otherwise well-represented classes, their presence and negative impact are very hard to identify. We propose an approach for the detection and mitigation of such rare subclasses in deep neural network classifiers. The new approach is underpinned by an easy-to-compute commonality metric that supports the detection of rare subclasses, and comprises methods for reducing the impact of these subclasses during both model training and model exploitation. We demonstrate our approach using two well-known datasets, MNIST's handwritten digits and Kaggle's cats/dogs, identifying rare subclasses and producing models which compensate for subclass rarity. In addition we demonstrate how our run-time approach increases the ability of users to identify samples likely to be misclassified at run-time.
    Differentiable Dynamic Quantization with Mixed Precision and Adaptive Resolution. (arXiv:2106.02295v2 [cs.LG] UPDATED)
    (2 min) Model quantization is challenging due to many tedious hyper-parameters such as precision (bitwidth), dynamic range (minimum and maximum discrete values) and stepsize (interval between discrete values). Unlike prior arts that carefully tune these values, we present a fully differentiable approach to learn all of them, named Differentiable Dynamic Quantization (DDQ), which has several benefits. (1) DDQ is able to quantize challenging lightweight architectures like MobileNets, where different layers prefer different quantization parameters. (2) DDQ is hardware-friendly and can be easily implemented using low-precision matrix-vector multiplication, making it capable in many hardware such as ARM. (3) Extensive experiments show that DDQ outperforms prior arts on many networks and benchmarks, especially when models are already efficient and compact. e.g., DDQ is the first approach that achieves lossless 4-bit quantization for MobileNetV2 on ImageNet.
    Predicting Opioid Use Disorder from Longitudinal Healthcare Data using Multi-stream Transformer. (arXiv:2103.08800v2 [cs.LG] UPDATED)
    (2 min) Opioid Use Disorder (OUD) is a public health crisis costing the US billions of dollars annually in healthcare, lost workplace productivity, and crime. Analyzing longitudinal healthcare data is critical in addressing many real-world problems in healthcare. Leveraging the real-world longitudinal healthcare data, we propose a novel multi-stream transformer model called MUPOD for OUD identification. MUPOD is designed to simultaneously analyze multiple types of healthcare data streams, such as medications and diagnoses, by attending to segments within and across these data streams. Our model tested on the data from 392,492 patients with long-term back pain problems showed significantly better performance than the traditional models and recently developed deep learning models.
    On Training Instance Selection for Few-Shot Neural Text Generation. (arXiv:2107.03176v1 [cs.CL])
    (2 min) Large-scale pretrained language models have led to dramatic improvements in text generation. Impressive performance can be achieved by finetuning only on a small number of instances (few-shot setting). Nonetheless, almost all previous work simply applies random sampling to select the few-shot training instances. Little to no attention has been paid to the selection strategies and how they would affect model performance. In this work, we present a study on training instance selection in few-shot neural text generation. The selection decision is made based only on the unlabeled data so as to identify the most worthwhile data points that should be annotated under some budget of labeling cost. Based on the intuition that the few-shot training instances should be diverse and representative of the entire data distribution, we propose a simple selection strategy with K-means clustering. We show that even with the naive clustering-based approach, the generation models consistently outperform random sampling on three text generation tasks: data-to-text generation, document summarization and question generation. We hope that this work will call for more attention on this largely unexplored area.
    Scaling Up Exact Neural Network Compression by ReLU Stability. (arXiv:2102.07804v2 [cs.LG] UPDATED)
    (2 min) We can compress a neural network while exactly preserving its underlying functionality with respect to a given input domain if some of its neurons are stable. However, current approaches to determine the stability of neurons with Rectified Linear Unit (ReLU) activations require solving or finding a good approximation to multiple discrete optimization problems. In this work, we introduce an algorithm based on solving a single optimization problem to identify all stable neurons. Our approach is on median 100 times faster than the state-of-art method, which allows us to explore exact compression on deeper (5 x 100) and wider (2 x 800) networks within minutes. For classifiers trained under an amount of L1 regularization that does not worsen accuracy, we can remove up to 40% of the connections
    Cognitive Learning-Aided Multi-Antenna Communications. (arXiv:2010.03131v2 [eess.SP] UPDATED)
    (2 min) Cognitive communications have emerged as a promising solution to enhance, adapt, and invent new tools and capabilities that transcend conventional wireless networks. Deep learning (DL) is critical in enabling essential features of cognitive systems because of its fast prediction performance, adaptive behavior, and model-free structure. These features are especially significant for multi-antenna wireless communications systems, which generate and handle massive data. Multiple antennas may provide multiplexing, diversity, or antenna gains that, respectively, improve the capacity, bit error rate, or the signal-to-interference-plus-noise ratio. In practice, multi-antenna cognitive communications encounter challenges in terms of data complexity and diversity, hardware complexity, and wireless channel dynamics. The DL-based solutions tackle these problems at the various stages of communications processing such as channel estimation, hybrid beamforming, user localization, and sparse array design. There are research opportunities to address significant design challenges arising from insufficient data coverage, learning model complexity, and data transmission overheads. This article provides synopses of various DL-based methods to impart cognitive behavior to multi-antenna wireless communications.
    The Promises and Pitfalls of Deep Kernel Learning. (arXiv:2102.12108v2 [stat.ML] UPDATED)
    (2 min) Deep kernel learning (DKL) and related techniques aim to combine the representational power of neural networks with the reliable uncertainty estimates of Gaussian processes. One crucial aspect of these models is an expectation that, because they are treated as Gaussian process models optimized using the marginal likelihood, they are protected from overfitting. However, we identify situations where this is not the case. We explore this behavior, explain its origins and consider how it applies to real datasets. Through careful experimentation on the UCI, CIFAR-10, and the UTKFace datasets, we find that the overfitting from overparameterized maximum marginal likelihood, in which the model is "somewhat Bayesian", can in certain scenarios be worse than that from not being Bayesian at all. We explain how and when DKL can still be successful by investigating optimization dynamics. We also find that failures of DKL can be rectified by a fully Bayesian treatment, which leads to the desired performance improvements over standard neural networks and Gaussian processes.
    Explainable Prediction of Text Complexity: The Missing Preliminaries for Text Simplification. (arXiv:2007.15823v2 [cs.CL] UPDATED)
    (2 min) Text simplification reduces the language complexity of professional content for accessibility purposes. End-to-end neural network models have been widely adopted to directly generate the simplified version of input text, usually functioning as a blackbox. We show that text simplification can be decomposed into a compact pipeline of tasks to ensure the transparency and explainability of the process. The first two steps in this pipeline are often neglected: 1) to predict whether a given piece of text needs to be simplified, and 2) if yes, to identify complex parts of the text. The two tasks can be solved separately using either lexical or deep learning methods, or solved jointly. Simply applying explainable complexity prediction as a preliminary step, the out-of-sample text simplification performance of the state-of-the-art, black-box simplification models can be improved by a large margin.
    Automatic Fault Detection for Deep Learning Programs Using Graph Transformations. (arXiv:2105.08095v2 [cs.SE] CROSS LISTED)
    (2 min) Nowadays, we are witnessing an increasing demand in both corporates and academia for exploiting Deep Learning (DL) to solve complex real-world problems. A DL program encodes the network structure of a desirable DL model and the process by which the model learns from the training dataset. Like any software, a DL program can be faulty, which implies substantial challenges of software quality assurance, especially in safety-critical domains. It is therefore crucial to equip DL development teams with efficient fault detection techniques and tools. In this paper, we propose NeuraLint, a model-based fault detection approach for DL programs, using meta-modelling and graph transformations. First, we design a meta-model for DL programs that includes their base skeleton and fundamental properties. Then, we construct a graph-based verification process that covers 23 rules defined on top of the meta-model and implemented as graph transformations to detect faults and design inefficiencies in the generated models (i.e., instances of the meta-model). First, the proposed approach is evaluated by finding faults and design inefficiencies in 28 synthesized examples built from common problems reported in the literature. Then NeuraLint successfully finds 64 faults and design inefficiencies in 34 real-world DL programs extracted from Stack Overflow posts and GitHub repositories. The results show that NeuraLint effectively detects faults and design issues in both synthesized and real-world examples with a recall of 70.5 % and a precision of 100 %. Although the proposed meta-model is designed for feedforward neural networks, it can be extended to support other neural network architectures such as recurrent neural networks. Researchers can also expand our set of verification rules to cover more types of issues in DL programs.
    SelfCF: A Simple Framework for Self-supervised Collaborative Filtering. (arXiv:2107.03019v1 [cs.IR])
    (2 min) Collaborative filtering (CF) is widely used to learn an informative latent representation of a user or item from observed interactions. Existing CF-based methods commonly adopt negative sampling to discriminate different items. That is, observed user-item pairs are treated as positive instances; unobserved pairs are considered as negative instances and are sampled under a defined distribution for training. Training with negative sampling on large datasets is computationally expensive. Further, negative items should be carefully sampled under the defined distribution, in order to avoid selecting an observed positive item in the training dataset. Unavoidably, some negative items sampled from the training dataset could be positive in the test set. Recently, self-supervised learning (SSL) has emerged as a powerful tool to learn a model without negative samples. In this paper, we propose a self-supervised collaborative filtering framework (SelfCF), that is specially designed for recommender scenario with implicit feedback. The main idea of SelfCF is to augment the output embeddings generated by backbone networks, because it is infeasible to augment raw input of user/item ids. We propose and study three output perturbation techniques that can be applied to different types of backbone networks including both traditional CF models and graph-based models. By encapsulating two popular recommendation models into the framework, our experiments on three datasets show that the best performance of our framework is comparable or better than the supervised counterpart. We also show that SelfCF can boost up the performance by up to 8.93\% on average, compared with another self-supervised framework as the baseline. Source codes are available at: https://github.com/enoche/SelfCF.
    Faults in Deep Reinforcement Learning Programs: A Taxonomy and A Detection Approach. (arXiv:2101.00135v2 [cs.SE] CROSS LISTED)
    (2 min) A growing demand is witnessed in both industry and academia for employing Deep Learning (DL) in various domains to solve real-world problems. Deep Reinforcement Learning (DRL) is the application of DL in the domain of Reinforcement Learning (RL). Like any software systems, DRL applications can fail because of faults in their programs. In this paper, we present the first attempt to categorize faults occurring in DRL programs. We manually analyzed 761 artifacts of DRL programs (from Stack Overflow posts and GitHub issues) developed using well-known DRL frameworks (OpenAI Gym, Dopamine, Keras-rl, Tensorforce) and identified faults reported by developers/users. We labeled and taxonomized the identified faults through several rounds of discussions. The resulting taxonomy is validated using an online survey with 19 developers/researchers. To allow for the automatic detection of faults in DRL programs, we have defined a meta-model of DRL programs and developed DRLinter, a model-based fault detection approach that leverages static analysis and graph transformations. The execution flow of DRLinter consists in parsing a DRL program to generate a model conforming to our meta-model and applying detection rules on the model to identify faults occurrences. The effectiveness of DRLinter is evaluated using 15 synthetic DRLprograms in which we injected faults observed in the analyzed artifacts of the taxonomy. The results show that DRLinter can successfully detect faults in all synthetic faulty programs.
    AutoDebias: Learning to Debias for Recommendation. (arXiv:2105.04170v2 [cs.LG] UPDATED)
    (2 min) Recommender systems rely on user behavior data like ratings and clicks to build personalization model. However, the collected data is observational rather than experimental, causing various biases in the data which significantly affect the learned model. Most existing work for recommendation debiasing, such as the inverse propensity scoring and imputation approaches, focuses on one or two specific biases, lacking the universal capacity that can account for mixed or even unknown biases in the data. Towards this research gap, we first analyze the origin of biases from the perspective of \textit{risk discrepancy} that represents the difference between the expectation empirical risk and the true risk. Remarkably, we derive a general learning framework that well summarizes most existing debiasing strategies by specifying some parameters of the general framework. This provides a valuable opportunity to develop a universal solution for debiasing, e.g., by learning the debiasing parameters from data. However, the training data lacks important signal of how the data is biased and what the unbiased data looks like. To move this idea forward, we propose \textit{AotoDebias} that leverages another (small) set of uniform data to optimize the debiasing parameters by solving the bi-level optimization problem with meta-learning. Through theoretical analyses, we derive the generalization bound for AutoDebias and prove its ability to acquire the appropriate debiasing strategy. Extensive experiments on two real datasets and a simulated dataset demonstrated effectiveness of AutoDebias. The code is available at \url{https://github.com/DongHande/AutoDebias}.
    Convolutional Neural Nets in Chemical Engineering: Foundations, Computations, and Applications. (arXiv:2101.04869v2 [cs.LG] UPDATED)
    (2 min) In this paper we review the mathematical foundations of convolutional neural nets (CNNs) with the goals of: i) highlighting connections with techniques from statistics, signal processing, linear algebra, differential equations, and optimization, ii) demystifying underlying computations, and iii) identifying new types of applications. CNNs are powerful machine learning models that highlight features from grid data to make predictions (regression and classification). The grid data object can be represented as vectors (in 1D), matrices (in 2D), or tensors (in 3D or higher dimensions) and can incorporate multiple channels (thus providing high flexibility in the input data representation). CNNs highlight features from the grid data by performing convolution operations with different types of operators. The operators highlight different types of features (e.g., patterns, gradients, geometrical features) and are learned by using optimization techniques. In other words, CNNs seek to identify optimal operators that best map the input data to the output data. A common misconception is that CNNs are only capable of processing image or video data but their application scope is much wider; specifically, datasets encountered in diverse applications can be expressed as grid data. Here, we show how to apply CNNs to new types of applications such as optimal control, flow cytometry, multivariate process monitoring, and molecular simulations.
    A Pseudo-Metric between Probability Distributions based on Depth-Trimmed Regions. (arXiv:2103.12711v2 [stat.ML] UPDATED)
    (2 min) Data depth is a non parametric statistical tool that measures centrality of any element $x\in\mathbb{R}^d$ with respect to (w.r.t.) a probability distribution or a data set. It is a natural median-oriented extension of the cumulative distribution function (cdf) to the multivariate case. Consequently, its upper level sets -- the depth-trimmed regions -- give rise to a definition of multivariate quantiles. In this work, we propose two new pseudo-metrics between continuous probability measures based on data depth and its associated central regions. The first one is constructed as the Lp-distance between data depth w.r.t. each distribution while the second one relies on the Hausdorff distance between their quantile regions. It can further be seen as an original way to extend the one-dimensional formulae of the Wasserstein distance, which involves quantiles and cdfs, to the multivariate space. After discussing the properties of these pseudo-metrics and providing conditions under which they define a distance, we highlight similarities with the Wasserstein distance. Interestingly, the derived non-asymptotic bounds show that in contrast to the Wasserstein distance, the proposed pseudo-metrics do not suffer from the curse of dimensionality. Moreover, based on the support function of a convex body, we propose an efficient approximation possessing linear time complexity w.r.t. the size of the data set and its dimension. The quality of this approximation as well as the performance of the proposed approach are illustrated in experiments. Furthermore, by construction the regions-based pseudo-metric appears to be robust w.r.t. both outliers and heavy tails, a behavior witnessed in the numerical experiments.
    Understanding the Security of Deepfake Detection. (arXiv:2107.02045v2 [cs.CR] UPDATED)
    (2 min) Deepfakes pose growing challenges to the trust of information on the Internet. Thus, detecting deepfakes has attracted increasing attentions from both academia and industry. State-of-the-art deepfake detection methods consist of two key components, i.e., face extractor and face classifier, which extract the face region in an image and classify it to be real/fake, respectively. Existing studies mainly focused on improving the detection performance in non-adversarial settings, leaving security of deepfake detection in adversarial settings largely unexplored. In this work, we aim to bridge the gap. In particular, we perform a systematic measurement study to understand the security of the state-of-the-art deepfake detection methods in adversarial settings. We use two large-scale public deepfakes data sources including FaceForensics++ and Facebook Deepfake Detection Challenge, where the deepfakes are fake face images; and we train state-of-the-art deepfake detection methods. These detection methods can achieve 0.94--0.99 accuracies in non-adversarial settings on these datasets. However, our measurement results uncover multiple security limitations of the deepfake detection methods in adversarial settings. First, we find that an attacker can evade a face extractor, i.e., the face extractor fails to extract the correct face regions, via adding small Gaussian noise to its deepfake images. Second, we find that a face classifier trained using deepfakes generated by one method cannot detect deepfakes generated by another method, i.e., an attacker can evade detection via generating deepfakes using a new method. Third, we find that an attacker can leverage backdoor attacks developed by the adversarial machine learning community to evade a face classifier. Our results highlight that deepfake detection should consider the adversarial nature of the problem.
    Improving the performance of EEG decoding using anchored-STFT in conjunction with gradient norm adversarial augmentation. (arXiv:2011.14694v3 [q-bio.QM] UPDATED)
    (3 min) Brain-computer interfaces (BCIs) enable direct communication between humans and machines by translating brain activity into control commands. EEG is one of the most common sources of neural signals because of its inexpensive and non-invasive nature. However, interpretation of EEG signals is non-trivial because EEG signals have a low spatial resolution and are often distorted with noise and artifacts. Therefore, it is possible that meaningful patterns for classifying EEG signals are deeply hidden. Nowadays, state-of-the-art deep-learning algorithms have proven to be quite efficient in learning hidden, meaningful patterns. However, the performance of the deep learning algorithms depends upon the quality and the amount of the provided training data. Hence, a better input formation (feature extraction) technique and a generative model to produce high-quality data can enable the deep learning algorithms to adapt high generalization quality. In this study, we proposed a novel input formation (feature extraction) method in conjunction with a novel deep learning based generative model to harness new training examples. The feature vectors are extracted using a modified Short Time Fourier Transform (STFT) called anchored-STFT. Anchored-STFT, inspired by wavelet transform, tries to minimize the tradeoff between time and frequency resolution. As a result, it extracts the inputs (feature vectors) with better time and frequency resolution compared to the standard STFT. Secondly, we introduced a novel generative adversarial data augmentation technique called gradient norm adversarial augmentation (GNAA) for generating more training data. Thirdly, we investigated the existence and significance of adversarial inputs in EEG data. Our approach obtained the kappa value of 0.814 for BCI competition II dataset III and 0.755 for BCI competition IV dataset 2b for session-to-session transfer on test data.
    Learning to Predict Error for MRI Reconstruction. (arXiv:2002.05582v3 [cs.LG] UPDATED)
    (2 min) In healthcare applications, predictive uncertainty has been used to assess predictive accuracy. In this paper, we demonstrate that predictive uncertainty estimated by the current methods does not highly correlate with prediction error by decomposing the latter into random and systematic errors, and showing that the former is equivalent to the variance of the random error. In addition, we observe that current methods unnecessarily compromise performance by modifying the model and training loss to estimate the target and uncertainty jointly. We show that estimating them separately without modifications improves performance. Following this, we propose a novel method that estimates the target labels and magnitude of the prediction error in two steps. We demonstrate this method on a large-scale MRI reconstruction task, and achieve significantly better results than the state-of-the-art uncertainty estimation methods.
    An empirical evaluation of active inference in multi-armed bandits. (arXiv:2101.08699v3 [cs.LG] UPDATED)
    (2 min) A key feature of sequential decision making under uncertainty is a need to balance between exploiting--choosing the best action according to the current knowledge, and exploring--obtaining information about values of other actions. The multi-armed bandit problem, a classical task that captures this trade-off, served as a vehicle in machine learning for developing bandit algorithms that proved to be useful in numerous industrial applications. The active inference framework, an approach to sequential decision making recently developed in neuroscience for understanding human and animal behaviour, is distinguished by its sophisticated strategy for resolving the exploration-exploitation trade-off. This makes active inference an exciting alternative to already established bandit algorithms. Here we derive an efficient and scalable approximate active inference algorithm and compare it to two state-of-the-art bandit algorithms: Bayesian upper confidence bound and optimistic Thompson sampling. This comparison is done on two types of bandit problems: a stationary and a dynamic switching bandit. Our empirical evaluation shows that the active inference algorithm does not produce efficient long-term behaviour in stationary bandits. However, in the more challenging switching bandit problem active inference performs substantially better than the two state-of-the-art bandit algorithms. The results open exciting venues for further research in theoretical and applied machine learning, as well as lend additional credibility to active inference as a general framework for studying human and animal behaviour.
    L2P: An Algorithm for Estimating Heavy-tailed Outcomes. (arXiv:1908.04628v2 [cs.LG] UPDATED)
    (2 min) Many real-world prediction tasks have outcome variables that have characteristic heavy-tail distributions. Examples include copies of books sold, auction prices of art pieces, demand for commodities in warehouses, etc. By learning heavy-tailed distributions, "big and rare" instances (e.g., the best-sellers) will have accurate predictions. Most existing approaches are not dedicated to learning heavy-tailed distribution; thus, they heavily under-predict such instances. To tackle this problem, we introduce Learning to Place (L2P), which exploits the pairwise relationships between instances for learning. In its training phase, L2P learns a pairwise preference classifier: is instance A > instance B? In its placing phase, L2P obtains a prediction by placing the new instance among the known instances. Based on its placement, the new instance is then assigned a value for its outcome variable. Experiments on real data show that L2P outperforms competing approaches in terms of accuracy and ability to reproduce heavy-tailed outcome distribution. In addition, L2P provides an interpretable model by placing each predicted instance in relation to its comparable neighbors. Interpretable models are highly desirable when lives and treasure are at stake.
    Enhancing the Spatio-temporal Observability of Grid-Edge Resources in Distribution Grids. (arXiv:2102.07801v2 [eess.SP] UPDATED)
    (2 min) Enhancing the spatio-temporal observability of distributed energy resources (DERs) is crucial for achieving secure and efficient operations in distribution grids. This paper puts forth a joint recovery framework for residential loads by leveraging the complimentary strengths of heterogeneous types of measurements. The proposed approaches integrate the low-resolution smart meter data collected for every load node with the fast-sampled feeder-level measurements provided by limited number of phasor measurement units. To address the lack of data, we exploit two key characteristics for the loads and DERs, namely the sparse changes due to infrequent activities of appliances and electric vehicles (EVs) and the locational dependence of solar photovoltaic (PV) generation. Accordingly, meaningful regularization terms are introduced to cast a convex load recovery problem, which will be further simplified to reduce computational complexity. The load recovery solutions can be utilized to identify the EV charging events at each load node and to infer the total behind-the-meter PV output. Numerical tests using real-world data have demonstrated the effectiveness of the proposed approaches in enhancing the visibility of these grid-edge DERs.
    A case for new neural network smoothness constraints. (arXiv:2012.07969v3 [stat.ML] UPDATED)
    (2 min) How sensitive should machine learning models be to input changes? We tackle the question of model smoothness and show that it is a useful inductive bias which aids generalization, adversarial robustness, generative modeling and reinforcement learning. We explore current methods of imposing smoothness constraints and observe they lack the flexibility to adapt to new tasks, they don't account for data modalities, they interact with losses, architectures and optimization in ways not yet fully understood. We conclude that new advances in the field are hinging on finding ways to incorporate data, tasks and learning into our definitions of smoothness.
    Formal derivation of Mesh Neural Networks with their Forward-Only gradient Propagation. (arXiv:1905.06684v5 [cs.LG] UPDATED)
    (2 min) This paper proposes the Mesh Neural Network (MNN), a novel architecture which allows neurons to be connected in any topology, to efficiently route information. In MNNs, information is propagated between neurons throughout a state transition function. State and error gradients are then directly computed from state updates without backward computation. The MNN architecture and the error propagation schema is formalized and derived in tensor algebra. The proposed computational model can fully supply a gradient descent process, and is potentially suitable for very large scale sparse NNs, due to its expressivity and training efficiency, with respect to NNs based on back-propagation and computational graphs.
    Monocular Depth Estimation via Listwise Ranking using the Plackett-Luce Model. (arXiv:2010.13118v4 [cs.CV] UPDATED)
    (2 min) In many real-world applications, the relative depth of objects in an image is crucial for scene understanding. Recent approaches mainly tackle the problem of depth prediction in monocular images by treating the problem as a regression task. Yet, being interested in an order relation in the first place, ranking methods suggest themselves as a natural alternative to regression, and indeed, ranking approaches leveraging pairwise comparisons as training information ("object A is closer to the camera than B") have shown promising performance on this problem. In this paper, we elaborate on the use of so-called listwise ranking as a generalization of the pairwise approach. Our method is based on the Plackett-Luce (PL) model, a probability distribution on rankings, which we combine with a state-of-the-art neural network architecture and a simple sampling strategy to reduce training complexity. Moreover, taking advantage of the representation of PL as a random utility model, the proposed predictor offers a natural way to recover (shift-invariant) metric depth information from ranking-only data provided at training time. An empirical evaluation on several benchmark datasets in a "zero-shot" setting demonstrates the effectiveness of our approach compared to existing ranking and regression methods.
    Solving the scalarization issues of Advantage-based Reinforcement Learning Algorithms. (arXiv:2004.04120v3 [cs.LG] UPDATED)
    (2 min) In this research, some of the issues that arise from the scalarization of the multi-objective optimization problem in the Advantage Actor Critic (A2C) reinforcement learning algorithm are investigated. The paper shows how a naive scalarization can lead to gradients overlapping. Furthermore, the possibility that the entropy regularization term can be a source of uncontrolled noise is discussed. With respect to the above issues, a technique to avoid gradient overlapping is proposed, while keeping the same loss formulation. Moreover, a method to avoid the uncontrolled noise, by sampling the actions from distributions with a desired minimum entropy, is investigated. Pilot experiments have been carried out to show how the proposed method speeds up the training. The proposed approach can be applied to any Advantage-based Reinforcement Learning algorithm.
    Path Planning using Neural A* Search. (arXiv:2009.07476v3 [cs.LG] UPDATED)
    (2 min) We present Neural A*, a novel data-driven search method for path planning problems. Despite the recent increasing attention to data-driven path planning, machine learning approaches to search-based planning are still challenging due to the discrete nature of search algorithms. In this work, we reformulate a canonical A* search algorithm to be differentiable and couple it with a convolutional encoder to form an end-to-end trainable neural network planner. Neural A* solves a path planning problem by encoding a problem instance to a guidance map and then performing the differentiable A* search with the guidance map. By learning to match the search results with ground-truth paths provided by experts, Neural A* can produce a path consistent with the ground truth accurately and efficiently. Our extensive experiments confirmed that Neural A* outperformed state-of-the-art data-driven planners in terms of the search optimality and efficiency trade-off. Furthermore, Neural A* successfully predicted realistic human trajectories by directly performing search-based planning on natural image inputs. Project page: https://omron-sinicx.github.io/neural-astar/
    Online Deterministic Annealing for Classification and Clustering. (arXiv:2102.05836v2 [cs.LG] UPDATED)
    (2 min) Inherent in virtually every iterative machine learning algorithm is the problem of hyper-parameter tuning, which includes three major design parameters: (a) the complexity of the model, e.g., the number of neurons in a neural network, (b) the initial conditions, which heavily affect the behavior of the algorithm, and (c) the dissimilarity measure used to quantify its performance. We introduce an online prototype-based learning algorithm that can be viewed as a progressively growing competitive-learning neural network architecture for classification and clustering. The learning rule of the proposed approach is formulated as an online gradient-free stochastic approximation algorithm that solves a sequence of appropriately defined optimization problems, simulating an annealing process. The annealing nature of the algorithm contributes to avoiding poor local minima, offers robustness with respect to the initial conditions, and provides a means to progressively increase the complexity of the learning model, through an intuitive bifurcation phenomenon. The proposed approach is interpretable, requires minimal hyper-parameter tuning, and allows online control over the performance-complexity trade-off. Finally, we show that Bregman divergences appear naturally as a family of dissimilarity measures that play a central role in both the performance and the computational complexity of the learning algorithm. Experimental results illustrate the properties and evaluate the performance of the proposed learning algorithm.
    How and Why to Use Experimental Data to Evaluate Methods for Observational Causal Inference. (arXiv:2010.03051v2 [stat.ME] UPDATED)
    (2 min) Methods that infer causal dependence from observational data are central to many areas of science, including medicine, economics, and the social sciences. A variety of theoretical properties of these methods have been proven, but empirical evaluation remains a challenge, largely due to the lack of observational data sets for which treatment effect is known. We describe and analyze observational sampling from randomized controlled trials (OSRCT), a method for evaluating causal inference methods using data from randomized controlled trials (RCTs). This method can be used to create constructed observational data sets with corresponding unbiased estimates of treatment effect, substantially increasing the number of data sets available for empirical evaluation of causal inference methods. We show that, in expectation, OSRCT creates data sets that are equivalent to those produced by randomly sampling from empirical data sets in which all potential outcomes are available. We then perform a large-scale evaluation of seven causal inference methods over 37 data sets, drawn from RCTs, as well as simulators, real-world computational systems, and observational data sets augmented with a synthetic response variable. We find notable performance differences when comparing across data from different sources, demonstrating the importance of using data from a variety of sources when evaluating any causal inference method.
    A data-driven approach to the forecasting of ground-level ozone concentration. (arXiv:2012.00685v4 [physics.ao-ph] UPDATED)
    (2 min) The ability to forecast the concentration of air pollutants in an urban region is crucial for decision-makers wishing to reduce the impact of pollution on public health through active measures (e.g. temporary traffic closures). In this study, we present a machine learning approach applied to the forecast of the day-ahead maximum value of the ozone concentration for several geographical locations in southern Switzerland. Due to the low density of measurement stations and to the complex orography of the use case terrain, we adopted feature selection methods instead of explicitly restricting relevant features to a neighbourhood of the prediction sites, as common in spatio-temporal forecasting methods. We then used Shapley values to assess the explainability of the learned models in terms of feature importance and feature interactions in relation to ozone predictions; our analysis suggests that the trained models effectively learned explanatory cross-dependencies among atmospheric variables. Finally, we show how weighting observations helps in increasing the accuracy of the forecasts for specific ranges of ozone's daily peak values.
    Learning 1-Dimensional Submanifolds for Subsequent Inference on Random Dot Product Graphs. (arXiv:2004.07348v4 [stat.ML] UPDATED)
    (2 min) A random dot product graph (RDPG) is a generative model for networks in which vertices correspond to positions in a latent Euclidean space and edge probabilities are determined by the dot products of the latent positions. We consider RDPGs for which the latent positions are randomly sampled from an unknown $1$-dimensional submanifold of the latent space. In principle, restricted inference, i.e., procedures that exploit the structure of the submanifold, should be more effective than unrestricted inference; however, it is not clear how to conduct restricted inference when the submanifold is unknown. We submit that techniques for manifold learning can be used to learn the unknown submanifold well enough to realize benefit from restricted inference. To illustrate, we test $1$- and $2$-sample hypotheses about the Fr\'{e}chet means of small communities of vertices, using the complete set of vertices to infer latent structure. We propose test statistics that deploy the Isomap procedure for manifold learning, using shortest path distances on neighborhood graphs constructed from estimated latent positions to estimate arc lengths on the unknown $1$-dimensional submanifold. Unlike conventional applications of Isomap, the estimated latent positions do not lie on the submanifold of interest. We extend existing convergence results for Isomap to this setting and use them to demonstrate that, as the number of auxiliary vertices increases, the power of our test converges to the power of the corresponding test when the submanifold is known.
    Your GAN is Secretly an Energy-based Model and You Should use Discriminator Driven Latent Sampling. (arXiv:2003.06060v3 [cs.LG] UPDATED)
    (2 min) We show that the sum of the implicit generator log-density $\log p_g$ of a GAN with the logit score of the discriminator defines an energy function which yields the true data density when the generator is imperfect but the discriminator is optimal, thus making it possible to improve on the typical generator (with implicit density $p_g$). To make that practical, we show that sampling from this modified density can be achieved by sampling in latent space according to an energy-based model induced by the sum of the latent prior log-density and the discriminator output score. This can be achieved by running a Langevin MCMC in latent space and then applying the generator function, which we call Discriminator Driven Latent Sampling~(DDLS). We show that DDLS is highly efficient compared to previous methods which work in the high-dimensional pixel space and can be applied to improve on previously trained GANs of many types. We evaluate DDLS on both synthetic and real-world datasets qualitatively and quantitatively. On CIFAR-10, DDLS substantially improves the Inception Score of an off-the-shelf pre-trained SN-GAN~\citep{sngan} from $8.22$ to $9.09$ which is even comparable to the class-conditional BigGAN~\citep{biggan} model. This achieves a new state-of-the-art in unconditional image synthesis setting without introducing extra parameters or additional training.
    Efficient Matrix-Free Approximations of Second-Order Information, with Applications to Pruning and Optimization. (arXiv:2107.03356v1 [cs.LG])
    (2 min) Efficiently approximating local curvature information of the loss function is a key tool for optimization and compression of deep neural networks. Yet, most existing methods to approximate second-order information have high computational or storage costs, which can limit their practicality. In this work, we investigate matrix-free, linear-time approaches for estimating Inverse-Hessian Vector Products (IHVPs) for the case when the Hessian can be approximated as a sum of rank-one matrices, as in the classic approximation of the Hessian by the empirical Fisher matrix. We propose two new algorithms as part of a framework called M-FAC: the first algorithm is tailored towards network compression and can compute the IHVP for dimension $d$, if the Hessian is given as a sum of $m$ rank-one matrices, using $O(dm^2)$ precomputation, $O(dm)$ cost for computing the IHVP, and query cost $O(m)$ for any single element of the inverse Hessian. The second algorithm targets an optimization setting, where we wish to compute the product between the inverse Hessian, estimated over a sliding window of optimization steps, and a given gradient direction, as required for preconditioned SGD. We give an algorithm with cost $O(dm + m^2)$ for computing the IHVP and $O(dm + m^3)$ for adding or removing any gradient from the sliding window. These two algorithms yield state-of-the-art results for network pruning and optimization with lower computational overhead relative to existing second-order methods. Implementations are available at [10] and [18].
    Auxiliary Diagnosing Coronary Stenosis Using Machine Learning. (arXiv:2007.10316v3 [q-bio.TO] UPDATED)
    (2 min) How to accurately classify and diagnose whether an individual has Coronary Stenosis (CS) without invasive physical examination? This problem has not been solved satisfactorily. To this end, the four machine learning (ML) algorithms, i.e., Boosted Tree (BT), Decision Tree (DT), Logistic Regression (LR) and Random Forest (RF) are employed in this paper. First, eleven features including basic information of an individual, symptoms and results of routine physical examination are selected, as well as one label is specified, indicating whether an individual suffers from different severity of coronary artery stenosis or not. On the basis of it, a sample set is constructed. Second, each of these four ML algorithms learns from the sample set to obtain the corresponding optimal classified results, respectively. The experimental results show that: RF performs better than other three algorithms, and the former algorithm classifies whether an individual has CS with an accuracy of 95.7% (=90/94).
    Mitigating Performance Saturation in Neural Marked Point Processes: Architectures and Loss Functions. (arXiv:2107.03354v1 [cs.LG])
    (2 min) Attributed event sequences are commonly encountered in practice. A recent research line focuses on incorporating neural networks with the statistical model -- marked point processes, which is the conventional tool for dealing with attributed event sequences. Neural marked point processes possess good interpretability of probabilistic models as well as the representational power of neural networks. However, we find that performance of neural marked point processes is not always increasing as the network architecture becomes more complicated and larger, which is what we call the performance saturation phenomenon. This is due to the fact that the generalization error of neural marked point processes is determined by both the network representational ability and the model specification at the same time. Therefore we can draw two major conclusions: first, simple network structures can perform no worse than complicated ones for some cases; second, using a proper probabilistic assumption is as equally, if not more, important as improving the complexity of the network. Based on this observation, we propose a simple graph-based network structure called GCHP, which utilizes only graph convolutional layers, thus it can be easily accelerated by the parallel mechanism. We directly consider the distribution of interarrival times instead of imposing a specific assumption on the conditional intensity function, and propose to use a likelihood ratio loss with a moment matching mechanism for optimization and model selection. Experimental results show that GCHP can significantly reduce training time and the likelihood ratio loss with interarrival time probability assumptions can greatly improve the model performance.
    Samplets: A new paradigm for data compression. (arXiv:2107.03337v1 [math.NA])
    (2 min) In this article, we introduce the novel concept of samplets by transferring the construction of Tausch-White wavelets to the realm of data. This way we obtain a multilevel representation of discrete data which directly enables data compression, detection of singularities and adaptivity. Applying samplets to represent kernel matrices, as they arise in kernel based learning or Gaussian process regression, we end up with quasi-sparse matrices. By thresholding small entries, these matrices are compressible to O(N log N) relevant entries, where N is the number of data points. This feature allows for the use of fill-in reducing reorderings to obtain a sparse factorization of the compressed matrices. Besides the comprehensive introduction to samplets and their properties, we present extensive numerical studies to benchmark the approach. Our results demonstrate that samplets mark a considerable step in the direction of making large data sets accessible for analysis.
    A Survey of Uncertainty in Deep Neural Networks. (arXiv:2107.03342v1 [cs.LG])
    (2 min) Due to their increasing spread, confidence in neural network predictions became more and more important. However, basic neural networks do not deliver certainty estimates or suffer from over or under confidence. Many researchers have been working on understanding and quantifying uncertainty in a neural network's prediction. As a result, different types and sources of uncertainty have been identified and a variety of approaches to measure and quantify uncertainty in neural networks have been proposed. This work gives a comprehensive overview of uncertainty estimation in neural networks, reviews recent advances in the field, highlights current challenges, and identifies potential research opportunities. It is intended to give anyone interested in uncertainty estimation in neural networks a broad overview and introduction, without presupposing prior knowledge in this field. A comprehensive introduction to the most crucial sources of uncertainty is given and their separation into reducible model uncertainty and not reducible data uncertainty is presented. The modeling of these uncertainties based on deterministic neural networks, Bayesian neural networks, ensemble of neural networks, and test-time data augmentation approaches is introduced and different branches of these fields as well as the latest developments are discussed. For a practical application, we discuss different measures of uncertainty, approaches for the calibration of neural networks and give an overview of existing baselines and implementations. Different examples from the wide spectrum of challenges in different fields give an idea of the needs and challenges regarding uncertainties in practical applications. Additionally, the practical limitations of current methods for mission- and safety-critical real world applications are discussed and an outlook on the next steps towards a broader usage of such methods is given.
    On the Convergence of Overlapping Schwarz Decomposition for Nonlinear Optimal Control. (arXiv:2005.06674v4 [math.OC] UPDATED)
    (2 min) We study the convergence properties of an overlapping Schwarz decomposition~algorithm for solving nonlinear optimal control problems (OCPs). The approach decomposes the time domain into a set of overlapping subdomains, and solves subproblems defined over such subdomains in parallel. Convergence is attained by updating primal-dual information at the boundaries of the overlapping regions. We show that the algorithm exhibits local linear convergence and that the convergence rate improves exponentially with the overlap size. Our convergence results rely on a sensitivity result for OCPs that we call "exponential decay of sensitivity" (EDS). Intuitively, EDS states that the impact of parametric perturbations at the boundaries of the domain (initial and final time) decays exponentially as one moves into the domain. We show that EDS holds for nonlinear OCPs under a uniform second-order sufficient condition, a controllability condition, and a uniform boundedness condition. We conduct numerical experiments using a quadrotor motion planning problem and a PDE control problem; and show that the approach is significantly more efficient than ADMM and as efficient as the centralized solver Ipopt.
    Graph Convolution for Semi-Supervised Classification: Improved Linear Separability and Out-of-Distribution Generalization. (arXiv:2102.06966v3 [cs.LG] UPDATED)
    (2 min) Recently there has been increased interest in semi-supervised classification in the presence of graphical information. A new class of learning models has emerged that relies, at its most basic level, on classifying the data after first applying a graph convolution. To understand the merits of this approach, we study the classification of a mixture of Gaussians, where the data corresponds to the node attributes of a stochastic block model. We show that graph convolution extends the regime in which the data is linearly separable by a factor of roughly $1/\sqrt{D}$, where $D$ is the expected degree of a node, as compared to the mixture model data on its own. Furthermore, we find that the linear classifier obtained by minimizing the cross-entropy loss after the graph convolution generalizes to out-of-distribution data where the unseen data can have different intra- and inter-class edge probabilities from the training data.
    Enhancing an Intelligent Digital Twin with a Self-organized Reconfiguration Management based on Adaptive Process Models. (arXiv:2107.03324v1 [cs.AI])
    (2 min) Shorter product life cycles and increasing individualization of production leads to an increased reconfiguration demand in the domain of industrial automation systems, which will be dominated by cyber-physical production systems in the future. In constantly changing systems, however, not all configuration alternatives of the almost infinite state space are fully understood. Thus, certain configurations can lead to process instability, a reduction in quality or machine failures. Therefore, this paper presents an approach that enhances an intelligent Digital Twin with a self-organized reconfiguration management based on adaptive process models in order to find optimized configurations more comprehensively.
    Probabilistic semi-nonnegative matrix factorization: a Skellam-based framework. (arXiv:2107.03317v1 [cs.LG])
    (2 min) We present a new probabilistic model to address semi-nonnegative matrix factorization (SNMF), called Skellam-SNMF. It is a hierarchical generative model consisting of prior components, Skellam-distributed hidden variables and observed data. Two inference algorithms are derived: Expectation-Maximization (EM) algorithm for maximum \emph{a posteriori} estimation and Variational Bayes EM (VBEM) for full Bayesian inference, including the estimation of parameters prior distribution. From this Skellam-based model, we also introduce a new divergence $\mathcal{D}$ between a real-valued target data $x$ and two nonnegative parameters $\lambda_{0}$ and $\lambda_{1}$ such that $\mathcal{D}\left(x\mid\lambda_{0},\lambda_{1}\right)=0\Leftrightarrow x=\lambda_{0}-\lambda_{1}$, which is a generalization of the Kullback-Leibler (KL) divergence. Finally, we conduct experimental studies on those new algorithms in order to understand their behavior and prove that they can outperform the classic SNMF approach on real data in a task of automatic clustering.
    RoFL: Attestable Robustness for Secure Federated Learning. (arXiv:2107.03311v1 [cs.CR])
    (2 min) Federated Learning is an emerging decentralized machine learning paradigm that allows a large number of clients to train a joint model without the need to share their private data. Participants instead only share ephemeral updates necessary to train the model. To ensure the confidentiality of the client updates, Federated Learning systems employ secure aggregation; clients encrypt their gradient updates, and only the aggregated model is revealed to the server. Achieving this level of data protection, however, presents new challenges to the robustness of Federated Learning, i.e., the ability to tolerate failures and attacks. Unfortunately, in this setting, a malicious client can now easily exert influence on the model behavior without being detected. As Federated Learning is being deployed in practice in a range of sensitive applications, its robustness is growing in importance. In this paper, we take a step towards understanding and improving the robustness of secure Federated Learning. We start this paper with a systematic study that evaluates and analyzes existing attack vectors and discusses potential defenses and assesses their effectiveness. We then present RoFL, a secure Federated Learning system that improves robustness against malicious clients through input checks on the encrypted model updates. RoFL extends Federated Learning's secure aggregation protocol to allow expressing a variety of properties and constraints on model updates using zero-knowledge proofs. To enable RoFL to scale to typical Federated Learning settings, we introduce several ML and cryptographic optimizations specific to Federated Learning. We implement and evaluate a prototype of RoFL and show that realistic ML models can be trained in a reasonable time while improving robustness.
    Simulated Data Generation Through Algorithmic Force Coefficient Estimation for AI-Based Robotic Projectile Launch Modeling. (arXiv:2105.12833v3 [cs.RO] UPDATED)
    (2 min) Modeling of non-rigid object launching and manipulation is complex considering the wide range of dynamics affecting trajectory, many of which may be unknown. Using physics models can be inaccurate because they cannot account for unknown factors and the effects of the deformation of the object as it is launched; moreover, deriving force coefficients for these models is not possible without extensive experimental testing. Recently, advancements in data-powered artificial intelligence methods have allowed learnable models and systems to emerge. It is desirable to train a model for launch prediction on a robot, as deep neural networks can account for immeasurable dynamics. However, the inability to collect large amounts of experimental data decreases performance of deep neural networks. Through estimating force coefficients, the accepted physics models can be leveraged to produce adequate supplemental data to artificially increase the size of the training set, yielding improved neural networks. In this paper, we introduce a new framework for algorithmic estimation of force coefficients for non-rigid object launching, which can be generalized to other domains, in order to generate large datasets. We implement a novel training algorithm and objective for our deep neural network to accurately model launch trajectory of non-rigid objects and predict whether they will hit a series of targets. Our experimental results demonstrate the effectiveness of using simulated data from force coefficient estimation and shows the importance of simulated data for training an effective neural network.
    AGD-Autoencoder: Attention Gated Deep Convolutional Autoencoder for Brain Tumor Segmentation. (arXiv:2107.03323v1 [eess.IV])
    (2 min) Brain tumor segmentation is a challenging problem in medical image analysis. The endpoint is to generate the salient masks that accurately identify brain tumor regions in an fMRI screening. In this paper, we propose a novel attention gate (AG model) for brain tumor segmentation that utilizes both the edge detecting unit and the attention gated network to highlight and segment the salient regions from fMRI images. This feature enables us to eliminate the necessity of having to explicitly point towards the damaged area(external tissue localization) and classify(classification) as per classical computer vision techniques. AGs can easily be integrated within the deep convolutional neural networks(CNNs). Minimal computional overhead is required while the AGs increase the sensitivity scores significantly. We show that the edge detector along with an attention gated mechanism provide a sufficient enough method for brain segmentation reaching an IOU of 0.78
    PHOTONAI -- A Python API for Rapid Machine Learning Model Development. (arXiv:2002.05426v4 [cs.LG] UPDATED)
    (2 min) PHOTONAI is a high-level Python API designed to simplify and accelerate machine learning model development. It functions as a unifying framework allowing the user to easily access and combine algorithms from different toolboxes into custom algorithm sequences. It is especially designed to support the iterative model development process and automates the repetitive training, hyperparameter optimization and evaluation tasks. Importantly, the workflow ensures unbiased performance estimates while still allowing the user to fully customize the machine learning analysis. PHOTONAI extends existing solutions with a novel pipeline implementation supporting more complex data streams, feature combinations, and algorithm selection. Metrics and results can be conveniently visualized using the PHOTONAI Explorer and predictive models are shareable in a standardized format for further external validation or application. A growing add-on ecosystem allows researchers to offer data modality specific algorithms to the community and enhance machine learning in the areas of the life sciences. Its practical utility is demonstrated on an exemplary medical machine learning problem, achieving a state-of-the-art solution in few lines of code. Source code is publicly available on Github, while examples and documentation can be found at www.photon-ai.com.
    A Modified Drake Equation for Assessing Adversarial Risk to Machine Learning Models. (arXiv:2103.02718v2 [cs.LG] UPDATED)
    (2 min) Machine learning models present a risk of adversarial attack when deployed in production. Quantifying the contributing factors and uncertainties using empirical measures could assist the industry with assessing the risk of downloading and deploying common model types. This work proposes modifying the traditional Drake Equation's formalism to estimate the number of potentially successful adversarial attacks on a deployed model. The Drake Equation is famously used for parameterizing uncertainties and it has been used in many research fields outside of its original intentions to estimate the number of radio-capable extra-terrestrial civilizations. While previous work has outlined methods for discovering vulnerabilities in public model architectures, the proposed equation seeks to provide a semi-quantitative benchmark for evaluating and estimating the potential risk factors for adversarial attacks.
    SoundStream: An End-to-End Neural Audio Codec. (arXiv:2107.03312v1 [cs.SD])
    (2 min) We present SoundStream, a novel neural audio codec that can efficiently compress speech, music and general audio at bitrates normally targeted by speech-tailored codecs. SoundStream relies on a model architecture composed by a fully convolutional encoder/decoder network and a residual vector quantizer, which are trained jointly end-to-end. Training leverages recent advances in text-to-speech and speech enhancement, which combine adversarial and reconstruction losses to allow the generation of high-quality audio content from quantized embeddings. By training with structured dropout applied to quantizer layers, a single model can operate across variable bitrates from 3kbps to 18kbps, with a negligible quality loss when compared with models trained at fixed bitrates. In addition, the model is amenable to a low latency implementation, which supports streamable inference and runs in real time on a smartphone CPU. In subjective evaluations using audio at 24kHz sampling rate, SoundStream at 3kbps outperforms Opus at 12kbps and approaches EVS at 9.6kbps. Moreover, we are able to perform joint compression and enhancement either at the encoder or at the decoder side with no additional latency, which we demonstrate through background noise suppression for speech.
    Anomaly detection and automatic labeling for solar cell quality inspection based on Generative Adversarial Network. (arXiv:2103.03518v2 [cs.CV] UPDATED)
    (3 min) Quality inspection applications in industry are required to move towards a zero-defect manufacturing scenario, withnon-destructive inspection and traceability of 100 % of produced parts. Developing robust fault detection and classification modelsfrom the start-up of the lines is challenging due to the difficulty in getting enough representative samples of the faulty patternsand the need to manually label them. This work presents a methodology to develop a robust inspection system, targeting thesepeculiarities, in the context of solar cell manufacturing. The methodology is divided into two phases: In the first phase, an anomalydetection model based on a Generative Adversarial Network (GAN) is employed. This model enables the detection and localizationof anomalous patterns within the solar cells from the beginning, using only non-defective samples for training and without anymanual labeling involved. In a second stage, as defective samples arise, the detected anomalies will be used as automaticallygenerated annotations for the supervised training of a Fully Convolutional Network that is capable of detecting multiple types offaults. The experimental results using 1873 EL images of monocrystalline cells show that (a) the anomaly detection scheme can beused to start detecting features with very little available data, (b) the anomaly detection may serve as automatic labeling in order totrain a supervised model, and (c) segmentation and classification results of supervised models trained with automatic labels arecomparable to the ones obtained from the models trained with manual labels.
    "Are you sure?": Preliminary Insights from Scaling Product Comparisons to Multiple Shops. (arXiv:2107.03256v1 [cs.IR])
    (2 min) Large eCommerce players introduced comparison tables as a new type of recommendations. However, building comparisons at scale without pre-existing training/taxonomy data remains an open challenge, especially within the operational constraints of shops in the long tail. We present preliminary results from building a comparison pipeline designed to scale in a multi-shop scenario: we describe our design choices and run extensive benchmarks on multiple shops to stress-test it. Finally, we run a small user study on property selection and conclude by discussing potential improvements and highlighting the questions that remain to be addressed.
    Universal Approximation for Log-concave Distributions using Well-conditioned Normalizing Flows. (arXiv:2107.02951v1 [cs.LG])
    (2 min) Normalizing flows are a widely used class of latent-variable generative models with a tractable likelihood. Affine-coupling (Dinh et al, 2014-16) models are a particularly common type of normalizing flows, for which the Jacobian of the latent-to-observable-variable transformation is triangular, allowing the likelihood to be computed in linear time. Despite the widespread usage of affine couplings, the special structure of the architecture makes understanding their representational power challenging. The question of universal approximation was only recently resolved by three parallel papers (Huang et al.,2020;Zhang et al.,2020;Koehler et al.,2020) -- who showed reasonably regular distributions can be approximated arbitrarily well using affine couplings -- albeit with networks with a nearly-singular Jacobian. As ill-conditioned Jacobians are an obstacle for likelihood-based training, the fundamental question remains: which distributions can be approximated using well-conditioned affine coupling flows? In this paper, we show that any log-concave distribution can be approximated using well-conditioned affine-coupling flows. In terms of proof techniques, we uncover and leverage deep connections between affine coupling architectures, underdamped Langevin dynamics (a stochastic differential equation often used to sample from Gibbs measures) and H\'enon maps (a structured dynamical system that appears in the study of symplectic diffeomorphisms). Our results also inform the practice of training affine couplings: we approximate a padded version of the input distribution with iid Gaussians -- a strategy which Koehler et al.(2020) empirically observed to result in better-conditioned flows, but had hitherto no theoretical grounding. Our proof can thus be seen as providing theoretical evidence for the benefits of Gaussian padding when training normalizing flows.
    Graphing else matters: exploiting aspect opinions and ratings in explainable graph-based recommendations. (arXiv:2107.03226v1 [cs.IR])
    (2 min) The success of neural network embeddings has entailed a renewed interest in using knowledge graphs for a wide variety of machine learning and information retrieval tasks. In particular, current recommendation methods based on graph embeddings have shown state-of-the-art performance. These methods commonly encode latent rating patterns and content features. Different from previous work, in this paper, we propose to exploit embeddings extracted from graphs that combine information from ratings and aspect-based opinions expressed in textual reviews. We then adapt and evaluate state-of-the-art graph embedding techniques over graphs generated from Amazon and Yelp reviews on six domains, outperforming baseline recommenders. Our approach has the advantage of providing explanations which leverage aspect-based opinions given by users about recommended items. Furthermore, we also provide examples of the applicability of recommendations utilizing aspect opinions as explanations in a visualization dashboard, which allows obtaining information about the most and least liked aspects of similar users obtained from the embeddings of an input graph.
    Evaluating the progress of Deep Reinforcement Learning in the real world: aligning domain-agnostic and domain-specific research. (arXiv:2107.03015v1 [cs.LG])
    (2 min) Deep Reinforcement Learning (DRL) is considered a potential framework to improve many real-world autonomous systems; it has attracted the attention of multiple and diverse fields. Nevertheless, the successful deployment in the real world is a test most of DRL models still need to pass. In this work we focus on this issue by reviewing and evaluating the research efforts from both domain-agnostic and domain-specific communities. On one hand, we offer a comprehensive summary of DRL challenges and summarize the different proposals to mitigate them; this helps identifying five gaps of domain-agnostic research. On the other hand, from the domain-specific perspective, we discuss different success stories and argue why other models might fail to be deployed. Finally, we take up on ways to move forward accounting for both perspectives.
    Exact Learning Augmented Naive Bayes Classifier. (arXiv:2107.03018v1 [cs.LG])
    (2 min) Earlier studies have shown that classification accuracies of Bayesian networks (BNs) obtained by maximizing the conditional log likelihood (CLL) of a class variable, given the feature variables, were higher than those obtained by maximizing the marginal likelihood (ML). However, differences between the performances of the two scores in the earlier studies may be attributed to the fact that they used approximate learning algorithms, not exact ones. This paper compares the classification accuracies of BNs with approximate learning using CLL to those with exact learning using ML. The results demonstrate that the classification accuracies of BNs obtained by maximizing the ML are higher than those obtained by maximizing the CLL for large data. However, the results also demonstrate that the classification accuracies of exact learning BNs using the ML are much worse than those of other methods when the sample size is small and the class variable has numerous parents. To resolve the problem, we propose an exact learning augmented naive Bayes classifier (ANB), which ensures a class variable with no parents. The proposed method is guaranteed to asymptotically estimate the identical class posterior to that of the exactly learned BN. Comparison experiments demonstrated the superior performance of the proposed method.
    Scaling up Continuous-Time Markov Chains Helps Resolve Underspecification. (arXiv:2107.02911v1 [cs.LG])
    (2 min) Modeling the time evolution of discrete sets of items (e.g., genetic mutations) is a fundamental problem in many biomedical applications. We approach this problem through the lens of continuous-time Markov chains, and show that the resulting learning task is generally underspecified in the usual setting of cross-sectional data. We explore a perhaps surprising remedy: including a number of additional independent items can help determine time order, and hence resolve underspecification. This is in sharp contrast to the common practice of limiting the analysis to a small subset of relevant items, which is followed largely due to poor scaling of existing methods. To put our theoretical insight into practice, we develop an approximate likelihood maximization method for learning continuous-time Markov chains, which can scale to hundreds of items and is orders of magnitude faster than previous methods. We demonstrate the effectiveness of our approach on synthetic and real cancer data.
    ADAPT : Awesome Domain Adaptation Python Toolbox. (arXiv:2107.03049v1 [cs.LG])
    (2 min) ADAPT is an open-source python library providing the implementation of several domain adaptation methods. The library is suited for scikit-learn estimator object (object which implement fit and predict methods) and tensorflow models. Most of the implemented methods are developed in an estimator agnostic fashion, offering various possibilities adapted to multiple usage. The library offers three modules corresponding to the three principal strategies of domain adaptation: (i) feature-based containing methods performing feature transformation; (ii) instance-based with the implementation of reweighting techniques and (iii) parameter-based proposing methods to adapt pre-trained models to novel observations. A full documentation is proposed online https://adapt-python.github.io/adapt/ with gallery of examples. Besides, the library presents an high test coverage.
    Coastal water quality prediction based on machine learning with feature interpretation and spatio-temporal analysis. (arXiv:2107.03230v1 [stat.AP])
    (3 min) Coastal water quality management is a public health concern, as poor coastal water quality can harbor pathogens that are dangerous to human health. Tourism-oriented countries need to actively monitor the condition of coastal water at tourist popular sites during the summer season. In this study, routine monitoring data of $Escherichia\ Coli$ and enterococci across 15 public beaches in the city of Rijeka, Croatia, were used to build machine learning models for predicting their levels based on environmental parameters as well as to investigate their relationships with environmental stressors. Gradient Boosting (Catboost, Xgboost), Random Forests, Support Vector Regression and Artificial Neural Networks were trained with measurements from all sampling sites and used to predict $E.\ Coli$ and enterococci values based on environmental features. The evaluation of stability and generalizability with 10-fold cross validation analysis of the machine learning models, showed that the Catboost algorithm performed best with R$^2$ values of 0.71 and 0.68 for predicting $E.\ Coli$ and enterococci, respectively, compared to other evaluated ML algorithms including Xgboost, Random Forests, Support Vector Regression and Artificial Neural Networks. We also use the SHapley Additive exPlanations technique to identify and interpret which features have the most predictive power. The results show that site salinity measured is the most important feature for forecasting both $E.\ Coli$ and enterococci levels. Finally, the spatial and temporal accuracy of both ML models were examined at sites with the lowest coastal water quality. The spatial $E. Coli$ and enterococci models achieved strong R$^2$ values of 0.85 and 0.83, while the temporal models achieved R$^2$ values of 0.74 and 0.67. The temporal model also achieved moderate R$^2$ values of 0.44 and 0.46 at a site with high coastal water quality.
    DER Forecast using Privacy Preserving Federated Learning. (arXiv:2107.03248v1 [eess.SY])
    (2 min) With increasing penetration of Distributed Energy Resources (DERs) in grid edge including renewable generation, flexible loads, and storage, accurate prediction of distributed generation and consumption at the consumer level becomes important. However, DER prediction based on the transmission of customer level data, either repeatedly or in large amounts, is not feasible due to privacy concerns. In this paper, a distributed machine learning approach, Federated Learning, is proposed to carry out DER forecasting using a network of IoT nodes, each of which transmits a model of the consumption and generation patterns without revealing consumer data. We consider a simulation study which includes 1000 DERs, and show that our method leads to an accurate prediction of preserve consumer privacy, while still leading to an accurate forecast. We also evaluate grid-specific performance metrics such as load swings and load curtailment and show that our FL algorithm leads to satisfactory performance. Simulations are also performed on the Pecan street dataset to demonstrate the validity of the proposed approach on real data.
    RAILS: A Robust Adversarial Immune-inspired Learning System. (arXiv:2107.02840v1 [cs.NE])
    (2 min) Adversarial attacks against deep neural networks (DNNs) are continuously evolving, requiring increasingly powerful defense strategies. We develop a novel adversarial defense framework inspired by the adaptive immune system: the Robust Adversarial Immune-inspired Learning System (RAILS). Initializing a population of exemplars that is balanced across classes, RAILS starts from a uniform label distribution that encourages diversity and debiases a potentially corrupted initial condition. RAILS implements an evolutionary optimization process to adjust the label distribution and achieve specificity towards ground truth. RAILS displays a tradeoff between robustness (diversity) and accuracy (specificity), providing a new immune-inspired perspective on adversarial learning. We empirically validate the benefits of RAILS through several adversarial image classification experiments on MNIST, SVHN, and CIFAR-10 datasets. For the PGD attack, RAILS is found to improve the robustness over existing methods by >= 5.62%, 12.5% and 10.32%, respectively, without appreciable loss of standard accuracy.
    Controlled Caption Generation for Images Through Adversarial Attacks. (arXiv:2107.03050v1 [cs.CV])
    (2 min) Deep learning is found to be vulnerable to adversarial examples. However, its adversarial susceptibility in image caption generation is under-explored. We study adversarial examples for vision and language models, which typically adopt an encoder-decoder framework consisting of two major components: a Convolutional Neural Network (i.e., CNN) for image feature extraction and a Recurrent Neural Network (RNN) for caption generation. In particular, we investigate attacks on the visual encoder's hidden layer that is fed to the subsequent recurrent network. The existing methods either attack the classification layer of the visual encoder or they back-propagate the gradients from the language model. In contrast, we propose a GAN-based algorithm for crafting adversarial examples for neural image captioning that mimics the internal representation of the CNN such that the resulting deep features of the input image enable a controlled incorrect caption generation through the recurrent network. Our contribution provides new insights for understanding adversarial attacks on vision systems with language component. The proposed method employs two strategies for a comprehensive evaluation. The first examines if a neural image captioning system can be misled to output targeted image captions. The second analyzes the possibility of keywords into the predicted captions. Experiments show that our algorithm can craft effective adversarial images based on the CNN hidden layers to fool captioning framework. Moreover, we discover the proposed attack to be highly transferable. Our work leads to new robustness implications for neural image captioning.
    Immunization of Pruning Attack in DNN Watermarking Using Constant Weight Code. (arXiv:2107.02961v1 [cs.CR])
    (2 min) To ensure protection of the intellectual property rights of DNN models, watermarking techniques have been investigated to insert side-information into the models without seriously degrading the performance of original task. One of the threats for the DNN watermarking is the pruning attack such that less important neurons in the model are pruned to make it faster and more compact as well as to remove the watermark. In this study, we investigate a channel coding approach to resist the pruning attack. As the channel model is completely different from conventional models like digital images, it has been an open problem what kind of encoding method is suitable for DNN watermarking. A novel encoding approach by using constant weight codes to immunize the effects of pruning attacks is presented. To the best of our knowledge, this is the first study that introduces an encoding technique for DNN watermarking to make it robust against pruning attacks.
    Immuno-mimetic Deep Neural Networks (Immuno-Net). (arXiv:2107.02842v1 [cs.NE])
    (2 min) Biomimetics has played a key role in the evolution of artificial neural networks. Thus far, in silico metaphors have been dominated by concepts from neuroscience and cognitive psychology. In this paper we introduce a different type of biomimetic model, one that borrows concepts from the immune system, for designing robust deep neural networks. This immuno-mimetic model leads to a new computational biology framework for robustification of deep neural networks against adversarial attacks. Within this Immuno-Net framework we define a robust adaptive immune-inspired learning system (Immuno-Net RAILS) that emulates, in silico, the adaptive biological mechanisms of B-cells that are used to defend a mammalian host against pathogenic attacks. When applied to image classification tasks on benchmark datasets, we demonstrate that Immuno-net RAILS results in improvement of as much as 12.5% in adversarial accuracy of a baseline method, the DkNN-robustified CNN, without appreciable loss of accuracy on clean data.
    Hub and Spoke Logistics Network Design for Urban Region with Clustering-Based Approach. (arXiv:2107.03080v1 [cs.LG])
    (2 min) This study aims to propose effective modeling and approach for designing a logistics network in the urban area in order to offer an efficient flow distribution network as a competitive strategy in the logistics industry where demand is sensitive to both price and time. A multi-stage approach is introduced to select the number of hubs and allocate spokes to the hubs for flow distribution and hubs' location detection. Specifically, a fuzzy clustering model with the objective function is to minimize the approximate transportation cost is employed, in the next phase is to focus on balancing the demand capacity among the hubs with the help of domain experts, afterward, the facility location vehicle routing problems within the network is introduced. To demonstrate the approach's advantages, an experiment was performed on the designed network and its actual transportation cost for the real operational data in which specific to the Ho Chi Minh city infrastructure conditions. Additionally, we show the flexibility of the designed network in the flow distribution and its computational experiments to develop the managerial insights which contribute to the network design decision-making process.
    Structured Denoising Diffusion Models in Discrete State-Spaces. (arXiv:2107.03006v1 [cs.LG])
    (2 min) Denoising diffusion probabilistic models (DDPMs) (Ho et al. 2020) have shown impressive results on image and waveform generation in continuous state spaces. Here, we introduce Discrete Denoising Diffusion Probabilistic Models (D3PMs), diffusion-like generative models for discrete data that generalize the multinomial diffusion model of Hoogeboom et al. 2021, by going beyond corruption processes with uniform transition probabilities. This includes corruption with transition matrices that mimic Gaussian kernels in continuous space, matrices based on nearest neighbors in embedding space, and matrices that introduce absorbing states. The third allows us to draw a connection between diffusion models and autoregressive and mask-based generative models. We show that the choice of transition matrix is an important design decision that leads to improved results in image and text domains. We also introduce a new loss function that combines the variational lower bound with an auxiliary cross entropy loss. For text, this model class achieves strong results on character-level text generation while scaling to large vocabularies on LM1B. On the image dataset CIFAR-10, our models approach the sample quality and exceed the log-likelihood of the continuous-space DDPM model.
    Solution of Physics-based Bayesian Inverse Problems with Deep Generative Priors. (arXiv:2107.02926v1 [stat.ML])
    (2 min) Inverse problems are notoriously difficult to solve because they can have no solutions, multiple solutions, or have solutions that vary significantly in response to small perturbations in measurements. Bayesian inference, which poses an inverse problem as a stochastic inference problem, addresses these difficulties and provides quantitative estimates of the inferred field and the associated uncertainty. However, it is difficult to employ when inferring vectors of large dimensions, and/or when prior information is available through previously acquired samples. In this paper, we describe how deep generative adversarial networks can be used to represent the prior distribution in Bayesian inference and overcome these challenges. We apply these ideas to inverse problems that are diverse in terms of the governing physical principles, sources of prior knowledge, type of measurement, and the extent of available information about measurement noise. In each case we apply the proposed approach to infer the most likely solution and quantitative estimates of uncertainty.
    Trans4E: Link Prediction on Scholarly Knowledge Graphs. (arXiv:2107.03297v1 [cs.AI])
    (2 min) The incompleteness of Knowledge Graphs (KGs) is a crucial issue affecting the quality of AI-based services. In the scholarly domain, KGs describing research publications typically lack important information, hindering our ability to analyse and predict research dynamics. In recent years, link prediction approaches based on Knowledge Graph Embedding models became the first aid for this issue. In this work, we present Trans4E, a novel embedding model that is particularly fit for KGs which include N to M relations with N$\gg$M. This is typical for KGs that categorize a large number of entities (e.g., research articles, patents, persons) according to a relatively small set of categories. Trans4E was applied on two large-scale knowledge graphs, the Academia/Industry DynAmics (AIDA) and Microsoft Academic Graph (MAG), for completing the information about Fields of Study (e.g., 'neural networks', 'machine learning', 'artificial intelligence'), and affiliation types (e.g., 'education', 'company', 'government'), improving the scope and accuracy of the resulting data. We evaluated our approach against alternative solutions on AIDA, MAG, and four other benchmarks (FB15k, FB15k-237, WN18, and WN18RR). Trans4E outperforms the other models when using low embedding dimensions and obtains competitive results in high dimensions.
    New Methods and Datasets for Group Anomaly Detection From Fundamental Physics. (arXiv:2107.02821v1 [stat.ML])
    (2 min) The identification of anomalous overdensities in data - group or collective anomaly detection - is a rich problem with a large number of real world applications. However, it has received relatively little attention in the broader ML community, as compared to point anomalies or other types of single instance outliers. One reason for this is the lack of powerful benchmark datasets. In this paper, we first explain how, after the Nobel-prize winning discovery of the Higgs boson, unsupervised group anomaly detection has become a new frontier of fundamental physics (where the motivation is to find new particles and forces). Then we propose a realistic synthetic benchmark dataset (LHCO2020) for the development of group anomaly detection algorithms. Finally, we compare several existing statistically-sound techniques for unsupervised group anomaly detection, and demonstrate their performance on the LHCO2020 dataset.
    Generalization Error Analysis of Neural networks with Gradient Based Regularization. (arXiv:2107.02797v1 [cs.LG])
    (2 min) We study gradient-based regularization methods for neural networks. We mainly focus on two regularization methods: the total variation and the Tikhonov regularization. Applying these methods is equivalent to using neural networks to solve some partial differential equations, mostly in high dimensions in practical applications. In this work, we introduce a general framework to analyze the generalization error of regularized networks. The error estimate relies on two assumptions on the approximation error and the quadrature error. Moreover, we conduct some experiments on the image classification tasks to show that gradient-based methods can significantly improve the generalization ability and adversarial robustness of neural networks. A graphical extension of the gradient-based methods are also considered in the experiments.
    Incorporating Label Uncertainty in Understanding Adversarial Robustness. (arXiv:2107.03250v1 [cs.LG])
    (2 min) A fundamental question in adversarial machine learning is whether a robust classifier exists for a given task. A line of research has made progress towards this goal by studying concentration of measure, but without considering data labels. We argue that the standard concentration fails to fully characterize the intrinsic robustness of a classification problem, since it ignores data labels which are essential to any classification task. Building on a novel definition of label uncertainty, we empirically demonstrate that error regions induced by state-of-the-art models tend to have much higher label uncertainty compared with randomly-selected subsets. This observation motivates us to adapt a concentration estimation algorithm to account for label uncertainty, resulting in more accurate intrinsic robustness measures for benchmark image classification problems. We further provide empirical evidence showing that adding an abstain option for classifiers based on label uncertainty can help improve both the clean and robust accuracies of models.
    Combined Global and Local Search for Optimization with Gaussian Process Models. (arXiv:2107.03217v1 [stat.ML])
    (2 min) Gaussian process (GP) model based optimization is widely applied in simulation and machine learning. In general, it first estimates a GP model based on a few observations from the true response and then employs this model to guide the search, aiming to quickly locate the global optimum. Despite its successful applications, it has several limitations that may hinder its broader usage. First, building an accurate GP model can be difficult and computationally expensive, especially when the response function is multi-modal or varies significantly over the design space. Second, even with an appropriate model, the search process can be trapped in suboptimal regions before moving to the global optimum due to the excessive effort spent around the current best solution. In this work, we adopt the Additive Global and Local GP (AGLGP) model in the optimization framework. The model is rooted in the inducing-points-based GP sparse approximations and is combined with independent local models in different regions. With these properties, the AGLGP model is suitable for multi-modal responses with relatively large data sizes. Based on this AGLGP model, we propose a Combined Global and Local search for Optimization (CGLO) algorithm. It first divides the whole design space into disjoint local regions and identifies a promising region with the global model. Next, a local model in the selected region is fit to guide detailed search within this region. The algorithm then switches back to the global step when a good local solution is found. The global and local natures of CGLO enable it to enjoy the benefits of both global and local search to efficiently locate the global optimum.
    Intensity Prediction of Tropical Cyclones using Long Short-Term Memory Network. (arXiv:2107.03187v1 [cs.LG])
    (2 min) Tropical cyclones can be of varied intensity and cause a huge loss of lives and property if the intensity is high enough. Therefore, the prediction of the intensity of tropical cyclones advance in time is of utmost importance. We propose a novel stacked bidirectional long short-term memory network (BiLSTM) based model architecture to predict the intensity of a tropical cyclone in terms of Maximum surface sustained wind speed (MSWS). The proposed model can predict MSWS well advance in time (up to 72 h) with very high accuracy. We have applied the model on tropical cyclones in the North Indian Ocean from 1982 to 2018 and checked its performance on two recent tropical cyclones, namely, Fani and Vayu. The model predicts MSWS (in knots) for the next 3, 12, 24, 36, 48, 60, and 72 hours with a mean absolute error of 1.52, 3.66, 5.88, 7.42, 8.96, 10.15, and 11.92, respectively.
    MD-split+: Practical Local Conformal Inference in High Dimensions. (arXiv:2107.03280v1 [stat.ML])
    (2 min) Quantifying uncertainty in model predictions is a common goal for practitioners seeking more than just point predictions. One tool for uncertainty quantification that requires minimal assumptions is conformal inference, which can help create probabilistically valid prediction regions for black box models. Classical conformal prediction only provides marginal validity, whereas in many situations locally valid prediction regions are desirable. Deciding how best to partition the feature space X when applying localized conformal prediction is still an open question. We present MD-split+, a practical local conformal approach that creates X partitions based on localized model performance of conditional density estimation models. Our method handles complex real-world data settings where such models may be misspecified, and scales to high-dimensional inputs. We discuss how our local partitions philosophically align with expected behavior from an unattainable conditional conformal inference approach. We also empirically compare our method against other local conformal approaches.
    From Zero to The Hero: A Collaborative Market Aware Recommendation System for Crowd Workers. (arXiv:2107.02890v1 [cs.SE])
    (2 min) The success of software crowdsourcing depends on active and trustworthy pool of worker supply. The uncertainty of crowd workers' behaviors makes it challenging to predict workers' success and plan accordingly. In a competitive crowdsourcing marketplace, competition for success over shared tasks adds another layer of uncertainty in crowd workers' decision-making process. Preliminary analysis on software worker behaviors reveals an alarming task dropping rate of 82.9%. These factors lead to the need for an automated recommendation system for CSD workers to improve the visibility and predictability of their success in the competition. To that end, this paper proposes a collaborative recommendation system for crowd workers. The proposed recommendation system method uses five input metrics based on workers' collaboration history in the pool, workers' preferences in taking tasks in terms of monetary prize and duration, workers' specialty, and workers' proficiency. The proposed method then recommends the most suitable tasks for a worker to compete on based on workers' probability of success in the task. Experimental results on 260 active crowd workers demonstrate that just following the top three success probabilities of task recommendations, workers can achieve success up to 86%
    Bi-Level Poisoning Attack Model and Countermeasure for Appliance Consumption Data of Smart Homes. (arXiv:2107.02897v1 [cs.CR])
    (2 min) Accurate building energy prediction is useful in various applications starting from building energy automation and management to optimal storage control. However, vulnerabilities should be considered when designing building energy prediction models, as intelligent attackers can deliberately influence the model performance using sophisticated attack models. These may consequently degrade the prediction accuracy, which may affect the efficiency and performance of the building energy management systems. In this paper, we investigate the impact of bi-level poisoning attacks on regression models of energy usage obtained from household appliances. Furthermore, an effective countermeasure against the poisoning attacks on the prediction model is proposed in this paper. Attacks and defenses are evaluated on a benchmark dataset. Experimental results show that an intelligent cyber-attacker can poison the prediction model to manipulate the decision. However, our proposed solution successfully ensures defense against such poisoning attacks effectively compared to other benchmark techniques.
    Deep Extrapolation for Attribute-Enhanced Generation. (arXiv:2107.02968v1 [cs.LG])
    (2 min) Attribute extrapolation in sample generation is challenging for deep neural networks operating beyond the training distribution. We formulate a new task for extrapolation in sequence generation, focusing on natural language and proteins, and propose GENhance, a generative framework that enhances attributes through a learned latent space. Trained on movie reviews and a computed protein stability dataset, GENhance can generate strongly-positive text reviews and highly stable protein sequences without being exposed to similar data during training. We release our benchmark tasks and models to contribute to the study of generative modeling extrapolation and data-driven design in biology and chemistry.
    Efficient Detection of Botnet Traffic by features selection and Decision Trees. (arXiv:2107.02896v1 [cs.CR])
    (2 min) Botnets are one of the online threats with the biggest presence, causing billionaire losses to global economies. Nowadays, the increasing number of devices connected to the Internet makes it necessary to analyze large amounts of network traffic data. In this work, we focus on increasing the performance on botnet traffic classification by selecting those features that further increase the detection rate. For this purpose we use two feature selection techniques, Information Gain and Gini Importance, which led to three pre-selected subsets of five, six and seven features. Then, we evaluate the three feature subsets along with three models, Decision Tree, Random Forest and k-Nearest Neighbors. To test the performance of the three feature vectors and the three models we generate two datasets based on the CTU-13 dataset, namely QB-CTU13 and EQB-CTU13. We measure the performance as the macro averaged F1 score over the computational time required to classify a sample. The results show that the highest performance is achieved by Decision Trees using a five feature set which obtained a mean F1 score of 85% classifying each sample in an average time of 0.78 microseconds.
    Particle Convolution for High Energy Physics. (arXiv:2107.02908v1 [hep-ph])
    (2 min) We introduce the Particle Convolution Network (PCN), a new type of equivariant neural network layer suitable for many tasks in jet physics. The particle convolution layer can be viewed as an extension of Deep Sets and Energy Flow network architectures, in which the permutation-invariant operator is promoted to a group convolution. While the PCN can be implemented for various kinds of symmetries, we consider the specific case of rotation about the jet axis the $\eta - \phi$ plane. In two standard benchmark tasks, q/g tagging and top tagging, we show that the rotational PCN (rPCN) achieves performance comparable to graph networks such as ParticleNet. Moreover, we show that it is possible to implement an IRC-safe rPCN, which significantly outperforms existing IRC-safe tagging methods on both tasks. We speculate that by generalizing the PCN to include additional convolutional symmetries relevant to jet physics, it may outperform the current state-of-the-art set by graph networks, while offering a new degree of control over physically-motivated inductive biases.
    Joint Embedding of Structural and Functional Brain Networks with Graph Neural Networks for Mental Illness Diagnosis. (arXiv:2107.03220v1 [q-bio.NC])
    (2 min) Multimodal brain networks characterize complex connectivities among different brain regions from both structural and functional aspects and provide a new means for mental disease analysis. Recently, Graph Neural Networks (GNNs) have become a de facto model for analyzing graph-structured data. However, how to employ GNNs to extract effective representations from brain networks in multiple modalities remains rarely explored. Moreover, as brain networks provide no initial node features, how to design informative node attributes and leverage edge weights for GNNs to learn is left unsolved. To this end, we develop a novel multiview GNN for multimodal brain networks. In particular, we regard each modality as a view for brain networks and employ contrastive learning for multimodal fusion. Then, we propose a GNN model which takes advantage of the message passing scheme by propagating messages based on degree statistics and brain region connectivities. Extensive experiments on two real-world disease datasets (HIV and Bipolar) demonstrate the effectiveness of our proposed method over state-of-the-art baselines.
    Nested Counterfactual Identification from Arbitrary Surrogate Experiments. (arXiv:2107.03190v1 [cs.AI])
    (2 min) The Ladder of Causation describes three qualitatively different types of activities an agent may be interested in engaging in, namely, seeing (observational), doing (interventional), and imagining (counterfactual) (Pearl and Mackenzie, 2018). The inferential challenge imposed by the causal hierarchy is that data is collected by an agent observing or intervening in a system (layers 1 and 2), while its goal may be to understand what would have happened had it taken a different course of action, contrary to what factually ended up happening (layer 3). While there exists a solid understanding of the conditions under which cross-layer inferences are allowed from observations to interventions, the results are somewhat scarcer when targeting counterfactual quantities. In this paper, we study the identification of nested counterfactuals from an arbitrary combination of observations and experiments. Specifically, building on a more explicit definition of nested counterfactuals, we prove the counterfactual unnesting theorem (CUT), which allows one to map arbitrary nested counterfactuals to unnested ones. For instance, applications in mediation and fairness analysis usually evoke notions of direct, indirect, and spurious effects, which naturally require nesting. Second, we introduce a sufficient and necessary graphical condition for counterfactual identification from an arbitrary combination of observational and experimental distributions. Lastly, we develop an efficient and complete algorithm for identifying nested counterfactuals; failure of the algorithm returning an expression for a query implies it is not identifiable.
    Supervised Bayesian Specification Inference from Demonstrations. (arXiv:2107.02912v1 [cs.AI])
    (2 min) When observing task demonstrations, human apprentices are able to identify whether a given task is executed correctly long before they gain expertise in actually performing that task. Prior research into learning from demonstrations (LfD) has failed to capture this notion of the acceptability of a task's execution; meanwhile, temporal logics provide a flexible language for expressing task specifications. Inspired by this, we present Bayesian specification inference, a probabilistic model for inferring task specification as a temporal logic formula. We incorporate methods from probabilistic programming to define our priors, along with a domain-independent likelihood function to enable sampling-based inference. We demonstrate the efficacy of our model for inferring specifications, with over 90% similarity observed between the inferred specification and the ground truth, both within a synthetic domain and during a real-world table setting task.
    Bio-Inspired Adversarial Attack Against Deep Neural Networks. (arXiv:2107.02895v1 [cs.CR])
    (2 min) The paper develops a new adversarial attack against deep neural networks (DNN), based on applying bio-inspired design to moving physical objects. To the best of our knowledge, this is the first work to introduce physical attacks with a moving object. Instead of following the dominating attack strategy in the existing literature, i.e., to introduce minor perturbations to a digital input or a stationary physical object, we show two new successful attack strategies in this paper. We show by superimposing several patterns onto one physical object, a DNN becomes confused and picks one of the patterns to assign a class label. Our experiment with three flapping wing robots demonstrates the possibility of developing an adversarial camouflage to cause a targeted mistake by DNN. We also show certain motion can reduce the dependency among consecutive frames in a video and make an object detector "blind", i.e., not able to detect an object exists in the video. Hence in a successful physical attack against DNN, targeted motion against the system should also be considered.
    Logit-based Uncertainty Measure in Classification. (arXiv:2107.02845v1 [cs.LG])
    (2 min) We introduce a new, reliable, and agnostic uncertainty measure for classification tasks called logit uncertainty. It is based on logit outputs of neural networks. We in particular show that this new uncertainty measure yields a superior performance compared to existing uncertainty measures on different tasks, including out of sample detection and finding erroneous predictions. We analyze theoretical foundations of the measure and explore a relationship with high density regions. We also demonstrate how to test uncertainty using intermediate outputs in training of generative adversarial networks. We propose two potential ways to utilize logit-based uncertainty in real world applications, and show that the uncertainty measure outperforms.
    Keiki: Towards Realistic Danmaku Generation via Sequential GANs. (arXiv:2107.02991v1 [cs.LG])
    (2 min) Search-based procedural content generation methods have recently been introduced for the autonomous creation of bullet hell games. Search-based methods, however, can hardly model patterns of danmakus -- the bullet hell shooting entity -- explicitly and the resulting levels often look non-realistic. In this paper, we present a novel bullet hell game platform named Keiki, which allows the representation of danmakus as a parametric sequence which, in turn, can model the sequential behaviours of danmakus. We employ three types of generative adversarial networks (GANs) and test Keiki across three metrics designed to quantify the quality of the generated danmakus. The time-series GAN and periodic spatial GAN show different yet competitive performance in terms of the evaluation metrics adopted, their deviation from human-designed danmakus, and the diversity of generated danmakus. The preliminary experimental studies presented here showcase that potential of time-series GANs for sequential content generation in games.
    GAN-based Data Augmentation for Chest X-ray Classification. (arXiv:2107.02970v1 [eess.IV])
    (2 min) A common problem in computer vision -- particularly in medical applications -- is a lack of sufficiently diverse, large sets of training data. These datasets often suffer from severe class imbalance. As a result, networks often overfit and are unable to generalize to novel examples. Generative Adversarial Networks (GANs) offer a novel method of synthetic data augmentation. In this work, we evaluate the use of GAN- based data augmentation to artificially expand the CheXpert dataset of chest radiographs. We compare performance to traditional augmentation and find that GAN-based augmentation leads to higher downstream performance for underrepresented classes. Furthermore, we see that this result is pronounced in low data regimens. This suggests that GAN-based augmentation a promising area of research to improve network performance when data collection is prohibitively expensive.
    Neural Contextual Bandits without Regret. (arXiv:2107.03144v1 [stat.ML])
    (2 min) Contextual bandits are a rich model for sequential decision making given side information, with important applications, e.g., in recommender systems. We propose novel algorithms for contextual bandits harnessing neural networks to approximate the unknown reward function. We resolve the open problem of proving sublinear regret bounds in this setting for general context sequences, considering both fully-connected and convolutional networks. To this end, we first analyze NTK-UCB, a kernelized bandit optimization algorithm employing the Neural Tangent Kernel (NTK), and bound its regret in terms of the NTK maximum information gain $\gamma_T$, a complexity parameter capturing the difficulty of learning. Our bounds on $\gamma_T$ for the NTK may be of independent interest. We then introduce our neural network based algorithm NN-UCB, and show that its regret closely tracks that of NTK-UCB. Under broad non-parametric assumptions about the reward function, our approach converges to the optimal policy at a $\tilde{\mathcal{O}}(T^{-1/2d})$ rate, where $d$ is the dimension of the context.
    Transfer Learning in Information Criteria-based Feature Selection. (arXiv:2107.02847v1 [stat.ML])
    (2 min) This paper investigates the effectiveness of transfer learning based on Mallows' Cp. We propose a procedure that combines transfer learning with Mallows' Cp (TLCp) and prove that it outperforms the conventional Mallows' Cp criterion in terms of accuracy and stability. Our theoretical results indicate that, for any sample size in the target domain, the proposed TLCp estimator performs better than the Cp estimator by the mean squared error (MSE) metric in the case of orthogonal predictors, provided that i) the dissimilarity between the tasks from source domain and target domain is small, and ii) the procedure parameters (complexity penalties) are tuned according to certain explicit rules. Moreover, we show that our transfer learning framework can be extended to other feature selection criteria, such as the Bayesian information criterion. By analyzing the solution of the orthogonalized Cp, we identify an estimator that asymptotically approximates the solution of the Cp criterion in the case of non-orthogonal predictors. Similar results are obtained for the non-orthogonal TLCp. Finally, simulation studies and applications with real data demonstrate the usefulness of the TLCp scheme.
    Harnessing Heterogeneity: Learning from Decomposed Feedback in Bayesian Modeling. (arXiv:2107.03003v1 [cs.LG])
    (2 min) There is significant interest in learning and optimizing a complex system composed of multiple sub-components, where these components may be agents or autonomous sensors. Among the rich literature on this topic, agent-based and domain-specific simulations can capture complex dynamics and subgroup interaction, but optimizing over such simulations can be computationally and algorithmically challenging. Bayesian approaches, such as Gaussian processes (GPs), can be used to learn a computationally tractable approximation to the underlying dynamics but typically neglect the detailed information about subgroups in the complicated system. We attempt to find the best of both worlds by proposing the idea of decomposed feedback, which captures group-based heterogeneity and dynamics. We introduce a novel decomposed GP regression to incorporate the subgroup decomposed feedback. Our modified regression has provably lower variance -- and thus a more accurate posterior -- compared to previous approaches; it also allows us to introduce a decomposed GP-UCB optimization algorithm that leverages subgroup feedback. The Bayesian nature of our method makes the optimization algorithm trackable with a theoretical guarantee on convergence and no-regret property. To demonstrate the wide applicability of this work, we execute our algorithm on two disparate social problems: infectious disease control in a heterogeneous population and allocation of distributed weather sensors. Experimental results show that our new method provides significant improvement compared to the state-of-the-art.
    Test for non-negligible adverse shifts. (arXiv:2107.02990v1 [stat.ML])
    (2 min) Statistical tests for dataset shift are susceptible to false alarms: they are sensitive to minor differences where there is in fact adequate sample coverage and predictive performance. We propose instead a robust framework for tests of dataset shift based on outlier scores, D-SOS for short. D-SOS detects adverse shifts and can identify false alarms caused by benign ones. It posits that a new (test) sample is not substantively worse than an old (training) sample, and not that the two are equal. The key idea is to reduce observations to outlier scores and compare contamination rates. Beyond comparing distributions, users can define what worse means in terms of predictive performance and other relevant notions. We show how versatile and practical D-SOS is for a wide range of real and simulated datasets. Unlike tests of equal distribution and of goodness-of-fit, the D-SOS tests are uniquely tailored to serve as robust performance metrics to monitor model drift and dataset shift.
    Bias-Tolerant Fair Classification. (arXiv:2107.03207v1 [cs.LG])
    (2 min) The label bias and selection bias are acknowledged as two reasons in data that will hinder the fairness of machine-learning outcomes. The label bias occurs when the labeling decision is disturbed by sensitive features, while the selection bias occurs when subjective bias exists during the data sampling. Even worse, models trained on such data can inherit or even intensify the discrimination. Most algorithmic fairness approaches perform an empirical risk minimization with predefined fairness constraints, which tends to trade-off accuracy for fairness. However, such methods would achieve the desired fairness level with the sacrifice of the benefits (receive positive outcomes) for individuals affected by the bias. Therefore, we propose a Bias-TolerantFAirRegularizedLoss (B-FARL), which tries to regain the benefits using data affected by label bias and selection bias. B-FARL takes the biased data as input, calls a model that approximates the one trained with fair but latent data, and thus prevents discrimination without constraints required. In addition, we show the effective components by decomposing B-FARL, and we utilize the meta-learning framework for the B-FARL optimization. The experimental results on real-world datasets show that our method is empirically effective in improving fairness towards the direction of true but latent labels.
    Episodic Bandits with Stochastic Experts. (arXiv:2107.03263v1 [cs.LG])
    (2 min) We study a version of the contextual bandit problem where an agent is given soft control of a node in a graph-structured environment through a set of stochastic expert policies. The agent interacts with the environment over episodes, with each episode having different context distributions; this results in the `best expert' changing across episodes. Our goal is to develop an agent that tracks the best expert over episodes. We introduce the Empirical Divergence-based UCB (ED-UCB) algorithm in this setting where the agent does not have any knowledge of the expert policies or changes in context distributions. With mild assumptions, we show that bootstrapping from $\tilde{O}(N\log(NT^2\sqrt{E}))$ samples results in a regret of $\tilde{O}(E(N+1) + \frac{N\sqrt{E}}{T^2})$. If the expert policies are known to the agent a priori, then we can improve the regret to $\tilde{O}(EN)$ without requiring any bootstrapping. Our analysis also tightens pre-existing logarithmic regret bounds to a problem-dependent constant in the non-episodic setting when expert policies are known. We finally empirically validate our findings through simulations.
    On Codomain Separability and Label Inference from (Noisy) Loss Functions. (arXiv:2107.03022v1 [cs.LG])
    (2 min) Machine learning classifiers rely on loss functions for performance evaluation, often on a private (hidden) dataset. Label inference was recently introduced as the problem of reconstructing the ground truth labels of this private dataset from just the (possibly perturbed) loss function values evaluated at chosen prediction vectors, without any other access to the hidden dataset. Existing results have demonstrated this inference is possible on specific loss functions like the cross-entropy loss. In this paper, we introduce the notion of codomain separability to formally study the necessary and sufficient conditions under which label inference is possible from any (noisy) loss function values. Using this notion, we show that for many commonly used loss functions, including multiclass cross-entropy with common activation functions and some Bregman divergence-based losses, it is possible to design label inference attacks for arbitrary noise levels. We demonstrate that these attacks can also be carried out through actual neural network models, and argue, both formally and empirically, the role of finite precision arithmetic in this setting.
    Scalable Teacher Forcing Network for Semi-Supervised Large Scale Data Streams. (arXiv:2107.02943v1 [cs.DC])
    (2 min) The large-scale data stream problem refers to high-speed information flow which cannot be processed in scalable manner under a traditional computing platform. This problem also imposes expensive labelling cost making the deployment of fully supervised algorithms unfeasible. On the other hand, the problem of semi-supervised large-scale data streams is little explored in the literature because most works are designed in the traditional single-node computing environments while also being fully supervised approaches. This paper offers Weakly Supervised Scalable Teacher Forcing Network (WeScatterNet) to cope with the scarcity of labelled samples and the large-scale data streams simultaneously. WeScatterNet is crafted under distributed computing platform of Apache Spark with a data-free model fusion strategy for model compression after parallel computing stage. It features an open network structure to address the global and local drift problems while integrating a data augmentation, annotation and auto-correction ($DA^3$) method for handling partially labelled data streams. The performance of WeScatterNet is numerically evaluated in the six large-scale data stream problems with only $25\%$ label proportions. It shows highly competitive performance even if compared with fully supervised learners with $100\%$ label proportions.
    A Closed-Form Approximation to the Conjugate Prior of the Dirichlet and Beta Distributions. (arXiv:2107.03183v1 [stat.ML])
    (2 min) We derive the conjugate prior of the Dirichlet and beta distributions and explore it with numerical examples to gain an intuitive understanding of the distribution itself, its hyperparameters, and conditions concerning its convergence. Due to the prior's intractability, we proceed to define and analyze a closed-form approximation. Finally, we provide an algorithm implementing this approximation that enables fully tractable Bayesian conjugate treatment of Dirichlet and beta likelihoods without the need for Monte Carlo simulations.
    Adversarial Machine Learning for Cybersecurity and Computer Vision: Current Developments and Challenges. (arXiv:2107.02894v1 [cs.CR])
    (2 min) We provide a comprehensive overview of adversarial machine learning focusing on two application domains, i.e., cybersecurity and computer vision. Research in adversarial machine learning addresses a significant threat to the wide application of machine learning techniques -- they are vulnerable to carefully crafted attacks from malicious adversaries. For example, deep neural networks fail to correctly classify adversarial images, which are generated by adding imperceptible perturbations to clean images.We first discuss three main categories of attacks against machine learning techniques -- poisoning attacks, evasion attacks, and privacy attacks. Then the corresponding defense approaches are introduced along with the weakness and limitations of the existing defense approaches. We notice adversarial samples in cybersecurity and computer vision are fundamentally different. While adversarial samples in cybersecurity often have different properties/distributions compared with training data, adversarial images in computer vision are created with minor input perturbations. This further complicates the development of robust learning techniques, because a robust learning technique must withstand different types of attacks.
    A Deep Residual Star Generative Adversarial Network for multi-domain Image Super-Resolution. (arXiv:2107.03145v1 [eess.IV])
    (2 min) Recently, most of state-of-the-art single image super-resolution (SISR) methods have attained impressive performance by using deep convolutional neural networks (DCNNs). The existing SR methods have limited performance due to a fixed degradation settings, i.e. usually a bicubic downscaling of low-resolution (LR) image. However, in real-world settings, the LR degradation process is unknown which can be bicubic LR, bilinear LR, nearest-neighbor LR, or real LR. Therefore, most SR methods are ineffective and inefficient in handling more than one degradation settings within a single network. To handle the multiple degradation, i.e. refers to multi-domain image super-resolution, we propose a deep Super-Resolution Residual StarGAN (SR2*GAN), a novel and scalable approach that super-resolves the LR images for the multiple LR domains using only a single model. The proposed scheme is trained in a StarGAN like network topology with a single generator and discriminator networks. We demonstrate the effectiveness of our proposed approach in quantitative and qualitative experiments compared to other state-of-the-art methods.
    Distributed adaptive algorithm based on the asymmetric cost of error functions. (arXiv:2107.03067v1 [cs.LG])
    (2 min) In this paper, a family of novel diffusion adaptive estimation algorithm is proposed from the asymmetric cost function perspective by combining diffusion strategy and the linear-linear cost (LLC), quadratic-quadratic cost (QQC), and linear-exponential cost (LEC), at all distributed network nodes, and named diffusion LLCLMS (DLLCLMS), diffusion QQCLMS (DQQCLMS), and diffusion LECLMS (DLECLMS), respectively. Then the stability of mean estimation error and computational complexity of those three diffusion algorithms are analyzed theoretically. Finally, several experiment simulation results are designed to verify the superiority of those three proposed diffusion algorithms. Experimental simulation results show that DLLCLMS, DQQCLMS, and DLECLMS algorithms are more robust to the input signal and impulsive noise than the DSELMS, DRVSSLMS, and DLLAD algorithms. In brief, theoretical analysis and experiment results show that those proposed DLLCLMS, DQQCLMS, and DLECLMS algorithms have superior performance when estimating the unknown linear system under the changeable impulsive noise environments and different types of input signals.
    RAM-VO: Less is more in Visual Odometry. (arXiv:2107.02974v1 [cs.RO])
    (2 min) Building vehicles capable of operating without human supervision requires the determination of the agent's pose. Visual Odometry (VO) algorithms estimate the egomotion using only visual changes from the input images. The most recent VO methods implement deep-learning techniques using convolutional neural networks (CNN) extensively, which add a substantial cost when dealing with high-resolution images. Furthermore, in VO tasks, more input data does not mean a better prediction; on the contrary, the architecture may filter out useless information. Therefore, the implementation of computationally efficient and lightweight architectures is essential. In this work, we propose the RAM-VO, an extension of the Recurrent Attention Model (RAM) for visual odometry tasks. RAM-VO improves the visual and temporal representation of information and implements the Proximal Policy Optimization (PPO) algorithm to learn robust policies. The results indicate that RAM-VO can perform regressions with six degrees of freedom from monocular input images using approximately 3 million parameters. In addition, experiments on the KITTI dataset demonstrate that RAM-VO achieves competitive results using only 5.7% of the available visual information.
    Scalable Data Balancing for Unlabeled Satellite Imagery. (arXiv:2107.03227v1 [cs.CV])
    (2 min) Data imbalance is a ubiquitous problem in machine learning. In large scale collected and annotated datasets, data imbalance is either mitigated manually by undersampling frequent classes and oversampling rare classes, or planned for with imputation and augmentation techniques. In both cases balancing data requires labels. In other words, only annotated data can be balanced. Collecting fully annotated datasets is challenging, especially for large scale satellite systems such as the unlabeled NASA's 35 PB Earth Imagery dataset. Although the NASA Earth Imagery dataset is unlabeled, there are implicit properties of the data source that we can rely on to hypothesize about its imbalance, such as distribution of land and water in the case of the Earth's imagery. We present a new iterative method to balance unlabeled data. Our method utilizes image embeddings as a proxy for image labels that can be used to balance data, and ultimately when trained increases overall accuracy.
    Probabilistic partition of unity networks: clustering based deep approximation. (arXiv:2107.03066v1 [cs.LG])
    (2 min) Partition of unity networks (POU-Nets) have been shown capable of realizing algebraic convergence rates for regression and solution of PDEs, but require empirical tuning of training parameters. We enrich POU-Nets with a Gaussian noise model to obtain a probabilistic generalization amenable to gradient-based minimization of a maximum likelihood loss. The resulting architecture provides spatial representations of both noiseless and noisy data as Gaussian mixtures with closed form expressions for variance which provides an estimator of local error. The training process yields remarkably sharp partitions of input space based upon correlation of function values. This classification of training points is amenable to a hierarchical refinement strategy that significantly improves the localization of the regression, allowing for higher-order polynomial approximation to be utilized. The framework scales more favorably to large data sets as compared to Gaussian process regression and allows for spatially varying uncertainty, leveraging the expressive power of deep neural networks while bypassing expensive training associated with other probabilistic deep learning methods. Compared to standard deep neural networks, the framework demonstrates hp-convergence without the use of regularizers to tune the localization of partitions. We provide benchmarks quantifying performance in high/low-dimensions, demonstrating that convergence rates depend only on the latent dimension of data within high-dimensional space. Finally, we introduce a new open-source data set of PDE-based simulations of a semiconductor device and perform unsupervised extraction of a physically interpretable reduced-order basis.

2021-07-07

  • cs.CL updates on arXiv.org

    Gender Recognition in Informal and Formal Language Scenarios via Transfer Learning. (arXiv:2107.02759v1 [cs.CL])
    (2 min) The interest in demographic information retrieval based on text data has increased in the research community because applications have shown success in different sectors such as security, marketing, heath-care, and others. Recognition and identification of demographic traits such as gender, age, location, or personality based on text data can help to improve different marketing strategies. For instance it makes it possible to segment and to personalize offers, thus products and services are exposed to the group of greatest interest. This type of technology has been discussed widely in documents from social media. However, the methods have been poorly studied in data with a more formal structure, where there is no access to emoticons, mentions, and other linguistic phenomena that are only present in social media. This paper proposes the use of recurrent and convolutional neural networks, and a transfer learning strategy for gender recognition in documents that are written in informal and formal languages. Models are tested in two different databases consisting of Tweets and call-center conversations. Accuracies of up to 75\% are achieved for both databases. The results also indicate that it is possible to transfer the knowledge from a system trained on a specific type of expressions or idioms such as those typically used in social media into a more formal type of text data, where the amount of data is more scarce and its structure is completely different.
    Deep Learning Schema-based Event Extraction: Literature Review and Current Trends. (arXiv:2107.02126v2 [cs.CL] UPDATED)
    (2 min) Schema-based event extraction is a critical technique to apprehend the essential content of events promptly. With the rapid development of deep learning technology, event extraction technology based on deep learning has become a research hotspot. Numerous methods, datasets, and evaluation metrics have been proposed in the literature, raising the need for a comprehensive and updated survey. This paper fills the gap by reviewing the state-of-the-art approaches, focusing on deep learning-based models. We summarize the task definition, paradigm, and models of schema-based event extraction and then discuss each of these in detail. We introduce benchmark datasets that support tests of predictions and evaluation metrics. A comprehensive comparison between different techniques is also provided in this survey. Finally, we conclude by summarizing future research directions facing the research area.
    From Talk to Action with Accountability: Monitoring the Public Discussion of Policy Makers with Deep Neural Networks and Topic Modelling. (arXiv:2010.08346v2 [cs.CL] UPDATED)
    (2 min) Decades of research on climate have provided a consensus that human activity has changed the climate and we are currently heading into a climate crisis. While public discussion and research efforts on climate change mitigation have increased, potential solutions need to not only be discussed but also effectively deployed. For preventing mismanagement and holding policy makers accountable, transparency and degree of information about government processes have been shown to be crucial. However, currently the quantity of information about climate change discussions and the range of sources make it increasingly difficult for the public and civil society to maintain an overview to hold politicians accountable. In response, we propose a multi-source topic aggregation system (MuSTAS) which processes policy makers speech and rhetoric from several publicly available sources into an easily digestible topic summary. MuSTAS uses novel multi-source hybrid latent Dirichlet allocation to model topics from a variety of documents. This topic digest will serve the general public and civil society in assessing where, how, and when politicians talk about climate and climate policies, enabling them to hold politicians accountable for their actions to mitigate climate change and lack thereof.
    Improving Coherence and Consistency in Neural Sequence Models with Dual-System, Neuro-Symbolic Reasoning. (arXiv:2107.02794v1 [cs.AI])
    (2 min) Human reasoning can often be understood as an interplay between two systems: the intuitive and associative ("System 1") and the deliberative and logical ("System 2"). Neural sequence models -- which have been increasingly successful at performing complex, structured tasks -- exhibit the advantages and failure modes of System 1: they are fast and learn patterns from data, but are often inconsistent and incoherent. In this work, we seek a lightweight, training-free means of improving existing System 1-like sequence models by adding System 2-inspired logical reasoning. We explore several variations on this theme in which candidate generations from a neural sequence model are examined for logical consistency by a symbolic reasoning module, which can either accept or reject the generations. Our approach uses neural inference to mediate between the neural System 1 and the logical System 2. Results in robust story generation and grounded instruction-following show that this approach can increase the coherence and accuracy of neurally-based generations.
    SemEval-2021 Task 11: NLPContributionGraph -- Structuring Scholarly NLP Contributions for a Research Knowledge Graph. (arXiv:2106.07385v2 [cs.CL] UPDATED)
    (2 min) There is currently a gap between the natural language expression of scholarly publications and their structured semantic content modeling to enable intelligent content search. With the volume of research growing exponentially every year, a search feature operating over semantically structured content is compelling. The SemEval-2021 Shared Task NLPContributionGraph (a.k.a. 'the NCG task') tasks participants to develop automated systems that structure contributions from NLP scholarly articles in the English language. Being the first-of-its-kind in the SemEval series, the task released structured data from NLP scholarly articles at three levels of information granularity, i.e. at sentence-level, phrase-level, and phrases organized as triples toward Knowledge Graph (KG) building. The sentence-level annotations comprised the few sentences about the article's contribution. The phrase-level annotations were scientific term and predicate phrases from the contribution sentences. Finally, the triples constituted the research overview KG. For the Shared Task, participating systems were then expected to automatically classify contribution sentences, extract scientific terms and relations from the sentences, and organize them as KG triples. Overall, the task drew a strong participation demographic of seven teams and 27 participants. The best end-to-end task system classified contribution sentences at 57.27% F1, phrases at 46.41% F1, and triples at 22.28% F1. While the absolute performance to generate triples remains low, in the conclusion of this article, the difficulty of producing such data and as a consequence of modeling it is highlighted.
    Attention over learned object embeddings enables complex visual reasoning. (arXiv:2012.08508v2 [cs.CV] UPDATED)
    (2 min) Neural networks have achieved success in a wide array of perceptual tasks but often fail at tasks involving both perception and higher-level reasoning. On these more challenging tasks, bespoke approaches (such as modular symbolic components, independent dynamics models or semantic parsers) targeted towards that specific type of task have typically performed better. The downside to these targeted approaches, however, is that they can be more brittle than general-purpose neural networks, requiring significant modification or even redesign according to the particular task at hand. Here, we propose a more general neural-network-based approach to dynamic visual reasoning problems that obtains state-of-the-art performance on three different domains, in each case outperforming bespoke modular approaches tailored specifically to the task. Our method relies on learned object-centric representations, self-attention and self-supervised dynamics learning, and all three elements together are required for strong performance to emerge. The success of this combination suggests that there may be no need to trade off flexibility for performance on problems involving spatio-temporal or causal-style reasoning. With the right soft biases and learning objectives in a neural network we may be able to attain the best of both worlds.
    Lexical Access Model for Italian -- Modeling human speech processing: identification of words in running speech toward lexical access based on the detection of landmarks and other acoustic cues to features. (arXiv:2107.02720v1 [eess.AS])
    (3 min) Modelling the process that a listener actuates in deriving the words intended by a speaker requires setting a hypothesis on how lexical items are stored in memory. This work aims at developing a system that imitates humans when identifying words in running speech and, in this way, provide a framework to better understand human speech processing. We build a speech recognizer for Italian based on the principles of Stevens' model of Lexical Access in which words are stored as hierarchical arrangements of distinctive features (Stevens, K. N. (2002). "Toward a model for lexical access based on acoustic landmarks and distinctive features," J. Acoust. Soc. Am., 111(4):1872-1891). Over the past few decades, the Speech Communication Group at the Massachusetts Institute of Technology (MIT) developed a speech recognition system for English based on this approach. Italian will be the first language beyond English to be explored; the extension to another language provides the opportunity to test the hypothesis that words are represented in memory as a set of hierarchically-arranged distinctive features, and reveal which of the underlying mechanisms may have a language-independent nature. This paper also introduces a new Lexical Access corpus, the LaMIT database, created and labeled specifically for this work, that will be provided freely to the speech research community. Future developments will test the hypothesis that specific acoustic discontinuities - called landmarks - that serve as cues to features, are language independent, while other cues may be language-dependent, with powerful implications for understanding how the human brain recognizes speech.
    When Does Pretraining Help? Assessing Self-Supervised Learning for Law and the CaseHOLD Dataset. (arXiv:2104.08671v3 [cs.CL] UPDATED)
    (3 min) While self-supervised learning has made rapid advances in natural language processing, it remains unclear when researchers should engage in resource-intensive domain-specific pretraining (domain pretraining). The law, puzzlingly, has yielded few documented instances of substantial gains to domain pretraining in spite of the fact that legal language is widely seen to be unique. We hypothesize that these existing results stem from the fact that existing legal NLP tasks are too easy and fail to meet conditions for when domain pretraining can help. To address this, we first present CaseHOLD (Case Holdings On Legal Decisions), a new dataset comprised of over 53,000+ multiple choice questions to identify the relevant holding of a cited case. This dataset presents a fundamental task to lawyers and is both legally meaningful and difficult from an NLP perspective (F1 of 0.4 with a BiLSTM baseline). Second, we assess performance gains on CaseHOLD and existing legal NLP datasets. While a Transformer architecture (BERT) pretrained on a general corpus (Google Books and Wikipedia) improves performance, domain pretraining (using corpus of approximately 3.5M decisions across all courts in the U.S. that is larger than BERT's) with a custom legal vocabulary exhibits the most substantial performance gains with CaseHOLD (gain of 7.2% on F1, representing a 12% improvement on BERT) and consistent performance gains across two other legal tasks. Third, we show that domain pretraining may be warranted when the task exhibits sufficient similarity to the pretraining corpus: the level of performance increase in three legal tasks was directly tied to the domain specificity of the task. Our findings inform when researchers should engage resource-intensive pretraining and show that Transformer-based architectures, too, learn embeddings suggestive of distinct legal language.
    ETHOS: an Online Hate Speech Detection Dataset. (arXiv:2006.08328v2 [cs.CL] UPDATED)
    (2 min) Online hate speech is a recent problem in our society that is rising at a steady pace by leveraging the vulnerabilities of the corresponding regimes that characterise most social media platforms. This phenomenon is primarily fostered by offensive comments, either during user interaction or in the form of a posted multimedia context. Nowadays, giant corporations own platforms where millions of users log in every day, and protection from exposure to similar phenomena appears to be necessary in order to comply with the corresponding legislation and maintain a high level of service quality. A robust and reliable system for detecting and preventing the uploading of relevant content will have a significant impact on our digitally interconnected society. Several aspects of our daily lives are undeniably linked to our social profiles, making us vulnerable to abusive behaviours. As a result, the lack of accurate hate speech detection mechanisms would severely degrade the overall user experience, although its erroneous operation would pose many ethical concerns. In this paper, we present 'ETHOS', a textual dataset with two variants: binary and multi-label, based on YouTube and Reddit comments validated using the Figure-Eight crowdsourcing platform. Furthermore, we present the annotation protocol used to create this dataset: an active sampling procedure for balancing our data in relation to the various aspects defined. Our key assumption is that, even gaining a small amount of labelled data from such a time-consuming process, we can guarantee hate speech occurrences in the examined material.
    Sawtooth Factorial Topic Embeddings Guided Gamma Belief Network. (arXiv:2107.02757v1 [cs.IR])
    (2 min) Hierarchical topic models such as the gamma belief network (GBN) have delivered promising results in mining multi-layer document representations and discovering interpretable topic taxonomies. However, they often assume in the prior that the topics at each layer are independently drawn from the Dirichlet distribution, ignoring the dependencies between the topics both at the same layer and across different layers. To relax this assumption, we propose sawtooth factorial topic embedding guided GBN, a deep generative model of documents that captures the dependencies and semantic similarities between the topics in the embedding space. Specifically, both the words and topics are represented as embedding vectors of the same dimension. The topic matrix at a layer is factorized into the product of a factor loading matrix and a topic embedding matrix, the transpose of which is set as the factor loading matrix of the layer above. Repeating this particular type of factorization, which shares components between adjacent layers, leads to a structure referred to as sawtooth factorization. An auto-encoding variational inference network is constructed to optimize the model parameter via stochastic gradient descent. Experiments on big corpora show that our models outperform other neural topic models on extracting deeper interpretable topics and deriving better document representations.
    What Helps Transformers Recognize Conversational Structure? Importance of Context, Punctuation, and Labels in Dialog Act Recognition. (arXiv:2107.02294v1 [cs.CL])
    (2 min) Dialog acts can be interpreted as the atomic units of a conversation, more fine-grained than utterances, characterized by a specific communicative function. The ability to structure a conversational transcript as a sequence of dialog acts -- dialog act recognition, including the segmentation -- is critical for understanding dialog. We apply two pre-trained transformer models, XLNet and Longformer, to this task in English and achieve strong results on Switchboard Dialog Act and Meeting Recorder Dialog Act corpora with dialog act segmentation error rates (DSER) of 8.4% and 14.2%. To understand the key factors affecting dialog act recognition, we perform a comparative analysis of models trained under different conditions. We find that the inclusion of a broader conversational context helps disambiguate many dialog act classes, especially those infrequent in the training data. The presence of punctuation in the transcripts has a massive effect on the models' performance, and a detailed analysis reveals specific segmentation patterns observed in its absence. Finally, we find that the label set specificity does not affect dialog act segmentation performance. These findings have significant practical implications for spoken language understanding applications that depend heavily on a good-quality segmentation being available.
    Enhanced Universal Dependency Parsing with Automated Concatenation of Embeddings. (arXiv:2107.02416v1 [cs.CL])
    (2 min) This paper describes the system used in submission from SHANGHAITECH team to the IWPT 2021 Shared Task. Our system is a graph-based parser with the technique of Automated Concatenation of Embeddings (ACE). Because recent work found that better word representations can be obtained by concatenating different types of embeddings, we use ACE to automatically find the better concatenation of embeddings for the task of enhanced universal dependencies. According to official results averaged on 17 languages, our system ranks 2nd over 9 teams.
    Location, Location: Enhancing the Evaluation of Text-to-Speech Synthesis Using the Rapid Prosody Transcription Paradigm. (arXiv:2107.02527v1 [eess.AS])
    (2 min) Text-to-Speech synthesis systems are generally evaluated using Mean Opinion Score (MOS) tests, where listeners score samples of synthetic speech on a Likert scale. A major drawback of MOS tests is that they only offer a general measure of overall quality-i.e., the naturalness of an utterance-and so cannot tell us where exactly synthesis errors occur. This can make evaluation of the appropriateness of prosodic variation within utterances inconclusive. To address this, we propose a novel evaluation method based on the Rapid Prosody Transcription paradigm. This allows listeners to mark the locations of errors in an utterance in real-time, providing a probabilistic representation of the perceptual errors that occur in the synthetic signal. We conduct experiments that confirm that the fine-grained evaluation can be mapped to system rankings of standard MOS tests, but the error marking gives a much more comprehensive assessment of synthesized prosody. In particular, for standard audiobook test set samples, we see that error marks consistently cluster around words at major prosodic boundaries indicated by punctuation. However, for question-answer based stimuli, where we control information structure, we see differences emerge in the ability of neural TTS systems to generate context-appropriate prosodic prominence.
    Empowering NGOs in Countering Online Hate Messages. (arXiv:2107.02472v1 [cs.CL])
    (2 min) Studies on online hate speech have mostly focused on the automated detection of harmful messages. Little attention has been devoted so far to the development of effective strategies to fight hate speech, in particular through the creation of counter-messages. While existing manual scrutiny and intervention strategies are time-consuming and not scalable, advances in natural language processing have the potential to provide a systematic approach to hatred management. In this paper, we introduce a novel ICT platform that NGO operators can use to monitor and analyze social media data, along with a counter-narrative suggestion tool. Our platform aims at increasing the efficiency and effectiveness of operators' activities against islamophobia. We test the platform with more than one hundred NGO operators in three countries through qualitative and quantitative evaluation. Results show that NGOs favor the platform solution with the suggestion tool, and that the time required to produce counter-narratives significantly decreases.
    Mind Your Outliers! Investigating the Negative Impact of Outliers on Active Learning for Visual Question Answering. (arXiv:2107.02331v1 [cs.CL])
    (2 min) Active learning promises to alleviate the massive data needs of supervised machine learning: it has successfully improved sample efficiency by an order of magnitude on traditional tasks like topic classification and object recognition. However, we uncover a striking contrast to this promise: across 5 models and 4 datasets on the task of visual question answering, a wide variety of active learning approaches fail to outperform random selection. To understand this discrepancy, we profile 8 active learning methods on a per-example basis, and identify the problem as collective outliers -- groups of examples that active learning methods prefer to acquire but models fail to learn (e.g., questions that ask about text in images or require external knowledge). Through systematic ablation experiments and qualitative visualizations, we verify that collective outliers are a general phenomenon responsible for degrading pool-based active learning. Notably, we show that active learning sample efficiency increases significantly as the number of collective outliers in the active learning pool decreases. We conclude with a discussion and prescriptive recommendations for mitigating the effects of these outliers in future work.
    The NiuTrans End-to-End Speech Translation System \\for IWSLT 2021 Offline Task. (arXiv:2107.02444v1 [cs.CL])
    (2 min) This paper describes the submission of the NiuTrans end-to-end speech translation system for the IWSLT 2021 offline task, which translates from the English audio to German text directly without intermediate transcription. We use the Transformer-based model architecture and enhance it by Conformer, relative position encoding, and stacked acoustic and textual encoding. To augment the training data, the English transcriptions are translated to German translations. Finally, we employ ensemble decoding to integrate the predictions from several models trained with the different datasets. Combining these techniques, we achieve 33.84 BLEU points on the MuST-C En-De test set, which shows the enormous potential of the end-to-end model.
    AdaSpeech 3: Adaptive Text to Speech for Spontaneous Style. (arXiv:2107.02530v1 [cs.SD])
    (2 min) While recent text to speech (TTS) models perform very well in synthesizing reading-style (e.g., audiobook) speech, it is still challenging to synthesize spontaneous-style speech (e.g., podcast or conversation), mainly because of two reasons: 1) the lack of training data for spontaneous speech; 2) the difficulty in modeling the filled pauses (um and uh) and diverse rhythms in spontaneous speech. In this paper, we develop AdaSpeech 3, an adaptive TTS system that fine-tunes a well-trained reading-style TTS model for spontaneous-style speech. Specifically, 1) to insert filled pauses (FP) in the text sequence appropriately, we introduce an FP predictor to the TTS model; 2) to model the varying rhythms, we introduce a duration predictor based on mixture of experts (MoE), which contains three experts responsible for the generation of fast, medium and slow speech respectively, and fine-tune it as well as the pitch predictor for rhythm adaptation; 3) to adapt to other speaker timbre, we fine-tune some parameters in the decoder with few speech data. To address the challenge of lack of training data, we mine a spontaneous speech dataset to support our research this work and facilitate future research on spontaneous TTS. Experiments show that AdaSpeech 3 synthesizes speech with natural FP and rhythms in spontaneous styles, and achieves much better MOS and SMOS scores than previous adaptive TTS systems.
    Experiments with adversarial attacks on text genres. (arXiv:2107.02246v1 [cs.CL])
    (2 min) Neural models based on pre-trained transformers, such as BERT or XLM-RoBERTa, demonstrate SOTA results in many NLP tasks, including non-topical classification, such as genre identification. However, often these approaches exhibit low reliability to minor alterations of the test texts. A related probelm concerns topical biases in the training corpus, for example, the prevalence of words on a specific topic in a specific genre can trick the genre classifier to recognise any text on this topic in this genre. In order to mitigate the reliability problem, this paper investigates techniques for attacking genre classifiers to understand the limitations of the transformer models and to improve their performance. While simple text attacks, such as those based on word replacement using keywords extracted by tf-idf, are not capable of deceiving powerful models like XLM-RoBERTa, we show that embedding-based algorithms which can replace some of the most ``significant'' words with words similar to them, for example, TextFooler, have the ability to influence model predictions in a significant proportion of cases.
    VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer. (arXiv:2107.02681v1 [cs.CL])
    (2 min) Since visual perception can give rich information beyond text descriptions for world understanding, there has been increasing interest in leveraging visual grounding for language learning. Recently, vokenization has attracted attention by using the predictions of a text-to-image retrieval model as labels for language model supervision. Despite its success, the method suffers from approximation error of using finite image labels and the lack of vocabulary diversity of a small image-text dataset. To overcome these limitations, we present VidLanKD, a video-language knowledge distillation method for improving language understanding. We train a multi-modal teacher model on a video-text dataset, and then transfer its knowledge to a student language model with a text dataset. To avoid approximation error, we propose to use different knowledge distillation objectives. In addition, the use of a large-scale video-text dataset helps learn diverse and richer vocabularies. In our experiments, VidLanKD achieves consistent improvements over text-only language models and vokenization models, on several downstream language understanding tasks including GLUE, SQuAD, and SWAG. We also demonstrate the improved world knowledge, physical reasoning, and temporal reasoning capabilities of our model by evaluating on the GLUE-diagnostics, PIQA, and TRACIE datasets. Lastly, we present comprehensive ablation studies as well as visualizations of the learned text-to-video grounding results of our teacher and student language models. Our code and models are available at: https://github.com/zinengtang/VidLanKD
    Identifying negativity factors from social media text corpus using sentiment analysis method. (arXiv:2107.02175v1 [cs.CL])
    (2 min) Automatic sentiment analysis play vital role in decision making. Many organizations spend a lot of budget to understand their customer satisfaction by manually going over their feedback/comments or tweets. Automatic sentiment analysis can give overall picture of the comments received against any event, product, or activity. Usually, the comments/tweets are classified into two main classes that are negative or positive. However, the negative comments are too abstract to understand the basic reason or the context. organizations are interested to identify the exact reason for the negativity. In this research study, we hierarchically goes down into negative comments, and link them with more classes. Tweets are extracted from social media sites such as Twitter and Facebook. If the sentiment analysis classifies any tweet into negative class, then we further try to associates that negative comments with more possible negative classes. Based on expert opinions, the negative comments/tweets are further classified into 8 classes. Different machine learning algorithms are evaluated and their accuracy are reported.
    Probabilistic Graph Reasoning for Natural Proof Generation. (arXiv:2107.02418v1 [cs.CL])
    (2 min) In this paper, we investigate the problem of reasoning over natural language statements. Prior neural based approaches do not explicitly consider the inter-dependency among answers and their proofs. In this paper, we propose PRobr, a novel approach for joint answer prediction and proof generation. PRobr defines a joint probabilistic distribution over all possible proof graphs and answers via an induced graphical model. We then optimize the model using variational approximation on top of neural textual representation. Experiments on multiple datasets under diverse settings (fully supervised, few-shot and zero-shot evaluation) verify the effectiveness of PRobr, e.g., achieving 10%-30% improvement on QA accuracy in few/zero-shot evaluation. Our codes and models can be found at https://github.com/changzhisun/PRobr/.
    Weakly Supervised Named Entity Tagging with Learnable Logical Rules. (arXiv:2107.02282v1 [cs.CL])
    (2 min) We study the problem of building entity tagging systems by using a few rules as weak supervision. Previous methods mostly focus on disambiguation entity types based on contexts and expert-provided rules, while assuming entity spans are given. In this work, we propose a novel method TALLOR that bootstraps high-quality logical rules to train a neural tagger in a fully automated manner. Specifically, we introduce compound rules that are composed from simple rules to increase the precision of boundary detection and generate more diverse pseudo labels. We further design a dynamic label selection strategy to ensure pseudo label quality and therefore avoid overfitting the neural tagger. Experiments on three datasets demonstrate that our method outperforms other weakly supervised methods and even rivals a state-of-the-art distantly supervised tagger with a lexicon of over 2,000 terms when starting from only 20 simple rules. Our method can serve as a tool for rapidly building taggers in emerging domains and tasks. Case studies show that learned rules can potentially explain the predicted entities.
    Injecting Knowledge Base Information into End-to-End Joint Entity and Relation Extraction and Coreference Resolution. (arXiv:2107.02286v1 [cs.CL])
    (2 min) We consider a joint information extraction (IE) model, solving named entity recognition, coreference resolution and relation extraction jointly over the whole document. In particular, we study how to inject information from a knowledge base (KB) in such IE model, based on unsupervised entity linking. The used KB entity representations are learned from either (i) hyperlinked text documents (Wikipedia), or (ii) a knowledge graph (Wikidata), and appear complementary in raising IE performance. Representations of corresponding entity linking (EL) candidates are added to text span representations of the input document, and we experiment with (i) taking a weighted average of the EL candidate representations based on their prior (in Wikipedia), and (ii) using an attention scheme over the EL candidate list. Results demonstrate an increase of up to 5% F1-score for the evaluated IE tasks on two datasets. Despite a strong performance of the prior-based model, our quantitative and qualitative analysis reveals the advantage of using the attention-based approach.
    Instant One-Shot Word-Learning for Context-Specific Neural Sequence-to-Sequence Speech Recognition. (arXiv:2107.02268v1 [cs.CL])
    (2 min) Neural sequence-to-sequence systems deliver state-of-the-art performance for automatic speech recognition (ASR). When using appropriate modeling units, e.g., byte-pair encoded characters, these systems are in principal open vocabulary systems. In practice, however, they often fail to recognize words not seen during training, e.g., named entities, numbers or technical terms. To alleviate this problem we supplement an end-to-end ASR system with a word/phrase memory and a mechanism to access this memory to recognize the words and phrases correctly. After the training of the ASR system, and when it has already been deployed, a relevant word can be added or subtracted instantly without the need for further training. In this paper we demonstrate that through this mechanism our system is able to recognize more than 85% of newly added words that it previously failed to recognize compared to a strong baseline.
    Long-Short Transformer: Efficient Transformers for Language and Vision. (arXiv:2107.02192v1 [cs.CV])
    (2 min) Transformers have achieved success in both language and vision domains. However, it is prohibitively expensive to scale them to long sequences such as long documents or high-resolution images, because self-attention mechanism has quadratic time and memory complexities with respect to the input sequence length. In this paper, we propose Long-Short Transformer (Transformer-LS), an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks. It aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations. We propose a dual normalization strategy to account for the scale mismatch between the two attention mechanisms. Transformer-LS can be applied to both autoregressive and bidirectional models without additional complexity. Our method outperforms the state-of-the-art models on multiple tasks in language and vision domains, including the Long Range Arena benchmark, autoregressive language modeling, and ImageNet classification. For instance, Transformer-LS achieves 0.97 test BPC on enwik8 using half the number of parameters than previous method, while being faster and is able to handle 3$\times$ as long sequences compared to its full-attention version on the same hardware. On ImageNet, it can obtain the state-of-the-art results~(e.g., Top-1 accuracy 84.1% trained on 224$\times$224 ImageNet-1K only), while being more scalable on high-resolution images. The models and source code will be released soon.
    An NLG pipeline for a legal expert system: a work in progress. (arXiv:2107.02421v1 [cs.CL])
    (2 min) We present the NLG component for L4, a prototype domain-specific language (DSL) for drafting laws and contracts. As a concrete use case, we describe a pipeline for a legal expert system created from L4 code. The NLG component is used in two steps. The first step is to create an interview, whose answers are processed into a query for an automated reasoner. The second step is to render the answers of the reasoner in natural language.
    Transfer Learning for Improving Results on Russian Sentiment Datasets. (arXiv:2107.02499v1 [cs.CL])
    (2 min) In this study, we test transfer learning approach on Russian sentiment benchmark datasets using additional train sample created with distant supervision technique. We compare several variants of combining additional data with benchmark train samples. The best results were achieved using three-step approach of sequential training on general, thematic and original train samples. For most datasets, the results were improved by more than 3% to the current state-of-the-art methods. The BERT-NLI model treating sentiment classification problem as a natural language inference task reached the human level of sentiment analysis on one of the datasets.
    Sarcasm Detection: A Comparative Study. (arXiv:2107.02276v1 [cs.CL])
    (2 min) Sarcasm detection is the task of identifying irony containing utterances in sentiment-bearing text. However, the figurative and creative nature of sarcasm poses a great challenge for affective computing systems performing sentiment analysis. This article compiles and reviews the salient work in the literature of automatic sarcasm detection. Thus far, three main paradigm shifts have occurred in the way researchers have approached this task: 1) semi-supervised pattern extraction to identify implicit sentiment, 2) use of hashtag-based supervision, and 3) incorporation of context beyond target text. In this article, we provide a comprehensive review of the datasets, approaches, trends, and issues in sarcasm and irony detection.
  • cs.CV updates on arXiv.org

    DeepOPG: Improving Orthopantomogram Finding Summarization with Weak Supervision. (arXiv:2103.08290v2 [cs.CV] UPDATED)
    (2 min) Clinical finding summaries from an orthopantomogram, or a dental panoramic radiograph, have significant potential to improve patient communication and speed up clinical judgments. While orthopantomogram is a first-line tool for dental examinations, no existing work has explored the summarization of findings from it. A finding summary has to find teeth in the imaging study and label the teeth with several types of past treatments. To tackle the problem, we developDeepOPG that breaks the summarization process into functional segmentation and tooth localization, the latter of which is further refined by a novel dental coherence module. We also leverage weak supervision labels to improve detection results in a reinforcement learning scenario. Experiments show high efficacy of DeepOPG on finding summarization, achieving an overall AUC of 88.2% in detecting six types of findings. The proposed dental coherence and weak supervision both are shown to improve DeepOPG by adding 5.9% and 0.4% to AP@IoU=0.5.
    VTGAN: Semi-supervised Retinal Image Synthesis and Disease Prediction using Vision Transformers. (arXiv:2104.06757v2 [eess.IV] UPDATED)
    (2 min) In Fluorescein Angiography (FA), an exogenous dye is injected in the bloodstream to image the vascular structure of the retina. The injected dye can cause adverse reactions such as nausea, vomiting, anaphylactic shock, and even death. In contrast, color fundus imaging is a non-invasive technique used for photographing the retina but does not have sufficient fidelity for capturing its vascular structure. The only non-invasive method for capturing retinal vasculature is optical coherence tomography-angiography (OCTA). However, OCTA equipment is quite expensive, and stable imaging is limited to small areas on the retina. In this paper, we propose a novel conditional generative adversarial network (GAN) capable of simultaneously synthesizing FA images from fundus photographs while predicting retinal degeneration. The proposed system has the benefit of addressing the problem of imaging retinal vasculature in a non-invasive manner as well as predicting the existence of retinal abnormalities. We use a semi-supervised approach to train our GAN using multiple weighted losses on different modalities of data. Our experiments validate that the proposed architecture exceeds recent state-of-the-art generative networks for fundus-to-angiography synthesis. Moreover, our vision transformer-based discriminators generalize quite well on out-of-distribution data sets for retinal disease prediction.
    Efficient-CapsNet: Capsule Network with Self-Attention Routing. (arXiv:2101.12491v2 [cs.CV] UPDATED)
    (2 min) Deep convolutional neural networks, assisted by architectural design strategies, make extensive use of data augmentation techniques and layers with a high number of feature maps to embed object transformations. That is highly inefficient and for large datasets implies a massive redundancy of features detectors. Even though capsules networks are still in their infancy, they constitute a promising solution to extend current convolutional networks and endow artificial visual perception with a process to encode more efficiently all feature affine transformations. Indeed, a properly working capsule network should theoretically achieve higher results with a considerably lower number of parameters count due to intrinsic capability to generalize to novel viewpoints. Nevertheless, little attention has been given to this relevant aspect. In this paper, we investigate the efficiency of capsule networks and, pushing their capacity to the limits with an extreme architecture with barely 160K parameters, we prove that the proposed architecture is still able to achieve state-of-the-art results on three different datasets with only 2% of the original CapsNet parameters. Moreover, we replace dynamic routing with a novel non-iterative, highly parallelizable routing algorithm that can easily cope with a reduced number of capsules. Extensive experimentation with other capsule implementations has proved the effectiveness of our methodology and the capability of capsule networks to efficiently embed visual representations more prone to generalization.
    Minimizing L1 over L2 norms on the gradient. (arXiv:2101.00809v2 [math.NA] UPDATED)
    (2 min) In this paper, we study the L1/L2 minimization on the gradient for imaging applications. Several recent works have demonstrated that L1/L2 is better than the L1 norm when approximating the L0 norm to promote sparsity. Consequently, we postulate that applying L1/L2 on the gradient is better than the classic total variation (the L1 norm on the gradient) to enforce the sparsity of the image gradient. To verify our hypothesis, we consider a constrained formulation to reveal empirical evidence on the superiority of L1/L2 over L1 when recovering piecewise constant signals from low-frequency measurements. Numerically, we design a specific splitting scheme, under which we can prove subsequential and global convergence for the alternating direction method of multipliers (ADMM) under certain conditions. Experimentally, we demonstrate visible improvements of L1/L2 over L1 and other nonconvex regularizations for image recovery from low-frequency measurements and two medical applications of MRI and CT reconstruction. All the numerical results show the efficiency of our proposed approach.
    The GIST and RIST of Iterative Self-Training for Semi-Supervised Segmentation. (arXiv:2103.17105v2 [cs.CV] UPDATED)
    (2 min) We consider the task of semi-supervised semantic segmentation, where we aim to produce pixel-wise semantic object masks given only a small number of human-labeled training examples. We focus on iterative self-training methods in which we explore the behavior of self-training over multiple refinement stages. We show that iterative self-training leads to performance degradation if done na\"ively with a fixed ratio of human-labeled to pseudo-labeled training examples. We propose Greedy Iterative Self-Training (GIST) and Random Iterative Self-Training (RIST) strategies that alternate between training on either human-labeled data or pseudo-labeled data at each refinement stage, resulting in a performance boost rather than degradation. We further show that GIST and RIST can be combined with existing semi-supervised learning methods to boost performance.
    SUTD-TrafficQA: A Question Answering Benchmark and an Efficient Network for Video Reasoning over Traffic Events. (arXiv:2103.15538v3 [cs.CV] UPDATED)
    (2 min) Traffic event cognition and reasoning in videos is an important task that has a wide range of applications in intelligent transportation, assisted driving, and autonomous vehicles. In this paper, we create a novel dataset, SUTD-TrafficQA (Traffic Question Answering), which takes the form of video QA based on the collected 10,080 in-the-wild videos and annotated 62,535 QA pairs, for benchmarking the cognitive capability of causal inference and event understanding models in complex traffic scenarios. Specifically, we propose 6 challenging reasoning tasks corresponding to various traffic scenarios, so as to evaluate the reasoning capability over different kinds of complex yet practical traffic events. Moreover, we propose Eclipse, a novel Efficient glimpse network via dynamic inference, in order to achieve computation-efficient and reliable video reasoning. The experiments show that our method achieves superior performance while reducing the computation cost significantly. The project page: https://github.com/SUTDCV/SUTD-TrafficQA.
    3DIoUMatch: Leveraging IoU Prediction for Semi-Supervised 3D Object Detection. (arXiv:2012.04355v3 [cs.CV] UPDATED)
    (2 min) 3D object detection is an important yet demanding task that heavily relies on difficult to obtain 3D annotations. To reduce the required amount of supervision, we propose 3DIoUMatch, a novel semi-supervised method for 3D object detection applicable to both indoor and outdoor scenes. We leverage a teacher-student mutual learning framework to propagate information from the labeled to the unlabeled train set in the form of pseudo-labels. However, due to the high task complexity, we observe that the pseudo-labels suffer from significant noise and are thus not directly usable. To that end, we introduce a confidence-based filtering mechanism, inspired by FixMatch. We set confidence thresholds based upon the predicted objectness and class probability to filter low-quality pseudo-labels. While effective, we observe that these two measures do not sufficiently capture localization quality. We therefore propose to use the estimated 3D IoU as a localization metric and set category-aware self-adjusted thresholds to filter poorly localized proposals. We adopt VoteNet as our backbone detector on indoor datasets while we use PV-RCNN on the autonomous driving dataset, KITTI. Our method consistently improves state-of-the-art methods on both ScanNet and SUN-RGBD benchmarks by significant margins under all label ratios (including fully labeled setting). For example, when training using only 10\% labeled data on ScanNet, 3DIoUMatch achieves 7.7% absolute improvement on mAP@0.25 and 8.5% absolute improvement on mAP@0.5 upon the prior art. On KITTI, we are the first to demonstrate semi-supervised 3D object detection and our method surpasses a fully supervised baseline from 1.8% to 7.6% under different label ratios and categories.
    PAN++: Towards Efficient and Accurate End-to-End Spotting of Arbitrarily-Shaped Text. (arXiv:2105.00405v3 [cs.CV] UPDATED)
    (2 min) Scene text detection and recognition have been well explored in the past few years. Despite the progress, efficient and accurate end-to-end spotting of arbitrarily-shaped text remains challenging. In this work, we propose an end-to-end text spotting framework, termed PAN++, which can efficiently detect and recognize text of arbitrary shapes in natural scenes. PAN++ is based on the kernel representation that reformulates a text line as a text kernel (central region) surrounded by peripheral pixels. By systematically comparing with existing scene text representations, we show that our kernel representation can not only describe arbitrarily-shaped text but also well distinguish adjacent text. Moreover, as a pixel-based representation, the kernel representation can be predicted by a single fully convolutional network, which is very friendly to real-time applications. Taking the advantages of the kernel representation, we design a series of components as follows: 1) a computationally efficient feature enhancement network composed of stacked Feature Pyramid Enhancement Modules (FPEMs); 2) a lightweight detection head cooperating with Pixel Aggregation (PA); and 3) an efficient attention-based recognition head with Masked RoI. Benefiting from the kernel representation and the tailored components, our method achieves high inference speed while maintaining competitive accuracy. Extensive experiments show the superiority of our method. For example, the proposed PAN++ achieves an end-to-end text spotting F-measure of 64.9 at 29.2 FPS on the Total-Text dataset, which significantly outperforms the previous best method. Code will be available at: https://git.io/PAN.
    PillarSegNet: Pillar-based Semantic Grid Map Estimation using Sparse LiDAR Data. (arXiv:2105.04169v2 [cs.CV] UPDATED)
    (2 min) Semantic understanding of the surrounding environment is essential for automated vehicles. The recent publication of the SemanticKITTI dataset stimulates the research on semantic segmentation of LiDAR point clouds in urban scenarios. While most existing approaches predict sparse pointwise semantic classes for the sparse input LiDAR scan, we propose PillarSegNet to be able to output a dense semantic grid map. In contrast to a previously proposed grid map method, PillarSegNet uses PointNet to learn features directly from the 3D point cloud and then conducts 2D semantic segmentation in the top view. To train and evaluate our approach, we use both sparse and dense ground truth, where the dense ground truth is obtained from multiple superimposed scans. Experimental results on the SemanticKITTI dataset show that PillarSegNet achieves a performance gain of about 10% mIoU over the state-of-the-art grid map method.
    ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training. (arXiv:2104.14129v2 [cs.LG] UPDATED)
    (2 min) The increasing size of neural network models has been critical for improvements in their accuracy, but device memory is not growing at the same rate. This creates fundamental challenges for training neural networks within limited memory environments. In this work, we propose ActNN, a memory-efficient training framework that stores randomly quantized activations for back propagation. We prove the convergence of ActNN for general network architectures, and we characterize the impact of quantization on the convergence via an exact expression for the gradient variance. Using our theory, we propose novel mixed-precision quantization strategies that exploit the activation's heterogeneity across feature dimensions, samples, and layers. These techniques can be readily applied to existing dynamic graph frameworks, such as PyTorch, simply by substituting the layers. We evaluate ActNN on mainstream computer vision models for classification, detection, and segmentation tasks. On all these tasks, ActNN compresses the activation to 2 bits on average, with negligible accuracy loss. ActNN reduces the memory footprint of the activation by 12x, and it enables training with a 6.6x to 14x larger batch size.
    Automatic size and pose homogenization with spatial transformer network to improve and accelerate pediatric segmentation. (arXiv:2107.02655v1 [cs.CV])
    (2 min) Due to a high heterogeneity in pose and size and to a limited number of available data, segmentation of pediatric images is challenging for deep learning methods. In this work, we propose a new CNN architecture that is pose and scale invariant thanks to the use of Spatial Transformer Network (STN). Our architecture is composed of three sequential modules that are estimated together during training: (i) a regression module to estimate a similarity matrix to normalize the input image to a reference one; (ii) a differentiable module to find the region of interest to segment; (iii) a segmentation module, based on the popular UNet architecture, to delineate the object. Unlike the original UNet, which strives to learn a complex mapping, including pose and scale variations, from a finite training dataset, our segmentation module learns a simpler mapping focusing on images with normalized pose and size. Furthermore, the use of an automatic bounding box detection through STN allows saving time and especially memory, while keeping similar performance. We test the proposed method in kidney and renal tumor segmentation on abdominal pediatric CT scanners. Results indicate that the estimated STN homogenization of size and pose accelerates the segmentation (25h), compared to standard data-augmentation (33h), while obtaining a similar quality for the kidney (88.01\% of Dice score) and improving the renal tumor delineation (from 85.52\% to 87.12\%).
    Differentially private federated deep learning for multi-site medical image segmentation. (arXiv:2107.02586v1 [eess.IV])
    (2 min) Collaborative machine learning techniques such as federated learning (FL) enable the training of models on effectively larger datasets without data transfer. Recent initiatives have demonstrated that segmentation models trained with FL can achieve performance similar to locally trained models. However, FL is not a fully privacy-preserving technique and privacy-centred attacks can disclose confidential patient data. Thus, supplementing FL with privacy-enhancing technologies (PTs) such as differential privacy (DP) is a requirement for clinical applications in a multi-institutional setting. The application of PTs to FL in medical imaging and the trade-offs between privacy guarantees and model utility, the ramifications on training performance and the susceptibility of the final models to attacks have not yet been conclusively investigated. Here we demonstrate the first application of differentially private gradient descent-based FL on the task of semantic segmentation in computed tomography. We find that high segmentation performance is possible under strong privacy guarantees with an acceptable training time penalty. We furthermore demonstrate the first successful gradient-based model inversion attack on a semantic segmentation model and show that the application of DP prevents it from divulging sensitive image features.
    Learned Visual Navigation for Under-Canopy Agricultural Robots. (arXiv:2107.02792v1 [cs.RO])
    (2 min) We describe a system for visually guided autonomous navigation of under-canopy farm robots. Low-cost under-canopy robots can drive between crop rows under the plant canopy and accomplish tasks that are infeasible for over-the-canopy drones or larger agricultural equipment. However, autonomously navigating them under the canopy presents a number of challenges: unreliable GPS and LiDAR, high cost of sensing, challenging farm terrain, clutter due to leaves and weeds, and large variability in appearance over the season and across crop types. We address these challenges by building a modular system that leverages machine learning for robust and generalizable perception from monocular RGB images from low-cost cameras, and model predictive control for accurate control in challenging terrain. Our system, CropFollow, is able to autonomously drive 485 meters per intervention on average, outperforming a state-of-the-art LiDAR based system (286 meters per intervention) in extensive field testing spanning over 25 km.
    Non-Local Representation based Mutual Affine-Transfer Network for Photorealistic Stylization. (arXiv:1907.10274v2 [cs.CV] UPDATED)
    (2 min) Photorealistic stylization aims to transfer the style of a reference photo onto a content photo in a natural fashion, such that the stylized image looks like a real photo taken by a camera. State-of-the-art methods stylize the image locally within each matched semantic region and are prone to global color inconsistency across semantic objects/parts, making the stylized image less photorealistic. To tackle the challenging issues, we propose a non-local representation scheme, constrained with a mutual affine-transfer network (NL-MAT). Through a dictionary-based decomposition, NL-MAT is able to successfully decouple matched non-local representations and color information of the image pair, such that the context correspondence between the image pair is incorporated naturally, which largely facilitates local style transfer in a global-consistent fashion. To the best of our knowledge, this is the first attempt to address the photorealistic stylization problem with a non-local representation scheme, such that no additional models or steps for semantic matching are required during stylization. Experimental results demonstrate that the proposed method is able to generate photorealistic results with local style transfer while preserving both the spatial structure and global color consistency of the content image.
    Coarse-to-fine Semantic Localization with HD Map for Autonomous Driving in Structural Scenes. (arXiv:2107.02557v1 [cs.CV])
    (2 min) Robust and accurate localization is an essential component for robotic navigation and autonomous driving. The use of cameras for localization with high definition map (HD Map) provides an affordable localization sensor set. Existing methods suffer from pose estimation failure due to error prone data association or initialization with accurate initial pose requirement. In this paper, we propose a cost-effective vehicle localization system with HD map for autonomous driving that uses cameras as primary sensors. To this end, we formulate vision-based localization as a data association problem that maps visual semantics to landmarks in HD map. Specifically, system initialization is finished in a coarse to fine manner by combining coarse GPS (Global Positioning System) measurement and fine pose searching. In tracking stage, vehicle pose is refined by implicitly aligning the semantic segmentation result between image and landmarks in HD maps with photometric consistency. Finally, vehicle pose is computed by pose graph optimization in a sliding window fashion. We evaluate our method on two datasets and demonstrate that the proposed approach yields promising localization results in different driving scenarios. Additionally, our approach is suitable for both monocular camera and multi-cameras that provides flexibility and improves robustness for the localization system.
    AE TextSpotter: Learning Visual and Linguistic Representation for Ambiguous Text Spotting. (arXiv:2008.00714v5 [cs.CV] UPDATED)
    (2 min) Scene text spotting aims to detect and recognize the entire word or sentence with multiple characters in natural images. It is still challenging because ambiguity often occurs when the spacing between characters is large or the characters are evenly spread in multiple rows and columns, making many visually plausible groupings of the characters (e.g. "BERLIN" is incorrectly detected as "BERL" and "IN" in Fig. 1(c)). Unlike previous works that merely employed visual features for text detection, this work proposes a novel text spotter, named Ambiguity Eliminating Text Spotter (AE TextSpotter), which learns both visual and linguistic features to significantly reduce ambiguity in text detection. The proposed AE TextSpotter has three important benefits. 1) The linguistic representation is learned together with the visual representation in a framework. To our knowledge, it is the first time to improve text detection by using a language model. 2) A carefully designed language module is utilized to reduce the detection confidence of incorrect text lines, making them easily pruned in the detection stage. 3) Extensive experiments show that AE TextSpotter outperforms other state-of-the-art methods by a large margin. For example, we carefully select a validation set of extremely ambiguous samples from the IC19-ReCTS dataset, where our approach surpasses other methods by more than 4%. The code has been released at https://github.com/whai362/AE_TextSpotter. The image list and evaluation scripts of the validation set have been released at https://github.com/whai362/TDA-ReCTS.
    DocSynth: A Layout Guided Approach for Controllable Document Image Synthesis. (arXiv:2107.02638v1 [cs.CV])
    (2 min) Despite significant progress on current state-of-the-art image generation models, synthesis of document images containing multiple and complex object layouts is a challenging task. This paper presents a novel approach, called DocSynth, to automatically synthesize document images based on a given layout. In this work, given a spatial layout (bounding boxes with object categories) as a reference by the user, our proposed DocSynth model learns to generate a set of realistic document images consistent with the defined layout. Also, this framework has been adapted to this work as a superior baseline model for creating synthetic document image datasets for augmenting real data during training for document layout analysis tasks. Different sets of learning objectives have been also used to improve the model performance. Quantitatively, we also compare the generated results of our model with real data using standard evaluation metrics. The results highlight that our model can successfully generate realistic and diverse document images with multiple objects. We also present a comprehensive qualitative analysis summary of the different scopes of synthetic image generation tasks. Lastly, to our knowledge this is the first work of its kind.
    Spatiotemporal Fusion in Remote Sensing. (arXiv:2107.02701v1 [cs.CV])
    (2 min) Remote sensing images and techniques are powerful tools to investigate earth surface. Data quality is the key to enhance remote sensing applications and obtaining a clear and noise-free set of data is very difficult in most situations due to the varying acquisition (e.g., atmosphere and season), sensor, and platform (e.g., satellite angles and sensor characteristics) conditions. With the increasing development of satellites, nowadays Terabytes of remote sensing images can be acquired every day. Therefore, information and data fusion can be particularly important in the remote sensing community. The fusion integrates data from various sources acquired asynchronously for information extraction, analysis, and quality improvement. In this chapter, we aim to discuss the theory of spatiotemporal fusion by investigating previous works, in addition to describing the basic concepts and some of its applications by summarizing our prior and ongoing works.
    Foreground-Aware Stylization and Consensus Pseudo-Labeling for Domain Adaptation of First-Person Hand Segmentation. (arXiv:2107.02718v1 [cs.CV])
    (2 min) Hand segmentation is a crucial task in first-person vision. Since first-person images exhibit strong bias in appearance among different environments, adapting a pre-trained segmentation model to a new domain is required in hand segmentation. Here, we focus on appearance gaps for hand regions and backgrounds separately. We propose (i) foreground-aware image stylization and (ii) consensus pseudo-labeling for domain adaptation of hand segmentation. We stylize source images independently for the foreground and background using target images as style. To resolve the domain shift that the stylization has not addressed, we apply careful pseudo-labeling by taking a consensus between the models trained on the source and stylized source images. We validated our method on domain adaptation of hand segmentation from real and simulation images. Our method achieved state-of-the-art performance in both settings. We also demonstrated promising results in challenging multi-target domain adaptation and domain generalization settings. Code is available at https://github.com/ut-vision/FgSty-CPL.
    Predicate correlation learning for scene graph generation. (arXiv:2107.02713v1 [cs.CV])
    (2 min) For a typical Scene Graph Generation (SGG) method, there is often a large gap in the performance of the predicates' head classes and tail classes. This phenomenon is mainly caused by the semantic overlap between different predicates as well as the long-tailed data distribution. In this paper, a Predicate Correlation Learning (PCL) method for SGG is proposed to address the above two problems by taking the correlation between predicates into consideration. To describe the semantic overlap between strong-correlated predicate classes, a Predicate Correlation Matrix (PCM) is defined to quantify the relationship between predicate pairs, which is dynamically updated to remove the matrix's long-tailed bias. In addition, PCM is integrated into a Predicate Correlation Loss function ($L_{PC}$) to reduce discouraging gradients of unannotated classes. The proposed method is evaluated on Visual Genome benchmark, where the performance of the tail classes is significantly improved when built on the existing methods.
    Memory-based Jitter: Improving Visual Recognition on Long-tailed Data with Diversity In Memory. (arXiv:2008.09809v6 [cs.CV] UPDATED)
    (3 min) This paper considers deep visual recognition on long-tailed data. To be general, we consider two applied scenarios, \ie, deep classification and deep metric learning. Under the long-tailed data distribution, the majority classes (\ie, tail classes) only occupy relatively few samples and are prone to lack of within-class diversity. A radical solution is to augment the tail classes with higher diversity. To this end, we introduce a simple and reliable method named Memory-based Jitter (MBJ). We observe that during training, the deep model constantly changes its parameters after every iteration, yielding the phenomenon of \emph{weight jitters}. Consequentially, given a same image as the input, two historical editions of the model generate two different features in the deeply-embedded space, resulting in \emph{feature jitters}. Using a memory bank, we collect these (model or feature) jitters across multiple training iterations and get the so-called Memory-based Jitter. The accumulated jitters enhance the within-class diversity for the tail classes and consequentially improves long-tailed visual recognition. With slight modifications, MBJ is applicable for two fundamental visual recognition tasks, \emph{i.e.}, deep image classification and deep metric learning (on long-tailed data). Extensive experiments on five long-tailed classification benchmarks and two deep metric learning benchmarks demonstrate significant improvement. Moreover, the achieved performance are on par with the state of the art on both tasks.
    Neural Human Video Rendering by Learning Dynamic Textures and Rendering-to-Video Translation. (arXiv:2001.04947v3 [cs.GR] UPDATED)
    (2 min) Synthesizing realistic videos of humans using neural networks has been a popular alternative to the conventional graphics-based rendering pipeline due to its high efficiency. Existing works typically formulate this as an image-to-image translation problem in 2D screen space, which leads to artifacts such as over-smoothing, missing body parts, and temporal instability of fine-scale detail, such as pose-dependent wrinkles in the clothing. In this paper, we propose a novel human video synthesis method that approaches these limiting factors by explicitly disentangling the learning of time-coherent fine-scale details from the embedding of the human in 2D screen space. More specifically, our method relies on the combination of two convolutional neural networks (CNNs). Given the pose information, the first CNN predicts a dynamic texture map that contains time-coherent high-frequency details, and the second CNN conditions the generation of the final video on the temporally coherent output of the first CNN. We demonstrate several applications of our approach, such as human reenactment and novel view synthesis from monocular video, where we show significant improvement over the state of the art both qualitatively and quantitatively.
    Combining EfficientNet and Vision Transformers for Video Deepfake Detection. (arXiv:2107.02612v1 [cs.CV])
    (2 min) Deepfakes are the result of digital manipulation to obtain credible videos in order to deceive the viewer. This is done through deep learning techniques based on autoencoders or GANs that become more accessible and accurate year after year, resulting in fake videos that are very difficult to distinguish from real ones. Traditionally, CNN networks have been used to perform deepfake detection, with the best results obtained using methods based on EfficientNet B7. In this study, we combine various types of Vision Transformers with a convolutional EfficientNet B0 used as a feature extractor, obtaining comparable results with some very recent methods that use Vision Transformers. Differently from the state-of-the-art approaches, we use neither distillation nor ensemble methods. The best model achieved an AUC of 0.951 and an F1 score of 88.0%, very close to the state-of-the-art on the DeepFake Detection Challenge (DFDC).
    General Purpose (GenP) Bioimage Ensemble of Handcrafted and Learned Features with Data Augmentation. (arXiv:1904.08084v4 [cs.CV] UPDATED)
    (2 min) Bioimage classification plays a crucial role in many biological problems. In this work, we present a new General Purpose (GenP) ensemble that boosts performance by combining local features, dense sampling features, and deep learning approaches. First, we introduce three new methods for data augmentation based on PCA/DCT; second, we show that different data augmentation approaches can boost the performance of an ensemble of CNNs; and, finally, we propose a set of handcrafted/learned descriptors that are highly generalizable. Each handcrafted descriptor is used to train a different Support Vector Machine (SVM), and the different SVMs are combined with the ensemble of CNNs. Our method is evaluated on a diverse set of bioimage classification problems. Results demonstrate that the proposed GenP bioimage ensemble obtains state-of-the-art performance without any ad-hoc dataset tuning of parameters (thus avoiding the risk of overfitting/overtraining).
    S2FGAN: Semantically Aware Interactive Sketch-to-Face Translation. (arXiv:2011.14785v2 [cs.CV] UPDATED)
    (2 min) Interactive facial image manipulation attempts to edit single and multiple face attributes using a photo-realistic face and/or semantic mask as input. In the absence of the photo-realistic image (only sketch/mask available), previous methods only retrieve the original face but ignore the potential of aiding model controllability and diversity in the translation process. This paper proposes a sketch-to-image generation framework called S2FGAN, aiming to improve users' ability to interpret and flexibility of face attribute editing from a simple sketch. The proposed framework modifies the constrained latent space semantics trained on Generative Adversarial Networks (GANs). We employ two latent spaces to control the face appearance and adjust the desired attributes of the generated face. Instead of constraining the translation process by using a reference image, the users can command the model to retouch the generated images by involving the semantic information in the generation process. In this way, our method can manipulate single or multiple face attributes by only specifying attributes to be changed. Extensive experimental results on CelebAMask-HQ dataset empirically shows our superior performance and effectiveness on this task. Our method successfully outperforms state-of-the-art methods on attribute manipulation by exploiting greater control of attribute intensity.
    Unsupervised learning of MRI tissue properties using MRI physics models. (arXiv:2107.02704v1 [eess.IV])
    (2 min) In neuroimaging, MRI tissue properties characterize underlying neurobiology, provide quantitative biomarkers for neurological disease detection and analysis, and can be used to synthesize arbitrary MRI contrasts. Estimating tissue properties from a single scan session using a protocol available on all clinical scanners promises to reduce scan time and cost, enable quantitative analysis in routine clinical scans and provide scan-independent biomarkers of disease. However, existing tissue properties estimation methods - most often $\mathbf{T_1}$ relaxation, $\mathbf{T_2^*}$ relaxation, and proton density ($\mathbf{PD}$) - require data from multiple scan sessions and cannot estimate all properties from a single clinically available MRI protocol such as the multiecho MRI scan. In addition, the widespread use of non-standard acquisition parameters across clinical imaging sites require estimation methods that can generalize across varying scanner parameters. However, existing learning methods are acquisition protocol specific and cannot estimate from heterogenous clinical data from different imaging sites. In this work we propose an unsupervised deep-learning strategy that employs MRI physics to estimate all three tissue properties from a single multiecho MRI scan session, and generalizes across varying acquisition parameters. The proposed strategy optimizes accurate synthesis of new MRI contrasts from estimated latent tissue properties, enabling unsupervised training, we also employ random acquisition parameters during training to achieve acquisition generalization. We provide the first demonstration of estimating all tissue properties from a single multiecho scan session. We demonstrate improved accuracy and generalizability for tissue property estimation and MRI synthesis.
    A deep-learning--based multimodal depth-aware dynamic hand gesture recognition system. (arXiv:2107.02543v1 [cs.CV])
    (2 min) Any spatio-temporal movement or reorientation of the hand, done with the intention of conveying a specific meaning, can be considered as a hand gesture. Inputs to hand gesture recognition systems can be in several forms, such as depth images, monocular RGB, or skeleton joint points. We observe that raw depth images possess low contrasts in the hand regions of interest (ROI). They do not highlight important details to learn, such as finger bending information (whether a finger is overlapping the palm, or another finger). Recently, in deep-learning--based dynamic hand gesture recognition, researchers are tying to fuse different input modalities (e.g. RGB or depth images and hand skeleton joint points) to improve the recognition accuracy. In this paper, we focus on dynamic hand gesture (DHG) recognition using depth quantized image features and hand skeleton joint points. In particular, we explore the effect of using depth-quantized features in Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) based multi-modal fusion networks. We find that our method improves existing results on the SHREC-DHG-14 dataset. Furthermore, using our method, we show that it is possible to reduce the resolution of the input images by more than four times and still obtain comparable or better accuracy to that of the resolutions used in previous methods.
    Unsupervised Knowledge-Transfer for Learned Image Reconstruction. (arXiv:2107.02572v1 [eess.IV])
    (2 min) Deep learning-based image reconstruction approaches have demonstrated impressive empirical performance in many imaging modalities. These approaches generally require a large amount of high-quality training data, which is often not available. To circumvent this issue, we develop a novel unsupervised knowledge-transfer paradigm for learned iterative reconstruction within a Bayesian framework. The proposed approach learns an iterative reconstruction network in two phases. The first phase trains a reconstruction network with a set of ordered pairs comprising of ground truth images and measurement data. The second phase fine-tunes the pretrained network to the measurement data without supervision. Furthermore, the framework delivers uncertainty information over the reconstructed image. We present extensive experimental results on low-dose and sparse-view computed tomography, showing that the proposed framework significantly improves reconstruction quality not only visually, but also quantitatively in terms of PSNR and SSIM, and is competitive with several state-of-the-art supervised and unsupervised reconstruction techniques.
    Attention over learned object embeddings enables complex visual reasoning. (arXiv:2012.08508v2 [cs.CV] UPDATED)
    (2 min) Neural networks have achieved success in a wide array of perceptual tasks but often fail at tasks involving both perception and higher-level reasoning. On these more challenging tasks, bespoke approaches (such as modular symbolic components, independent dynamics models or semantic parsers) targeted towards that specific type of task have typically performed better. The downside to these targeted approaches, however, is that they can be more brittle than general-purpose neural networks, requiring significant modification or even redesign according to the particular task at hand. Here, we propose a more general neural-network-based approach to dynamic visual reasoning problems that obtains state-of-the-art performance on three different domains, in each case outperforming bespoke modular approaches tailored specifically to the task. Our method relies on learned object-centric representations, self-attention and self-supervised dynamics learning, and all three elements together are required for strong performance to emerge. The success of this combination suggests that there may be no need to trade off flexibility for performance on problems involving spatio-temporal or causal-style reasoning. With the right soft biases and learning objectives in a neural network we may be able to attain the best of both worlds.
    Learning the Best Pooling Strategy for Visual Semantic Embedding. (arXiv:2011.04305v5 [cs.CV] UPDATED)
    (2 min) Visual Semantic Embedding (VSE) is a dominant approach for vision-language retrieval, which aims at learning a deep embedding space such that visual data are embedded close to their semantic text labels or descriptions. Recent VSE models use complex methods to better contextualize and aggregate multi-modal features into holistic embeddings. However, we discover that surprisingly simple (but carefully selected) global pooling functions (e.g., max pooling) outperform those complex models, across different feature extractors. Despite its simplicity and effectiveness, seeking the best pooling function for different data modality and feature extractor is costly and tedious, especially when the size of features varies (e.g., text, video). Therefore, we propose a Generalized Pooling Operator (GPO), which learns to automatically adapt itself to the best pooling strategy for different features, requiring no manual tuning while staying effective and efficient. We extend the VSE model using this proposed GPO and denote it as VSE$\infty$. Without bells and whistles, VSE$\infty$ outperforms previous VSE methods significantly on image-text retrieval benchmarks across popular feature extractors. With a simple adaptation, variants of VSE$\infty$ further demonstrate its strength by achieving the new state of the art on two video-text retrieval datasets. Comprehensive experiments and visualizations confirm that GPO always discovers the best pooling strategy and can be a plug-and-play feature aggregation module for standard VSE models. Code and pre-trained models are available at https://vse-infty.github.io.
    Red Blood Cell Segmentation with Overlapping Cell Separation and Classification on Imbalanced Dataset. (arXiv:2012.01321v3 [eess.IV] UPDATED)
    (2 min) Automated red blood cell (RBC) classification on blood smear images helps hematologists to analyze RBC lab results in a reduced time and cost. However, overlapping cells can cause incorrect predicted results, and so they have to be separated into multiple single RBCs before classifying. To classify multiple classes with deep learning, imbalance problems are common in medical imaging because normal samples are always higher than rare disease samples. This paper presents a new method to segment and classify RBCs from blood smear images, specifically to tackle cell overlapping and data imbalance problems. Focusing on overlapping cell separation, our segmentation process first estimates ellipses to represent RBCs. The method detects the concave points and then finds the ellipses using directed ellipse fitting. The accuracy from 20 blood smear images was 0.889. Classification requires balanced training datasets. However, some RBC types are rare. The imbalance ratio of this dataset was 34.538 for 12 RBC classes from 20,875 individual RBC samples. The use of machine learning for RBC classification with an imbalanced dataset is hence more challenging than many other applications. We analyzed techniques to deal with this problem. The best accuracy and F1-score were 0.921 and 0.8679, respectively, using EfficientNet-B1 with augmentation. Experimental results showed that the weight balancing technique with augmentation had the potential to deal with imbalance problems by improving the F1-score on minority classes, while data augmentation significantly improved the overall classification performance.
    Fractional order graph neural network. (arXiv:2001.04026v3 [cs.LG] UPDATED)
    (2 min) This paper proposes fractional order graph neural networks (FGNNs), optimized by the approximation strategy to address the challenges of local optimum of classic and fractional graph neural networks which are specialised at aggregating information from the feature and adjacent matrices of connected nodes and their neighbours to solve learning tasks on non-Euclidean data such as graphs. Meanwhile the approximate calculation of fractional order gradients also overcomes the high computational complexity of fractional order derivations. We further prove that such an approximation is feasible and the FGNN is unbiased towards global optimization solution. Extensive experiments on citation networks show that FGNN achieves great advantage over baseline models when selected appropriate fractional order.
    Depth-supervised NeRF: Fewer Views and Faster Training for Free. (arXiv:2107.02791v1 [cs.CV])
    (2 min) One common failure mode of Neural Radiance Field (NeRF) models is fitting incorrect geometries when given an insufficient number of input views. We propose DS-NeRF (Depth-supervised Neural Radiance Fields), a loss for learning neural radiance fields that takes advantage of readily-available depth supervision. Our key insight is that sparse depth supervision can be used to regularize the learned geometry, a crucial component for effectively rendering novel views using NeRF. We exploit the fact that current NeRF pipelines require images with known camera poses that are typically estimated by running structure-from-motion (SFM). Crucially, SFM also produces sparse 3D points that can be used as ``free" depth supervision during training: we simply add a loss to ensure that depth rendered along rays that intersect these 3D points is close to the observed depth. We find that DS-NeRF can render more accurate images given fewer training views while training 2-6x faster. With only two training views on real-world images, DS-NeRF significantly outperforms NeRF as well as other sparse-view variants. We show that our loss is compatible with these NeRF models, demonstrating that depth is a cheap and easily digestible supervisory signal. Finally, we show that DS-NeRF supports other types of depth supervision such as scanned depth sensors and RGBD reconstruction outputs.
    Representation Theoretic Patterns in Multi-Frequency Class Averaging for Three-Dimensional Cryo-Electron Microscopy. (arXiv:1906.01082v4 [eess.IV] UPDATED)
    (2 min) We develop in this paper a novel intrinsic classification algorithm -- multi-frequency class averaging (MFCA) -- for classifying noisy projection images obtained from three-dimensional cryo-electron microscopy (cryo-EM) by the similarity among their viewing directions. This new algorithm leverages multiple irreducible representations of the unitary group to introduce additional redundancy into the representation of the optimal in-plane rotational alignment, extending and outperforming the existing class averaging algorithm that uses only a single representation. The formal algebraic model and representation theoretic patterns of the proposed MFCA algorithm extend the framework of Hadani and Singer to arbitrary irreducible representations of the unitary group. We conceptually establish the consistency and stability of MFCA by inspecting the spectral properties of a generalized local parallel transport operator through the lens of Wigner $D$-matrices. We demonstrate the efficacy of the proposed algorithm with numerical experiments.
    Real-time Pose Estimation from Images for Multiple Humanoid Robots. (arXiv:2107.02675v1 [cs.RO])
    (2 min) Pose estimation commonly refers to computer vision methods that recognize people's body postures in images or videos. With recent advancements in deep learning, we now have compelling models to tackle the problem in real-time. Since these models are usually designed for human images, one needs to adapt existing models to work on other creatures, including robots. This paper examines different state-of-the-art pose estimation models and proposes a lightweight model that can work in real-time on humanoid robots in the RoboCup Humanoid League environment. Additionally, we present a novel dataset called the HumanoidRobotPose dataset. The results of this work have the potential to enable many advanced behaviors for soccer-playing robots.
    iPOKE: Poking a Still Image for Controlled Stochastic Video Synthesis. (arXiv:2107.02790v1 [cs.CV])
    (2 min) How would a static scene react to a local poke? What are the effects on other parts of an object if you could locally push it? There will be distinctive movement, despite evident variations caused by the stochastic nature of our world. These outcomes are governed by the characteristic kinematics of objects that dictate their overall motion caused by a local interaction. Conversely, the movement of an object provides crucial information about its underlying distinctive kinematics and the interdependencies between its parts. This two-way relation motivates learning a bijective mapping between object kinematics and plausible future image sequences. Therefore, we propose iPOKE - invertible Prediction of Object Kinematics - that, conditioned on an initial frame and a local poke, allows to sample object kinematics and establishes a one-to-one correspondence to the corresponding plausible videos, thereby providing a controlled stochastic video synthesis. In contrast to previous works, we do not generate arbitrary realistic videos, but provide efficient control of movements, while still capturing the stochastic nature of our environment and the diversity of plausible outcomes it entails. Moreover, our approach can transfer kinematics onto novel object instances and is not confined to particular object classes. Project page is available at https://bit.ly/3dJN4Lf
    Anomaly Detection using Edge Computing in Video Surveillance System: Review. (arXiv:2107.02778v1 [cs.CV])
    (2 min) The current concept of Smart Cities influences urban planners and researchers to provide modern, secured and sustainable infrastructure and give a decent quality of life to its residents. To fulfill this need video surveillance cameras have been deployed to enhance the safety and well-being of the citizens. Despite technical developments in modern science, abnormal event detection in surveillance video systems is challenging and requires exhaustive human efforts. In this paper, we surveyed various methodologies developed to detect anomalies in intelligent video surveillance. Firstly, we revisit the surveys on anomaly detection in the last decade. We then present a systematic categorization of methodologies developed for ease of understanding. Considering the notion of anomaly depends on context, we identify different objects-of-interest and publicly available datasets in anomaly detection. Since anomaly detection is considered a time-critical application of computer vision, our emphasis is on anomaly detection using edge devices and approaches explicitly designed for them. Further, we discuss the challenges and opportunities involved in anomaly detection at the edge.
    Confidence-based Out-of-Distribution Detection: A Comparative Study and Analysis. (arXiv:2107.02568v1 [cs.CV])
    (2 min) Image classification models deployed in the real world may receive inputs outside the intended data distribution. For critical applications such as clinical decision making, it is important that a model can detect such out-of-distribution (OOD) inputs and express its uncertainty. In this work, we assess the capability of various state-of-the-art approaches for confidence-based OOD detection through a comparative study and in-depth analysis. First, we leverage a computer vision benchmark to reproduce and compare multiple OOD detection methods. We then evaluate their capabilities on the challenging task of disease classification using chest X-rays. Our study shows that high performance in a computer vision task does not directly translate to accuracy in a medical imaging task. We analyse factors that affect performance of the methods between the two tasks. Our results provide useful insights for developing the next generation of OOD detection methods.
    Detecting Hypo-plastic Left Heart Syndrome in Fetal Ultrasound via Disease-specific Atlas Maps. (arXiv:2107.02643v1 [eess.IV])
    (2 min) Fetal ultrasound screening during pregnancy plays a vital role in the early detection of fetal malformations which have potential long-term health impacts. The level of skill required to diagnose such malformations from live ultrasound during examination is high and resources for screening are often limited. We present an interpretable, atlas-learning segmentation method for automatic diagnosis of Hypo-plastic Left Heart Syndrome (HLHS) from a single `4 Chamber Heart' view image. We propose to extend the recently introduced Image-and-Spatial Transformer Networks (Atlas-ISTN) into a framework that enables sensitising atlas generation to disease. In this framework we can jointly learn image segmentation, registration, atlas construction and disease prediction while providing a maximum level of clinical interpretability compared to direct image classification methods. As a result our segmentation allows diagnoses competitive with expert-derived manual diagnosis and yields an AUC-ROC of 0.978 (1043 cases for training, 260 for validation and 325 for testing).
    Depth-Aware Multi-Grid Deep Homography Estimation with Contextual Correlation. (arXiv:2107.02524v1 [cs.CV])
    (2 min) Homography estimation is an important task in computer vision, such as image stitching, video stabilization, and camera calibration. Traditional homography estimation methods heavily depend on the quantity and distribution of feature points, leading to poor robustness in textureless scenes. The learning solutions, on the contrary, try to learn robust deep features but demonstrate unsatisfying performance in the scenes of low overlap rates. In this paper, we address the two problems simultaneously, by designing a contextual correlation layer, which can capture the long-range correlation on feature maps and flexibly be bridged in a learning framework. In addition, considering that a single homography can not represent the complex spatial transformation in depth-varying images with parallax, we propose to predict multi-grid homography from global to local. Moreover, we equip our network with depth perception capability, by introducing a novel depth-aware shape-preserved loss. Extensive experiments demonstrate the superiority of our method over other state-of-the-art solutions in the synthetic benchmark dataset and real-world dataset. The codes and models will be available at https://github.com/nie-lang/Multi-Grid-Deep-Homogarphy.
    Hyperspectral Pansharpening Based on Improved Deep Image Prior and Residual Reconstruction. (arXiv:2107.02630v1 [cs.CV])
    (2 min) Hyperspectral pansharpening aims to synthesize a low-resolution hyperspectral image (LR-HSI) with a registered panchromatic image (PAN) to generate an enhanced HSI with high spectral and spatial resolution. Recently proposed HS pansharpening methods have obtained remarkable results using deep convolutional networks (ConvNets), which typically consist of three steps: (1) up-sampling the LR-HSI, (2) predicting the residual image via a ConvNet, and (3) obtaining the final fused HSI by adding the outputs from first and second steps. Recent methods have leveraged Deep Image Prior (DIP) to up-sample the LR-HSI due to its excellent ability to preserve both spatial and spectral information, without learning from large data sets. However, we observed that the quality of up-sampled HSIs can be further improved by introducing an additional spatial-domain constraint to the conventional spectral-domain energy function. We define our spatial-domain constraint as the $L_1$ distance between the predicted PAN image and the actual PAN image. To estimate the PAN image of the up-sampled HSI, we also propose a learnable spectral response function (SRF). Moreover, we noticed that the residual image between the up-sampled HSI and the reference HSI mainly consists of edge information and very fine structures. In order to accurately estimate fine information, we propose a novel over-complete network, called HyperKite, which focuses on learning high-level features by constraining the receptive from increasing in the deep layers. We perform experiments on three HSI datasets to demonstrate the superiority of our DIP-HyperKite over the state-of-the-art pansharpening methods. The deployment codes, pre-trained models, and final fusion outputs of our DIP-HyperKite and the methods used for the comparisons will be publicly made available at https://github.com/wgcban/DIP-HyperKite.git.
    Embracing the Dark Knowledge: Domain Generalization Using Regularized Knowledge Distillation. (arXiv:2107.02629v1 [cs.CV])
    (2 min) Though convolutional neural networks are widely used in different tasks, lack of generalization capability in the absence of sufficient and representative data is one of the challenges that hinder their practical application. In this paper, we propose a simple, effective, and plug-and-play training strategy named Knowledge Distillation for Domain Generalization (KDDG) which is built upon a knowledge distillation framework with the gradient filter as a novel regularization term. We find that both the ``richer dark knowledge" from the teacher network, as well as the gradient filter we proposed, can reduce the difficulty of learning the mapping which further improves the generalization ability of the model. We also conduct experiments extensively to show that our framework can significantly improve the generalization capability of deep neural networks in different tasks including image classification, segmentation, reinforcement learning by comparing our method with existing state-of-the-art domain generalization techniques. Last but not the least, we propose to adopt two metrics to analyze our proposed method in order to better understand how our proposed method benefits the generalization capability of deep neural networks.
    VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer. (arXiv:2107.02681v1 [cs.CL])
    (2 min) Since visual perception can give rich information beyond text descriptions for world understanding, there has been increasing interest in leveraging visual grounding for language learning. Recently, vokenization has attracted attention by using the predictions of a text-to-image retrieval model as labels for language model supervision. Despite its success, the method suffers from approximation error of using finite image labels and the lack of vocabulary diversity of a small image-text dataset. To overcome these limitations, we present VidLanKD, a video-language knowledge distillation method for improving language understanding. We train a multi-modal teacher model on a video-text dataset, and then transfer its knowledge to a student language model with a text dataset. To avoid approximation error, we propose to use different knowledge distillation objectives. In addition, the use of a large-scale video-text dataset helps learn diverse and richer vocabularies. In our experiments, VidLanKD achieves consistent improvements over text-only language models and vokenization models, on several downstream language understanding tasks including GLUE, SQuAD, and SWAG. We also demonstrate the improved world knowledge, physical reasoning, and temporal reasoning capabilities of our model by evaluating on the GLUE-diagnostics, PIQA, and TRACIE datasets. Lastly, we present comprehensive ablation studies as well as visualizations of the learned text-to-video grounding results of our teacher and student language models. Our code and models are available at: https://github.com/zinengtang/VidLanKD
    Point Cloud Registration using Representative Overlapping Points. (arXiv:2107.02583v1 [cs.CV])
    (2 min) 3D point cloud registration is a fundamental task in robotics and computer vision. Recently, many learning-based point cloud registration methods based on correspondences have emerged. However, these methods heavily rely on such correspondences and meet great challenges with partial overlap. In this paper, we propose ROPNet, a new deep learning model using Representative Overlapping Points with discriminative features for registration that transforms partial-to-partial registration into partial-to-complete registration. Specifically, we propose a context-guided module which uses an encoder to extract global features for predicting point overlap score. To better find representative overlapping points, we use the extracted global features for coarse alignment. Then, we introduce a Transformer to enrich point features and remove non-representative points based on point overlap score and feature matching. A similarity matrix is built in a partial-to-complete mode, and finally, weighted SVD is adopted to estimate a transformation matrix. Extensive experiments over ModelNet40 using noisy and partially overlapping point clouds show that the proposed method outperforms traditional and learning-based methods, achieving state-of-the-art performance. The code is available at https://github.com/zhulf0804/ROPNet.
    Learning Semantic Segmentation of Large-Scale Point Clouds with Random Sampling. (arXiv:2107.02389v1 [cs.CV])
    (2 min) We study the problem of efficient semantic segmentation of large-scale 3D point clouds. By relying on expensive sampling techniques or computationally heavy pre/post-processing steps, most existing approaches are only able to be trained and operate over small-scale point clouds. In this paper, we introduce RandLA-Net, an efficient and lightweight neural architecture to directly infer per-point semantics for large-scale point clouds. The key to our approach is to use random point sampling instead of more complex point selection approaches. Although remarkably computation and memory efficient, random sampling can discard key features by chance. To overcome this, we introduce a novel local feature aggregation module to progressively increase the receptive field for each 3D point, thereby effectively preserving geometric details. Comparative experiments show that our RandLA-Net can process 1 million points in a single pass up to 200x faster than existing approaches. Moreover, extensive experiments on five large-scale point cloud datasets, including Semantic3D, SemanticKITTI, Toronto3D, NPM3D and S3DIS, demonstrate the state-of-the-art semantic segmentation performance of our RandLA-Net.
    NRST: Non-rigid Surface Tracking from Monocular Video. (arXiv:2107.02407v1 [cs.CV])
    (2 min) We propose an efficient method for non-rigid surface tracking from monocular RGB videos. Given a video and a template mesh, our algorithm sequentially registers the template non-rigidly to each frame. We formulate the per-frame registration as an optimization problem that includes a novel texture term specifically tailored towards tracking objects with uniform texture but fine-scale structure, such as the regular micro-structural patterns of fabric. Our texture term exploits the orientation information in the micro-structures of the objects, e.g., the yarn patterns of fabrics. This enables us to accurately track uniformly colored materials that have these high frequency micro-structures, for which traditional photometric terms are usually less effective. The results demonstrate the effectiveness of our method on both general textured non-rigid objects and monochromatic fabrics.
    Contrastive Multimodal Fusion with TupleInfoNCE. (arXiv:2107.02575v1 [cs.CV])
    (2 min) This paper proposes a method for representation learning of multimodal data using contrastive losses. A traditional approach is to contrast different modalities to learn the information shared between them. However, that approach could fail to learn the complementary synergies between modalities that might be useful for downstream tasks. Another approach is to concatenate all the modalities into a tuple and then contrast positive and negative tuple correspondences. However, that approach could consider only the stronger modalities while ignoring the weaker ones. To address these issues, we propose a novel contrastive learning objective, TupleInfoNCE. It contrasts tuples based not only on positive and negative correspondences but also by composing new negative tuples using modalities describing different scenes. Training with these additional negatives encourages the learning model to examine the correspondences among modalities in the same tuple, ensuring that weak modalities are not ignored. We provide a theoretical justification based on mutual information for why this approach works, and we propose a sample optimization algorithm to generate positive and negative samples to maximize training efficacy. We find that TupleInfoNCE significantly outperforms the previous state of the arts on three different downstream tasks.
    HybrUR: A Hybrid Physical-Neural Solution for Unsupervised Underwater Image Restoration. (arXiv:2107.02660v1 [cs.CV])
    (2 min) Robust vision restoration for an underwater image remains a challenging problem. For the lack of aligned underwater-terrestrial image pairs, the unsupervised method is more suited to this task. However, the pure data-driven unsupervised method usually has difficulty in achieving realistic color correction for lack of optical constraint. In this paper, we propose a data- and physics-driven unsupervised architecture that learns underwater vision restoration from unpaired underwater-terrestrial images. For sufficient domain transformation and detail preservation, the underwater degeneration needs to be explicitly constructed based on the optically unambiguous physics law. Thus, we employ the Jaffe-McGlamery degradation theory to design the generation models, and use neural networks to describe the process of underwater degradation. Furthermore, to overcome the problem of invalid gradient when optimizing the hybrid physical-neural model, we fully investigate the intrinsic correlation between the scene depth and the degradation factors for the backscattering estimation, to improve the restoration performance through physical constraints. Our experimental results show that the proposed method is able to perform high-quality restoration for unconstrained underwater images without any supervision. On multiple benchmarks, we outperform several state-of-the-art supervised and unsupervised approaches. We also demonstrate that our methods yield encouraging results on real-world applications.
    Stateless actor-critic for instance segmentation with high-level priors. (arXiv:2107.02600v1 [cs.CV])
    (2 min) Instance segmentation is an important computer vision problem which remains challenging despite impressive recent advances due to deep learning-based methods. Given sufficient training data, fully supervised methods can yield excellent performance, but annotation of ground-truth data remains a major bottleneck, especially for biomedical applications where it has to be performed by domain experts. The amount of labels required can be drastically reduced by using rules derived from prior knowledge to guide the segmentation. However, these rules are in general not differentiable and thus cannot be used with existing methods. Here, we relax this requirement by using stateless actor critic reinforcement learning, which enables non-differentiable rewards. We formulate the instance segmentation problem as graph partitioning and the actor critic predicts the edge weights driven by the rewards, which are based on the conformity of segmented instances to high-level priors on object shape, position or size. The experiments on toy and real datasets demonstrate that we can achieve excellent performance without any direct supervision based only on a rich set of priors.
    A Theory of the Distortion-Perception Tradeoff in Wasserstein Space. (arXiv:2107.02555v1 [eess.IV])
    (2 min) The lower the distortion of an estimator, the more the distribution of its outputs generally deviates from the distribution of the signals it attempts to estimate. This phenomenon, known as the perception-distortion tradeoff, has captured significant attention in image restoration, where it implies that fidelity to ground truth images comes at the expense of perceptual quality (deviation from statistics of natural images). However, despite the increasing popularity of performing comparisons on the perception-distortion plane, there remains an important open question: what is the minimal distortion that can be achieved under a given perception constraint? In this paper, we derive a closed form expression for this distortion-perception (DP) function for the mean squared-error (MSE) distortion and the Wasserstein-2 perception index. We prove that the DP function is always quadratic, regardless of the underlying distribution. This stems from the fact that estimators on the DP curve form a geodesic in Wasserstein space. In the Gaussian setting, we further provide a closed form expression for such estimators. For general distributions, we show how these estimators can be constructed from the estimators at the two extremes of the tradeoff: The global MSE minimizer, and a minimizer of the MSE under a perfect perceptual quality constraint. The latter can be obtained as a stochastic transformation of the former.
    Detecting Outliers with Poisson Image Interpolation. (arXiv:2107.02622v1 [cs.CV])
    (2 min) Supervised learning of every possible pathology is unrealistic for many primary care applications like health screening. Image anomaly detection methods that learn normal appearance from only healthy data have shown promising results recently. We propose an alternative to image reconstruction-based and image embedding-based methods and propose a new self-supervised method to tackle pathological anomaly detection. Our approach originates in the foreign patch interpolation (FPI) strategy that has shown superior performance on brain MRI and abdominal CT data. We propose to use a better patch interpolation strategy, Poisson image interpolation (PII), which makes our method suitable for applications in challenging data regimes. PII outperforms state-of-the-art methods by a good margin when tested on surrogate tasks like identifying common lung anomalies in chest X-rays or hypo-plastic left heart syndrome in prenatal, fetal cardiac ultrasound images. Code available at https://github.com/jemtan/PII.
    COVID-19 Pneumonia Severity Prediction using Hybrid Convolution-Attention Neural Architectures. (arXiv:2107.02672v1 [eess.IV])
    (2 min) This study proposed a novel framework for COVID-19 severity prediction, which is a combination of data-centric and model-centric approaches. First, we propose a data-centric pre-training for extremely scare data scenarios of the investigating dataset. Second, we propose two hybrid convolution-attention neural architectures that leverage the self-attention from Transformer and Hopfield networks. Our proposed approach achieves significant improvement from the conventional baseline approach. The best model from our proposed approach achieves $R^2 = 0.85 \pm 0.05$ and Pearson correlation coefficient $\rho = 0.92 \pm 0.02$ in geographic extend and $R^2 = 0.72 \pm 0.09, \rho = 0.85\pm 0.06$ in opacity prediction.
    Impact of deep learning-based image super-resolution on binary signal detection. (arXiv:2107.02338v1 [eess.IV])
    (2 min) Deep learning-based image super-resolution (DL-SR) has shown great promise in medical imaging applications. To date, most of the proposed methods for DL-SR have only been assessed by use of traditional measures of image quality (IQ) that are commonly employed in the field of computer vision. However, the impact of these methods on objective measures of image quality that are relevant to medical imaging tasks remains largely unexplored. In this study, we investigate the impact of DL-SR methods on binary signal detection performance. Two popular DL-SR methods, the super-resolution convolutional neural network (SRCNN) and the super-resolution generative adversarial network (SRGAN), were trained by use of simulated medical image data. Binary signal-known-exactly with background-known-statistically (SKE/BKS) and signal-known-statistically with background-known-statistically (SKS/BKS) detection tasks were formulated. Numerical observers, which included a neural network-approximated ideal observer and common linear numerical observers, were employed to assess the impact of DL-SR on task performance. The impact of the complexity of the DL-SR network architectures on task-performance was quantified. In addition, the utility of DL-SR for improving the task-performance of sub-optimal observers was investigated. Our numerical experiments confirmed that, as expected, DL-SR could improve traditional measures of IQ. However, for many of the study designs considered, the DL-SR methods provided little or no improvement in task performance and could even degrade it. It was observed that DL-SR could improve the task-performance of sub-optimal observers under certain conditions. The presented study highlights the urgent need for the objective assessment of DL-SR methods and suggests avenues for improving their efficacy in medical imaging applications.
    Semi-TCL: Semi-Supervised Track Contrastive Representation Learning. (arXiv:2107.02396v1 [cs.CV])
    (2 min) Online tracking of multiple objects in videos requires strong capacity of modeling and matching object appearances. Previous methods for learning appearance embedding mostly rely on instance-level matching without considering the temporal continuity provided by videos. We design a new instance-to-track matching objective to learn appearance embedding that compares a candidate detection to the embedding of the tracks persisted in the tracker. It enables us to learn not only from videos labeled with complete tracks, but also unlabeled or partially labeled videos. We implement this learning objective in a unified form following the spirit of constrastive loss. Experiments on multiple object tracking datasets demonstrate that our method can effectively learning discriminative appearance embeddings in a semi-supervised fashion and outperform state of the art methods on representative benchmarks.
    Attention-based Adversarial Appearance Learning of Augmented Pedestrians. (arXiv:2107.02673v1 [cs.CV])
    (2 min) Synthetic data became already an essential component of machine learning-based perception in the field of autonomous driving. Yet it still cannot replace real data completely due to the sim2real domain shift. In this work, we propose a method that leverages the advantages of the augmentation process and adversarial training to synthesize realistic data for the pedestrian recognition task. Our approach utilizes an attention mechanism driven by an adversarial loss to learn domain discrepancies and improve sim2real adaptation. Our experiments confirm that the proposed adaptation method is robust to such discrepancies and reveals both visual realism and semantic consistency. Furthermore, we evaluate our data generation pipeline on the task of pedestrian recognition and demonstrate that generated data resemble properties of the real domain.
    From General to Specific: Online Updating for Blind Super-Resolution. (arXiv:2107.02398v1 [cs.CV])
    (2 min) Most deep learning-based super-resolution (SR) methods are not image-specific: 1) They are exhaustively trained on datasets synthesized by predefined blur kernels (\eg bicubic), regardless of the domain gap with test images. 2) Their model weights are fixed during testing, which means that test images with various degradations are super-resolved by the same set of weights. However, degradations of real images are various and unknown (\ie blind SR). It is hard for a single model to perform well in all cases. To address these issues, we propose an online super-resolution (ONSR) method. It does not rely on predefined blur kernels and allows the model weights to be updated according to the degradation of the test image. Specifically, ONSR consists of two branches, namely internal branch (IB) and external branch (EB). IB could learn the specific degradation of the given test LR image, and EB could learn to super resolve images degraded by the learned degradation. In this way, ONSR could customize a specific model for each test image, and thus could be more tolerant with various degradations in real applications. Extensive experiments on both synthesized and real-world images show that ONSR can generate more visually favorable SR results and achieve state-of-the-art performance in blind SR.
    Rethinking Positional Encoding. (arXiv:2107.02561v1 [cs.LG])
    (2 min) It is well noted that coordinate based MLPs benefit greatly -- in terms of preserving high-frequency information -- through the encoding of coordinate positions as an array of Fourier features. Hitherto, the rationale for the effectiveness of these positional encodings has been solely studied through a Fourier lens. In this paper, we strive to broaden this understanding by showing that alternative non-Fourier embedding functions can indeed be used for positional encoding. Moreover, we show that their performance is entirely determined by a trade-off between the stable rank of the embedded matrix and the distance preservation between embedded coordinates. We further establish that the now ubiquitous Fourier feature mapping of position is a special case that fulfills these conditions. Consequently, we present a more general theory to analyze positional encoding in terms of shifted basis functions. To this end, we develop the necessary theoretical formulae and empirically verify that our theoretical claims hold in practice. Codes available at https://github.com/osiriszjq/Rethinking-positional-encoding.
    End-To-End Data-Dependent Routing in Multi-Path Neural Networks. (arXiv:2107.02450v1 [cs.CV])
    (2 min) Neural networks are known to give better performance with increased depth due to their ability to learn more abstract features. Although the deepening of networks has been well established, there is still room for efficient feature extraction within a layer which would reduce the need for mere parameter increment. The conventional widening of networks by having more filters in each layer introduces a quadratic increment of parameters. Having multiple parallel convolutional/dense operations in each layer solves this problem, but without any context-dependent allocation of resources among these operations: the parallel computations tend to learn similar features making the widening process less effective. Therefore, we propose the use of multi-path neural networks with data-dependent resource allocation among parallel computations within layers, which also lets an input to be routed end-to-end through these parallel paths. To do this, we first introduce a cross-prediction based algorithm between parallel tensors of subsequent layers. Second, we further reduce the routing overhead by introducing feature-dependent cross-connections between parallel tensors of successive layers. Our multi-path networks show superior performance to existing widening and adaptive feature extraction, and even ensembles, and deeper networks at similar complexity in the image recognition task.
    Semantic Segmentation Alternative Technique: Segmentation Domain Generation. (arXiv:2107.02525v1 [cs.CV])
    (2 min) Detecting objects of interest in images was always a compelling task to automate. In recent years this task was more and more explored using deep learning techniques, mostly using region-based convolutional networks. In this project we propose an alternative semantic segmentation technique making use of Generative Adversarial Networks. We consider semantic segmentation to be a domain transfer problem. Thus, we train a feed forward network (FFNN) to receive as input a seed real image and generate as output its segmentation mask.
    Generalizing Nucleus Recognition Model in Multi-source Images via Pruning. (arXiv:2107.02500v1 [cs.CV])
    (2 min) Ki67 is a significant biomarker in the diagnosis and prognosis of cancer, whose index can be evaluated by quantifying its expression in Ki67 immunohistochemistry (IHC) stained images. However, quantitative analysis on multi-source Ki67 images is yet a challenging task in practice due to cross-domain distribution differences, which result from imaging variation, staining styles, and lesion types. Many recent studies have made some efforts on domain generalization (DG), whereas there are still some noteworthy limitations. Specifically in the case of Ki67 images, learning invariant representation is at the mercy of the insufficient number of domains and the cell categories mismatching in different domains. In this paper, we propose a novel method to improve DG by searching the domain-agnostic subnetwork in a domain merging scenario. Partial model parameters are iteratively pruned according to the domain gap, which is caused by the data converting from a single domain into merged domains during training. In addition, the model is optimized by fine-tuning on merged domains to eliminate the interference of class mismatching among various domains. Furthermore, an appropriate implementation is attained by applying the pruning method to different parts of the framework. Compared with known DG methods, our method yields excellent performance in multiclass nucleus recognition of Ki67 IHC images, especially in the lost category cases. Moreover, our competitive results are also evaluated on the public dataset over the state-of-the-art DG methods.
    UACANet: Uncertainty Augmented Context Attention for Polyp Semgnetaion. (arXiv:2107.02368v1 [cs.CV])
    (2 min) We propose Uncertainty Augmented Context Attention network (UACANet) for polyp segmentation which consider a uncertain area of the saliency map. We construct a modified version of U-Net shape network with additional encoder and decoder and compute a saliency map in each bottom-up stream prediction module and propagate to the next prediction module. In each prediction module, previously predicted saliency map is utilized to compute foreground, background and uncertain area map and we aggregate the feature map with three area maps for each representation. Then we compute the relation between each representation and each pixel in the feature map. We conduct experiments on five popular polyp segmentation benchmarks, Kvasir, CVC-ClinicDB, ETIS, CVC-ColonDB and CVC-300, and achieve state-of-the-art performance. Especially, we achieve 76.6% mean Dice on ETIS dataset which is 13.8% improvement compared to the previous state-of-the-art method.
    Long-Short Transformer: Efficient Transformers for Language and Vision. (arXiv:2107.02192v1 [cs.CV])
    (2 min) Transformers have achieved success in both language and vision domains. However, it is prohibitively expensive to scale them to long sequences such as long documents or high-resolution images, because self-attention mechanism has quadratic time and memory complexities with respect to the input sequence length. In this paper, we propose Long-Short Transformer (Transformer-LS), an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks. It aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations. We propose a dual normalization strategy to account for the scale mismatch between the two attention mechanisms. Transformer-LS can be applied to both autoregressive and bidirectional models without additional complexity. Our method outperforms the state-of-the-art models on multiple tasks in language and vision domains, including the Long Range Arena benchmark, autoregressive language modeling, and ImageNet classification. For instance, Transformer-LS achieves 0.97 test BPC on enwik8 using half the number of parameters than previous method, while being faster and is able to handle 3$\times$ as long sequences compared to its full-attention version on the same hardware. On ImageNet, it can obtain the state-of-the-art results~(e.g., Top-1 accuracy 84.1% trained on 224$\times$224 ImageNet-1K only), while being more scalable on high-resolution images. The models and source code will be released soon.
    Neural Mixture Models with Expectation-Maximization for End-to-end Deep Clustering. (arXiv:2107.02453v1 [cs.LG])
    (2 min) Any clustering algorithm must synchronously learn to model the clusters and allocate data to those clusters in the absence of labels. Mixture model-based methods model clusters with pre-defined statistical distributions and allocate data to those clusters based on the cluster likelihoods. They iteratively refine those distribution parameters and member assignments following the Expectation-Maximization (EM) algorithm. However, the cluster representability of such hand-designed distributions that employ a limited amount of parameters is not adequate for most real-world clustering tasks. In this paper, we realize mixture model-based clustering with a neural network where the final layer neurons, with the aid of an additional transformation, approximate cluster distribution outputs. The network parameters pose as the parameters of those distributions. The result is an elegant, much-generalized representation of clusters than a restricted mixture of hand-designed distributions. We train the network end-to-end via batch-wise EM iterations where the forward pass acts as the E-step and the backward pass acts as the M-step. In image clustering, the mixture-based EM objective can be used as the clustering objective along with existing representation learning methods. In particular, we show that when mixture-EM optimization is fused with consistency optimization, it improves the sole consistency optimization performance in clustering. Our trained networks outperform single-stage deep clustering methods that still depend on k-means, with unsupervised classification accuracy of 63.8% in STL10, 58% in CIFAR10, 25.9% in CIFAR100, and 98.9% in MNIST.
    Feature Fusion Vision Transformer Fine-Grained Visual Categorization. (arXiv:2107.02341v1 [cs.CV])
    (2 min) The core for tackling the fine-grained visual categorization (FGVC) is to learn subtleyet discriminative features. Most previous works achieve this by explicitly selecting thediscriminative parts or integrating the attention mechanism via CNN-based approaches.However, these methods enhance the computational complexity and make the modeldominated by the regions containing the most of the objects. Recently, vision trans-former (ViT) has achieved SOTA performance on general image recognition tasks. Theself-attention mechanism aggregates and weights the information from all patches to theclassification token, making it perfectly suitable for FGVC. Nonetheless, the classifi-cation token in the deep layer pays more attention to the global information, lackingthe local and low-level features that are essential for FGVC. In this work, we proposea novel pure transformer-based framework Feature Fusion Vision Transformer (FFVT)where we aggregate the important tokens from each transformer layer to compensate thelocal, low-level and middle-level information. We design a novel token selection mod-ule called mutual attention weight selection (MAWS) to guide the network effectivelyand efficiently towards selecting discriminative tokens without introducing extra param-eters. We verify the effectiveness of FFVT on three benchmarks where FFVT achievesthe state-of-the-art performance.
    Adapting Vehicle Detector to Target Domain by Adversarial Prediction Alignment. (arXiv:2107.02411v1 [cs.CV])
    (2 min) While recent advancement of domain adaptation techniques is significant, most of methods only align a feature extractor and do not adapt a classifier to target domain, which would be a cause of performance degradation. We propose novel domain adaptation technique for object detection that aligns prediction output space. In addition to feature alignment, we aligned predictions of locations and class confidences of our vehicle detector for satellite images by adversarial training. The proposed method significantly improved AP score by over 5%, which shows effectivity of our method for object detection tasks in satellite images.
    CoReD: Generalizing Fake Media Detection with Continual Representation using Distillation. (arXiv:2107.02408v1 [cs.CV])
    (2 min) Over the last few decades, artificial intelligence research has made tremendous strides, but it still heavily relies on fixed datasets in stationary environments. Continual learning is a growing field of research that examines how AI systems can learn sequentially from a continuous stream of linked data in the same way that biological systems do. Simultaneously, fake media such as deepfakes and synthetic face images have emerged as significant to current multimedia technologies. Recently, numerous method has been proposed which can detect deepfakes with high accuracy. However, they suffer significantly due to their reliance on fixed datasets in limited evaluation settings. Therefore, in this work, we apply continuous learning to neural networks' learning dynamics, emphasizing its potential to increase data efficiency significantly. We propose Continual Representation using Distillation (CoReD) method that employs the concept of Continual Learning (CoL), Representation Learning (ReL), and Knowledge Distillation (KD). We design CoReD to perform sequential domain adaptation tasks on new deepfake and GAN-generated synthetic face datasets, while effectively minimizing the catastrophic forgetting in a teacher-student model setting. Our extensive experimental results demonstrate that our method is efficient at domain adaptation to detect low-quality deepfakes videos and GAN-generated images from several datasets, outperforming the-state-of-art baseline methods.
    Domain Adaptation via CycleGAN for Retina Segmentation in Optical Coherence Tomography. (arXiv:2107.02345v1 [eess.IV])
    (2 min) With the FDA approval of Artificial Intelligence (AI) for point-of-care clinical diagnoses, model generalizability is of the utmost importance as clinical decision-making must be domain-agnostic. A method of tackling the problem is to increase the dataset to include images from a multitude of domains; while this technique is ideal, the security requirements of medical data is a major limitation. Additionally, researchers with developed tools benefit from the addition of open-sourced data, but are limited by the difference in domains. Herewith, we investigated the implementation of a Cycle-Consistent Generative Adversarial Networks (CycleGAN) for the domain adaptation of Optical Coherence Tomography (OCT) volumes. This study was done in collaboration with the Biomedical Optics Research Group and Functional & Anatomical Imaging & Shape Analysis Lab at Simon Fraser University. In this study, we investigated a learning-based approach of adapting the domain of a publicly available dataset, UK Biobank dataset (UKB). To evaluate the performance of domain adaptation, we utilized pre-existing retinal layer segmentation tools developed on a different set of RETOUCH OCT data. This study provides insight on state-of-the-art tools for domain adaptation compared to traditional processing techniques as well as a pipeline for adapting publicly available retinal data to the domains previously used by our collaborators.
    The RSNA-ASNR-MICCAI BraTS 2021 Benchmark on Brain Tumor Segmentation and Radiogenomic Classification. (arXiv:2107.02314v1 [cs.CV])
    (2 min) The BraTS 2021 challenge celebrates its 10th anniversary and is jointly organized by the Radiological Society of North America (RSNA), the American Society of Neuroradiology (ASNR), and the Medical Image Computing and Computer Assisted Interventions (MICCAI) society. Since its inception, BraTS has been focusing on being a common benchmarking venue for brain glioma segmentation algorithms, with well-curated multi-institutional multi-parametric magnetic resonance imaging (mpMRI) data. Gliomas are the most common primary malignancies of the central nervous system, with varying degrees of aggressiveness and prognosis. The RSNA-ASNR-MICCAI BraTS 2021 challenge targets the evaluation of computational algorithms assessing the same tumor compartmentalization, as well as the underlying tumor's molecular characterization, in pre-operative baseline mpMRI data from 2,000 patients. Specifically, the two tasks that BraTS 2021 focuses on are: a) the segmentation of the histologically distinct brain tumor sub-regions, and b) the classification of the tumor's O[6]-methylguanine-DNA methyltransferase (MGMT) promoter methylation status. The performance evaluation of all participating algorithms in BraTS 2021 will be conducted through the Sage Bionetworks Synapse platform (Task 1) and Kaggle (Task 2), concluding in distributing to the top ranked participants monetary awards of $60,000 collectively.
    Integrating Circle Kernels into Convolutional Neural Networks. (arXiv:2107.02451v1 [cs.CV])
    (2 min) The square kernel is a standard unit for contemporary Convolutional Neural Networks (CNNs), as it fits well on the tensor computation for the convolution operation. However, the receptive field in the human visual system is actually isotropic like a circle. Motivated by this observation, we propose using circle kernels with isotropic receptive fields for the convolution, and our training takes approximately equivalent amount of calculation when compared with the corresponding CNN with square kernels. Our preliminary experiments demonstrate the rationality of circle kernels. We then propose a kernel boosting strategy that integrates the circle kernels with square kernels for the training and inference, and we further let the kernel size/radius be learnable during the training. Note that we reparameterize the circle kernels or integrated kernels before the inference, thus taking no extra computation as well as the number of parameter overhead for the testing. Extensive experiments on several standard datasets, ImageNet, CIFAR-10 and CIFAR-100, using the circle kernels or integrated kernels on typical existing CNNs, show that our approach exhibits highly competitive performance. Specifically, on ImageNet with standard data augmentation, our approach dramatically boosts the performance of MobileNetV3-Small by 5.20% top-1 accuracy and 3.39% top-5 accuracy, and boosts the performance of MobileNetV3-Large by 2.16% top-1 accuracy and 1.18% top-5 accuracy.
    MSE Loss with Outlying Label for Imbalanced Classification. (arXiv:2107.02393v1 [cs.CV])
    (2 min) In this paper, we propose mean squared error (MSE) loss with outlying label for class imbalanced classification. Cross entropy (CE) loss, which is widely used for image recognition, is learned so that the probability value of true class is closer to one by back propagation. However, for imbalanced datasets, the learning is insufficient for the classes with a small number of samples. Therefore, we propose a novel classification method using the MSE loss that can be learned the relationships of all classes no matter which image is input. Unlike CE loss, MSE loss is possible to equalize the number of back propagation for all classes and to learn the feature space considering the relationships between classes as metric learning. Furthermore, instead of the usual one-hot teacher label, we use a novel teacher label that takes the number of class samples into account. This induces the outlying label which depends on the number of samples in each class, and the class with a small number of samples has outlying margin in a feature space. It is possible to create the feature space for separating high-difficulty classes and low-difficulty classes. By the experiments on imbalanced classification and semantic segmentation, we confirmed that the proposed method was much improved in comparison with standard CE loss and conventional methods, even though only the loss and teacher labels were changed.
    On Robustness of Lane Detection Models to Physical-World Adversarial Attacks in Autonomous Driving. (arXiv:2107.02488v1 [cs.CV])
    (2 min) After the 2017 TuSimple Lane Detection Challenge, its evaluation based on accuracy and F1 score has become the de facto standard to measure the performance of lane detection methods. In this work, we conduct the first large-scale empirical study to evaluate the robustness of state-of-the-art lane detection methods under physical-world adversarial attacks in autonomous driving. We evaluate 4 major types of lane detection approaches with the conventional evaluation and end-to-end evaluation in autonomous driving scenarios and then discuss the security proprieties of each lane detection model. We demonstrate that the conventional evaluation fails to reflect the robustness in end-to-end autonomous driving scenarios. Our results show that the most robust model on the conventional metrics is the least robust in the end-to-end evaluation. Although the competition dataset and its metrics have played a substantial role in developing performant lane detection methods along with the rapid development of deep neural networks, the conventional evaluation is becoming obsolete and the gap between the metrics and practicality is critical. We hope that our study will help the community make further progress in building a more comprehensive framework to evaluate lane detection models.
    Vision Xformers: Efficient Attention for Image Classification. (arXiv:2107.02239v1 [cs.CV])
    (2 min) Linear attention mechanisms provide hope for overcoming the bottleneck of quadratic complexity which restricts application of transformer models in vision tasks. We modify the ViT architecture to work on longer sequence data by replacing the quadratic attention with efficient transformers like Performer, Linformer and Nystr\"omformer of linear complexity creating Vision X-formers (ViX). We show that ViX performs better than ViT in image classification consuming lesser computing resources. We further show that replacing the embedding linear layer by convolutional layers in ViX further increases their performance. Our test on recent visions transformer models like LeViT and Compact Convolutional Transformer (CCT) show that replacing the attention with Nystr\"omformer or Performer saves GPU usage and memory without deteriorating performance. Incorporating these changes can democratize transformers by making them accessible to those with limited data and computing resources.
    Neighbor-Vote: Improving Monocular 3D Object Detection through Neighbor Distance Voting. (arXiv:2107.02493v1 [cs.CV])
    (2 min) As cameras are increasingly deployed in new application domains such as autonomous driving, performing 3D object detection on monocular images becomes an important task for visual scene understanding. Recent advances on monocular 3D object detection mainly rely on the ``pseudo-LiDAR'' generation, which performs monocular depth estimation and lifts the 2D pixels to pseudo 3D points. However, depth estimation from monocular images, due to its poor accuracy, leads to inevitable position shift of pseudo-LiDAR points within the object. Therefore, the predicted bounding boxes may suffer from inaccurate location and deformed shape. In this paper, we present a novel neighbor-voting method that incorporates neighbor predictions to ameliorate object detection from severely deformed pseudo-LiDAR point clouds. Specifically, each feature point around the object forms their own predictions, and then the ``consensus'' is achieved through voting. In this way, we can effectively combine the neighbors' predictions with local prediction and achieve more accurate 3D detection. To further enlarge the difference between the foreground region of interest (ROI) pseudo-LiDAR points and the background points, we also encode the ROI prediction scores of 2D foreground pixels into the corresponding pseudo-LiDAR points. We conduct extensive experiments on the KITTI benchmark to validate the merits of our proposed method. Our results on the bird's eye view detection outperform the state-of-the-art performance by a large margin, especially for the ``hard'' level detection.
    Double-Uncertainty Assisted Spatial and Temporal Regularization Weighting for Learning-based Registration. (arXiv:2107.02433v1 [cs.CV])
    (2 min) In order to tackle the difficulty associated with the ill-posed nature of the image registration problem, researchers use regularization to constrain the solution space. For most learning-based registration approaches, the regularization usually has a fixed weight and only constrains the spatial transformation. Such convention has two limitations: (1) The regularization strength of a specific image pair should be associated with the content of the images, thus the ``one value fits all'' scheme is not ideal; (2) Only spatially regularizing the transformation (but overlooking the temporal consistency of different estimations) may not be the best strategy to cope with the ill-posedness. In this study, we propose a mean-teacher based registration framework. This framework incorporates an additional \textit{temporal regularization} term by encouraging the teacher model's temporal ensemble prediction to be consistent with that of the student model. At each training step, it also automatically adjusts the weights of the \textit{spatial regularization} and the \textit{temporal regularization} by taking account of the transformation uncertainty and appearance uncertainty derived from the perturbed teacher model. We perform experiments on multi- and uni-modal registration tasks, and the results show that our strategy outperforms the traditional and learning-based benchmark methods.
    Connectivity Matters: Neural Network Pruning Through the Lens of Effective Sparsity. (arXiv:2107.02306v1 [cs.LG])
    (2 min) Neural network pruning is a fruitful area of research with surging interest in high sparsity regimes. Benchmarking in this domain heavily relies on faithful representation of the sparsity of subnetworks, which has been traditionally computed as the fraction of removed connections (direct sparsity). This definition, however, fails to recognize unpruned parameters that detached from input or output layers of underlying subnetworks, potentially underestimating actual effective sparsity: the fraction of inactivated connections. While this effect might be negligible for moderately pruned networks (up to 10-100 compression rates), we find that it plays an increasing role for thinner subnetworks, greatly distorting comparison between different pruning algorithms. For example, we show that effective compression of a randomly pruned LeNet-300-100 can be orders of magnitude larger than its direct counterpart, while no discrepancy is ever observed when using SynFlow for pruning [Tanaka et al., 2020]. In this work, we adopt the lens of effective sparsity to reevaluate several recent pruning algorithms on common benchmark architectures (e.g., LeNet-300-100, VGG-19, ResNet-18) and discover that their absolute and relative performance changes dramatically in this new and more appropriate framework. To aim for effective, rather than direct, sparsity, we develop a low-cost extension to most pruning algorithms. Further, equipped with effective sparsity as a reference frame, we partially reconfirm that random pruning with appropriate sparsity allocation across layers performs as well or better than more sophisticated algorithms for pruning at initialization [Su et al., 2020]. In response to this observation, using a simple analogy of pressure distribution in coupled cylinders from physics, we design novel layerwise sparsity quotas that outperform all existing baselines in the context of random pruning.
    Independent Encoder for Deep Hierarchical Unsupervised Image-to-Image Translation. (arXiv:2107.02494v1 [cs.CV])
    (2 min) The main challenges of image-to-image (I2I) translation are to make the translated image realistic and retain as much information from the source domain as possible. To address this issue, we propose a novel architecture, termed as IEGAN, which removes the encoder of each network and introduces an encoder that is independent of other networks. Compared with previous models, it embodies three advantages of our model: Firstly, it is more directly and comprehensively to grasp image information since the encoder no longer receives loss from generator and discriminator. Secondly, the independent encoder allows each network to focus more on its own goal which makes the translated image more realistic. Thirdly, the reduction in the number of encoders performs more unified image representation. However, when the independent encoder applies two down-sampling blocks, it's hard to extract semantic information. To tackle this problem, we propose deep and shallow information space containing characteristic and semantic information, which can guide the model to translate high-quality images under the task with significant shape or texture change. We compare IEGAN with other previous models, and conduct researches on semantic information consistency and component ablation at the same time. These experiments show the superiority and effectiveness of our architecture. Our code is published on: https://github.com/Elvinky/IEGAN.
    GCN-Based Linkage Prediction for Face Clusteringon Imbalanced Datasets: An Empirical Study. (arXiv:2107.02477v1 [cs.CV])
    (2 min) In recent years, benefiting from the expressivepower of Graph Convolutional Networks (GCNs),significant breakthroughs have been made in faceclustering. However, rare attention has been paidto GCN-based clustering on imbalanced data. Al-though imbalance problem has been extensivelystudied, the impact of imbalanced data on GCN-based linkage prediction task is quite different,which would cause problems in two aspects: im-balanced linkage labels and biased graph represen-tations. The problem of imbalanced linkage labelsis similar to that in image classification task, but thelatter is a particular problem in GCN-based clus-tering via linkage prediction. Significantly biasedgraph representations in training can cause catas-trophic overfitting of a GCN model. To tacklethese problems, we evaluate the feasibility of thoseexisting methods for imbalanced image classifica-tion problem on graphs with extensive experiments,and present a new method to alleviate the imbal-anced labels and also augment graph representa-tions using a Reverse-Imbalance Weighted Sam-pling (RIWS) strategy, followed with insightfulanalyses and discussions. A series of imbalancedbenchmark datasets synthesized from MS-Celeb-1M and DeepFashion will be openly available.
    Memory-aware curriculum federated learning for breast cancer classification. (arXiv:2107.02504v1 [cs.CV])
    (2 min) For early breast cancer detection, regular screening with mammography imaging is recommended. Routinary examinations result in datasets with a predominant amount of negative samples. A potential solution to such class-imbalance is joining forces across multiple institutions. Developing a collaborative computer-aided diagnosis system is challenging in different ways. Patient privacy and regulations need to be carefully respected. Data across institutions may be acquired from different devices or imaging protocols, leading to heterogeneous non-IID data. Also, for learning-based methods, new optimization strategies working on distributed data are required. Recently, federated learning has emerged as an effective tool for collaborative learning. In this setting, local models perform computation on their private data to update the global model. The order and the frequency of local updates influence the final global model. Hence, the order in which samples are locally presented to the optimizers plays an important role. In this work, we define a memory-aware curriculum learning method for the federated setting. Our curriculum controls the order of the training samples paying special attention to those that are forgotten after the deployment of the global model. Our approach is combined with unsupervised domain adaptation to deal with domain shift while preserving data privacy. We evaluate our method with three clinical datasets from different vendors. Our results verify the effectiveness of federated adversarial learning for the multi-site breast cancer classification. Moreover, we show that our proposed memory-aware curriculum method is beneficial to further improve classification performance. Our code is publicly available at: https://github.com/ameliajimenez/curriculum-federated-learning.
    Histogram of Cell Types: Deep Learning for Automated Bone Marrow Cytology. (arXiv:2107.02293v1 [eess.IV])
    (2 min) Bone marrow cytology is required to make a hematological diagnosis, influencing critical clinical decision points in hematology. However, bone marrow cytology is tedious, limited to experienced reference centers and associated with high inter-observer variability. This may lead to a delayed or incorrect diagnosis, leaving an unmet need for innovative supporting technologies. We have developed the first ever end-to-end deep learning-based technology for automated bone marrow cytology. Starting with a bone marrow aspirate digital whole slide image, our technology rapidly and automatically detects suitable regions for cytology, and subsequently identifies and classifies all bone marrow cells in each region. This collective cytomorphological information is captured in a novel representation called Histogram of Cell Types (HCT) quantifying bone marrow cell class probability distribution and acting as a cytological "patient fingerprint". The approach achieves high accuracy in region detection (0.97 accuracy and 0.99 ROC AUC), and cell detection and cell classification (0.75 mAP, 0.78 F1-score, Log-average miss rate of 0.31). HCT has potential to revolutionize hematopathology diagnostic workflows, leading to more cost-effective, accurate diagnosis and opening the door to precision medicine.
    Mind Your Outliers! Investigating the Negative Impact of Outliers on Active Learning for Visual Question Answering. (arXiv:2107.02331v1 [cs.CL])
    (2 min) Active learning promises to alleviate the massive data needs of supervised machine learning: it has successfully improved sample efficiency by an order of magnitude on traditional tasks like topic classification and object recognition. However, we uncover a striking contrast to this promise: across 5 models and 4 datasets on the task of visual question answering, a wide variety of active learning approaches fail to outperform random selection. To understand this discrepancy, we profile 8 active learning methods on a per-example basis, and identify the problem as collective outliers -- groups of examples that active learning methods prefer to acquire but models fail to learn (e.g., questions that ask about text in images or require external knowledge). Through systematic ablation experiments and qualitative visualizations, we verify that collective outliers are a general phenomenon responsible for degrading pool-based active learning. Notably, we show that active learning sample efficiency increases significantly as the number of collective outliers in the active learning pool decreases. We conclude with a discussion and prescriptive recommendations for mitigating the effects of these outliers in future work.
    A new smart-cropping pipeline for prostate segmentation using deep learning networks. (arXiv:2107.02476v1 [eess.IV])
    (2 min) Prostate segmentation from magnetic resonance imaging (MRI) is a challenging task. In recent years, several network architectures have been proposed to automate this process and alleviate the burden of manual annotation. Although the performance of these models has achieved promising results, there is still room for improvement before these models can be used safely and effectively in clinical practice. One of the major challenges in prostate MR image segmentation is the presence of class imbalance in the image labels where the background pixels dominate over the prostate. In the present work we propose a DL-based pipeline for cropping the region around the prostate from MRI images to produce a more balanced distribution of the foreground pixels (prostate) and the background pixels and improve segmentation accuracy. The effect of DL-cropping for improving the segmentation performance compared to standard center-cropping is assessed using five popular DL networks for prostate segmentation, namely U-net, U-net+, Res Unet++, Bridge U-net and Dense U-net. The proposed smart-cropping outperformed the standard center cropping in terms of segmentation accuracy for all the evaluated prostate segmentation networks. In terms of Dice score, the highest improvement was achieved for the U-net+ and ResU-net++ architectures corresponding to 8.9% and 8%, respectively.
    Exploring Deep Learning Methods for Real-Time Surgical Instrument Segmentation in Laparoscopy. (arXiv:2107.02319v1 [eess.IV])
    (2 min) Minimally invasive surgery is a surgical intervention used to examine the organs inside the abdomen and has been widely used due to its effectiveness over open surgery. Due to the hardware improvements such as high definition cameras, this procedure has significantly improved and new software methods have demonstrated potential for computer-assisted procedures. However, there exists challenges and requirements to improve detection and tracking of the position of the instruments during these surgical procedures. To this end, we evaluate and compare some popular deep learning methods that can be explored for the automated segmentation of surgical instruments in laparoscopy, an important step towards tool tracking. Our experimental results exhibit that the Dual decoder attention network (DDANet) produces a superior result compared to other recent deep learning methods. DDANet yields a Dice coefficient of 0.8739 and mean intersection-over-union of 0.8183 for the Robust Medical Instrument Segmentation (ROBUST-MIS) Challenge 2019 dataset, at a real-time speed of 101.36 frames-per-second that is critical for such procedures.
    LightFuse: Lightweight CNN based Dual-exposure Fusion. (arXiv:2107.02299v1 [cs.CV])
    (2 min) Deep convolutional neural networks (DCNN) aided high dynamic range (HDR) imaging recently received a lot of attention. The quality of DCNN generated HDR images have overperformed the traditional counterparts. However, DCNN is prone to be computationally intensive and power-hungry. To address the challenge, we propose LightFuse, a light-weight CNN-based algorithm for extreme dual-exposure image fusion, which can be implemented on various embedded computing platforms with limited power and hardware resources. Two sub-networks are utilized: a GlobalNet (G) and a DetailNet (D). The goal of G is to learn the global illumination information on the spatial dimension, whereas D aims to enhance local details on the channel dimension. Both G and D are based solely on depthwise convolution (D Conv) and pointwise convolution (P Conv) to reduce required parameters and computations. Experimental results display that the proposed technique could generate HDR images with plausible details in extremely exposed regions. Our PSNR score exceeds the other state-of-the-art approaches by 1.2 to 1.6 times and achieves 1.4 to 20 times FLOP and parameter reduction compared with others.
    Automated age-related macular degeneration area estimation -- first results. (arXiv:2107.02211v1 [eess.IV])
    (2 min) This work aims to research an automatic method for detecting Age-related Macular Degeneration (AMD) lesions in RGB eye fundus images. For this, we align invasively obtained eye fundus contrast images (the "golden standard" diagnostic) to the RGB ones and use them to hand-annotate the lesions. This is done using our custom-made tool. Using the data, we train and test five different convolutional neural networks: a custom one to classify healthy and AMD-affected eye fundi, and four well-known networks: ResNet50, ResNet101, MobileNetV3, and UNet to segment (localize) the AMD lesions in the affected eye fundus images. We achieve 93.55% accuracy or 69.71% Dice index as the preliminary best results in segmentation with MobileNetV3.
    Label noise in segmentation networks : mitigation must deal with bias. (arXiv:2107.02189v1 [cs.CV])
    (2 min) Imperfect labels limit the quality of predictions learned by deep neural networks. This is particularly relevant in medical image segmentation, where reference annotations are difficult to collect and vary significantly even across expert annotators. Prior work on mitigating label noise focused on simple models of mostly uniform noise. In this work, we explore biased and unbiased errors artificially introduced to brain tumour annotations on MRI data. We found that supervised and semi-supervised segmentation methods are robust or fairly robust to unbiased errors but sensitive to biased errors. It is therefore important to identify the sorts of errors expected in medical image labels and especially mitigate the biased errors.
    A Hierarchical Dual Model of Environment- and Place-Specific Utility for Visual Place Recognition. (arXiv:2107.02440v1 [cs.RO])
    (2 min) Visual Place Recognition (VPR) approaches have typically attempted to match places by identifying visual cues, image regions or landmarks that have high ``utility'' in identifying a specific place. But this concept of utility is not singular - rather it can take a range of forms. In this paper, we present a novel approach to deduce two key types of utility for VPR: the utility of visual cues `specific' to an environment, and to a particular place. We employ contrastive learning principles to estimate both the environment- and place-specific utility of Vector of Locally Aggregated Descriptors (VLAD) clusters in an unsupervised manner, which is then used to guide local feature matching through keypoint selection. By combining these two utility measures, our approach achieves state-of-the-art performance on three challenging benchmark datasets, while simultaneously reducing the required storage and compute time. We provide further analysis demonstrating that unsupervised cluster selection results in semantically meaningful results, that finer grained categorization often has higher utility for VPR than high level semantic categorization (e.g. building, road), and characterise how these two utility measures vary across different places and environments. Source code is made publicly available at https://github.com/Nik-V9/HEAPUtil.
    Morphological Classification of Galaxies in S-PLUS using an Ensemble of Convolutional Networks. (arXiv:2107.02287v1 [astro-ph.GA])
    (2 min) The universe is composed of galaxies that have diverse shapes. Once the structure of a galaxy is determined, it is possible to obtain important information about its formation and evolution. Morphologically classifying galaxies means cataloging them according to their visual appearance and the classification is linked to the physical properties of the galaxy. A morphological classification made through visual inspection is subject to biases introduced by subjective observations made by human volunteers. For this reason, systematic, objective and easily reproducible classification of galaxies has been gaining importance since the astronomer Edwin Hubble created his famous classification method. In this work, we combine accurate visual classifications of the Galaxy Zoo project with \emph {Deep Learning} methods. The goal is to find an efficient technique at human performance level classification, but in a systematic and automatic way, for classification of elliptical and spiral galaxies. For this, a neural network model was created through an Ensemble of four other convolutional models, allowing a greater accuracy in the classification than what would be obtained with any one individual. Details of the individual models and improvements made are also described. The present work is entirely based on the analysis of images (not parameter tables) from DR1 (www.datalab.noao.edu) of the Southern Photometric Local Universe Survey (S-PLUS). In terms of classification, we achieved, with the Ensemble, an accuracy of $\approx 99 \%$ in the test sample (using pre-trained networks).
    Graph Convolution for Re-ranking in Person Re-identification. (arXiv:2107.02220v1 [cs.CV])
    (2 min) Nowadays, deep learning is widely applied to extract features for similarity computation in person re-identification (re-ID) and have achieved great success. However, due to the non-overlapping between training and testing IDs, the difference between the data used for model training and the testing data makes the performance of learned feature degraded during testing. Hence, re-ranking is proposed to mitigate this issue and various algorithms have been developed. However, most of existing re-ranking methods focus on replacing the Euclidean distance with sophisticated distance metrics, which are not friendly to downstream tasks and hard to be used for fast retrieval of massive data in real applications. In this work, we propose a graph-based re-ranking method to improve learned features while still keeping Euclidean distance as the similarity metric. Inspired by graph convolution networks, we develop an operator to propagate features over an appropriate graph. Since graph is the essential key for the propagation, two important criteria are considered for designing the graph, and three different graphs are explored accordingly. Furthermore, a simple yet effective method is proposed to generate a profile vector for each tracklet in videos, which helps extend our method to video re-ID. Extensive experiments on three benchmark data sets, e.g., Market-1501, Duke, and MARS, demonstrate the effectiveness of our proposed approach.
    TransformerFusion: Monocular RGB Scene Reconstruction using Transformers. (arXiv:2107.02191v1 [cs.CV])
    (2 min) We introduce TransformerFusion, a transformer-based 3D scene reconstruction approach. From an input monocular RGB video, the video frames are processed by a transformer network that fuses the observations into a volumetric feature grid representing the scene; this feature grid is then decoded into an implicit 3D scene representation. Key to our approach is the transformer architecture that enables the network to learn to attend to the most relevant image frames for each 3D location in the scene, supervised only by the scene reconstruction task. Features are fused in a coarse-to-fine fashion, storing fine-level features only where needed, requiring lower memory storage and enabling fusion at interactive rates. The feature grid is then decoded to a higher-resolution scene reconstruction, using an MLP-based surface occupancy prediction from interpolated coarse-to-fine 3D features. Our approach results in an accurate surface reconstruction, outperforming state-of-the-art multi-view stereo depth estimation methods, fully-convolutional 3D reconstruction approaches, and approaches using LSTM- or GRU-based recurrent networks for video sequence fusion.
    An Ensemble Noise-Robust K-fold Cross-Validation Selection Method for Noisy Labels. (arXiv:2107.02347v1 [cs.LG])
    (2 min) We consider the problem of training robust and accurate deep neural networks (DNNs) when subject to various proportions of noisy labels. Large-scale datasets tend to contain mislabeled samples that can be memorized by DNNs, impeding the performance. With appropriate handling, this degradation can be alleviated. There are two problems to consider: how to distinguish clean samples and how to deal with noisy samples. In this paper, we present Ensemble Noise-robust K-fold Cross-Validation Selection (E-NKCVS) to effectively select clean samples from noisy data, solving the first problem. For the second problem, we create a new pseudo label for any sample determined to have an uncertain or likely corrupt label. E-NKCVS obtains multiple predicted labels for each sample and the entropy of these labels is used to tune the weight given to the pseudo label and the given label. Theoretical analysis and extensive verification of the algorithms in the noisy label setting are provided. We evaluate our approach on various image and text classification tasks where the labels have been manually corrupted with different noise ratios. Additionally, two large real-world noisy datasets are also used, Clothing-1M and WebVision. E-NKCVS is empirically shown to be highly tolerant to considerable proportions of label noise and has a consistent improvement over state-of-the-art methods. Especially on more difficult datasets with higher noise ratios, we can achieve a significant improvement over the second-best model. Moreover, our proposed approach can easily be integrated into existing DNN methods to improve their robustness against label noise.
    Polarized skylight orientation determination artificial neural network. (arXiv:2107.02328v1 [cs.CV])
    (2 min) This paper proposes an artificial neural network to determine orientation using polarized skylight. This neural network has specific dilated convolution, which can extract light intensity information of different polarization directions. Then, the degree of polarization (DOP) and angle of polarization (AOP) are directly extracted in the network. In addition, the exponential function encoding of orientation is designed as the network output, which can better reflect the insect's encoding of polarization information, and improve the accuracy of orientation determination. Finally, training and testing were conducted on a public polarized skylight navigation dataset, and the experimental results proved the stability and effectiveness of the network.
    Self-Adversarial Training incorporating Forgery Attention for Image Forgery Localization. (arXiv:2107.02434v1 [cs.CV])
    (2 min) Image editing techniques enable people to modify the content of an image without leaving visual traces and thus may cause serious security risks. Hence the detection and localization of these forgeries become quite necessary and challenging. Furthermore, unlike other tasks with extensive data, there is usually a lack of annotated forged images for training due to annotation difficulties. In this paper, we propose a self-adversarial training strategy and a reliable coarse-to-fine network that utilizes a self-attention mechanism to localize forged regions in forgery images. The self-attention module is based on a Channel-Wise High Pass Filter block (CW-HPF). CW-HPF leverages inter-channel relationships of features and extracts noise features by high pass filters. Based on the CW-HPF, a self-attention mechanism, called forgery attention, is proposed to capture rich contextual dependencies of intrinsic inconsistency extracted from tampered regions. Specifically, we append two types of attention modules on top of CW-HPF respectively to model internal interdependencies in spatial dimension and external dependencies among channels. We exploit a coarse-to-fine network to enhance the noise inconsistency between original and tampered regions. More importantly, to address the issue of insufficient training data, we design a self-adversarial training strategy that expands training data dynamically to achieve more robust performance. Specifically, in each training iteration, we perform adversarial attacks against our network to generate adversarial examples and train our model on them. Extensive experimental results demonstrate that our proposed algorithm steadily outperforms state-of-the-art methods by a clear margin in different benchmark datasets.
    Learning Disentangled Representation Implicitly via Transformer for Occluded Person Re-Identification. (arXiv:2107.02380v1 [cs.CV])
    (2 min) Person re-identification (re-ID) under various occlusions has been a long-standing challenge as person images with different types of occlusions often suffer from misalignment in image matching and ranking. Most existing methods tackle this challenge by aligning spatial features of body parts according to external semantic cues or feature similarities but this alignment approach is complicated and sensitive to noises. We design DRL-Net, a disentangled representation learning network that handles occluded re-ID without requiring strict person image alignment or any additional supervision. Leveraging transformer architectures, DRL-Net achieves alignment-free re-ID via global reasoning of local features of occluded person images. It measures image similarity by automatically disentangling the representation of undefined semantic components, e.g., human body parts or obstacles, under the guidance of semantic preference object queries in the transformer. In addition, we design a decorrelation constraint in the transformer decoder and impose it over object queries for better focus on different semantic components. To better eliminate interference from occlusions, we design a contrast feature learning technique (CFL) for better separation of occlusion features and discriminative ID features. Extensive experiments over occluded and holistic re-ID benchmarks (Occluded-DukeMTMC, Market1501 and DukeMTMC) show that the DRL-Net achieves superior re-ID performance consistently and outperforms the state-of-the-art by large margins for Occluded-DukeMTMC.
    VolNet: Estimating Human Body Part Volumes from a Single RGB Image. (arXiv:2107.02259v1 [cs.CV])
    (2 min) Human body volume estimation from a single RGB image is a challenging problem despite minimal attention from the research community. However VolNet, an architecture leveraging 2D and 3D pose estimation, body part segmentation and volume regression extracted from a single 2D RGB image combined with the subject's body height can be used to estimate the total body volume. VolNet is designed to predict the 2D and 3D pose as well as the body part segmentation in intermediate tasks. We generated a synthetic, large-scale dataset of photo-realistic images of human bodies with a wide range of body shapes and realistic poses called SURREALvols. By using Volnet and combining multiple stacked hourglass networks together with ResNeXt, our model correctly predicted the volume in ~82% of cases with a 10% tolerance threshold. This is a considerable improvement compared to state-of-the-art solutions such as BodyNet with only a ~38% success rate.
    A visual introduction to Gaussian Belief Propagation. (arXiv:2107.02308v1 [cs.AI])
    (2 min) In this article, we present a visual introduction to Gaussian Belief Propagation (GBP), an approximate probabilistic inference algorithm that operates by passing messages between the nodes of arbitrarily structured factor graphs. A special case of loopy belief propagation, GBP updates rely only on local information and will converge independently of the message schedule. Our key argument is that, given recent trends in computing hardware, GBP has the right computational properties to act as a scalable distributed probabilistic inference framework for future machine learning systems.
    FloorLevel-Net: Recognizing Floor-Level Lines with Height-Attention-Guided Multi-task Learning. (arXiv:2107.02462v1 [cs.CV])
    (2 min) The ability to recognize the position and order of the floor-level lines that divide adjacent building floors can benefit many applications, for example, urban augmented reality (AR). This work tackles the problem of locating floor-level lines in street-view images, using a supervised deep learning approach. Unfortunately, very little data is available for training such a network $-$ current street-view datasets contain either semantic annotations that lack geometric attributes, or rectified facades without perspective priors. To address this issue, we first compile a new dataset and develop a new data augmentation scheme to synthesize training samples by harassing (i) the rich semantics of existing rectified facades and (ii) perspective priors of buildings in diverse street views. Next, we design FloorLevel-Net, a multi-task learning network that associates explicit features of building facades and implicit floor-level lines, along with a height-attention mechanism to help enforce a vertical ordering of floor-level lines. The generated segmentations are then passed to a second-stage geometry post-processing to exploit self-constrained geometric priors for plausible and consistent reconstruction of floor-level lines. Quantitative and qualitative evaluations conducted on assorted facades in existing datasets and street views from Google demonstrate the effectiveness of our approach. Also, we present context-aware image overlay results and show the potentials of our approach in enriching AR-related applications.
  • cs.IR updates on arXiv.org

    Sawtooth Factorial Topic Embeddings Guided Gamma Belief Network. (arXiv:2107.02757v1 [cs.IR])
    (2 min) Hierarchical topic models such as the gamma belief network (GBN) have delivered promising results in mining multi-layer document representations and discovering interpretable topic taxonomies. However, they often assume in the prior that the topics at each layer are independently drawn from the Dirichlet distribution, ignoring the dependencies between the topics both at the same layer and across different layers. To relax this assumption, we propose sawtooth factorial topic embedding guided GBN, a deep generative model of documents that captures the dependencies and semantic similarities between the topics in the embedding space. Specifically, both the words and topics are represented as embedding vectors of the same dimension. The topic matrix at a layer is factorized into the product of a factor loading matrix and a topic embedding matrix, the transpose of which is set as the factor loading matrix of the layer above. Repeating this particular type of factorization, which shares components between adjacent layers, leads to a structure referred to as sawtooth factorization. An auto-encoding variational inference network is constructed to optimize the model parameter via stochastic gradient descent. Experiments on big corpora show that our models outperform other neural topic models on extracting deeper interpretable topics and deriving better document representations.
    CausalRec: Causal Inference for Visual Debiasing in Visually-Aware Recommendation. (arXiv:2107.02390v1 [cs.IR])
    (2 min) Visually-aware recommendation on E-commerce platforms aims to leverage visual information of items to predict a user's preference. It is commonly observed that user's attention to visual features does not always reflect the real preference. Although a user may click and view an item in light of a visual satisfaction of their expectations, a real purchase does not always occur due to the unsatisfaction of other essential features (e.g., brand, material, price). We refer to the reason for such a visually related interaction deviating from the real preference as a visual bias. Existing visually-aware models make use of the visual features as a separate collaborative signal similarly to other features to directly predict the user's preference without considering a potential bias, which gives rise to a visually biased recommendation. In this paper, we derive a causal graph to identify and analyze the visual bias of these existing methods. In this causal graph, the visual feature of an item acts as a mediator, which could introduce a spurious relationship between the user and the item. To eliminate this spurious relationship that misleads the prediction of the user's real preference, an intervention and a counterfactual inference are developed over the mediator. Particularly, the Total Indirect Effect is applied for a debiased prediction during the testing phase of the model. This causal inference framework is model agnostic such that it can be integrated into the existing methods. Furthermore, we propose a debiased visually-aware recommender system, denoted as CausalRec to effectively retain the supportive significance of the visual information and remove the visual bias. Extensive experiments are conducted on eight benchmark datasets, which shows the state-of-the-art performance of CausalRec and the efficacy of debiasing.
    KATRec: Knowledge Aware aTtentive Sequential Recommendations. (arXiv:2012.03323v3 [cs.IR] UPDATED)
    (2 min) Sequential recommendation systems model dynamic preferences of users based on their historical interactions with platforms. Despite recent progress, modeling short-term and long-term behavior of users in such systems is nontrivial and challenging. To address this, we present a solution enhanced by a knowledge graph called KATRec (Knowledge Aware aTtentive sequential Recommendations). KATRec learns the short and long-term interests of users by modeling their sequence of interacted items and leveraging pre-existing side information through a knowledge graph attention network. Our novel knowledge graph-enhanced sequential recommender contains item multi-relations at the entity-level and users' dynamic sequences at the item-level. KATRec improves item representation learning by considering higher-order connections and incorporating them in user preference representation while recommending the next item. Experiments on three public datasets show that KATRec outperforms state-of-the-art recommendation models and demonstrates the importance of modeling both temporal and side information to achieve high-quality recommendations.
    Gender Recognition in Informal and Formal Language Scenarios via Transfer Learning. (arXiv:2107.02759v1 [cs.CL])
    (2 min) The interest in demographic information retrieval based on text data has increased in the research community because applications have shown success in different sectors such as security, marketing, heath-care, and others. Recognition and identification of demographic traits such as gender, age, location, or personality based on text data can help to improve different marketing strategies. For instance it makes it possible to segment and to personalize offers, thus products and services are exposed to the group of greatest interest. This type of technology has been discussed widely in documents from social media. However, the methods have been poorly studied in data with a more formal structure, where there is no access to emoticons, mentions, and other linguistic phenomena that are only present in social media. This paper proposes the use of recurrent and convolutional neural networks, and a transfer learning strategy for gender recognition in documents that are written in informal and formal languages. Models are tested in two different databases consisting of Tweets and call-center conversations. Accuracies of up to 75\% are achieved for both databases. The results also indicate that it is possible to transfer the knowledge from a system trained on a specific type of expressions or idioms such as those typically used in social media into a more formal type of text data, where the amount of data is more scarce and its structure is completely different.
    SemEval-2021 Task 11: NLPContributionGraph -- Structuring Scholarly NLP Contributions for a Research Knowledge Graph. (arXiv:2106.07385v2 [cs.CL] UPDATED)
    (2 min) There is currently a gap between the natural language expression of scholarly publications and their structured semantic content modeling to enable intelligent content search. With the volume of research growing exponentially every year, a search feature operating over semantically structured content is compelling. The SemEval-2021 Shared Task NLPContributionGraph (a.k.a. 'the NCG task') tasks participants to develop automated systems that structure contributions from NLP scholarly articles in the English language. Being the first-of-its-kind in the SemEval series, the task released structured data from NLP scholarly articles at three levels of information granularity, i.e. at sentence-level, phrase-level, and phrases organized as triples toward Knowledge Graph (KG) building. The sentence-level annotations comprised the few sentences about the article's contribution. The phrase-level annotations were scientific term and predicate phrases from the contribution sentences. Finally, the triples constituted the research overview KG. For the Shared Task, participating systems were then expected to automatically classify contribution sentences, extract scientific terms and relations from the sentences, and organize them as KG triples. Overall, the task drew a strong participation demographic of seven teams and 27 participants. The best end-to-end task system classified contribution sentences at 57.27% F1, phrases at 46.41% F1, and triples at 22.28% F1. While the absolute performance to generate triples remains low, in the conclusion of this article, the difficulty of producing such data and as a consequence of modeling it is highlighted.
    Exploring the Scope of Using News Articles to Understand Development Patterns of Districts in India. (arXiv:2107.02765v1 [cs.IR])
    (2 min) Understanding what factors bring about socio-economic development may often suffer from the streetlight effect, of analyzing the effect of only those variables that have been measured and are therefore available for analysis. How do we check whether all worthwhile variables have been instrumented and considered when building an econometric development model? We attempt to address this question by building unsupervised learning methods to identify and rank news articles about diverse events occurring in different districts of India, that can provide insights about what may have transpired in the districts. This can help determine whether variables related to these events are indeed available or not to model the development of these districts. We also describe several other applications that emerge from this approach, such as to use news articles to understand why pairs of districts that may have had similar socio-economic indicators approximately ten years back ended up at different levels of development currently, and another application that generates a newsfeed of unusual news articles that do not conform to news articles about typical districts with a similar socio-economic profile. These applications outline the need for qualitative data to augment models based on quantitative data, and are meant to open up research on new ways to mine information from unstructured qualitative data to understand development.
    Overlapping Spaces for Compact Graph Representations. (arXiv:2007.02445v2 [cs.LG] UPDATED)
    (2 min) Various non-trivial spaces are becoming popular for embedding structured data such as graphs, texts, or images. Following spherical and hyperbolic spaces, more general product spaces have been proposed. However, searching for the best configuration of product space is a resource-intensive procedure, which reduces the practical applicability of the idea. We generalize the concept of product space and introduce an overlapping space that does not have the configuration search problem. The main idea is to allow subsets of coordinates to be shared between spaces of different types (Euclidean, hyperbolic, spherical). As a result, parameter optimization automatically learns the optimal configuration. Additionally, overlapping spaces allow for more compact representations since their geometry is more complex. Our experiments confirm that overlapping spaces outperform the competitors in graph embedding tasks. Here, we consider both distortion setup, where the aim is to preserve distances, and ranking setup, where the relative order should be preserved. The proposed method effectively solves the problem and outperforms the competitors in both settings. We also perform an empirical analysis in a realistic information retrieval task, where we compare all spaces by incorporating them into DSSM. In this case, the proposed overlapping space consistently achieves nearly optimal results without any configuration tuning. This allows for reducing training time, which can be significant in large-scale applications.
  • cs.LG updates on arXiv.org

    General Purpose (GenP) Bioimage Ensemble of Handcrafted and Learned Features with Data Augmentation. (arXiv:1904.08084v4 [cs.CV] UPDATED)
    (2 min) Bioimage classification plays a crucial role in many biological problems. In this work, we present a new General Purpose (GenP) ensemble that boosts performance by combining local features, dense sampling features, and deep learning approaches. First, we introduce three new methods for data augmentation based on PCA/DCT; second, we show that different data augmentation approaches can boost the performance of an ensemble of CNNs; and, finally, we propose a set of handcrafted/learned descriptors that are highly generalizable. Each handcrafted descriptor is used to train a different Support Vector Machine (SVM), and the different SVMs are combined with the ensemble of CNNs. Our method is evaluated on a diverse set of bioimage classification problems. Results demonstrate that the proposed GenP bioimage ensemble obtains state-of-the-art performance without any ad-hoc dataset tuning of parameters (thus avoiding the risk of overfitting/overtraining).
    Hyperspectral Pansharpening Based on Improved Deep Image Prior and Residual Reconstruction. (arXiv:2107.02630v1 [cs.CV])
    (2 min) Hyperspectral pansharpening aims to synthesize a low-resolution hyperspectral image (LR-HSI) with a registered panchromatic image (PAN) to generate an enhanced HSI with high spectral and spatial resolution. Recently proposed HS pansharpening methods have obtained remarkable results using deep convolutional networks (ConvNets), which typically consist of three steps: (1) up-sampling the LR-HSI, (2) predicting the residual image via a ConvNet, and (3) obtaining the final fused HSI by adding the outputs from first and second steps. Recent methods have leveraged Deep Image Prior (DIP) to up-sample the LR-HSI due to its excellent ability to preserve both spatial and spectral information, without learning from large data sets. However, we observed that the quality of up-sampled HSIs can be further improved by introducing an additional spatial-domain constraint to the conventional spectral-domain energy function. We define our spatial-domain constraint as the $L_1$ distance between the predicted PAN image and the actual PAN image. To estimate the PAN image of the up-sampled HSI, we also propose a learnable spectral response function (SRF). Moreover, we noticed that the residual image between the up-sampled HSI and the reference HSI mainly consists of edge information and very fine structures. In order to accurately estimate fine information, we propose a novel over-complete network, called HyperKite, which focuses on learning high-level features by constraining the receptive from increasing in the deep layers. We perform experiments on three HSI datasets to demonstrate the superiority of our DIP-HyperKite over the state-of-the-art pansharpening methods. The deployment codes, pre-trained models, and final fusion outputs of our DIP-HyperKite and the methods used for the comparisons will be publicly made available at https://github.com/wgcban/DIP-HyperKite.git.
    AdaRL: What, Where, and How to Adapt in Transfer Reinforcement Learning. (arXiv:2107.02729v1 [cs.LG])
    (2 min) Most approaches in reinforcement learning (RL) are data-hungry and specific to fixed environments. In this paper, we propose a principled framework for adaptive RL, called AdaRL, that adapts reliably to changes across domains. Specifically, we construct a generative environment model for the structural relationships among variables in the system and embed the changes in a compact way, which provides a clear and interpretable picture for locating what and where the changes are and how to adapt. Based on the environment model, we characterize a minimal set of representations, including both domain-specific factors and domain-shared state representations, that suffice for reliable and low-cost transfer. Moreover, we show that by explicitly leveraging a compact representation to encode changes, we can adapt the policy with only a few samples without further policy optimization in the target domain. We illustrate the efficacy of AdaRL through a series of experiments that allow for changes in different components of Cartpole and Atari games.
    Overlapping Spaces for Compact Graph Representations. (arXiv:2007.02445v2 [cs.LG] UPDATED)
    (2 min) Various non-trivial spaces are becoming popular for embedding structured data such as graphs, texts, or images. Following spherical and hyperbolic spaces, more general product spaces have been proposed. However, searching for the best configuration of product space is a resource-intensive procedure, which reduces the practical applicability of the idea. We generalize the concept of product space and introduce an overlapping space that does not have the configuration search problem. The main idea is to allow subsets of coordinates to be shared between spaces of different types (Euclidean, hyperbolic, spherical). As a result, parameter optimization automatically learns the optimal configuration. Additionally, overlapping spaces allow for more compact representations since their geometry is more complex. Our experiments confirm that overlapping spaces outperform the competitors in graph embedding tasks. Here, we consider both distortion setup, where the aim is to preserve distances, and ranking setup, where the relative order should be preserved. The proposed method effectively solves the problem and outperforms the competitors in both settings. We also perform an empirical analysis in a realistic information retrieval task, where we compare all spaces by incorporating them into DSSM. In this case, the proposed overlapping space consistently achieves nearly optimal results without any configuration tuning. This allows for reducing training time, which can be significant in large-scale applications.
    Bayesian Algorithm Execution: Estimating Computable Properties of Black-box Functions Using Mutual Information. (arXiv:2104.09460v2 [stat.ML] UPDATED)
    (2 min) In many real-world problems, we want to infer some property of an expensive black-box function $f$, given a budget of $T$ function evaluations. One example is budget constrained global optimization of $f$, for which Bayesian optimization is a popular method. Other properties of interest include local optima, level sets, integrals, or graph-structured information induced by $f$. Often, we can find an algorithm $\mathcal{A}$ to compute the desired property, but it may require far more than $T$ queries to execute. Given such an $\mathcal{A}$, and a prior distribution over $f$, we refer to the problem of inferring the output of $\mathcal{A}$ using $T$ evaluations as Bayesian Algorithm Execution (BAX). To tackle this problem, we present a procedure, InfoBAX, that sequentially chooses queries that maximize mutual information with respect to the algorithm's output. Applying this to Dijkstra's algorithm, for instance, we infer shortest paths in synthetic and real-world graphs with black-box edge costs. Using evolution strategies, we yield variants of Bayesian optimization that target local, rather than global, optima. On these problems, InfoBAX uses up to 500 times fewer queries to $f$ than required by the original algorithm. Our method is closely connected to other Bayesian optimal experimental design procedures such as entropy search methods and optimal sensor placement using Gaussian processes.
    Quantum Annealing Formulation for Binary Neural Networks. (arXiv:2107.02751v1 [quant-ph])
    (2 min) Quantum annealing is a promising paradigm for building practical quantum computers. Compared to other approaches, quantum annealing technology has been scaled up to a larger number of qubits. On the other hand, deep learning has been profoundly successful in pushing the boundaries of AI. It is thus natural to investigate potentially game changing technologies such as quantum annealers to augment the capabilities of deep learning. In this work, we explore binary neural networks, which are lightweight yet powerful models typically intended for resource constrained devices. Departing from current training regimes for binary networks that smooth/approximate the activation functions to make the network differentiable, we devise a quadratic unconstrained binary optimization formulation for the training problem. While the problem is intractable, i.e., the cost to estimate the binary weights scales exponentially with network size, we show how the problem can be optimized directly on a quantum annealer, thereby opening up to the potential gains of quantum computing. We experimentally validated our formulation via simulation and testing on an actual quantum annealer (D-Wave Advantage), the latter to the extent allowable by the capacity of current technology.
    DeepOPG: Improving Orthopantomogram Finding Summarization with Weak Supervision. (arXiv:2103.08290v2 [cs.CV] UPDATED)
    (2 min) Clinical finding summaries from an orthopantomogram, or a dental panoramic radiograph, have significant potential to improve patient communication and speed up clinical judgments. While orthopantomogram is a first-line tool for dental examinations, no existing work has explored the summarization of findings from it. A finding summary has to find teeth in the imaging study and label the teeth with several types of past treatments. To tackle the problem, we developDeepOPG that breaks the summarization process into functional segmentation and tooth localization, the latter of which is further refined by a novel dental coherence module. We also leverage weak supervision labels to improve detection results in a reinforcement learning scenario. Experiments show high efficacy of DeepOPG on finding summarization, achieving an overall AUC of 88.2% in detecting six types of findings. The proposed dental coherence and weak supervision both are shown to improve DeepOPG by adding 5.9% and 0.4% to AP@IoU=0.5.
    Using Experts' Opinions in Machine Learning Tasks. (arXiv:2008.04216v2 [cs.LG] UPDATED)
    (2 min) In machine learning tasks, especially in the tasks of prediction, scientists tend to rely solely on available historical data and disregard unproven insights, such as experts' opinions, polls, and betting odds. In this paper, we propose a general three-step framework for utilizing experts' insights in machine learning tasks and build four concrete models for a sports game prediction case study. For the case study, we have chosen the task of predicting NCAA Men's Basketball games, which has been the focus of a group of Kaggle competitions in recent years. Results highly suggest that the good performance and high scores of the past models are a result of chance, and not because of a good-performing and stable model. Furthermore, our proposed models can achieve more steady results with lower log loss average (best at 0.489) compared to the top solutions of the 2019 competition (>0.503), and reach the top 1%, 10% and 1% in the 2017, 2018 and 2019 leaderboards, respectively.
    Detecting Hypo-plastic Left Heart Syndrome in Fetal Ultrasound via Disease-specific Atlas Maps. (arXiv:2107.02643v1 [eess.IV])
    (2 min) Fetal ultrasound screening during pregnancy plays a vital role in the early detection of fetal malformations which have potential long-term health impacts. The level of skill required to diagnose such malformations from live ultrasound during examination is high and resources for screening are often limited. We present an interpretable, atlas-learning segmentation method for automatic diagnosis of Hypo-plastic Left Heart Syndrome (HLHS) from a single `4 Chamber Heart' view image. We propose to extend the recently introduced Image-and-Spatial Transformer Networks (Atlas-ISTN) into a framework that enables sensitising atlas generation to disease. In this framework we can jointly learn image segmentation, registration, atlas construction and disease prediction while providing a maximum level of clinical interpretability compared to direct image classification methods. As a result our segmentation allows diagnoses competitive with expert-derived manual diagnosis and yields an AUC-ROC of 0.978 (1043 cases for training, 260 for validation and 325 for testing).
    Differentiating through the Fr\'echet Mean. (arXiv:2003.00335v4 [stat.ML] UPDATED)
    (2 min) Recent advances in deep representation learning on Riemannian manifolds extend classical deep learning operations to better capture the geometry of the manifold. One possible extension is the Fr\'echet mean, the generalization of the Euclidean mean; however, it has been difficult to apply because it lacks a closed form with an easily computable derivative. In this paper, we show how to differentiate through the Fr\'echet mean for arbitrary Riemannian manifolds. Then, focusing on hyperbolic space, we derive explicit gradient expressions and a fast, accurate, and hyperparameter-free Fr\'echet mean solver. This fully integrates the Fr\'echet mean into the hyperbolic neural network pipeline. To demonstrate this integration, we present two case studies. First, we apply our Fr\'echet mean to the existing Hyperbolic Graph Convolutional Network, replacing its projected aggregation to obtain state-of-the-art results on datasets with high hyperbolicity. Second, to demonstrate the Fr\'echet mean's capacity to generalize Euclidean neural network operations, we develop a hyperbolic batch normalization method that gives an improvement parallel to the one observed in the Euclidean setting.
    NP-DRAW: A Non-Parametric Structured Latent Variable Model for Image Generation. (arXiv:2106.13435v2 [cs.CV] CROSS LISTED)
    (2 min) In this paper, we present a non-parametric structured latent variable model for image generation, called NP-DRAW, which sequentially draws on a latent canvas in a part-by-part fashion and then decodes the image from the canvas. Our key contributions are as follows. 1) We propose a non-parametric prior distribution over the appearance of image parts so that the latent variable ``what-to-draw'' per step becomes a categorical random variable. This improves the expressiveness and greatly eases the learning compared to Gaussians used in the literature. 2) We model the sequential dependency structure of parts via a Transformer, which is more powerful and easier to train compared to RNNs used in the literature. 3) We propose an effective heuristic parsing algorithm to pre-train the prior. Experiments on MNIST, Omniglot, CIFAR-10, and CelebA show that our method significantly outperforms previous structured image models like DRAW and AIR and is competitive to other generic generative models. Moreover, we show that our model's inherent compositionality and interpretability bring significant benefits in the low-data learning regime and latent space editing. Code is available at https://github.com/ZENGXH/NPDRAW.
    New Benchmarks for Learning on Non-Homophilous Graphs. (arXiv:2104.01404v2 [cs.LG] UPDATED)
    (2 min) Much data with graph structures satisfy the principle of homophily, meaning that connected nodes tend to be similar with respect to a specific attribute. As such, ubiquitous datasets for graph machine learning tasks have generally been highly homophilous, rewarding methods that leverage homophily as an inductive bias. Recent work has pointed out this particular focus, as new non-homophilous datasets have been introduced and graph representation learning models better suited for low-homophily settings have been developed. However, these datasets are small and poorly suited to truly testing the effectiveness of new methods in non-homophilous settings. We present a series of improved graph datasets with node label relationships that do not satisfy the homophily principle. Along with this, we introduce a new measure of the presence or absence of homophily that is better suited than existing measures in different regimes. We benchmark a range of simple methods and graph neural networks across our proposed datasets, drawing new insights for further research. Data and codes can be found at https://github.com/CUAI/Non-Homophily-Benchmarks.
    VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer. (arXiv:2107.02681v1 [cs.CL])
    (2 min) Since visual perception can give rich information beyond text descriptions for world understanding, there has been increasing interest in leveraging visual grounding for language learning. Recently, vokenization has attracted attention by using the predictions of a text-to-image retrieval model as labels for language model supervision. Despite its success, the method suffers from approximation error of using finite image labels and the lack of vocabulary diversity of a small image-text dataset. To overcome these limitations, we present VidLanKD, a video-language knowledge distillation method for improving language understanding. We train a multi-modal teacher model on a video-text dataset, and then transfer its knowledge to a student language model with a text dataset. To avoid approximation error, we propose to use different knowledge distillation objectives. In addition, the use of a large-scale video-text dataset helps learn diverse and richer vocabularies. In our experiments, VidLanKD achieves consistent improvements over text-only language models and vokenization models, on several downstream language understanding tasks including GLUE, SQuAD, and SWAG. We also demonstrate the improved world knowledge, physical reasoning, and temporal reasoning capabilities of our model by evaluating on the GLUE-diagnostics, PIQA, and TRACIE datasets. Lastly, we present comprehensive ablation studies as well as visualizations of the learned text-to-video grounding results of our teacher and student language models. Our code and models are available at: https://github.com/zinengtang/VidLanKD
    Midwifery Learning and Forecasting: Predicting Content Demand with User-Generated Logs. (arXiv:2107.02480v1 [stat.ML])
    (2 min) Every day, 800 women and 6,700 newborns die from complications related to pregnancy or childbirth. A well-trained midwife can prevent most of these maternal and newborn deaths. Data science models together with logs generated by users of online learning applications for midwives can help to improve their learning competencies. The goal is to use these rich behavioral data to push digital learning towards personalized content and to provide an adaptive learning journey. In this work, we evaluate various forecasting methods to determine the interest of future users on the different kind of contents available in the app, broken down by profession and region.
    Size-Invariant Graph Representations for Graph Classification Extrapolations. (arXiv:2103.05045v2 [cs.LG] UPDATED)
    (2 min) In general, graph representation learning methods assume that the train and test data come from the same distribution. In this work we consider an underexplored area of an otherwise rapidly developing field of graph representation learning: The task of out-of-distribution (OOD) graph classification, where train and test data have different distributions, with test data unavailable during training. Our work shows it is possible to use a causal model to learn approximately invariant representations that better extrapolate between train and test data. Finally, we conclude with synthetic and real-world dataset experiments showcasing the benefits of representations that are invariant to train/test distribution shifts.
    Dirichlet Energy Constrained Learning for Deep Graph Neural Networks. (arXiv:2107.02392v1 [cs.LG])
    (2 min) Graph neural networks (GNNs) integrate deep architectures and topological structure modeling in an effective way. However, the performance of existing GNNs would decrease significantly when they stack many layers, because of the over-smoothing issue. Node embeddings tend to converge to similar vectors when GNNs keep recursively aggregating the representations of neighbors. To enable deep GNNs, several methods have been explored recently. But they are developed from either techniques in convolutional neural networks or heuristic strategies. There is no generalizable and theoretical principle to guide the design of deep GNNs. To this end, we analyze the bottleneck of deep GNNs by leveraging the Dirichlet energy of node embeddings, and propose a generalizable principle to guide the training of deep GNNs. Based on it, a novel deep GNN framework -- EGNN is designed. It could provide lower and upper constraints in terms of Dirichlet energy at each layer to avoid over-smoothing. Experimental results demonstrate that EGNN achieves state-of-the-art performance by using deep layers.
    Bayesian Nonparametric Modelling for Model-Free Reinforcement Learning in LTE-LAA and Wi-Fi Coexistence. (arXiv:2107.02431v1 [cs.LG])
    (2 min) With the arrival of next generation wireless communication, a growing number of new applications like internet of things, autonomous driving systems, and drone are crowding the unlicensed spectrum. Licensed network such as the long-term evolution (LTE) also comes to the unlicensed spectrum for better providing high-capacity contents with low cost. However, LTE was not designed to share resources with others. Previous solutions usually work on fixed scenarios. This work features a Nonparametric Bayesian reinforcement learning algorithm to cope with the coexistence between Wi-Fi and LTE licensed assisted access (LTE-LAA) agents in 5 GHz unlicensed spectrum. The coexistence problem is modeled as a decentralized partially-observable Markov decision process (Dec-POMDP) and Bayesian inference is adopted for policy learning with nonparametric prior to accommodate the uncertainty of policy for different agents. A fairness measure is introduced in the reward function to encourage fair sharing between agents. Variational inference for posterior model approximation is considered to make the algorithm computationally efficient. Simulation results demonstrate that this algorithm can reach high value with compact policy representations in few learning iterations.
    Self-training with noisy student model and semi-supervised loss function for dcase 2021 challenge task 4. (arXiv:2107.02569v1 [cs.SD])
    (2 min) This report proposes a polyphonic sound event detection (SED) method for the DCASE 2021 Challenge Task 4. The proposed SED model consists of two stages: a mean-teacher model for providing target labels regarding weakly labeled or unlabeled data and a self-training-based noisy student model for predicting strong labels for sound events. The mean-teacher model, which is based on the residual convolutional recurrent neural network (RCRNN) for the teacher and student model, is first trained using all the training data from a weakly labeled dataset, an unlabeled dataset, and a strongly labeled synthetic dataset. Then, the trained mean-teacher model predicts the strong label to each of the weakly labeled and unlabeled datasets, which is brought to the noisy student model in the second stage of the proposed SED model. Here, the structure of the noisy student model is identical to the RCRNN-based student model of the mean-teacher model in the first stage. Then, it is self-trained by adding feature noises, such as time-frequency shift, mixup, SpecAugment, and dropout-based model noise. In addition, a semi-supervised loss function is applied to train the noisy student model, which acts as label noise injection. The performance of the proposed SED model is evaluated on the validation set of the DCASE 2021 Challenge Task 4, and then, several ensemble models that combine five-fold validation models with different hyperparameters of the semi-supervised loss function are finally selected as our final models.
    Prioritized training on points that are learnable, worth learning, and not yet learned. (arXiv:2107.02565v1 [cs.LG])
    (2 min) We introduce Goldilocks Selection, a technique for faster model training which selects a sequence of training points that are "just right". We propose an information-theoretic acquisition function -- the reducible validation loss -- and compute it with a small proxy model -- GoldiProx -- to efficiently choose training points that maximize information about a validation set. We show that the "hard" (e.g. high loss) points usually selected in the optimization literature are typically noisy, while the "easy" (e.g. low noise) samples often prioritized for curriculum learning confer less information. Further, points with uncertain labels, typically targeted by active learning, tend to be less relevant to the task. In contrast, Goldilocks Selection chooses points that are "just right" and empirically outperforms the above approaches. Moreover, the selected sequence can transfer to other architectures; practitioners can share and reuse it without the need to recreate it.
    Automatic Testing With Reusable Adversarial Agents. (arXiv:1910.13645v3 [cs.LG] UPDATED)
    (2 min) Autonomous systems such as self-driving cars and general-purpose robots are safety-critical systems that operate in highly uncertain and dynamic environments. We propose an interactive multi-agent framework where the system-under-design is modeled as an ego agent and its environment is modeled by a number of adversarial (ado) agents. For example, a self-driving car is an ego agent whose behavior is influenced by ado agents such as pedestrians, bicyclists, traffic lights, road geometry etc. Given a logical specification of the correct behavior of the ego agent, and a set of constraints that encode reasonable adversarial behavior, our framework reduces the adversarial testing problem to the problem of synthesizing controllers for (constrained) ado agents that cause the ego agent to violate its specifications. Specifically, we explore the use of tabular and deep reinforcement learning approaches for synthesizing adversarial agents. We show that ado agents trained in this fashion are better than traditional falsification or testing techniques because they can generalize to ego agents and environments that differ from the original ego agent. We demonstrate the efficacy of our technique on two real-world case studies from the domain of self-driving cars.
    Provably Strict Generalisation Benefit for Equivariant Models. (arXiv:2102.10333v2 [stat.ML] UPDATED)
    (2 min) It is widely believed that engineering a model to be invariant/equivariant improves generalisation. Despite the growing popularity of this approach, a precise characterisation of the generalisation benefit is lacking. By considering the simplest case of linear models, this paper provides the first provably non-zero improvement in generalisation for invariant/equivariant models when the target distribution is invariant/equivariant with respect to a compact group. Moreover, our work reveals an interesting relationship between generalisation, the number of training examples and properties of the group action. Our results rest on an observation of the structure of function spaces under averaging operators which, along with its consequences for feature averaging, may be of independent interest.
    Risk bounds when learning infinitely many response functions by ordinary linear regression. (arXiv:2006.09223v2 [stat.ML] UPDATED)
    (2 min) Consider the problem of learning a large number of response functions simultaneously based on the same input variables. The training data consist of a single independent random sample of the input variables drawn from a common distribution together with the associated responses. The input variables are mapped into a high-dimensional linear space, called the feature space, and the response functions are modelled as linear functionals of the mapped features, with coefficients calibrated via ordinary least squares. We provide convergence guarantees on the worst-case excess prediction risk by controlling the convergence rate of the excess risk uniformly in the response function. The dimension of the feature map is allowed to tend to infinity with the sample size. The collection of response functions, although potentially infinite, is supposed to have a finite Vapnik-Chervonenkis dimension. The bound derived can be applied when building multiple surrogate models in a reasonable computing time.
    ActNN: Reducing Training Memory Footprint via 2-Bit Activation Compressed Training. (arXiv:2104.14129v2 [cs.LG] UPDATED)
    (2 min) The increasing size of neural network models has been critical for improvements in their accuracy, but device memory is not growing at the same rate. This creates fundamental challenges for training neural networks within limited memory environments. In this work, we propose ActNN, a memory-efficient training framework that stores randomly quantized activations for back propagation. We prove the convergence of ActNN for general network architectures, and we characterize the impact of quantization on the convergence via an exact expression for the gradient variance. Using our theory, we propose novel mixed-precision quantization strategies that exploit the activation's heterogeneity across feature dimensions, samples, and layers. These techniques can be readily applied to existing dynamic graph frameworks, such as PyTorch, simply by substituting the layers. We evaluate ActNN on mainstream computer vision models for classification, detection, and segmentation tasks. On all these tasks, ActNN compresses the activation to 2 bits on average, with negligible accuracy loss. ActNN reduces the memory footprint of the activation by 12x, and it enables training with a 6.6x to 14x larger batch size.
    SemEval-2021 Task 11: NLPContributionGraph -- Structuring Scholarly NLP Contributions for a Research Knowledge Graph. (arXiv:2106.07385v2 [cs.CL] UPDATED)
    (2 min) There is currently a gap between the natural language expression of scholarly publications and their structured semantic content modeling to enable intelligent content search. With the volume of research growing exponentially every year, a search feature operating over semantically structured content is compelling. The SemEval-2021 Shared Task NLPContributionGraph (a.k.a. 'the NCG task') tasks participants to develop automated systems that structure contributions from NLP scholarly articles in the English language. Being the first-of-its-kind in the SemEval series, the task released structured data from NLP scholarly articles at three levels of information granularity, i.e. at sentence-level, phrase-level, and phrases organized as triples toward Knowledge Graph (KG) building. The sentence-level annotations comprised the few sentences about the article's contribution. The phrase-level annotations were scientific term and predicate phrases from the contribution sentences. Finally, the triples constituted the research overview KG. For the Shared Task, participating systems were then expected to automatically classify contribution sentences, extract scientific terms and relations from the sentences, and organize them as KG triples. Overall, the task drew a strong participation demographic of seven teams and 27 participants. The best end-to-end task system classified contribution sentences at 57.27% F1, phrases at 46.41% F1, and triples at 22.28% F1. While the absolute performance to generate triples remains low, in the conclusion of this article, the difficulty of producing such data and as a consequence of modeling it is highlighted.
    Advanced Graph and Sequence Neural Networks for Molecular Property Prediction and Drug Discovery. (arXiv:2012.01981v3 [q-bio.QM] UPDATED)
    (2 min) Properties of molecules are indicative of their functions and thus are useful in many applications. With the advances of deep learning methods, computational approaches for predicting molecular properties are gaining increasing momentum. However, there lacks customized and advanced methods and comprehensive tools for this task currently. Here we develop a suite of comprehensive machine learning methods and tools spanning different computational models, molecular representations, and loss functions for molecular property prediction and drug discovery. Specifically, we represent molecules as both graphs and sequences. Built on these representations, we develop novel deep models for learning from molecular graphs and sequences. In order to learn effectively from highly imbalanced datasets, we develop advanced loss functions that optimize areas under precision-recall curves. Altogether, our work not only serves as a comprehensive tool, but also contributes towards developing novel and advanced graph and sequence learning methodologies. Results on both online and offline antibiotics discovery and molecular property prediction tasks show that our methods achieve consistent improvements over prior methods. In particular, our methods achieve #1 ranking in terms of both ROC-AUC and PRC-AUC on the AI Cures Open Challenge for drug discovery related to COVID-19. Our software is released as part of the MoleculeX library under AdvProp.
    Learning-based vs Model-free Adaptive Control of a MAV under Wind Gust. (arXiv:2101.12501v2 [cs.RO] UPDATED)
    (2 min) Navigation problems under unknown varying conditions are among the most important and well-studied problems in the control field. Classic model-based adaptive control methods can be applied only when a convenient model of the plant or environment is provided. Recent model-free adaptive control methods aim at removing this dependency by learning the physical characteristics of the plant and/or process directly from sensor feedback. Although there have been prior attempts at improving these techniques, it remains an open question as to whether it is possible to cope with real-world uncertainties in a control system that is fully based on either paradigm. We propose a conceptually simple learning-based approach composed of a full state feedback controller, tuned robustly by a deep reinforcement learning framework based on the Soft Actor-Critic algorithm. We compare it, in realistic simulations, to a model-free controller that uses the same deep reinforcement learning framework for the control of a micro aerial vehicle under wind gust. The results indicate the great potential of learning-based adaptive control methods in modern dynamical systems.
    LTL2Action: Generalizing LTL Instructions for Multi-Task RL. (arXiv:2102.06858v3 [cs.AI] UPDATED)
    (2 min) We address the problem of teaching a deep reinforcement learning (RL) agent to follow instructions in multi-task environments. Instructions are expressed in a well-known formal language -- linear temporal logic (LTL) -- and can specify a diversity of complex, temporally extended behaviours, including conditionals and alternative realizations. Our proposed learning approach exploits the compositional syntax and the semantics of LTL, enabling our RL agent to learn task-conditioned policies that generalize to new instructions, not observed during training. To reduce the overhead of learning LTL semantics, we introduce an environment-agnostic LTL pretraining scheme which improves sample-efficiency in downstream environments. Experiments on discrete and continuous domains target combinatorial task sets of up to $\sim10^{39}$ unique tasks and demonstrate the strength of our approach in learning to solve (unseen) tasks, given LTL instructions.
    An $\ell_p$ theory of PCA and spectral clustering. (arXiv:2006.14062v2 [math.ST] UPDATED)
    (2 min) Principal Component Analysis (PCA) is a powerful tool in statistics and machine learning. While existing study of PCA focuses on the recovery of principal components and their associated eigenvalues, there are few precise characterizations of individual principal component scores that yield low-dimensional embedding of samples. That hinders the analysis of various spectral methods. In this paper, we first develop an $\ell_p$ perturbation theory for a hollowed version of PCA in Hilbert spaces which provably improves upon the vanilla PCA in the presence of heteroscedastic noises. Through a novel $\ell_p$ analysis of eigenvectors, we investigate entrywise behaviors of principal component score vectors and show that they can be approximated by linear functionals of the Gram matrix in $\ell_p$ norm, which includes $\ell_2$ and $\ell_\infty$ as special examples. For sub-Gaussian mixture models, the choice of $p$ giving optimal bounds depends on the signal-to-noise ratio, which further yields optimality guarantees for spectral clustering. For contextual community detection, the $\ell_p$ theory leads to a simple spectral algorithm that achieves the information threshold for exact recovery. These also provide optimal recovery results for Gaussian mixture and stochastic block models as special cases.
    Learning optimal multigrid smoothers via neural networks. (arXiv:2102.12071v2 [math.NA] UPDATED)
    (2 min) Multigrid methods are one of the most efficient techniques for solving linear systems arising from Partial Differential Equations (PDEs) and graph Laplacians from machine learning applications. One of the key components of multigrid is smoothing, which aims at reducing high-frequency errors on each grid level. However, finding optimal smoothing algorithms is problem-dependent and can impose challenges for many problems. In this paper, we propose an efficient adaptive framework for learning optimized smoothers from operator stencils in the form of convolutional neural networks (CNNs). The CNNs are trained on small-scale problems from a given type of PDEs based on a supervised loss function derived from multigrid convergence theories, and can be applied to large-scale problems of the same class of PDEs. Numerical results on anisotropic rotated Laplacian problems demonstrate improved convergence rates and solution time compared with classical hand-crafted relaxation methods.
    LipBaB: Computing exact Lipschitz constant of ReLU networks. (arXiv:2105.05495v2 [cs.LG] UPDATED)
    (2 min) The Lipschitz constant of neural networks plays an important role in several contexts of deep learning ranging from robustness certification and regularization to stability analysis of systems with neural network controllers. Obtaining tight bounds of the Lipschitz constant is therefore important. We introduce LipBaB, a branch and bound framework to compute certified bounds of the local Lipschitz constant of deep neural networks with ReLU activation functions up to any desired precision. We achieve this by bounding the norm of the Jacobians, corresponding to different activation patterns of the network caused within the input domain. Our algorithm can provide provably exact computation of the Lipschitz constant for any p-norm.
    ETHOS: an Online Hate Speech Detection Dataset. (arXiv:2006.08328v2 [cs.CL] UPDATED)
    (2 min) Online hate speech is a recent problem in our society that is rising at a steady pace by leveraging the vulnerabilities of the corresponding regimes that characterise most social media platforms. This phenomenon is primarily fostered by offensive comments, either during user interaction or in the form of a posted multimedia context. Nowadays, giant corporations own platforms where millions of users log in every day, and protection from exposure to similar phenomena appears to be necessary in order to comply with the corresponding legislation and maintain a high level of service quality. A robust and reliable system for detecting and preventing the uploading of relevant content will have a significant impact on our digitally interconnected society. Several aspects of our daily lives are undeniably linked to our social profiles, making us vulnerable to abusive behaviours. As a result, the lack of accurate hate speech detection mechanisms would severely degrade the overall user experience, although its erroneous operation would pose many ethical concerns. In this paper, we present 'ETHOS', a textual dataset with two variants: binary and multi-label, based on YouTube and Reddit comments validated using the Figure-Eight crowdsourcing platform. Furthermore, we present the annotation protocol used to create this dataset: an active sampling procedure for balancing our data in relation to the various aspects defined. Our key assumption is that, even gaining a small amount of labelled data from such a time-consuming process, we can guarantee hate speech occurrences in the examined material.
    Effectiveness of MPC-friendly Softmax Replacement. (arXiv:2011.11202v2 [cs.LG] UPDATED)
    (2 min) Softmax is widely used in deep learning to map some representation to a probability distribution. As it is based on exp/log functions that are relatively expensive in multi-party computation, Mohassel and Zhang (2017) proposed a simpler replacement based on ReLU to be used in secure computation. However, we could not reproduce the accuracy they reported for training on MNIST with three fully connected layers. Later works (e.g., Wagh et al., 2019 and 2021) used the softmax replacement not for computing the output probability distribution but for approximating the gradient in back-propagation. In this work, we analyze the two uses of the replacement and compare them to softmax, both in terms of accuracy and cost in multi-party computation. We found that the replacement only provides a significant speed-up for a one-layer network while it always reduces accuracy, sometimes significantly. Thus we conclude that its usefulness is limited and one should use the original softmax function instead.
    Contrastive Losses and Solution Caching for Predict-and-Optimize. (arXiv:2011.05354v2 [cs.LG] UPDATED)
    (2 min) Many decision-making processes involve solving a combinatorial optimization problem with uncertain input that can be estimated from historic data. Recently, problems in this class have been successfully addressed via end-to-end learning approaches, which rely on solving one optimization problem for each training instance at every epoch. In this context, we provide two distinct contributions. First, we use a Noise Contrastive approach to motivate a family of surrogate loss functions, based on viewing non-optimal solutions as negative examples. Second, we address a major bottleneck of all predict-and-optimize approaches, i.e. the need to frequently recompute optimal solutions at training time. This is done via a solver-agnostic solution caching scheme, and by replacing optimization calls with a lookup in the solution cache. The method is formally based on an inner approximation of the feasible space and, combined with a cache lookup strategy, provides a controllable trade-off between training time and accuracy of the loss approximation. We empirically show that even a very slow growth rate is enough to match the quality of state-of-the-art methods, at a fraction of the computational cost.
    A Unified Off-Policy Evaluation Approach for General Value Function. (arXiv:2107.02711v1 [cs.LG])
    (2 min) General Value Function (GVF) is a powerful tool to represent both the {\em predictive} and {\em retrospective} knowledge in reinforcement learning (RL). In practice, often multiple interrelated GVFs need to be evaluated jointly with pre-collected off-policy samples. In the literature, the gradient temporal difference (GTD) learning method has been adopted to evaluate GVFs in the off-policy setting, but such an approach may suffer from a large estimation error even if the function approximation class is sufficiently expressive. Moreover, none of the previous work have formally established the convergence guarantee to the ground truth GVFs under the function approximation settings. In this paper, we address both issues through the lens of a class of GVFs with causal filtering, which cover a wide range of RL applications such as reward variance, value gradient, cost in anomaly detection, stationary distribution gradient, etc. We propose a new algorithm called GenTD for off-policy GVFs evaluation and show that GenTD learns multiple interrelated multi-dimensional GVFs as efficiently as a single canonical scalar value function. We further show that unlike GTD, the learned GVFs by GenTD are guaranteed to converge to the ground truth GVFs as long as the function approximation power is sufficiently large. To our best knowledge, GenTD is the first off-policy GVF evaluation algorithm that has global optimality guarantee.
    A deep-learning--based multimodal depth-aware dynamic hand gesture recognition system. (arXiv:2107.02543v1 [cs.CV])
    (2 min) Any spatio-temporal movement or reorientation of the hand, done with the intention of conveying a specific meaning, can be considered as a hand gesture. Inputs to hand gesture recognition systems can be in several forms, such as depth images, monocular RGB, or skeleton joint points. We observe that raw depth images possess low contrasts in the hand regions of interest (ROI). They do not highlight important details to learn, such as finger bending information (whether a finger is overlapping the palm, or another finger). Recently, in deep-learning--based dynamic hand gesture recognition, researchers are tying to fuse different input modalities (e.g. RGB or depth images and hand skeleton joint points) to improve the recognition accuracy. In this paper, we focus on dynamic hand gesture (DHG) recognition using depth quantized image features and hand skeleton joint points. In particular, we explore the effect of using depth-quantized features in Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN) based multi-modal fusion networks. We find that our method improves existing results on the SHREC-DHG-14 dataset. Furthermore, using our method, we show that it is possible to reduce the resolution of the input images by more than four times and still obtain comparable or better accuracy to that of the resolutions used in previous methods.
    Attention over learned object embeddings enables complex visual reasoning. (arXiv:2012.08508v2 [cs.CV] UPDATED)
    (2 min) Neural networks have achieved success in a wide array of perceptual tasks but often fail at tasks involving both perception and higher-level reasoning. On these more challenging tasks, bespoke approaches (such as modular symbolic components, independent dynamics models or semantic parsers) targeted towards that specific type of task have typically performed better. The downside to these targeted approaches, however, is that they can be more brittle than general-purpose neural networks, requiring significant modification or even redesign according to the particular task at hand. Here, we propose a more general neural-network-based approach to dynamic visual reasoning problems that obtains state-of-the-art performance on three different domains, in each case outperforming bespoke modular approaches tailored specifically to the task. Our method relies on learned object-centric representations, self-attention and self-supervised dynamics learning, and all three elements together are required for strong performance to emerge. The success of this combination suggests that there may be no need to trade off flexibility for performance on problems involving spatio-temporal or causal-style reasoning. With the right soft biases and learning objectives in a neural network we may be able to attain the best of both worlds.
    Parametric Complexity Bounds for Approximating PDEs with Neural Networks. (arXiv:2103.02138v2 [cs.LG] UPDATED)
    (2 min) Recent experiments have shown that deep networks can approximate solutions to high-dimensional PDEs, seemingly escaping the curse of dimensionality. However, questions regarding the theoretical basis for such approximations, including the required network size, remain open. In this paper, we investigate the representational power of neural networks for approximating solutions to linear elliptic PDEs with Dirichlet boundary conditions. We prove that when a PDE's coefficients are representable by small neural networks, the parameters required to approximate its solution scale polynomially with the input dimension $d$ and proportionally to the parameter counts of the coefficient networks. To this we end, we develop a proof technique that simulates gradient descent (in an appropriate Hilbert space) by growing a neural network architecture whose iterates each participate as sub-networks in their (slightly larger) successors, and converge to the solution of the PDE. We bound the size of the solution, showing a polynomial dependence on $d$ and no dependence on the volume of the domain.
    Variance Reduction for Matrix Computations with Applications to Gaussian Processes. (arXiv:2106.14565v2 [stat.ML] UPDATED)
    (2 min) In addition to recent developments in computing speed and memory, methodological advances have contributed to significant gains in the performance of stochastic simulation. In this paper, we focus on variance reduction for matrix computations via matrix factorization. We provide insights into existing variance reduction methods for estimating the entries of large matrices. Popular methods do not exploit the reduction in variance that is possible when the matrix is factorized. We show how computing the square root factorization of the matrix can achieve in some important cases arbitrarily better stochastic performance. In addition, we propose a factorized estimator for the trace of a product of matrices and numerically demonstrate that the estimator can be up to 1,000 times more efficient on certain problems of estimating the log-likelihood of a Gaussian process. Additionally, we provide a new estimator of the log-determinant of a positive semi-definite matrix where the log-determinant is treated as a normalizing constant of a probability density.
    Counterfactual Explanations in Sequential Decision Making Under Uncertainty. (arXiv:2107.02776v1 [cs.LG])
    (2 min) Methods to find counterfactual explanations have predominantly focused on one step decision making processes. In this work, we initiate the development of methods to find counterfactual explanations for decision making processes in which multiple, dependent actions are taken sequentially over time. We start by formally characterizing a sequence of actions and states using finite horizon Markov decision processes and the Gumbel-Max structural causal model. Building upon this characterization, we formally state the problem of finding counterfactual explanations for sequential decision making processes. In our problem formulation, the counterfactual explanation specifies an alternative sequence of actions differing in at most k actions from the observed sequence that could have led the observed process realization to a better outcome. Then, we introduce a polynomial time algorithm based on dynamic programming to build a counterfactual policy that is guaranteed to always provide the optimal counterfactual explanation on every possible realization of the counterfactual environment dynamics. We validate our algorithm using both synthetic and real data from cognitive behavioral therapy and show that the counterfactual explanations our algorithm finds can provide valuable insights to enhance sequential decision making under uncertainty.
    Improving Coherence and Consistency in Neural Sequence Models with Dual-System, Neuro-Symbolic Reasoning. (arXiv:2107.02794v1 [cs.AI])
    (2 min) Human reasoning can often be understood as an interplay between two systems: the intuitive and associative ("System 1") and the deliberative and logical ("System 2"). Neural sequence models -- which have been increasingly successful at performing complex, structured tasks -- exhibit the advantages and failure modes of System 1: they are fast and learn patterns from data, but are often inconsistent and incoherent. In this work, we seek a lightweight, training-free means of improving existing System 1-like sequence models by adding System 2-inspired logical reasoning. We explore several variations on this theme in which candidate generations from a neural sequence model are examined for logical consistency by a symbolic reasoning module, which can either accept or reject the generations. Our approach uses neural inference to mediate between the neural System 1 and the logical System 2. Results in robust story generation and grounded instruction-following show that this approach can increase the coherence and accuracy of neurally-based generations.
    Data-driven reduced order modeling of environmental hydrodynamics using deep autoencoders and neural ODEs. (arXiv:2107.02784v1 [cs.LG])
    (2 min) Model reduction for fluid flow simulation continues to be of great interest across a number of scientific and engineering fields. In a previous work [arXiv:2104.13962], we explored the use of Neural Ordinary Differential Equations (NODE) as a non-intrusive method for propagating the latent-space dynamics in reduced order models. Here, we investigate employing deep autoencoders for discovering the reduced basis representation, the dynamics of which are then approximated by NODE. The ability of deep autoencoders to represent the latent-space is compared to the traditional proper orthogonal decomposition (POD) approach, again in conjunction with NODE for capturing the dynamics. Additionally, we compare their behavior with two classical non-intrusive methods based on POD and radial basis function interpolation as well as dynamic mode decomposition. The test problems we consider include incompressible flow around a cylinder as well as a real-world application of shallow water hydrodynamics in an estuarine system. Our findings indicate that deep autoencoders can leverage nonlinear manifold learning to achieve a highly efficient compression of spatial information and define a latent-space that appears to be more suitable for capturing the temporal dynamics through the NODE framework.
    ADMM for Efficient Deep Learning with Global Convergence. (arXiv:1905.13611v4 [math.OC] UPDATED)
    (2 min) Alternating Direction Method of Multipliers (ADMM) has been used successfully in many conventional machine learning applications and is considered to be a useful alternative to Stochastic Gradient Descent (SGD) as a deep learning optimizer. However, as an emerging domain, several challenges remain, including 1) The lack of global convergence guarantees, 2) Slow convergence towards solutions, and 3) Cubic time complexity with regard to feature dimensions. In this paper, we propose a novel optimization framework for deep learning via ADMM (dlADMM) to address these challenges simultaneously. The parameters in each layer are updated backward and then forward so that the parameter information in each layer is exchanged efficiently. The time complexity is reduced from cubic to quadratic in (latent) feature dimensions via a dedicated algorithm design for subproblems that enhances them utilizing iterative quadratic approximations and backtracking. Finally, we provide the first proof of global convergence for an ADMM-based method (dlADMM) in a deep neural network problem under mild conditions. Experiments on benchmark datasets demonstrated that our proposed dlADMM algorithm outperforms most of the comparison methods.
    Adversarial Graph Disentanglement. (arXiv:2103.07295v2 [cs.LG] UPDATED)
    (2 min) A real-world graph has a complex topological structure, which is often formed by the interaction of different latent factors. Disentanglement of these latent factors can effectively improve the robustness and expressiveness of node representation of graph. However, most existing methods lack consideration of the intrinsic differences in relations between nodes caused by factor entanglement. In this paper, we propose an Adversarial Disentangled Graph Convolutional Network (ADGCN) for disentangled graph representation learning. Specifically, a component-specific aggregation approach is proposed to achieve micro-disentanglement by inferring latent components that caused the links between nodes. On the basis of micro-disentanglement, we further propose a macro-disentanglement adversarial regularizer to improve the separability among component distributions, thus restricting the interdependence among components. Additionally, to reveal the topological graph structure, a diversity-preserving node sampling approach is proposed, by which the graph structure can be progressively refined in a way of local structure awareness. The experimental results on various real-world graph data verify that our ADGCN obtains more favorable performance over currently available alternatives.
    Deep Network Approximation With Accuracy Independent of Number of Neurons. (arXiv:2107.02397v1 [cs.LG])
    (2 min) This paper develops simple feed-forward neural networks that achieve the universal approximation property for all continuous functions with a fixed finite number of neurons. These neural networks are simple because they are designed with a simple and computable continuous activation function $\sigma$ leveraging a triangular-wave function and a softsign function. We prove that $\sigma$-activated networks with width $36d(2d+1)$ and depth $11$ can approximate any continuous function on a $d$-dimensioanl hypercube within an arbitrarily small error. Hence, for supervised learning and its related regression problems, the hypothesis space generated by these networks with a size not smaller than $36d(2d+1)\times 11$ is dense in the space of continuous functions. Furthermore, classification functions arising from image and signal classification are in the hypothesis space generated by $\sigma$-activated networks with width $36d(2d+1)$ and depth $12$, when there exist pairwise disjoint closed bounded subsets of $\mathbb{R}^d$ such that the samples of the same class are located in the same subset.
    Classification with Rejection Based on Cost-sensitive Classification. (arXiv:2010.11748v4 [stat.ML] UPDATED)
    (2 min) The goal of classification with rejection is to avoid risky misclassification in error-critical applications such as medical diagnosis and product inspection. In this paper, based on the relationship between classification with rejection and cost-sensitive classification, we propose a novel method of classification with rejection by learning an ensemble of cost-sensitive classifiers, which satisfies all the following properties: (i) it can avoid estimating class-posterior probabilities, resulting in improved classification accuracy, (ii) it allows a flexible choice of losses including non-convex ones, (iii) it does not require complicated modifications when using different losses, (iv) it is applicable to both binary and multiclass cases, and (v) it is theoretically justifiable for any classification-calibrated loss. Experimental results demonstrate the usefulness of our proposed approach in clean-labeled, noisy-labeled, and positive-unlabeled classification.
    Dynamical System Parameter Identification using Deep Recurrent Cell Networks. (arXiv:2107.02427v1 [cs.LG])
    (2 min) In this paper, we investigate the parameter identification problem in dynamical systems through a deep learning approach. Focusing mainly on second-order, linear time-invariant dynamical systems, the topic of damping factor identification is studied. By utilizing a six-layer deep neural network with different recurrent cells, namely GRUs, LSTMs or BiLSTMs; and by feeding input-output sequence pairs captured from a dynamical system simulator, we search for an effective deep recurrent architecture in order to resolve damping factor identification problem. Our study results show that, although previously not utilized for this task in the literature, bidirectional gated recurrent cells (BiLSTMs) provide better parameter identification results when compared to unidirectional gated recurrent memory cells such as GRUs and LSTM. Thus, indicating that an input-output sequence pair of finite length, collected from a dynamical system and when observed anachronistically, may carry information in both time directions for prediction of a dynamical systems parameter.
    Neural Computing. (arXiv:2107.02744v1 [cs.NE])
    (2 min) This chapter aims to provide next-level understanding of the problems of the world and the solutions available to those problems, which lie very well within the domain of neural computing, and at the same time are intelligent in their approach, to invoke a sense of innovation among the educationalists, researchers, academic professionals, students and people concerned, by highlighting the work done by major researchers and innovators in this field and thus, encouraging the readers to develop newer and more advanced techniques for the same. By means of this chapter, the societal problems are discussed and various solutions are also given by means of the theories presented and researches done so far. Different types of neural networks discovered so far and applications of some of those neural networks are focused on, apart from their theoretical understanding, the working and core concepts involved in the applications.
    Depth-supervised NeRF: Fewer Views and Faster Training for Free. (arXiv:2107.02791v1 [cs.CV])
    (2 min) One common failure mode of Neural Radiance Field (NeRF) models is fitting incorrect geometries when given an insufficient number of input views. We propose DS-NeRF (Depth-supervised Neural Radiance Fields), a loss for learning neural radiance fields that takes advantage of readily-available depth supervision. Our key insight is that sparse depth supervision can be used to regularize the learned geometry, a crucial component for effectively rendering novel views using NeRF. We exploit the fact that current NeRF pipelines require images with known camera poses that are typically estimated by running structure-from-motion (SFM). Crucially, SFM also produces sparse 3D points that can be used as ``free" depth supervision during training: we simply add a loss to ensure that depth rendered along rays that intersect these 3D points is close to the observed depth. We find that DS-NeRF can render more accurate images given fewer training views while training 2-6x faster. With only two training views on real-world images, DS-NeRF significantly outperforms NeRF as well as other sparse-view variants. We show that our loss is compatible with these NeRF models, demonstrating that depth is a cheap and easily digestible supervisory signal. Finally, we show that DS-NeRF supports other types of depth supervision such as scanned depth sensors and RGBD reconstruction outputs.
    Maximizing Ensemble Diversity in Deep Q-Learning. (arXiv:2006.13823v2 [cs.LG] UPDATED)
    (2 min) The classic DQN algorithm is limited by the overestimation bias of the learned Q-function. Subsequent algorithms have proposed techniques to reduce this problem, without fully eliminating it. Recently, the Maxmin and Ensemble Q-learning algorithms have used different estimates provided by the ensembles of learners to reduce the overestimation bias. Unfortunately, these learners can converge to the same point in the parametric or representation space, falling back to the classic single neural network DQN. In this paper, we describe a regularization technique to maximize ensemble diversity in these algorithms. We propose and compare five regularization functions inspired from economics theory and consensus optimization. We show that the regularized approach significantly outperforms the Maxmin and Ensemble Q-learning algorithms as well as non-ensemble baselines.
    Differentially private federated deep learning for multi-site medical image segmentation. (arXiv:2107.02586v1 [eess.IV])
    (2 min) Collaborative machine learning techniques such as federated learning (FL) enable the training of models on effectively larger datasets without data transfer. Recent initiatives have demonstrated that segmentation models trained with FL can achieve performance similar to locally trained models. However, FL is not a fully privacy-preserving technique and privacy-centred attacks can disclose confidential patient data. Thus, supplementing FL with privacy-enhancing technologies (PTs) such as differential privacy (DP) is a requirement for clinical applications in a multi-institutional setting. The application of PTs to FL in medical imaging and the trade-offs between privacy guarantees and model utility, the ramifications on training performance and the susceptibility of the final models to attacks have not yet been conclusively investigated. Here we demonstrate the first application of differentially private gradient descent-based FL on the task of semantic segmentation in computed tomography. We find that high segmentation performance is possible under strong privacy guarantees with an acceptable training time penalty. We furthermore demonstrate the first successful gradient-based model inversion attack on a semantic segmentation model and show that the application of DP prevents it from divulging sensitive image features.
    Remote sensing, AI and innovative prediction methods for adapting cities to the impacts of the climate change. (arXiv:2107.02693v1 [cs.LG])
    (2 min) Urban areas are not only one of the biggest contributors to climate change, but also they are one of the most vulnerable areas with high populations who would together experience the negative impacts. In this paper, I address some of the opportunities brought by satellite remote sensing imaging and artificial intelligence (AI) in order to measure climate adaptation of cities automatically. I propose an AI-based framework which might be useful for extracting indicators from remote sensing images and might help with predictive estimation of future states of these climate adaptation related indicators. When such models become more robust and used in real-life applications, they might help decision makers and early responders to choose the best actions to sustain the wellbeing of society, natural resources and biodiversity. I underline that this is an open field and an ongoing research for many scientists, therefore I offer an in depth discussion on the challenges and limitations of AI-based methods and the predictive estimation models in general.
    Learned Visual Navigation for Under-Canopy Agricultural Robots. (arXiv:2107.02792v1 [cs.RO])
    (2 min) We describe a system for visually guided autonomous navigation of under-canopy farm robots. Low-cost under-canopy robots can drive between crop rows under the plant canopy and accomplish tasks that are infeasible for over-the-canopy drones or larger agricultural equipment. However, autonomously navigating them under the canopy presents a number of challenges: unreliable GPS and LiDAR, high cost of sensing, challenging farm terrain, clutter due to leaves and weeds, and large variability in appearance over the season and across crop types. We address these challenges by building a modular system that leverages machine learning for robust and generalizable perception from monocular RGB images from low-cost cameras, and model predictive control for accurate control in challenging terrain. Our system, CropFollow, is able to autonomously drive 485 meters per intervention on average, outperforming a state-of-the-art LiDAR based system (286 meters per intervention) in extensive field testing spanning over 25 km.
    The Surprising Effectiveness of PPO in Cooperative, Multi-Agent Games. (arXiv:2103.01955v2 [cs.LG] UPDATED)
    (2 min) Proximal Policy Optimization (PPO) is a popular on-policy reinforcement learning algorithm but is significantly less utilized than off-policy learning algorithms in multi-agent settings. This is often due the belief that on-policy methods are significantly less sample efficient than their off-policy counterparts in multi-agent problems. In this work, we investigate Multi-Agent PPO (MAPPO), a variant of PPO which is specialized for multi-agent settings. Using a 1-GPU desktop, we show that MAPPO achieves surprisingly strong performance in three popular multi-agent testbeds: the particle-world environments, the Starcraft multi-agent challenge, and the Hanabi challenge, with minimal hyperparameter tuning and without any domain-specific algorithmic modifications or architectures. In the majority of environments, we find that compared to off-policy baselines, MAPPO achieves strong results while exhibiting comparable sample efficiency. Finally, through ablation studies, we present the implementation and algorithmic factors which are most influential to MAPPO's practical performance.
    Fractional order graph neural network. (arXiv:2001.04026v3 [cs.LG] UPDATED)
    (2 min) This paper proposes fractional order graph neural networks (FGNNs), optimized by the approximation strategy to address the challenges of local optimum of classic and fractional graph neural networks which are specialised at aggregating information from the feature and adjacent matrices of connected nodes and their neighbours to solve learning tasks on non-Euclidean data such as graphs. Meanwhile the approximate calculation of fractional order gradients also overcomes the high computational complexity of fractional order derivations. We further prove that such an approximation is feasible and the FGNN is unbiased towards global optimization solution. Extensive experiments on citation networks show that FGNN achieves great advantage over baseline models when selected appropriate fractional order.
    An Evaluation of Machine Learning and Deep Learning Models for Drought Prediction using Weather Data. (arXiv:2107.02517v1 [cs.LG])
    (2 min) Drought is a serious natural disaster that has a long duration and a wide range of influence. To decrease the drought-caused losses, drought prediction is the basis of making the corresponding drought prevention and disaster reduction measures. While this problem has been studied in the literature, it remains unknown whether drought can be precisely predicted or not with machine learning models using weather data. To answer this question, a real-world public dataset is leveraged in this study and different drought levels are predicted using the last 90 days of 18 meteorological indicators as the predictors. In a comprehensive approach, 16 machine learning models and 16 deep learning models are evaluated and compared. The results show no single model can achieve the best performance for all evaluation metrics simultaneously, which indicates the drought prediction problem is still challenging. As benchmarks for further studies, the code and results are publicly available in a Github repository.
    Geometric convergence of elliptical slice sampling. (arXiv:2105.03308v2 [stat.ML] UPDATED)
    (2 min) For Bayesian learning, given likelihood function and Gaussian prior, the elliptical slice sampler, introduced by Murray, Adams and MacKay 2010, provides a tool for the construction of a Markov chain for approximate sampling of the underlying posterior distribution. Besides of its wide applicability and simplicity its main feature is that no tuning is necessary. Under weak regularity assumptions on the posterior density we show that the corresponding Markov chain is geometrically ergodic and therefore yield qualitative convergence guarantees. We illustrate our result for Gaussian posteriors as they appear in Gaussian process regression, as well as in a setting of a multi-modal distribution. Remarkably, our numerical experiments indicate a dimension-independent performance of elliptical slice sampling even in situations where our ergodicity result does not apply.
    FedFog: Network-Aware Optimization of Federated Learning over Wireless Fog-Cloud Systems. (arXiv:2107.02755v1 [cs.LG])
    (2 min) Federated learning (FL) is capable of performing large distributed machine learning tasks across multiple edge users by periodically aggregating trained local parameters. To address key challenges of enabling FL over a wireless fog-cloud system (e.g., non-i.i.d. data, users' heterogeneity), we first propose an efficient FL algorithm (called FedFog) to perform the local aggregation of gradient parameters at fog servers and global training update at the cloud. Next, we employ FedFog in wireless fog-cloud systems by investigating a novel network-aware FL optimization problem that strikes the balance between the global loss and completion time. An iterative algorithm is then developed to obtain a precise measurement of the system performance, which helps design an efficient stopping criteria to output an appropriate number of global rounds. To mitigate the straggler effect, we propose a flexible user aggregation strategy that trains fast users first to obtain a certain level of accuracy before allowing slow users to join the global training updates. Extensive numerical results using several real-world FL tasks are provided to verify the theoretical convergence of FedFog. We also show that the proposed co-design of FL and communication is essential to substantially improve resource utilization while achieving comparable accuracy of the learning model.
    SAGE: Intrusion Alert-driven Attack Graph Extractor. (arXiv:2107.02783v1 [cs.CR])
    (2 min) Attack graphs (AG) are used to assess pathways availed by cyber adversaries to penetrate a network. State-of-the-art approaches for AG generation focus mostly on deriving dependencies between system vulnerabilities based on network scans and expert knowledge. In real-world operations however, it is costly and ineffective to rely on constant vulnerability scanning and expert-crafted AGs. We propose to automatically learn AGs based on actions observed through intrusion alerts, without prior expert knowledge. Specifically, we develop an unsupervised sequence learning system, SAGE, that leverages the temporal and probabilistic dependence between alerts in a suffix-based probabilistic deterministic finite automaton (S-PDFA) -- a model that accentuates infrequent severe alerts and summarizes paths leading to them. AGs are then derived from the S-PDFA. Tested with intrusion alerts collected through Collegiate Penetration Testing Competition, SAGE produces AGs that reflect the strategies used by participating teams. The resulting AGs are succinct, interpretable, and enable analysts to derive actionable insights, e.g., attackers tend to follow shorter paths after they have discovered a longer one.
    Physics-informed regularization and structure preservation for learning stable reduced models from data with operator inference. (arXiv:2107.02597v1 [math.NA])
    (2 min) Operator inference learns low-dimensional dynamical-system models with polynomial nonlinear terms from trajectories of high-dimensional physical systems (non-intrusive model reduction). This work focuses on the large class of physical systems that can be well described by models with quadratic nonlinear terms and proposes a regularizer for operator inference that induces a stability bias onto quadratic models. The proposed regularizer is physics informed in the sense that it penalizes quadratic terms with large norms and so explicitly leverages the quadratic model form that is given by the underlying physics. This means that the proposed approach judiciously learns from data and physical insights combined, rather than from either data or physics alone. Additionally, a formulation of operator inference is proposed that enforces model constraints for preserving structure such as symmetry and definiteness in the linear terms. Numerical results demonstrate that models learned with operator inference and the proposed regularizer and structure preservation are accurate and stable even in cases where using no regularization or Tikhonov regularization leads to models that are unstable.
    Sawtooth Factorial Topic Embeddings Guided Gamma Belief Network. (arXiv:2107.02757v1 [cs.IR])
    (2 min) Hierarchical topic models such as the gamma belief network (GBN) have delivered promising results in mining multi-layer document representations and discovering interpretable topic taxonomies. However, they often assume in the prior that the topics at each layer are independently drawn from the Dirichlet distribution, ignoring the dependencies between the topics both at the same layer and across different layers. To relax this assumption, we propose sawtooth factorial topic embedding guided GBN, a deep generative model of documents that captures the dependencies and semantic similarities between the topics in the embedding space. Specifically, both the words and topics are represented as embedding vectors of the same dimension. The topic matrix at a layer is factorized into the product of a factor loading matrix and a topic embedding matrix, the transpose of which is set as the factor loading matrix of the layer above. Repeating this particular type of factorization, which shares components between adjacent layers, leads to a structure referred to as sawtooth factorization. An auto-encoding variational inference network is constructed to optimize the model parameter via stochastic gradient descent. Experiments on big corpora show that our models outperform other neural topic models on extracting deeper interpretable topics and deriving better document representations.
    A Model-Driven Engineering Approach to Machine Learning and Software Modeling. (arXiv:2107.02689v1 [cs.SE])
    (2 min) Models are used in both the Software Engineering (SE) and the Artificial Intelligence (AI) communities. In the former case, models of software, which may specify the software system architecture on different levels of abstraction could be used in various stages of the Software Development Life-Cycle (SDLC), from early conceptualization and design, to verification, implementation, testing and evolution. However, in the latter case, i.e., AI, models may provide smart capabilities, such as prediction and decision making support. For instance, in Machine Learning (ML), which is the most popular sub-discipline of AI at the present time, mathematical models may learn useful patterns in the observed data instances and can become capable of making better predictions or recommendations in the future. The goal of this work is to create synergy by bringing models in the said communities together and proposing a holistic approach. We illustrate how software models can become capable of producing or dealing with data analytics and ML models. The main focus is on the Internet of Things (IoT) and smart Cyber-Physical Systems (CPS) use cases, where both ML and model-driven (model-based) SE play a key role. In particular, we implement the proposed approach in an open source prototype and validate it using two use cases from the IoT/CPS domain.
    Provable Lipschitz Certification for Generative Models. (arXiv:2107.02732v1 [cs.LG])
    (2 min) We present a scalable technique for upper bounding the Lipschitz constant of generative models. We relate this quantity to the maximal norm over the set of attainable vector-Jacobian products of a given generative model. We approximate this set by layerwise convex approximations using zonotopes. Our approach generalizes and improves upon prior work using zonotope transformers and we extend to Lipschitz estimation of neural networks with large output dimension. This provides efficient and tight bounds on small networks and can scale to generative models on VAE and DCGAN architectures.
    From Talk to Action with Accountability: Monitoring the Public Discussion of Policy Makers with Deep Neural Networks and Topic Modelling. (arXiv:2010.08346v2 [cs.CL] UPDATED)
    (2 min) Decades of research on climate have provided a consensus that human activity has changed the climate and we are currently heading into a climate crisis. While public discussion and research efforts on climate change mitigation have increased, potential solutions need to not only be discussed but also effectively deployed. For preventing mismanagement and holding policy makers accountable, transparency and degree of information about government processes have been shown to be crucial. However, currently the quantity of information about climate change discussions and the range of sources make it increasingly difficult for the public and civil society to maintain an overview to hold politicians accountable. In response, we propose a multi-source topic aggregation system (MuSTAS) which processes policy makers speech and rhetoric from several publicly available sources into an easily digestible topic summary. MuSTAS uses novel multi-source hybrid latent Dirichlet allocation to model topics from a variety of documents. This topic digest will serve the general public and civil society in assessing where, how, and when politicians talk about climate and climate policies, enabling them to hold politicians accountable for their actions to mitigate climate change and lack thereof.
    A new smart-cropping pipeline for prostate segmentation using deep learning networks. (arXiv:2107.02476v1 [eess.IV])
    (2 min) Prostate segmentation from magnetic resonance imaging (MRI) is a challenging task. In recent years, several network architectures have been proposed to automate this process and alleviate the burden of manual annotation. Although the performance of these models has achieved promising results, there is still room for improvement before these models can be used safely and effectively in clinical practice. One of the major challenges in prostate MR image segmentation is the presence of class imbalance in the image labels where the background pixels dominate over the prostate. In the present work we propose a DL-based pipeline for cropping the region around the prostate from MRI images to produce a more balanced distribution of the foreground pixels (prostate) and the background pixels and improve segmentation accuracy. The effect of DL-cropping for improving the segmentation performance compared to standard center-cropping is assessed using five popular DL networks for prostate segmentation, namely U-net, U-net+, Res Unet++, Bridge U-net and Dense U-net. The proposed smart-cropping outperformed the standard center cropping in terms of segmentation accuracy for all the evaluated prostate segmentation networks. In terms of Dice score, the highest improvement was achieved for the U-net+ and ResU-net++ architectures corresponding to 8.9% and 8%, respectively.
    ML-Quadrat & DriotData: A Model-Driven Engineering Tool and a Low-Code Platform for Smart IoT Services. (arXiv:2107.02692v1 [cs.SE])
    (2 min) In this paper, we present the novel early tool prototype of ML-Quadrat, which is an open source research prototype, based on the Eclipse Modeling Framework (EMF) and the state of the art in the literature of Model-Driven Software Engineering (MDSE) for smart Cyber-Physical Systems (CPS) and the Internet of Things (IoT). Its envisioned users are mostly software developers, who might not have deep knowledge and skills in the heterogeneous IoT platforms and the diverse Artificial Intelligence (AI) technologies, specifically regarding Data Analytics and Machine Learning (DAML). ML-Quadrat is released under the terms of the Apache 2.0 license on Github: https://github.com/arminmoin/ML-Quadrat. Additionally, the novel early tool prototype of DriotData, a Low-Code platform targeting citizen data scientists and citizen/end-user software developers is demonstrated. DriotData exploits and adopts ML-Quadrat and offers an extended version of it as a web-based service to companies, especially Small- and Medium-Sized Enterprises (SME). A basic web-based demo of the Minimum Viable Product (MVP) of DriotData is already available. Finally, a short video demonstrating the tools is available on YouTube: https://youtu.be/YCNFfhmy_JY.
    A Multi-Objective Approach for Sustainable Generative Audio Models. (arXiv:2107.02621v1 [cs.LG])
    (2 min) In recent years, the deep learning community has largely focused on the accuracy of deep generative models, resulting in impressive improvements in several research fields. However, this scientific race for quality comes at a tremendous computational cost, which incurs vast energy consumption and greenhouse gas emissions. If the current exponential growth of computational consumption persists, Artificial Intelligence (AI) will sadly become a considerable contributor to global warming. At the heart of this problem are the measures that we use as a scientific community to evaluate our work. Currently, researchers in the field of AI judge scientific works mostly based on the improvement in accuracy, log-likelihood, reconstruction or opinion scores, all of which entirely obliterates the actual computational cost of generative models. In this paper, we introduce the idea of relying on a multi-objective measure based on Pareto optimality, which simultaneously integrates the models accuracy, as well as the environmental impact of their training. By applying this measure on the current state-of-the-art in generative audio models, we show that this measure drastically changes the perceived significance of the results in the field, encouraging optimal training techniques and resource allocation. We hope that this type of measure will be widely adopted, in order to help the community to better evaluate the significance of their work, while bringing computational cost -- and in fine carbon emissions -- in the spotlight of AI research.
    Deep Learning Methods for Joint Optimization of Beamforming and Fronthaul Quantization in Cloud Radio Access Networks. (arXiv:2107.02520v1 [eess.SP])
    (2 min) Cooperative beamforming across access points (APs) and fronthaul quantization strategies are essential for cloud radio access network (C-RAN) systems. The nonconvexity of the C-RAN optimization problems, which is stemmed from per-AP power and fronthaul capacity constraints, requires high computational complexity for executing iterative algorithms. To resolve this issue, we investigate a deep learning approach where the optimization module is replaced with a well-trained deep neural network (DNN). An efficient learning solution is proposed which constructs a DNN to produce a low-dimensional representation of optimal beamforming and quantization strategies. Numerical results validate the advantages of the proposed learning solution.
    InfoNCE is a variational autoencoder. (arXiv:2107.02495v1 [stat.ML])
    (2 min) We show that a popular self-supervised learning method, InfoNCE, is a special case of a new family of unsupervised learning methods, the self-supervised variational autoencoder (SSVAE). SSVAEs circumvent the usual VAE requirement to reconstruct the data by using a carefully chosen implicit decoder. The InfoNCE objective was motivated as a simplified parametric mutual information estimator. Under one choice of prior, the SSVAE objective (i.e. the ELBO) is exactly equal to the mutual information (up to constants). Under an alternative choice of prior, the SSVAE objective is exactly equal to the simplified parametric mutual information estimator used in InfoNCE (up to constants). Importantly, the use of simplified parametric mutual information estimators is believed to be critical to obtain good high-level representations, and the SSVAE framework naturally provides a principled justification for using prior information to choose these estimators.
    A visual introduction to Gaussian Belief Propagation. (arXiv:2107.02308v1 [cs.AI])
    (2 min) In this article, we present a visual introduction to Gaussian Belief Propagation (GBP), an approximate probabilistic inference algorithm that operates by passing messages between the nodes of arbitrarily structured factor graphs. A special case of loopy belief propagation, GBP updates rely only on local information and will converge independently of the message schedule. Our key argument is that, given recent trends in computing hardware, GBP has the right computational properties to act as a scalable distributed probabilistic inference framework for future machine learning systems.
    EVARS-GPR: EVent-triggered Augmented Refitting of Gaussian Process Regression for Seasonal Data. (arXiv:2107.02463v1 [cs.LG])
    (2 min) Time series forecasting is a growing domain with diverse applications. However, changes of the system behavior over time due to internal or external influences are challenging. Therefore, predictions of a previously learned fore-casting model might not be useful anymore. In this paper, we present EVent-triggered Augmented Refitting of Gaussian Process Regression for Seasonal Data (EVARS-GPR), a novel online algorithm that is able to handle sudden shifts in the target variable scale of seasonal data. For this purpose, EVARS-GPR com-bines online change point detection with a refitting of the prediction model using data augmentation for samples prior to a change point. Our experiments on sim-ulated data show that EVARS-GPR is applicable for a wide range of output scale changes. EVARS-GPR has on average a 20.8 % lower RMSE on different real-world datasets compared to methods with a similar computational resource con-sumption. Furthermore, we show that our algorithm leads to a six-fold reduction of the averaged runtime in relation to all comparison partners with a periodical refitting strategy. In summary, we present a computationally efficient online fore-casting algorithm for seasonal time series with changes of the target variable scale and demonstrate its functionality on simulated as well as real-world data. All code is publicly available on GitHub: https://github.com/grimmlab/evars-gpr.
    Energy Forecasting in Smart Grid Systems: A Review of the State-of-the-art Techniques. (arXiv:2011.12598v2 [cs.LG] UPDATED)
    (2 min) Energy forecasting has a vital role to play in smart grid (SG) systems involving various applications such as demand-side management, load shedding, and optimum dispatch. Managing efficient forecasting while ensuring the least possible prediction error is one of the main challenges posed in the grid today, considering the uncertainty and granularity in SG data. This paper presents a comprehensive and application-oriented review of state-of-the-art forecasting methods for SG systems along with recent developments in probabilistic deep learning (PDL) considering different models and architectures. Traditional point forecasting methods including statistical, machine learning (ML), and deep learning (DL) are extensively investigated in terms of their applicability to energy forecasting. In addition, the significance of hybrid and data pre-processing techniques to support forecasting performance is also studied. A comparative case study using the Victorian electricity consumption and American electric power (AEP) datasets is conducted to analyze the performance of point and probabilistic forecasting methods. The analysis demonstrates higher accuracy of the long-short term memory (LSTM) models with appropriate hyper-parameter tuning among point forecasting methods especially when sample sizes are larger and involve nonlinear patterns with long sequences. Furthermore, Bayesian bidirectional LSTM (BLSTM) as a probabilistic method exhibit the highest accuracy in terms of least pinball score and root mean square error (RMSE).
    The QR decomposition for radial neural networks. (arXiv:2107.02550v1 [cs.LG])
    (2 min) We provide a theoretical framework for neural networks in terms of the representation theory of quivers, thus revealing symmetries of the parameter space of neural networks. An exploitation of these symmetries leads to a model compression algorithm for radial neural networks based on an analogue of the QR decomposition. A projected version of backpropogation on the original model matches usual backpropogation on the compressed model.
    Rethinking Positional Encoding. (arXiv:2107.02561v1 [cs.LG])
    (2 min) It is well noted that coordinate based MLPs benefit greatly -- in terms of preserving high-frequency information -- through the encoding of coordinate positions as an array of Fourier features. Hitherto, the rationale for the effectiveness of these positional encodings has been solely studied through a Fourier lens. In this paper, we strive to broaden this understanding by showing that alternative non-Fourier embedding functions can indeed be used for positional encoding. Moreover, we show that their performance is entirely determined by a trade-off between the stable rank of the embedded matrix and the distance preservation between embedded coordinates. We further establish that the now ubiquitous Fourier feature mapping of position is a special case that fulfills these conditions. Consequently, we present a more general theory to analyze positional encoding in terms of shifted basis functions. To this end, we develop the necessary theoretical formulae and empirically verify that our theoretical claims hold in practice. Codes available at https://github.com/osiriszjq/Rethinking-positional-encoding.
    Multi-Level Graph Contrastive Learning. (arXiv:2107.02639v1 [cs.LG])
    (2 min) Graph representation learning has attracted a surge of interest recently, whose target at learning discriminant embedding for each node in the graph. Most of these representation methods focus on supervised learning and heavily depend on label information. However, annotating graphs are expensive to obtain in the real world, especially in specialized domains (i.e. biology), as it needs the annotator to have the domain knowledge to label the graph. To approach this problem, self-supervised learning provides a feasible solution for graph representation learning. In this paper, we propose a Multi-Level Graph Contrastive Learning (MLGCL) framework for learning robust representation of graph data by contrasting space views of graphs. Specifically, we introduce a novel contrastive view - topological and feature space views. The original graph is first-order approximation structure and contains uncertainty or error, while the $k$NN graph generated by encoding features preserves high-order proximity. Thus $k$NN graph generated by encoding features not only provide a complementary view, but is more suitable to GNN encoder to extract discriminant representation. Furthermore, we develop a multi-level contrastive mode to preserve the local similarity and semantic similarity of graph-structured data simultaneously. Extensive experiments indicate MLGCL achieves promising results compared with the existing state-of-the-art graph representation learning methods on seven datasets.
    Leveraging Clinical Context for User-Centered Explainability: A Diabetes Use Case. (arXiv:2107.02359v1 [cs.LG])
    (2 min) Academic advances of AI models in high-precision domains, like healthcare, need to be made explainable in order to enhance real-world adoption. Our past studies and ongoing interactions indicate that medical experts can use AI systems with greater trust if there are ways to connect the model inferences about patients to explanations that are tied back to the context of use. Specifically, risk prediction is a complex problem of diagnostic and interventional importance to clinicians wherein they consult different sources to make decisions. To enable the adoption of the ever improving AI risk prediction models in practice, we have begun to explore techniques to contextualize such models along three dimensions of interest: the patients' clinical state, AI predictions about their risk of complications, and algorithmic explanations supporting the predictions. We validate the importance of these dimensions by implementing a proof-of-concept (POC) in type-2 diabetes (T2DM) use case where we assess the risk of chronic kidney disease (CKD) - a common T2DM comorbidity. Within the POC, we include risk prediction models for CKD, post-hoc explainers of the predictions, and other natural-language modules which operationalize domain knowledge and CPGs to provide context. With primary care physicians (PCP) as our end-users, we present our initial results and clinician feedback in this paper. Our POC approach covers multiple knowledge sources and clinical scenarios, blends knowledge to explain data and predictions to PCPs, and received an enthusiastic response from our medical expert.
    Causal Inference with Corrupted Data: Measurement Error, Missing Values, Discretization, and Differential Privacy. (arXiv:2107.02780v1 [econ.EM])
    (2 min) Even the most carefully curated economic data sets have variables that are noisy, missing, discretized, or privatized. The standard workflow for empirical research involves data cleaning followed by data analysis that typically ignores the bias and variance consequences of data cleaning. We formulate a semiparametric model for causal inference with corrupted data to encompass both data cleaning and data analysis. We propose a new end-to-end procedure for data cleaning, estimation, and inference with data cleaning-adjusted confidence intervals. We prove root-n consistency, Gaussian approximation, and semiparametric efficiency for our estimator of the causal parameter by finite sample arguments. Our key assumption is that the true covariates are approximately low rank. In our analysis, we provide nonasymptotic theoretical contributions to matrix completion, statistical learning, and semiparametric statistics. We verify the coverage of the data cleaning-adjusted confidence intervals in simulations.
    Does Dataset Complexity Matters for Model Explainers?. (arXiv:2107.02661v1 [cs.LG])
    (2 min) Strategies based on Explainable Artificial Intelligence - XAI have emerged in computing to promote a better understanding of predictions made by black box models. Most XAI-based tools used today explain these types of models, generating attribute rankings aimed at explaining the same, that is, the analysis of Attribute Importance. There is no consensus on which XAI tool generates a general rank of explainability, for this reason, several proposals for tools have emerged (Ciu, Dalex, Eli5, Lofo, Shap and Skater). Here, we present an experimental benchmark of explainable AI techniques capable of producing model-agnostic global explainability ranks based on tabular data related to different problems. Seeking to answer questions such as "Are the explanations generated by the different tools the same, similar or different?" and "How does data complexity play along model explainability?". The results from the construction of 82 computational models and 592 ranks give us some light on the other side of the problem of explainability: dataset complexity!
    Evaluating subgroup disparity using epistemic uncertainty in mammography. (arXiv:2107.02716v1 [cs.LG])
    (2 min) As machine learning (ML) continue to be integrated into healthcare systems that affect clinical decision making, new strategies will need to be incorporated in order to effectively detect and evaluate subgroup disparities to ensure accountability and generalizability in clinical workflows. In this paper, we explore how epistemic uncertainty can be used to evaluate disparity in patient demographics (race) and data acquisition (scanner) subgroups for breast density assessment on a dataset of 108,190 mammograms collected from 33 clinical sites. Our results show that even if aggregate performance is comparable, the choice of uncertainty quantification metric can significantly the subgroup level. We hope this analysis can promote further work on how uncertainty can be leveraged to increase transparency of machine learning applications for clinical deployment.
    Enabling Un-/Semi-Supervised Machine Learning for MDSE of the Real-World CPS/IoT Applications. (arXiv:2107.02690v1 [cs.SE])
    (2 min) In this paper, we propose a novel approach to support domain-specific Model-Driven Software Engineering (MDSE) for the real-world use-case scenarios of smart Cyber-Physical Systems (CPS) and the Internet of Things (IoT). We argue that the majority of available data in the nature for Artificial Intelligence (AI), specifically Machine Learning (ML) are unlabeled. Hence, unsupervised and/or semi-supervised ML approaches are the practical choices. However, prior work in the literature of MDSE has considered supervised ML approaches, which only work with labeled training data. Our proposed approach is fully implemented and integrated with an existing state-of-the-art MDSE tool to serve the CPS/IoT domain. Moreover, we validate the proposed approach using a portion of the open data of the REFIT reference dataset for the smart energy systems domain. Our model-to-code transformations (code generators) provide the full source code of the desired IoT services out of the model instances in an automated manner. Currently, we generate the source code in Java and Python. The Python code is responsible for the ML functionalities and uses the APIs of several ML libraries and frameworks, namely Scikit-Learn, Keras and TensorFlow. For unsupervised and semi-supervised learning, the APIs of Scikit-Learn are deployed. In addition to the pure MDSE approach, where certain ML methods, e.g., K-Means, Mini-Batch K-Means, DB-SCAN, Spectral Clustering, Gaussian Mixture Model, Self-Training, Label Propagation and Label Spreading are supported, a more flexible, hybrid approach is also enabled to support the practitioner in deploying a pre-trained ML model with any arbitrary architecture and learning algorithm.
    Implicit Variational Conditional Sampling with Normalizing Flows. (arXiv:2107.02474v1 [stat.ML])
    (2 min) We present a method for conditional sampling with normalizing flows when only part of an observation is available. We rely on the following fact: if the flow's domain can be partitioned in such a way that the flow restrictions to subdomains keep the bijectivity property, a lower bound to the conditioning variable log-probability can be derived. Simulation from the variational conditional flow then amends to solving an equality constraint. Our contribution is three-fold: a) we provide detailed insights on the choice of variational distributions; b) we propose how to partition the input space of the flow to preserve bijectivity property; c) we propose a set of methods to optimise the variational distribution in specific cases. Through extensive experiments, we show that our sampling method can be applied with success to invertible residual networks for inference and classification.
    Intrinsic uncertainties and where to find them. (arXiv:2107.02526v1 [cs.LG])
    (2 min) We introduce a framework for uncertainty estimation that both describes and extends many existing methods. We consider typical hyperparameters involved in classical training as random variables and marginalise them out to capture various sources of uncertainty in the parameter space. We investigate which forms and combinations of marginalisation are most useful from a practical point of view on standard benchmarking data sets. Moreover, we discuss how some marginalisations may produce reliable estimates of uncertainty without the need for extensive hyperparameter tuning and/or large-scale ensembling.
    Enhanced Universal Dependency Parsing with Automated Concatenation of Embeddings. (arXiv:2107.02416v1 [cs.CL])
    (2 min) This paper describes the system used in submission from SHANGHAITECH team to the IWPT 2021 Shared Task. Our system is a graph-based parser with the technique of Automated Concatenation of Embeddings (ACE). Because recent work found that better word representations can be obtained by concatenating different types of embeddings, we use ACE to automatically find the better concatenation of embeddings for the task of enhanced universal dependencies. According to official results averaged on 17 languages, our system ranks 2nd over 9 teams.
    Large Scale Model Predictive Control with Neural Networks and Primal Active Sets. (arXiv:1910.10835v2 [cs.LG] UPDATED)
    (2 min) This work presents an explicit-implicit procedure to compute a model predictive control (MPC) law with guarantees on recursive feasibility and asymptotic stability. The approach combines an offline-trained fully-connected neural network with an online primal active set solver. The neural network provides a control input initialization while the primal active set method ensures recursive feasibility and asymptotic stability. The neural network is trained with a primal-dual loss function, aiming to generate control sequences that are primal feasible and meet a desired level of suboptimality. Since the neural network alone does not guarantee constraint satisfaction, its output is used to warm start the primal active set method online. We demonstrate that this approach scales to large problems with thousands of optimization variables, which are challenging for current approaches. Our method achieves a 2x reduction in online inference time compared to the best method in a benchmark suite of different solver and initialization strategies.
    Meta-Reinforcement Learning for Heuristic Planning. (arXiv:2107.02603v1 [cs.AI])
    (2 min) In Meta-Reinforcement Learning (meta-RL) an agent is trained on a set of tasks to prepare for and learn faster in new, unseen, but related tasks. The training tasks are usually hand-crafted to be representative of the expected distribution of test tasks and hence all used in training. We show that given a set of training tasks, learning can be both faster and more effective (leading to better performance in the test tasks), if the training tasks are appropriately selected. We propose a task selection algorithm, Information-Theoretic Task Selection (ITTS), based on information theory, which optimizes the set of tasks used for training in meta-RL, irrespectively of how they are generated. The algorithm establishes which training tasks are both sufficiently relevant for the test tasks, and different enough from one another. We reproduce different meta-RL experiments from the literature and show that ITTS improves the final performance in all of them.
    Shell Language Processing: Unix command parsing for Machine Learning. (arXiv:2107.02438v1 [cs.LG])
    (2 min) In this article, we present a Shell Language Preprocessing (SLP) library, which implements tokenization and encoding directed on the parsing of Unix and Linux shell commands. We describe the rationale behind the need for a new approach with specific examples when conventional Natural Language Processing (NLP) pipelines fail. Furthermore, we evaluate our methodology on a security classification task against widely accepted information and communications technology (ICT) tokenization techniques and achieve significant improvement of an F1-score from 0.392 to 0.874.
    DEANN: Speeding up Kernel-Density Estimation using Approximate Nearest Neighbor Search. (arXiv:2107.02736v1 [cs.DS])
    (2 min) Kernel Density Estimation (KDE) is a nonparametric method for estimating the shape of a density function, given a set of samples from the distribution. Recently, locality-sensitive hashing, originally proposed as a tool for nearest neighbor search, has been shown to enable fast KDE data structures. However, these approaches do not take advantage of the many other advances that have been made in algorithms for nearest neighbor algorithms. We present an algorithm called Density Estimation from Approximate Nearest Neighbors (DEANN) where we apply Approximate Nearest Neighbor (ANN) algorithms as a black box subroutine to compute an unbiased KDE. The idea is to find points that have a large contribution to the KDE using ANN, compute their contribution exactly, and approximate the remainder with Random Sampling (RS). We present a theoretical argument that supports the idea that an ANN subroutine can speed up the evaluation. Furthermore, we provide a C++ implementation with a Python interface that can make use of an arbitrary ANN implementation as a subroutine for KDE evaluation. We show empirically that our implementation outperforms state of the art implementations in all high dimensional datasets we considered, and matches the performance of RS in cases where the ANN yield no gains in performance.
    Early Recognition of Ball Catching Success in Clinical Trials with RNN-Based Predictive Classification. (arXiv:2107.02442v1 [cs.LG])
    (2 min) Motor disturbances can affect the interaction with dynamic objects, such as catching a ball. A classification of clinical catching trials might give insight into the existence of pathological alterations in the relation of arm and ball movements. Accurate, but also early decisions are required to classify a catching attempt before the catcher's first ball contact. To obtain clinically valuable results, a significant decision confidence of at least 75% is required. Hence, three competing objectives have to be optimized at the same time: accuracy, earliness and decision-making confidence. Here we propose a coupled classification and prediction approach for early time series classification: a predictive, generative recurrent neural network (RNN) forecasts the next data points of ball trajectories based on already available observations; a discriminative RNN continuously generates classification guesses based on the available data points and the unrolled sequence predictions. We compare our approach, which we refer to as predictive sequential classification (PSC), to state-of-the-art sequence learners, including various RNN and temporal convolutional network (TCN) architectures. On this hard real-world task we can consistently demonstrate the superiority of PSC over all other models in terms of accuracy and confidence with respect to earliness of recognition. Specifically, PSC is able to confidently classify the success of catching trials as early as 123 milliseconds before the first ball contact. We conclude that PSC is a promising approach for early time series classification, when accurate and confident decisions are required.
    DTGAN: Differential Private Training for Tabular GANs. (arXiv:2107.02521v1 [cs.LG])
    (2 min) Tabular generative adversarial networks (TGAN) have recently emerged to cater to the need of synthesizing tabular data -- the most widely used data format. While synthetic tabular data offers the advantage of complying with privacy regulations, there still exists a risk of privacy leakage via inference attacks due to interpolating the properties of real data during training. Differential private (DP) training algorithms provide theoretical guarantees for training machine learning models by injecting statistical noise to prevent privacy leaks. However, the challenges of applying DP on TGAN are to determine the most optimal framework (i.e., PATE/DP-SGD) and neural network (i.e., Generator/Discriminator)to inject noise such that the data utility is well maintained under a given privacy guarantee. In this paper, we propose DTGAN, a novel conditional Wasserstein tabular GAN that comes in two variants DTGAN_G and DTGAN_D, for providing a detailed comparison of tabular GANs trained using DP-SGD for the generator vs discriminator, respectively. We elicit the privacy analysis associated with training the generator with complex loss functions (i.e., classification and information losses) needed for high quality tabular data synthesis. Additionally, we rigorously evaluate the theoretical privacy guarantees offered by DP empirically against membership and attribute inference attacks. Our results on 3 datasets show that the DP-SGD framework is superior to PATE and that a DP discriminator is more optimal for training convergence. Thus, we find (i) DTGAN_D is capable of maintaining the highest data utility across 4 ML models by up to 18% in terms of the average precision score for a strict privacy budget, epsilon = 1, as compared to the prior studies and (ii) DP effectively prevents privacy loss against inference attacks by restricting the success probability of membership attacks to be close to 50%.
    DriveML: An R Package for Driverless Machine Learning. (arXiv:2005.00478v2 [cs.LG] UPDATED)
    (2 min) In recent years, the concept of automated machine learning has become very popular. Automated Machine Learning (AutoML) mainly refers to the automated methods for model selection and hyper-parameter optimization of various algorithms such as random forests, gradient boosting, neural networks, etc. In this paper, we introduce a new package i.e. DriveML for automated machine learning. DriveML helps in implementing some of the pillars of an automated machine learning pipeline such as automated data preparation, feature engineering, model building and model explanation by running the function instead of writing lengthy R codes. The DriveML package is available in CRAN. We compare the DriveML package with other relevant packages in CRAN/Github and find that DriveML performs the best across different parameters. We also provide an illustration by applying the DriveML package with default configuration on a real world dataset. Overall, the main benefits of DriveML are in development time savings, reduce developer's errors, optimal tuning of machine learning models and reproducibility.
    GradDiv: Adversarial Robustness of Randomized Neural Networks via Gradient Diversity Regularization. (arXiv:2107.02425v1 [cs.LG])
    (2 min) Deep learning is vulnerable to adversarial examples. Many defenses based on randomized neural networks have been proposed to solve the problem, but fail to achieve robustness against attacks using proxy gradients such as the Expectation over Transformation (EOT) attack. We investigate the effect of the adversarial attacks using proxy gradients on randomized neural networks and demonstrate that it highly relies on the directional distribution of the loss gradients of the randomized neural network. We show in particular that proxy gradients are less effective when the gradients are more scattered. To this end, we propose Gradient Diversity (GradDiv) regularizations that minimize the concentration of the gradients to build a robust randomized neural network. Our experiments on MNIST, CIFAR10, and STL10 show that our proposed GradDiv regularizations improve the adversarial robustness of randomized neural networks against a variety of state-of-the-art attack methods. Moreover, our method efficiently reduces the transferability among sample models of randomized neural networks.
    CoReD: Generalizing Fake Media Detection with Continual Representation using Distillation. (arXiv:2107.02408v1 [cs.CV])
    (2 min) Over the last few decades, artificial intelligence research has made tremendous strides, but it still heavily relies on fixed datasets in stationary environments. Continual learning is a growing field of research that examines how AI systems can learn sequentially from a continuous stream of linked data in the same way that biological systems do. Simultaneously, fake media such as deepfakes and synthetic face images have emerged as significant to current multimedia technologies. Recently, numerous method has been proposed which can detect deepfakes with high accuracy. However, they suffer significantly due to their reliance on fixed datasets in limited evaluation settings. Therefore, in this work, we apply continuous learning to neural networks' learning dynamics, emphasizing its potential to increase data efficiency significantly. We propose Continual Representation using Distillation (CoReD) method that employs the concept of Continual Learning (CoL), Representation Learning (ReL), and Knowledge Distillation (KD). We design CoReD to perform sequential domain adaptation tasks on new deepfake and GAN-generated synthetic face datasets, while effectively minimizing the catastrophic forgetting in a teacher-student model setting. Our extensive experimental results demonstrate that our method is efficient at domain adaptation to detect low-quality deepfakes videos and GAN-generated images from several datasets, outperforming the-state-of-art baseline methods.
    Deep Visual Attention-Based Transfer Clustering. (arXiv:2107.02415v1 [cs.LG])
    (2 min) In this paper, we propose a methodology to improvise the technique of deep transfer clustering (DTC) when applied to the less variant data distribution. Clustering can be considered as the most important unsupervised learning problem. A simple definition of clustering can be stated as "the process of organizing objects into groups, whose members are similar in some way". Image clustering is a crucial but challenging task in the domain machine learning and computer vision. We have discussed the clustering of the data collection where the data is less variant. We have discussed the improvement by using attention-based classifiers rather than regular classifiers as the initial feature extractors in the deep transfer clustering. We have enforced the model to learn only the required region of interest in the images to get the differentiable and robust features that do not take into account the background. This paper is the improvement of the existing deep transfer clustering for less variant data distribution.
    Dueling Bandits with Team Comparisons. (arXiv:2107.02738v1 [cs.LG])
    (2 min) We introduce the dueling teams problem, a new online-learning setting in which the learner observes noisy comparisons of disjoint pairs of $k$-sized teams from a universe of $n$ players. The goal of the learner is to minimize the number of duels required to identify, with high probability, a Condorcet winning team, i.e., a team which wins against any other disjoint team (with probability at least $1/2$). Noisy comparisons are linked to a total order on the teams. We formalize our model by building upon the dueling bandits setting (Yue et al.2012) and provide several algorithms, both for stochastic and deterministic settings. For the stochastic setting, we provide a reduction to the classical dueling bandits setting, yielding an algorithm that identifies a Condorcet winning team within $\mathcal{O}((n + k \log (k)) \frac{\max(\log\log n, \log k)}{\Delta^2})$ duels, where $\Delta$ is a gap parameter. For deterministic feedback, we additionally present a gap-independent algorithm that identifies a Condorcet winning team within $\mathcal{O}(nk\log(k)+k^5)$ duels.
    On Generalization of Graph Autoencoders with Adversarial Training. (arXiv:2107.02658v1 [cs.LG])
    (2 min) Adversarial training is an approach for increasing model's resilience against adversarial perturbations. Such approaches have been demonstrated to result in models with feature representations that generalize better. However, limited works have been done on adversarial training of models on graph data. In this paper, we raise such a question { does adversarial training improve the generalization of graph representations. We formulate L2 and L1 versions of adversarial training in two powerful node embedding methods: graph autoencoder (GAE) and variational graph autoencoder (VGAE). We conduct extensive experiments on three main applications, i.e. link prediction, node clustering, graph anomaly detection of GAE and VGAE, and demonstrate that both L2 and L1 adversarial training boost the generalization of GAE and VGAE.
    Automatic size and pose homogenization with spatial transformer network to improve and accelerate pediatric segmentation. (arXiv:2107.02655v1 [cs.CV])
    (2 min) Due to a high heterogeneity in pose and size and to a limited number of available data, segmentation of pediatric images is challenging for deep learning methods. In this work, we propose a new CNN architecture that is pose and scale invariant thanks to the use of Spatial Transformer Network (STN). Our architecture is composed of three sequential modules that are estimated together during training: (i) a regression module to estimate a similarity matrix to normalize the input image to a reference one; (ii) a differentiable module to find the region of interest to segment; (iii) a segmentation module, based on the popular UNet architecture, to delineate the object. Unlike the original UNet, which strives to learn a complex mapping, including pose and scale variations, from a finite training dataset, our segmentation module learns a simpler mapping focusing on images with normalized pose and size. Furthermore, the use of an automatic bounding box detection through STN allows saving time and especially memory, while keeping similar performance. We test the proposed method in kidney and renal tumor segmentation on abdominal pediatric CT scanners. Results indicate that the estimated STN homogenization of size and pose accelerates the segmentation (25h), compared to standard data-augmentation (33h), while obtaining a similar quality for the kidney (88.01\% of Dice score) and improving the renal tumor delineation (from 85.52\% to 87.12\%).
    Physical Interaction as Communication: Learning Robot Objectives Online from Human Corrections. (arXiv:2107.02349v1 [cs.RO])
    (2 min) When a robot performs a task next to a human, physical interaction is inevitable: the human might push, pull, twist, or guide the robot. The state-of-the-art treats these interactions as disturbances that the robot should reject or avoid. At best, these robots respond safely while the human interacts; but after the human lets go, these robots simply return to their original behavior. We recognize that physical human-robot interaction (pHRI) is often intentional -- the human intervenes on purpose because the robot is not doing the task correctly. In this paper, we argue that when pHRI is intentional it is also informative: the robot can leverage interactions to learn how it should complete the rest of its current task even after the person lets go. We formalize pHRI as a dynamical system, where the human has in mind an objective function they want the robot to optimize, but the robot does not get direct access to the parameters of this objective -- they are internal to the human. Within our proposed framework human interactions become observations about the true objective. We introduce approximations to learn from and respond to pHRI in real-time. We recognize that not all human corrections are perfect: often users interact with the robot noisily, and so we improve the efficiency of robot learning from pHRI by reducing unintended learning. Finally, we conduct simulations and user studies on a robotic manipulator to compare our proposed approach to the state-of-the-art. Our results indicate that learning from pHRI leads to better task performance and improved human satisfaction.
    Causal Bandits on General Graphs. (arXiv:2107.02772v1 [cs.LG])
    (2 min) We study the problem of determining the best intervention in a Causal Bayesian Network (CBN) specified only by its causal graph. We model this as a stochastic multi-armed bandit (MAB) problem with side-information, where the interventions correspond to the arms of the bandit instance. First, we propose a simple regret minimization algorithm that takes as input a semi-Markovian causal graph with atomic interventions and possibly unobservable variables, and achieves $\tilde{O}(\sqrt{M/T})$ expected simple regret, where $M$ is dependent on the input CBN and could be very small compared to the number of arms. We also show that this is almost optimal for CBNs described by causal graphs having an $n$-ary tree structure. Our simple regret minimization results, both upper and lower bound, subsume previous results in the literature, which assumed additional structural restrictions on the input causal graph. In particular, our results indicate that the simple regret guarantee of our proposed algorithm can only be improved by considering more nuanced structural restrictions on the causal graph. Next, we propose a cumulative regret minimization algorithm that takes as input a general causal graph with all observable nodes and atomic interventions and performs better than the optimal MAB algorithm that does not take causal side-information into account. We also experimentally compare both our algorithms with the best known algorithms in the literature. To the best of our knowledge, this work gives the first simple and cumulative regret minimization algorithms for CBNs with general causal graphs under atomic interventions and having unobserved confounders.
    Mind Your Outliers! Investigating the Negative Impact of Outliers on Active Learning for Visual Question Answering. (arXiv:2107.02331v1 [cs.CL])
    (2 min) Active learning promises to alleviate the massive data needs of supervised machine learning: it has successfully improved sample efficiency by an order of magnitude on traditional tasks like topic classification and object recognition. However, we uncover a striking contrast to this promise: across 5 models and 4 datasets on the task of visual question answering, a wide variety of active learning approaches fail to outperform random selection. To understand this discrepancy, we profile 8 active learning methods on a per-example basis, and identify the problem as collective outliers -- groups of examples that active learning methods prefer to acquire but models fail to learn (e.g., questions that ask about text in images or require external knowledge). Through systematic ablation experiments and qualitative visualizations, we verify that collective outliers are a general phenomenon responsible for degrading pool-based active learning. Notably, we show that active learning sample efficiency increases significantly as the number of collective outliers in the active learning pool decreases. We conclude with a discussion and prescriptive recommendations for mitigating the effects of these outliers in future work.
    Semantic Segmentation Alternative Technique: Segmentation Domain Generation. (arXiv:2107.02525v1 [cs.CV])
    (2 min) Detecting objects of interest in images was always a compelling task to automate. In recent years this task was more and more explored using deep learning techniques, mostly using region-based convolutional networks. In this project we propose an alternative semantic segmentation technique making use of Generative Adversarial Networks. We consider semantic segmentation to be a domain transfer problem. Thus, we train a feed forward network (FFNN) to receive as input a seed real image and generate as output its segmentation mask.
    Automated age-related macular degeneration area estimation -- first results. (arXiv:2107.02211v1 [eess.IV])
    (2 min) This work aims to research an automatic method for detecting Age-related Macular Degeneration (AMD) lesions in RGB eye fundus images. For this, we align invasively obtained eye fundus contrast images (the "golden standard" diagnostic) to the RGB ones and use them to hand-annotate the lesions. This is done using our custom-made tool. Using the data, we train and test five different convolutional neural networks: a custom one to classify healthy and AMD-affected eye fundi, and four well-known networks: ResNet50, ResNet101, MobileNetV3, and UNet to segment (localize) the AMD lesions in the affected eye fundus images. We achieve 93.55% accuracy or 69.71% Dice index as the preliminary best results in segmentation with MobileNetV3.
    A comparison of LSTM and GRU networks for learning symbolic sequences. (arXiv:2107.02248v1 [cs.LG])
    (2 min) We explore relations between the hyper-parameters of a recurrent neural network (RNN) and the complexity of string sequences it is able to memorize. We compare long short-term memory (LSTM) networks and gated recurrent units (GRUs). We find that an increase of RNN depth does not necessarily result in better memorization capability when the training time is constrained. Our results also indicate that the learning rate and the number of units per layer are among the most important hyper-parameters to be tuned. Generally, GRUs outperform LSTM networks on low complexity sequences while on high complexity sequences LSTMs perform better.
    SplitAVG: A heterogeneity-aware federated deep learning method for medical imaging. (arXiv:2107.02375v1 [cs.LG])
    (2 min) Federated learning is an emerging research paradigm for enabling collaboratively training deep learning models without sharing patient data. However, the data from different institutions are usually heterogeneous across institutions, which may reduce the performance of models trained using federated learning. In this study, we propose a novel heterogeneity-aware federated learning method, SplitAVG, to overcome the performance drops from data heterogeneity in federated learning. Unlike previous federated methods that require complex heuristic training or hyper parameter tuning, our SplitAVG leverages the simple network split and feature map concatenation strategies to encourage the federated model training an unbiased estimator of the target data distribution. We compare SplitAVG with seven state-of-the-art federated learning methods, using centrally hosted training data as the baseline on a suite of both synthetic and real-world federated datasets. We find that the performance of models trained using all the comparison federated learning methods degraded significantly with the increasing degrees of data heterogeneity. In contrast, SplitAVG method achieves comparable results to the baseline method under all heterogeneous settings, that it achieves 96.2% of the accuracy and 110.4% of the mean absolute error obtained by the baseline in a diabetic retinopathy binary classification dataset and a bone age prediction dataset, respectively, on highly heterogeneous data partitions. We conclude that SplitAVG method can effectively overcome the performance drops from variability in data distributions across institutions. Experimental results also show that SplitAVG can be adapted to different base networks and generalized to various types of medical imaging tasks.
    An Ensemble Noise-Robust K-fold Cross-Validation Selection Method for Noisy Labels. (arXiv:2107.02347v1 [cs.LG])
    (2 min) We consider the problem of training robust and accurate deep neural networks (DNNs) when subject to various proportions of noisy labels. Large-scale datasets tend to contain mislabeled samples that can be memorized by DNNs, impeding the performance. With appropriate handling, this degradation can be alleviated. There are two problems to consider: how to distinguish clean samples and how to deal with noisy samples. In this paper, we present Ensemble Noise-robust K-fold Cross-Validation Selection (E-NKCVS) to effectively select clean samples from noisy data, solving the first problem. For the second problem, we create a new pseudo label for any sample determined to have an uncertain or likely corrupt label. E-NKCVS obtains multiple predicted labels for each sample and the entropy of these labels is used to tune the weight given to the pseudo label and the given label. Theoretical analysis and extensive verification of the algorithms in the noisy label setting are provided. We evaluate our approach on various image and text classification tasks where the labels have been manually corrupted with different noise ratios. Additionally, two large real-world noisy datasets are also used, Clothing-1M and WebVision. E-NKCVS is empirically shown to be highly tolerant to considerable proportions of label noise and has a consistent improvement over state-of-the-art methods. Especially on more difficult datasets with higher noise ratios, we can achieve a significant improvement over the second-best model. Moreover, our proposed approach can easily be integrated into existing DNN methods to improve their robustness against label noise.
    Meta-learning Amidst Heterogeneity and Ambiguity. (arXiv:2107.02228v1 [cs.LG])
    (2 min) Meta-learning aims to learn a model that can handle multiple tasks generated from an unknown but shared distribution. However, typical meta-learning algorithms have assumed the tasks to be similar such that a single meta-learner is sufficient to aggregate the variations in all aspects. In addition, there has been less consideration on uncertainty when limited information is given as context. In this paper, we devise a novel meta-learning framework, called Meta-learning Amidst Heterogeneity and Ambiguity (MAHA), that outperforms previous works in terms of prediction based on its ability on task identification. By extensively conducting several experiments in regression and classification, we demonstrate the validity of our model, which turns out to be robust to both task heterogeneity and ambiguity.
    Learning an Explicit Hyperparameter Prediction Policy Conditioned on Tasks. (arXiv:2107.02378v1 [cs.LG])
    (2 min) Meta learning has attracted much attention recently in machine learning community. Contrary to conventional machine learning aiming to learn inherent prediction rules to predict labels for new query data, meta learning aims to learn the learning methodology for machine learning from observed tasks, so as to generalize to new query tasks by leveraging the meta-learned learning methodology. In this study, we interpret such learning methodology as learning an explicit hyperparameter prediction policy shared by all training tasks. Specifically, this policy is represented as a parameterized function called meta-learner, mapping from a training/test task to its suitable hyperparameter setting, extracted from a pre-specified function set called meta learning machine. Such setting guarantees that the meta-learned learning methodology is able to flexibly fit diverse query tasks, instead of only obtaining fixed hyperparameters by many current meta learning methods, with less adaptability to query task's variations. Such understanding of meta learning also makes it easily succeed from traditional learning theory for analyzing its generalization bounds with general losses/tasks/models. The theory naturally leads to some feasible controlling strategies for ameliorating the quality of the extracted meta-learner, verified to be able to finely ameliorate its generalization capability in some typical meta learning applications, including few-shot regression, few-shot classification and domain generalization.
    DeepDDS: deep graph neural network with attention mechanism to predict synergistic drug combinations. (arXiv:2107.02467v1 [cs.LG])
    (2 min) Drug combination therapy has become a increasingly promising method in the treatment of cancer. However, the number of possible drug combinations is so huge that it is hard to screen synergistic drug combinations through wet-lab experiments. Therefore, computational screening has become an important way to prioritize drug combinations. Graph neural network have recently shown remarkable performance in the prediction of compound-protein interactions, but it has not been applied to the screening of drug combinations. In this paper, we proposed a deep learning model based on graph neural networks and attention mechanism to identify drug combinations that can effectively inhibit the viability of specific cancer cells. The feature embeddings of drug molecule structure and gene expression profiles were taken as input to multi-layer feedforward neural network to identify the synergistic drug combinations. We compared DeepDDS with classical machine learning methods and other deep learning-based methods on benchmark data set, and the leave-one-out experimental results showed that DeepDDS achieved better performance than competitive methods. Also, on an independent test set released by well-known pharmaceutical enterprise AstraZeneca, DeepDDS was superior to competitive methods by more than 16\% predictive precision. Furthermore, we explored the interpretability of the graph attention network, and found the correlation matrix of atomic features revealed important chemical substructures of drugs. We believed that DeepDDS is an effective tool that prioritized synergistic drug combinations for further wet-lab experiment validation.
    An Inverse QSAR Method Based on Linear Regression and Integer Programming. (arXiv:2107.02381v1 [cs.LG])
    (2 min) Recently a novel framework has been proposed for designing the molecular structure of chemical compounds using both artificial neural networks (ANNs) and mixed integer linear programming (MILP). In the framework, we first define a feature vector $f(C)$ of a chemical graph $C$ and construct an ANN that maps $x=f(C)$ to a predicted value $\eta(x)$ of a chemical property $\pi$ to $C$. After this, we formulate an MILP that simulates the computation process of $f(C)$ from $C$ and that of $\eta(x)$ from $x$. Given a target value $y^*$ of the chemical property $\pi$, we infer a chemical graph $C^\dagger$ such that $\eta(f(C^\dagger))=y^*$ by solving the MILP. In this paper, we use linear regression to construct a prediction function $\eta$ instead of ANNs. For this, we derive an MILP formulation that simulates the computation process of a prediction function by linear regression. The results of computational experiments suggest our method can infer chemical graphs with around up to 50 non-hydrogen atoms.
    Instant One-Shot Word-Learning for Context-Specific Neural Sequence-to-Sequence Speech Recognition. (arXiv:2107.02268v1 [cs.CL])
    (2 min) Neural sequence-to-sequence systems deliver state-of-the-art performance for automatic speech recognition (ASR). When using appropriate modeling units, e.g., byte-pair encoded characters, these systems are in principal open vocabulary systems. In practice, however, they often fail to recognize words not seen during training, e.g., named entities, numbers or technical terms. To alleviate this problem we supplement an end-to-end ASR system with a word/phrase memory and a mechanism to access this memory to recognize the words and phrases correctly. After the training of the ASR system, and when it has already been deployed, a relevant word can be added or subtracted instantly without the need for further training. In this paper we demonstrate that through this mechanism our system is able to recognize more than 85% of newly added words that it previously failed to recognize compared to a strong baseline.
    Energy and Thermal-aware Resource Management of Cloud Data Centres: A Taxonomy and Future Directions. (arXiv:2107.02342v1 [cs.DC])
    (2 min) This paper investigates the existing resource management approaches in Cloud Data Centres for energy and thermal efficiency. It identifies the need for integrated computing and cooling systems management and learning-based solutions in resource management systems. A taxonomy on energy and thermal efficient resource management in data centres is proposed based on an in-depth analysis of the literature. Furthermore, a detailed survey on existing approaches is conducted according to the taxonomy and recent advancements including machine learning-based resource management approaches and cooling management technologies are discussed.
    Sarcasm Detection: A Comparative Study. (arXiv:2107.02276v1 [cs.CL])
    (2 min) Sarcasm detection is the task of identifying irony containing utterances in sentiment-bearing text. However, the figurative and creative nature of sarcasm poses a great challenge for affective computing systems performing sentiment analysis. This article compiles and reviews the salient work in the literature of automatic sarcasm detection. Thus far, three main paradigm shifts have occurred in the way researchers have approached this task: 1) semi-supervised pattern extraction to identify implicit sentiment, 2) use of hashtag-based supervision, and 3) incorporation of context beyond target text. In this article, we provide a comprehensive review of the datasets, approaches, trends, and issues in sarcasm and irony detection.
    Vision Xformers: Efficient Attention for Image Classification. (arXiv:2107.02239v1 [cs.CV])
    (2 min) Linear attention mechanisms provide hope for overcoming the bottleneck of quadratic complexity which restricts application of transformer models in vision tasks. We modify the ViT architecture to work on longer sequence data by replacing the quadratic attention with efficient transformers like Performer, Linformer and Nystr\"omformer of linear complexity creating Vision X-formers (ViX). We show that ViX performs better than ViT in image classification consuming lesser computing resources. We further show that replacing the embedding linear layer by convolutional layers in ViX further increases their performance. Our test on recent visions transformer models like LeViT and Compact Convolutional Transformer (CCT) show that replacing the attention with Nystr\"omformer or Performer saves GPU usage and memory without deteriorating performance. Incorporating these changes can democratize transformers by making them accessible to those with limited data and computing resources.
    Histogram of Cell Types: Deep Learning for Automated Bone Marrow Cytology. (arXiv:2107.02293v1 [eess.IV])
    (2 min) Bone marrow cytology is required to make a hematological diagnosis, influencing critical clinical decision points in hematology. However, bone marrow cytology is tedious, limited to experienced reference centers and associated with high inter-observer variability. This may lead to a delayed or incorrect diagnosis, leaving an unmet need for innovative supporting technologies. We have developed the first ever end-to-end deep learning-based technology for automated bone marrow cytology. Starting with a bone marrow aspirate digital whole slide image, our technology rapidly and automatically detects suitable regions for cytology, and subsequently identifies and classifies all bone marrow cells in each region. This collective cytomorphological information is captured in a novel representation called Histogram of Cell Types (HCT) quantifying bone marrow cell class probability distribution and acting as a cytological "patient fingerprint". The approach achieves high accuracy in region detection (0.97 accuracy and 0.99 ROC AUC), and cell detection and cell classification (0.75 mAP, 0.78 F1-score, Log-average miss rate of 0.31). HCT has potential to revolutionize hematopathology diagnostic workflows, leading to more cost-effective, accurate diagnosis and opening the door to precision medicine.
    Memory-Sample Lower Bounds for Learning Parity with Noise. (arXiv:2107.02320v1 [cs.LG])
    (2 min) In this work, we show, for the well-studied problem of learning parity under noise, where a learner tries to learn $x=(x_1,\ldots,x_n) \in \{0,1\}^n$ from a stream of random linear equations over $\mathrm{F}_2$ that are correct with probability $\frac{1}{2}+\varepsilon$ and flipped with probability $\frac{1}{2}-\varepsilon$, that any learning algorithm requires either a memory of size $\Omega(n^2/\varepsilon)$ or an exponential number of samples. In fact, we study memory-sample lower bounds for a large class of learning problems, as characterized by [GRT'18], when the samples are noisy. A matrix $M: A \times X \rightarrow \{-1,1\}$ corresponds to the following learning problem with error parameter $\varepsilon$: an unknown element $x \in X$ is chosen uniformly at random. A learner tries to learn $x$ from a stream of samples, $(a_1, b_1), (a_2, b_2) \ldots$, where for every $i$, $a_i \in A$ is chosen uniformly at random and $b_i = M(a_i,x)$ with probability $1/2+\varepsilon$ and $b_i = -M(a_i,x)$ with probability $1/2-\varepsilon$ ($0<\varepsilon< \frac{1}{2}$). Assume that $k,\ell, r$ are such that any submatrix of $M$ of at least $2^{-k} \cdot |A|$ rows and at least $2^{-\ell} \cdot |X|$ columns, has a bias of at most $2^{-r}$. We show that any learning algorithm for the learning problem corresponding to $M$, with error, requires either a memory of size at least $\Omega\left(\frac{k \cdot \ell}{\varepsilon} \right)$, or at least $2^{\Omega(r)}$ samples. In particular, this shows that for a large class of learning problems, same as those in [GRT'18], any learning algorithm requires either a memory of size at least $\Omega\left(\frac{(\log |X|) \cdot (\log |A|)}{\varepsilon}\right)$ or an exponential number of noisy samples. Our proof is based on adapting the arguments in [Raz'17,GRT'18] to the noisy case.
    Featurized Density Ratio Estimation. (arXiv:2107.02212v1 [cs.LG])
    (2 min) Density ratio estimation serves as an important technique in the unsupervised machine learning toolbox. However, such ratios are difficult to estimate for complex, high-dimensional data, particularly when the densities of interest are sufficiently different. In our work, we propose to leverage an invertible generative model to map the two distributions into a common feature space prior to estimation. This featurization brings the densities closer together in latent space, sidestepping pathological scenarios where the learned density ratios in input space can be arbitrarily inaccurate. At the same time, the invertibility of our feature map guarantees that the ratios computed in feature space are equivalent to those in input space. Empirically, we demonstrate the efficacy of our approach in a variety of downstream tasks that require access to accurate density ratios such as mutual information estimation, targeted sampling in deep generative models, and classification with data augmentation.
    Physics-Informed Graph Learning for Robust Fault Location in Distribution Systems. (arXiv:2107.02275v1 [cs.LG])
    (2 min) The rapid growth of distributed energy resources potentially increases power grid instability. One promising strategy is to employ data in power grids to efficiently respond to abnormal events (e.g., faults) by detection and location. Unfortunately, most existing works lack physical interpretation and are vulnerable to the practical challenges: sparse observation, insufficient labeled datasets, and stochastic environment. We propose a physics-informed graph learning framework of two stages to handle these challenges when locating faults. Stage- I focuses on informing a graph neural network (GNN) with the geometrical structure of power grids; stage-II employs the physical similarity of labeled and unlabeled data samples to improve the location accuracy. We provide a random walk-based the underpinning of designing our GNNs to address the challenge of sparse observation and augment the correct prediction probability. We compare our approach with three baselines in the IEEE 123-node benchmark system, showing that the proposed method outperforms the others by significant margins, especially when label rates are low. Also, we validate the robustness of our algorithms to out-of-distribution-data (ODD) due to topology changes and load variations. Additionally, we adapt our graph learning framework to the IEEE 37-node test feeder and show high location performance with the proposed training strategy.
    Total Nitrogen Estimation in Agricultural Soils via Aerial Multispectral Imaging and LIBS. (arXiv:2107.02355v1 [eess.IV])
    (2 min) Measuring soil health indicators is an important and challenging task that affects farmers' decisions on timing, placement, and quantity of fertilizers applied in the farms. Most existing methods to measure soil health indicators (SHIs) are in-lab wet chemistry or spectroscopy-based methods, which require significant human input and effort, time-consuming, costly, and are low-throughput in nature. To address this challenge, we develop an artificial intelligence (AI)-driven near real-time unmanned aerial vehicle (UAV)-based multispectral sensing (UMS) solution to estimate total nitrogen (TN) of the soil, an important macro-nutrient or SHI that directly affects the crop health. Accurate prediction of soil TN can significantly increase crop yield through informed decision making on the timing of seed planting, and fertilizer quantity and timing. We train two machine learning models including multi-layer perceptron and support vector machine to predict the soil nitrogen using a suite of data classes including multispectral characteristics of the soil and crops in red, near-infrared, and green spectral bands, computed vegetation indices, and environmental variables including air temperature and relative humidity. To generate the ground-truth data or the training data for the machine learning models, we measure the total nitrogen of the soil samples (collected from a farm) using laser-induced breakdown spectroscopy (LIBS).
    "Garbage In, Garbage Out" Revisited: What Do Machine Learning Application Papers Report About Human-Labeled Training Data?. (arXiv:2107.02278v1 [cs.LG])
    (2 min) Supervised machine learning, in which models are automatically derived from labeled training data, is only as good as the quality of that data. This study builds on prior work that investigated to what extent 'best practices' around labeling training data were followed in applied ML publications within a single domain (social media platforms). In this paper, we expand by studying publications that apply supervised ML in a far broader spectrum of disciplines, focusing on human-labeled data. We report to what extent a random sample of ML application papers across disciplines give specific details about whether best practices were followed, while acknowledging that a greater range of application fields necessarily produces greater diversity of labeling and annotation methods. Because much of machine learning research and education only focuses on what is done once a "ground truth" or "gold standard" of training data is available, it is especially relevant to discuss issues around the equally-important aspect of whether such data is reliable in the first place. This determination becomes increasingly complex when applied to a variety of specialized fields, as labeling can range from a task requiring little-to-no background knowledge to one that must be performed by someone with career expertise.
    AdaSpeech 3: Adaptive Text to Speech for Spontaneous Style. (arXiv:2107.02530v1 [cs.SD])
    (2 min) While recent text to speech (TTS) models perform very well in synthesizing reading-style (e.g., audiobook) speech, it is still challenging to synthesize spontaneous-style speech (e.g., podcast or conversation), mainly because of two reasons: 1) the lack of training data for spontaneous speech; 2) the difficulty in modeling the filled pauses (um and uh) and diverse rhythms in spontaneous speech. In this paper, we develop AdaSpeech 3, an adaptive TTS system that fine-tunes a well-trained reading-style TTS model for spontaneous-style speech. Specifically, 1) to insert filled pauses (FP) in the text sequence appropriately, we introduce an FP predictor to the TTS model; 2) to model the varying rhythms, we introduce a duration predictor based on mixture of experts (MoE), which contains three experts responsible for the generation of fast, medium and slow speech respectively, and fine-tune it as well as the pitch predictor for rhythm adaptation; 3) to adapt to other speaker timbre, we fine-tune some parameters in the decoder with few speech data. To address the challenge of lack of training data, we mine a spontaneous speech dataset to support our research this work and facilitate future research on spontaneous TTS. Experiments show that AdaSpeech 3 synthesizes speech with natural FP and rhythms in spontaneous styles, and achieves much better MOS and SMOS scores than previous adaptive TTS systems.
    Equivariant bifurcation, quadratic equivariants, and symmetry breaking for the standard representation of $S_n$. (arXiv:2107.02422v1 [cs.LG])
    (2 min) Motivated by questions originating from the study of a class of shallow student-teacher neural networks, methods are developed for the analysis of spurious minima in classes of gradient equivariant dynamics related to neural nets. In the symmetric case, methods depend on the generic equivariant bifurcation theory of irreducible representations of the symmetric group on $n$ symbols, $S_n$; in particular, the standard representation of $S_n$. It is shown that spurious minima do not arise from spontaneous symmetry breaking but rather through a complex deformation of the landscape geometry that can be encoded by a generic $S_n$-equivariant bifurcation. We describe minimal models for forced symmetry breaking that give a lower bound on the dynamic complexity involved in the creation of spurious minima when there is no symmetry. Results on generic bifurcation when there are quadratic equivariants are also proved; this work extends and clarifies results of Ihrig & Golubitsky and Chossat, Lauterback & Melbourne on the instability of solutions when there are quadratic equivariants.
    Connectivity Matters: Neural Network Pruning Through the Lens of Effective Sparsity. (arXiv:2107.02306v1 [cs.LG])
    (2 min) Neural network pruning is a fruitful area of research with surging interest in high sparsity regimes. Benchmarking in this domain heavily relies on faithful representation of the sparsity of subnetworks, which has been traditionally computed as the fraction of removed connections (direct sparsity). This definition, however, fails to recognize unpruned parameters that detached from input or output layers of underlying subnetworks, potentially underestimating actual effective sparsity: the fraction of inactivated connections. While this effect might be negligible for moderately pruned networks (up to 10-100 compression rates), we find that it plays an increasing role for thinner subnetworks, greatly distorting comparison between different pruning algorithms. For example, we show that effective compression of a randomly pruned LeNet-300-100 can be orders of magnitude larger than its direct counterpart, while no discrepancy is ever observed when using SynFlow for pruning [Tanaka et al., 2020]. In this work, we adopt the lens of effective sparsity to reevaluate several recent pruning algorithms on common benchmark architectures (e.g., LeNet-300-100, VGG-19, ResNet-18) and discover that their absolute and relative performance changes dramatically in this new and more appropriate framework. To aim for effective, rather than direct, sparsity, we develop a low-cost extension to most pruning algorithms. Further, equipped with effective sparsity as a reference frame, we partially reconfirm that random pruning with appropriate sparsity allocation across layers performs as well or better than more sophisticated algorithms for pruning at initialization [Su et al., 2020]. In response to this observation, using a simple analogy of pressure distribution in coupled cylinders from physics, we design novel layerwise sparsity quotas that outperform all existing baselines in the context of random pruning.
    Asymptotics of Network Embeddings Learned via Subsampling. (arXiv:2107.02363v1 [stat.ML])
    (2 min) Network data are ubiquitous in modern machine learning, with tasks of interest including node classification, node clustering and link prediction. A frequent approach begins by learning an Euclidean embedding of the network, to which algorithms developed for vector-valued data are applied. For large networks, embeddings are learned using stochastic gradient methods where the sub-sampling scheme can be freely chosen. Despite the strong empirical performance of such methods, they are not well understood theoretically. Our work encapsulates representation methods using a subsampling approach, such as node2vec, into a single unifying framework. We prove, under the assumption that the graph is exchangeable, that the distribution of the learned embedding vectors asymptotically decouples. Moreover, we characterize the asymptotic distribution and provided rates of convergence, in terms of the latent parameters, which includes the choice of loss function and the embedding dimension. This provides a theoretical foundation to understand what the embedding vectors represent and how well these methods perform on downstream tasks. Notably, we observe that typically used loss functions may lead to shortcomings, such as a lack of Fisher consistency.
    Morphological Classification of Galaxies in S-PLUS using an Ensemble of Convolutional Networks. (arXiv:2107.02287v1 [astro-ph.GA])
    (2 min) The universe is composed of galaxies that have diverse shapes. Once the structure of a galaxy is determined, it is possible to obtain important information about its formation and evolution. Morphologically classifying galaxies means cataloging them according to their visual appearance and the classification is linked to the physical properties of the galaxy. A morphological classification made through visual inspection is subject to biases introduced by subjective observations made by human volunteers. For this reason, systematic, objective and easily reproducible classification of galaxies has been gaining importance since the astronomer Edwin Hubble created his famous classification method. In this work, we combine accurate visual classifications of the Galaxy Zoo project with \emph {Deep Learning} methods. The goal is to find an efficient technique at human performance level classification, but in a systematic and automatic way, for classification of elliptical and spiral galaxies. For this, a neural network model was created through an Ensemble of four other convolutional models, allowing a greater accuracy in the classification than what would be obtained with any one individual. Details of the individual models and improvements made are also described. The present work is entirely based on the analysis of images (not parameter tables) from DR1 (www.datalab.noao.edu) of the Southern Photometric Local Universe Survey (S-PLUS). In terms of classification, we achieved, with the Ensemble, an accuracy of $\approx 99 \%$ in the test sample (using pre-trained networks).
    Discrete-Valued Neural Communication. (arXiv:2107.02367v1 [cs.LG])
    (2 min) Deep learning has advanced from fully connected architectures to structured models organized into components, e.g., the transformer composed of positional elements, modular architectures divided into slots, and graph neural nets made up of nodes. In structured models, an interesting question is how to conduct dynamic and possibly sparse communication among the separate components. Here, we explore the hypothesis that restricting the transmitted information among components to discrete representations is a beneficial bottleneck. The motivating intuition is human language in which communication occurs through discrete symbols. Even though individuals have different understandings of what a ``"cat" is based on their specific experiences, the shared discrete token makes it possible for communication among individuals to be unimpeded by individual differences in internal representation. To discretize the values of concepts dynamically communicated among specialist components, we extend the quantization mechanism from the Vector-Quantized Variational Autoencoder to multi-headed discretization with shared codebooks and use it for discrete-valued neural communication (DVNC). Our experiments show that DVNC substantially improves systematic generalization in a variety of architectures -- transformers, modular architectures, and graph neural networks. We also show that the DVNC is robust to the choice of hyperparameters, making the method very useful in practice. Moreover, we establish a theoretical justification of our discretization process, proving that it has the ability to increase noise robustness and reduce the underlying dimensionality of the model.
    End-to-End Weak Supervision. (arXiv:2107.02233v1 [cs.LG])
    (2 min) Aggregating multiple sources of weak supervision (WS) can ease the data-labeling bottleneck prevalent in many machine learning applications, by replacing the tedious manual collection of ground truth labels. Current state of the art approaches that do not use any labeled training data, however, require two separate modeling steps: Learning a probabilistic latent variable model based on the WS sources -- making assumptions that rarely hold in practice -- followed by downstream model training. Importantly, the first step of modeling does not consider the performance of the downstream model. To address these caveats we propose an end-to-end approach for directly learning the downstream model by maximizing its agreement with probabilistic labels generated by reparameterizing previous probabilistic posteriors with a neural network. Our results show improved performance over prior work in terms of end model performance on downstream test sets, as well as in terms of improved robustness to dependencies among weak supervision sources.
    Efficient First-Order Contextual Bandits: Prediction, Allocation, and Triangular Discrimination. (arXiv:2107.02237v1 [cs.LG])
    (2 min) A recurring theme in statistical learning, online learning, and beyond is that faster convergence rates are possible for problems with low noise, often quantified by the performance of the best hypothesis; such results are known as first-order or small-loss guarantees. While first-order guarantees are relatively well understood in statistical and online learning, adapting to low noise in contextual bandits (and more broadly, decision making) presents major algorithmic challenges. In a COLT 2017 open problem, Agarwal, Krishnamurthy, Langford, Luo, and Schapire asked whether first-order guarantees are even possible for contextual bandits and -- if so -- whether they can be attained by efficient algorithms. We give a resolution to this question by providing an optimal and efficient reduction from contextual bandits to online regression with the logarithmic (or, cross-entropy) loss. Our algorithm is simple and practical, readily accommodates rich function classes, and requires no distributional assumptions beyond realizability. In a large-scale empirical evaluation, we find that our approach typically outperforms comparable non-first-order methods. On the technical side, we show that the logarithmic loss and an information-theoretic quantity called the triangular discrimination play a fundamental role in obtaining first-order guarantees, and we combine this observation with new refinements to the regression oracle reduction framework of Foster and Rakhlin. The use of triangular discrimination yields novel results even for the classical statistical learning model, and we anticipate that it will find broader use.
    CAP-RAM: A Charge-Domain In-Memory Computing 6T-SRAM for Accurate and Precision-Programmable CNN Inference. (arXiv:2107.02388v1 [cs.AR])
    (2 min) A compact, accurate, and bitwidth-programmable in-memory computing (IMC) static random-access memory (SRAM) macro, named CAP-RAM, is presented for energy-efficient convolutional neural network (CNN) inference. It leverages a novel charge-domain multiply-and-accumulate (MAC) mechanism and circuitry to achieve superior linearity under process variations compared to conventional IMC designs. The adopted semi-parallel architecture efficiently stores filters from multiple CNN layers by sharing eight standard 6T SRAM cells with one charge-domain MAC circuit. Moreover, up to six levels of bit-width of weights with two encoding schemes and eight levels of input activations are supported. A 7-bit charge-injection SAR (ciSAR) analog-to-digital converter (ADC) getting rid of sample and hold (S&H) and input/reference buffers further improves the overall energy efficiency and throughput. A 65-nm prototype validates the excellent linearity and computing accuracy of CAP-RAM. A single 512x128 macro stores a complete pruned and quantized CNN model to achieve 98.8% inference accuracy on the MNIST data set and 89.0% on the CIFAR-10 data set, with a 573.4-giga operations per second (GOPS) peak throughput and a 49.4-tera operations per second (TOPS)/W energy efficiency.
    Improving Text-to-Image Synthesis Using Contrastive Learning. (arXiv:2107.02423v1 [cs.LG])
    (2 min) The goal of text-to-image synthesis is to generate a visually realistic image that matches a given text description. In practice, the captions annotated by humans for the same image have large variance in terms of contents and the choice of words. The linguistic discrepancy between the captions of the identical image leads to the synthetic images deviating from the ground truth. To address this issue, we propose a contrastive learning approach to improve the quality and enhance the semantic consistency of synthetic images. In the pre-training stage, we utilize the contrastive learning approach to learn the consistent textual representations for the captions corresponding to the same image. Furthermore, in the following stage of GAN training, we employ the contrastive learning method to enhance the consistency between the generated images from the captions related to the same image. We evaluate our approach over two popular text-to-image synthesis models, AttnGAN and DM-GAN, on datasets CUB and COCO, respectively. Experimental results have shown that our approach can effectively improve the quality of synthetic images in terms of three metrics: IS, FID and R-precision. Especially, on the challenging COCO dataset, our approach boosts the FID significantly by 29.60% over AttnGAn and by 21.96% over DM-GAN.
    Impact of On-Chip Interconnect on In-Memory Acceleration of Deep Neural Networks. (arXiv:2107.02358v1 [cs.AR])
    (2 min) With the widespread use of Deep Neural Networks (DNNs), machine learning algorithms have evolved in two diverse directions -- one with ever-increasing connection density for better accuracy and the other with more compact sizing for energy efficiency. The increase in connection density increases on-chip data movement, which makes efficient on-chip communication a critical function of the DNN accelerator. The contribution of this work is threefold. First, we illustrate that the point-to-point (P2P)-based interconnect is incapable of handling a high volume of on-chip data movement for DNNs. Second, we evaluate P2P and network-on-chip (NoC) interconnect (with a regular topology such as a mesh) for SRAM- and ReRAM-based in-memory computing (IMC) architectures for a range of DNNs. This analysis shows the necessity for the optimal interconnect choice for an IMC DNN accelerator. Finally, we perform an experimental evaluation for different DNNs to empirically obtain the performance of the IMC architecture with both NoC-tree and NoC-mesh. We conclude that, at the tile level, NoC-tree is appropriate for compact DNNs employed at the edge, and NoC-mesh is necessary to accelerate DNNs with high connection density. Furthermore, we propose a technique to determine the optimal choice of interconnect for any given DNN. In this technique, we use analytical models of NoC to evaluate end-to-end communication latency of any given DNN. We demonstrate that the interconnect optimization in the IMC architecture results in up to 6$\times$ improvement in energy-delay-area product for VGG-19 inference compared to the state-of-the-art ReRAM-based IMC architectures.
    Weighted Gaussian Process Bandits for Non-stationary Environments. (arXiv:2107.02371v1 [cs.LG])
    (2 min) In this paper, we consider the Gaussian process (GP) bandit optimization problem in a non-stationary environment. To capture external changes, the black-box function is allowed to be time-varying within a reproducing kernel Hilbert space (RKHS). To this end, we develop WGP-UCB, a novel UCB-type algorithm based on weighted Gaussian process regression. A key challenge is how to cope with infinite-dimensional feature maps. To that end, we leverage kernel approximation techniques to prove a sublinear regret bound, which is the first (frequentist) sublinear regret guarantee on weighted time-varying bandits with general nonlinear rewards. This result generalizes both non-stationary linear bandits and standard GP-UCB algorithms. Further, a novel concentration inequality is achieved for weighted Gaussian process regression with general weights. We also provide universal upper bounds and weight-dependent upper bounds for weighted maximum information gains. These results are potentially of independent interest for applications such as news ranking and adaptive pricing, where weights can be adopted to capture the importance or quality of data. Finally, we conduct experiments to highlight the favorable gains of the proposed algorithm in many cases when compared to existing methods.
    Multi-Modal Mutual Information (MuMMI) Training for Robust Self-Supervised Deep Reinforcement Learning. (arXiv:2107.02339v1 [cs.LG])
    (2 min) This work focuses on learning useful and robust deep world models using multiple, possibly unreliable, sensors. We find that current methods do not sufficiently encourage a shared representation between modalities; this can cause poor performance on downstream tasks and over-reliance on specific sensors. As a solution, we contribute a new multi-modal deep latent state-space model, trained using a mutual information lower-bound. The key innovation is a specially-designed density ratio estimator that encourages consistency between the latent codes of each modality. We tasked our method to learn policies (in a self-supervised manner) on multi-modal Natural MuJoCo benchmarks and a challenging Table Wiping task. Experiments show our method significantly outperforms state-of-the-art deep reinforcement learning methods, particularly in the presence of missing observations.
    Effects of Smart Traffic Signal Control on Air Quality. (arXiv:2107.02361v1 [cs.MA])
    (2 min) Adaptive traffic signal control (ATSC) in urban traffic networks poses a challenging task due to the complicated dynamics arising in traffic systems. In recent years, several approaches based on multi-agent deep reinforcement learning (MARL) have been studied experimentally. These approaches propose distributed techniques in which each signalized intersection is seen as an agent in a stochastic game whose purpose is to optimize the flow of vehicles in its vicinity. In this setting, the systems evolves towards an equilibrium among the agents that shows beneficial for the whole traffic network. A recently developed multi-agent variant of the well-established advantage actor-critic (A2C) algorithm, called MA2C (multi-agent A2C) exploits the promising idea of some communication among the agents. In this view,the agents share their strategies with other neighbor agents, thereby stabilizing the learning process even when the agents grow in number and variety. We experimented MA2C in two traffic networks located in Bologna (Italy) and found that its action translates into a significant decrease of the amount of pollutants released into the environment.
    Dueling Bandits with Adversarial Sleeping. (arXiv:2107.02274v1 [cs.LG])
    (2 min) We introduce the problem of sleeping dueling bandits with stochastic preferences and adversarial availabilities (DB-SPAA). In almost all dueling bandit applications, the decision space often changes over time; eg, retail store management, online shopping, restaurant recommendation, search engine optimization, etc. Surprisingly, this `sleeping aspect' of dueling bandits has never been studied in the literature. Like dueling bandits, the goal is to compete with the best arm by sequentially querying the preference feedback of item pairs. The non-triviality however results due to the non-stationary item spaces that allow any arbitrary subsets items to go unavailable every round. The goal is to find an optimal `no-regret' policy that can identify the best available item at each round, as opposed to the standard `fixed best-arm regret objective' of dueling bandits. We first derive an instance-specific lower bound for DB-SPAA $\Omega( \sum_{i =1}^{K-1}\sum_{j=i+1}^K \frac{\log T}{\Delta(i,j)})$, where $K$ is the number of items and $\Delta(i,j)$ is the gap between items $i$ and $j$. This indicates that the sleeping problem with preference feedback is inherently more difficult than that for classical multi-armed bandits (MAB). We then propose two algorithms, with near optimal regret guarantees. Our results are corroborated empirically.
    Domain Adaptation via CycleGAN for Retina Segmentation in Optical Coherence Tomography. (arXiv:2107.02345v1 [eess.IV])
    (2 min) With the FDA approval of Artificial Intelligence (AI) for point-of-care clinical diagnoses, model generalizability is of the utmost importance as clinical decision-making must be domain-agnostic. A method of tackling the problem is to increase the dataset to include images from a multitude of domains; while this technique is ideal, the security requirements of medical data is a major limitation. Additionally, researchers with developed tools benefit from the addition of open-sourced data, but are limited by the difference in domains. Herewith, we investigated the implementation of a Cycle-Consistent Generative Adversarial Networks (CycleGAN) for the domain adaptation of Optical Coherence Tomography (OCT) volumes. This study was done in collaboration with the Biomedical Optics Research Group and Functional & Anatomical Imaging & Shape Analysis Lab at Simon Fraser University. In this study, we investigated a learning-based approach of adapting the domain of a publicly available dataset, UK Biobank dataset (UKB). To evaluate the performance of domain adaptation, we utilized pre-existing retinal layer segmentation tools developed on a different set of RETOUCH OCT data. This study provides insight on state-of-the-art tools for domain adaptation compared to traditional processing techniques as well as a pipeline for adapting publicly available retinal data to the domains previously used by our collaborators.
    A Review of Explainable Artificial Intelligence in Manufacturing. (arXiv:2107.02295v1 [cs.AI])
    (2 min) The implementation of Artificial Intelligence (AI) systems in the manufacturing domain enables higher production efficiency, outstanding performance, and safer operations, leveraging powerful tools such as deep learning and reinforcement learning techniques. Despite the high accuracy of these models, they are mostly considered black boxes: they are unintelligible to the human. Opaqueness affects trust in the system, a factor that is critical in the context of decision-making. We present an overview of Explainable Artificial Intelligence (XAI) techniques as a means of boosting the transparency of models. We analyze different metrics to evaluate these techniques and describe several application scenarios in the manufacturing domain.
    Generalization by design: Shortcuts to Generalization in Deep Learning. (arXiv:2107.02253v1 [cs.LG])
    (2 min) We take a geometrical viewpoint and present a unifying view on supervised deep learning with the Bregman divergence loss function - this entails frequent classification and prediction tasks. Motivated by simulations we suggest that there is principally no implicit bias of vanilla stochastic gradient descent training of deep models towards "simpler" functions. Instead, we show that good generalization may be instigated by bounded spectral products over layers leading to a novel geometric regularizer. It is revealed that in deep enough models such a regularizer enables both, extreme accuracy and generalization, to be reached. We associate popular regularization techniques like weight decay, drop out, batch normalization, and early stopping with this perspective. Backed up by theory we further demonstrate that "generalization by design" is practically possible and that good generalization may be encoded into the structure of the network. We design two such easy-to-use structural regularizers that insert an additional \textit{generalization layer} into a model architecture, one with a skip connection and another one with drop-out. We verify our theoretical results in experiments on various feedforward and convolutional architectures, including ResNets, and datasets (MNIST, CIFAR10, synthetic data). We believe this work opens up new avenues of research towards better generalizing architectures.
    Agents that Listen: High-Throughput Reinforcement Learning with Multiple Sensory Systems. (arXiv:2107.02195v1 [cs.LG])
    (2 min) Humans and other intelligent animals evolved highly sophisticated perception systems that combine multiple sensory modalities. On the other hand, state-of-the-art artificial agents rely mostly on visual inputs or structured low-dimensional observations provided by instrumented environments. Learning to act based on combined visual and auditory inputs is still a new topic of research that has not been explored beyond simple scenarios. To facilitate progress in this area we introduce a new version of VizDoom simulator to create a highly efficient learning environment that provides raw audio observations. We study the performance of different model architectures in a series of tasks that require the agent to recognize sounds and execute instructions given in natural language. Finally, we train our agent to play the full game of Doom and find that it can consistently defeat a traditional vision-based adversary. We are currently in the process of merging the augmented simulator with the main ViZDoom code repository. Video demonstrations and experiment code can be found at https://sites.google.com/view/sound-rl.
    DeepCEL0 for 2D Single Molecule Localization in Fluorescence Microscopy. (arXiv:2107.02281v1 [cs.LG])
    (2 min) In fluorescence microscopy, Single Molecule Localization Microscopy (SMLM) techniques aim at localizing with high precision high density fluorescent molecules by stochastically activating and imaging small subsets of blinking emitters. Super Resolution (SR) plays an important role in this field since it allows to go beyond the intrinsic light diffraction limit. In this work, we propose a deep learning-based algorithm for precise molecule localization of high density frames acquired by SMLM techniques whose $\ell_{2}$-based loss function is regularized by positivity and $\ell_{0}$-based constraints. The $\ell_{0}$ is relaxed through its Continuous Exact $\ell_{0}$ (CEL0) counterpart. The arising approach, named DeepCEL0, is parameter-free, more flexible, faster and provides more precise molecule localization maps if compared to the other state-of-the-art methods. We validate our approach on both simulated and real fluorescence microscopy data.
    A Short Note on the Relationship of Information Gain and Eluder Dimension. (arXiv:2107.02377v1 [cs.LG])
    (2 min) Eluder dimension and information gain are two widely used methods of complexity measures in bandit and reinforcement learning. Eluder dimension was originally proposed as a general complexity measure of function classes, but the common examples of where it is known to be small are function spaces (vector spaces). In these cases, the primary tool to upper bound the eluder dimension is the elliptic potential lemma. Interestingly, the elliptic potential lemma also features prominently in the analysis of linear bandits/reinforcement learning and their nonparametric generalization, the information gain. We show that this is not a coincidence -- eluder dimension and information gain are equivalent in a precise sense for reproducing kernel Hilbert spaces.
    Clustering Structure of Microstructure Measures. (arXiv:2107.02283v1 [q-fin.ST])
    (2 min) This paper builds the clustering model of measures of market microstructure features which are popular in predicting the stock returns. In a 10-second time frequency, we study the clustering structure of different measures to find out the best ones for predicting. In this way, we can predict more accurately with a limited number of predictors, which removes the noise and makes the model more interpretable.
    VolNet: Estimating Human Body Part Volumes from a Single RGB Image. (arXiv:2107.02259v1 [cs.CV])
    (2 min) Human body volume estimation from a single RGB image is a challenging problem despite minimal attention from the research community. However VolNet, an architecture leveraging 2D and 3D pose estimation, body part segmentation and volume regression extracted from a single 2D RGB image combined with the subject's body height can be used to estimate the total body volume. VolNet is designed to predict the 2D and 3D pose as well as the body part segmentation in intermediate tasks. We generated a synthetic, large-scale dataset of photo-realistic images of human bodies with a wide range of body shapes and realistic poses called SURREALvols. By using Volnet and combining multiple stacked hourglass networks together with ResNeXt, our model correctly predicted the volume in ~82% of cases with a 10% tolerance threshold. This is a considerable improvement compared to state-of-the-art solutions such as BodyNet with only a ~38% success rate.
    A Deep Learning-Based Particle-in-Cell Method for Plasma Simulations. (arXiv:2107.02232v1 [physics.plasm-ph])
    (2 min) We design and develop a new Particle-in-Cell (PIC) method for plasma simulations using Deep-Learning (DL) to calculate the electric field from the electron phase space. We train a Multilayer Perceptron (MLP) and a Convolutional Neural Network (CNN) to solve the two-stream instability test. We verify that the DL-based MLP PIC method produces the correct results using the two-stream instability: the DL-based PIC provides the expected growth rate of the two-stream instability. The DL-based PIC does not conserve the total energy and momentum. However, the DL-based PIC method is stable against the cold-beam instability, affecting traditional PIC methods. This work shows that integrating DL technologies into traditional computational methods is a viable approach for developing next-generation PIC algorithms.
    Neural Mixture Models with Expectation-Maximization for End-to-end Deep Clustering. (arXiv:2107.02453v1 [cs.LG])
    (2 min) Any clustering algorithm must synchronously learn to model the clusters and allocate data to those clusters in the absence of labels. Mixture model-based methods model clusters with pre-defined statistical distributions and allocate data to those clusters based on the cluster likelihoods. They iteratively refine those distribution parameters and member assignments following the Expectation-Maximization (EM) algorithm. However, the cluster representability of such hand-designed distributions that employ a limited amount of parameters is not adequate for most real-world clustering tasks. In this paper, we realize mixture model-based clustering with a neural network where the final layer neurons, with the aid of an additional transformation, approximate cluster distribution outputs. The network parameters pose as the parameters of those distributions. The result is an elegant, much-generalized representation of clusters than a restricted mixture of hand-designed distributions. We train the network end-to-end via batch-wise EM iterations where the forward pass acts as the E-step and the backward pass acts as the M-step. In image clustering, the mixture-based EM objective can be used as the clustering objective along with existing representation learning methods. In particular, we show that when mixture-EM optimization is fused with consistency optimization, it improves the sole consistency optimization performance in clustering. Our trained networks outperform single-stage deep clustering methods that still depend on k-means, with unsupervised classification accuracy of 63.8% in STL10, 58% in CIFAR10, 25.9% in CIFAR100, and 98.9% in MNIST.
    TransformerFusion: Monocular RGB Scene Reconstruction using Transformers. (arXiv:2107.02191v1 [cs.CV])
    (2 min) We introduce TransformerFusion, a transformer-based 3D scene reconstruction approach. From an input monocular RGB video, the video frames are processed by a transformer network that fuses the observations into a volumetric feature grid representing the scene; this feature grid is then decoded into an implicit 3D scene representation. Key to our approach is the transformer architecture that enables the network to learn to attend to the most relevant image frames for each 3D location in the scene, supervised only by the scene reconstruction task. Features are fused in a coarse-to-fine fashion, storing fine-level features only where needed, requiring lower memory storage and enabling fusion at interactive rates. The feature grid is then decoded to a higher-resolution scene reconstruction, using an MLP-based surface occupancy prediction from interpolated coarse-to-fine 3D features. Our approach results in an accurate surface reconstruction, outperforming state-of-the-art multi-view stereo depth estimation methods, fully-convolutional 3D reconstruction approaches, and approaches using LSTM- or GRU-based recurrent networks for video sequence fusion.
    Near-optimal inference in adaptive linear regression. (arXiv:2107.02266v1 [math.ST])
    (2 min) When data is collected in an adaptive manner, even simple methods like ordinary least squares can exhibit non-normal asymptotic behavior. As an undesirable consequence, hypothesis tests and confidence intervals based on asymptotic normality can lead to erroneous results. We propose an online debiasing estimator to correct these distributional anomalies in least squares estimation. Our proposed method takes advantage of the covariance structure present in the dataset and provides sharper estimates in directions for which more information has accrued. We establish an asymptotic normality property for our proposed online debiasing estimator under mild conditions on the data collection process, and provide asymptotically exact confidence intervals. We additionally prove a minimax lower bound for the adaptive linear regression problem, thereby providing a baseline by which to compare estimators. There are various conditions under which our proposed estimator achieves the minimax lower bound up to logarithmic factors. We demonstrate the usefulness of our theory via applications to multi-armed bandit, autoregressive time series estimation, and active learning with exploration.
    Long-Short Transformer: Efficient Transformers for Language and Vision. (arXiv:2107.02192v1 [cs.CV])
    (2 min) Transformers have achieved success in both language and vision domains. However, it is prohibitively expensive to scale them to long sequences such as long documents or high-resolution images, because self-attention mechanism has quadratic time and memory complexities with respect to the input sequence length. In this paper, we propose Long-Short Transformer (Transformer-LS), an efficient self-attention mechanism for modeling long sequences with linear complexity for both language and vision tasks. It aggregates a novel long-range attention with dynamic projection to model distant correlations and a short-term attention to capture fine-grained local correlations. We propose a dual normalization strategy to account for the scale mismatch between the two attention mechanisms. Transformer-LS can be applied to both autoregressive and bidirectional models without additional complexity. Our method outperforms the state-of-the-art models on multiple tasks in language and vision domains, including the Long Range Arena benchmark, autoregressive language modeling, and ImageNet classification. For instance, Transformer-LS achieves 0.97 test BPC on enwik8 using half the number of parameters than previous method, while being faster and is able to handle 3$\times$ as long sequences compared to its full-attention version on the same hardware. On ImageNet, it can obtain the state-of-the-art results~(e.g., Top-1 accuracy 84.1% trained on 224$\times$224 ImageNet-1K only), while being more scalable on high-resolution images. The models and source code will be released soon.
    Label noise in segmentation networks : mitigation must deal with bias. (arXiv:2107.02189v1 [cs.CV])
    (2 min) Imperfect labels limit the quality of predictions learned by deep neural networks. This is particularly relevant in medical image segmentation, where reference annotations are difficult to collect and vary significantly even across expert annotators. Prior work on mitigating label noise focused on simple models of mostly uniform noise. In this work, we explore biased and unbiased errors artificially introduced to brain tumour annotations on MRI data. We found that supervised and semi-supervised segmentation methods are robust or fairly robust to unbiased errors but sensitive to biased errors. It is therefore important to identify the sorts of errors expected in medical image labels and especially mitigate the biased errors.

2021-07-06

  • cs.CL updates on arXiv.org

    Doing Good or Doing Right? Exploring the Weakness of Commonsense Causal Reasoning Models. (arXiv:2107.01791v1 [cs.CL])
    (2 min) Pretrained language models (PLM) achieve surprising performance on the Choice of Plausible Alternatives (COPA) task. However, whether PLMs have truly acquired the ability of causal reasoning remains a question. In this paper, we investigate the problem of semantic similarity bias and reveal the vulnerability of current COPA models by certain attacks. Previous solutions that tackle the superficial cues of unbalanced token distribution still encounter the same problem of semantic bias, even more seriously due to the utilization of more training data. We mitigate this problem by simply adding a regularization loss and experimental results show that this solution not only improves the model's generalization ability, but also assists the models to perform more robustly on a challenging dataset, BCOPA-CE, which has unbiased token distribution and is more difficult for models to distinguish cause and effect.
    Coarse-to-Careful: Seeking Semantic-related Knowledge for Open-domain Commonsense Question Answering. (arXiv:2107.01592v1 [cs.CL])
    (2 min) It is prevalent to utilize external knowledge to help machine answer questions that need background commonsense, which faces a problem that unlimited knowledge will transmit noisy and misleading information. Towards the issue of introducing related knowledge, we propose a semantic-driven knowledge-aware QA framework, which controls the knowledge injection in a coarse-to-careful fashion. We devise a tailoring strategy to filter extracted knowledge under monitoring of the coarse semantic of question on the knowledge extraction stage. And we develop a semantic-aware knowledge fetching module that engages structural knowledge information and fuses proper knowledge according to the careful semantic of questions in a hierarchical way. Experiments demonstrate that the proposed approach promotes the performance on the CommonsenseQA dataset comparing with strong baselines.
    Low-Latency Real-Time Non-Parallel Voice Conversion based on Cyclic Variational Autoencoder and Multiband WaveRNN with Data-Driven Linear Prediction. (arXiv:2105.09858v2 [cs.SD] UPDATED)
    (2 min) This paper presents a low-latency real-time (LLRT) non-parallel voice conversion (VC) framework based on cyclic variational autoencoder (CycleVAE) and multiband WaveRNN with data-driven linear prediction (MWDLP). CycleVAE is a robust non-parallel multispeaker spectral model, which utilizes a speaker-independent latent space and a speaker-dependent code to generate reconstructed/converted spectral features given the spectral features of an input speaker. On the other hand, MWDLP is an efficient and a high-quality neural vocoder that can handle multispeaker data and generate speech waveform for LLRT applications with CPU. To accommodate LLRT constraint with CPU, we propose a novel CycleVAE framework that utilizes mel-spectrogram as spectral features and is built with a sparse network architecture. Further, to improve the modeling performance, we also propose a novel fine-tuning procedure that refines the frame-rate CycleVAE network by utilizing the waveform loss from the MWDLP network. The experimental results demonstrate that the proposed framework achieves high-performance VC, while allowing for LLRT usage with a single-core of $2.1$--$2.7$ GHz CPU on a real-time factor of $0.87$--$0.95$, including input/output, feature extraction, on a frame shift of $10$ ms, a window length of $27.5$ ms, and $2$ lookup frames.
    Unified Interpretation of Softmax Cross-Entropy and Negative Sampling: With Case Study for Knowledge Graph Embedding. (arXiv:2106.07250v2 [cs.LG] UPDATED)
    (2 min) In knowledge graph embedding, the theoretical relationship between the softmax cross-entropy and negative sampling loss functions has not been investigated. This makes it difficult to fairly compare the results of the two different loss functions. We attempted to solve this problem by using the Bregman divergence to provide a unified interpretation of the softmax cross-entropy and negative sampling loss functions. Under this interpretation, we can derive theoretical findings for fair comparison. Experimental results on the FB15k-237 and WN18RR datasets show that the theoretical findings are valid in practical settings.
    End-to-end Neural Coreference Resolution Revisited: A Simple yet Effective Baseline. (arXiv:2107.01700v1 [cs.CL])
    (2 min) Since the first end-to-end neural coreference resolution model was introduced, many extensions to the model have been proposed, ranging from using higher-order inference to directly optimizing evaluation metrics using reinforcement learning. Despite improving the coreference resolution performance by a large margin, these extensions add a lot of extra complexity to the original model. Motivated by this observation and the recent advances in pre-trained Transformer language models, we propose a simple yet effective baseline for coreference resolution. Our model is a simplified version of the original neural coreference resolution model, however, it achieves impressive performance, outperforming all recent extended works on the public English OntoNotes benchmark. Our work provides evidence for the necessity of carefully justifying the complexity of existing or newly proposed models, as introducing a conceptual or practical simplification to an existing model can still yield competitive results.
    High-Fidelity and Low-Latency Universal Neural Vocoder based on Multiband WaveRNN with Data-Driven Linear Prediction for Discrete Waveform Modeling. (arXiv:2105.09856v2 [cs.SD] UPDATED)
    (2 min) This paper presents a novel high-fidelity and low-latency universal neural vocoder framework based on multiband WaveRNN with data-driven linear prediction for discrete waveform modeling (MWDLP). MWDLP employs a coarse-fine bit WaveRNN architecture for 10-bit mu-law waveform modeling. A sparse gated recurrent unit with a relatively large size of hidden units is utilized, while the multiband modeling is deployed to achieve real-time low-latency usage. A novel technique for data-driven linear prediction (LP) with discrete waveform modeling is proposed, where the LP coefficients are estimated in a data-driven manner. Moreover, a novel loss function using short-time Fourier transform (STFT) for discrete waveform modeling with Gumbel approximation is also proposed. The experimental results demonstrate that the proposed MWDLP framework generates high-fidelity synthetic speech for seen and unseen speakers and/or language on 300 speakers training data including clean and noisy/reverberant conditions, where the number of training utterances is limited to 60 per speaker, while allowing for real-time low-latency processing using a single core of $\sim\!$ 2.1--2.7 GHz CPU with $\sim\!$ 0.57--0.64 real-time factor including input/output and feature extraction.
    Polyphone Disambiguition in Mandarin Chinese with Semi-Supervised Learning. (arXiv:2102.00621v2 [cs.CL] UPDATED)
    (2 min) The majority of Chinese characters are monophonic, while a special group of characters, called polyphonic characters, have multiple pronunciations. As a prerequisite of performing speech-related generative tasks, the correct pronunciation must be identified among several candidates. This process is called Polyphone Disambiguation. Although the problem has been well explored with both knowledge-based and learning-based approaches, it remains challenging due to the lack of publicly available labeled datasets and the irregular nature of polyphone in Mandarin Chinese. In this paper, we propose a novel semi-supervised learning (SSL) framework for Mandarin Chinese polyphone disambiguation that can potentially leverage unlimited unlabeled text data. We explore the effect of various proxy labeling strategies including entropy-thresholding and lexicon-based labeling. Qualitative and quantitative experiments demonstrate that our method achieves state-of-the-art performance. In addition, we publish a novel dataset specifically for the polyphone disambiguation task to promote further researches.
    Re-Evaluating GermEval17 Using German Pre-Trained Language Models. (arXiv:2102.12330v2 [cs.CL] UPDATED)
    (2 min) The lack of a commonly used benchmark data set (collection) such as (Super-)GLUE (Wang et al., 2018, 2019) for the evaluation of non-English pre-trained language models is a severe shortcoming of current English-centric NLP-research. It concentrates a large part of the research on English, neglecting the uncertainty when transferring conclusions found for the English language to other languages. We evaluate the performance of the German and multilingual BERT-based models currently available via the huggingface transformers library on the four tasks of the GermEval17 workshop. We compare them to pre-BERT architectures (Wojatzki et al., 2017; Schmitt et al., 2018; Attia et al., 2018) as well as to an ELMo-based architecture (Biesialska et al., 2020) and a BERT-based approach (Guhr et al., 2020). The observed improvements are put in relation to those for similar tasks and similar models (pre-BERT vs. BERT-based) for the English language in order to draw tentative conclusions about whether the observed improvements are transferable to German or potentially other related languages.
    Speech SIMCLR: Combining Contrastive and Reconstruction Objective for Self-supervised Speech Representation Learning. (arXiv:2010.13991v2 [cs.CL] UPDATED)
    (2 min) Self-supervised visual pretraining has shown significant progress recently. Among those methods, SimCLR greatly advanced the state of the art in self-supervised and semi-supervised learning on ImageNet. The input feature representations for speech and visual tasks are both continuous, so it is natural to consider applying similar objective on speech representation learning. In this paper, we propose Speech SimCLR, a new self-supervised objective for speech representation learning. During training, Speech SimCLR applies augmentation on raw speech and its spectrogram. Its objective is the combination of contrastive loss that maximizes agreement between differently augmented samples in the latent space and reconstruction loss of input representation. The proposed method achieved competitive results on speech emotion recognition and speech recognition.
    MasakhaNER: Named Entity Recognition for African Languages. (arXiv:2103.11811v2 [cs.CL] UPDATED)
    (2 min) We take a step towards addressing the under-representation of the African continent in NLP research by creating the first large publicly available high-quality dataset for named entity recognition (NER) in ten African languages, bringing together a variety of stakeholders. We detail characteristics of the languages to help researchers understand the challenges that these languages pose for NER. We analyze our datasets and conduct an extensive empirical evaluation of state-of-the-art methods across both supervised and transfer learning settings. We release the data, code, and models in order to inspire future research on African NLP.
    DeepRapper: Neural Rap Generation with Rhyme and Rhythm Modeling. (arXiv:2107.01875v1 [cs.SD])
    (2 min) Rap generation, which aims to produce lyrics and corresponding singing beats, needs to model both rhymes and rhythms. Previous works for rap generation focused on rhyming lyrics but ignored rhythmic beats, which are important for rap performance. In this paper, we develop DeepRapper, a Transformer-based rap generation system that can model both rhymes and rhythms. Since there is no available rap dataset with rhythmic beats, we develop a data mining pipeline to collect a large-scale rap dataset, which includes a large number of rap songs with aligned lyrics and rhythmic beats. Second, we design a Transformer-based autoregressive language model which carefully models rhymes and rhythms. Specifically, we generate lyrics in the reverse order with rhyme representation and constraint for rhyme enhancement and insert a beat symbol into lyrics for rhythm/beat modeling. To our knowledge, DeepRapper is the first system to generate rap with both rhymes and rhythms. Both objective and subjective evaluations demonstrate that DeepRapper generates creative and high-quality raps with rhymes and rhythms. Code will be released on GitHub.
    Vietnamese Complaint Detection on E-Commerce Websites. (arXiv:2104.11969v3 [cs.CL] UPDATED)
    (2 min) Customer product reviews play a role in improving the quality of products and services for business organizations or their brands. Complaining is an attitude that expresses dissatisfaction with an event or a product not meeting customer expectations. In this paper, we build a Open-domain Complaint Detection dataset (UIT-ViOCD), including 5,485 human-annotated reviews on four categories about product reviews on e-commerce sites. After the data collection phase, we proceed to the annotation task and achieve the inter-annotator agreement Am of 87%. Then, we present an extensive methodology for the research purposes and achieve 92.16% by F1-score for identifying complaints. With the results, in the future, we aim to build a system for open-domain complaint detection in E-commerce websites.
    Improved Ackermannian lower bound for the Petri nets reachability problem. (arXiv:2105.08551v2 [cs.FL] UPDATED)
    (2 min) Petri nets, equivalently presentable as vector addition systems with states, are an established model of concurrency with widespread applications. The reachability problem, where we ask whether from a given initial configuration there exists a sequence of valid execution steps reaching a given final configuration, is the central algorithmic problem for this model. The complexity of the problem has remained, until recently, one of the hardest open questions in verification of concurrent systems. A first upper bound has been provided only in 2015 by Leroux and Schmitz, then refined by the same authors to non-primitive recursive Ackermannian upper bound in 2019. The exponential space lower bound, shown by Lipton already in 1976, remained the only known for over 40 years until a breakthrough non-elementary lower bound by Czerwi{\'n}ski, Lasota, Lazic, Leroux and Mazowiecki in 2019. Finally, a matching Ackermannian lower bound announced this year by Czerwi{\'n}ski and Orlikowski, and independently by Leroux, established the complexity of the problem. Our contribution is an improvement of the former construction, making it conceptually simpler and more direct. On the way we improve the lower bound for vector addition systems with states in fixed dimension (or, equivalently, Petri nets with fixed number of places): while Czerwi{\'n}ski and Orlikowski prove $F_k$-hardness (hardness for $k$th level in Grzegorczyk Hierarchy) in dimension $6k$, and Leroux in dimension $4k+5$, our simplified construction yields $F_k$-hardness already in dimension $3k+2$.
    FFCI: A Framework for Interpretable Automatic Evaluation of Summarization. (arXiv:2011.13662v2 [cs.CL] UPDATED)
    (2 min) In this paper, we propose FFCI, a framework for fine-grained summarization evaluation that comprises four elements: faithfulness (degree of factual consistency with the source), focus (precision of summary content relative to the reference), coverage (recall of summary content relative to the reference), and inter-sentential coherence (document fluency between adjacent sentences). We construct a novel dataset for focus, coverage, and inter-sentential coherence, and develop automatic methods for evaluating each of the four dimensions of FFCI based on cross-comparison of evaluation metrics and model-based evaluation methods, including question answering (QA) approaches, STS, next-sentence prediction (NSP), and scores derived from 19 pre-trained language models. We then apply the developed metrics in evaluating a broad range of summarization models across two datasets, with some surprising findings.
    Towards One Model to Rule All: Multilingual Strategy for Dialectal Code-Switching Arabic ASR. (arXiv:2105.14779v2 [cs.CL] UPDATED)
    (2 min) With the advent of globalization, there is an increasing demand for multilingual automatic speech recognition (ASR), handling language and dialectal variation of spoken content. Recent studies show its efficacy over monolingual systems. In this study, we design a large multilingual end-to-end ASR using self-attention based conformer architecture. We trained the system using Arabic (Ar), English (En) and French (Fr) languages. We evaluate the system performance handling: (i) monolingual (Ar, En and Fr); (ii) multi-dialectal (Modern Standard Arabic, along with dialectal variation such as Egyptian and Moroccan); (iii) code-switching -- cross-lingual (Ar-En/Fr) and dialectal (MSA-Egyptian dialect) test cases, and compare with current state-of-the-art systems. Furthermore, we investigate the influence of different embedding/character representations including character vs word-piece; shared vs distinct input symbol per language. Our findings demonstrate the strength of such a model by outperforming state-of-the-art monolingual dialectal Arabic and code-switching Arabic ASR.
    Exploring Fluent Query Reformulations with Text-to-Text Transformers and Reinforcement Learning. (arXiv:2012.10033v2 [cs.CL] UPDATED)
    (2 min) Query reformulation aims to alter noisy or ambiguous text sequences into coherent ones closer to natural language questions. This is to prevent errors from propagating in a client-facing pipeline and promote better communication with users. Besides, it is crucial to maintain performance in downstream environments like question answering when rephrased queries are given as input. We show that under the previous framework (AQA), attempts to alter RL algorithms do not bring significant benefits to either reward acquisition or sequence fluency. Instead, we leverage a query-reformulating text-to-text transformer (QRT5) and apply policy-based RL algorithms to further nudge this reformulator and obtain better answers downstream by generating reward-acquiring query trajectories. QRT5 shows better sample efficiency in RL to achieve the same level of QA performance as the previous approach. It can generate reformulations with more readability based on query well-formedness evaluations and can generalize to out-of-sample data. Our framework is demonstrated to be flexible, allowing reward signals to be sourced from different downstream environments such as intent classification.
    A Survey of Data Augmentation Approaches for NLP. (arXiv:2105.03075v4 [cs.CL] UPDATED)
    (2 min) Data augmentation has recently seen increased interest in NLP due to more work in low-resource domains, new tasks, and the popularity of large-scale neural networks that require large amounts of training data. Despite this recent upsurge, this area is still relatively underexplored, perhaps due to the challenges posed by the discrete nature of language data. In this paper, we present a comprehensive and unifying survey of data augmentation for NLP by summarizing the literature in a structured manner. We first introduce and motivate data augmentation for NLP, and then discuss major methodologically representative approaches. Next, we highlight techniques that are used for popular NLP applications and tasks. We conclude by outlining current challenges and directions for future research. Overall, our paper aims to clarify the landscape of existing literature in data augmentation for NLP and motivate additional work in this area. We also present a GitHub repository with a paper list that will be continuously updated at https://github.com/styfeng/DataAug4NLP
    Knodle: Modular Weakly Supervised Learning with PyTorch. (arXiv:2104.11557v3 [cs.LG] UPDATED)
    (2 min) Strategies for improving the training and prediction quality of weakly supervised machine learning models vary in how much they are tailored to a specific task or integrated with a specific model architecture. In this work, we introduce Knodle, a software framework that treats weak data annotations, deep learning models, and methods for improving weakly supervised training as separate, modular components. This modularization gives the training process access to fine-grained information such as data set characteristics, matches of heuristic rules, or elements of the deep learning model ultimately used for prediction. Hence, our framework can encompass a wide range of training methods for improving weak supervision, ranging from methods that only look at correlations of rules and output classes (independently of the machine learning model trained with the resulting labels), to those that harness the interplay of neural networks and weakly labeled data. We illustrate the benchmarking potential of the framework with a performance comparison of several reference implementations on a selection of datasets that are already available in Knodle. The framework is published as an open-source Python package knodle and available at https://github.com/knodle/knodle.
    The DCU-EPFL Enhanced Dependency Parser at the IWPT 2021 Shared Task. (arXiv:2107.01982v1 [cs.CL])
    (2 min) We describe the DCU-EPFL submission to the IWPT 2021 Shared Task on Parsing into Enhanced Universal Dependencies. The task involves parsing Enhanced UD graphs, which are an extension of the basic dependency trees designed to be more facilitative towards representing semantic structure. Evaluation is carried out on 29 treebanks in 17 languages and participants are required to parse the data from each language starting from raw strings. Our approach uses the Stanza pipeline to preprocess the text files, XLMRoBERTa to obtain contextualized token representations, and an edge-scoring and labeling model to predict the enhanced graph. Finally, we run a post-processing script to ensure all of our outputs are valid Enhanced UD graphs. Our system places 6th out of 9 participants with a coarse Enhanced Labeled Attachment Score (ELAS) of 83.57. We carry out additional post-deadline experiments which include using Trankit for pre-processing, XLM-RoBERTa-LARGE, treebank concatenation, and multitask learning between a basic and an enhanced dependency parser. All of these modifications improve our initial score and our final system has a coarse ELAS of 88.04.
    Thank you BART! Rewarding Pre-Trained Models Improves Formality Style Transfer. (arXiv:2105.06947v2 [cs.CL] UPDATED)
    (2 min) Scarcity of parallel data causes formality style transfer models to have scarce success in preserving content. We show that fine-tuning pre-trained language (GPT-2) and sequence-to-sequence (BART) models boosts content preservation, and that this is possible even with limited amounts of parallel data. Augmenting these models with rewards that target style and content -- the two core aspects of the task -- we achieve a new state-of-the-art.
    RiddleSense: Reasoning about Riddle Questions Featuring Linguistic Creativity and Commonsense Knowledge. (arXiv:2101.00376v2 [cs.CL] UPDATED)
    (2 min) Question: I have five fingers but I am not alive. What am I? Answer: a glove. Answering such a riddle-style question is a challenging cognitive process, in that it requires complex commonsense reasoning abilities, an understanding of figurative language, and counterfactual reasoning skills, which are all important abilities for advanced natural language understanding (NLU). However, there are currently no dedicated datasets aiming to test these abilities. Herein, we present RiddleSense, a new multiple-choice question answering task, which comes with the first large dataset (5.7k examples) for answering riddle-style commonsense questions. We systematically evaluate a wide range of models over the challenge, and point out that there is a large gap between the best-supervised model and human performance -- suggesting intriguing future research in the direction of higher-order commonsense reasoning and linguistic creativity towards building advanced NLU systems.
    BiERU: Bidirectional Emotional Recurrent Unit for Conversational Sentiment Analysis. (arXiv:2006.00492v3 [cs.CL] UPDATED)
    (2 min) Sentiment analysis in conversations has gained increasing attention in recent years for the growing amount of applications it can serve, e.g., sentiment analysis, recommender systems, and human-robot interaction. The main difference between conversational sentiment analysis and single sentence sentiment analysis is the existence of context information which may influence the sentiment of an utterance in a dialogue. How to effectively encode contextual information in dialogues, however, remains a challenge. Existing approaches employ complicated deep learning structures to distinguish different parties in a conversation and then model the context information. In this paper, we propose a fast, compact and parameter-efficient party-ignorant framework named bidirectional emotional recurrent unit for conversational sentiment analysis. In our system, a generalized neural tensor block followed by a two-channel classifier is designed to perform context compositionality and sentiment classification, respectively. Extensive experiments on three standard datasets demonstrate that our model outperforms the state of the art in most cases.
    Domain Adaptation for Sentiment Analysis Using Increased Intraclass Separation. (arXiv:2107.01598v1 [cs.CL])
    (2 min) Sentiment analysis is a costly yet necessary task for enterprises to study the opinions of their customers to improve their products and to determine optimal marketing strategies. Due to the existence of a wide range of domains across different products and services, cross-domain sentiment analysis methods have received significant attention. These methods mitigate the domain gap between different applications by training cross-domain generalizable classifiers which help to relax the need for data annotation for each domain. Most existing methods focus on learning domain-agnostic representations that are invariant with respect to both the source and the target domains. As a result, a classifier that is trained using the source domain annotated data would generalize well in a related target domain. We introduce a new domain adaptation method which induces large margins between different classes in an embedding space. This embedding space is trained to be domain-agnostic by matching the data distributions across the domains. Large intraclass margins in the source domain help to reduce the effect of "domain shift" on the classifier performance in the target domain. Theoretical and empirical analysis are provided to demonstrate that the proposed method is effective.
    A Survey of Knowledge-Enhanced Text Generation. (arXiv:2010.04389v2 [cs.CL] UPDATED)
    (2 min) The goal of text generation is to make machines express in human language. It is one of the most important yet challenging tasks in natural language processing (NLP). Since 2014, various neural encoder-decoder models pioneered by Seq2Seq have been proposed to achieve the goal by learning to map input text to output text. However, the input text alone often provides limited knowledge to generate the desired output, so the performance of text generation is still far from satisfaction in many real-world scenarios. To address this issue, researchers have considered incorporating various forms of knowledge beyond the input text into the generation models. This research direction is known as knowledge-enhanced text generation. In this survey, we present a comprehensive review of the research on knowledge enhanced text generation over the past five years. The main content includes two parts: (i) general methods and architectures for integrating knowledge into text generation; (ii) specific techniques and applications according to different forms of knowledge data. This survey can have broad audiences, researchers and practitioners, in academia and industry.
    Layer-Wise Cross-View Decoding for Sequence-to-Sequence Learning. (arXiv:2005.08081v5 [cs.CL] UPDATED)
    (2 min) In sequence-to-sequence learning, the decoder relies on the attention mechanism to efficiently extract information from the encoder. While it is common practice to draw information from only the last encoder layer, recent work has proposed to use representations from different encoder layers for diversified levels of information. Nonetheless, the decoder still obtains only a single view of the source sequences, which might lead to insufficient training of the encoder layer stack due to the hierarchy bypassing problem. In this work, we propose layer-wise cross-view decoding, where for each decoder layer, together with the representations from the last encoder layer, which serve as a global view, those from other encoder layers are supplemented for a stereoscopic view of the source sequences. Systematic experiments show that we successfully address the hierarchy bypassing problem and substantially improve the performance of sequence-to-sequence learning with deep representations on diverse tasks.
    Contradiction Detection in Persian Text. (arXiv:2107.01987v1 [cs.CL])
    (2 min) Detection of semantic contradictory sentences is one of the most challenging and fundamental issues for NLP applications such as recognition of textual entailments. Contradiction in this study includes different types of semantic confrontation, such as conflict and antonymy. Due to lack of sufficient data to apply precise machine learning and specifically deep learning methods to Persian and other low resource languages, rule-based approaches that can function similarly to these systems will be of a great interest. Also recently, emergence of new methods such as transfer learning, has opened up the possibility of deep learning for low-resource languages. Considering two above points, in this study, along with a simple rule-base baseline, a novel rule-base system for identifying semantic contradiction along with a Bert base deep contradiction detection system for Persian texts have been introduced. The rule base system has used frequent rule mining method to extract appropriate contradiction rules using a development set. Extracted rules are tested for different categories of contradictory sentences. In this system the maximum f-measure among contradiction categories is obtained for negation about 90% and the average F-measure of system for all classes is about 76% which outperforms other algorithms on Persian texts. On the other hand, because of medium performance of rule base system for some categories of contradiction, we use a Bert base deep learning system using our translated dataset; with average F-measure of 73. Our hybrid system has f-measure of about 80.
    IITP at WAT 2021: System description for English-Hindi Multimodal Translation Task. (arXiv:2107.01656v1 [cs.CL])
    (2 min) Neural Machine Translation (NMT) is a predominant machine translation technology nowadays because of its end-to-end trainable flexibility. However, NMT still struggles to translate properly in low-resource settings specifically on distant language pairs. One way to overcome this is to use the information from other modalities if available. The idea is that despite differences in languages, both the source and target language speakers see the same thing and the visual representation of both the source and target is the same, which can positively assist the system. Multimodal information can help the NMT system to improve the translation by removing ambiguity on some phrases or words. We participate in the 8th Workshop on Asian Translation (WAT - 2021) for English-Hindi multimodal translation task and achieve 42.47 and 37.50 BLEU points for Evaluation and Challenge subset, respectively.
    CasEE: A Joint Learning Framework with Cascade Decoding for Overlapping Event Extraction. (arXiv:2107.01583v1 [cs.CL])
    (2 min) Event extraction (EE) is a crucial information extraction task that aims to extract event information in texts. Most existing methods assume that events appear in sentences without overlaps, which are not applicable to the complicated overlapping event extraction. This work systematically studies the realistic event overlapping problem, where a word may serve as triggers with several types or arguments with different roles. To tackle the above problem, we propose a novel joint learning framework with cascade decoding for overlapping event extraction, termed as CasEE. Particularly, CasEE sequentially performs type detection, trigger extraction and argument extraction, where the overlapped targets are extracted separately conditioned on the specific former prediction. All the subtasks are jointly learned in a framework to capture dependencies among the subtasks. The evaluation on a public event extraction benchmark FewFC demonstrates that CasEE achieves significant improvements on overlapping event extraction over previous competitive methods.
    Cross-Modal Transformer-Based Neural Correction Models for Automatic Speech Recognition. (arXiv:2107.01569v1 [cs.CL])
    (2 min) We propose a cross-modal transformer-based neural correction models that refines the output of an automatic speech recognition (ASR) system so as to exclude ASR errors. Generally, neural correction models are composed of encoder-decoder networks, which can directly model sequence-to-sequence mapping problems. The most successful method is to use both input speech and its ASR output text as the input contexts for the encoder-decoder networks. However, the conventional method cannot take into account the relationships between these two different modal inputs because the input contexts are separately encoded for each modal. To effectively leverage the correlated information between the two different modal inputs, our proposed models encode two different contexts jointly on the basis of cross-modal self-attention using a transformer. We expect that cross-modal self-attention can effectively capture the relationships between two different modals for refining ASR hypotheses. We also introduce a shallow fusion technique to efficiently integrate the first-pass ASR model and our proposed neural correction model. Experiments on Japanese natural language ASR tasks demonstrated that our proposed models achieve better ASR performance than conventional neural correction models.
    Persian-WSD-Corpus: A Sense Annotated Corpus for Persian All-words Word Sense Disambiguation. (arXiv:2107.01540v1 [cs.CL])
    (2 min) Word Sense Disambiguation (WSD) is a long-standing task in Natural Language Processing(NLP) that aims to automatically identify the most relevant meaning of the words in a given context. Developing standard WSD test collections can be mentioned as an important prerequisite for developing and evaluating different WSD systems in the language of interest. Although many WSD test collections have been developed for a variety of languages, no standard All-words WSD benchmark is available for Persian. In this paper, we address this shortage for the Persian language by introducing SBU-WSD-Corpus, as the first standard test set for the Persian All-words WSD task. SBU-WSD-Corpus is manually annotated with senses from the Persian WordNet (FarsNet) sense inventory. To this end, three annotators used SAMP (a tool for sense annotation based on FarsNet lexical graph) to perform the annotation task. SBU-WSD-Corpus consists of 19 Persian documents in different domains such as Sports, Science, Arts, etc. It includes 5892 content words of Persian running text and 3371 manually sense annotated words (2073 nouns, 566 verbs, 610 adjectives, and 122 adverbs). Providing baselines for future studies on the Persian All-words WSD task, we evaluate several WSD models on SBU-WSD-Corpus. The corpus is publicly available at https://github.com/hrouhizadeh/SBU-WSD-Corpus.
    Can Transformers Jump Around Right in Natural Language? Assessing Performance Transfer from SCAN. (arXiv:2107.01366v1 [cs.CL])
    (2 min) Despite their practical success, modern seq2seq architectures are unable to generalize systematically on several SCAN tasks. Hence, it is not clear if SCAN-style compositional generalization is useful in realistic NLP tasks. In this work, we study the benefit that such compositionality brings about to several machine translation tasks. We present several focused modifications of Transformer that greatly improve generalization capabilities on SCAN and select one that remains on par with a vanilla Transformer on a standard machine translation (MT) task. Next, we study its performance in low-resource settings and on a newly introduced distribution-shifted English-French translation task. Overall, we find that improvements of a SCAN-capable model do not directly transfer to the resource-rich MT setup. In contrast, in the low-resource setup, general modifications lead to an improvement of up to 13.1% BLEU score w.r.t. a vanilla Transformer. Similarly, an improvement of 14% in an accuracy-based metric is achieved in the introduced compositional English-French translation task. This provides experimental evidence that the compositional generalization assessed in SCAN is particularly useful in resource-starved and domain-shifted scenarios.
    Audio-Oriented Multimodal Machine Comprehension: Task, Dataset and Model. (arXiv:2107.01571v1 [cs.CL])
    (2 min) While Machine Comprehension (MC) has attracted extensive research interests in recent years, existing approaches mainly belong to the category of Machine Reading Comprehension task which mines textual inputs (paragraphs and questions) to predict the answers (choices or text spans). However, there are a lot of MC tasks that accept audio input in addition to the textual input, e.g. English listening comprehension test. In this paper, we target the problem of Audio-Oriented Multimodal Machine Comprehension, and its goal is to answer questions based on the given audio and textual information. To solve this problem, we propose a Dynamic Inter- and Intra-modality Attention (DIIA) model to effectively fuse the two modalities (audio and textual). DIIA can work as an independent component and thus be easily integrated into existing MC models. Moreover, we further develop a Multimodal Knowledge Distillation (MKD) module to enable our multimodal MC model to accurately predict the answers based only on either the text or the audio. As a result, the proposed approach can handle various tasks including: Audio-Oriented Multimodal Machine Comprehension, Machine Reading Comprehension and Machine Listening Comprehension, in a single model, making fair comparisons possible between our model and the existing unimodal MC models. Experimental results and analysis prove the effectiveness of the proposed approaches. First, the proposed DIIA boosts the baseline models by up to 21.08% in terms of accuracy; Second, under the unimodal scenarios, the MKD module allows our multimodal MC model to significantly outperform the unimodal models by up to 18.87%, which are trained and tested with only audio or textual data.
    Neural-Symbolic Solver for Math Word Problems with Auxiliary Tasks. (arXiv:2107.01431v1 [cs.CL])
    (2 min) Previous math word problem solvers following the encoder-decoder paradigm fail to explicitly incorporate essential math symbolic constraints, leading to unexplainable and unreasonable predictions. Herein, we propose Neural-Symbolic Solver (NS-Solver) to explicitly and seamlessly incorporate different levels of symbolic constraints by auxiliary tasks. Our NS-Solver consists of a problem reader to encode problems, a programmer to generate symbolic equations, and a symbolic executor to obtain answers. Along with target expression supervision, our solver is also optimized via 4 new auxiliary objectives to enforce different symbolic reasoning: a) self-supervised number prediction task predicting both number quantity and number locations; b) commonsense constant prediction task predicting what prior knowledge (e.g. how many legs a chicken has) is required; c) program consistency checker computing the semantic loss between predicted equation and target equation to ensure reasonable equation mapping; d) duality exploiting task exploiting the quasi duality between symbolic equation generation and problem's part-of-speech generation to enhance the understanding ability of a solver. Besides, to provide a more realistic and challenging benchmark for developing a universal and scalable solver, we also construct a new large-scale MWP benchmark CM17K consisting of 4 kinds of MWPs (arithmetic, one-unknown linear, one-unknown non-linear, equation set) with more than 17K samples. Extensive experiments on Math23K and our CM17k demonstrate the superiority of our NS-Solver compared to state-of-the-art methods.
    Unified Autoregressive Modeling for Joint End-to-End Multi-Talker Overlapped Speech Recognition and Speaker Attribute Estimation. (arXiv:2107.01549v1 [cs.CL])
    (2 min) In this paper, we present a novel modeling method for single-channel multi-talker overlapped automatic speech recognition (ASR) systems. Fully neural network based end-to-end models have dramatically improved the performance of multi-taker overlapped ASR tasks. One promising approach for end-to-end modeling is autoregressive modeling with serialized output training in which transcriptions of multiple speakers are recursively generated one after another. This enables us to naturally capture relationships between speakers. However, the conventional modeling method cannot explicitly take into account the speaker attributes of individual utterances such as gender and age information. In fact, the performance deteriorates when each speaker is the same gender or is close in age. To address this problem, we propose unified autoregressive modeling for joint end-to-end multi-talker overlapped ASR and speaker attribute estimation. Our key idea is to handle gender and age estimation tasks within the unified autoregressive modeling. In the proposed method, transformer-based autoregressive model recursively generates not only textual tokens but also attribute tokens of each speaker. This enables us to effectively utilize speaker attributes for improving multi-talker overlapped ASR. Experiments on Japanese multi-talker overlapped ASR tasks demonstrate the effectiveness of the proposed method.
    Scarecrow: A Framework for Scrutinizing Machine Text. (arXiv:2107.01294v1 [cs.CL])
    (2 min) Modern neural text generation systems can produce remarkably fluent and grammatical texts. While earlier language models suffered from repetition and syntactic errors, the errors made by contemporary models are often semantic, narrative, or discourse failures. To facilitate research of these complex error types, we introduce a new structured, crowdsourced error annotation schema called Scarecrow. The error categories used in Scarecrow -- such as redundancy, commonsense errors, and incoherence -- were identified by combining expert analysis with several pilot rounds of ontology-free crowd annotation to arrive at a schema which covers the error phenomena found in real machine generated text. We use Scarecrow to collect 13k annotations of 1.3k human and machine generate paragraphs of English language news text, amounting to over 41k spans each labeled with its error category, severity, a natural language explanation, and antecedent span (where relevant). We collect annotations for text generated by state-of-the-art systems with varying known performance levels, from GPT-2 Small through the largest GPT-3. We isolate several factors for detailed analysis, including parameter count, training data, and decoding technique. Our results show both expected and surprising differences across these settings. These findings demonstrate the value of Scarecrow annotations in the assessment of current and future text generation systems. We release our complete annotation toolkit and dataset at https://yao-dou.github.io/scarecrow/.
    Arabic Code-Switching Speech Recognition using Monolingual Data. (arXiv:2107.01573v1 [cs.CL])
    (2 min) Code-switching in automatic speech recognition (ASR) is an important challenge due to globalization. Recent research in multilingual ASR shows potential improvement over monolingual systems. We study key issues related to multilingual modeling for ASR through a series of large-scale ASR experiments. Our innovative framework deploys a multi-graph approach in the weighted finite state transducers (WFST) framework. We compare our WFST decoding strategies with a transformer sequence to sequence system trained on the same data. Given a code-switching scenario between Arabic and English languages, our results show that the WFST decoding approaches were more suitable for the intersentential code-switching datasets. In addition, the transformer system performed better for intrasentential code-switching task. With this study, we release an artificially generated development and test sets, along with ecological code-switching test set, to benchmark the ASR performance.
    Towards Neural Diarization for Unlimited Numbers of Speakers Using Global and Local Attractors. (arXiv:2107.01545v1 [eess.AS])
    (2 min) Attractor-based end-to-end diarization is achieving comparable accuracy to the carefully tuned conventional clustering-based methods on challenging datasets. However, the main drawback is that it cannot deal with the case where the number of speakers is larger than the one observed during training. This is because its speaker counting relies on supervised learning. In this work, we introduce an unsupervised clustering process embedded in the attractor-based end-to-end diarization. We first split a sequence of frame-wise embeddings into short subsequences and then perform attractor-based diarization for each subsequence. Given subsequence-wise diarization results, inter-subsequence speaker correspondence is obtained by unsupervised clustering of the vectors computed from the attractors from all the subsequences. This makes it possible to produce diarization results of a large number of speakers for the whole recording even if the number of output speakers for each subsequence is limited. Experimental results showed that our method could produce accurate diarization results of an unseen number of speakers. Our method achieved 11.84 %, 28.33 %, and 19.49 % on the CALLHOME, DIHARD II, and DIHARD III datasets, respectively, each of which is better than the conventional end-to-end diarization methods.
    Relaxed Attention: A Simple Method to Boost Performance of End-to-End Automatic Speech Recognition. (arXiv:2107.01275v1 [eess.AS])
    (2 min) Recently, attention-based encoder-decoder (AED) models have shown high performance for end-to-end automatic speech recognition (ASR) across several tasks. Addressing overconfidence in such models, in this paper we introduce the concept of relaxed attention, which is a simple gradual injection of a uniform distribution to the encoder-decoder attention weights during training that is easily implemented with two lines of code. We investigate the effect of relaxed attention across different AED model architectures and two prominent ASR tasks, Wall Street Journal (WSJ) and Librispeech. We found that transformers trained with relaxed attention outperform the standard baseline models consistently during decoding with external language models. On WSJ, we set a new benchmark for transformer-based end-to-end speech recognition with a word error rate of 3.65%, outperforming state of the art (4.20%) by 13.1% relative, while introducing only a single hyperparameter. Upon acceptance, models will be published on github.
  • cs.CV updates on arXiv.org

    Endo-Depth-and-Motion: Reconstruction and Tracking in Endoscopic Videos using Depth Networks and Photometric Constraints. (arXiv:2103.16525v2 [cs.CV] UPDATED)
    (2 min) Estimating a scene reconstruction and the camera motion from in-body videos is challenging due to several factors, e.g. the deformation of in-body cavities or the lack of texture. In this paper we present Endo-Depth-and-Motion, a pipeline that estimates the 6-degrees-of-freedom camera pose and dense 3D scene models from monocular endoscopic videos. Our approach leverages recent advances in self-supervised depth networks to generate pseudo-RGBD frames, then tracks the camera pose using photometric residuals and fuses the registered depth maps in a volumetric representation. We present an extensive experimental evaluation in the public dataset Hamlyn, showing high-quality results and comparisons against relevant baselines. We also release all models and code for future comparisons.
    A Deep Learning Object Detection Method for an Efficient Clusters Initialization. (arXiv:2104.13634v3 [cs.CV] UPDATED)
    (2 min) Clustering is an unsupervised machine learning method grouping data samples into clusters of similar objects. In practice, clustering has been used in numerous applications such as banking customers profiling, document retrieval, image segmentation, and e-commerce recommendation engines. However, the existing clustering techniques present significant limitations, from which is the dependability of their stability on the initialization parameters (e.g. number of clusters, centroids). Different solutions were presented in the literature to overcome this limitation (i.e. internal and external validation metrics). However, these solutions require high computational complexity and memory consumption, especially when dealing with big data. In this paper, we apply the recent object detection Deep Learning (DL) model, named YOLO-v5, to detect the initial clustering parameters such as the number of clusters with their sizes and centroids. Mainly, the proposed solution consists of adding a DL-based initialization phase making the clustering algorithms free of initialization. Two model solutions are provided in this work, one for isolated clusters and the other one for overlapping clusters. The features of the incoming dataset determine which model to use. Moreover, The results show that the proposed solution can provide near-optimal clusters initialization parameters with low computational and resources overhead compared to existing solutions.
    Unsupervised Domain Adaptation of Object Detectors: A Survey. (arXiv:2105.13502v2 [cs.CV] UPDATED)
    (2 min) Recent advances in deep learning have led to the development of accurate and efficient models for various computer vision applications such as classification, segmentation, and detection. However, learning highly accurate models relies on the availability of large-scale annotated datasets. Due to this, model performance drops drastically when evaluated on label-scarce datasets having visually distinct images, termed as domain adaptation problem. There is a plethora of works to adapt classification and segmentation models to label-scarce target datasets through unsupervised domain adaptation. Considering that detection is a fundamental task in computer vision, many recent works have focused on developing novel domain adaptive detection techniques. Here, we describe in detail the domain adaptation problem for detection and present an extensive survey of the various methods. Furthermore, we highlight strategies proposed and the associated shortcomings. Subsequently, we identify multiple aspects of the problem that are most promising for future research. We believe that this survey shall be valuable to the pattern recognition experts working in the fields of computer vision, biometrics, medical imaging, and autonomous navigation by introducing them to the problem, and familiarizing them with the current status of the progress while providing promising directions for future research.
    PixSet : An Opportunity for 3D Computer Vision to Go Beyond Point Clouds With a Full-Waveform LiDAR Dataset. (arXiv:2102.12010v2 [cs.RO] UPDATED)
    (2 min) Leddar PixSet is a new publicly available dataset (dataset.leddartech.com) for autonomous driving research and development. One key novelty of this dataset is the presence of full-waveform data from the Leddar Pixell sensor, a solid-state flash LiDAR. Full-waveform data has been shown to improve the performance of perception algorithms in airborne applications but is yet to be demonstrated for terrestrial applications such as autonomous driving. The PixSet dataset contains approximately 29k frames from 97 sequences recorded in high-density urban areas, using a set of various sensors (cameras, LiDARs, radar, IMU, etc.) Each frame has been manually annotated with 3D bounding boxes.
    A Self-Supervised Gait Encoding Approach with Locality-Awareness for 3D Skeleton Based Person Re-Identification. (arXiv:2009.03671v3 [cs.CV] UPDATED)
    (2 min) Person re-identification (Re-ID) via gait features within 3D skeleton sequences is a newly-emerging topic with several advantages. Existing solutions either rely on hand-crafted descriptors or supervised gait representation learning. This paper proposes a self-supervised gait encoding approach that can leverage unlabeled skeleton data to learn gait representations for person Re-ID. Specifically, we first create self-supervision by learning to reconstruct unlabeled skeleton sequences reversely, which involves richer high-level semantics to obtain better gait representations. Other pretext tasks are also explored to further improve self-supervised learning. Second, inspired by the fact that motion's continuity endows adjacent skeletons in one skeleton sequence and temporally consecutive skeleton sequences with higher correlations (referred as locality in 3D skeleton data), we propose a locality-aware attention mechanism and a locality-aware contrastive learning scheme, which aim to preserve locality-awareness on intra-sequence level and inter-sequence level respectively during self-supervised learning. Last, with context vectors learned by our locality-aware attention mechanism and contrastive learning scheme, a novel feature named Constrastive Attention-based Gait Encodings (CAGEs) is designed to represent gait effectively. Empirical evaluations show that our approach significantly outperforms skeleton-based counterparts by 15-40% Rank-1 accuracy, and it even achieves superior performance to numerous multi-modal methods with extra RGB or depth information. Our codes are available at https://github.com/Kali-Hac/Locality-Awareness-SGE.
    Part2Word: Learning Joint Embedding of Point Clouds and Text by Matching Parts to Words. (arXiv:2107.01872v1 [cs.CV])
    (2 min) It is important to learn joint embedding for 3D shapes and text in different shape understanding tasks, such as shape-text matching, retrieval, and shape captioning. Current multi-view based methods learn a mapping from multiple rendered views to text. However, these methods can not analyze 3D shapes well due to the self-occlusion and limitation of learning manifolds. To resolve this issue, we propose a method to learn joint embedding of point clouds and text by matching parts from shapes to words from sentences in a common space. Specifically, we first learn segmentation prior to segment point clouds into parts. Then, we map parts and words into an optimized space, where the parts and words can be matched with each other. In the optimized space, we represent a part by aggregating features of all points within the part, while representing each word with its context information, where we train our network to minimize the triplet ranking loss. Moreover, we also introduce cross-modal attention to capture the relationship of part-word in this matching procedure, which enhances joint embedding learning. Our experimental results outperform the state-of-the-art in multi-modal retrieval under the widely used benchmark.
    Least Squares Normalized Cross Correlation. (arXiv:1810.04320v2 [cs.CV] UPDATED)
    (2 min) Direct methods are widely used for alignment of models to images, due to their accuracy, since they minimize errors in the domain of measurement noise. They have leveraged least squares minimizations, for simple, efficient, variational optimization, since the seminal 1981 work of Lucas & Kanade, and normalized cross correlation (NCC), for robustness to intensity variations, since at least 1972. Despite the complementary benefits of these two well known methods, they have not been effectively combined to address local variations in intensity. Many ad-hoc NCC frameworks, sub-optimal least squares methods and image transformation approaches have thus been proposed instead, each with their own limitations. This work shows that a least squares optimization of NCC without approximation is not only possible, but straightforward and efficient. A robust, locally normalized formulation is introduced to mitigate local intensity variations and partial occlusions. Finally, sparse features with oriented patches are proposed for further efficiency. The resulting framework is simple to implement, computationally efficient and robust to local intensity variations. It is evaluated on the image alignment problem, showing improvements in both convergence rate and computation time over existing lighting invariant methods.
    OPA: Object Placement Assessment Dataset. (arXiv:2107.01889v1 [cs.CV])
    (2 min) Image composition aims to generate realistic composite image by inserting an object from one image into another background image, where the placement (e.g., location, size, occlusion) of inserted object may be unreasonable, which would significantly degrade the quality of the composite image. Although some works attempted to learn object placement to create realistic composite images, they did not focus on assessing the plausibility of object placement. In this paper, we focus on object placement assessment task, which verifies whether a composite image is plausible in terms of the object placement. To accomplish this task, we construct the first Object Placement Assessment (OPA) dataset consisting of composite images and their rationality labels. Dataset is available at https://github.com/bcmi/Object-Placement-Assessment-Dataset-OPA.
    Cognitive Visual Commonsense Reasoning Using Dynamic Working Memory. (arXiv:2107.01671v1 [cs.CV])
    (2 min) Visual Commonsense Reasoning (VCR) predicts an answer with corresponding rationale, given a question-image input. VCR is a recently introduced visual scene understanding task with a wide range of applications, including visual question answering, automated vehicle systems, and clinical decision support. Previous approaches to solving the VCR task generally rely on pre-training or exploiting memory with long dependency relationship encoded models. However, these approaches suffer from a lack of generalizability and prior knowledge. In this paper we propose a dynamic working memory based cognitive VCR network, which stores accumulated commonsense between sentences to provide prior knowledge for inference. Extensive experiments show that the proposed model yields significant improvements over existing methods on the benchmark VCR dataset. Moreover, the proposed model provides intuitive interpretation into visual commonsense reasoning. A Python implementation of our mechanism is publicly available at https://github.com/tanjatang/DMVCR
    Styleformer: Transformer based Generative Adversarial Networks with Style Vector. (arXiv:2106.07023v2 [cs.CV] UPDATED)
    (2 min) We propose Styleformer, which is a style-based generator for GAN architecture, but a convolution-free transformer-based generator. In our paper, we explain how a transformer can generate high-quality images, overcoming the disadvantage that convolution operations are difficult to capture global features in an image. Furthermore, we change the demodulation of StyleGAN2 and modify the existing transformer structure (e.g., residual connection, layer normalization) to create a strong style-based generator with a convolution-free structure. We also make Styleformer lighter by applying Linformer, enabling Styleformer to generate higher resolution images and result in improvements in terms of speed and memory. We experiment with the low-resolution image dataset such as CIFAR-10, as well as the high-resolution image dataset like LSUN-church. Styleformer records FID 2.82 and IS 9.94 on CIFAR-10, a benchmark dataset, which is comparable performance to the current state-of-the-art and outperforms all GAN-based generative models, including StyleGAN2-ADA with fewer parameters on the unconditional setting. We also both achieve new state-of-the-art with FID 15.17, IS 11.01, and FID 3.66, respectively on STL-10 and CelebA. We release our code at https://github.com/Jeeseung-Park/Styleformer.
    Unsupervised Activity Segmentation by Joint Representation Learning and Online Clustering. (arXiv:2105.13353v3 [cs.CV] UPDATED)
    (2 min) We present a novel approach for unsupervised activity segmentation, which uses video frame clustering as a pretext task and simultaneously performs representation learning and online clustering. This is in contrast with prior works where representation learning and clustering are often performed sequentially. We leverage temporal information in videos by employing temporal optimal transport and temporal coherence loss. In particular, we incorporate a temporal regularization term into the standard optimal transport module, which preserves the temporal order of the activity, yielding the temporal optimal transport module for computing pseudo-label cluster assignments. Next, the temporal coherence loss encourages neighboring video frames to be mapped to nearby points while distant video frames are mapped to farther away points in the embedding space. The combination of these two components results in effective representations for unsupervised activity segmentation. Furthermore, previous methods require storing learned features for the entire dataset before clustering them in an offline manner, whereas our approach processes one mini-batch at a time in an online manner. Extensive evaluations on three public datasets, i.e. 50-Salads, YouTube Instructions, and Breakfast, and our dataset, i.e., Desktop Assembly, show that our approach performs on par or better than previous methods for unsupervised activity segmentation, despite having significantly less memory constraints.
    A More Compact Object Detector Head Network with Feature Enhancement and Relational Reasoning. (arXiv:2106.14475v2 [cs.CV] UPDATED)
    (2 min) Modeling implicit feature interaction patterns is of significant importance to object detection tasks. However, in the two-stage detectors, due to the excessive use of hand-crafted components, it is very difficult to reason about the implicit relationship of the instance features. To tackle this problem, we analyze three different levels of feature interaction relationships, namely, the dependency relationship between the cropped local features and global features, the feature autocorrelation within the instance, and the cross-correlation relationship between the instances. To this end, we propose a more compact object detector head network (CODH), which can not only preserve global context information and condense the information density, but also allows instance-wise feature enhancement and relational reasoning in a larger matrix space. Without bells and whistles, our method can effectively improve the detection performance while significantly reducing the parameters of the model, e.g., with our method, the parameters of the head network is 0.6 times smaller than the state-of-the-art Cascade R-CNN, yet the performance boost is 1.3% on COCO test-dev. Without losing generality, we can also build a more lighter head network for other multi-stage detectors by assembling our method.
    Structure by Architecture: Disentangled Representations without Regularization. (arXiv:2006.07796v3 [cs.LG] UPDATED)
    (2 min) We study the problem of self-supervised structured representation learning using autoencoders for generative modeling. Unlike most methods which rely on matching an arbitrary, relatively unstructured, prior distribution for sampling, we propose a sampling technique that relies solely on the independence of latent variables, thereby avoiding the trade-off between reconstruction quality and generative performance inherent to VAEs. We design a novel autoencoder architecture capable of learning a structured representation without the need for aggressive regularization. Our structural decoders learn a hierarchy of latent variables, akin to structural causal models, thereby ordering the information without any additional regularization. We demonstrate how these models learn a representation that improves results in a variety of downstream tasks including generation, disentanglement, and extrapolation using several challenging and natural image datasets.
    Noise Sensitivity-Based Energy Efficient and Robust Adversary Detection in Neural Networks. (arXiv:2101.01543v2 [cs.CV] UPDATED)
    (2 min) Neural networks have achieved remarkable performance in computer vision, however they are vulnerable to adversarial examples. Adversarial examples are inputs that have been carefully perturbed to fool classifier networks, while appearing unchanged to humans. Based on prior works on detecting adversaries, we propose a structured methodology of augmenting a deep neural network (DNN) with a detector subnetwork. We use $\textit{Adversarial Noise Sensitivity}$ (ANS), a novel metric for measuring the adversarial gradient contribution of different intermediate layers of a network. Based on the ANS value, we append a detector to the most sensitive layer. In prior works, more complex detectors were added to a DNN, increasing the inference computational cost of the model. In contrast, our structured and strategic addition of a detector to a DNN reduces the complexity of the model while making the overall network adversarially resilient. Through comprehensive white-box and black-box experiments on MNIST, CIFAR-10, and CIFAR-100, we show that our method improves state-of-the-art detector robustness against adversarial examples. Furthermore, we validate the energy efficiency of our proposed adversarial detection methodology through an extensive energy analysis on various hardware scalable CMOS accelerator platforms. We also demonstrate the effects of quantization on our detector-appended networks.
    GraphXCOVID: Explainable Deep Graph Diffusion Pseudo-Labelling for Identifying COVID-19 on Chest X-rays. (arXiv:2010.00378v2 [cs.LG] UPDATED)
    (3 min) Can one learn to diagnose COVID-19 under extreme minimal supervision? Since the outbreak of the novel COVID-19 there has been a rush for developing Artificial Intelligence techniques for expert-level disease identification on Chest X-ray data. In particular, the use of deep supervised learning has become the go-to paradigm. However, the performance of such models is heavily dependent on the availability of a large and representative labelled dataset. The creation of which is a heavily expensive and time consuming task, and especially imposes a great challenge for a novel disease. Semi-supervised learning has shown the ability to match the incredible performance of supervised models whilst requiring a small fraction of the labelled examples. This makes the semi-supervised paradigm an attractive option for identifying COVID-19. In this work, we introduce a graph based deep semi-supervised framework for classifying COVID-19 from chest X-rays. Our framework introduces an optimisation model for graph diffusion that reinforces the natural relation among the tiny labelled set and the vast unlabelled data. We then connect the diffusion prediction output as pseudo-labels that are used in an iterative scheme in a deep net. We demonstrate, through our experiments, that our model is able to outperform the current leading supervised model with a tiny fraction of the labelled examples. Finally, we provide attention maps to accommodate the radiologist's mental model, better fitting their perceptual and cognitive abilities. These visualisation aims to assist the radiologist in judging whether the diagnostic is correct or not, and in consequence to accelerate the decision.
    FINT: Field-aware INTeraction Neural Network For CTR Prediction. (arXiv:2107.01999v1 [cs.IR])
    (2 min) As a critical component for online advertising and marking, click-through rate (CTR) prediction has draw lots of attentions from both industry and academia field. Recently, the deep learning has become the mainstream methodological choice for CTR. Despite of sustainable efforts have been made, existing approaches still pose several challenges. On the one hand, high-order interaction between the features is under-explored. On the other hand, high-order interactions may neglect the semantic information from the low-order fields. In this paper, we proposed a novel prediction method, named FINT, that employs the Field-aware INTeraction layer which captures high-order feature interactions while retaining the low-order field information. To empirically investigate the effectiveness and robustness of the FINT, we perform extensive experiments on the three realistic databases: KDD2012, Criteo and Avazu. The obtained results demonstrate that the FINT can significantly improve the performance compared to the existing methods, without increasing the amount of computation required. Moreover, the proposed method brought about 2.72\% increase to the advertising revenue of a big online video app through A/B testing. To better promote the research in CTR field, we will release our code as well as reference implementation of those baseline models in the final version.
    Pulmonary Vessel Segmentation based on Orthogonal Fused U-Net++ of Chest CT Images. (arXiv:2107.01502v1 [eess.IV])
    (2 min) Pulmonary vessel segmentation is important for clinical diagnosis of pulmonary diseases, while is also challenging due to the complicated structure. In this work, we present an effective framework and refinement process of pulmonary vessel segmentation from chest computed tomographic (CT) images. The key to our approach is a 2.5D segmentation network applied from three orthogonal axes, which presents a robust and fully automated pulmonary vessel segmentation result with lower network complexity and memory usage compared to 3D networks. The slice radius is introduced to convolve the adjacent information of the center slice and the multi-planar fusion optimizes the presentation of intra- and inter- slice features. Besides, the tree-like structure of the pulmonary vessel is extracted in the post-processing process, which is used for segmentation refining and pruning. In the evaluation experiments, three fusion methods are tested and the most promising one is compared with the state-of-the-art 2D and 3D structures on 300 cases of lung images randomly selected from LIDC dataset. Our method outperforms other network structures by a large margin and achieves by far the highest average DICE score of 0.9272 and precision of 0.9310, as per our knowledge from the pulmonary vessel segmentation models available in the literature.
    Custom Deep Neural Network for 3D Covid Chest CT-scan Classification. (arXiv:2107.01456v1 [eess.IV])
    (2 min) 3D CT-scan base on chest is one of the controversial topisc of the researcher nowadays. There are many tasks to diagnose the disease through CT-scan images, include Covid19. In this paper, we propose a method that custom and combine Deep Neural Network to classify the series of 3D CT-scans chest images. In our methods, we experiment with 2 backbones is DenseNet 121 and ResNet 101. In this proposal, we separate the experiment into 2 tasks, one is for 2 backbones combination of ResNet and DenseNet, one is for DenseNet backbones combination.
    Multi-View Correlation Distillation for Incremental Object Detection. (arXiv:2107.01787v1 [cs.CV])
    (2 min) In real applications, new object classes often emerge after the detection model has been trained on a prepared dataset with fixed classes. Due to the storage burden and the privacy of old data, sometimes it is impractical to train the model from scratch with both old and new data. Fine-tuning the old model with only new data will lead to a well-known phenomenon of catastrophic forgetting, which severely degrades the performance of modern object detectors. In this paper, we propose a novel \textbf{M}ulti-\textbf{V}iew \textbf{C}orrelation \textbf{D}istillation (MVCD) based incremental object detection method, which explores the correlations in the feature space of the two-stage object detector (Faster R-CNN). To better transfer the knowledge learned from the old classes and maintain the ability to learn new classes, we design correlation distillation losses from channel-wise, point-wise and instance-wise views to regularize the learning of the incremental model. A new metric named Stability-Plasticity-mAP is proposed to better evaluate both the stability for old classes and the plasticity for new classes in incremental object detection. The extensive experiments conducted on VOC2007 and COCO demonstrate that MVCD can effectively learn to detect objects of new classes and mitigate the problem of catastrophic forgetting.
    Towards Adversarial Patch Analysis and Certified Defense against Crowd Counting. (arXiv:2104.10868v2 [cs.CV] UPDATED)
    (3 min) Crowd counting has drawn much attention due to its importance in safety-critical surveillance systems. Especially, deep neural network (DNN) methods have significantly reduced estimation errors for crowd counting missions. Recent studies have demonstrated that DNNs are vulnerable to adversarial attacks, i.e., normal images with human-imperceptible perturbations could mislead DNNs to make false predictions. In this work, we propose a robust attack strategy called Adversarial Patch Attack with Momentum (APAM) to systematically evaluate the robustness of crowd counting models, where the attacker's goal is to create an adversarial perturbation that severely degrades their performances, thus leading to public safety accidents (e.g., stampede accidents). Especially, the proposed attack leverages the extreme-density background information of input images to generate robust adversarial patches via a series of transformations (e.g., interpolation, rotation, etc.). We observe that by perturbing less than 6\% of image pixels, our attacks severely degrade the performance of crowd counting systems, both digitally and physically. To better enhance the adversarial robustness of crowd counting models, we propose the first regression model-based Randomized Ablation (RA), which is more sufficient than Adversarial Training (ADT) (Mean Absolute Error of RA is 5 lower than ADT on clean samples and 30 lower than ADT on adversarial examples). Extensive experiments on five crowd counting models demonstrate the effectiveness and generality of the proposed method. Code is available at \url{https://github.com/harrywuhust2022/Adv-Crowd-analysis}.
    Enhancing Geometric Factors in Model Learning and Inference for Object Detection and Instance Segmentation. (arXiv:2005.03572v4 [cs.CV] UPDATED)
    (3 min) Deep learning-based object detection and instance segmentation have achieved unprecedented progress. In this paper, we propose Complete-IoU (CIoU) loss and Cluster-NMS for enhancing geometric factors in both bounding box regression and Non-Maximum Suppression (NMS), leading to notable gains of average precision (AP) and average recall (AR), without the sacrifice of inference efficiency. In particular, we consider three geometric factors, i.e., overlap area, normalized central point distance and aspect ratio, which are crucial for measuring bounding box regression in object detection and instance segmentation. The three geometric factors are then incorporated into CIoU loss for better distinguishing difficult regression cases. The training of deep models using CIoU loss results in consistent AP and AR improvements in comparison to widely adopted $\ell_n$-norm loss and IoU-based loss. Furthermore, we propose Cluster-NMS, where NMS during inference is done by implicitly clustering detected boxes and usually requires less iterations. Cluster-NMS is very efficient due to its pure GPU implementation, and geometric factors can be incorporated to improve both AP and AR. In the experiments, CIoU loss and Cluster-NMS have been applied to state-of-the-art instance segmentation (e.g., YOLACT and BlendMask-RT), and object detection (e.g., YOLO v3, SSD and Faster R-CNN) models. Taking YOLACT on MS COCO as an example, our method achieves performance gains as +1.7 AP and +6.2 AR$_{100}$ for object detection, and +0.9 AP and +3.5 AR$_{100}$ for instance segmentation, with 27.1 FPS on one NVIDIA GTX 1080Ti GPU. All the source code and trained models are available at https://github.com/Zzh-tju/CIoU
    Visual Time Series Forecasting: An Image-driven Approach. (arXiv:2107.01273v1 [cs.CV])
    (2 min) In this work, we address time-series forecasting as a computer vision task. We capture input data as an image and train a model to produce the subsequent image. This approach results in predicting distributions as opposed to pointwise values. To assess the robustness and quality of our approach, we examine various datasets and multiple evaluation metrics. Our experiments show that our forecasting tool is effective for cyclic data but somewhat less for irregular data such as stock prices. Importantly, when using image-based evaluation metrics, we find our method to outperform various baselines, including ARIMA, and a numerical variation of our deep learning approach.
    Extended Few-Shot Learning: Exploiting Existing Resources for Novel Tasks. (arXiv:2012.07176v3 [cs.LG] UPDATED)
    (2 min) In many practical few-shot learning problems, even though labeled examples are scarce, there are abundant auxiliary datasets that potentially contain useful information. We propose the problem of extended few-shot learning to study these scenarios. We then introduce a framework to address the challenges of efficiently selecting and effectively using auxiliary data in few-shot image classification. Given a large auxiliary dataset and a notion of semantic similarity among classes, we automatically select pseudo shots, which are labeled examples from other classes related to the target task. We show that naive approaches, such as (1) modeling these additional examples the same as the target task examples or (2) using them to learn features via transfer learning, only increase accuracy by a modest amount. Instead, we propose a masking module that adjusts the features of auxiliary data to be more similar to those of the target classes. We show that this masking module performs better than naively modeling the support examples and transfer learning by 4.68 and 6.03 percentage points, respectively.
    A Unified Model for Fingerprint Authentication and Presentation Attack Detection. (arXiv:2104.03255v2 [cs.CV] UPDATED)
    (2 min) Typical fingerprint recognition systems are comprised of a spoof detection module and a subsequent recognition module, running one after the other. In this paper, we reformulate the workings of a typical fingerprint recognition system. In particular, we posit that both spoof detection and fingerprint recognition are correlated tasks. Therefore, rather than performing the two tasks separately, we propose a joint model for spoof detection and matching to simultaneously perform both tasks without compromising the accuracy of either task. We demonstrate the capability of our joint model to obtain an authentication accuracy (1:1 matching) of TAR = 100% @ FAR = 0.1% on the FVC 2006 DB2A dataset while achieving a spoof detection ACE of 1.44% on the LiveDet 2015 dataset, both maintaining the performance of stand-alone methods. In practice, this reduces the time and memory requirements of the fingerprint recognition system by 50% and 40%, respectively; a significant advantage for recognition systems running on resource-constrained devices and communication channels.
    Learning to Relate Depth and Semantics for Unsupervised Domain Adaptation. (arXiv:2105.07830v2 [cs.CV] UPDATED)
    (2 min) We present an approach for encoding visual task relationships to improve model performance in an Unsupervised Domain Adaptation (UDA) setting. Semantic segmentation and monocular depth estimation are shown to be complementary tasks; in a multi-task learning setting, a proper encoding of their relationships can further improve performance on both tasks. Motivated by this observation, we propose a novel Cross-Task Relation Layer (CTRL), which encodes task dependencies between the semantic and depth predictions. To capture the cross-task relationships, we propose a neural network architecture that contains task-specific and cross-task refinement heads. Furthermore, we propose an Iterative Self-Learning (ISL) training scheme, which exploits semantic pseudo-labels to provide extra supervision on the target domain. We experimentally observe improvements in both tasks' performance because the complementary information present in these tasks is better captured. Specifically, we show that: (1) our approach improves performance on all tasks when they are complementary and mutually dependent; (2) the CTRL helps to improve both semantic segmentation and depth estimation tasks performance in the challenging UDA setting; (3) the proposed ISL training scheme further improves the semantic segmentation performance. The implementation is available at https://github.com/susaha/ctrl-uda.
    Full interpretable machine learning in 2D with inline coordinates. (arXiv:2106.07568v2 [cs.LG] UPDATED)
    (2 min) This paper proposed a new methodology for machine learning in 2-dimensional space (2-D ML) in inline coordinates. It is a full machine learning approach that does not require to deal with n-dimensional data in n-dimensional space. It allows discovering n-D patterns in 2-D space without loss of n-D information using graph representations of n-D data in 2-D. Specifically, it can be done with the inline based coordinates in different modifications, including static and dynamic ones. The classification and regression algorithms based on these inline coordinates were introduced. A successful case study based on a benchmark data demonstrated the feasibility of the approach. This approach helps to consolidate further a whole new area of full 2-D machine learning as a promising ML methodology. It has advantages of abilities to involve actively the end-users into the discovering of models and their justification. Another advantage is providing interpretable ML models.
    VisualWordGrid: Information Extraction From Scanned Documents Using A Multimodal Approach. (arXiv:2010.02358v5 [cs.CV] UPDATED)
    (2 min) We introduce a novel approach for scanned document representation to perform field extraction. It allows the simultaneous encoding of the textual, visual and layout information in a 3-axis tensor used as an input to a segmentation model. We improve the recent Chargrid and Wordgrid \cite{chargrid} models in several ways, first by taking into account the visual modality, then by boosting its robustness in regards to small datasets while keeping the inference time low. Our approach is tested on public and private document-image datasets, showing higher performances compared to the recent state-of-the-art methods.
    COVID-19 detection using deep convolutional neural networks and binary-differential-algorithm-based feature selection on X-ray images. (arXiv:2104.07279v3 [eess.IV] UPDATED)
    (2 min) The new Coronavirus is spreading rapidly, and it has taken the lives of many people so far. The virus has destructive effects on the human lung, and early detection is very important. Deep Convolution neural networks are such powerful tools in classifying images. Therefore, in this paper, a hybrid approach based on a deep network is presented. Feature vectors were extracted by applying a deep convolution neural network on the images, and useful features were selected by the binary differential meta-heuristic algorithm. These optimized features were given to the SVM classifier. A database consisting of three categories of images such as COVID-19, pneumonia, and healthy included in 1092 X-ray samples was considered. The proposed method achieved an accuracy of 99.43%, a sensitivity of 99.16%, and a specificity of 99.57%. Our results demonstrate that the suggested approach is better than recent studies on COVID-19 detection with X-ray images.
    Repurposing GANs for One-shot Semantic Part Segmentation. (arXiv:2103.04379v5 [cs.CV] UPDATED)
    (2 min) While GANs have shown success in realistic image generation, the idea of using GANs for other tasks unrelated to synthesis is underexplored. Do GANs learn meaningful structural parts of objects during their attempt to reproduce those objects? In this work, we test this hypothesis and propose a simple and effective approach based on GANs for semantic part segmentation that requires as few as one label example along with an unlabeled dataset. Our key idea is to leverage a trained GAN to extract pixel-wise representation from the input image and use it as feature vectors for a segmentation network. Our experiments demonstrate that GANs representation is "readily discriminative" and produces surprisingly good results that are comparable to those from supervised baselines trained with significantly more labels. We believe this novel repurposing of GANs underlies a new class of unsupervised representation learning that is applicable to many other tasks. More results are available at https://repurposegans.github.io/.
    Unknown Presentation Attack Detection against Rational Attackers. (arXiv:2010.01592v2 [cs.CV] UPDATED)
    (2 min) Despite the impressive progress in the field of presentation attack detection and multimedia forensics over the last decade, these systems are still vulnerable to attacks in real-life settings. Some of the challenges for existing solutions are the detection of unknown attacks, the ability to perform in adversarial settings, few-shot learning, and explainability. In this study, these limitations are approached by reliance on a game-theoretic view for modeling the interactions between the attacker and the detector. Consequently, a new optimization criterion is proposed and a set of requirements are defined for improving the performance of these systems in real-life settings. Furthermore, a novel detection technique is proposed using generator-based feature sets that are not biased towards any specific attack species. To further optimize the performance on known attacks, a new loss function coined categorical margin maximization loss (C-marmax) is proposed which gradually improves the performance against the most powerful attack. The proposed approach provides a more balanced performance across known and unknown attacks and achieves state-of-the-art performance in known and unknown attack detection cases against rational attackers. Lastly, the few-shot learning potential of the proposed approach is studied as well as its ability to provide pixel-level explainability.
    Body Meshes as Points. (arXiv:2105.02467v2 [cs.CV] UPDATED)
    (2 min) We consider the challenging multi-person 3D body mesh estimation task in this work. Existing methods are mostly two-stage based--one stage for person localization and the other stage for individual body mesh estimation, leading to redundant pipelines with high computation cost and degraded performance for complex scenes (e.g., occluded person instances). In this work, we present a single-stage model, Body Meshes as Points (BMP), to simplify the pipeline and lift both efficiency and performance. In particular, BMP adopts a new method that represents multiple person instances as points in the spatial-depth space where each point is associated with one body mesh. Hinging on such representations, BMP can directly predict body meshes for multiple persons in a single stage by concurrently localizing person instance points and estimating the corresponding body meshes. To better reason about depth ordering of all the persons within the same scene, BMP designs a simple yet effective inter-instance ordinal depth loss to obtain depth-coherent body mesh estimation. BMP also introduces a novel keypoint-aware augmentation to enhance model robustness to occluded person instances. Comprehensive experiments on benchmarks Panoptic, MuPoTS-3D and 3DPW clearly demonstrate the state-of-the-art efficiency of BMP for multi-person body mesh estimation, together with outstanding accuracy. Code can be found at: https://github.com/jfzhang95/BMP.
    Sensor-invariant Fingerprint ROI Segmentation Using Recurrent Adversarial Learning. (arXiv:2107.01361v1 [cs.CV])
    (2 min) A fingerprint region of interest (roi) segmentation algorithm is designed to separate the foreground fingerprint from the background noise. All the learning based state-of-the-art fingerprint roi segmentation algorithms proposed in the literature are benchmarked on scenarios when both training and testing databases consist of fingerprint images acquired from the same sensors. However, when testing is conducted on a different sensor, the segmentation performance obtained is often unsatisfactory. As a result, every time a new fingerprint sensor is used for testing, the fingerprint roi segmentation model needs to be re-trained with the fingerprint image acquired from the new sensor and its corresponding manually marked ROI. Manually marking fingerprint ROI is expensive because firstly, it is time consuming and more importantly, requires domain expertise. In order to save the human effort in generating annotations required by state-of-the-art, we propose a fingerprint roi segmentation model which aligns the features of fingerprint images derived from the unseen sensor such that they are similar to the ones obtained from the fingerprints whose ground truth roi masks are available for training. Specifically, we propose a recurrent adversarial learning based feature alignment network that helps the fingerprint roi segmentation model to learn sensor-invariant features. Consequently, sensor-invariant features learnt by the proposed roi segmentation model help it to achieve improved segmentation performance on fingerprints acquired from the new sensor. Experiments on publicly available FVC databases demonstrate the efficacy of the proposed work.
    Transformer in Transformer. (arXiv:2103.00112v2 [cs.CV] UPDATED)
    (2 min) Transformer is a new kind of neural architecture which encodes the input data as powerful features via the attention mechanism. Basically, the visual transformers first divide the input images into several local patches and then calculate both representations and their relationship. Since natural images are of high complexity with abundant detail and color information, the granularity of the patch dividing is not fine enough for excavating features of objects in different scales and locations. In this paper, we point out that the attention inside these local patches are also essential for building visual transformers with high performance and we explore a new architecture, namely, Transformer iN Transformer (TNT). Specifically, we regard the local patches (e.g., 16$\times$16) as "visual sentences" and present to further divide them into smaller patches (e.g., 4$\times$4) as "visual words". The attention of each word will be calculated with other words in the given visual sentence with negligible computational costs. Features of both words and sentences will be aggregated to enhance the representation ability. Experiments on several benchmarks demonstrate the effectiveness of the proposed TNT architecture, e.g., we achieve an $81.5%$ top-1 accuracy on the ImageNet, which is about $1.7%$ higher than that of the state-of-the-art visual transformer with similar computational cost. The PyTorch code is available at https://github.com/huawei-noah/CV-Backbones/tree/master/tnt_pytorch, and the MindSpore code is at https://gitee.com/mindspore/mindspore/tree/master/model_zoo/research/cv/TNT.
    Learning Domain Invariant Representations for Generalizable Person Re-Identification. (arXiv:2103.15890v2 [cs.CV] UPDATED)
    (2 min) Generalizable person Re-Identification (ReID) has attracted growing attention in recent computer vision community. In this work, we construct a structural causal model among identity labels, identity-specific factors (clothes/shoes color etc), and domain-specific factors (background, viewpoints etc). According to the causal analysis, we propose a novel Domain Invariant Representation Learning for generalizable person Re-Identification (DIR-ReID) framework. Specifically, we first propose to disentangle the identity-specific and domain-specific feature spaces, based on which we propose an effective algorithmic implementation for backdoor adjustment, essentially serving as a causal intervention towards the SCM. Extensive experiments have been conducted, showing that DIR-ReID outperforms state-of-the-art methods on large-scale domain generalization ReID benchmarks.
    Perceptual Adversarial Robustness: Defense Against Unseen Threat Models. (arXiv:2006.12655v4 [cs.LG] UPDATED)
    (3 min) A key challenge in adversarial robustness is the lack of a precise mathematical characterization of human perception, used in the very definition of adversarial attacks that are imperceptible to human eyes. Most current attacks and defenses try to avoid this issue by considering restrictive adversarial threat models such as those bounded by $L_2$ or $L_\infty$ distance, spatial perturbations, etc. However, models that are robust against any of these restrictive threat models are still fragile against other threat models. To resolve this issue, we propose adversarial training against the set of all imperceptible adversarial examples, approximated using deep neural networks. We call this threat model the neural perceptual threat model (NPTM); it includes adversarial examples with a bounded neural perceptual distance (a neural network-based approximation of the true perceptual distance) to natural images. Through an extensive perceptual study, we show that the neural perceptual distance correlates well with human judgements of perceptibility of adversarial examples, validating our threat model. Under the NPTM, we develop novel perceptual adversarial attacks and defenses. Because the NPTM is very broad, we find that Perceptual Adversarial Training (PAT) against a perceptual attack gives robustness against many other types of adversarial attacks. We test PAT on CIFAR-10 and ImageNet-100 against five diverse adversarial attacks. We find that PAT achieves state-of-the-art robustness against the union of these five attacks, more than doubling the accuracy over the next best model, without training against any of them. That is, PAT generalizes well to unforeseen perturbation types. This is vital in sensitive applications where a particular threat model cannot be assumed, and to the best of our knowledge, PAT is the first adversarial training defense with this property.
    Learning a Model for Inferring a Spatial Road Lane Network Graph using Self-Supervision. (arXiv:2107.01784v1 [cs.CV])
    (2 min) Interconnected road lanes are a central concept for navigating urban roads. Currently, most autonomous vehicles rely on preconstructed lane maps as designing an algorithmic model is difficult. However, the generation and maintenance of such maps is costly and hinders large-scale adoption of autonomous vehicle technology. This paper presents the first self-supervised learning method to train a model to infer a spatially grounded lane-level road network graph based on a dense segmented representation of the road scene generated from onboard sensors. A formal road lane network model is presented and proves that any structured road scene can be represented by a directed acyclic graph of at most depth three while retaining the notion of intersection regions, and that this is the most compressed representation. The formal model is implemented by a hybrid neural and search-based model, utilizing a novel barrier function loss formulation for robust learning from partial labels. Experiments are conducted for all common road intersection layouts. Results show that the model can generalize to new road layouts, unlike previous approaches, demonstrating its potential for real-world application as a practical learning-based lane-level map generator.
    Unsupervised Audiovisual Synthesis via Exemplar Autoencoders. (arXiv:2001.04463v3 [cs.CV] UPDATED)
    (2 min) We present an unsupervised approach that converts the input speech of any individual into audiovisual streams of potentially-infinitely many output speakers. Our approach builds on simple autoencoders that project out-of-sample data onto the distribution of the training set. We use Exemplar Autoencoders to learn the voice, stylistic prosody, and visual appearance of a specific target exemplar speech. In contrast to existing methods, the proposed approach can be easily extended to an arbitrarily large number of speakers and styles using only 3 minutes of target audio-video data, without requiring {\em any} training data for the input speaker. To do so, we learn audiovisual bottleneck representations that capture the structured linguistic content of speech. We outperform prior approaches on both audio and video synthesis, and provide extensive qualitative analysis on our project page -- https://www.cs.cmu.edu/~exemplar-ae/.
    Dynamic Feature Pyramid Networks for Object Detection. (arXiv:2012.00779v2 [cs.CV] UPDATED)
    (2 min) Feature pyramid network (FPN) is a critical component in modern object detection frameworks. The performance gain in most of the existing FPN variants is mainly attributed to the increase of computational burden. An attempt to enhance the FPN is enriching the spatial information by expanding the receptive fields, which is promising to largely improve the detection accuracy. In this paper, we first investigate how expanding the receptive fields affect the accuracy and computational costs of FPN. We explore a baseline model called inception FPN in which each lateral connection contains convolution filters with different kernel sizes. Moreover, we point out that not all objects need such a complicated calculation and propose a new dynamic FPN (DyFPN). The output features of DyFPN will be calculated by using the adaptively selected branch according to a dynamic gating operation. Therefore, the proposed method can provide a more efficient dynamic inference for achieving a better trade-off between accuracy and computational cost. Extensive experiments conducted on MS-COCO benchmark demonstrate that the proposed DyFPN significantly improves performance with the optimal allocation of computation resources. For instance, replacing inception FPN with DyFPN reduces about 40% of its FLOPs while maintaining similar high performance.
    COVID-Rate: An Automated Framework for Segmentation of COVID-19 Lesions from Chest CT Scans. (arXiv:2107.01527v1 [eess.IV])
    (2 min) Novel Coronavirus disease (COVID-19) is a highly contagious respiratory infection that has had devastating effects on the world. Recently, new COVID-19 variants are emerging making the situation more challenging and threatening. Evaluation and quantification of COVID-19 lung abnormalities based on chest Computed Tomography (CT) scans can help determining the disease stage, efficiently allocating limited healthcare resources, and making informed treatment decisions. During pandemic era, however, visual assessment and quantification of COVID-19 lung lesions by expert radiologists become expensive and prone to error, which raises an urgent quest to develop practical autonomous solutions. In this context, first, the paper introduces an open access COVID-19 CT segmentation dataset containing 433 CT images from 82 patients that have been annotated by an expert radiologist. Second, a Deep Neural Network (DNN)-based framework is proposed, referred to as the COVID-Rate, that autonomously segments lung abnormalities associated with COVID-19 from chest CT scans. Performance of the proposed COVID-Rate framework is evaluated through several experiments based on the introduced and external datasets. The results show a dice score of 0:802 and specificity and sensitivity of 0:997 and 0:832, respectively. Furthermore, the results indicate that the COVID-Rate model can efficiently segment COVID-19 lesions in both 2D CT images and whole lung volumes. Results on the external dataset illustrate generalization capabilities of the COVID-Rate model to CT images obtained from a different scanner.
    Gaze Estimation with an Ensemble of Four Architectures. (arXiv:2107.01980v1 [cs.CV])
    (2 min) This paper presents a method for gaze estimation according to face images. We train several gaze estimators adopting four different network architectures, including an architecture designed for gaze estimation (i.e.,iTracker-MHSA) and three originally designed for general computer vision tasks(i.e., BoTNet, HRNet, ResNeSt). Then, we select the best six estimators and ensemble their predictions through a linear combination. The method ranks the first on the leader-board of ETH-XGaze Competition, achieving an average angular error of $3.11^{\circ}$ on the ETH-XGaze test set.
    Deep Edge-Aware Interactive Colorization against Color-Bleeding Effects. (arXiv:2107.01619v1 [cs.CV])
    (2 min) Deep image colorization networks often suffer from the color-bleeding artifact, a problematic color spreading near the boundaries between adjacent objects. The color-bleeding artifacts debase the reality of generated outputs, limiting the applicability of colorization models on a practical application. Although previous approaches have tackled this problem in an automatic manner, they often generate imperfect outputs because their enhancements are available only in limited cases, such as having a high contrast of gray-scale value in an input image. Instead, leveraging user interactions would be a promising approach, since it can help the edge correction in the desired regions. In this paper, we propose a novel edge-enhancing framework for the regions of interest, by utilizing user scribbles that indicate where to enhance. Our method requires minimal user effort to obtain satisfactory enhancements. Experimental results on various datasets demonstrate that our interactive approach has outstanding performance in improving color-bleeding artifacts against the existing baselines.
    Image-to-Image Translation: Methods and Applications. (arXiv:2101.08629v2 [cs.CV] UPDATED)
    (2 min) Image-to-image translation (I2I) aims to transfer images from a source domain to a target domain while preserving the content representations. I2I has drawn increasing attention and made tremendous progress in recent years because of its wide range of applications in many computer vision and image processing problems, such as image synthesis, segmentation, style transfer, restoration, and pose estimation. In this paper, we provide an overview of the I2I works developed in recent years. We will analyze the key techniques of the existing I2I works and clarify the main progress the community has made. Additionally, we will elaborate on the effect of I2I on the research and industry community and point out remaining challenges in related fields.
    GANDA: A deep generative adversarial network predicts the spatial distribution of nanoparticles in tumor pixelly. (arXiv:2012.12561v2 [eess.IV] UPDATED)
    (2 min) Intratumoral nanoparticles (NPs) distribution is critical for the success of nanomedicine in imaging and treatment, but computational models to describe the NPs distribution remain unavailable due to the complex tumor-nano interactions. Here, we develop a Generative Adversarial Network for Distribution Analysis (GANDA) to describe and conditionally generates the intratumoral quantum dots (QDs) distribution after i.v. injection. This deep generative model is trained automatically by 27 775 patches of tumor vessels and cell nuclei decomposed from whole-slide images of 4T1 breast cancer sections. The GANDA model can conditionally generate images of intratumoral QDs distribution under the constraint of given tumor vessels and cell nuclei channels with the same spatial resolution (pixels-to-pixels), minimal loss (mean squared error, MSE = 1.871) and excellent reliability (intraclass correlation, ICC = 0.94). Quantitative analysis of QDs extravasation distance (ICC = 0.95) and subarea distribution (ICC = 0.99) is allowed on the generated images without knowing the real QDs distribution. We believe this deep generative model may provide opportunities to investigate how influencing factors affect NPs distribution in individual tumors and guide nanomedicine optimization for molecular imaging and personalized treatment.
    Direct Measure Matching for Crowd Counting. (arXiv:2107.01558v1 [cs.CV])
    (2 min) Traditional crowd counting approaches usually use Gaussian assumption to generate pseudo density ground truth, which suffers from problems like inaccurate estimation of the Gaussian kernel sizes. In this paper, we propose a new measure-based counting approach to regress the predicted density maps to the scattered point-annotated ground truth directly. First, crowd counting is formulated as a measure matching problem. Second, we derive a semi-balanced form of Sinkhorn divergence, based on which a Sinkhorn counting loss is designed for measure matching. Third, we propose a self-supervised mechanism by devising a Sinkhorn scale consistency loss to resist scale changes. Finally, an efficient optimization method is provided to minimize the overall loss function. Extensive experiments on four challenging crowd counting datasets namely ShanghaiTech, UCF-QNRF, JHU++, and NWPU have validated the proposed method.
    Anomaly Detection With Partitioning Overfitting Autoencoder Ensembles. (arXiv:2009.02755v6 [cs.LG] UPDATED)
    (2 min) In this paper, we propose POTATOES (Partitioning OverfiTting AuTOencoder EnSemble), a new method for unsupervised outlier detection (UOD). More precisely, given any autoencoder for UOD, this technique can be used to improve its accuracy while at the same time removing the burden of tuning its regularization. The idea is to not regularize at all, but to rather randomly partition the data into sufficiently many equally sized parts, overfit each part with its own autoencoder, and to use the maximum over all autoencoder reconstruction errors as the anomaly score. We apply our model to various realistic datasets and show that if the set of inliers is dense enough, our method indeed improves the UOD performance of a given autoencoder significantly. For reproducibility, the code is made available on github so the reader can recreate the results in this paper as well as apply the method to other autoencoders and datasets.
    Multi-view Graph Learning by Joint Modeling of Consistency and Inconsistency. (arXiv:2008.10208v2 [cs.LG] UPDATED)
    (2 min) Graph learning has emerged as a promising technique for multi-view clustering with its ability to learn a unified and robust graph from multiple views. However, existing graph learning methods mostly focus on the multi-view consistency issue, yet often neglect the inconsistency across multiple views, which makes them vulnerable to possibly low-quality or noisy datasets. To overcome this limitation, we propose a new multi-view graph learning framework, which for the first time simultaneously and explicitly models multi-view consistency and multi-view inconsistency in a unified objective function, through which the consistent and inconsistent parts of each single-view graph as well as the unified graph that fuses the consistent parts can be iteratively learned. Though optimizing the objective function is NP-hard, we design a highly efficient optimization algorithm which is able to obtain an approximate solution with linear time complexity in the number of edges in the unified graph. Furthermore, our multi-view graph learning approach can be applied to both similarity graphs and dissimilarity graphs, which lead to two graph fusion-based variants in our framework. Experiments on twelve multi-view datasets have demonstrated the robustness and efficiency of the proposed approach.
    A Rotation-Invariant Framework for Deep Point Cloud Analysis. (arXiv:2003.07238v2 [cs.CV] UPDATED)
    (2 min) Recently, many deep neural networks were designed to process 3D point clouds, but a common drawback is that rotation invariance is not ensured, leading to poor generalization to arbitrary orientations. In this paper, we introduce a new low-level purely rotation-invariant representation to replace common 3D Cartesian coordinates as the network inputs. Also, we present a network architecture to embed these representations into features, encoding local relations between points and their neighbors, and the global shape structure. To alleviate inevitable global information loss caused by the rotation-invariant representations, we further introduce a region relation convolution to encode local and non-local information. We evaluate our method on multiple point cloud analysis tasks, including shape classification, part segmentation, and shape retrieval. Experimental results show that our method achieves consistent, and also the best performance, on inputs at arbitrary orientations, compared with the state-of-the-arts.
    GraspME -- Grasp Manifold Estimator. (arXiv:2107.01836v1 [cs.RO])
    (2 min) In this paper, we introduce a Grasp Manifold Estimator (GraspME) to detect grasp affordances for objects directly in 2D camera images. To perform manipulation tasks autonomously it is crucial for robots to have such graspability models of the surrounding objects. Grasp manifolds have the advantage of providing continuously infinitely many grasps, which is not the case when using other grasp representations such as predefined grasp points. For instance, this property can be leveraged in motion optimization to define goal sets as implicit surface constraints in the robot configuration space. In this work, we restrict ourselves to the case of estimating possible end-effector positions directly from 2D camera images. To this extend, we define grasp manifolds via a set of key points and locate them in images using a Mask R-CNN backbone. Using learned features allows generalizing to different view angles, with potentially noisy images, and objects that were not part of the training set. We rely on simulation data only and perform experiments on simple and complex objects, including unseen ones. Our framework achieves an inference speed of 11.5 fps on a GPU, an average precision for keypoint estimation of 94.5% and a mean pixel distance of only 1.29. This shows that we can estimate the objects very well via bounding boxes and segmentation masks as well as approximate the correct grasp manifold's keypoint coordinates.
    On the Predictability of Pruning Across Scales. (arXiv:2006.10621v3 [cs.LG] UPDATED)
    (2 min) We show that the error of iteratively magnitude-pruned networks empirically follows a scaling law with interpretable coefficients that depend on the architecture and task. We functionally approximate the error of the pruned networks, showing it is predictable in terms of an invariant tying width, depth, and pruning level, such that networks of vastly different pruned densities are interchangeable. We demonstrate the accuracy of this approximation over orders of magnitude in depth, width, dataset size, and density. We show that the functional form holds (generalizes) for large scale data (e.g., ImageNet) and architectures (e.g., ResNets). As neural networks become ever larger and costlier to train, our findings suggest a framework for reasoning conceptually and analytically about a standard method for unstructured pruning.
    Faster-LTN: a neuro-symbolic, end-to-end object detection architecture. (arXiv:2107.01877v1 [cs.CV])
    (2 min) The detection of semantic relationships between objects represented in an image is one of the fundamental challenges in image interpretation. Neural-Symbolic techniques, such as Logic Tensor Networks (LTNs), allow the combination of semantic knowledge representation and reasoning with the ability to efficiently learn from examples typical of neural networks. We here propose Faster-LTN, an object detector composed of a convolutional backbone and an LTN. To the best of our knowledge, this is the first attempt to combine both frameworks in an end-to-end training setting. This architecture is trained by optimizing a grounded theory which combines labelled examples with prior knowledge, in the form of logical axioms. Experimental comparisons show competitive performance with respect to the traditional Faster R-CNN architecture.
    Similarity-Aware Fusion Network for 3D Semantic Segmentation. (arXiv:2107.01579v1 [cs.CV])
    (2 min) In this paper, we propose a similarity-aware fusion network (SAFNet) to adaptively fuse 2D images and 3D point clouds for 3D semantic segmentation. Existing fusion-based methods achieve remarkable performances by integrating information from multiple modalities. However, they heavily rely on the correspondence between 2D pixels and 3D points by projection and can only perform the information fusion in a fixed manner, and thus their performances cannot be easily migrated to a more realistic scenario where the collected data often lack strict pair-wise features for prediction. To address this, we employ a late fusion strategy where we first learn the geometric and contextual similarities between the input and back-projected (from 2D pixels) point clouds and utilize them to guide the fusion of two modalities to further exploit complementary information. Specifically, we employ a geometric similarity module (GSM) to directly compare the spatial coordinate distributions of pair-wise 3D neighborhoods, and a contextual similarity module (CSM) to aggregate and compare spatial contextual information of corresponding central points. The two proposed modules can effectively measure how much image features can help predictions, enabling the network to adaptively adjust the contributions of two modalities to the final prediction of each point. Experimental results on the ScanNetV2 benchmark demonstrate that SAFNet significantly outperforms existing state-of-the-art fusion-based approaches across various data integrity.
    GAN2GAN: Generative Noise Learning for Blind Denoising with Single Noisy Images. (arXiv:1905.10488v5 [eess.IV] UPDATED)
    (2 min) We tackle a challenging blind image denoising problem, in which only single distinct noisy images are available for training a denoiser, and no information about noise is known, except for it being zero-mean, additive, and independent of the clean image. In such a setting, which often occurs in practice, it is not possible to train a denoiser with the standard discriminative training or with the recently developed Noise2Noise (N2N) training; the former requires the underlying clean image for the given noisy image, and the latter requires two independently realized noisy image pair for a clean image. To that end, we propose GAN2GAN (Generated-Artificial-Noise to Generated-Artificial-Noise) method that first learns a generative model that can 1) simulate the noise in the given noisy images and 2) generate a rough, noisy estimates of the clean images, then 3) iteratively trains a denoiser with subsequently synthesized noisy image pairs (as in N2N), obtained from the generative model. In results, we show the denoiser trained with our GAN2GAN achieves an impressive denoising performance on both synthetic and real-world datasets for the blind denoising setting; it almost approaches the performance of the standard discriminatively-trained or N2N-trained models that have more information than ours, and it significantly outperforms the recent baseline for the same setting, \textit{e.g.}, Noise2Void, and a more conventional yet strong one, BM3D. The official code of our method is available at https://github.com/csm9493/GAN2GAN.
    A Novel Disaster Image Dataset and Characteristics Analysis using Attention Model. (arXiv:2107.01284v1 [cs.CV])
    (2 min) The advancement of deep learning technology has enabled us to develop systems that outperform any other classification technique. However, success of any empirical system depends on the quality and diversity of the data available to train the proposed system. In this research, we have carefully accumulated a relatively challenging dataset that contains images collected from various sources for three different disasters: fire, water and land. Besides this, we have also collected images for various damaged infrastructure due to natural or man made calamities and damaged human due to war or accidents. We have also accumulated image data for a class named non-damage that contains images with no such disaster or sign of damage in them. There are 13,720 manually annotated images in this dataset, each image is annotated by three individuals. We are also providing discriminating image class information annotated manually with bounding box for a set of 200 test images. Images are collected from different news portals, social media, and standard datasets made available by other researchers. A three layer attention model (TLAM) is trained and average five fold validation accuracy of 95.88% is achieved. Moreover, on the 200 unseen test images this accuracy is 96.48%. We also generate and compare attention maps for these test images to determine the characteristics of the trained attention model. Our dataset is available at https://niloy193.github.io/Disaster-Dataset
    Efficient Vision Transformers via Fine-Grained Manifold Distillation. (arXiv:2107.01378v1 [cs.CV])
    (2 min) This paper studies the model compression problem of vision transformers. Benefit from the self-attention module, transformer architectures have shown extraordinary performance on many computer vision tasks. Although the network performance is boosted, transformers are often required more computational resources including memory usage and the inference complexity. Compared with the existing knowledge distillation approaches, we propose to excavate useful information from the teacher transformer through the relationship between images and the divided patches. We then explore an efficient fine-grained manifold distillation approach that simultaneously calculates cross-images, cross-patch, and random-selected manifolds in teacher and student models. Experimental results conducted on several benchmarks demonstrate the superiority of the proposed algorithm for distilling portable transformer models with higher performance. For example, our approach achieves 75.06% Top-1 accuracy on the ImageNet-1k dataset for training a DeiT-Tiny model, which outperforms other ViT distillation methods.
    WoodScape: A multi-task, multi-camera fisheye dataset for autonomous driving. (arXiv:1905.01489v3 [cs.CV] UPDATED)
    (2 min) Fisheye cameras are commonly employed for obtaining a large field of view in surveillance, augmented reality and in particular automotive applications. In spite of their prevalence, there are few public datasets for detailed evaluation of computer vision algorithms on fisheye images. We release the first extensive fisheye automotive dataset, WoodScape, named after Robert Wood who invented the fisheye camera in 1906. WoodScape comprises of four surround view cameras and nine tasks including segmentation, depth estimation, 3D bounding box detection and soiling detection. Semantic annotation of 40 classes at the instance level is provided for over 10,000 images and annotation for other tasks are provided for over 100,000 images. With WoodScape, we would like to encourage the community to adapt computer vision models for fisheye camera instead of using naive rectification.
    Joint Object Contour Points and Semantics for Instance Segmentation. (arXiv:2008.00460v3 [cs.CV] UPDATED)
    (2 min) The attributes of object contours has great significance for instance segmentation task. However, most of the current popular deep neural networks do not pay much attention to the object edge information. Inspired by the human annotation process when making instance segmentation datasets, in this paper, we propose Mask Point R-CNN aiming at promoting the neural network's attention to the object boundary. Specifically, we innovatively extend the original human keypoint detection task to the contour point detection of any object. Based on this analogy, we present an contour point detection auxiliary task to Mask R-CNN, which can boost the gradient flow between different tasks by effectively using feature fusion strategies and multi-task joint training. As a consequence, the model will be more sensitive to the edges of the object and can capture more geometric features. Quantitatively, the experimental results show that our approach outperforms vanilla Mask R-CNN by 3.8\% on Cityscapes dataset and 0.8\% on COCO dataset.
    A contextual analysis of multi-layer perceptron models in classifying hand-written digits and letters: limited resources. (arXiv:2107.01782v1 [cs.LG])
    (2 min) Classifying hand-written digits and letters has taken a big leap with the introduction of ConvNets. However, on very constrained hardware the time necessary to train such models would be high. Our main contribution is twofold. First, we extensively test an end-to-end vanilla neural network (MLP) approach in pure numpy without any pre-processing or feature extraction done beforehand. Second, we show that basic data mining operations can significantly improve the performance of the models in terms of computational time, without sacrificing much accuracy. We illustrate our claims on a simpler variant of the Extended MNIST dataset, called Balanced EMNIST dataset. Our experiments show that, without any data mining, we get increased generalization performance when using more hidden layers and regularization techniques, the best model achieving 84.83% accuracy on a test dataset. Using dimensionality reduction done by PCA we were able to increase that figure to 85.08% with only 10% of the original feature space, reducing the memory size needed by 64%. Finally, adding methods to remove possibly harmful training samples like deviation from the mean helped us to still achieve over 84% test accuracy but with only 32.8% of the original memory size for the training set. This compares favorably to the majority of literature results obtained through similar architectures. Although this approach gets outshined by state-of-the-art models, it does scale to some (AlexNet, VGGNet) trained on 50% of the same dataset.
    Bag of Instances Aggregation Boosts Self-supervised Learning. (arXiv:2107.01691v1 [cs.CV])
    (2 min) Recent advances in self-supervised learning have experienced remarkable progress, especially for contrastive learning based methods, which regard each image as well as its augmentations as an individual class and try to distinguish them from all other images. However, due to the large quantity of exemplars, this kind of pretext task intrinsically suffers from slow convergence and is hard for optimization. This is especially true for small scale models, which we find the performance drops dramatically comparing with its supervised counterpart. In this paper, we propose a simple but effective distillation strategy for unsupervised learning. The highlight is that the relationship among similar samples counts and can be seamlessly transferred to the student to boost the performance. Our method, termed as BINGO, which is short for \textbf{B}ag of \textbf{I}nsta\textbf{N}ces a\textbf{G}gregati\textbf{O}n, targets at transferring the relationship learned by the teacher to the student. Here bag of instances indicates a set of similar samples constructed by the teacher and are grouped within a bag, and the goal of distillation is to aggregate compact representations over the student with respect to instances in a bag. Notably, BINGO achieves new state-of-the-art performance on small scale models, \emph{i.e.}, 65.5% and 68.9% top-1 accuracies with linear evaluation on ImageNet, using ResNet-18 and ResNet-34 as backbone, respectively, surpassing baselines (52.5% and 57.4% top-1 accuracies) by a significant margin. The code will be available at \url{https://github.com/haohang96/bingo}.
    Better Compression with Deep Pre-Editing. (arXiv:2002.00113v2 [eess.IV] UPDATED)
    (2 min) Could we compress images via standard codecs while avoiding visible artifacts? The answer is obvious -- this is doable as long as the bit budget is generous enough. What if the allocated bit-rate for compression is insufficient? Then unfortunately, artifacts are a fact of life. Many attempts were made over the years to fight this phenomenon, with various degrees of success. In this work we aim to break the unholy connection between bit-rate and image quality, and propose a way to circumvent compression artifacts by pre-editing the incoming image and modifying its content to fit the given bits. We design this editing operation as a learned convolutional neural network, and formulate an optimization problem for its training. Our loss takes into account a proximity between the original image and the edited one, a bit-budget penalty over the proposed image, and a no-reference image quality measure for forcing the outcome to be visually pleasing. The proposed approach is demonstrated on the popular JPEG compression, showing savings in bits and/or improvements in visual quality, obtained with intricate editing effects.
    UCSL : A Machine Learning Expectation-Maximization framework for Unsupervised Clustering driven by Supervised Learning. (arXiv:2107.01988v1 [stat.ML])
    (2 min) Subtype Discovery consists in finding interpretable and consistent sub-parts of a dataset, which are also relevant to a certain supervised task. From a mathematical point of view, this can be defined as a clustering task driven by supervised learning in order to uncover subgroups in line with the supervised prediction. In this paper, we propose a general Expectation-Maximization ensemble framework entitled UCSL (Unsupervised Clustering driven by Supervised Learning). Our method is generic, it can integrate any clustering method and can be driven by both binary classification and regression. We propose to construct a non-linear model by merging multiple linear estimators, one per cluster. Each hyperplane is estimated so that it correctly discriminates - or predict - only one cluster. We use SVC or Logistic Regression for classification and SVR for regression. Furthermore, to perform cluster analysis within a more suitable space, we also propose a dimension-reduction algorithm that projects the data onto an orthonormal space relevant to the supervised task. We analyze the robustness and generalization capability of our algorithm using synthetic and experimental datasets. In particular, we validate its ability to identify suitable consistent sub-types by conducting a psychiatric-diseases cluster analysis with known ground-truth labels. The gain of the proposed method over previous state-of-the-art techniques is about +1.9 points in terms of balanced accuracy. Finally, we make codes and examples available in a scikit-learn-compatible Python package at https://github.com/neurospin-projects/2021_rlouiset_ucsl
    Robust End-to-End Offline Chinese Handwriting Text Page Spotter with Text Kernel. (arXiv:2107.01547v1 [cs.CV])
    (2 min) Offline Chinese handwriting text recognition is a long-standing research topic in the field of pattern recognition. In previous studies, text detection and recognition are separated, which leads to the fact that text recognition is highly dependent on the detection results. In this paper, we propose a robust end-to-end Chinese text page spotter framework. It unifies text detection and text recognition with text kernel that integrates global text feature information to optimize the recognition from multiple scales, which reduces the dependence of detection and improves the robustness of the system. Our method achieves state-of-the-art results on the CASIA-HWDB2.0-2.2 dataset and ICDAR-2013 competition dataset. Without any language model, the correct rates are 99.12% and 94.27% for line-level recognition, and 99.03% and 94.20% for page-level recognition, respectively.
    Continual Contrastive Self-supervised Learning for Image Classification. (arXiv:2107.01776v1 [cs.CV])
    (2 min) For artificial learning systems, continual learning over time from a stream of data is essential. The burgeoning studies on supervised continual learning have achieved great progress, while the study of catastrophic forgetting in unsupervised learning is still blank. Among unsupervised learning methods, self-supervise learning method shows tremendous potential on visual representation without any labeled data at scale. To improve the visual representation of self-supervised learning, larger and more varied data is needed. In the real world, unlabeled data is generated at all times. This circumstance provides a huge advantage for the learning of the self-supervised method. However, in the current paradigm, packing previous data and current data together and training it again is a waste of time and resources. Thus, a continual self-supervised learning method is badly needed. In this paper, we make the first attempt to implement the continual contrastive self-supervised learning by proposing a rehearsal method, which keeps a few exemplars from the previous data. Instead of directly combining saved exemplars with the current data set for training, we leverage self-supervised knowledge distillation to transfer contrastive information among previous data to the current network by mimicking similarity score distribution inferred by the old network over a set of saved exemplars. Moreover, we build an extra sample queue to assist the network to distinguish between previous and current data and prevent mutual interference while learning their own feature representation. Experimental results show that our method performs well on CIFAR100 and ImageNet-Sub. Compared with self-supervised baselines, which learning tasks one by one without taking any technique, we improve the image classification top-1 accuracy by 1.60% on CIFAR100 and 2.86% on ImageNet-Sub under 10 incremental steps setting.
    On The Distribution of Penultimate Activations of Classification Networks. (arXiv:2107.01900v1 [cs.LG])
    (2 min) This paper studies probability distributions ofpenultimate activations of classification networks.We show that, when a classification network istrained with the cross-entropy loss, its final classi-fication layer forms aGenerative-Discriminativepairwith a generative classifier based on a specificdistribution of penultimate activations. More im-portantly, the distribution is parameterized by theweights of the final fully-connected layer, and canbe considered as a generative model that synthe-sizes the penultimate activations without feedinginput data. We empirically demonstrate that thisgenerative model enables stable knowledge dis-tillation in the presence of domain shift, and cantransfer knowledge from a classifier to variationalautoencoders and generative adversarial networksfor class-conditional image generation.
    Web-Scale Generic Object Detection at Microsoft Bing. (arXiv:2107.01814v1 [cs.CV])
    (2 min) In this paper, we present Generic Object Detection (GenOD), one of the largest object detection systems deployed to a web-scale general visual search engine that can detect over 900 categories for all Microsoft Bing Visual Search queries in near real-time. It acts as a fundamental visual query understanding service that provides object-centric information and shows gains in multiple production scenarios, improving upon domain-specific models. We discuss the challenges of collecting data, training, deploying and updating such a large-scale object detection model with multiple dependencies. We discuss a data collection pipeline that reduces per-bounding box labeling cost by 81.5% and latency by 61.2% while improving on annotation quality. We show that GenOD can improve weighted average precision by over 20% compared to multiple domain-specific models. We also improve the model update agility by nearly 2 times with the proposed disjoint detector training compared to joint fine-tuning. Finally we demonstrate how GenOD benefits visual search applications by significantly improving object-level search relevance by 54.9% and user engagement by 59.9%.
    Depth Quality-Inspired Feature Manipulation for Efficient RGB-D Salient Object Detection. (arXiv:2107.01779v1 [cs.CV])
    (2 min) RGB-D salient object detection (SOD) recently has attracted increasing research interest by benefiting conventional RGB SOD with extra depth information. However, existing RGB-D SOD models often fail to perform well in terms of both efficiency and accuracy, which hinders their potential applications on mobile devices and real-world problems. An underlying challenge is that the model accuracy usually degrades when the model is simplified to have few parameters. To tackle this dilemma and also inspired by the fact that depth quality is a key factor influencing the accuracy, we propose a novel depth quality-inspired feature manipulation (DQFM) process, which is efficient itself and can serve as a gating mechanism for filtering depth features to greatly boost the accuracy. DQFM resorts to the alignment of low-level RGB and depth features, as well as holistic attention of the depth stream to explicitly control and enhance cross-modal fusion. We embed DQFM to obtain an efficient light-weight model called DFM-Net, where we also design a tailored depth backbone and a two-stage decoder for further efficiency consideration. Extensive experimental results demonstrate that our DFM-Net achieves state-of-the-art accuracy when comparing to existing non-efficient models, and meanwhile runs at 140ms on CPU (2.2$\times$ faster than the prior fastest efficient model) with only $\sim$8.5Mb model size (14.9% of the prior lightest). Our code will be made publicly available.
    COVID-VIT: Classification of COVID-19 from CT chest images based on vision transformer models. (arXiv:2107.01682v1 [eess.IV])
    (2 min) This paper is responding to the MIA-COV19 challenge to classify COVID from non-COVID based on CT lung images. The COVID-19 virus has devastated the world in the last eighteen months by infecting more than 182 million people and causing over 3.9 million deaths. The overarching aim is to predict the diagnosis of the COVID-19 virus from chest radiographs, through the development of explainable vision transformer deep learning techniques, leading to population screening in a more rapid, accurate and transparent way. In this competition, there are 5381 three-dimensional (3D) datasets in total, including 1552 for training, 374 for evaluation and 3455 for testing. While most of the data volumes are in axial view, there are a number of subjects' data are in coronal or sagittal views with 1 or 2 slices are in axial view. Hence, while 3D data based classification is investigated, in this competition, 2D images remains the main focus. Two deep learning methods are studied, which are vision transformer (ViT) based on attention models and DenseNet that is built upon conventional convolutional neural network (CNN). Initial evaluation results based on validation datasets whereby the ground truth is known indicate that ViT performs better than DenseNet with F1 scores being 0.76 and 0.72 respectively. Codes are available at GitHub at .
    Split-and-Bridge: Adaptable Class Incremental Learning within a Single Neural Network. (arXiv:2107.01349v1 [cs.LG])
    (2 min) Continual learning has been a major problem in the deep learning community, where the main challenge is how to effectively learn a series of newly arriving tasks without forgetting the knowledge of previous tasks. Initiated by Learning without Forgetting (LwF), many of the existing works report that knowledge distillation is effective to preserve the previous knowledge, and hence they commonly use a soft label for the old task, namely a knowledge distillation (KD) loss, together with a class label for the new task, namely a cross entropy (CE) loss, to form a composite loss for a single neural network. However, this approach suffers from learning the knowledge by a CE loss as a KD loss often more strongly influences the objective function when they are in a competitive situation within a single network. This could be a critical problem particularly in a class incremental scenario, where the knowledge across tasks as well as within the new task, both of which can only be acquired by a CE loss, is essentially learned due to the existence of a unified classifier. In this paper, we propose a novel continual learning method, called Split-and-Bridge, which can successfully address the above problem by partially splitting a neural network into two partitions for training the new task separated from the old task and re-connecting them for learning the knowledge across tasks. In our thorough experimental analysis, our Split-and-Bridge method outperforms the state-of-the-art competitors in KD-based continual learning.
    Exploring Data Pipelines through the Process Lens: a Reference Model forComputer Vision. (arXiv:2107.01824v1 [cs.CV])
    (2 min) Researchers have identified datasets used for training computer vision (CV) models as an important source of hazardous outcomes, and continue to examine popular CV datasets to expose their harms. These works tend to treat datasets as objects, or focus on particular steps in data production pipelines. We argue here that we could further systematize our analysis of harms by examining CV data pipelines through a process-oriented lens that captures the creation, the evolution and use of these datasets. As a step towards cultivating a process-oriented lens, we embarked on an empirical study of CV data pipelines informed by the field of method engineering. We present here a preliminary result: a reference model of CV data pipelines. Besides exploring the questions that this endeavor raises, we discuss how the process lens could support researchers in discovering understudied issues, and could help practitioners in making their processes more transparent.
    Ray-ONet: Efficient 3D Reconstruction From A Single RGB Image. (arXiv:2107.01899v1 [cs.CV])
    (2 min) We propose Ray-ONet to reconstruct detailed 3D models from monocular images efficiently. By predicting a series of occupancy probabilities along a ray that is back-projected from a pixel in the camera coordinate, our method Ray-ONet improves the reconstruction accuracy in comparison with Occupancy Networks (ONet), while reducing the network inference complexity to O($N^2$). As a result, Ray-ONet achieves state-of-the-art performance on the ShapeNet benchmark with more than 20$\times$ speed-up at $128^3$ resolution and maintains a similar memory footprint during inference.
    SM-SGE: A Self-Supervised Multi-Scale Skeleton Graph Encoding Framework for Person Re-Identification. (arXiv:2107.01903v1 [cs.CV])
    (2 min) Person re-identification via 3D skeletons is an emerging topic with great potential in security-critical applications. Existing methods typically learn body and motion features from the body-joint trajectory, whereas they lack a systematic way to model body structure and underlying relations of body components beyond the scale of body joints. In this paper, we for the first time propose a Self-supervised Multi-scale Skeleton Graph Encoding (SM-SGE) framework that comprehensively models human body, component relations, and skeleton dynamics from unlabeled skeleton graphs of various scales to learn an effective skeleton representation for person Re-ID. Specifically, we first devise multi-scale skeleton graphs with coarse-to-fine human body partitions, which enables us to model body structure and skeleton dynamics at multiple levels. Second, to mine inherent correlations between body components in skeletal motion, we propose a multi-scale graph relation network to learn structural relations between adjacent body-component nodes and collaborative relations among nodes of different scales, so as to capture more discriminative skeleton graph features. Last, we propose a novel multi-scale skeleton reconstruction mechanism to enable our framework to encode skeleton dynamics and high-level semantics from unlabeled skeleton graphs, which encourages learning a discriminative skeleton representation for person Re-ID. Extensive experiments show that SM-SGE outperforms most state-of-the-art skeleton-based methods. We further demonstrate its effectiveness on 3D skeleton data estimated from large-scale RGB videos. Our codes are open at https://github.com/Kali-Hac/SM-SGE.
    Towards Better Adversarial Synthesis of Human Images from Text. (arXiv:2107.01869v1 [cs.CV])
    (2 min) This paper proposes an approach that generates multiple 3D human meshes from text. The human shapes are represented by 3D meshes based on the SMPL model. The model's performance is evaluated on the COCO dataset, which contains challenging human shapes and intricate interactions between individuals. The model is able to capture the dynamics of the scene and the interactions between individuals based on text. We further show how using such a shape as input to image synthesis frameworks helps to constrain the network to synthesize humans with realistic human shapes.
    Controllable cardiac synthesis via disentangled anatomy arithmetic. (arXiv:2107.01748v1 [eess.IV])
    (2 min) Acquiring annotated data at scale with rare diseases or conditions remains a challenge. It would be extremely useful to have a method that controllably synthesizes images that can correct such underrepresentation. Assuming a proper latent representation, the idea of a "latent vector arithmetic" could offer the means of achieving such synthesis. A proper representation must encode the fidelity of the input data, preserve invariance and equivariance, and permit arithmetic operations. Motivated by the ability to disentangle images into spatial anatomy (tensor) factors and accompanying imaging (vector) representations, we propose a framework termed "disentangled anatomy arithmetic", in which a generative model learns to combine anatomical factors of different input images such that when they are re-entangled with the desired imaging modality (e.g. MRI), plausible new cardiac images are created with the target characteristics. To encourage a realistic combination of anatomy factors after the arithmetic step, we propose a localized noise injection network that precedes the generator. Our model is used to generate realistic images, pathology labels, and segmentation masks that are used to augment the existing datasets and subsequently improve post-hoc classification and segmentation tasks. Code is publicly available at https://github.com/vios-s/DAA-GAN.
    VinDr-RibCXR: A Benchmark Dataset for Automatic Segmentation and Labeling of Individual Ribs on Chest X-rays. (arXiv:2107.01327v1 [eess.IV])
    (2 min) We introduce a new benchmark dataset, namely VinDr-RibCXR, for automatic segmentation and labeling of individual ribs from chest X-ray (CXR) scans. The VinDr-RibCXR contains 245 CXRs with corresponding ground truth annotations provided by human experts. A set of state-of-the-art segmentation models are trained on 196 images from the VinDr-RibCXR to segment and label 20 individual ribs. Our best performing model obtains a Dice score of 0.834 (95% CI, 0.810--0.853) on an independent test set of 49 images. Our study, therefore, serves as a proof of concept and baseline performance for future research.
    Self-Contrastive Learning with Hard Negative Sampling for Self-supervised Point Cloud Learning. (arXiv:2107.01886v1 [cs.CV])
    (2 min) Point clouds have attracted increasing attention as a natural representation of 3D shapes. Significant progress has been made in developing methods for point cloud analysis, which often requires costly human annotation as supervision in practice. To address this issue, we propose a novel self-contrastive learning for self-supervised point cloud representation learning, aiming to capture both local geometric patterns and nonlocal semantic primitives based on the nonlocal self-similarity of point clouds. The contributions are two-fold: on the one hand, instead of contrasting among different point clouds as commonly employed in contrastive learning, we exploit self-similar point cloud patches within a single point cloud as positive samples and otherwise negative ones to facilitate the task of contrastive learning. Such self-contrastive learning is well aligned with the emerging paradigm of self-supervised learning for point cloud analysis. On the other hand, we actively learn hard negative samples that are close to positive samples in the representation space for discriminative feature learning, which are sampled conditional on each anchor patch leveraging on the degree of self-similarity. Experimental results show that the proposed method achieves state-of-the-art performance on widely used benchmark datasets for self-supervised point cloud segmentation and transfer learning for classification.
    Drone Detection Using Convolutional Neural Networks. (arXiv:2107.01435v1 [cs.CV])
    (2 min) In image processing, it is essential to detect and track air targets, especially UAVs. In this paper, we detect the flying drone using a fisheye camera. In the field of diagnosis and classification of objects, there are always many problems that prevent the development of rapid and significant progress in this area. During the previous decades, a couple of advanced classification methods such as convolutional neural networks and support vector machines have been developed. In this study, the drone was detected using three methods of classification of convolutional neural network (CNN), support vector machine (SVM), and nearest neighbor. The outcomes show that CNN, SVM, and nearest neighbor have total accuracy of 95%, 88%, and 80%, respectively. Compared with other classifiers with the same experimental conditions, the accuracy of the convolutional neural network classifier is satisfactory.
    SSPNet: Scale Selection Pyramid Network for Tiny Person Detection from UAV Images. (arXiv:2107.01548v1 [cs.CV])
    (2 min) With the increasing demand for search and rescue, it is highly demanded to detect objects of interest in large-scale images captured by Unmanned Aerial Vehicles (UAVs), which is quite challenging due to extremely small scales of objects. Most existing methods employed Feature Pyramid Network (FPN) to enrich shallow layers' features by combing deep layers' contextual features. However, under the limitation of the inconsistency in gradient computation across different layers, the shallow layers in FPN are not fully exploited to detect tiny objects. In this paper, we propose a Scale Selection Pyramid network (SSPNet) for tiny person detection, which consists of three components: Context Attention Module (CAM), Scale Enhancement Module (SEM), and Scale Selection Module (SSM). CAM takes account of context information to produce hierarchical attention heatmaps. SEM highlights features of specific scales at different layers, leading the detector to focus on objects of specific scales instead of vast backgrounds. SSM exploits adjacent layers' relationships to fulfill suitable feature sharing between deep layers and shallow layers, thereby avoiding the inconsistency in gradient computation across different layers. Besides, we propose a Weighted Negative Sampling (WNS) strategy to guide the detector to select more representative samples. Experiments on the TinyPerson benchmark show that our method outperforms other state-of-the-art (SOTA) detectors.
    Learning from scarce information: using synthetic data to classify Roman fine ware pottery. (arXiv:2107.01401v1 [cs.CV])
    (2 min) In this article we consider a version of the challenging problem of learning from datasets whose size is too limited to allow generalisation beyond the training set. To address the challenge we propose to use a transfer learning approach whereby the model is first trained on a synthetic dataset replicating features of the original objects. In this study the objects were smartphone photographs of near-complete Roman terra sigillata pottery vessels from the collection of the Museum of London. Taking the replicated features from published profile drawings of pottery forms allowed the integration of expert knowledge into the process through our synthetic data generator. After this first initial training the model was fine-tuned with data from photographs of real vessels. We show, through exhaustive experiments across several popular deep learning architectures, different test priors, and considering the impact of the photograph viewpoint and excessive damage to the vessels, that the proposed hybrid approach enables the creation of classifiers with appropriate generalisation performance. This performance is significantly better than that of classifiers trained exclusively on the original data which shows the promise of the approach to alleviate the fundamental issue of learning from small datasets.
    CT Image Harmonization for Enhancing Radiomics Studies. (arXiv:2107.01337v1 [eess.IV])
    (2 min) While remarkable advances have been made in Computed Tomography (CT), capturing CT images with non-standardized protocols causes low reproducibility regarding radiomic features, forming a barrier on CT image analysis in a large scale. RadiomicGAN is developed to effectively mitigate the discrepancy caused by using non-standard reconstruction kernels. RadiomicGAN consists of hybrid neural blocks including both pre-trained and trainable layers adopted to learn radiomic feature distributions efficiently. A novel training approach, called Dynamic Window-based Training, has been developed to smoothly transform the pre-trained model to the medical imaging domain. Model performance evaluated using 1401 radiomic features show that RadiomicGAN clearly outperforms the state-of-art image standardization models.
    A study of CNN capacity applied to Left Venticle Segmentation in Cardiac MRI. (arXiv:2107.01318v1 [eess.IV])
    (2 min) CNN (Convolutional Neural Network) models have been successfully used for segmentation of the left ventricle (LV) in cardiac MRI (Magnetic Resonance Imaging), providing clinical measurements.In practice, two questions arise with deployment of CNNs: 1) when is it better to use a shallow model instead of a deeper one? 2) how the size of a dataset might change the network performance? We propose a framework to answer them, by experimenting with deep and shallow versions of three U-Net families, trained from scratch in six subsets varying from 100 to 10,000 images, different network sizes, learning rates and regularization values. 1620 models were evaluated using 5-foldcross-validation by loss and DICE. The results indicate that: sample size affects performance more than architecture or hyper-parameters; in small samples the performance is more sensitive to hyper-parameters than architecture; the performance difference between shallow and deeper networks is not the same across families.
    Imaging dynamics beneath turbid media via parallelized single-photon detection. (arXiv:2107.01422v1 [physics.optics])
    (2 min) Noninvasive optical imaging through dynamic scattering media has numerous important biomedical applications but still remains a challenging task. While standard methods aim to form images based upon optical absorption or fluorescent emission, it is also well-established that the temporal correlation of scattered coherent light diffuses through tissue much like optical intensity. Few works to date, however, have aimed to experimentally measure and process such data to demonstrate deep-tissue imaging of decorrelation dynamics. In this work, we take advantage of a single-photon avalanche diode (SPAD) array camera, with over one thousand detectors, to simultaneously detect speckle fluctuations at the single-photon level from 12 different phantom tissue surface locations delivered via a customized fiber bundle array. We then apply a deep neural network to convert the acquired single-photon measurements into video of scattering dynamics beneath rapidly decorrelating liquid tissue phantoms. We demonstrate the ability to record video of dynamic events occurring 5-8 mm beneath a decorrelating tissue phantom with mm-scale resolution and at a 2.5-10 Hz frame rate.
    Data Uncertainty Guided Noise-aware Preprocessing Of Fingerprints. (arXiv:2107.01248v1 [cs.CV])
    (2 min) The effectiveness of fingerprint-based authentication systems on good quality fingerprints is established long back. However, the performance of standard fingerprint matching systems on noisy and poor quality fingerprints is far from satisfactory. Towards this, we propose a data uncertainty-based framework which enables the state-of-the-art fingerprint preprocessing models to quantify noise present in the input image and identify fingerprint regions with background noise and poor ridge clarity. Quantification of noise helps the model two folds: firstly, it makes the objective function adaptive to the noise in a particular input fingerprint and consequently, helps to achieve robust performance on noisy and distorted fingerprint regions. Secondly, it provides a noise variance map which indicates noisy pixels in the input fingerprint image. The predicted noise variance map enables the end-users to understand erroneous predictions due to noise present in the input image. Extensive experimental evaluation on 13 publicly available fingerprint databases, across different architectural choices and two fingerprint processing tasks demonstrate effectiveness of the proposed framework.
    Learning Hierarchical Graph Neural Networks for Image Clustering. (arXiv:2107.01319v1 [cs.CV])
    (2 min) We propose a hierarchical graph neural network (GNN) model that learns how to cluster a set of images into an unknown number of identities using a training set of images annotated with labels belonging to a disjoint set of identities. Our hierarchical GNN uses a novel approach to merge connected components predicted at each level of the hierarchy to form a new graph at the next level. Unlike fully unsupervised hierarchical clustering, the choice of grouping and complexity criteria stems naturally from supervision in the training set. The resulting method, Hi-LANDER, achieves an average of 54% improvement in F-score and 8% increase in Normalized Mutual Information (NMI) relative to current GNN-based clustering algorithms. Additionally, state-of-the-art GNN-based methods rely on separate models to predict linkage probabilities and node densities as intermediate steps of the clustering process. In contrast, our unified framework achieves a seven-fold decrease in computational cost. We release our training and inference code at https://github.com/dmlc/dgl/tree/master/examples/pytorch/hilander.
    EAR-NET: Error Attention Refining Network For Retinal Vessel Segmentation. (arXiv:2107.01351v1 [eess.IV])
    (2 min) The precise detection of blood vessels in retinal images is crucial to the early diagnosis of the retinal vascular diseases, e.g., diabetic, hypertensive and solar retinopathies. Existing works often fail in predicting the abnormal areas, e.g, sudden brighter and darker areas and are inclined to predict a pixel to background due to the significant class imbalance, leading to high accuracy and specificity while low sensitivity. To that end, we propose a novel error attention refining network (ERA-Net) that is capable of learning and predicting the potential false predictions in a two-stage manner for effective retinal vessel segmentation. The proposed ERA-Net in the refine stage drives the model to focus on and refine the segmentation errors produced in the initial training stage. To achieve this, unlike most previous attention approaches that run in an unsupervised manner, we introduce a novel error attention mechanism which considers the differences between the ground truth and the initial segmentation masks as the ground truth to supervise the attention map learning. Experimental results demonstrate that our method achieves state-of-the-art performance on two common retinal blood vessel datasets.
    Scene-aware Learning Network for Radar Object Detection. (arXiv:2107.01469v1 [cs.CV])
    (2 min) Object detection is essential to safe autonomous or assisted driving. Previous works usually utilize RGB images or LiDAR point clouds to identify and localize multiple objects in self-driving. However, cameras tend to fail in bad driving conditions, e.g. bad weather or weak lighting, while LiDAR scanners are too expensive to get widely deployed in commercial applications. Radar has been drawing more and more attention due to its robustness and low cost. In this paper, we propose a scene-aware radar learning framework for accurate and robust object detection. First, the learning framework contains branches conditioning on the scene category of the radar sequence; with each branch optimized for a specific type of scene. Second, three different 3D autoencoder-based architectures are proposed for radar object detection and ensemble learning is performed over the different architectures to further boost the final performance. Third, we propose novel scene-aware sequence mix augmentation (SceneMix) and scene-specific post-processing to generate more robust detection results. In the ROD2021 Challenge, we achieved a final result of average precision of 75.0% and an average recall of 81.0%. Moreover, in the parking lot scene, our framework ranks first with an average precision of 97.8% and an average recall of 98.6%, which demonstrates the effectiveness of our framework.
    SPI-GAN: Towards Single-Pixel Imaging through Generative Adversarial Network. (arXiv:2107.01330v1 [cs.CV])
    (2 min) Single-pixel imaging is a novel imaging scheme that has gained popularity due to its huge computational gain and potential for a low-cost alternative to imaging beyond the visible spectrum. The traditional reconstruction methods struggle to produce a clear recovery when one limits the number of illumination patterns from a spatial light modulator. As a remedy, several deep-learning-based solutions have been proposed which lack good generalization ability due to the architectural setup and loss functions. In this paper, we propose a generative adversarial network-based reconstruction framework for single-pixel imaging, referred to as SPI-GAN. Our method can reconstruct images with 17.92 dB PSNR and 0.487 SSIM, even if the sampling ratio drops to 5%. This facilitates much faster reconstruction making our method suitable for single-pixel video. Furthermore, our ResNet-like architecture for the generator leads to useful representation learning that allows us to reconstruct completely unseen objects. The experimental results demonstrate that SPI-GAN achieves significant performance gain, e.g. near 3dB PSNR gain, over the current state-of-the-art method.
    WisdomNet: Prognosis of COVID-19 with Slender Prospect of False Negative Cases and Vaticinating the Probability of Maturation to ARDS using Posteroanterior Chest X-Rays. (arXiv:2107.01392v1 [eess.IV])
    (3 min) Coronavirus is a large virus family consisting of diverse viruses, some of which disseminate among mammals and others cause sickness among humans. COVID-19 is highly contagious and is rapidly spreading, rendering its early diagnosis of preeminent status. Researchers, medical specialists and organizations all over the globe have been working tirelessly to combat this virus and help in its containment. In this paper, a novel neural network called WisdomNet has been proposed, for the diagnosis of COVID-19 using chest X-rays. The WisdomNet uses the concept of Wisdom of Crowds as its founding idea. It is a two-layered convolutional Neural Network (CNN), which takes chest x-ray images as input. Both layers of the proposed neural network consist of a number of neural networks each. The dataset used for this study consists of chest x-ray images of COVID-19 positive patients, compiled and shared by Dr. Cohen on GitHub, and the chest x-ray images of healthy lungs and lungs affected by viral and bacterial pneumonia were obtained from Kaggle. The network not only pinpoints the presence of COVID-19, but also gives the probability of the disease maturing into Acute Respiratory Distress Syndrome (ARDS). Thus, predicting the progression of the disease in the COVID-19 positive patients. The network also slender the occurrences of false negative cases by employing a high threshold value, thus aids in curbing the spread of the disease and gives an accuracy of 100% for successfully predicting COVID-19 among the chest x-rays of patients affected with COVID-19, bacterial and viral pneumonia.
    CInC Flow: Characterizable Invertible 3x3 Convolution. (arXiv:2107.01358v1 [cs.LG])
    (2 min) Normalizing flows are an essential alternative to GANs for generative modelling, which can be optimized directly on the maximum likelihood of the dataset. They also allow computation of the exact latent vector corresponding to an image since they are composed of invertible transformations. However, the requirement of invertibility of the transformation prevents standard and expressive neural network models such as CNNs from being directly used. Emergent convolutions were proposed to construct an invertible 3$\times$3 CNN layer using a pair of masked CNN layers, making them inefficient. We study conditions such that 3$\times$3 CNNs are invertible, allowing them to construct expressive normalizing flows. We derive necessary and sufficient conditions on a padded CNN for it to be invertible. Our conditions for invertibility are simple, can easily be maintained during the training process. Since we require only a single CNN layer for every effective invertible CNN layer, our approach is more efficient than emerging convolutions. We also proposed a coupling method, Quad-coupling. We benchmark our approach and show similar performance results to emergent convolutions while improving the model's efficiency.
    Demiguise Attack: Crafting Invisible Semantic Adversarial Perturbations with Perceptual Similarity. (arXiv:2107.01396v1 [cs.CV])
    (2 min) Deep neural networks (DNNs) have been found to be vulnerable to adversarial examples. Adversarial examples are malicious images with visually imperceptible perturbations. While these carefully crafted perturbations restricted with tight $\Lp$ norm bounds are small, they are still easily perceivable by humans. These perturbations also have limited success rates when attacking black-box models or models with defenses like noise reduction filters. To solve these problems, we propose Demiguise Attack, crafting ``unrestricted'' perturbations with Perceptual Similarity. Specifically, we can create powerful and photorealistic adversarial examples by manipulating semantic information based on Perceptual Similarity. Adversarial examples we generate are friendly to the human visual system (HVS), although the perturbations are of large magnitudes. We extend widely-used attacks with our approach, enhancing adversarial effectiveness impressively while contributing to imperceptibility. Extensive experiments show that the proposed method not only outperforms various state-of-the-art attacks in terms of fooling rate, transferability, and robustness against defenses but can also improve attacks effectively. In addition, we also notice that our implementation can simulate illumination and contrast changes that occur in real-world scenarios, which will contribute to exposing the blind spots of DNNs.
  • cs.IR updates on arXiv.org

    FINT: Field-aware INTeraction Neural Network For CTR Prediction. (arXiv:2107.01999v1 [cs.IR])
    (2 min) As a critical component for online advertising and marking, click-through rate (CTR) prediction has draw lots of attentions from both industry and academia field. Recently, the deep learning has become the mainstream methodological choice for CTR. Despite of sustainable efforts have been made, existing approaches still pose several challenges. On the one hand, high-order interaction between the features is under-explored. On the other hand, high-order interactions may neglect the semantic information from the low-order fields. In this paper, we proposed a novel prediction method, named FINT, that employs the Field-aware INTeraction layer which captures high-order feature interactions while retaining the low-order field information. To empirically investigate the effectiveness and robustness of the FINT, we perform extensive experiments on the three realistic databases: KDD2012, Criteo and Avazu. The obtained results demonstrate that the FINT can significantly improve the performance compared to the existing methods, without increasing the amount of computation required. Moreover, the proposed method brought about 2.72\% increase to the advertising revenue of a big online video app through A/B testing. To better promote the research in CTR field, we will release our code as well as reference implementation of those baseline models in the final version.
    Learning Complex Users' Preferences for Recommender Systems. (arXiv:2107.01529v1 [cs.IR])
    (2 min) Recommender systems (RSs) have emerged as very useful tools to help customers with their decision-making process, find items of their interest, and alleviate the information overload problem. There are two different lines of approaches in RSs: (1) general recommenders with the main goal of discovering long-term users' preferences, and (2) sequential recommenders with the main focus of capturing short-term users' preferences in a session of user-item interaction (here, a session refers to a record of purchasing multiple items in one shopping event). While considering short-term users' preferences may satisfy their current needs and interests, long-term users' preferences provide users with the items that they may interact with, eventually. In this thesis, we first focus on improving the performance of general RSs. Most of the existing general RSs tend to exploit the users' rating patterns on common items to detect similar users. The data sparsity problem (i.e. the lack of available information) is one of the major challenges for the current general RSs, and they may fail to have any recommendations when there are no common items of interest among users. We call this problem data sparsity with no feedback on common items (DSW-n-FCI). To overcome this problem, we propose a personality-based RS in which similar users are identified based on the similarity of their personality traits.
    Improved Representation Learning for Session-based Recommendation. (arXiv:2107.01516v1 [cs.IR])
    (2 min) Session-based recommendation systems suggest relevant items to users by modeling user behavior and preferences using short-term anonymous sessions. Existing methods leverage Graph Neural Networks (GNNs) that propagate and aggregate information from neighboring nodes i.e., local message passing. Such graph-based architectures have representational limits, as a single sub-graph is susceptible to overfit the sequential dependencies instead of accounting for complex transitions between items in different sessions. We propose using a Transformer in combination with a target attentive GNN, which allows richer Representation Learning. Our experimental results and ablation show that our proposed method outperforms the existing methods on real-world benchmark datasets.
    Assessing Viewpoint Diversity in Search Results Using Ranking Fairness Metrics. (arXiv:2010.14531v2 [cs.IR] UPDATED)
    (2 min) The way pages are ranked in search results influences whether the users of search engines are exposed to more homogeneous, or rather to more diverse viewpoints. However, this viewpoint diversity is not trivial to assess. In this paper we use existing and novel ranking fairness metrics to evaluate viewpoint diversity in search result rankings. We conduct a controlled simulation study that shows how ranking fairness metrics can be used for viewpoint diversity, how their outcome should be interpreted, and which metric is most suitable depending on the situation. This paper lays out important ground work for future research to measure and assess viewpoint diversity in real search result rankings.
    NOTE: Solution for KDD-CUP 2021 WikiKG90M-LSC. (arXiv:2107.01892v1 [cs.IR])
    (2 min) WikiKG90M in KDD Cup 2021 is a large encyclopedic knowledge graph, which could benefit various downstream applications such as question answering and recommender systems. Participants are invited to complete the knowledge graph by predicting missing triplets. Recent representation learning methods have achieved great success on standard datasets like FB15k-237. Thus, we train the advanced algorithms in different domains to learn the triplets, including OTE, QuatE, RotatE and TransE. Significantly, we modified OTE into NOTE (short for Norm-OTE) for better performance. Besides, we use both the DeepWalk and the post-smoothing technique to capture the graph structure for supplementation. In addition to the representations, we also use various statistical probabilities among the head entities, the relations and the tail entities for the final prediction. Experimental results show that the ensemble of state-of-the-art representation learning methods could draw on each others strengths. And we develop feature engineering from validation candidates for further improvements. Please note that we apply the same strategy on the test set for final inference. And these features may not be practical in the real world when considering ranking against all the entities.
    Attribute-aware Explainable Complementary Clothing Recommendation. (arXiv:2107.01655v1 [cs.IR])
    (2 min) Modelling mix-and-match relationships among fashion items has become increasingly demanding yet challenging for modern E-commerce recommender systems. When performing clothes matching, most existing approaches leverage the latent visual features extracted from fashion item images for compatibility modelling, which lacks explainability of generated matching results and can hardly convince users of the recommendations. Though recent methods start to incorporate pre-defined attribute information (e.g., colour, style, length, etc.) for learning item representations and improving the model interpretability, their utilisation of attribute information is still mainly reserved for enhancing the learned item representations and generating explanations via post-processing. As a result, this creates a severe bottleneck when we are trying to advance the recommendation accuracy and generating fine-grained explanations since the explicit attributes have only loose connections to the actual recommendation process. This work aims to tackle the explainability challenge in fashion recommendation tasks by proposing a novel Attribute-aware Fashion Recommender (AFRec). Specifically, AFRec recommender assesses the outfit compatibility by explicitly leveraging the extracted attribute-level representations from each item's visual feature. The attributes serve as the bridge between two fashion items, where we quantify the affinity of a pair of items through the learned compatibility between their attributes. Extensive experiments have demonstrated that, by making full use of the explicit attributes in the recommendation process, AFRec is able to achieve state-of-the-art recommendation accuracy and generate intuitive explanations at the same time.
  • cs.LG updates on arXiv.org

    Structure by Architecture: Disentangled Representations without Regularization. (arXiv:2006.07796v3 [cs.LG] UPDATED)
    (2 min) We study the problem of self-supervised structured representation learning using autoencoders for generative modeling. Unlike most methods which rely on matching an arbitrary, relatively unstructured, prior distribution for sampling, we propose a sampling technique that relies solely on the independence of latent variables, thereby avoiding the trade-off between reconstruction quality and generative performance inherent to VAEs. We design a novel autoencoder architecture capable of learning a structured representation without the need for aggressive regularization. Our structural decoders learn a hierarchy of latent variables, akin to structural causal models, thereby ordering the information without any additional regularization. We demonstrate how these models learn a representation that improves results in a variety of downstream tasks including generation, disentanglement, and extrapolation using several challenging and natural image datasets.
    Template-Based Graph Clustering. (arXiv:2107.01994v1 [stat.ML])
    (2 min) We propose a novel graph clustering method guided by additional information on the underlying structure of the clusters (or communities). The problem is formulated as the matching of a graph to a template with smaller dimension, hence matching $n$ vertices of the observed graph (to be clustered) to the $k$ vertices of a template graph, using its edges as support information, and relaxed on the set of orthonormal matrices in order to find a $k$ dimensional embedding. With relevant priors that encode the density of the clusters and their relationships, our method outperforms classical methods, especially for challenging cases.
    Unsupervised Domain Adaptation of Object Detectors: A Survey. (arXiv:2105.13502v2 [cs.CV] UPDATED)
    (2 min) Recent advances in deep learning have led to the development of accurate and efficient models for various computer vision applications such as classification, segmentation, and detection. However, learning highly accurate models relies on the availability of large-scale annotated datasets. Due to this, model performance drops drastically when evaluated on label-scarce datasets having visually distinct images, termed as domain adaptation problem. There is a plethora of works to adapt classification and segmentation models to label-scarce target datasets through unsupervised domain adaptation. Considering that detection is a fundamental task in computer vision, many recent works have focused on developing novel domain adaptive detection techniques. Here, we describe in detail the domain adaptation problem for detection and present an extensive survey of the various methods. Furthermore, we highlight strategies proposed and the associated shortcomings. Subsequently, we identify multiple aspects of the problem that are most promising for future research. We believe that this survey shall be valuable to the pattern recognition experts working in the fields of computer vision, biometrics, medical imaging, and autonomous navigation by introducing them to the problem, and familiarizing them with the current status of the progress while providing promising directions for future research.
    A Two-step Surface-based 3D Deep Learning Pipeline for Segmentation of Intracranial Aneurysms. (arXiv:2006.16161v2 [eess.IV] UPDATED)
    (2 min) The exact shape of intracranial aneurysms is critical in medical diagnosis and surgical planning. While voxel-based deep learning frameworks have been proposed for this segmentation task, their performance remains limited. In this study, we offer a two-step surface-based deep learning pipeline that achieves significantly higher performance. Our proposed model takes a surface model of entire principal brain arteries containing aneurysms as input and returns aneurysms surfaces as output. A user first generates a surface model by manually specifying multiple thresholds for time-of-flight magnetic resonance angiography images. The system then samples small surface fragments from the entire brain arteries and classifies the surface fragments according to whether aneurysms are present using a point-based deep learning network (PointNet++). Finally, the system applies surface segmentation (SO-Net) to surface fragments containing aneurysms. We conduct a direct comparison of segmentation performance by counting voxels between the proposed surface-based framework and the existing voxel-based method, in which our framework achieves a much higher dice similarity coefficient score (72%) than the prior approach (46%).
    Information Theoretic Meta Learning with Gaussian Processes. (arXiv:2009.03228v3 [cs.LG] UPDATED)
    (2 min) We formulate meta learning using information theoretic concepts; namely, mutual information and the information bottleneck. The idea is to learn a stochastic representation or encoding of the task description, given by a training set, that is highly informative about predicting the validation set. By making use of variational approximations to the mutual information, we derive a general and tractable framework for meta learning. This framework unifies existing gradient-based algorithms and also allows us to derive new algorithms. In particular, we develop a memory-based algorithm that uses Gaussian processes to obtain non-parametric encoding representations. We demonstrate our method on a few-shot regression problem and on four few-shot classification problems, obtaining competitive accuracy when compared to existing baselines.
    The Implicit Bias for Adaptive Optimization Algorithms on Homogeneous Neural Networks. (arXiv:2012.06244v3 [cs.LG] UPDATED)
    (2 min) Despite their overwhelming capacity to overfit, deep neural networks trained by specific optimization algorithms tend to generalize well to unseen data. Recently, researchers explained it by investigating the implicit regularization effect of optimization algorithms. A remarkable progress is the work (Lyu&Li, 2019), which proves gradient descent (GD) maximizes the margin of homogeneous deep neural networks. Except GD, adaptive algorithms such as AdaGrad, RMSProp and Adam are popular owing to their rapid training process. However, theoretical guarantee for the generalization of adaptive optimization algorithms is still lacking. In this paper, we study the implicit regularization of adaptive optimization algorithms when they are optimizing the logistic loss on homogeneous deep neural networks. We prove that adaptive algorithms that adopt exponential moving average strategy in conditioner (such as Adam and RMSProp) can maximize the margin of the neural network, while AdaGrad that directly sums historical squared gradients in conditioner can not. It indicates superiority on generalization of exponential moving average strategy in the design of the conditioner. Technically, we provide a unified framework to analyze convergent direction of adaptive optimization algorithms by constructing novel adaptive gradient flow and surrogate margin. Our experiments can well support the theoretical findings on convergent direction of adaptive optimization algorithms.
    Emotion Recognition of the Singing Voice: Toward a Real-Time Analysis Tool for Singers. (arXiv:2105.00173v2 [cs.SD] UPDATED)
    (2 min) Current computational-emotion research has focused on applying acoustic properties to analyze how emotions are perceived mathematically or used in natural language processing machine learning models. While recent interest has focused on analyzing emotions from the spoken voice, little experimentation has been performed to discover how emotions are recognized in the singing voice -- both in noiseless and noisy data (i.e., data that is either inaccurate, difficult to interpret, has corrupted/distorted/nonsense information like actual noise sounds in this case, or has a low ratio of usable/unusable information). Not only does this ignore the challenges of training machine learning models on more subjective data and testing them with much noisier data, but there is also a clear disconnect in progress between advancing the development of convolutional neural networks and the goal of emotionally cognizant artificial intelligence. By training a new model to include this type of information with a rich comprehension of psycho-acoustic properties, not only can models be trained to recognize information within extremely noisy data, but advancement can be made toward more complex biofeedback applications -- including creating a model which could recognize emotions given any human information (language, breath, voice, body, posture) and be used in any performance medium (music, speech, acting) or psychological assistance for patients with disorders such as BPD, alexithymia, autism, among others. This paper seeks to reflect and expand upon the findings of related research and present a stepping-stone toward this end goal.
    Two Sides of the Same Coin: Heterophily and Oversmoothing in Graph Convolutional Neural Networks. (arXiv:2102.06462v3 [cs.LG] UPDATED)
    (3 min) Most graph convolutional neural networks (GCNs) perform poorly in graphs where neighbors typically have different features/classes (heterophily) and when stacking multiple layers (oversmoothing). These two seemingly unrelated problems have been studied independently, but there is recent empirical evidence that solving one problem may benefit the other. In this work, going beyond empirical observations, we aim to: (1) propose a new perspective to analyze the heterophily and oversmoothing problems under a unified theoretical framework, (2) identify the common causes of the two problems based on the proposed framework, and (3) propose simple yet effective strategies that address the common causes. Focusing on the node classification task, we use linear separability of node representations as an indicator to reflect the performance of GCNs and we propose to study the linear separability by analyzing the statistical change of the node representations in the graph convolution. We find that the relative degree of a node (compared to its neighbors) and the heterophily level of a node's neighborhood are the root causes that influence the separability of node representations. Our analysis suggests that: (1) Nodes with high heterophily always produce less separable representations after graph convolution; (2) Even with low heterophily, degree disparity between nodes can influence the network dynamics and result in a pseudo-heterophily situation, which helps to explain oversmoothing. Based on our insights, we propose simple modifications to the GCN architecture -- i.e., degree corrections and signed messages -- which alleviate the root causes of these issues, and also show this empirically on 9 real networks. Compared to other approaches, which tend to work well in one regime but fail in others, our modified GCN model consistently performs well across all settings.
    Neural Granular Sound Synthesis. (arXiv:2008.01393v3 [cs.SD] UPDATED)
    (2 min) Granular sound synthesis is a popular audio generation technique based on rearranging sequences of small waveform windows. In order to control the synthesis, all grains in a given corpus are analyzed through a set of acoustic descriptors. This provides a representation reflecting some form of local similarities across the grains. However, the quality of this grain space is bound by that of the descriptors. Its traversal is not continuously invertible to signal and does not render any structured temporality. We demonstrate that generative neural networks can implement granular synthesis while alleviating most of its shortcomings. We efficiently replace its audio descriptor basis by a probabilistic latent space learned with a Variational Auto-Encoder. In this setting the learned grain space is invertible, meaning that we can continuously synthesize sound when traversing its dimensions. It also implies that original grains are not stored for synthesis. Another major advantage of our approach is to learn structured paths inside this latent space by training a higher-level temporal embedding over arranged grain sequences. The model can be applied to many types of libraries, including pitched notes or unpitched drums and environmental noises. We report experiments on the common granular synthesis processes as well as novel ones such as conditional sampling and morphing.
    Bayesian Learning-Based Adaptive Control for Safety Critical Systems. (arXiv:1910.02325v3 [eess.SY] UPDATED)
    (2 min) Deep learning has enjoyed much recent success, and applying state-of-the-art model learning methods to controls is an exciting prospect. However, there is a strong reluctance to use these methods on safety-critical systems, which have constraints on safety, stability, and real-time performance. We propose a framework which satisfies these constraints while allowing the use of deep neural networks for learning model uncertainties. Central to our method is the use of Bayesian model learning, which provides an avenue for maintaining appropriate degrees of caution in the face of the unknown. In the proposed approach, we develop an adaptive control framework leveraging the theory of stochastic CLFs (Control Lyapunov Functions) and stochastic CBFs (Control Barrier Functions) along with tractable Bayesian model learning via Gaussian Processes or Bayesian neural networks. Under reasonable assumptions, we guarantee stability and safety while adapting to unknown dynamics with probability 1. We demonstrate this architecture for high-speed terrestrial mobility targeting potential applications in safety-critical high-speed Mars rover missions.
    Recent Theoretical Advances in Non-Convex Optimization. (arXiv:2012.06188v2 [math.OC] UPDATED)
    (2 min) Motivated by recent increased interest in optimization algorithms for non-convex optimization in application to training deep neural networks and other optimization problems in data analysis, we give an overview of recent theoretical results on global performance guarantees of optimization algorithms for non-convex optimization. We start with classical arguments showing that general non-convex problems could not be solved efficiently in a reasonable time. Then we give a list of problems that can be solved efficiently to find the global minimizer by exploiting the structure of the problem as much as it is possible. Another way to deal with non-convexity is to relax the goal from finding the global minimum to finding a stationary point or a local minimum. For this setting, we first present known results for the convergence rates of deterministic first-order methods, which are then followed by a general theoretical analysis of optimal stochastic and randomized gradient schemes, and an overview of the stochastic first-order methods. After that, we discuss quite general classes of non-convex problems, such as minimization of $\alpha$-weakly-quasi-convex functions and functions that satisfy Polyak--Lojasiewicz condition, which still allow obtaining theoretical convergence guarantees of first-order methods. Then we consider higher-order and zeroth-order/derivative-free methods and their convergence rates for non-convex optimization problems.
    Lipschitzness Is All You Need To Tame Off-policy Generative Adversarial Imitation Learning. (arXiv:2006.16785v2 [cs.LG] UPDATED)
    (2 min) Despite the recent success of reinforcement learning in various domains, these approaches remain, for the most part, deterringly sensitive to hyper-parameters and are often riddled with essential engineering feats allowing their success. We consider the case of off-policy generative adversarial imitation learning, and perform an in-depth review, qualitative and quantitative, of the method. We show that forcing the learned reward function to be local Lipschitz-continuous is a sine qua non condition for the method to perform well. We then study the effects of this necessary condition and provide several theoretical results involving the local Lipschitzness of the state-value function. We complement these guarantees with empirical evidence attesting to the strong positive effect that the consistent satisfaction of the Lipschitzness constraint on the reward has on imitation performance. Finally, we tackle a generic pessimistic reward preconditioning add-on spawning a large class of reward shaping methods, which makes the base method it is plugged into provably more robust, as shown in several additional theoretical guarantees. We then discuss these through a fine-grained lens and share our insights. Crucially, the guarantees derived and reported in this work are valid for any reward satisfying the Lipschitzness condition, nothing is specific to imitation. As such, these may be of independent interest.
    Towards Understanding Ensemble, Knowledge Distillation and Self-Distillation in Deep Learning. (arXiv:2012.09816v2 [cs.LG] UPDATED)
    (2 min) We formally study how ensemble of deep learning models can improve test accuracy, and how the superior performance of ensemble can be distilled into a single model using knowledge distillation. We consider the challenging case where the ensemble is simply an average of the outputs of a few independently trained neural networks with the SAME architecture, trained using the SAME algorithm on the SAME data set, and they only differ by the random seeds used in the initialization. We empirically show that ensemble/knowledge distillation in deep learning works very differently from traditional learning theory, especially differently from ensemble of random feature mappings or the neural-tangent-kernel feature mappings, and is potentially out of the scope of existing theorems. Thus, to properly understand ensemble and knowledge distillation in deep learning, we develop a theory showing that when data has a structure we refer to as "multi-view", then ensemble of independently trained neural networks can provably improve test accuracy, and such superior test accuracy can also be provably distilled into a single model by training a single model to match the output of the ensemble instead of the true label. Our result sheds light on how ensemble works in deep learning in a way that is completely different from traditional theorems, and how the "dark knowledge" is hidden in the outputs of the ensemble -- that can be used in knowledge distillation -- comparing to the true data labels. In the end, we prove that self-distillation can also be viewed as implicitly combining ensemble and knowledge distillation to improve test accuracy.
    Improving Graph Neural Network Expressivity via Subgraph Isomorphism Counting. (arXiv:2006.09252v3 [cs.LG] UPDATED)
    (2 min) While Graph Neural Networks (GNNs) have achieved remarkable results in a variety of applications, recent studies exposed important shortcomings in their ability to capture the structure of the underlying graph. It has been shown that the expressive power of standard GNNs is bounded by the Weisfeiler-Leman (WL) graph isomorphism test, from which they inherit proven limitations such as the inability to detect and count graph substructures. On the other hand, there is significant empirical evidence, e.g. in network science and bioinformatics, that substructures are often intimately related to downstream tasks. To this end, we propose "Graph Substructure Networks" (GSN), a topologically-aware message passing scheme based on substructure encoding. We theoretically analyse the expressive power of our architecture, showing that it is strictly more expressive than the WL test, and provide sufficient conditions for universality. Importantly, we do not attempt to adhere to the WL hierarchy; this allows us to retain multiple attractive properties of standard GNNs such as locality and linear network complexity, while being able to disambiguate even hard instances of graph isomorphism. We perform an extensive experimental evaluation on graph classification and regression tasks and obtain state-of-the-art results in diverse real-world settings including molecular graphs and social networks. The code is publicly available at https://github.com/gbouritsas/graph-substructure-networks.
    Learning Distributional Programs for Relational Autocompletion. (arXiv:2001.08603v5 [cs.AI] UPDATED)
    (2 min) Relational autocompletion is the problem of automatically filling out some missing values in multi-relational data. We tackle this problem within the probabilistic logic programming framework of Distributional Clauses (DC), which supports both discrete and continuous probability distributions. Within this framework, we introduce DiceML { an approach to learn both the structure and the parameters of DC programs from relational data (with possibly missing data). To realize this, DiceML integrates statistical modeling and distributional clauses with rule learning. The distinguishing features of DiceML are that it 1) tackles autocompletion in relational data, 2) learns distributional clauses extended with statistical models, 3) deals with both discrete and continuous distributions, 4) can exploit background knowledge, and 5) uses an expectation-maximization based algorithm to cope with missing data. The empirical results show the promise of the approach, even when there is missing data.
    A Local Convergence Theory for Mildly Over-Parameterized Two-Layer Neural Network. (arXiv:2102.02410v2 [cs.LG] UPDATED)
    (2 min) While over-parameterization is widely believed to be crucial for the success of optimization for the neural networks, most existing theories on over-parameterization do not fully explain the reason -- they either work in the Neural Tangent Kernel regime where neurons don't move much, or require an enormous number of neurons. In practice, when the data is generated using a teacher neural network, even mildly over-parameterized neural networks can achieve 0 loss and recover the directions of teacher neurons. In this paper we develop a local convergence theory for mildly over-parameterized two-layer neural net. We show that as long as the loss is already lower than a threshold (polynomial in relevant parameters), all student neurons in an over-parameterized two-layer neural network will converge to one of teacher neurons, and the loss will go to 0. Our result holds for any number of student neurons as long as it is at least as large as the number of teacher neurons, and our convergence rate is independent of the number of student neurons. A key component of our analysis is the new characterization of local optimization landscape -- we show the gradient satisfies a special case of Lojasiewicz property which is different from local strong convexity or PL conditions used in previous work.
    A Bit More Bayesian: Domain-Invariant Learning with Uncertainty. (arXiv:2105.04030v2 [cs.LG] UPDATED)
    (2 min) Domain generalization is challenging due to the domain shift and the uncertainty caused by the inaccessibility of target domain data. In this paper, we address both challenges with a probabilistic framework based on variational Bayesian inference, by incorporating uncertainty into neural network weights. We couple domain invariance in a probabilistic formula with the variational Bayesian inference. This enables us to explore domain-invariant learning in a principled way. Specifically, we derive domain-invariant representations and classifiers, which are jointly established in a two-layer Bayesian neural network. We empirically demonstrate the effectiveness of our proposal on four widely used cross-domain visual recognition benchmarks. Ablation studies validate the synergistic benefits of our Bayesian treatment when jointly learning domain-invariant representations and classifiers for domain generalization. Further, our method consistently delivers state-of-the-art mean accuracy on all benchmarks.
    Machine Learning for Fraud Detection in E-Commerce: A Research Agenda. (arXiv:2107.01979v1 [cs.LG])
    (2 min) Fraud detection and prevention play an important part in ensuring the sustained operation of any e-commerce business. Machine learning (ML) often plays an important role in these anti-fraud operations, but the organizational context in which these ML models operate cannot be ignored. In this paper, we take an organization-centric view on the topic of fraud detection by formulating an operational model of the anti-fraud departments in e-commerce organizations. We derive 6 research topics and 12 practical challenges for fraud detection from this operational model. We summarize the state of the literature for each research topic, discuss potential solutions to the practical challenges, and identify 22 open research challenges.
    An Overview of Human Activity Recognition Using Wearable Sensors: Healthcare and Artificial Intelligence. (arXiv:2103.15990v3 [cs.HC] UPDATED)
    (2 min) With the rapid development of the internet of things (IoT) and artificial intelligence (AI) technologies, human activity recognition (HAR) has been applied in a variety of domains such as security and surveillance, human-robot interaction, and entertainment. Even though a number of surveys and review papers have been published, there is a lack of HAR overview papers focusing on healthcare applications that use wearable sensors. Therefore, we fill in the gap by presenting this overview paper. In particular, we present our projects to illustrate the system design of HAR applications for healthcare. Our projects include early mobility identification of human activities for intensive care unit (ICU) patients and gait analysis of Duchenne muscular dystrophy (DMD) patients. We cover essential components of designing HAR systems including sensor factors (e.g., type, number, and placement location), AI model selection (e.g., classical machine learning models versus deep learning models), and feature engineering. In addition, we highlight the challenges of such healthcare-oriented HAR systems and propose several research opportunities for both the medical and the computer science community.
    A Critical Connectivity Radius for Randomly-Generated, High Dimensional Data Points. (arXiv:1602.03822v6 [cs.LG] UPDATED)
    (2 min) We use random geometric graphs to describe clusters of higher dimensional data points which are bijectively mapped to a (possibly) lower dimensional space where an equivalent random cluster model is used to calculate the expected number of modes to be found when separating the data of a multi-modal data set into distinct clusters. Furthermore, as a function of the expected number of modes and the number of data points in the sample, an upper bound on a given distance measure is found such that data points have the greatest correlation if their mutual distances from a common center is less than or equal to the calculated bound. Anomalies are exposed, which lie outside of the union of all regularized clusters of data points. Finally, similarly to finding a hyperplane which can be shifted along its normal to expose the maximal distance between binary classes, it is shown that the union of regularized clusters can be used to define a hyperplane which can be shifted by a certain amount to separate the data into binary classes.
    A Survey of Knowledge-Enhanced Text Generation. (arXiv:2010.04389v2 [cs.CL] UPDATED)
    (2 min) The goal of text generation is to make machines express in human language. It is one of the most important yet challenging tasks in natural language processing (NLP). Since 2014, various neural encoder-decoder models pioneered by Seq2Seq have been proposed to achieve the goal by learning to map input text to output text. However, the input text alone often provides limited knowledge to generate the desired output, so the performance of text generation is still far from satisfaction in many real-world scenarios. To address this issue, researchers have considered incorporating various forms of knowledge beyond the input text into the generation models. This research direction is known as knowledge-enhanced text generation. In this survey, we present a comprehensive review of the research on knowledge enhanced text generation over the past five years. The main content includes two parts: (i) general methods and architectures for integrating knowledge into text generation; (ii) specific techniques and applications according to different forms of knowledge data. This survey can have broad audiences, researchers and practitioners, in academia and industry.
    Unsupervised Audiovisual Synthesis via Exemplar Autoencoders. (arXiv:2001.04463v3 [cs.CV] UPDATED)
    (2 min) We present an unsupervised approach that converts the input speech of any individual into audiovisual streams of potentially-infinitely many output speakers. Our approach builds on simple autoencoders that project out-of-sample data onto the distribution of the training set. We use Exemplar Autoencoders to learn the voice, stylistic prosody, and visual appearance of a specific target exemplar speech. In contrast to existing methods, the proposed approach can be easily extended to an arbitrarily large number of speakers and styles using only 3 minutes of target audio-video data, without requiring {\em any} training data for the input speaker. To do so, we learn audiovisual bottleneck representations that capture the structured linguistic content of speech. We outperform prior approaches on both audio and video synthesis, and provide extensive qualitative analysis on our project page -- https://www.cs.cmu.edu/~exemplar-ae/.
    Sample Efficient Reinforcement Learning via Model-Ensemble Exploration and Exploitation. (arXiv:2107.01825v1 [cs.LG])
    (2 min) Model-based deep reinforcement learning has achieved success in various domains that require high sample efficiencies, such as Go and robotics. However, there are some remaining issues, such as planning efficient explorations to learn more accurate dynamic models, evaluating the uncertainty of the learned models, and more rational utilization of models. To mitigate these issues, we present MEEE, a model-ensemble method that consists of optimistic exploration and weighted exploitation. During exploration, unlike prior methods directly selecting the optimal action that maximizes the expected accumulative return, our agent first generates a set of action candidates and then seeks out the optimal action that takes both expected return and future observation novelty into account. During exploitation, different discounted weights are assigned to imagined transition tuples according to their model uncertainty respectively, which will prevent model predictive error propagation in agent training. Experiments on several challenging continuous control benchmark tasks demonstrated that our approach outperforms other model-free and model-based state-of-the-art methods, especially in sample complexity.
    Oracle Lower Bounds for Stochastic Gradient Sampling Algorithms. (arXiv:2002.00291v3 [stat.ML] UPDATED)
    (2 min) We consider the problem of sampling from a strongly log-concave density in $\mathbb{R}^d$, and prove an information theoretic lower bound on the number of stochastic gradient queries of the log density needed. Several popular sampling algorithms (including many Markov chain Monte Carlo methods) operate by using stochastic gradients of the log density to generate a sample; our results establish an information theoretic limit for all these algorithms. We show that for every algorithm, there exists a well-conditioned strongly log-concave target density for which the distribution of points generated by the algorithm would be at least $\varepsilon$ away from the target in total variation distance if the number of gradient queries is less than $\Omega(\sigma^2 d/\varepsilon^2)$, where $\sigma^2 d$ is the variance of the stochastic gradient. Our lower bound follows by combining the ideas of Le Cam deficiency routinely used in the comparison of statistical experiments along with standard information theoretic tools used in lower bounding Bayes risk functions. To the best of our knowledge our results provide the first nontrivial dimension-dependent lower bound for this problem.
    Repurposing GANs for One-shot Semantic Part Segmentation. (arXiv:2103.04379v5 [cs.CV] UPDATED)
    (2 min) While GANs have shown success in realistic image generation, the idea of using GANs for other tasks unrelated to synthesis is underexplored. Do GANs learn meaningful structural parts of objects during their attempt to reproduce those objects? In this work, we test this hypothesis and propose a simple and effective approach based on GANs for semantic part segmentation that requires as few as one label example along with an unlabeled dataset. Our key idea is to leverage a trained GAN to extract pixel-wise representation from the input image and use it as feature vectors for a segmentation network. Our experiments demonstrate that GANs representation is "readily discriminative" and produces surprisingly good results that are comparable to those from supervised baselines trained with significantly more labels. We believe this novel repurposing of GANs underlies a new class of unsupervised representation learning that is applicable to many other tasks. More results are available at https://repurposegans.github.io/.
    Explainability via Interactivity? Supporting Nonexperts' Sensemaking of Pretrained CNN by Interacting with Their Daily Surroundings. (arXiv:2107.01996v1 [cs.HC])
    (2 min) Current research on Explainable AI (XAI) heavily targets on expert users (data scientists or AI developers). However, increasing importance has been argued for making AI more understandable to nonexperts, who are expected to leverage AI techniques, but have limited knowledge about AI. We present a mobile application to support nonexperts to interactively make sense of Convolutional Neural Networks (CNN); it allows users to play with a pretrained CNN by taking pictures of their surrounding objects. We use an up-to-date XAI technique (Class Activation Map) to intuitively visualize the model's decision (the most important image regions that lead to a certain result). Deployed in a university course, this playful learning tool was found to support design students to gain vivid understandings about the capabilities and limitations of pretrained CNNs in real-world environments. Concrete examples of students' playful explorations are reported to characterize their sensemaking processes reflecting different depths of thought.
    Optimal Dynamic Regret in Exp-Concave Online Learning. (arXiv:2104.11824v2 [cs.LG] UPDATED)
    (2 min) We consider the problem of the Zinkevich (2003)-style dynamic regret minimization in online learning with exp-concave losses. We show that whenever improper learning is allowed, a Strongly Adaptive online learner achieves the dynamic regret of $\tilde O^*(n^{1/3}C_n^{2/3} \vee 1)$ where $C_n$ is the total variation (a.k.a. path length) of the an arbitrary sequence of comparators that may not be known to the learner ahead of time. Achieving this rate was highly nontrivial even for squared losses in 1D where the best known upper bound was $O(\sqrt{nC_n} \vee \log n)$ (Yuan and Lamperski, 2019). Our new proof techniques make elegant use of the intricate structures of the primal and dual variables imposed by the KKT conditions and could be of independent interest. Finally, we apply our results to the classical statistical problem of locally adaptive non-parametric regression (Mammen, 1991; Donoho and Johnstone, 1998) and obtain a stronger and more flexible algorithm that do not require any statistical assumptions or any hyperparameter tuning.
    BiERU: Bidirectional Emotional Recurrent Unit for Conversational Sentiment Analysis. (arXiv:2006.00492v3 [cs.CL] UPDATED)
    (2 min) Sentiment analysis in conversations has gained increasing attention in recent years for the growing amount of applications it can serve, e.g., sentiment analysis, recommender systems, and human-robot interaction. The main difference between conversational sentiment analysis and single sentence sentiment analysis is the existence of context information which may influence the sentiment of an utterance in a dialogue. How to effectively encode contextual information in dialogues, however, remains a challenge. Existing approaches employ complicated deep learning structures to distinguish different parties in a conversation and then model the context information. In this paper, we propose a fast, compact and parameter-efficient party-ignorant framework named bidirectional emotional recurrent unit for conversational sentiment analysis. In our system, a generalized neural tensor block followed by a two-channel classifier is designed to perform context compositionality and sentiment classification, respectively. Extensive experiments on three standard datasets demonstrate that our model outperforms the state of the art in most cases.
    Perceptual Adversarial Robustness: Defense Against Unseen Threat Models. (arXiv:2006.12655v4 [cs.LG] UPDATED)
    (3 min) A key challenge in adversarial robustness is the lack of a precise mathematical characterization of human perception, used in the very definition of adversarial attacks that are imperceptible to human eyes. Most current attacks and defenses try to avoid this issue by considering restrictive adversarial threat models such as those bounded by $L_2$ or $L_\infty$ distance, spatial perturbations, etc. However, models that are robust against any of these restrictive threat models are still fragile against other threat models. To resolve this issue, we propose adversarial training against the set of all imperceptible adversarial examples, approximated using deep neural networks. We call this threat model the neural perceptual threat model (NPTM); it includes adversarial examples with a bounded neural perceptual distance (a neural network-based approximation of the true perceptual distance) to natural images. Through an extensive perceptual study, we show that the neural perceptual distance correlates well with human judgements of perceptibility of adversarial examples, validating our threat model. Under the NPTM, we develop novel perceptual adversarial attacks and defenses. Because the NPTM is very broad, we find that Perceptual Adversarial Training (PAT) against a perceptual attack gives robustness against many other types of adversarial attacks. We test PAT on CIFAR-10 and ImageNet-100 against five diverse adversarial attacks. We find that PAT achieves state-of-the-art robustness against the union of these five attacks, more than doubling the accuracy over the next best model, without training against any of them. That is, PAT generalizes well to unforeseen perturbation types. This is vital in sensitive applications where a particular threat model cannot be assumed, and to the best of our knowledge, PAT is the first adversarial training defense with this property.
    Unified Interpretation of Softmax Cross-Entropy and Negative Sampling: With Case Study for Knowledge Graph Embedding. (arXiv:2106.07250v2 [cs.LG] UPDATED)
    (2 min) In knowledge graph embedding, the theoretical relationship between the softmax cross-entropy and negative sampling loss functions has not been investigated. This makes it difficult to fairly compare the results of the two different loss functions. We attempted to solve this problem by using the Bregman divergence to provide a unified interpretation of the softmax cross-entropy and negative sampling loss functions. Under this interpretation, we can derive theoretical findings for fair comparison. Experimental results on the FB15k-237 and WN18RR datasets show that the theoretical findings are valid in practical settings.
    Learning Domain Invariant Representations for Generalizable Person Re-Identification. (arXiv:2103.15890v2 [cs.CV] UPDATED)
    (2 min) Generalizable person Re-Identification (ReID) has attracted growing attention in recent computer vision community. In this work, we construct a structural causal model among identity labels, identity-specific factors (clothes/shoes color etc), and domain-specific factors (background, viewpoints etc). According to the causal analysis, we propose a novel Domain Invariant Representation Learning for generalizable person Re-Identification (DIR-ReID) framework. Specifically, we first propose to disentangle the identity-specific and domain-specific feature spaces, based on which we propose an effective algorithmic implementation for backdoor adjustment, essentially serving as a causal intervention towards the SCM. Extensive experiments have been conducted, showing that DIR-ReID outperforms state-of-the-art methods on large-scale domain generalization ReID benchmarks.
    Low-Latency Real-Time Non-Parallel Voice Conversion based on Cyclic Variational Autoencoder and Multiband WaveRNN with Data-Driven Linear Prediction. (arXiv:2105.09858v2 [cs.SD] UPDATED)
    (2 min) This paper presents a low-latency real-time (LLRT) non-parallel voice conversion (VC) framework based on cyclic variational autoencoder (CycleVAE) and multiband WaveRNN with data-driven linear prediction (MWDLP). CycleVAE is a robust non-parallel multispeaker spectral model, which utilizes a speaker-independent latent space and a speaker-dependent code to generate reconstructed/converted spectral features given the spectral features of an input speaker. On the other hand, MWDLP is an efficient and a high-quality neural vocoder that can handle multispeaker data and generate speech waveform for LLRT applications with CPU. To accommodate LLRT constraint with CPU, we propose a novel CycleVAE framework that utilizes mel-spectrogram as spectral features and is built with a sparse network architecture. Further, to improve the modeling performance, we also propose a novel fine-tuning procedure that refines the frame-rate CycleVAE network by utilizing the waveform loss from the MWDLP network. The experimental results demonstrate that the proposed framework achieves high-performance VC, while allowing for LLRT usage with a single-core of $2.1$--$2.7$ GHz CPU on a real-time factor of $0.87$--$0.95$, including input/output, feature extraction, on a frame shift of $10$ ms, a window length of $27.5$ ms, and $2$ lookup frames.
    MARINA: Faster Non-Convex Distributed Learning with Compression. (arXiv:2102.07845v2 [cs.LG] UPDATED)
    (2 min) We develop and analyze MARINA: a new communication efficient method for non-convex distributed learning over heterogeneous datasets. MARINA employs a novel communication compression strategy based on the compression of gradient differences that is reminiscent of but different from the strategy employed in the DIANA method of Mishchenko et al. (2019). Unlike virtually all competing distributed first-order methods, including DIANA, ours is based on a carefully designed biased gradient estimator, which is the key to its superior theoretical and practical performance. The communication complexity bounds we prove for MARINA are evidently better than those of all previous first-order methods. Further, we develop and analyze two variants of MARINA: VR-MARINA and PP-MARINA. The first method is designed for the case when the local loss functions owned by clients are either of a finite sum or of an expectation form, and the second method allows for a partial participation of clients -- a feature important in federated learning. All our methods are superior to previous state-of-the-art methods in terms of oracle/communication complexity. Finally, we provide a convergence analysis of all methods for problems satisfying the Polyak-Lojasiewicz condition.
    Annotating Motion Primitives for Simplifying Action Search in Reinforcement Learning. (arXiv:2102.12017v3 [cs.LG] UPDATED)
    (2 min) Reinforcement learning in large-scale environments is challenging due to the many possible actions that can be taken in specific situations. We have previously developed a means of constraining, and hence speeding up, the search process through the use of motion primitives; motion primitives are sequences of pre-specified actions taken across a state series. As a byproduct of this work, we have found that if the motion primitives' motions and actions are labeled, then the search can be sped up further. Since motion primitives may initially lack such details, we propose a theoretically viewpoint-insensitive and speed-insensitive means of automatically annotating the underlying motions and actions. We do this through a differential-geometric, spatio-temporal kinematics descriptor, which analyzes how the poses of entities in two motion sequences change over time. We use this descriptor in conjunction with a weighted-nearest-neighbor classifier to label the primitives using a limited set of training examples. In our experiments, we achieve high motion and action annotation rates for human-action-derived primitives with as few as one training sample. We also demonstrate that reinforcement learning using accurately labeled trajectories leads to high-performing policies more quickly than standard reinforcement learning techniques. This is partly because motion primitives encode prior domain knowledge and preempt the need to re-discover that knowledge during training. It is also because agents can leverage the labels to systematically ignore action classes that do not facilitate task objectives, thereby reducing the action space.
    Full interpretable machine learning in 2D with inline coordinates. (arXiv:2106.07568v2 [cs.LG] UPDATED)
    (2 min) This paper proposed a new methodology for machine learning in 2-dimensional space (2-D ML) in inline coordinates. It is a full machine learning approach that does not require to deal with n-dimensional data in n-dimensional space. It allows discovering n-D patterns in 2-D space without loss of n-D information using graph representations of n-D data in 2-D. Specifically, it can be done with the inline based coordinates in different modifications, including static and dynamic ones. The classification and regression algorithms based on these inline coordinates were introduced. A successful case study based on a benchmark data demonstrated the feasibility of the approach. This approach helps to consolidate further a whole new area of full 2-D machine learning as a promising ML methodology. It has advantages of abilities to involve actively the end-users into the discovering of models and their justification. Another advantage is providing interpretable ML models.
    Imputation-Free Learning from Incomplete Observations. (arXiv:2107.01983v1 [cs.LG])
    (2 min) Although recent works have developed methods that can generate estimations (or imputations) of the missing entries in a dataset to facilitate downstream analysis, most depend on assumptions that may not align with real-world applications and could suffer from poor performance in subsequent tasks. This is particularly true if the data have large missingness rates or a small population. More importantly, the imputation error could be propagated into the prediction step that follows, causing the gradients used to train the prediction models to be biased. Consequently, in this work, we introduce the importance guided stochastic gradient descent (IGSGD) method to train multilayer perceptrons (MLPs) and long short-term memories (LSTMs) to directly perform inference from inputs containing missing values without imputation. Specifically, we employ reinforcement learning (RL) to adjust the gradients used to train the models via back-propagation. This not only reduces bias but allows the model to exploit the underlying information behind missingness patterns. We test the proposed approach on real-world time-series (i.e., MIMIC-III), tabular data obtained from an eye clinic, and a standard dataset (i.e., MNIST), where our imputation-free predictions outperform the traditional two-step imputation-based predictions using state-of-the-art imputation methods.
    Polymorphic dynamic programming by algebraic shortcut fusion. (arXiv:2107.01752v1 [cs.DS])
    (2 min) Dynamic programming (DP) is a broadly applicable algorithmic design paradigm for the efficient, exact solution of otherwise intractable, combinatorial problems. However, the design of such algorithms is often presented informally in an ad-hoc manner, and as a result is often difficult to apply correctly. In this paper, we present a rigorous algebraic formalism for systematically deriving novel DP algorithms, either from existing DP algorithms or from simple functional recurrences. These derivations lead to algorithms which are provably correct and polymorphic over any semiring, which means that they can be applied to the full scope of combinatorial problems expressible in terms of semirings. This includes, for example: optimization, optimal probability and Viterbi decoding, probabilistic marginalization, logical inference, fuzzy sets, differentiable softmax, and relational and provenance queries. The approach, building on many ideas from the existing literature on constructive algorithmics, exploits generic properties of (semiring) polymorphic functions, tupling and formal sums (lifting), and algebraic simplifications arising from constraint algebras. We demonstrate the effectiveness of this formalism for some example applications arising in signal processing, bioinformatics and reliability engineering.
    High-Fidelity and Low-Latency Universal Neural Vocoder based on Multiband WaveRNN with Data-Driven Linear Prediction for Discrete Waveform Modeling. (arXiv:2105.09856v2 [cs.SD] UPDATED)
    (2 min) This paper presents a novel high-fidelity and low-latency universal neural vocoder framework based on multiband WaveRNN with data-driven linear prediction for discrete waveform modeling (MWDLP). MWDLP employs a coarse-fine bit WaveRNN architecture for 10-bit mu-law waveform modeling. A sparse gated recurrent unit with a relatively large size of hidden units is utilized, while the multiband modeling is deployed to achieve real-time low-latency usage. A novel technique for data-driven linear prediction (LP) with discrete waveform modeling is proposed, where the LP coefficients are estimated in a data-driven manner. Moreover, a novel loss function using short-time Fourier transform (STFT) for discrete waveform modeling with Gumbel approximation is also proposed. The experimental results demonstrate that the proposed MWDLP framework generates high-fidelity synthetic speech for seen and unseen speakers and/or language on 300 speakers training data including clean and noisy/reverberant conditions, where the number of training utterances is limited to 60 per speaker, while allowing for real-time low-latency processing using a single core of $\sim\!$ 2.1--2.7 GHz CPU with $\sim\!$ 0.57--0.64 real-time factor including input/output and feature extraction.
    Randomized Dimensionality Reduction for Facility Location and Single-Linkage Clustering. (arXiv:2107.01804v1 [cs.DS])
    (2 min) Random dimensionality reduction is a versatile tool for speeding up algorithms for high-dimensional problems. We study its application to two clustering problems: the facility location problem, and the single-linkage hierarchical clustering problem, which is equivalent to computing the minimum spanning tree. We show that if we project the input pointset $X$ onto a random $d = O(d_X)$-dimensional subspace (where $d_X$ is the doubling dimension of $X$), then the optimum facility location cost in the projected space approximates the original cost up to a constant factor. We show an analogous statement for minimum spanning tree, but with the dimension $d$ having an extra $\log \log n$ term and the approximation factor being arbitrarily close to $1$. Furthermore, we extend these results to approximating solutions instead of just their costs. Lastly, we provide experimental results to validate the quality of solutions and the speedup due to the dimensionality reduction. Unlike several previous papers studying this approach in the context of $k$-means and $k$-medians, our dimension bound does not depend on the number of clusters but only on the intrinsic dimensionality of $X$.
    Autoencoding Slow Representations for Semi-supervised Data Efficient Regression. (arXiv:2012.06279v2 [cs.LG] UPDATED)
    (2 min) The slowness principle is a concept inspired by the visual cortex of the brain. It postulates that the underlying generative factors of a quickly varying sensory signal change on a slower time scale. Unsupervised learning of intermediate representations utilizing abundant unlabeled sensory data can be leveraged to perform data-efficient supervised downstream regression. In this paper, we propose a general formulation of slowness for unsupervised representation learning adding a slowness regularization term to the estimate lower bound of the beta-VAE to encourage temporal similarity in observation and latent space. Within this framework we compare existing slowness regularization terms such as the L1 and L2 loss used in existing end-to-end methods, the SlowVAE and propose a new term based on Brownian motion. We empirically evaluate these slowness regularization terms with respect to their downstream task performance and data efficiency. We find that slow representations lead to equal or better downstream task performance and data efficiency in different experiment domains when compared to representations without slowness regularization. Finally, we discuss how the Frechet Inception Distance (FID), traditionally used to determine the generative capabilities of GANs, can serve as a measure to predict the performance of pre-trained Autoencoder model in a supervised downstream task and accelerate hyperparameter search.
    Learning to Relate Depth and Semantics for Unsupervised Domain Adaptation. (arXiv:2105.07830v2 [cs.CV] UPDATED)
    (2 min) We present an approach for encoding visual task relationships to improve model performance in an Unsupervised Domain Adaptation (UDA) setting. Semantic segmentation and monocular depth estimation are shown to be complementary tasks; in a multi-task learning setting, a proper encoding of their relationships can further improve performance on both tasks. Motivated by this observation, we propose a novel Cross-Task Relation Layer (CTRL), which encodes task dependencies between the semantic and depth predictions. To capture the cross-task relationships, we propose a neural network architecture that contains task-specific and cross-task refinement heads. Furthermore, we propose an Iterative Self-Learning (ISL) training scheme, which exploits semantic pseudo-labels to provide extra supervision on the target domain. We experimentally observe improvements in both tasks' performance because the complementary information present in these tasks is better captured. Specifically, we show that: (1) our approach improves performance on all tasks when they are complementary and mutually dependent; (2) the CTRL helps to improve both semantic segmentation and depth estimation tasks performance in the challenging UDA setting; (3) the proposed ISL training scheme further improves the semantic segmentation performance. The implementation is available at https://github.com/susaha/ctrl-uda.
    Exploring Fluent Query Reformulations with Text-to-Text Transformers and Reinforcement Learning. (arXiv:2012.10033v2 [cs.CL] UPDATED)
    (2 min) Query reformulation aims to alter noisy or ambiguous text sequences into coherent ones closer to natural language questions. This is to prevent errors from propagating in a client-facing pipeline and promote better communication with users. Besides, it is crucial to maintain performance in downstream environments like question answering when rephrased queries are given as input. We show that under the previous framework (AQA), attempts to alter RL algorithms do not bring significant benefits to either reward acquisition or sequence fluency. Instead, we leverage a query-reformulating text-to-text transformer (QRT5) and apply policy-based RL algorithms to further nudge this reformulator and obtain better answers downstream by generating reward-acquiring query trajectories. QRT5 shows better sample efficiency in RL to achieve the same level of QA performance as the previous approach. It can generate reformulations with more readability based on query well-formedness evaluations and can generalize to out-of-sample data. Our framework is demonstrated to be flexible, allowing reward signals to be sourced from different downstream environments such as intent classification.
    Discovering Interpretable Machine Learning Models in Parallel Coordinates. (arXiv:2106.07474v2 [cs.LG] UPDATED)
    (2 min) This paper contributes to interpretable machine learning via visual knowledge discovery in parallel coordinates. The concepts of hypercubes and hyper-blocks are used as easily understandable by end-users in the visual form in parallel coordinates. The Hyper algorithm for classification with mixed and pure hyper-blocks (HBs) is proposed to discover hyper-blocks interactively and automatically in individual, multiple, overlapping, and non-overlapping setting. The combination of hyper-blocks with linguistic description of visual patterns is presented too. It is shown that Hyper models generalize decision trees. The Hyper algorithm was tested on the benchmark data from UCI ML repository. It allowed discovering pure and mixed HBs with all data and then with 10-fold cross validation. The links between hyper-blocks, dimension reduction and visualization are established. Major benefits of hyper-block technology and the Hyper algorithm are in their ability to discover and observe hyper-blocks by end-users including side by side visualizations making patterns visible for all classes. Another advantage of sets of HBs relative to the decision trees is the ability to avoid both data overgeneralization and overfitting.
    VisualWordGrid: Information Extraction From Scanned Documents Using A Multimodal Approach. (arXiv:2010.02358v5 [cs.CV] UPDATED)
    (2 min) We introduce a novel approach for scanned document representation to perform field extraction. It allows the simultaneous encoding of the textual, visual and layout information in a 3-axis tensor used as an input to a segmentation model. We improve the recent Chargrid and Wordgrid \cite{chargrid} models in several ways, first by taking into account the visual modality, then by boosting its robustness in regards to small datasets while keeping the inference time low. Our approach is tested on public and private document-image datasets, showing higher performances compared to the recent state-of-the-art methods.
    Endo-Depth-and-Motion: Reconstruction and Tracking in Endoscopic Videos using Depth Networks and Photometric Constraints. (arXiv:2103.16525v2 [cs.CV] UPDATED)
    (2 min) Estimating a scene reconstruction and the camera motion from in-body videos is challenging due to several factors, e.g. the deformation of in-body cavities or the lack of texture. In this paper we present Endo-Depth-and-Motion, a pipeline that estimates the 6-degrees-of-freedom camera pose and dense 3D scene models from monocular endoscopic videos. Our approach leverages recent advances in self-supervised depth networks to generate pseudo-RGBD frames, then tracks the camera pose using photometric residuals and fuses the registered depth maps in a volumetric representation. We present an extensive experimental evaluation in the public dataset Hamlyn, showing high-quality results and comparisons against relevant baselines. We also release all models and code for future comparisons.
    S4RL: Surprisingly Simple Self-Supervision for Offline Reinforcement Learning. (arXiv:2103.06326v2 [cs.LG] UPDATED)
    (2 min) Offline reinforcement learning proposes to learn policies from large collected datasets without interacting with the physical environment. These algorithms have made it possible to learn useful skills from data that can then be deployed in the environment in real-world settings where interactions may be costly or dangerous, such as autonomous driving or factories. However, current algorithms overfit to the dataset they are trained on and exhibit poor out-of-distribution generalization to the environment when deployed. In this paper, we study the effectiveness of performing data augmentations on the state space, and study 7 different augmentation schemes and how they behave with existing offline RL algorithms. We then combine the best data performing augmentation scheme with a state-of-the-art Q-learning technique, and improve the function approximation of the Q-networks by smoothening out the learned state-action space. We experimentally show that using this Surprisingly Simple Self-Supervision technique in RL (S4RL), we significantly improve over the current state-of-the-art algorithms on offline robot learning environments such as MetaWorld [1] and RoboSuite [2,3], and benchmark datasets such as D4RL [4].
    A Tutorial on Sparse Gaussian Processes and Variational Inference. (arXiv:2012.13962v11 [cs.LG] UPDATED)
    (3 min) Gaussian processes (GPs) provide a framework for Bayesian inference that can offer principled uncertainty estimates for a large range of problems. For example, if we consider regression problems with Gaussian likelihoods, a GP model enjoys a posterior in closed form. However, identifying the posterior GP scales cubically with the number of training examples and requires to store all examples in memory. In order to overcome these obstacles, sparse GPs have been proposed that approximate the true posterior GP with pseudo-training examples. Importantly, the number of pseudo-training examples is user-defined and enables control over computational and memory complexity. In the general case, sparse GPs do not enjoy closed-form solutions and one has to resort to approximate inference. In this context, a convenient choice for approximate inference is variational inference (VI), where the problem of Bayesian inference is cast as an optimization problem -- namely, to maximize a lower bound of the log marginal likelihood. This paves the way for a powerful and versatile framework, where pseudo-training examples are treated as optimization arguments of the approximate posterior that are jointly identified together with hyperparameters of the generative model (i.e. prior and likelihood). The framework can naturally handle a wide scope of supervised learning problems, ranging from regression with heteroscedastic and non-Gaussian likelihoods to classification problems with discrete labels, but also multilabel problems. The purpose of this tutorial is to provide access to the basic matter for readers without prior knowledge in both GPs and VI. A proper exposition to the subject enables also access to more recent advances (like importance-weighted VI as well as interdomain, multioutput and deep GPs) that can serve as an inspiration for new research ideas.
    COVID-19 detection using deep convolutional neural networks and binary-differential-algorithm-based feature selection on X-ray images. (arXiv:2104.07279v3 [eess.IV] UPDATED)
    (2 min) The new Coronavirus is spreading rapidly, and it has taken the lives of many people so far. The virus has destructive effects on the human lung, and early detection is very important. Deep Convolution neural networks are such powerful tools in classifying images. Therefore, in this paper, a hybrid approach based on a deep network is presented. Feature vectors were extracted by applying a deep convolution neural network on the images, and useful features were selected by the binary differential meta-heuristic algorithm. These optimized features were given to the SVM classifier. A database consisting of three categories of images such as COVID-19, pneumonia, and healthy included in 1092 X-ray samples was considered. The proposed method achieved an accuracy of 99.43%, a sensitivity of 99.16%, and a specificity of 99.57%. Our results demonstrate that the suggested approach is better than recent studies on COVID-19 detection with X-ray images.
    On the Predictability of Pruning Across Scales. (arXiv:2006.10621v3 [cs.LG] UPDATED)
    (2 min) We show that the error of iteratively magnitude-pruned networks empirically follows a scaling law with interpretable coefficients that depend on the architecture and task. We functionally approximate the error of the pruned networks, showing it is predictable in terms of an invariant tying width, depth, and pruning level, such that networks of vastly different pruned densities are interchangeable. We demonstrate the accuracy of this approximation over orders of magnitude in depth, width, dataset size, and density. We show that the functional form holds (generalizes) for large scale data (e.g., ImageNet) and architectures (e.g., ResNets). As neural networks become ever larger and costlier to train, our findings suggest a framework for reasoning conceptually and analytically about a standard method for unstructured pruning.
    Towards Rigorous Interpretations: a Formalisation of Feature Attribution. (arXiv:2104.12437v2 [cs.LG] UPDATED)
    (2 min) Feature attribution is often loosely presented as the process of selecting a subset of relevant features as a rationale of a prediction. Task-dependent by nature, precise definitions of "relevance" encountered in the literature are however not always consistent. This lack of clarity stems from the fact that we usually do not have access to any notion of ground-truth attribution and from a more general debate on what good interpretations are. In this paper we propose to formalise feature selection/attribution based on the concept of relaxed functional dependence. In particular, we extend our notions to the instance-wise setting and derive necessary properties for candidate selection solutions, while leaving room for task-dependence. By computing ground-truth attributions on synthetic datasets, we evaluate many state-of-the-art attribution methods and show that, even when optimised, some fail to verify the proposed properties and provide wrong solutions.
    Improper Reinforcement Learning with Gradient-based Policy Optimization. (arXiv:2102.08201v3 [cs.LG] UPDATED)
    (2 min) We consider an improper reinforcement learning setting where a learner is given $M$ base controllers for an unknown Markov decision process, and wishes to combine them optimally to produce a potentially new controller that can outperform each of the base ones. This can be useful in tuning across controllers, learnt possibly in mismatched or simulated environments, to obtain a good controller for a given target environment with relatively few trials. \par We propose a gradient-based approach that operates over a class of improper mixtures of the controllers. We derive convergence rate guarantees for the approach assuming access to a gradient oracle. The value function of the mixture and its gradient may not be available in closed-form; however, we show that we can employ rollouts and simultaneous perturbation stochastic approximation (SPSA) for explicit gradient descent optimization. Numerical results on (i) the standard control theoretic benchmark of stabilizing an inverted pendulum and (ii) a constrained queueing task show that our improper policy optimization algorithm can stabilize the system even when the base policies at its disposal are unstable\footnote{Under review. Please do not distribute.}.
    A Survey of Data Augmentation Approaches for NLP. (arXiv:2105.03075v4 [cs.CL] UPDATED)
    (2 min) Data augmentation has recently seen increased interest in NLP due to more work in low-resource domains, new tasks, and the popularity of large-scale neural networks that require large amounts of training data. Despite this recent upsurge, this area is still relatively underexplored, perhaps due to the challenges posed by the discrete nature of language data. In this paper, we present a comprehensive and unifying survey of data augmentation for NLP by summarizing the literature in a structured manner. We first introduce and motivate data augmentation for NLP, and then discuss major methodologically representative approaches. Next, we highlight techniques that are used for popular NLP applications and tasks. We conclude by outlining current challenges and directions for future research. Overall, our paper aims to clarify the landscape of existing literature in data augmentation for NLP and motivate additional work in this area. We also present a GitHub repository with a paper list that will be continuously updated at https://github.com/styfeng/DataAug4NLP
    GraphXCOVID: Explainable Deep Graph Diffusion Pseudo-Labelling for Identifying COVID-19 on Chest X-rays. (arXiv:2010.00378v2 [cs.LG] UPDATED)
    (3 min) Can one learn to diagnose COVID-19 under extreme minimal supervision? Since the outbreak of the novel COVID-19 there has been a rush for developing Artificial Intelligence techniques for expert-level disease identification on Chest X-ray data. In particular, the use of deep supervised learning has become the go-to paradigm. However, the performance of such models is heavily dependent on the availability of a large and representative labelled dataset. The creation of which is a heavily expensive and time consuming task, and especially imposes a great challenge for a novel disease. Semi-supervised learning has shown the ability to match the incredible performance of supervised models whilst requiring a small fraction of the labelled examples. This makes the semi-supervised paradigm an attractive option for identifying COVID-19. In this work, we introduce a graph based deep semi-supervised framework for classifying COVID-19 from chest X-rays. Our framework introduces an optimisation model for graph diffusion that reinforces the natural relation among the tiny labelled set and the vast unlabelled data. We then connect the diffusion prediction output as pseudo-labels that are used in an iterative scheme in a deep net. We demonstrate, through our experiments, that our model is able to outperform the current leading supervised model with a tiny fraction of the labelled examples. Finally, we provide attention maps to accommodate the radiologist's mental model, better fitting their perceptual and cognitive abilities. These visualisation aims to assist the radiologist in judging whether the diagnostic is correct or not, and in consequence to accelerate the decision.
    GANDA: A deep generative adversarial network predicts the spatial distribution of nanoparticles in tumor pixelly. (arXiv:2012.12561v2 [eess.IV] UPDATED)
    (2 min) Intratumoral nanoparticles (NPs) distribution is critical for the success of nanomedicine in imaging and treatment, but computational models to describe the NPs distribution remain unavailable due to the complex tumor-nano interactions. Here, we develop a Generative Adversarial Network for Distribution Analysis (GANDA) to describe and conditionally generates the intratumoral quantum dots (QDs) distribution after i.v. injection. This deep generative model is trained automatically by 27 775 patches of tumor vessels and cell nuclei decomposed from whole-slide images of 4T1 breast cancer sections. The GANDA model can conditionally generate images of intratumoral QDs distribution under the constraint of given tumor vessels and cell nuclei channels with the same spatial resolution (pixels-to-pixels), minimal loss (mean squared error, MSE = 1.871) and excellent reliability (intraclass correlation, ICC = 0.94). Quantitative analysis of QDs extravasation distance (ICC = 0.95) and subarea distribution (ICC = 0.99) is allowed on the generated images without knowing the real QDs distribution. We believe this deep generative model may provide opportunities to investigate how influencing factors affect NPs distribution in individual tumors and guide nanomedicine optimization for molecular imaging and personalized treatment.
    Lottery Tickets in Linear Models: An Analysis of Iterative Magnitude Pruning. (arXiv:2007.08243v3 [cs.LG] UPDATED)
    (2 min) We analyse the pruning procedure behind the lottery ticket hypothesis arXiv:1803.03635v5, iterative magnitude pruning (IMP), when applied to linear models trained by gradient flow. We begin by presenting sufficient conditions on the statistical structure of the features under which IMP prunes those features that have smallest projection onto the data. Following this, we explore IMP as a method for sparse estimation.
    Zeroth-Order Supervised Policy Improvement. (arXiv:2006.06600v2 [cs.LG] UPDATED)
    (2 min) Policy gradient (PG) algorithms have been widely used in reinforcement learning (RL). However, PG algorithms rely on exploiting the value function being learned with the first-order update locally, which results in limited sample efficiency. In this work, we propose an alternative method called Zeroth-Order Supervised Policy Improvement (ZOSPI). ZOSPI exploits the estimated value function $Q$ globally while preserving the local exploitation of the PG methods based on zeroth-order policy optimization. This learning paradigm follows Q-learning but overcomes the difficulty of efficiently operating argmax in continuous action space. It finds max-valued action within a small number of samples. The policy learning of ZOSPI has two steps: First, it samples actions and evaluates those actions with a learned value estimator, and then it learns to perform the action with the highest value through supervised learning. We further demonstrate such a supervised learning framework can learn multi-modal policies. Experiments show that ZOSPI achieves competitive results on the continuous control benchmarks with a remarkable sample efficiency.
    Langevin Monte Carlo: random coordinate descent and variance reduction. (arXiv:2007.14209v5 [stat.ML] UPDATED)
    (2 min) Sampling from a log-concave distribution function on $\mathbb{R}^d$ (with $d\gg 1$) is a popular problem that has wide applications. In this paper we study the application of random coordinate descent method (RCD) on the Langevin Monte Carlo (LMC) sampling method, and we find two sides of the theory: 1. The direct application of RCD on LMC does reduce the number of finite differencing approximations per iteration, but it induces a large variance error term. More iterations are then needed, and ultimately the method gains no computational advantage; 2. When variance reduction techniques (such as SAGA and SVRG) are incorporated in RCD-LMC, the variance error term is reduced. The new methods, compared to the vanilla LMC, reduce the total computational cost by $d$ folds, and achieve the optimal cost rate. We perform our investigations in both overdamped and underdamped settings.
    Block-Term Tensor Decomposition Model Selection and Computation: The Bayesian Way. (arXiv:2101.02931v2 [stat.ME] UPDATED)
    (2 min) The so-called block-term decomposition (BTD) tensor model, especially in its rank-$(L_r,L_r,1)$ version, has been recently receiving increasing attention due to its enhanced ability of representing systems and signals that are composed of \emph{blocks} of rank higher than one, a scenario encountered in numerous and diverse applications. Uniqueness conditions and fitting methods have thus been thoroughly studied. Nevertheless, the challenging problem of estimating the BTD model structure, namely the number of block terms, $R$, and their individual ranks, $L_r$, has only recently started to attract significant attention, mainly through regularization-based approaches which entail the need to tune the regularization parameter(s). In this work, we build on ideas of sparse Bayesian learning (SBL) and put forward a fully automated Bayesian approach. Through a suitably crafted multi-level \emph{hierarchical} probabilistic model, which gives rise to heavy-tailed prior distributions for the BTD factors, structured sparsity is \emph{jointly} imposed. Ranks are then estimated from the numbers of blocks ($R$) and columns ($L_r$) of non-negligible energy. Approximate posterior inference is implemented, within the variational inference framework. The resulting iterative algorithm completely avoids hyperparameter tuning, which is a significant defect of regularization-based methods. Alternative probabilistic models are also explored and the connections with their regularization-based counterparts are brought to light with the aid of the associated maximum a-posteriori (MAP) estimators. We report simulation results with both synthetic and real-word data, which demonstrate the merits of the proposed method in terms of both rank estimation and model fitting as compared to state-of-the-art relevant methods.
    Reinforcement Learning and its Connections with Neuroscience and Psychology. (arXiv:2007.01099v4 [cs.LG] UPDATED)
    (2 min) Reinforcement learning methods have recently been very successful at performing complex sequential tasks like playing Atari games, Go and Poker. These algorithms have outperformed humans in several tasks by learning from scratch, using only scalar rewards obtained through interaction with their environment. While there certainly has been considerable independent innovation to produce such results, many core ideas in reinforcement learning are inspired by phenomena in animal learning, psychology and neuroscience. In this paper, we comprehensively review a large number of findings in both neuroscience and psychology that evidence reinforcement learning as a promising candidate for modeling learning and decision making in the brain. In doing so, we construct a mapping between various classes of modern RL algorithms and specific findings in both neurophysiological and behavioral literature. We then discuss the implications of this observed relationship between RL, neuroscience and psychology and its role in advancing research in both AI and brain science.
    Differentially Private False Discovery Rate Control. (arXiv:1807.04209v2 [math.ST] UPDATED)
    (2 min) Differential privacy provides a rigorous framework for privacy-preserving data analysis. This paper proposes the first differentially private procedure for controlling the false discovery rate (FDR) in multiple hypothesis testing. Inspired by the Benjamini-Hochberg procedure (BHq), our approach is to first repeatedly add noise to the logarithms of the $p$-values to ensure differential privacy and to select an approximately smallest $p$-value serving as a promising candidate at each iteration; the selected $p$-values are further supplied to the BHq and our private procedure releases only the rejected ones. Moreover, we develop a new technique that is based on a backward submartingale for proving FDR control of a broad class of multiple testing procedures, including our private procedure, and both the BHq step-up and step-down procedures. As a novel aspect, the proof works for arbitrary dependence between the true null and false null test statistics, while FDR control is maintained up to a small multiplicative factor.
    Multi-view Graph Learning by Joint Modeling of Consistency and Inconsistency. (arXiv:2008.10208v2 [cs.LG] UPDATED)
    (2 min) Graph learning has emerged as a promising technique for multi-view clustering with its ability to learn a unified and robust graph from multiple views. However, existing graph learning methods mostly focus on the multi-view consistency issue, yet often neglect the inconsistency across multiple views, which makes them vulnerable to possibly low-quality or noisy datasets. To overcome this limitation, we propose a new multi-view graph learning framework, which for the first time simultaneously and explicitly models multi-view consistency and multi-view inconsistency in a unified objective function, through which the consistent and inconsistent parts of each single-view graph as well as the unified graph that fuses the consistent parts can be iteratively learned. Though optimizing the objective function is NP-hard, we design a highly efficient optimization algorithm which is able to obtain an approximate solution with linear time complexity in the number of edges in the unified graph. Furthermore, our multi-view graph learning approach can be applied to both similarity graphs and dissimilarity graphs, which lead to two graph fusion-based variants in our framework. Experiments on twelve multi-view datasets have demonstrated the robustness and efficiency of the proposed approach.
    Re-Evaluating GermEval17 Using German Pre-Trained Language Models. (arXiv:2102.12330v2 [cs.CL] UPDATED)
    (2 min) The lack of a commonly used benchmark data set (collection) such as (Super-)GLUE (Wang et al., 2018, 2019) for the evaluation of non-English pre-trained language models is a severe shortcoming of current English-centric NLP-research. It concentrates a large part of the research on English, neglecting the uncertainty when transferring conclusions found for the English language to other languages. We evaluate the performance of the German and multilingual BERT-based models currently available via the huggingface transformers library on the four tasks of the GermEval17 workshop. We compare them to pre-BERT architectures (Wojatzki et al., 2017; Schmitt et al., 2018; Attia et al., 2018) as well as to an ELMo-based architecture (Biesialska et al., 2020) and a BERT-based approach (Guhr et al., 2020). The observed improvements are put in relation to those for similar tasks and similar models (pre-BERT vs. BERT-based) for the English language in order to draw tentative conclusions about whether the observed improvements are transferable to German or potentially other related languages.
    MaxVA: Fast Adaptation of Step Sizes by Maximizing Observed Variance of Gradients. (arXiv:2006.11918v4 [cs.LG] UPDATED)
    (2 min) Adaptive gradient methods such as RMSProp and Adam use exponential moving estimate of the squared gradient to compute adaptive step sizes, achieving better convergence than SGD in face of noisy objectives. However, Adam can have undesirable convergence behaviors due to unstable or extreme adaptive learning rates. Methods such as AMSGrad and AdaBound have been proposed to stabilize the adaptive learning rates of Adam in the later stage of training, but they do not outperform Adam in some practical tasks such as training Transformers \cite{transformer}. In this paper, we propose an adaptive learning rate principle, in which the running mean of squared gradient in Adam is replaced by a weighted mean, with weights chosen to maximize the estimated variance of each coordinate. This results in a faster adaptation to the local gradient variance, which leads to more desirable empirical convergence behaviors than Adam. We prove the proposed algorithm converges under mild assumptions for nonconvex stochastic optimization problems, and demonstrate the improved efficacy of our adaptive averaging approach on machine translation, natural language understanding and large-batch pretraining of BERT. The code is available at https://github.com/zhuchen03/MaxVA.
    XOmiVAE: an interpretable deep learning model for cancer classification using high-dimensional omics data. (arXiv:2105.12807v2 [q-bio.GN] UPDATED)
    (2 min) The lack of explainability is one of the most prominent disadvantages of deep learning applications in omics. This "black box" problem can undermine the credibility and limit the practical implementation of biomedical deep learning models. Here we present XOmiVAE, a variational autoencoder (VAE) based interpretable deep learning model for cancer classification using high-dimensional omics data. XOmiVAE is capable of revealing the contribution of each gene and latent dimension for each classification prediction, and the correlation between each gene and each latent dimension. It is also demonstrated that XOmiVAE can explain not only the supervised classification but the unsupervised clustering results from the deep learning network. To the best of our knowledge, XOmiVAE is one of the first activation level-based interpretable deep learning models explaining novel clusters generated by VAE. The explainable results generated by XOmiVAE were validated by both the performance of downstream tasks and the biomedical knowledge. In our experiments, XOmiVAE explanations of deep learning based cancer classification and clustering aligned with current domain knowledge including biological annotation and academic literature, which shows great potential for novel biomedical knowledge discovery from deep learning models.
    Knodle: Modular Weakly Supervised Learning with PyTorch. (arXiv:2104.11557v3 [cs.LG] UPDATED)
    (2 min) Strategies for improving the training and prediction quality of weakly supervised machine learning models vary in how much they are tailored to a specific task or integrated with a specific model architecture. In this work, we introduce Knodle, a software framework that treats weak data annotations, deep learning models, and methods for improving weakly supervised training as separate, modular components. This modularization gives the training process access to fine-grained information such as data set characteristics, matches of heuristic rules, or elements of the deep learning model ultimately used for prediction. Hence, our framework can encompass a wide range of training methods for improving weak supervision, ranging from methods that only look at correlations of rules and output classes (independently of the machine learning model trained with the resulting labels), to those that harness the interplay of neural networks and weakly labeled data. We illustrate the benchmarking potential of the framework with a performance comparison of several reference implementations on a selection of datasets that are already available in Knodle. The framework is published as an open-source Python package knodle and available at https://github.com/knodle/knodle.
    How Implicit Regularization of ReLU Neural Networks Characterizes the Learned Function -- Part I: the 1-D Case of Two Layers with Random First Layer. (arXiv:1911.02903v3 [cs.LG] UPDATED)
    (2 min) Today, various forms of neural networks are trained to perform approximation tasks in many fields. However, the estimates obtained are not fully understood on function space. Empirical results suggest that typical training algorithms favor regularized solutions. These observations motivate us to analyze properties of the neural networks found by gradient descent initialized close to zero, that is frequently employed to perform the training task. As a starting point, we consider one dimensional (shallow) ReLU neural networks in which weights are chosen randomly and only the terminal layer is trained. First, we rigorously show that for such networks ridge regularized regression corresponds in function space to regularizing the estimate's second derivative for fairly general loss functionals. For least squares regression, we show that the trained network converges to the smooth spline interpolation of the training data as the number of hidden nodes tends to infinity. Moreover, we derive a correspondence between the early stopped gradient descent and the smoothing spline regression. Our analysis might give valuable insight on the properties of the solutions obtained using gradient descent methods in general settings.
    Optimizing the Numbers of Queries and Replies in Federated Learning with Differential Privacy. (arXiv:2107.01895v1 [cs.LG])
    (2 min) Federated learning (FL) empowers distributed clients to collaboratively train a shared machine learning model through exchanging parameter information. Despite the fact that FL can protect clients' raw data, malicious users can still crack original data with disclosed parameters. To amend this flaw, differential privacy (DP) is incorporated into FL clients to disturb original parameters, which however can significantly impair the accuracy of the trained model. In this work, we study a crucial question which has been vastly overlooked by existing works: what are the optimal numbers of queries and replies in FL with DP so that the final model accuracy is maximized. In FL, the parameter server (PS) needs to query participating clients for multiple global iterations to complete training. Each client responds a query from the PS by conducting a local iteration. Our work investigates how many times the PS should query clients and how many times each client should reply the PS. We investigate two most extensively used DP mechanisms (i.e., the Laplace mechanism and Gaussian mechanisms). Through conducting convergence rate analysis, we can determine the optimal numbers of queries and replies in FL with DP so that the final model accuracy can be maximized. Finally, extensive experiments are conducted with publicly available datasets: MNIST and FEMNIST, to verify our analysis and the results demonstrate that properly setting the numbers of queries and replies can significantly improve the final model accuracy in FL with DP.
    WoodScape: A multi-task, multi-camera fisheye dataset for autonomous driving. (arXiv:1905.01489v3 [cs.CV] UPDATED)
    (2 min) Fisheye cameras are commonly employed for obtaining a large field of view in surveillance, augmented reality and in particular automotive applications. In spite of their prevalence, there are few public datasets for detailed evaluation of computer vision algorithms on fisheye images. We release the first extensive fisheye automotive dataset, WoodScape, named after Robert Wood who invented the fisheye camera in 1906. WoodScape comprises of four surround view cameras and nine tasks including segmentation, depth estimation, 3D bounding box detection and soiling detection. Semantic annotation of 40 classes at the instance level is provided for over 10,000 images and annotation for other tasks are provided for over 100,000 images. With WoodScape, we would like to encourage the community to adapt computer vision models for fisheye camera instead of using naive rectification.
    All Local Minima are Global for Two-Layer ReLU Neural Networks: The Hidden Convex Optimization Landscape. (arXiv:2006.05900v2 [cs.LG] UPDATED)
    (2 min) We prove that finding all globally optimal two-layer ReLU neural networks can be performed by solving a convex optimization program with cone constraints. Our analysis is novel, characterizes all optimal solutions, and does not leverage duality-based analysis which was recently used to lift neural network training into convex spaces. Given the set of solutions of our convex optimization program, we show how to construct exactly the entire set of optimal neural networks. We provide a detailed characterization of this optimal set and its invariant transformations. As additional consequences of our convex perspective, (i) we establish that Clarke stationary points found by stochastic gradient descent correspond to the global optimum of a subsampled convex problem (ii) we provide a polynomial-time algorithm for checking if a neural network is a global minimum of the training loss (iii) we provide an explicit construction of a continuous path between any neural network and the global minimum of its sublevel set and (iv) characterize the minimal size of the hidden layer so that the neural network optimization landscape has no spurious valleys. Overall, we provide a rich framework for studying the landscape of neural network training loss through convexity.
    A Self-Supervised Gait Encoding Approach with Locality-Awareness for 3D Skeleton Based Person Re-Identification. (arXiv:2009.03671v3 [cs.CV] UPDATED)
    (2 min) Person re-identification (Re-ID) via gait features within 3D skeleton sequences is a newly-emerging topic with several advantages. Existing solutions either rely on hand-crafted descriptors or supervised gait representation learning. This paper proposes a self-supervised gait encoding approach that can leverage unlabeled skeleton data to learn gait representations for person Re-ID. Specifically, we first create self-supervision by learning to reconstruct unlabeled skeleton sequences reversely, which involves richer high-level semantics to obtain better gait representations. Other pretext tasks are also explored to further improve self-supervised learning. Second, inspired by the fact that motion's continuity endows adjacent skeletons in one skeleton sequence and temporally consecutive skeleton sequences with higher correlations (referred as locality in 3D skeleton data), we propose a locality-aware attention mechanism and a locality-aware contrastive learning scheme, which aim to preserve locality-awareness on intra-sequence level and inter-sequence level respectively during self-supervised learning. Last, with context vectors learned by our locality-aware attention mechanism and contrastive learning scheme, a novel feature named Constrastive Attention-based Gait Encodings (CAGEs) is designed to represent gait effectively. Empirical evaluations show that our approach significantly outperforms skeleton-based counterparts by 15-40% Rank-1 accuracy, and it even achieves superior performance to numerous multi-modal methods with extra RGB or depth information. Our codes are available at https://github.com/Kali-Hac/Locality-Awareness-SGE.
    Android Malware Category and Family Detection and Identification using Machine Learning. (arXiv:2107.01927v1 [cs.CR])
    (2 min) Android malware is one of the most dangerous threats on the internet, and it's been on the rise for several years. Despite significant efforts in detecting and classifying android malware from innocuous android applications, there is still a long way to go. As a result, there is a need to provide a basic understanding of the behavior displayed by the most common Android malware categories and families. Each Android malware family and category has a distinct objective. As a result, it has impacted every corporate area, including healthcare, banking, transportation, government, and e-commerce. In this paper, we presented two machine-learning approaches for Dynamic Analysis of Android Malware: one for detecting and identifying Android Malware Categories and the other for detecting and identifying Android Malware Families, which was accomplished by analyzing a massive malware dataset with 14 prominent malware categories and 180 prominent malware families of CCCS-CIC-AndMal2020 dataset on Dynamic Layers. Our approach achieves in Android Malware Category detection more than 96 % accurate and achieves in Android Malware Family detection more than 99% accurate. Our approach provides a method for high-accuracy Dynamic Analysis of Android Malware while also shortening the time required to analyze smartphone malware.
    Feature Purification: How Adversarial Training Performs Robust Deep Learning. (arXiv:2005.10190v3 [cs.LG] UPDATED)
    (3 min) Despite the empirical success of using Adversarial Training to defend deep learning models against adversarial perturbations, so far, it still remains rather unclear what the principles are behind the existence of adversarial perturbations, and what adversarial training does to the neural network to remove them. In this paper, we present a principle that we call Feature Purification, where we show one of the causes of the existence of adversarial examples is the accumulation of certain small dense mixtures in the hidden weights during the training process of a neural network; and more importantly, one of the goals of adversarial training is to remove such mixtures to purify hidden weights. We present both experiments on the CIFAR-10 dataset to illustrate this principle, and a theoretical result proving that for certain natural classification tasks, training a two-layer neural network with ReLU activation using randomly initialized gradient descent indeed satisfies this principle. Technically, we give, to the best of our knowledge, the first result proving that the following two can hold simultaneously for training a neural network with ReLU activation. (1) Training over the original data is indeed non-robust to small adversarial perturbations of some radius. (2) Adversarial training, even with an empirical perturbation algorithm such as FGM, can in fact be provably robust against ANY perturbations of the same radius. Finally, we also prove a complexity lower bound, showing that low complexity models such as linear classifiers, low-degree polynomials, or even the neural tangent kernel for this network, CANNOT defend against perturbations of this same radius, no matter what algorithms are used to train them.
    Distilling Knowledge from Ensembles of Acoustic Models for Joint CTC-Attention End-to-End Speech Recognition. (arXiv:2005.09310v3 [cs.LG] UPDATED)
    (2 min) Knowledge distillation has been widely used to compress existing deep learning models while preserving the performance on a wide range of applications. In the specific context of Automatic Speech Recognition (ASR), distillation from ensembles of acoustic models has recently shown promising results in increasing recognition performance. In this paper, we propose an extension of multi-teacher distillation methods to joint CTC-attention end-to-end ASR systems. We also introduce three novel distillation strategies. The core intuition behind them is to integrate the error rate metric to the teacher selection rather than solely focusing on the observed losses. In this way, we directly distill and optimize the student toward the relevant metric for speech recognition. We evaluate these strategies under a selection of training procedures on different datasets (TIMIT, Librispeech, Common Voice) and various languages (English, French, Italian). In particular, state-of-the-art error rates are reported on the Common Voice French, Italian and TIMIT datasets.
    KAISA: An Adaptive Second-order Optimizer Framework for Deep Neural Networks. (arXiv:2107.01739v1 [cs.LG])
    (2 min) Kronecker-factored Approximate Curvature (K-FAC) has recently been shown to converge faster in deep neural network (DNN) training than stochastic gradient descent (SGD); however, K-FAC's larger memory footprint hinders its applicability to large models. We present KAISA, a K-FAC-enabled, Adaptable, Improved, and ScAlable second-order optimizer framework that adapts the memory footprint, communication, and computation given specific models and hardware to achieve maximized performance and enhanced scalability. We quantify the tradeoffs between memory and communication cost and evaluate KAISA on large models, including ResNet-50, Mask R-CNN, U-Net, and BERT, on up to 128 NVIDIA A100 GPUs. Compared to the original optimizers, KAISA converges 18.1-36.3% faster across applications with the same global batch size. Under a fixed memory budget, KAISA converges 32.5% and 41.6% faster in ResNet-50 and BERT-Large, respectively. KAISA can balance memory and communication to achieve scaling efficiency equal to or better than the baseline optimizers.
    Taxonomy of Saliency Metrics for Channel Pruning. (arXiv:1906.04675v2 [cs.LG] UPDATED)
    (2 min) Pruning unimportant parameters can allow deep neural networks (DNNs) to reduce their heavy computation and memory requirements. A saliency metric estimates which parameters can be safely pruned with little impact on the classification performance of the DNN. Many saliency metrics have been proposed, each within the context of a wider pruning algorithm. The result is that it is difficult to separate the effectiveness of the saliency metric from the wider pruning algorithm that surrounds it. Similar-looking saliency metrics can yield very different results because of apparently minor design choices. We propose a taxonomy of saliency metrics based on four mostly-orthogonal principal components. We show that a broad range of metrics from the pruning literature can be grouped according to these components. Our taxonomy not only serves as a guide to prior work, but allows us to construct new saliency metrics by exploring novel combinations of our taxonomic components. We perform an in-depth experimental investigation of more than 300 saliency metrics. Our results provide decisive answers to open research questions, and demonstrate the importance of reduction and scaling when pruning groups of weights. We find that some of our constructed metrics can outperform the best existing state-of-the-art metrics for convolutional neural network channel pruning.
    Fast Rate Learning in Stochastic First Price Bidding. (arXiv:2107.01835v1 [cs.LG])
    (2 min) First-price auctions have largely replaced traditional bidding approaches based on Vickrey auctions in programmatic advertising. As far as learning is concerned, first-price auctions are more challenging because the optimal bidding strategy does not only depend on the value of the item but also requires some knowledge of the other bids. They have already given rise to several works in sequential learning, many of which consider models for which the value of the buyer or the opponents' maximal bid is chosen in an adversarial manner. Even in the simplest settings, this gives rise to algorithms whose regret grows as $\sqrt{T}$ with respect to the time horizon $T$. Focusing on the case where the buyer plays against a stationary stochastic environment, we show how to achieve significantly lower regret: when the opponents' maximal bid distribution is known we provide an algorithm whose regret can be as low as $\log^2(T)$; in the case where the distribution must be learnt sequentially, a generalization of this algorithm can achieve $T^{1/3+ \epsilon}$ regret, for any $\epsilon>0$. To obtain these results, we introduce two novel ideas that can be of interest in their own right. First, by transposing results obtained in the posted price setting, we provide conditions under which the first-price biding utility is locally quadratic around its optimum. Second, we leverage the observation that, on small sub-intervals, the concentration of the variations of the empirical distribution function may be controlled more accurately than by using the classical Dvoretzky-Kiefer-Wolfowitz inequality. Numerical simulations confirm that our algorithms converge much faster than alternatives proposed in the literature for various bid distributions, including for bids collected on an actual programmatic advertising platform.
    Federated Multi-Mini-Batch: An Efficient Training Approach to Federated Learning in Non-IID Environments. (arXiv:2011.07006v2 [cs.LG] UPDATED)
    (2 min) Federated learning has faced performance and network communication challenges, especially in the environments where the data is not independent and identically distributed (IID) across the clients. To address the former challenge, we introduce the federated-centralized concordance property and show that the federated single-mini-batch training approach can achieve comparable performance as the corresponding centralized training in the Non-IID environments. To deal with the latter, we present the federated multi-mini-batch approach and illustrate that it can establish a trade-off between the performance and communication efficiency and outperforms federated averaging in the Non-IID settings.
    Causally Invariant Predictor with Shift-Robustness. (arXiv:2107.01876v1 [stat.ML])
    (2 min) This paper proposes an invariant causal predictor that is robust to distribution shift across domains and maximally reserves the transferable invariant information. Based on a disentangled causal factorization, we formulate the distribution shift as soft interventions in the system, which covers a wide range of cases for distribution shift as we do not make prior specifications on the causal structure or the intervened variables. Instead of imposing regularizations to constrain the invariance of the predictor, we propose to predict by the intervened conditional expectation based on the do-operator and then prove that it is invariant across domains. More importantly, we prove that the proposed predictor is the robust predictor that minimizes the worst-case quadratic loss among the distributions of all domains. For empirical learning, we propose an intuitive and flexible estimating method based on data regeneration and present a local causal discovery procedure to guide the regeneration step. The key idea is to regenerate data such that the regenerated distribution is compatible with the intervened graph, which allows us to incorporate standard supervised learning methods with the regenerated data. Experimental results on both synthetic and real data demonstrate the efficacy of our predictor in improving the predictive accuracy and robustness across domains.
    S-TRIGGER: Continual State Representation Learning via Self-Triggered Generative Replay. (arXiv:1902.09434v2 [cs.LG] UPDATED)
    (2 min) We consider the problem of building a state representation model for control, in a continual learning setting. As the environment changes, the aim is to efficiently compress the sensory state's information without losing past knowledge, and then use Reinforcement Learning on the resulting features for efficient policy learning. To this end, we propose S-TRIGGER, a general method for Continual State Representation Learning applicable to Variational Auto-Encoders and its many variants. The method is based on Generative Replay, i.e. the use of generated samples to maintain past knowledge. It comes along with a statistically sound method for environment change detection, which self-triggers the Generative Replay. Our experiments on VAEs show that S-TRIGGER learns state representations that allows fast and high-performing Reinforcement Learning, while avoiding catastrophic forgetting. The resulting system is capable of autonomously learning new information without using past data and with a bounded system size. Code for our experiments is attached in Appendix.
    Detecting Concept Drift With Neural Network Model Uncertainty. (arXiv:2107.01873v1 [cs.LG])
    (2 min) Deployed machine learning models are confronted with the problem of changing data over time, a phenomenon also called concept drift. While existing approaches of concept drift detection already show convincing results, they require true labels as a prerequisite for successful drift detection. Especially in many real-world application scenarios-like the ones covered in this work-true labels are scarce, and their acquisition is expensive. Therefore, we introduce a new algorithm for drift detection, Uncertainty Drift Detection (UDD), which is able to detect drifts without access to true labels. Our approach is based on the uncertainty estimates provided by a deep neural network in combination with Monte Carlo Dropout. Structural changes over time are detected by applying the ADWIN technique on the uncertainty estimates, and detected drifts trigger a retraining of the prediction model. In contrast to input data-based drift detection, our approach considers the effects of the current input data on the properties of the prediction model rather than detecting change on the input data only (which can lead to unnecessary retrainings). We show that UDD outperforms other state-of-the-art strategies on two synthetic as well as ten real-world data sets for both regression and classification tasks.
    Provable Convergence of Nesterov Accelerated Method for Over-Parameterized Neural Networks. (arXiv:2107.01832v1 [cs.LG])
    (2 min) Despite the empirical success of deep learning, it still lacks theoretical understandings to explain why randomly initialized neural network trained by first-order optimization methods is able to achieve zero training loss, even though its landscape is non-convex and non-smooth. Recently, there are some works to demystifies this phenomenon under over-parameterized regime. In this work, we make further progress on this area by considering a commonly used momentum optimization algorithm: Nesterov accelerated method (NAG). We analyze the convergence of NAG for two-layer fully connected neural network with ReLU activation. Specifically, we prove that the error of NAG converges to zero at a linear convergence rate $1-\Theta(1/\sqrt{\kappa})$, where $\kappa > 1$ is determined by the initialization and the architecture of neural network. Comparing to the rate $1-\Theta(1/\kappa)$ of gradient descent, NAG achieves an acceleration. Besides, it also validates NAG and Heavy-ball method can achieve a similar convergence rate.
    Multiple-criteria Based Active Learning with Fixed-size Determinantal Point Processes. (arXiv:2107.01622v1 [cs.LG])
    (2 min) Active learning aims to achieve greater accuracy with less training data by selecting the most useful data samples from which it learns. Single-criterion based methods (i.e., informativeness and representativeness based methods) are simple and efficient; however, they lack adaptability to different real-world scenarios. In this paper, we introduce a multiple-criteria based active learning algorithm, which incorporates three complementary criteria, i.e., informativeness, representativeness and diversity, to make appropriate selections in the active learning rounds under different data types. We consider the selection process as a Determinantal Point Process, which good balance among these criteria. We refine the query selection strategy by both selecting the hardest unlabeled data sample and biasing towards the classifiers that are more suitable for the current data distribution. In addition, we also consider the dependencies and relationships between these data points in data selection by means of centroidbased clustering approaches. Through evaluations on synthetic and real-world datasets, we show that our method performs significantly better and is more stable than other multiple-criteria based AL algorithms.
    Randomized Neural Networks for Forecasting Time Series with Multiple Seasonality. (arXiv:2107.01705v1 [cs.LG])
    (2 min) This work contributes to the development of neural forecasting models with novel randomization-based learning methods. These methods improve the fitting abilities of the neural model, in comparison to the standard method, by generating network parameters in accordance with the data and target function features. A pattern-based representation of time series makes the proposed approach useful for forecasting time series with multiple seasonality. In the simulation study, we evaluate the performance of the proposed models and find that they can compete in terms of forecasting accuracy with fully-trained networks. Extremely fast and easy training, simple architecture, ease of implementation, high accuracy as well as dealing with nonstationarity and multiple seasonality in time series make the proposed model very attractive for a wide range of complex time series forecasting problems.
    An Explainable AI System for the Diagnosis of High Dimensional Biomedical Data. (arXiv:2107.01820v1 [cs.LG])
    (2 min) Typical state of the art flow cytometry data samples consists of measures of more than 100.000 cells in 10 or more features. AI systems are able to diagnose such data with almost the same accuracy as human experts. However, there is one central challenge in such systems: their decisions have far-reaching consequences for the health and life of people, and therefore, the decisions of AI systems need to be understandable and justifiable by humans. In this work, we present a novel explainable AI method, called ALPODS, which is able to classify (diagnose) cases based on clusters, i.e., subpopulations, in the high-dimensional data. ALPODS is able to explain its decisions in a form that is understandable for human experts. For the identified subpopulations, fuzzy reasoning rules expressed in the typical language of domain experts are generated. A visualization method based on these rules allows human experts to understand the reasoning used by the AI system. A comparison to a selection of state of the art explainable AI systems shows that ALPODS operates efficiently on known benchmark data and also on everyday routine case data.
    The Least Restriction for Offline Reinforcement Learning. (arXiv:2107.01757v1 [cs.LG])
    (2 min) Many practical applications of reinforcement learning (RL) constrain the agent to learn from a fixed offline dataset of logged interactions, which has already been gathered, without offering further possibility for data collection. However, commonly used off-policy RL algorithms, such as the Deep Q Network and the Deep Deterministic Policy Gradient, are incapable of learning without data correlated to the distribution under the current policy, making them ineffective for this offline setting. As the first step towards useful offline RL algorithms, we analysis the reason of instability in standard off-policy RL algorithms. It is due to the bootstrapping error. The key to avoiding this error, is ensuring that the agent's action space does not go out of the fixed offline dataset. Based on our consideration, a creative offline RL framework, the Least Restriction (LR), is proposed in this paper. The LR regards selecting an action as taking a sample from the probability distribution. It merely set a little limit for action selection, which not only avoid the action being out of the offline dataset but also remove all the unreasonable restrictions in earlier approaches (e.g. Batch-Constrained Deep Q-Learning). In the further, we will demonstrate that the LR, is able to learn robustly from different offline datasets, including random and suboptimal demonstrations, on a range of practical control tasks.
    Leveraging Evidential Deep Learning Uncertainties with Graph-based Clustering to Detect Anomalies. (arXiv:2107.01557v1 [cs.LG])
    (2 min) Understanding and representing traffic patterns are key to detecting anomalies in the maritime domain. To this end, we propose a novel graph-based traffic representation and association scheme to cluster trajectories of vessels using automatic identification system (AIS) data. We utilize the (un)clustered data to train a recurrent neural network (RNN)-based evidential regression model, which can predict a vessel's trajectory at future timesteps with its corresponding prediction uncertainty. This paper proposes the usage of a deep learning (DL)-based uncertainty estimation in detecting maritime anomalies, such as unusual vessel maneuvering. Furthermore, we utilize the evidential deep learning classifiers to detect unusual turns of vessels and the loss of AIS signal using predicted class probabilities with associated uncertainties. Our experimental results suggest that using graph-based clustered data improves the ability of the DL models to learn the temporal-spatial correlation of data and associated uncertainties. Using different AIS datasets and experiments, we demonstrate that the estimated prediction uncertainty yields fundamental information for the detection of traffic anomalies in the maritime and, possibly in other domains.
    A Comparison of the Delta Method and the Bootstrap in Deep Learning Classification. (arXiv:2107.01606v1 [cs.LG])
    (2 min) We validate the recently introduced deep learning classification adapted Delta method by a comparison with the classical Bootstrap. We show that there is a strong linear relationship between the quantified predictive epistemic uncertainty levels obtained from the two methods when applied on two LeNet-based neural network classifiers using the MNIST and CIFAR-10 datasets. Furthermore, we demonstrate that the Delta method offers a five times computation time reduction compared to the Bootstrap.
    An Information-Theoretic Approach for Automatically Determining the Number of States when Aggregating Markov Chains. (arXiv:2107.01799v1 [cs.IT])
    (2 min) A fundamental problem when aggregating Markov chains is the specification of the number of state groups. Too few state groups may fail to sufficiently capture the pertinent dynamics of the original, high-order Markov chain. Too many state groups may lead to a non-parsimonious, reduced-order Markov chain whose complexity rivals that of the original. In this paper, we show that an augmented value-of-information-based approach to aggregating Markov chains facilitates the determination of the number of state groups. The optimal state-group count coincides with the case where the complexity of the reduced-order chain is balanced against the mutual dependence between the original- and reduced-order chain dynamics.
    ARM-Net: Adaptive Relation Modeling Network for Structured Data. (arXiv:2107.01830v1 [cs.LG])
    (2 min) Relational databases are the de facto standard for storing and querying structured data, and extracting insights from structured data requires advanced analytics. Deep neural networks (DNNs) have achieved super-human prediction performance in particular data types, e.g., images. However, existing DNNs may not produce meaningful results when applied to structured data. The reason is that there are correlations and dependencies across combinations of attribute values in a table, and these do not follow simple additive patterns that can be easily mimicked by a DNN. The number of possible such cross features is combinatorial, making them computationally prohibitive to model. Furthermore, the deployment of learning models in real-world applications has also highlighted the need for interpretability, especially for high-stakes applications, which remains another issue of concern to DNNs. In this paper, we present ARM-Net, an adaptive relation modeling network tailored for structured data, and a lightweight framework ARMOR based on ARM-Net for relational data analytics. The key idea is to model feature interactions with cross features selectively and dynamically, by first transforming the input features into exponential space, and then determining the interaction order and interaction weights adaptively for each cross feature. We propose a novel sparse attention mechanism to dynamically generate the interaction weights given the input tuple, so that we can explicitly model cross features of arbitrary orders with noisy features filtered selectively. Then during model inference, ARM-Net can specify the cross features being used for each prediction for higher accuracy and better interpretability. Our extensive experiments on real-world datasets demonstrate that ARM-Net consistently outperforms existing models and provides more interpretable predictions for data-driven decision making.
    Learning a Model for Inferring a Spatial Road Lane Network Graph using Self-Supervision. (arXiv:2107.01784v1 [cs.CV])
    (2 min) Interconnected road lanes are a central concept for navigating urban roads. Currently, most autonomous vehicles rely on preconstructed lane maps as designing an algorithmic model is difficult. However, the generation and maintenance of such maps is costly and hinders large-scale adoption of autonomous vehicle technology. This paper presents the first self-supervised learning method to train a model to infer a spatially grounded lane-level road network graph based on a dense segmented representation of the road scene generated from onboard sensors. A formal road lane network model is presented and proves that any structured road scene can be represented by a directed acyclic graph of at most depth three while retaining the notion of intersection regions, and that this is the most compressed representation. The formal model is implemented by a hybrid neural and search-based model, utilizing a novel barrier function loss formulation for robust learning from partial labels. Experiments are conducted for all common road intersection layouts. Results show that the model can generalize to new road layouts, unlike previous approaches, demonstrating its potential for real-world application as a practical learning-based lane-level map generator.
    Deep Gaussian Process Emulation using Stochastic Imputation. (arXiv:2107.01590v1 [stat.ML])
    (2 min) We propose a novel deep Gaussian process (DGP) inference method for computer model emulation using stochastic imputation. By stochastically imputing the latent layers, the approach transforms the DGP into the linked GP, a state-of-the-art surrogate model formed by linking a system of feed-forward coupled GPs. This transformation renders a simple while efficient DGP training procedure that only involves optimizations of conventional stationary GPs. In addition, the analytically tractable mean and variance of the linked GP allows one to implement predictions from DGP emulators in a fast and accurate manner. We demonstrate the method in a series of synthetic examples and real-world applications, and show that it is a competitive candidate for efficient DGP surrogate modeling in comparison to the variational inference and the fully-Bayesian approach. A $\texttt{Python}$ package $\texttt{dgpsi}$ implementing the method is also produced and available at https://github.com/mingdeyu/DGP.
    Robust Online Convex Optimization in the Presence of Outliers. (arXiv:2107.01881v1 [cs.LG])
    (2 min) We consider online convex optimization when a number k of data points are outliers that may be corrupted. We model this by introducing the notion of robust regret, which measures the regret only on rounds that are not outliers. The aim for the learner is to achieve small robust regret, without knowing where the outliers are. If the outliers are chosen adversarially, we show that a simple filtering strategy on extreme gradients incurs O(k) additive overhead compared to the usual regret bounds, and that this is unimprovable, which means that k needs to be sublinear in the number of rounds. We further ask which additional assumptions would allow for a linear number of outliers. It turns out that the usual benign cases of independently, identically distributed (i.i.d.) observations or strongly convex losses are not sufficient. However, combining i.i.d. observations with the assumption that outliers are those observations that are in an extreme quantile of the distribution, does lead to sublinear robust regret, even though the expected number of outliers is linear.
    Automated Recovery of Issue-Commit Links Leveraging Both Textual and Non-textual Data. (arXiv:2107.01894v1 [cs.SE])
    (2 min) An issue documents discussions around required changes in issue-tracking systems, while a commit contains the change itself in the version control systems. Recovering links between issues and commits can facilitate many software evolution tasks such as bug localization, and software documentation. A previous study on over half a million issues from GitHub reports only about 42.2% of issues are manually linked by developers to their pertinent commits. Automating the linking of commit-issue pairs can contribute to the improvement of the said tasks. By far, current state-of-the-art approaches for automated commit-issue linking suffer from low precision, leading to unreliable results, sometimes to the point that imposes human supervision on the predicted links. The low performance gets even more severe when there is a lack of textual information in either commits or issues. Current approaches are also proven computationally expensive. We propose Hybrid-Linker to overcome such limitations by exploiting two information channels; (1) a non-textual-based component that operates on non-textual, automatically recorded information of the commit-issue pairs to predict a link, and (2) a textual-based one which does the same using textual information of the commit-issue pairs. Then, combining the results from the two classifiers, Hybrid-Linker makes the final prediction. Thus, every time one component falls short in predicting a link, the other component fills the gap and improves the results. We evaluate Hybrid-Linker against competing approaches, namely FRLink and DeepLink on a dataset of 12 projects. Hybrid-Linker achieves 90.1%, 87.8%, and 88.9% based on recall, precision, and F-measure, respectively. It also outperforms FRLink and DeepLink by 31.3%, and 41.3%, regarding the F-measure. Moreover, Hybrid-Linker exhibits extensive improvements in terms of performance as well.
    The Role of "Live" in Livestreaming Markets: Evidence Using Orthogonal Random Forest. (arXiv:2107.01629v1 [stat.ML])
    (2 min) The common belief about the growing medium of livestreaming is that its value lies in its "live" component. In this paper, we leverage data from a large livestreaming platform to examine this belief. We are able to do this as this platform also allows viewers to purchase the recorded version of the livestream. We summarize the value of livestreaming content by estimating how demand responds to price before, on the day of, and after the livestream. We do this by proposing a generalized Orthogonal Random Forest framework. This framework allows us to estimate heterogeneous treatment effects in the presence of high-dimensional confounders whose relationships with the treatment policy (i.e., price) are complex but partially known. We find significant dynamics in the price elasticity of demand over the temporal distance to the scheduled livestreaming day and after. Specifically, demand gradually becomes less price sensitive over time to the livestreaming day and is inelastic on the livestreaming day. Over the post-livestream period, demand is still sensitive to price, but much less than the pre-livestream period. This indicates that the vlaue of livestreaming persists beyond the live component. Finally, we provide suggestive evidence for the likely mechanisms driving our results. These are quality uncertainty reduction for the patterns pre- and post-livestream and the potential of real-time interaction with the creator on the day of the livestream.
    NOTE: Solution for KDD-CUP 2021 WikiKG90M-LSC. (arXiv:2107.01892v1 [cs.IR])
    (2 min) WikiKG90M in KDD Cup 2021 is a large encyclopedic knowledge graph, which could benefit various downstream applications such as question answering and recommender systems. Participants are invited to complete the knowledge graph by predicting missing triplets. Recent representation learning methods have achieved great success on standard datasets like FB15k-237. Thus, we train the advanced algorithms in different domains to learn the triplets, including OTE, QuatE, RotatE and TransE. Significantly, we modified OTE into NOTE (short for Norm-OTE) for better performance. Besides, we use both the DeepWalk and the post-smoothing technique to capture the graph structure for supplementation. In addition to the representations, we also use various statistical probabilities among the head entities, the relations and the tail entities for the final prediction. Experimental results show that the ensemble of state-of-the-art representation learning methods could draw on each others strengths. And we develop feature engineering from validation candidates for further improvements. Please note that we apply the same strategy on the test set for final inference. And these features may not be practical in the real world when considering ranking against all the entities.
    Statistical Theory for Imbalanced Binary Classification. (arXiv:2107.01777v1 [math.ST])
    (2 min) Within the vast body of statistical theory developed for binary classification, few meaningful results exist for imbalanced classification, in which data are dominated by samples from one of the two classes. Existing theory faces at least two main challenges. First, meaningful results must consider more complex performance measures than classification accuracy. To address this, we characterize a novel generalization of the Bayes-optimal classifier to any performance metric computed from the confusion matrix, and we use this to show how relative performance guarantees can be obtained in terms of the error of estimating the class probability function under uniform ($\mathcal{L}_\infty$) loss. Second, as we show, optimal classification performance depends on certain properties of class imbalance that have not previously been formalized. Specifically, we propose a novel sub-type of class imbalance, which we call Uniform Class Imbalance. We analyze how Uniform Class Imbalance influences optimal classifier performance and show that it necessitates different classifier behavior than other types of class imbalance. We further illustrate these two contributions in the case of $k$-nearest neighbor classification, for which we develop novel guarantees. Together, these results provide some of the first meaningful finite-sample statistical theory for imbalanced binary classification.
    Why is Pruning at Initialization Immune to Reinitializing and Shuffling?. (arXiv:2107.01808v1 [cs.LG])
    (2 min) Recent studies assessing the efficacy of pruning neural networks methods uncovered a surprising finding: when conducting ablation studies on existing pruning-at-initialization methods, namely SNIP, GraSP, SynFlow, and magnitude pruning, performances of these methods remain unchanged and sometimes even improve when randomly shuffling the mask positions within each layer (Layerwise Shuffling) or sampling new initial weight values (Reinit), while keeping pruning masks the same. We attempt to understand the reason behind such network immunity towards weight/mask modifications, by studying layer-wise statistics before and after randomization operations. We found that under each of the pruning-at-initialization methods, the distribution of unpruned weights changed minimally with randomization operations.
    Certifiably Robust Interpretation via Renyi Differential Privacy. (arXiv:2107.01561v1 [cs.LG])
    (2 min) Motivated by the recent discovery that the interpretation maps of CNNs could easily be manipulated by adversarial attacks against network interpretability, we study the problem of interpretation robustness from a new perspective of \Renyi differential privacy (RDP). The advantages of our Renyi-Robust-Smooth (RDP-based interpretation method) are three-folds. First, it can offer provable and certifiable top-$k$ robustness. That is, the top-$k$ important attributions of the interpretation map are provably robust under any input perturbation with bounded $\ell_d$-norm (for any $d\geq 1$, including $d = \infty$). Second, our proposed method offers $\sim10\%$ better experimental robustness than existing approaches in terms of the top-$k$ attributions. Remarkably, the accuracy of Renyi-Robust-Smooth also outperforms existing approaches. Third, our method can provide a smooth tradeoff between robustness and computational efficiency. Experimentally, its top-$k$ attributions are {\em twice} more robust than existing approaches when the computational resources are highly constrained.
    Towards Scheduling Federated Deep Learning using Meta-Gradients for Inter-Hospital Learning. (arXiv:2107.01707v1 [cs.LG])
    (2 min) Given the abundance and ease of access of personal data today, individual privacy has become of paramount importance, particularly in the healthcare domain. In this work, we aim to utilise patient data extracted from multiple hospital data centres to train a machine learning model without sacrificing patient privacy. We develop a scheduling algorithm in conjunction with a student-teacher algorithm that is deployed in a federated manner. This allows a central model to learn from batches of data at each federal node. The teacher acts between data centres to update the main task (student) algorithm using the data that is stored in the various data centres. We show that the scheduler, trained using meta-gradients, can effectively organise training and as a result train a machine learning model on a diverse dataset without needing explicit access to the patient data. We achieve state-of-the-art performance and show how our method overcomes some of the problems faced in the federated learning such as node poisoning. We further show how the scheduler can be used as a mechanism for transfer learning, allowing different teachers to work together in training a student for state-of-the-art performance.
    Winning at Any Cost -- Infringing the Cartel Prohibition With Reinforcement Learning. (arXiv:2107.01856v1 [cs.AI])
    (2 min) Pricing decisions are increasingly made by AI. Thanks to their ability to train with live market data while making decisions on the fly, deep reinforcement learning algorithms are especially effective in taking such pricing decisions. In e-commerce scenarios, multiple reinforcement learning agents can set prices based on their competitor's prices. Therefore, research states that agents might end up in a state of collusion in the long run. To further analyze this issue, we build a scenario that is based on a modified version of a prisoner's dilemma where three agents play the game of rock paper scissors. Our results indicate that the action selection can be dissected into specific stages, establishing the possibility to develop collusion prevention systems that are able to recognize situations which might lead to a collusion between competitors. We furthermore provide evidence for a situation where agents are capable of performing a tacit cooperation strategy without being explicitly trained to do so.
    Detecting Faults during Automatic Screwdriving: A Dataset and Use Case of Anomaly Detection for Automatic Screwdriving. (arXiv:2107.01955v1 [cs.LG])
    (2 min) Detecting faults in manufacturing applications can be difficult, especially if each fault model is to be engineered by hand. Data-driven approaches, using Machine Learning (ML) for detecting faults have recently gained increasing interest, where a ML model can be trained on a set of data from a manufacturing process. In this paper, we present a use case of using ML models for detecting faults during automated screwdriving operations, and introduce a new dataset containing fully monitored and registered data from a Universal Robot and OnRobot screwdriver during both normal and anomalous operations. We illustrate, with the use of two time-series ML models, how to detect faults in an automated screwdriving application.
    Latent structure blockmodels for Bayesian spectral graph clustering. (arXiv:2107.01734v1 [stat.ML])
    (2 min) Spectral embedding of network adjacency matrices often produces node representations living approximately around low-dimensional submanifold structures. In particular, hidden substructure is expected to arise when the graph is generated from a latent position model. Furthermore, the presence of communities within the network might generate community-specific submanifold structures in the embedding, but this is not explicitly accounted for in most statistical models for networks. In this article, a class of models called latent structure block models (LSBM) is proposed to address such scenarios, allowing for graph clustering when community-specific one dimensional manifold structure is present. LSBMs focus on a specific class of latent space model, the random dot product graph (RDPG), and assign a latent submanifold to the latent positions of each community. A Bayesian model for the embeddings arising from LSBMs is discussed, and shown to have a good performance on simulated and real world network data. The model is able to correctly recover the underlying communities living in a one-dimensional manifold, even when the parametric form of the underlying curves is unknown, achieving remarkable results on a variety of real data.
    Machine Learning for Malware Evolution Detection. (arXiv:2107.01627v1 [cs.CR])
    (2 min) Malware evolves over time and antivirus must adapt to such evolution. Hence, it is critical to detect those points in time where malware has evolved so that appropriate countermeasures can be undertaken. In this research, we perform a variety of experiments on a significant number of malware families to determine when malware evolution is likely to have occurred. All of the evolution detection techniques that we consider are based on machine learning and can be fully automated -- in particular, no reverse engineering or other labor-intensive manual analysis is required. Specifically, we consider analysis based on hidden Markov models (HMM) and the word embedding techniques HMM2Vec and Word2Vec.
    Q-SpiNN: A Framework for Quantizing Spiking Neural Networks. (arXiv:2107.01807v1 [cs.NE])
    (2 min) A prominent technique for reducing the memory footprint of Spiking Neural Networks (SNNs) without decreasing the accuracy significantly is quantization. However, the state-of-the-art only focus on employing the weight quantization directly from a specific quantization scheme, i.e., either the post-training quantization (PTQ) or the in-training quantization (ITQ), and do not consider (1) quantizing other SNN parameters (e.g., neuron membrane potential), (2) exploring different combinations of quantization approaches (i.e., quantization schemes, precision levels, and rounding schemes), and (3) selecting the SNN model with a good memory-accuracy trade-off at the end. Therefore, the memory saving offered by these state-of-the-art to meet the targeted accuracy is limited, thereby hindering processing SNNs on the resource-constrained systems (e.g., the IoT-Edge devices). Towards this, we propose Q-SpiNN, a novel quantization framework for memory-efficient SNNs. The key mechanisms of the Q-SpiNN are: (1) employing quantization for different SNN parameters based on their significance to the accuracy, (2) exploring different combinations of quantization schemes, precision levels, and rounding schemes to find efficient SNN model candidates, and (3) developing an algorithm that quantifies the benefit of the memory-accuracy trade-off obtained by the candidates, and selects the Pareto-optimal one. The experimental results show that, for the unsupervised network, the Q-SpiNN reduces the memory footprint by ca. 4x, while maintaining the accuracy within 1% from the baseline on the MNIST dataset. For the supervised network, the Q-SpiNN reduces the memory by ca. 2x, while keeping the accuracy within 2% from the baseline on the DVS-Gesture dataset.
    DeepRapper: Neural Rap Generation with Rhyme and Rhythm Modeling. (arXiv:2107.01875v1 [cs.SD])
    (2 min) Rap generation, which aims to produce lyrics and corresponding singing beats, needs to model both rhymes and rhythms. Previous works for rap generation focused on rhyming lyrics but ignored rhythmic beats, which are important for rap performance. In this paper, we develop DeepRapper, a Transformer-based rap generation system that can model both rhymes and rhythms. Since there is no available rap dataset with rhythmic beats, we develop a data mining pipeline to collect a large-scale rap dataset, which includes a large number of rap songs with aligned lyrics and rhythmic beats. Second, we design a Transformer-based autoregressive language model which carefully models rhymes and rhythms. Specifically, we generate lyrics in the reverse order with rhyme representation and constraint for rhyme enhancement and insert a beat symbol into lyrics for rhythm/beat modeling. To our knowledge, DeepRapper is the first system to generate rap with both rhymes and rhythms. Both objective and subjective evaluations demonstrate that DeepRapper generates creative and high-quality raps with rhymes and rhythms. Code will be released on GitHub.
    Universal Approximation of Functions on Sets. (arXiv:2107.01959v1 [cs.LG])
    (2 min) Modelling functions of sets, or equivalently, permutation-invariant functions, is a long-standing challenge in machine learning. Deep Sets is a popular method which is known to be a universal approximator for continuous set functions. We provide a theoretical analysis of Deep Sets which shows that this universal approximation property is only guaranteed if the model's latent space is sufficiently high-dimensional. If the latent space is even one dimension lower than necessary, there exist piecewise-affine functions for which Deep Sets performs no better than a na\"ive constant baseline, as judged by worst-case error. Deep Sets may be viewed as the most efficient incarnation of the Janossy pooling paradigm. We identify this paradigm as encompassing most currently popular set-learning methods. Based on this connection, we discuss the implications of our results for set learning more broadly, and identify some open questions on the universality of Janossy pooling in general.
    Poisoning Attack against Estimating from Pairwise Comparisons. (arXiv:2107.01854v1 [cs.LG])
    (2 min) As pairwise ranking becomes broadly employed for elections, sports competitions, recommendations, and so on, attackers have strong motivation and incentives to manipulate the ranking list. They could inject malicious comparisons into the training data to fool the victim. Such a technique is called poisoning attack in regression and classification tasks. In this paper, to the best of our knowledge, we initiate the first systematic investigation of data poisoning attacks on pairwise ranking algorithms, which can be formalized as the dynamic and static games between the ranker and the attacker and can be modeled as certain kinds of integer programming problems. To break the computational hurdle of the underlying integer programming problems, we reformulate them into the distributionally robust optimization (DRO) problems, which are computationally tractable. Based on such DRO formulations, we propose two efficient poisoning attack algorithms and establish the associated theoretical guarantees. The effectiveness of the suggested poisoning attack strategies is demonstrated by a series of toy simulations and several real data experiments. These experimental results show that the proposed methods can significantly reduce the performance of the ranker in the sense that the correlation between the true ranking list and the aggregated results can be decreased dramatically.
    Constrained Motion Planning Networks X. (arXiv:2010.08707v2 [cs.RO] UPDATED)
    (2 min) Constrained motion planning is a challenging field of research, aiming for computationally efficient methods that can find a collision-free path on the constraint manifolds between a given start and goal configuration. These planning problems come up surprisingly frequently, such as in robot manipulation for performing daily life assistive tasks. However, few solutions to constrained motion planning are available, and those that exist struggle with high computational time complexity in finding a path solution on the manifolds. To address this challenge, we present Constrained Motion Planning Networks X (CoMPNetX). It is a neural planning approach, comprising a conditional deep neural generator and discriminator with neural gradients-based fast projection operator. We also introduce neural task and scene representations conditioned on which the CoMPNetX generates implicit manifold configurations to turbo-charge any underlying classical planner such as Sampling-based Motion Planning methods for quickly solving complex constrained planning tasks. We show that our method finds path solutions with high success rates and lower computation times than state-of-the-art traditional path-finding tools on various challenging scenarios.
    A similarity-based Bayesian mixture-of-experts model. (arXiv:2012.02130v3 [stat.ML] UPDATED)
    (2 min) We present a new nonparametric mixture-of-experts model for multivariate regression problems, inspired by the probabilistic $k$-nearest neighbors algorithm. Using a conditionally specified model, predictions for out-of-sample inputs are based on similarities to each observed data point, yielding predictive distributions represented by Gaussian mixtures. Posterior inference is performed on the parameters of the mixture components as well as the distance metric using a mean-field variational Bayes algorithm accompanied with a stochastic gradient-based optimization procedure. The proposed method is especially advantageous in settings where inputs are of relatively high dimension in comparison to the data size, where input--output relationships are complex, and where predictive distributions may be skewed or multimodal. Computational studies on two synthetic datasets and one dataset comprising dose statistics of radiation therapy treatment plans show that our mixture-of-experts method performs similarly or better than a conditional Dirichlet process mixture model both in terms of validation metrics and visual inspection.
    Adversarial Robustness of Probabilistic Network Embedding for Link Prediction. (arXiv:2107.01936v1 [cs.SI])
    (2 min) In today's networked society, many real-world problems can be formalized as predicting links in networks, such as Facebook friendship suggestions, e-commerce recommendations, and the prediction of scientific collaborations in citation networks. Increasingly often, link prediction problem is tackled by means of network embedding methods, owing to their state-of-the-art performance. However, these methods lack transparency when compared to simpler baselines, and as a result their robustness against adversarial attacks is a possible point of concern: could one or a few small adversarial modifications to the network have a large impact on the link prediction performance when using a network embedding model? Prior research has already investigated adversarial robustness for network embedding models, focused on classification at the node and graph level. Robustness with respect to the link prediction downstream task, on the other hand, has been explored much less. This paper contributes to filling this gap, by studying adversarial robustness of Conditional Network Embedding (CNE), a state-of-the-art probabilistic network embedding model, for link prediction. More specifically, given CNE and a network, we measure the sensitivity of the link predictions of the model to small adversarial perturbations of the network, namely changes of the link status of a node pair. Thus, our approach allows one to identify the links and non-links in the network that are most vulnerable to such perturbations, for further investigation by an analyst. We analyze the characteristics of the most and least sensitive perturbations, and empirically confirm that our approach not only succeeds in identifying the most vulnerable links and non-links, but also that it does so in a time-efficient manner thanks to an effective approximation.
    Adaptive calibration for binary classification. (arXiv:2107.01726v1 [cs.LG])
    (2 min) This note proposes a way of making probability forecasting rules less sensitive to changes in data distribution, concentrating on the simple case of binary classification. This is important in applications of machine learning, where the quality of a trained predictor may drop significantly in the process of its exploitation. Our techniques are based on recent work on conformal test martingales and older work on prediction with expert advice, namely tracking the best expert.
    Matching a Desired Causal State via Shift Interventions. (arXiv:2107.01850v1 [stat.ME])
    (2 min) Transforming a causal system from a given initial state to a desired target state is an important task permeating multiple fields including control theory, biology, and materials science. In causal models, such transformations can be achieved by performing a set of interventions. In this paper, we consider the problem of identifying a shift intervention that matches the desired mean of a system through active learning. We define the Markov equivalence class that is identifiable from shift interventions and propose two active learning strategies that are guaranteed to exactly match a desired mean. We then derive a worst-case lower bound for the number of interventions required and show that these strategies are optimal for certain classes of graphs. In particular, we show that our strategies may require exponentially fewer interventions than the previously considered approaches, which optimize for structure learning in the underlying causal graph. In line with our theoretical results, we also demonstrate experimentally that our proposed active learning strategies require fewer interventions compared to several baselines.
    A Theoretical Analysis of Fine-tuning with Linear Teachers. (arXiv:2107.01641v1 [cs.LG])
    (2 min) Fine-tuning is a common practice in deep learning, achieving excellent generalization results on downstream tasks using relatively little training data. Although widely used in practice, it is lacking strong theoretical understanding. We analyze the sample complexity of this scheme for regression with linear teachers in several architectures. Intuitively, the success of fine-tuning depends on the similarity between the source tasks and the target task, however measuring it is non trivial. We show that a relevant measure considers the relation between the source task, the target task and the covariance structure of the target data. In the setting of linear regression, we show that under realistic settings a substantial sample complexity reduction is plausible when the above measure is low. For deep linear regression, we present a novel result regarding the inductive bias of gradient-based training when the network is initialized with pretrained weights. Using this result we show that the similarity measure for this setting is also affected by the depth of the network. We further present results on shallow ReLU models, and analyze the dependence of sample complexity there on source and target tasks. We empirically demonstrate our results for both synthetic and realistic data.
    A Framework for Evaluating the Cybersecurity Risk of Real World, Machine Learning Production Systems. (arXiv:2107.01806v1 [cs.CR])
    (2 min) Although cyberattacks on machine learning (ML) production systems can be destructive, many industry practitioners are ill equipped, lacking tactical and strategic tools that would allow them to analyze, detect, protect against, and respond to cyberattacks targeting their ML-based systems. In this paper, we take a significant step toward securing ML production systems by integrating these systems and their vulnerabilities into cybersecurity risk assessment frameworks. Specifically, we performed a comprehensive threat analysis of ML production systems and developed an extension to the MulVAL attack graph generation and analysis framework to incorporate cyberattacks on ML production systems. Using the proposed extension, security practitioners can apply attack graph analysis methods in environments that include ML components, thus providing security experts with a practical tool for evaluating the impact and quantifying the risk of a cyberattack targeting an ML production system.
    Data-Driven Learning of Feedforward Neural Networks with Different Activation Functions. (arXiv:2107.01702v1 [cs.LG])
    (2 min) This work contributes to the development of a new data-driven method (D-DM) of feedforward neural networks (FNNs) learning. This method was proposed recently as a way of improving randomized learning of FNNs by adjusting the network parameters to the target function fluctuations. The method employs logistic sigmoid activation functions for hidden nodes. In this study, we introduce other activation functions, such as bipolar sigmoid, sine function, saturating linear functions, reLU, and softplus. We derive formulas for their parameters, i.e. weights and biases. In the simulation study, we evaluate the performance of FNN data-driven learning with different activation functions. The results indicate that the sigmoid activation functions perform much better than others in the approximation of complex, fluctuated target functions.
    Sibling Regression for Generalized Linear Models. (arXiv:2107.01338v1 [stat.ME])
    (2 min) Field observations form the basis of many scientific studies, especially in ecological and social sciences. Despite efforts to conduct such surveys in a standardized way, observations can be prone to systematic measurement errors. The removal of systematic variability introduced by the observation process, if possible, can greatly increase the value of this data. Existing non-parametric techniques for correcting such errors assume linear additive noise models. This leads to biased estimates when applied to generalized linear models (GLM). We present an approach based on residual functions to address this limitation. We then demonstrate its effectiveness on synthetic data and show it reduces systematic detection variability in moth surveys.
    On the Efficiency of Various Deep Transfer Learning Models in Glitch Waveform Detection in Gravitational-Wave Data. (arXiv:2107.01863v1 [gr-qc])
    (2 min) LIGO is considered the most sensitive and complicated gravitational experiment ever built. Its main objective is to detect the gravitational wave from the strongest events in the universe by observing if the length of its 4-kilometer arms change by a distance 10,000 times smaller than the diameter of a proton. Due to its sensitivity, LIGO is prone to the disturbance of external noises which affects the data being collected to detect the gravitational wave. These noises are commonly called by the LIGO community as glitches. The objective of this study is to evaluate the effeciency of various deep trasnfer learning models namely VGG19, ResNet50V2, VGG16 and ResNet101 to detect glitch waveform in gravitational wave data. The accuracy achieved by the said models are 98.98%, 98.35%, 97.56% and 94.73% respectively. Even though the models achieved fairly high accuracy, it is observed that all of the model suffered from the lack of data for certain classes which is the main concern found in the experiment
    Differentially Private Sliced Wasserstein Distance. (arXiv:2107.01848v1 [cs.LG])
    (2 min) Developing machine learning methods that are privacy preserving is today a central topic of research, with huge practical impacts. Among the numerous ways to address privacy-preserving learning, we here take the perspective of computing the divergences between distributions under the Differential Privacy (DP) framework -- being able to compute divergences between distributions is pivotal for many machine learning problems, such as learning generative models or domain adaptation problems. Instead of resorting to the popular gradient-based sanitization method for DP, we tackle the problem at its roots by focusing on the Sliced Wasserstein Distance and seamlessly making it differentially private. Our main contribution is as follows: we analyze the property of adding a Gaussian perturbation to the intrinsic randomized mechanism of the Sliced Wasserstein Distance, and we establish the sensitivityof the resulting differentially private mechanism. One of our important findings is that this DP mechanism transforms the Sliced Wasserstein distance into another distance, that we call the Smoothed Sliced Wasserstein Distance. This new differentially private distribution distance can be plugged into generative models and domain adaptation algorithms in a transparent way, and we empirically show that it yields highly competitive performance compared with gradient-based DP approaches from the literature, with almost no loss in accuracy for the domain adaptation problems that we consider.
    Physics-Guided Deep Learning for Dynamical Systems: A survey. (arXiv:2107.01272v1 [cs.LG])
    (2 min) Modeling complex physical dynamics is a fundamental task in science and engineering. Traditional physics-based models are interpretable but rely on rigid assumptions. And the direct numerical approximation is usually computationally intensive, requiring significant computational resources and expertise. While deep learning (DL) provides novel alternatives for efficiently recognizing complex patterns and emulating nonlinear dynamics, it does not necessarily obey the governing laws of physical systems, nor do they generalize well across different systems. Thus, the study of physics-guided DL emerged and has gained great progress. It aims to take the best from both physics-based modeling and state-of-the-art DL models to better solve scientific problems. In this paper, we provide a structured overview of existing methodologies of integrating prior physical knowledge or physics-based modeling into DL and discuss the emerging opportunities.
    Cross-Modal Transformer-Based Neural Correction Models for Automatic Speech Recognition. (arXiv:2107.01569v1 [cs.CL])
    (2 min) We propose a cross-modal transformer-based neural correction models that refines the output of an automatic speech recognition (ASR) system so as to exclude ASR errors. Generally, neural correction models are composed of encoder-decoder networks, which can directly model sequence-to-sequence mapping problems. The most successful method is to use both input speech and its ASR output text as the input contexts for the encoder-decoder networks. However, the conventional method cannot take into account the relationships between these two different modal inputs because the input contexts are separately encoded for each modal. To effectively leverage the correlated information between the two different modal inputs, our proposed models encode two different contexts jointly on the basis of cross-modal self-attention using a transformer. We expect that cross-modal self-attention can effectively capture the relationships between two different modals for refining ASR hypotheses. We also introduce a shallow fusion technique to efficiently integrate the first-pass ASR model and our proposed neural correction model. Experiments on Japanese natural language ASR tasks demonstrated that our proposed models achieve better ASR performance than conventional neural correction models.
    Autoencoder based Randomized Learning of Feedforward Neural Networks for Regression. (arXiv:2107.01711v1 [cs.LG])
    (2 min) Feedforward neural networks are widely used as universal predictive models to fit data distribution. Common gradient-based learning, however, suffers from many drawbacks making the training process ineffective and time-consuming. Alternative randomized learning does not use gradients but selects hidden node parameters randomly. This makes the training process extremely fast. However, the problem in randomized learning is how to determine the random parameters. A recently proposed method uses autoencoders for unsupervised parameter learning. This method showed superior performance on classification tasks. In this work, we apply this method to regression problems, and, finding that it has some drawbacks, we show how to improve it. We propose a learning method of autoencoders that controls the produced random weights. We also propose how to determine the biases of hidden nodes. We empirically compare autoencoder based learning with other randomized learning methods proposed recently for regression and find that despite the proposed improvement of the autoencoder based learning, it does not outperform its competitors in fitting accuracy. Moreover, the method is much more complex than its competitors.
    Average-Case Communication Complexity of Statistical Problems. (arXiv:2107.01335v1 [cs.CC])
    (2 min) We study statistical problems, such as planted clique, its variants, and sparse principal component analysis in the context of average-case communication complexity. Our motivation is to understand the statistical-computational trade-offs in streaming, sketching, and query-based models. Communication complexity is the main tool for proving lower bounds in these models, yet many prior results do not hold in an average-case setting. We provide a general reduction method that preserves the input distribution for problems involving a random graph or matrix with planted structure. Then, we derive two-party and multi-party communication lower bounds for detecting or finding planted cliques, bipartite cliques, and related problems. As a consequence, we obtain new bounds on the query complexity in the edge-probe, vector-matrix-vector, matrix-vector, linear sketching, and $\mathbb{F}_2$-sketching models. Many of these results are nearly tight, and we use our techniques to provide simple proofs of some known lower bounds for the edge-probe model.
    Visual Time Series Forecasting: An Image-driven Approach. (arXiv:2107.01273v1 [cs.CV])
    (2 min) In this work, we address time-series forecasting as a computer vision task. We capture input data as an image and train a model to produce the subsequent image. This approach results in predicting distributions as opposed to pointwise values. To assess the robustness and quality of our approach, we examine various datasets and multiple evaluation metrics. Our experiments show that our forecasting tool is effective for cyclic data but somewhat less for irregular data such as stock prices. Importantly, when using image-based evaluation metrics, we find our method to outperform various baselines, including ARIMA, and a numerical variation of our deep learning approach.
    Robust Restless Bandits: Tackling Interval Uncertainty with Deep Reinforcement Learning. (arXiv:2107.01689v1 [cs.LG])
    (2 min) We introduce Robust Restless Bandits, a challenging generalization of restless multi-arm bandits (RMAB). RMABs have been widely studied for intervention planning with limited resources. However, most works make the unrealistic assumption that the transition dynamics are known perfectly, restricting the applicability of existing methods to real-world scenarios. To make RMABs more useful in settings with uncertain dynamics: (i) We introduce the Robust RMAB problem and develop solutions for a minimax regret objective when transitions are given by interval uncertainties; (ii) We develop a double oracle algorithm for solving Robust RMABs and demonstrate its effectiveness on three experimental domains; (iii) To enable our double oracle approach, we introduce RMABPPO, a novel deep reinforcement learning algorithm for solving RMABs. RMABPPO hinges on learning an auxiliary "$\lambda$-network" that allows each arm's learning to decouple, greatly reducing sample complexity required for training; (iv) Under minimax regret, the adversary in the double oracle approach is notoriously difficult to implement due to non-stationarity. To address this, we formulate the adversary oracle as a multi-agent reinforcement learning problem and solve it with a multi-agent extension of RMABPPO, which may be of independent interest as the first known algorithm for this setting. Code is available at https://github.com/killian-34/RobustRMAB.
    Low-Dimensional State and Action Representation Learning with MDP Homomorphism Metrics. (arXiv:2107.01677v1 [cs.LG])
    (2 min) Deep Reinforcement Learning has shown its ability in solving complicated problems directly from high-dimensional observations. However, in end-to-end settings, Reinforcement Learning algorithms are not sample-efficient and requires long training times and quantities of data. In this work, we proposed a framework for sample-efficient Reinforcement Learning that take advantage of state and action representations to transform a high-dimensional problem into a low-dimensional one. Moreover, we seek to find the optimal policy mapping latent states to latent actions. Because now the policy is learned on abstract representations, we enforce, using auxiliary loss functions, the lifting of such policy to the original problem domain. Results show that the novel framework can efficiently learn low-dimensional and interpretable state and action representations and the optimal latent policy.
    Attribute-aware Explainable Complementary Clothing Recommendation. (arXiv:2107.01655v1 [cs.IR])
    (2 min) Modelling mix-and-match relationships among fashion items has become increasingly demanding yet challenging for modern E-commerce recommender systems. When performing clothes matching, most existing approaches leverage the latent visual features extracted from fashion item images for compatibility modelling, which lacks explainability of generated matching results and can hardly convince users of the recommendations. Though recent methods start to incorporate pre-defined attribute information (e.g., colour, style, length, etc.) for learning item representations and improving the model interpretability, their utilisation of attribute information is still mainly reserved for enhancing the learned item representations and generating explanations via post-processing. As a result, this creates a severe bottleneck when we are trying to advance the recommendation accuracy and generating fine-grained explanations since the explicit attributes have only loose connections to the actual recommendation process. This work aims to tackle the explainability challenge in fashion recommendation tasks by proposing a novel Attribute-aware Fashion Recommender (AFRec). Specifically, AFRec recommender assesses the outfit compatibility by explicitly leveraging the extracted attribute-level representations from each item's visual feature. The attributes serve as the bridge between two fashion items, where we quantify the affinity of a pair of items through the learned compatibility between their attributes. Extensive experiments have demonstrated that, by making full use of the explicit attributes in the recommendation process, AFRec is able to achieve state-of-the-art recommendation accuracy and generate intuitive explanations at the same time.
    Fair Decision Rules for Binary Classification. (arXiv:2107.01325v1 [cs.LG])
    (2 min) In recent years, machine learning has begun automating decision making in fields as varied as college admissions, credit lending, and criminal sentencing. The socially sensitive nature of some of these applications together with increasing regulatory constraints has necessitated the need for algorithms that are both fair and interpretable. In this paper we consider the problem of building Boolean rule sets in disjunctive normal form (DNF), an interpretable model for binary classification, subject to fairness constraints. We formulate the problem as an integer program that maximizes classification accuracy with explicit constraints on two different measures of classification parity: equality of opportunity and equalized odds. Column generation framework, with a novel formulation, is used to efficiently search over exponentially many possible rules. When combined with faster heuristics, our method can deal with large data-sets. Compared to other fair and interpretable classifiers, our method is able to find rule sets that meet stricter notions of fairness with a modest trade-off in accuracy.
    Boosting Transferability of Targeted Adversarial Examples via Hierarchical Generative Networks. (arXiv:2107.01809v1 [cs.LG])
    (2 min) Transfer-based adversarial attacks can effectively evaluate model robustness in the black-box setting. Though several methods have demonstrated impressive transferability of untargeted adversarial examples, targeted adversarial transferability is still challenging. The existing methods either have low targeted transferability or sacrifice computational efficiency. In this paper, we develop a simple yet practical framework to efficiently craft targeted transfer-based adversarial examples. Specifically, we propose a conditional generative attacking model, which can generate the adversarial examples targeted at different classes by simply altering the class embedding and share a single backbone. Extensive experiments demonstrate that our method improves the success rates of targeted black-box attacks by a significant margin over the existing methods -- it reaches an average success rate of 29.6\% against six diverse models based only on one substitute white-box model in the standard testing of NeurIPS 2017 competition, which outperforms the state-of-the-art gradient-based attack methods (with an average success rate of $<$2\%) by a large margin. Moreover, the proposed method is also more efficient beyond an order of magnitude than gradient-based methods.
    Survey: Leakage and Privacy at Inference Time. (arXiv:2107.01614v1 [cs.LG])
    (2 min) Leakage of data from publicly available Machine Learning (ML) models is an area of growing significance as commercial and government applications of ML can draw on multiple sources of data, potentially including users' and clients' sensitive data. We provide a comprehensive survey of contemporary advances on several fronts, covering involuntary data leakage which is natural to ML models, potential malevolent leakage which is caused by privacy attacks, and currently available defence mechanisms. We focus on inference-time leakage, as the most likely scenario for publicly available models. We first discuss what leakage is in the context of different data, tasks, and model architectures. We then propose a taxonomy across involuntary and malevolent leakage, available defences, followed by the currently available assessment metrics and applications. We conclude with outstanding challenges and open questions, outlining some promising directions for future research.
    A contextual analysis of multi-layer perceptron models in classifying hand-written digits and letters: limited resources. (arXiv:2107.01782v1 [cs.LG])
    (2 min) Classifying hand-written digits and letters has taken a big leap with the introduction of ConvNets. However, on very constrained hardware the time necessary to train such models would be high. Our main contribution is twofold. First, we extensively test an end-to-end vanilla neural network (MLP) approach in pure numpy without any pre-processing or feature extraction done beforehand. Second, we show that basic data mining operations can significantly improve the performance of the models in terms of computational time, without sacrificing much accuracy. We illustrate our claims on a simpler variant of the Extended MNIST dataset, called Balanced EMNIST dataset. Our experiments show that, without any data mining, we get increased generalization performance when using more hidden layers and regularization techniques, the best model achieving 84.83% accuracy on a test dataset. Using dimensionality reduction done by PCA we were able to increase that figure to 85.08% with only 10% of the original feature space, reducing the memory size needed by 64%. Finally, adding methods to remove possibly harmful training samples like deviation from the mean helped us to still achieve over 84% test accuracy but with only 32.8% of the original memory size for the training set. This compares favorably to the majority of literature results obtained through similar architectures. Although this approach gets outshined by state-of-the-art models, it does scale to some (AlexNet, VGGNet) trained on 50% of the same dataset.
    A Lottery Ticket Hypothesis Framework for Low-Complexity Device-Robust Neural Acoustic Scene Classification. (arXiv:2107.01461v1 [cs.SD])
    (2 min) We propose a novel neural model compression strategy combining data augmentation, knowledge transfer, pruning, and quantization for device-robust acoustic scene classification (ASC). Specifically, we tackle the ASC task in a low-resource environment leveraging a recently proposed advanced neural network pruning mechanism, namely Lottery Ticket Hypothesis (LTH), to find a sub-network neural model associated with a small amount non-zero model parameters. The effectiveness of LTH for low-complexity acoustic modeling is assessed by investigating various data augmentation and compression schemes, and we report an efficient joint framework for low-complexity multi-device ASC, called Acoustic Lottery. Acoustic Lottery could compress an ASC model over $1/10^{4}$ and attain a superior performance (validation accuracy of 74.01% and Log loss of 0.76) compared to its not compressed seed model. All results reported in this work are based on a joint effort of four groups, namely GT-USTC-UKE-Tencent, aiming to address the "Low-Complexity Acoustic Scene Classification (ASC) with Multiple Devices" in the DCASE 2021 Challenge Task 1a.
    Random Neural Networks in the Infinite Width Limit as Gaussian Processes. (arXiv:2107.01562v1 [math.PR])
    (2 min) This article gives a new proof that fully connected neural networks with random weights and biases converge to Gaussian processes in the regime where the input dimension, output dimension, and depth are kept fixed, while the hidden layer widths tend to infinity. Unlike prior work, convergence is shown assuming only moment conditions for the distribution of weights and for quite general non-linearities.
    Split-and-Bridge: Adaptable Class Incremental Learning within a Single Neural Network. (arXiv:2107.01349v1 [cs.LG])
    (2 min) Continual learning has been a major problem in the deep learning community, where the main challenge is how to effectively learn a series of newly arriving tasks without forgetting the knowledge of previous tasks. Initiated by Learning without Forgetting (LwF), many of the existing works report that knowledge distillation is effective to preserve the previous knowledge, and hence they commonly use a soft label for the old task, namely a knowledge distillation (KD) loss, together with a class label for the new task, namely a cross entropy (CE) loss, to form a composite loss for a single neural network. However, this approach suffers from learning the knowledge by a CE loss as a KD loss often more strongly influences the objective function when they are in a competitive situation within a single network. This could be a critical problem particularly in a class incremental scenario, where the knowledge across tasks as well as within the new task, both of which can only be acquired by a CE loss, is essentially learned due to the existence of a unified classifier. In this paper, we propose a novel continual learning method, called Split-and-Bridge, which can successfully address the above problem by partially splitting a neural network into two partitions for training the new task separated from the old task and re-connecting them for learning the knowledge across tasks. In our thorough experimental analysis, our Split-and-Bridge method outperforms the state-of-the-art competitors in KD-based continual learning.
    Improved Representation Learning for Session-based Recommendation. (arXiv:2107.01516v1 [cs.IR])
    (2 min) Session-based recommendation systems suggest relevant items to users by modeling user behavior and preferences using short-term anonymous sessions. Existing methods leverage Graph Neural Networks (GNNs) that propagate and aggregate information from neighboring nodes i.e., local message passing. Such graph-based architectures have representational limits, as a single sub-graph is susceptible to overfit the sequential dependencies instead of accounting for complex transitions between items in different sessions. We propose using a Transformer in combination with a target attentive GNN, which allows richer Representation Learning. Our experimental results and ablation show that our proposed method outperforms the existing methods on real-world benchmark datasets.
    Learning in nonatomic games, Part I: Finite action spaces and population games. (arXiv:2107.01595v1 [cs.GT])
    (2 min) We examine the long-run behavior of a wide range of dynamics for learning in nonatomic games, in both discrete and continuous time. The class of dynamics under consideration includes fictitious play and its regularized variants, the best-reply dynamics (again, possibly regularized), as well as the dynamics of dual averaging / "follow the regularized leader" (which themselves include as special cases the replicator dynamics and Friedman's projection dynamics). Our analysis concerns both the actual trajectory of play and its time-average, and we cover potential and monotone games, as well as games with an evolutionarily stable state (global or otherwise). We focus exclusively on games with finite action spaces; nonatomic games with continuous action spaces are treated in detail in Part II of this paper.
    A Typology of Data Anomalies. (arXiv:2107.01615v1 [cs.LG])
    (2 min) Anomalies are cases that are in some way unusual and do not appear to fit the general patterns present in the dataset. Several conceptualizations exist to distinguish between different types of anomalies. However, these are either too specific to be generally applicable or so abstract that they neither provide concrete insight into the nature of anomaly types nor facilitate the functional evaluation of anomaly detection algorithms. With the recent criticism on 'black box' algorithms and analytics it has become clear that this is an undesirable situation. This paper therefore introduces a general typology of anomalies that offers a clear and tangible definition of the different types of anomalies in datasets. The typology also facilitates the evaluation of the functional capabilities of anomaly detection algorithms and as a framework assists in analyzing the conceptual levels of data, patterns and anomalies. Finally, it serves as an analytical tool for studying anomaly types from other typologies.
    UCSL : A Machine Learning Expectation-Maximization framework for Unsupervised Clustering driven by Supervised Learning. (arXiv:2107.01988v1 [stat.ML])
    (2 min) Subtype Discovery consists in finding interpretable and consistent sub-parts of a dataset, which are also relevant to a certain supervised task. From a mathematical point of view, this can be defined as a clustering task driven by supervised learning in order to uncover subgroups in line with the supervised prediction. In this paper, we propose a general Expectation-Maximization ensemble framework entitled UCSL (Unsupervised Clustering driven by Supervised Learning). Our method is generic, it can integrate any clustering method and can be driven by both binary classification and regression. We propose to construct a non-linear model by merging multiple linear estimators, one per cluster. Each hyperplane is estimated so that it correctly discriminates - or predict - only one cluster. We use SVC or Logistic Regression for classification and SVR for regression. Furthermore, to perform cluster analysis within a more suitable space, we also propose a dimension-reduction algorithm that projects the data onto an orthonormal space relevant to the supervised task. We analyze the robustness and generalization capability of our algorithm using synthetic and experimental datasets. In particular, we validate its ability to identify suitable consistent sub-types by conducting a psychiatric-diseases cluster analysis with known ground-truth labels. The gain of the proposed method over previous state-of-the-art techniques is about +1.9 points in terms of balanced accuracy. Finally, we make codes and examples available in a scikit-learn-compatible Python package at https://github.com/neurospin-projects/2021_rlouiset_ucsl
    How Does the Task Landscape Affect MAML Performance?. (arXiv:2010.14672v3 [cs.LG] UPDATED)
    (2 min) Model-Agnostic Meta-Learning (MAML) has become increasingly popular for training models that can quickly adapt to new tasks via one or few stochastic gradient descent steps. However, the MAML objective is significantly more difficult to optimize compared to standard Empirical Risk Minimization (ERM), and little is understood about how much MAML improves over ERM in terms of the fast adaptability of their solutions in various scenarios. We analytically address this issue in a linear regression setting consisting of a mixture of easy and hard tasks, where hardness is related to the condition number of the task's loss function. Specifically, we prove that in order for MAML to achieve substantial gain over ERM, (i) there must be some discrepancy in hardness among the tasks, and (ii) the optimal solutions of the hard tasks must be closely packed with the center far from the center of the easy tasks optimal solutions. We also give numerical and analytical results suggesting that these insights also apply to two-layer neural networks. Finally, we provide few-shot image classification experiments that support our insights for when MAML should be used and emphasize the importance of training MAML on hard tasks in practice.
    Automating Generative Deep Learning for Artistic Purposes: Challenges and Opportunities. (arXiv:2107.01858v1 [cs.LG])
    (2 min) We present a framework for automating generative deep learning with a specific focus on artistic applications. The framework provides opportunities to hand over creative responsibilities to a generative system as targets for automation. For the definition of targets, we adopt core concepts from automated machine learning and an analysis of generative deep learning pipelines, both in standard and artistic settings. To motivate the framework, we argue that automation aligns well with the goal of increasing the creative responsibility of a generative system, a central theme in computational creativity research. We understand automation as the challenge of granting a generative system more creative autonomy, by framing the interaction between the user and the system as a co-creative process. The development of the framework is informed by our analysis of the relationship between automation and creative autonomy. An illustrative example shows how the framework can give inspiration and guidance in the process of handing over creative responsibility.
    Ensemble and Auxiliary Tasks for Data-Efficient Deep Reinforcement Learning. (arXiv:2107.01904v1 [cs.LG])
    (2 min) Ensemble and auxiliary tasks are both well known to improve the performance of machine learning models when data is limited. However, the interaction between these two methods is not well studied, particularly in the context of deep reinforcement learning. In this paper, we study the effects of ensemble and auxiliary tasks when combined with the deep Q-learning algorithm. We perform a case study on ATARI games under limited data constraint. Moreover, we derive a refined bias-variance-covariance decomposition to analyze the different ways of learning ensembles and using auxiliary tasks, and use the analysis to help provide some understanding of the case study. Our code is open source and available at https://github.com/NUS-LID/RENAULT.
    Adversarially Robust Kernel Smoothing. (arXiv:2102.08474v3 [cs.LG] UPDATED)
    (2 min) We propose the adversarially robust kernel smoothing (ARKS) algorithm, combining kernel smoothing, robust optimization, and adversarial training for robust learning. Our methods are motivated by the convex analysis perspective of distributionally robust optimization based on probability metrics, such as the Wasserstein distance and the maximum mean discrepancy. We adapt the integral operator using supremal convolution in convex analysis to form a novel function majorant used for enforcing robustness. Our method is simple in form and applies to general loss functions and machine learning models. Furthermore, we report experiments with general machine learning models, such as deep neural networks, to demonstrate that ARKS performs competitively with the state-of-the-art methods based on the Wasserstein distance.
    Anomaly Detection With Partitioning Overfitting Autoencoder Ensembles. (arXiv:2009.02755v6 [cs.LG] UPDATED)
    (2 min) In this paper, we propose POTATOES (Partitioning OverfiTting AuTOencoder EnSemble), a new method for unsupervised outlier detection (UOD). More precisely, given any autoencoder for UOD, this technique can be used to improve its accuracy while at the same time removing the burden of tuning its regularization. The idea is to not regularize at all, but to rather randomly partition the data into sufficiently many equally sized parts, overfit each part with its own autoencoder, and to use the maximum over all autoencoder reconstruction errors as the anomaly score. We apply our model to various realistic datasets and show that if the set of inliers is dense enough, our method indeed improves the UOD performance of a given autoencoder significantly. For reproducibility, the code is made available on github so the reader can recreate the results in this paper as well as apply the method to other autoencoders and datasets.
    EasyFL: A Low-code Federated Learning Platform For Dummies. (arXiv:2105.07603v2 [cs.DC] UPDATED)
    (2 min) Academia and industry have developed several platforms to support the popular privacy-preserving distributed learning method -- Federated Learning (FL). However, these platforms are complex to use and require a deep understanding of FL, which imposes high barriers to entry for beginners, limits the productivity of researchers, and compromises deployment efficiency. In this paper, we propose the first low-code FL platform, EasyFL, to enable users with various levels of expertise to experiment and prototype FL applications with little coding. We achieve this goal while ensuring great flexibility and extensibility for customization by unifying simple API design, modular design, and granular training flow abstraction. With only a few lines of code, EasyFL empowers them with many out-of-the-box functionalities to accelerate experimentation and deployment. These practical functionalities are heterogeneity simulation, comprehensive tracking, distributed training optimization, and seamless deployment. They are proposed based on challenges identified in the proposed FL life cycle. Compared with other platforms, EasyFL not only requires just three lines of code (at least 10x lesser) to build a vanilla FL application but also incurs lower training overhead. Besides, our evaluations demonstrate that EasyFL expedites distributed training by 1.5x. It also improves the efficiency of deployment. We believe that EasyFL will increase the productivity of researchers and democratize FL to wider audiences.
    Quantile Encoder: Tackling High Cardinality Categorical Features in Regression Problems. (arXiv:2105.13783v2 [cs.LG] UPDATED)
    (2 min) Regression problems have been widely studied in machinelearning literature resulting in a plethora of regression models and performance measures. However, there are few techniques specially dedicated to solve the problem of how to incorporate categorical features to regression problems. Usually, categorical feature encoders are general enough to cover both classification and regression problems. This lack of specificity results in underperforming regression models. In this paper,we provide an in-depth analysis of how to tackle high cardinality categor-ical features with the quantile. Our proposal outperforms state-of-the-encoders, including the traditional statistical mean target encoder, when considering the Mean Absolute Error, especially in the presence of long-tailed or skewed distributions. Besides, to deal with possible overfitting when there are categories with small support, our encoder benefits from additive smoothing. Finally, we describe how to expand the encoded values by creating a set of features with different quantiles. This expanded encoder provides a more informative output about the categorical feature in question, further boosting the performance of the regression model.
    Partition and Code: learning how to compress graphs. (arXiv:2107.01952v1 [cs.LG])
    (2 min) Can we use machine learning to compress graph data? The absence of ordering in graphs poses a significant challenge to conventional compression algorithms, limiting their attainable gains as well as their ability to discover relevant patterns. On the other hand, most graph compression approaches rely on domain-dependent handcrafted representations and cannot adapt to different underlying graph distributions. This work aims to establish the necessary principles a lossless graph compression method should follow to approach the entropy storage lower bound. Instead of making rigid assumptions about the graph distribution, we formulate the compressor as a probabilistic model that can be learned from data and generalise to unseen instances. Our "Partition and Code" framework entails three steps: first, a partitioning algorithm decomposes the graph into elementary structures, then these are mapped to the elements of a small dictionary on which we learn a probability distribution, and finally, an entropy encoder translates the representation into bits. All three steps are parametric and can be trained with gradient descent. We theoretically compare the compression quality of several graph encodings and prove, under mild conditions, a total ordering of their expected description lengths. Moreover, we show that, under the same conditions, PnC achieves compression gains w.r.t. the baselines that grow either linearly or quadratically with the number of vertices. Our algorithms are quantitatively evaluated on diverse real-world networks obtaining significant performance improvements with respect to different families of non-parametric and parametric graph compressors.
    GAN2GAN: Generative Noise Learning for Blind Denoising with Single Noisy Images. (arXiv:1905.10488v5 [eess.IV] UPDATED)
    (2 min) We tackle a challenging blind image denoising problem, in which only single distinct noisy images are available for training a denoiser, and no information about noise is known, except for it being zero-mean, additive, and independent of the clean image. In such a setting, which often occurs in practice, it is not possible to train a denoiser with the standard discriminative training or with the recently developed Noise2Noise (N2N) training; the former requires the underlying clean image for the given noisy image, and the latter requires two independently realized noisy image pair for a clean image. To that end, we propose GAN2GAN (Generated-Artificial-Noise to Generated-Artificial-Noise) method that first learns a generative model that can 1) simulate the noise in the given noisy images and 2) generate a rough, noisy estimates of the clean images, then 3) iteratively trains a denoiser with subsequently synthesized noisy image pairs (as in N2N), obtained from the generative model. In results, we show the denoiser trained with our GAN2GAN achieves an impressive denoising performance on both synthetic and real-world datasets for the blind denoising setting; it almost approaches the performance of the standard discriminatively-trained or N2N-trained models that have more information than ours, and it significantly outperforms the recent baseline for the same setting, \textit{e.g.}, Noise2Void, and a more conventional yet strong one, BM3D. The official code of our method is available at https://github.com/csm9493/GAN2GAN.
    The MineRL BASALT Competition on Learning from Human Feedback. (arXiv:2107.01969v1 [cs.LG])
    (3 min) The last decade has seen a significant increase of interest in deep learning research, with many public successes that have demonstrated its potential. As such, these systems are now being incorporated into commercial products. With this comes an additional challenge: how can we build AI systems that solve tasks where there is not a crisp, well-defined specification? While multiple solutions have been proposed, in this competition we focus on one in particular: learning from human feedback. Rather than training AI systems using a predefined reward function or using a labeled dataset with a predefined set of categories, we instead train the AI system using a learning signal derived from some form of human feedback, which can evolve over time as the understanding of the task changes, or as the capabilities of the AI system improve. The MineRL BASALT competition aims to spur forward research on this important class of techniques. We design a suite of four tasks in Minecraft for which we expect it will be hard to write down hardcoded reward functions. These tasks are defined by a paragraph of natural language: for example, "create a waterfall and take a scenic picture of it", with additional clarifying details. Participants must train a separate agent for each task, using any method they want. Agents are then evaluated by humans who have read the task description. To help participants get started, we provide a dataset of human demonstrations on each of the four tasks, as well as an imitation learning baseline that leverages these demonstrations. Our hope is that this competition will improve our ability to build AI systems that do what their designers intend them to do, even when the intent cannot be easily formalized. Besides allowing AI to solve more tasks, this can also enable more effective regulation of AI systems, as well as making progress on the value alignment problem.
    Novel Policy Seeking with Constrained Optimization. (arXiv:2005.10696v2 [cs.LG] UPDATED)
    (2 min) In problem-solving, we humans can come up with multiple novel solutions to the same problem. However, reinforcement learning algorithms can only produce a set of monotonous policies that maximize the cumulative reward but lack diversity and novelty. In this work, we address the problem of generating novel policies in reinforcement learning tasks. Instead of following the multi-objective framework used in existing methods, we propose to rethink the problem under a novel perspective of constrained optimization. We first introduce a new metric to evaluate the difference between policies and then design two practical novel policy generation methods following the new perspective. The two proposed methods, namely the Constrained Task Novel Bisector (CTNB) and the Interior Policy Differentiation (IPD), are derived from the feasible direction method and the interior point method commonly known in the constrained optimization literature. Experimental comparisons on the MuJoCo control suite show our methods can achieve substantial improvement over previous novelty-seeking methods in terms of both the novelty of policies and their performances in the primal task.
    Extended Few-Shot Learning: Exploiting Existing Resources for Novel Tasks. (arXiv:2012.07176v3 [cs.LG] UPDATED)
    (2 min) In many practical few-shot learning problems, even though labeled examples are scarce, there are abundant auxiliary datasets that potentially contain useful information. We propose the problem of extended few-shot learning to study these scenarios. We then introduce a framework to address the challenges of efficiently selecting and effectively using auxiliary data in few-shot image classification. Given a large auxiliary dataset and a notion of semantic similarity among classes, we automatically select pseudo shots, which are labeled examples from other classes related to the target task. We show that naive approaches, such as (1) modeling these additional examples the same as the target task examples or (2) using them to learn features via transfer learning, only increase accuracy by a modest amount. Instead, we propose a masking module that adjusts the features of auxiliary data to be more similar to those of the target classes. We show that this masking module performs better than naively modeling the support examples and transfer learning by 4.68 and 6.03 percentage points, respectively.
    Single Model for Influenza Forecasting of Multiple Countries by Multi-task Learning. (arXiv:2107.01760v1 [cs.LG])
    (2 min) The accurate forecasting of infectious epidemic diseases such as influenza is a crucial task undertaken by medical institutions. Although numerous flu forecasting methods and models based mainly on historical flu activity data and online user-generated contents have been proposed in previous studies, no flu forecasting model targeting multiple countries using two types of data exists at present. Our paper leverages multi-task learning to tackle the challenge of building one flu forecasting model targeting multiple countries; each country as each task. Also, to develop the flu prediction model with higher performance, we solved two issues; finding suitable search queries, which are part of the user-generated contents, and how to leverage search queries efficiently in the model creation. For the first issue, we propose the transfer approaches from English to other languages. For the second issue, we propose a novel flu forecasting model that takes advantage of search queries using an attention mechanism and extend the model to a multi-task model for multiple countries' flu forecasts. Experiments on forecasting flu epidemics in five countries demonstrate that our model significantly improved the performance by leveraging the search queries and multi-task learning compared to the baselines.
    FedSiam: Towards Adaptive Federated Semi-Supervised Learning. (arXiv:2012.03292v2 [cs.LG] UPDATED)
    (2 min) Federated learning (FL) has emerged as an effective technique to co-training machine learning models without actually sharing data and leaking privacy. However, most existing FL methods focus on the supervised setting and ignore the utilization of unlabeled data. Although there are a few existing studies trying to incorporate unlabeled data into FL, they all fail to maintain performance guarantees or generalization ability in various real-world settings. In this paper, we focus on designing a general framework FedSiam to tackle different scenarios of federated semi-supervised learning, including four settings in the labels-at-client scenario and two setting in the labels-at-server scenario. FedSiam is built upon a siamese network into FL with a momentum update to handle the non-IID challenges introduced by unlabeled data. We further propose a new metric to measure the divergence of local model layers within the siamese network. Based on the divergence, FedSiam can automatically select layer-level parameters to be uploaded to the server in an adaptive manner. Experimental results on three datasets under two scenarios with different data distribution settings demonstrate that the proposed FedSiam framework outperforms state-of-the-art baselines.
    Learning Cost Functions for Optimal Transport. (arXiv:2002.09650v2 [cs.LG] UPDATED)
    (2 min) Inverse optimal transport (OT) refers to the problem of learning the cost function for OT from observed transport plan or its samples. In this paper, we derive an unconstrained convex optimization formulation of the inverse OT problem, which can be further augmented by any customizable regularization. We provide a comprehensive characterization of the properties of inverse OT, including uniqueness of solutions. We also develop two numerical algorithms, one is a fast matrix scaling method based on the Sinkhorn-Knopp algorithm for discrete OT, and the other one is a learning based algorithm that parameterizes the cost function as a deep neural network for continuous OT. The novel framework proposed in the work avoids repeatedly solving a forward OT in each iteration which has been a thorny computational bottleneck for the bi-level optimization in existing inverse OT approaches. Numerical results demonstrate promising efficiency and accuracy advantages of the proposed algorithms over existing state-of-the-art methods.
    Layer-Wise Cross-View Decoding for Sequence-to-Sequence Learning. (arXiv:2005.08081v5 [cs.CL] UPDATED)
    (2 min) In sequence-to-sequence learning, the decoder relies on the attention mechanism to efficiently extract information from the encoder. While it is common practice to draw information from only the last encoder layer, recent work has proposed to use representations from different encoder layers for diversified levels of information. Nonetheless, the decoder still obtains only a single view of the source sequences, which might lead to insufficient training of the encoder layer stack due to the hierarchy bypassing problem. In this work, we propose layer-wise cross-view decoding, where for each decoder layer, together with the representations from the last encoder layer, which serve as a global view, those from other encoder layers are supplemented for a stereoscopic view of the source sequences. Systematic experiments show that we successfully address the hierarchy bypassing problem and substantially improve the performance of sequence-to-sequence learning with deep representations on diverse tasks.
    Unknown Presentation Attack Detection against Rational Attackers. (arXiv:2010.01592v2 [cs.CV] UPDATED)
    (2 min) Despite the impressive progress in the field of presentation attack detection and multimedia forensics over the last decade, these systems are still vulnerable to attacks in real-life settings. Some of the challenges for existing solutions are the detection of unknown attacks, the ability to perform in adversarial settings, few-shot learning, and explainability. In this study, these limitations are approached by reliance on a game-theoretic view for modeling the interactions between the attacker and the detector. Consequently, a new optimization criterion is proposed and a set of requirements are defined for improving the performance of these systems in real-life settings. Furthermore, a novel detection technique is proposed using generator-based feature sets that are not biased towards any specific attack species. To further optimize the performance on known attacks, a new loss function coined categorical margin maximization loss (C-marmax) is proposed which gradually improves the performance against the most powerful attack. The proposed approach provides a more balanced performance across known and unknown attacks and achieves state-of-the-art performance in known and unknown attack detection cases against rational attackers. Lastly, the few-shot learning potential of the proposed approach is studied as well as its ability to provide pixel-level explainability.
    When and How to Fool Explainable Models (and Humans) with Adversarial Examples. (arXiv:2107.01943v1 [cs.LG])
    (2 min) Reliable deployment of machine learning models such as neural networks continues to be challenging due to several limitations. Some of the main shortcomings are the lack of interpretability and the lack of robustness against adversarial examples or out-of-distribution inputs. In this paper, we explore the possibilities and limits of adversarial attacks for explainable machine learning models. First, we extend the notion of adversarial examples to fit in explainable machine learning scenarios, in which the inputs, the output classifications and the explanations of the model's decisions are assessed by humans. Next, we propose a comprehensive framework to study whether (and how) adversarial examples can be generated for explainable models under human assessment, introducing novel attack paradigms. In particular, our framework considers a wide range of relevant (yet often ignored) factors such as the type of problem, the user expertise or the objective of the explanations in order to identify the attack strategies that should be adopted in each scenario to successfully deceive the model (and the human). These contributions intend to serve as a basis for a more rigorous and realistic study of adversarial examples in the field of explainable machine learning.
    Learning Bayesian Networks through Birkhoff Polytope: A Relaxation Method. (arXiv:2107.01658v1 [stat.ML])
    (2 min) We establish a novel framework for learning a directed acyclic graph (DAG) when data are generated from a Gaussian, linear structural equation model. It consists of two parts: (1) introduce a permutation matrix as a new parameter within a regularized Gaussian log-likelihood to represent variable ordering; and (2) given the ordering, estimate the DAG structure through sparse Cholesky factor of the inverse covariance matrix. For permutation matrix estimation, we propose a relaxation technique that avoids the NP-hard combinatorial problem of order estimation. Given an ordering, a sparse Cholesky factor is estimated using a cyclic coordinatewise descent algorithm which decouples row-wise. Our framework recovers DAGs without the need for an expensive verification of the acyclicity constraint or enumeration of possible parent sets. We establish numerical convergence of the algorithm, and consistency of the Cholesky factor estimator when the order of variables is known. Through several simulated and macro-economic datasets, we study the scope and performance of the proposed methodology.
    On The Distribution of Penultimate Activations of Classification Networks. (arXiv:2107.01900v1 [cs.LG])
    (2 min) This paper studies probability distributions ofpenultimate activations of classification networks.We show that, when a classification network istrained with the cross-entropy loss, its final classi-fication layer forms aGenerative-Discriminativepairwith a generative classifier based on a specificdistribution of penultimate activations. More im-portantly, the distribution is parameterized by theweights of the final fully-connected layer, and canbe considered as a generative model that synthe-sizes the penultimate activations without feedinginput data. We empirically demonstrate that thisgenerative model enables stable knowledge dis-tillation in the presence of domain shift, and cantransfer knowledge from a classifier to variationalautoencoders and generative adversarial networksfor class-conditional image generation.
    Slope and generalization properties of neural networks. (arXiv:2107.01473v1 [stat.ML])
    (2 min) Neural networks are very successful tools in for example advanced classification. From a statistical point of view, fitting a neural network may be seen as a kind of regression, where we seek a function from the input space to a space of classification probabilities that follows the "general" shape of the data, but avoids overfitting by avoiding memorization of individual data points. In statistics, this can be done by controlling the geometric complexity of the regression function. We propose to do something similar when fitting neural networks by controlling the slope of the network. After defining the slope and discussing some of its theoretical properties, we go on to show empirically in examples, using ReLU networks, that the distribution of the slope of a well-trained neural network classifier is generally independent of the width of the layers in a fully connected network, and that the mean of the distribution only has a weak dependence on the model architecture in general. The slope is of similar size throughout the relevant volume, and varies smoothly. It also behaves as predicted in rescaling examples. We discuss possible applications of the slope concept, such as using it as a part of the loss function or stopping criterion during network training, or ranking data sets in terms of their complexity.
    A Uniformly Consistent Estimator of non-Gaussian Causal Effects Under the k-Triangle-Faithfulness Assumption. (arXiv:2107.01333v1 [stat.ML])
    (2 min) Kalisch and B\"{u}hlmann (2007) showed that for linear Gaussian models, under the Causal Markov Assumption, the Strong Causal Faithfulness Assumption, and the assumption of causal sufficiency, the PC algorithm is a uniformly consistent estimator of the Markov Equivalence Class of the true causal DAG for linear Gaussian models; it follows from this that for the identifiable causal effects in the Markov Equivalence Class, there are uniformly consistent estimators of causal effects as well. The $k$-Triangle-Faithfulness Assumption is a strictly weaker assumption that avoids some implausible implications of the Strong Causal Faithfulness Assumption and also allows for uniformly consistent estimates of Markov Equivalence Classes (in a weakened sense), and of identifiable causal effects. However, both of these assumptions are restricted to linear Gaussian models. We propose the Generalized $k$-Triangle Faithfulness, which can be applied to any smooth distribution. In addition, under the Generalized $k$-Triangle Faithfulness Assumption, we describe the Edge Estimation Algorithm that provides uniformly consistent estimates of causal effects in some cases (and otherwise outputs "can't tell"), and the \textit{Very Conservative }$SGS$ Algorithm that (in a slightly weaker sense) is a uniformly consistent estimator of the Markov equivalence class of the true DAG.
    Optimizing ROC Curves with a Sort-Based Surrogate Loss Function for Binary Classification and Changepoint Detection. (arXiv:2107.01285v1 [stat.ML])
    (2 min) Receiver Operating Characteristic (ROC) curves are plots of true positive rate versus false positive rate which are useful for evaluating binary classification models, but difficult to use for learning since the Area Under the Curve (AUC) is non-convex. ROC curves can also be used in other problems that have false positive and true positive rates such as changepoint detection. We show that in this more general context, the ROC curve can have loops, points with highly sub-optimal error rates, and AUC greater than one. This observation motivates a new optimization objective: rather than maximizing the AUC, we would like a monotonic ROC curve with AUC=1 that avoids points with large values for Min(FP,FN). We propose a convex relaxation of this objective that results in a new surrogate loss function called the AUM, short for Area Under Min(FP, FN). Whereas previous loss functions are based on summing over all labeled examples or pairs, the AUM requires a sort and a sum over the sequence of points on the ROC curve. We show that AUM directional derivatives can be efficiently computed and used in a gradient descent learning algorithm. In our empirical study of supervised binary classification and changepoint detection problems, we show that our new AUM minimization learning algorithm results in improved AUC and comparable speed relative to previous baselines.
    Bayesian decision-making under misspecified priors with applications to meta-learning. (arXiv:2107.01509v1 [cs.LG])
    (2 min) Thompson sampling and other Bayesian sequential decision-making algorithms are among the most popular approaches to tackle explore/exploit trade-offs in (contextual) bandits. The choice of prior in these algorithms offers flexibility to encode domain knowledge but can also lead to poor performance when misspecified. In this paper, we demonstrate that performance degrades gracefully with misspecification. We prove that the expected reward accrued by Thompson sampling (TS) with a misspecified prior differs by at most $\tilde{\mathcal{O}}(H^2 \epsilon)$ from TS with a well specified prior, where $\epsilon$ is the total-variation distance between priors and $H$ is the learning horizon. Our bound does not require the prior to have any parametric form. For priors with bounded support, our bound is independent of the cardinality or structure of the action space, and we show that it is tight up to universal constants in the worst case. Building on our sensitivity analysis, we establish generic PAC guarantees for algorithms in the recently studied Bayesian meta-learning setting and derive corollaries for various families of priors. Our results generalize along two axes: (1) they apply to a broader family of Bayesian decision-making algorithms, including a Monte-Carlo implementation of the knowledge gradient algorithm (KG), and (2) they apply to Bayesian POMDPs, the most general Bayesian decision-making setting, encompassing contextual bandits as a special case. Through numerical simulations, we illustrate how prior misspecification and the deployment of one-step look-ahead (as in KG) can impact the convergence of meta-learning in multi-armed and contextual bandits with structured and correlated priors.
    Solving Machine Learning Problems. (arXiv:2107.01238v1 [cs.LG])
    (2 min) Can a machine learn Machine Learning? This work trains a machine learning model to solve machine learning problems from a University undergraduate level course. We generate a new training set of questions and answers consisting of course exercises, homework, and quiz questions from MIT's 6.036 Introduction to Machine Learning course and train a machine learning model to answer these questions. Our system demonstrates an overall accuracy of 96% for open-response questions and 97% for multiple-choice questions, compared with MIT students' average of 93%, achieving grade A performance in the course, all in real-time. Questions cover all 12 topics taught in the course, excluding coding questions or questions with images. Topics include: (i) basic machine learning principles; (ii) perceptrons; (iii) feature extraction and selection; (iv) logistic regression; (v) regression; (vi) neural networks; (vii) advanced neural networks; (viii) convolutional neural networks; (ix) recurrent neural networks; (x) state machines and MDPs; (xi) reinforcement learning; and (xii) decision trees. Our system uses Transformer models within an encoder-decoder architecture with graph and tree representations. An important aspect of our approach is a data-augmentation scheme for generating new example problems. We also train a machine learning model to generate problem hints. Thus, our system automatically generates new questions across topics, answers both open-response questions and multiple-choice questions, classifies problems, and generates problem hints, pushing the envelope of AI for STEM education.
    Exact Backpropagation in Binary Weighted Networks with Group Weight Transformations. (arXiv:2107.01400v1 [cs.LG])
    (2 min) Quantization based model compression serves as high performing and fast approach for inference that yields highly compressed models compared to their full-precision floating point counterparts. The most extreme quantization is a 1-bit representation of parameters such that they have only two possible values, typically -1(0) or +1. Models that constrain the weights to binary values enable efficient implementation of the ubiquitous dot product by additions only without requiring floating point multiplications which is beneficial for resources constrained inference. The main contribution of this work is the introduction of a method to smooth the combinatorial problem of determining a binary vector of weights to minimize the expected loss for a given objective by means of empirical risk minimization with backpropagation. This is achieved by approximating a multivariate binary state over the weights utilizing a deterministic and differentiable transformation of real-valued continuous parameters. The proposed method adds little overhead in training, can be readily applied without any substantial modifications to the original architecture, does not introduce additional saturating non-linearities or auxiliary losses, and does not prohibit applying other methods for binarizing the activations. It is demonstrated that contrary to common assertions made in the literature, binary weighted networks can train well with the same standard optimization techniques and similar hyperparameters settings as their full-precision counterparts, namely momentum SGD with large learning rates and $L_2$ regularization. The source code is publicly available at https://bitbucket.org/YanivShu/binary_weighted_networks_public
    Auxiliary-Classifier GAN for Malware Analysis. (arXiv:2107.01620v1 [cs.CR])
    (2 min) Generative adversarial networks (GAN) are a class of powerful machine learning techniques, where both a generative and discriminative model are trained simultaneously. GANs have been used, for example, to successfully generate "deep fake" images. A recent trend in malware research consists of treating executables as images and employing image-based analysis techniques. In this research, we generate fake malware images using auxiliary classifier GANs (AC-GAN), and we consider the effectiveness of various techniques for classifying the resulting images. Our results indicate that the resulting multiclass classification problem is challenging, yet we can obtain strong results when restricting the problem to distinguishing between real and fake samples. While the AC-GAN generated images often appear to be very similar to real malware images, we conclude that from a deep learning perspective, the AC-GAN generated samples do not rise to the level of deep fake malware images.
    CT Image Harmonization for Enhancing Radiomics Studies. (arXiv:2107.01337v1 [eess.IV])
    (2 min) While remarkable advances have been made in Computed Tomography (CT), capturing CT images with non-standardized protocols causes low reproducibility regarding radiomic features, forming a barrier on CT image analysis in a large scale. RadiomicGAN is developed to effectively mitigate the discrepancy caused by using non-standard reconstruction kernels. RadiomicGAN consists of hybrid neural blocks including both pre-trained and trainable layers adopted to learn radiomic feature distributions efficiently. A novel training approach, called Dynamic Window-based Training, has been developed to smoothly transform the pre-trained model to the medical imaging domain. Model performance evaluated using 1401 radiomic features show that RadiomicGAN clearly outperforms the state-of-art image standardization models.
    Class Introspection: A Novel Technique for Detecting Unlabeled Subclasses by Leveraging Classifier Explainability Methods. (arXiv:2107.01657v1 [cs.LG])
    (2 min) Detecting latent structure within a dataset is a crucial step in performing analysis of a dataset. However, existing state-of-the-art techniques for subclass discovery are limited: either they are limited to detecting very small numbers of outliers or they lack the statistical power to deal with complex data such as image or audio. This paper proposes a solution to this subclass discovery problem: by leveraging instance explanation methods, an existing classifier can be extended to detect latent classes via differences in the classifier's internal decisions about each instance. This works not only with simple classification techniques but also with deep neural networks, allowing for a powerful and flexible approach to detecting latent structure within datasets. Effectively, this represents a projection of the dataset into the classifier's "explanation space," and preliminary results show that this technique outperforms the baseline for the detection of latent classes even with limited processing. This paper also contains a pipeline for analyzing classifiers automatically, and a web application for interactively exploring the results from this technique.
    Pool of Experts: Realtime Querying Specialized Knowledge in Massive Neural Networks. (arXiv:2107.01354v1 [cs.DB])
    (2 min) In spite of the great success of deep learning technologies, training and delivery of a practically serviceable model is still a highly time-consuming process. Furthermore, a resulting model is usually too generic and heavyweight, and hence essentially goes through another expensive model compression phase to fit in a resource-limited device like embedded systems. Inspired by the fact that a machine learning task specifically requested by mobile users is often much simpler than it is supported by a massive generic model, this paper proposes a framework, called Pool of Experts (PoE), that instantly builds a lightweight and task-specific model without any training process. For a realtime model querying service, PoE first extracts a pool of primitive components, called experts, from a well-trained and sufficiently generic network by exploiting a novel conditional knowledge distillation method, and then performs our train-free knowledge consolidation to quickly combine necessary experts into a lightweight network for a target task. Thanks to this train-free property, in our thorough empirical study, PoE can build a fairly accurate yet compact model in a realtime manner, whereas it takes a few minutes per query for the other training methods to achieve a similar level of the accuracy.
    Learning ODEs via Diffeomorphisms for Fast and Robust Integration. (arXiv:2107.01650v1 [cs.LG])
    (2 min) Advances in differentiable numerical integrators have enabled the use of gradient descent techniques to learn ordinary differential equations (ODEs). In the context of machine learning, differentiable solvers are central for Neural ODEs (NODEs), a class of deep learning models with continuous depth, rather than discrete layers. However, these integrators can be unsatisfactorily slow and inaccurate when learning systems of ODEs from long sequences, or when solutions of the system vary at widely different timescales in each dimension. In this paper we propose an alternative approach to learning ODEs from data: we represent the underlying ODE as a vector field that is related to another base vector field by a differentiable bijection, modelled by an invertible neural network. By restricting the base ODE to be amenable to integration, we can drastically speed up and improve the robustness of integration. We demonstrate the efficacy of our method in training and evaluating continuous neural networks models, as well as in learning benchmark ODE systems. We observe improvements of up to two orders of magnitude when integrating learned ODEs with GPUs computation.
    Exploring a Handwriting Programming Language for Educational Robots. (arXiv:2105.04963v2 [cs.PL] UPDATED)
    (2 min) Recently, introducing computer science and educational robots in compulsory education has received increasing attention. However, the use of screens in classrooms is often met with resistance, especially in primary school. To address this issue, this study presents the development of a handwriting-based programming language for educational robots. Aiming to align better with existing classroom practices, it allows students to program a robot by drawing symbols with ordinary pens and paper. Regular smartphones are leveraged to process the hand-drawn instructions using computer vision and machine learning algorithms, and send the commands to the robot for execution. To align with the local computer science curriculum, an appropriate playground and scaffolded learning tasks were designed. The system was evaluated in a preliminary test with eight teachers, developers and educational researchers. While the participants pointed out that some technical aspects could be improved, they also acknowledged the potential of the approach to make computer science education in primary school more accessible.
    Scale Mixtures of Neural Network Gaussian Processes. (arXiv:2107.01408v1 [stat.ML])
    (2 min) Recent works have revealed that infinitely-wide feed-forward or recurrent neural networks of any architecture correspond to Gaussian processes referred to as $\mathrm{NNGP}$. While these works have extended the class of neural networks converging to Gaussian processes significantly, however, there has been little focus on broadening the class of stochastic processes that such neural networks converge to. In this work, inspired by the scale mixture of Gaussian random variables, we propose the scale mixture of $\mathrm{NNGP}$ for which we introduce a prior distribution on the scale of the last-layer parameters. We show that simply introducing a scale prior on the last-layer parameters can turn infinitely-wide neural networks of any architecture into a richer class of stochastic processes. Especially, with certain scale priors, we obtain heavy-tailed stochastic processes, and we recover Student's $t$ processes in the case of inverse gamma priors. We further analyze the distributions of the neural networks initialized with our prior setting and trained with gradient descents and obtain similar results as for $\mathrm{NNGP}$. We present a practical posterior-inference algorithm for the scale mixture of $\mathrm{NNGP}$ and empirically demonstrate its usefulness on regression and classification tasks.
    Subspace Clustering Based Analysis of Neural Networks. (arXiv:2107.01296v1 [cs.LG])
    (2 min) Tools to analyze the latent space of deep neural networks provide a step towards better understanding them. In this work, we motivate sparse subspace clustering (SSC) with an aim to learn affinity graphs from the latent structure of a given neural network layer trained over a set of inputs. We then use tools from Community Detection to quantify structures present in the input. These experiments reveal that as we go deeper in a network, inputs tend to have an increasing affinity to other inputs of the same class. Subsequently, we utilise matrix similarity measures to perform layer-wise comparisons between affinity graphs. In doing so we first demonstrate that when comparing a given layer currently under training to its final state, the shallower the layer of the network, the quicker it is to converge than the deeper layers. When performing a pairwise analysis of the entire network architecture, we observe that, as the network increases in size, it reorganises from a state where each layer is moderately similar to its neighbours, to a state where layers within a block have high similarity than to layers in other blocks. Finally, we analyze the learned affinity graphs of the final convolutional layer of the network and demonstrate how an input's local neighbourhood affects its classification by the network.
    SPI-GAN: Towards Single-Pixel Imaging through Generative Adversarial Network. (arXiv:2107.01330v1 [cs.CV])
    (2 min) Single-pixel imaging is a novel imaging scheme that has gained popularity due to its huge computational gain and potential for a low-cost alternative to imaging beyond the visible spectrum. The traditional reconstruction methods struggle to produce a clear recovery when one limits the number of illumination patterns from a spatial light modulator. As a remedy, several deep-learning-based solutions have been proposed which lack good generalization ability due to the architectural setup and loss functions. In this paper, we propose a generative adversarial network-based reconstruction framework for single-pixel imaging, referred to as SPI-GAN. Our method can reconstruct images with 17.92 dB PSNR and 0.487 SSIM, even if the sampling ratio drops to 5%. This facilitates much faster reconstruction making our method suitable for single-pixel video. Furthermore, our ResNet-like architecture for the generator leads to useful representation learning that allows us to reconstruct completely unseen objects. The experimental results demonstrate that SPI-GAN achieves significant performance gain, e.g. near 3dB PSNR gain, over the current state-of-the-art method.
    Learning Hierarchical Graph Neural Networks for Image Clustering. (arXiv:2107.01319v1 [cs.CV])
    (2 min) We propose a hierarchical graph neural network (GNN) model that learns how to cluster a set of images into an unknown number of identities using a training set of images annotated with labels belonging to a disjoint set of identities. Our hierarchical GNN uses a novel approach to merge connected components predicted at each level of the hierarchy to form a new graph at the next level. Unlike fully unsupervised hierarchical clustering, the choice of grouping and complexity criteria stems naturally from supervision in the training set. The resulting method, Hi-LANDER, achieves an average of 54% improvement in F-score and 8% increase in Normalized Mutual Information (NMI) relative to current GNN-based clustering algorithms. Additionally, state-of-the-art GNN-based methods rely on separate models to predict linkage probabilities and node densities as intermediate steps of the clustering process. In contrast, our unified framework achieves a seven-fold decrease in computational cost. We release our training and inference code at https://github.com/dmlc/dgl/tree/master/examples/pytorch/hilander.
    Learning Debiased Representation via Disentangled Feature Augmentation. (arXiv:2107.01372v1 [cs.LG])
    (2 min) Image classification models tend to make decisions based on peripheral attributes of data items that have strong correlation with a target variable (i.e., dataset bias). These biased models suffer from the poor generalization capability when evaluated on unbiased datasets. Existing approaches for debiasing often identify and emphasize those samples with no such correlation (i.e., bias-conflicting) without defining the bias type in advance. However, such bias-conflicting samples are significantly scarce in biased datasets, limiting the debiasing capability of these approaches. This paper first presents an empirical analysis revealing that training with "diverse" bias-conflicting samples beyond a given training set is crucial for debiasing as well as the generalization capability. Based on this observation, we propose a novel feature-level data augmentation technique in order to synthesize diverse bias-conflicting samples. To this end, our method learns the disentangled representation of (1) the intrinsic attributes (i.e., those inherently defining a certain class) and (2) bias attributes (i.e., peripheral attributes causing the bias), from a large number of bias-aligned samples, the bias attributes of which have strong correlation with the target variable. Using the disentangled representation, we synthesize bias-conflicting samples that contain the diverse intrinsic attributes of bias-aligned samples by swapping their latent features. By utilizing these diversified bias-conflicting features during the training, our approach achieves superior classification accuracy and debiasing results against the existing baselines on both synthetic as well as real-world datasets.
    Maximum Entropy Weighted Independent Set Pooling for Graph Neural Networks. (arXiv:2107.01410v1 [cs.LG])
    (2 min) In this paper, we propose a novel pooling layer for graph neural networks based on maximizing the mutual information between the pooled graph and the input graph. Since the maximum mutual information is difficult to compute, we employ the Shannon capacity of a graph as an inductive bias to our pooling method. More precisely, we show that the input graph to the pooling layer can be viewed as a representation of a noisy communication channel. For such a channel, sending the symbols belonging to an independent set of the graph yields a reliable and error-free transmission of information. We show that reaching the maximum mutual information is equivalent to finding a maximum weight independent set of the graph where the weights convey entropy contents. Through this communication theoretic standpoint, we provide a distinct perspective for posing the problem of graph pooling as maximizing the information transmission rate across a noisy communication channel, implemented by a graph neural network. We evaluate our method, referred to as Maximum Entropy Weighted Independent Set Pooling (MEWISPool), on graph classification tasks and the combinatorial optimization problem of the maximum independent set. Empirical results demonstrate that our method achieves the state-of-the-art and competitive results on graph classification tasks and the maximum independent set problem in several benchmark datasets.
    Traffic Signal Control with Communicative Deep Reinforcement Learning Agents: a Case Study. (arXiv:2107.01347v1 [cs.MA])
    (2 min) In this work we theoretically and experimentally analyze Multi-Agent Advantage Actor-Critic (MA2C) and Independent Advantage Actor-Critic (IA2C), two recently proposed multi-agent reinforcement learning methods that can be applied to control traffic signals in urban areas. The two methods differ in their use of a reward calculated locally or globally and in the management of agents' communication. We analyze the methods theoretically with the framework provided by non-Markov decision processes, which provides useful insights in the analysis of the algorithms. Moreover, we analyze the efficacy and the robustness of the methods experimentally by testing them in two traffic areas in the Bologna (Italy) area, simulated by SUMO, a software tool. The experimental results indicate that MA2C achieves the best performance in the majority of cases, outperforms the alternative method considered, and displays sufficient stability during the learning process.
    Two Ridge Solutions for the Incremental Broad Learning System on Added Nodes. (arXiv:1911.04872v2 [cs.LG] UPDATED)
    (2 min) The original Broad Learning System (BLS) on new added nodes and its existing efficient implementation both assume the ridge parameter is near 0 in the ridge inverse to approximate the generalized inverse, and compute the generalized inverse solution for the output weights. In this paper, we propose two ridge solutions for the output weights in the BLS on added nodes, where the ridge parameter can be any positive real number. One of the proposed ridge solutions computes the output weights from the inverse Cholesky factor, which is updated by extending the existing inverse Cholesky factorization. The other proposed ridge solution computes the output weights from the ridge inverse, and updates the ridge inverse by extending the Greville method that can only computes the generalized inverse of a partitioned matrix. The proposed BLS algorithm based on the ridge inverse requires the same complexity as the original BLS algorithm, while the proposed BLS algorithm based on the inverse Cholesky factor requires less complexity and training time than the original BLS and the existing efficient BLS. Both the proposed ridge solutions for BLS achieve the same testing accuracy as the standard ridge solution in the numerical experiments. The difference between the testing accuracy of the proposed ridge solutions and that of the existing generalized inverse solutions is negligible when the ridge parameter is very small, and becomes too big to be ignored when the ridge parameter is not very small. When the ridge parameter is not near 0, usually the proposed two ridge solutions for BLS achieve better testing accuracy than the existing generalized inverse solutions for BLS, and then the former are more preferred than the latter.
    Designing Machine Learning Pipeline Toolkit for AutoML Surrogate Modeling Optimization. (arXiv:2107.01253v1 [cs.LG])
    (2 min) The pipeline optimization problem in machine learning requires simultaneous optimization of pipeline structures and parameter adaptation of their elements. Having an elegant way to express these structures can help lessen the complexity in the management and analysis of their performances together with the different choices of optimization strategies. With these issues in mind, we created the AMLP toolkit which facilitates the creation and evaluation of complex machine learning pipeline structures using simple expressions. We use AMLP to find optimal pipeline signatures, datamine them, and use these datamined features to speed-up learning and prediction. We formulated a two-stage pipeline optimization with surrogate modeling in AMLP which outperforms other AutoML approaches with a 4-hour time budget in less than 5 minutes of AMLP computation time.
    Non-Comparative Fairness for Human-Auditing and Its Relation to Traditional Fairness Notions. (arXiv:2107.01277v1 [cs.LG])
    (2 min) Bias evaluation in machine-learning based services (MLS) based on traditional algorithmic fairness notions that rely on comparative principles is practically difficult, making it necessary to rely on human auditor feedback. However, in spite of taking rigorous training on various comparative fairness notions, human auditors are known to disagree on various aspects of fairness notions in practice, making it difficult to collect reliable feedback. This paper offers a paradigm shift to the domain of algorithmic fairness via proposing a new fairness notion based on the principle of non-comparative justice. In contrary to traditional fairness notions where the outcomes of two individuals/groups are compared, our proposed notion compares the MLS' outcome with a desired outcome for each input. This desired outcome naturally describes a human auditor's expectation, and can be easily used to evaluate MLS on crowd-auditing platforms. We show that any MLS can be deemed fair from the perspective of comparative fairness (be it in terms of individual fairness, statistical parity, equal opportunity or calibration) if it is non-comparatively fair with respect to a fair auditor. We also show that the converse holds true in the context of individual fairness. Given that such an evaluation relies on the trustworthiness of the auditor, we also present an approach to identify fair and reliable auditors by estimating their biases with respect to a given set of sensitive attributes, as well as quantify the uncertainty in the estimation of biases within a given MLS. Furthermore, all of the above results are also validated on COMPAS, German credit and Adult Census Income datasets.
    SGLB: Stochastic Gradient Langevin Boosting. (arXiv:2001.07248v4 [cs.LG] UPDATED)
    (2 min) This paper introduces Stochastic Gradient Langevin Boosting (SGLB) - a powerful and efficient machine learning framework that may deal with a wide range of loss functions and has provable generalization guarantees. The method is based on a special form of the Langevin diffusion equation specifically designed for gradient boosting. This allows us to theoretically guarantee the global convergence even for multimodal loss functions, while standard gradient boosting algorithms can guarantee only local optimum. We also empirically show that SGLB outperforms classic gradient boosting when applied to classification tasks with 0-1 loss function, which is known to be multimodal.
    Privacy-Preserving Representation Learning on Graphs: A Mutual Information Perspective. (arXiv:2107.01475v1 [cs.LG])
    (2 min) Learning with graphs has attracted significant attention recently. Existing representation learning methods on graphs have achieved state-of-the-art performance on various graph-related tasks such as node classification, link prediction, etc. However, we observe that these methods could leak serious private information. For instance, one can accurately infer the links (or node identity) in a graph from a node classifier (or link predictor) trained on the learnt node representations by existing methods. To address the issue, we propose a privacy-preserving representation learning framework on graphs from the \emph{mutual information} perspective. Specifically, our framework includes a primary learning task and a privacy protection task, and we consider node classification and link prediction as the two tasks of interest. Our goal is to learn node representations such that they can be used to achieve high performance for the primary learning task, while obtaining performance for the privacy protection task close to random guessing. We formally formulate our goal via mutual information objectives. However, it is intractable to compute mutual information in practice. Then, we derive tractable variational bounds for the mutual information terms, where each bound can be parameterized via a neural network. Next, we train these parameterized neural networks to approximate the true mutual information and learn privacy-preserving node representations. We finally evaluate our framework on various graph datasets.
    The Last-Iterate Convergence Rate of Optimistic Mirror Descent in Stochastic Variational Inequalities. (arXiv:2107.01906v1 [math.OC])
    (2 min) In this paper, we analyze the local convergence rate of optimistic mirror descent methods in stochastic variational inequalities, a class of optimization problems with important applications to learning theory and machine learning. Our analysis reveals an intricate relation between the algorithm's rate of convergence and the local geometry induced by the method's underlying Bregman function. We quantify this relation by means of the Legendre exponent, a notion that we introduce to measure the growth rate of the Bregman divergence relative to the ambient norm near a solution. We show that this exponent determines both the optimal step-size policy of the algorithm and the optimal rates attained, explaining in this way the differences observed for some popular Bregman functions (Euclidean projection, negative entropy, fractional power, etc.).
    Learning Complex Users' Preferences for Recommender Systems. (arXiv:2107.01529v1 [cs.IR])
    (2 min) Recommender systems (RSs) have emerged as very useful tools to help customers with their decision-making process, find items of their interest, and alleviate the information overload problem. There are two different lines of approaches in RSs: (1) general recommenders with the main goal of discovering long-term users' preferences, and (2) sequential recommenders with the main focus of capturing short-term users' preferences in a session of user-item interaction (here, a session refers to a record of purchasing multiple items in one shopping event). While considering short-term users' preferences may satisfy their current needs and interests, long-term users' preferences provide users with the items that they may interact with, eventually. In this thesis, we first focus on improving the performance of general RSs. Most of the existing general RSs tend to exploit the users' rating patterns on common items to detect similar users. The data sparsity problem (i.e. the lack of available information) is one of the major challenges for the current general RSs, and they may fail to have any recommendations when there are no common items of interest among users. We call this problem data sparsity with no feedback on common items (DSW-n-FCI). To overcome this problem, we propose a personality-based RS in which similar users are identified based on the similarity of their personality traits.
    Dual Causal/Non-Causal Self-Attention for Streaming End-to-End Speech Recognition. (arXiv:2107.01269v1 [eess.AS])
    (2 min) Attention-based end-to-end automatic speech recognition (ASR) systems have recently demonstrated state-of-the-art results for numerous tasks. However, the application of self-attention and attention-based encoder-decoder models remains challenging for streaming ASR, where each word must be recognized shortly after it was spoken. In this work, we present the dual causal/non-causal self-attention (DCN) architecture, which in contrast to restricted self-attention prevents the overall context to grow beyond the look-ahead of a single layer when used in a deep architecture. DCN is compared to chunk-based and restricted self-attention using streaming transformer and conformer architectures, showing improved ASR performance over restricted self-attention and competitive ASR results compared to chunk-based self-attention, while providing the advantage of frame-synchronous processing. Combined with triggered attention, the proposed streaming end-to-end ASR systems obtained state-of-the-art results on the LibriSpeech, HKUST, and Switchboard ASR tasks.
    Byzantine-robust Federated Learning through Spatial-temporal Analysis of Local Model Updates. (arXiv:2107.01477v1 [cs.LG])
    (2 min) Federated Learning (FL) enables multiple distributed clients (e.g., mobile devices) to collaboratively train a centralized model while keeping the training data locally on the client. Compared to traditional centralized machine learning, FL offers many favorable features such as offloading operations which would usually be performed by a central server and reducing risks of serious privacy leakage. However, Byzantine clients that send incorrect or disruptive updates due to system failures or adversarial attacks may disturb the joint learning process, consequently degrading the performance of the resulting model. In this paper, we propose to mitigate these failures and attacks from a spatial-temporal perspective. Specifically, we use a clustering-based method to detect and exclude incorrect updates by leveraging their geometric properties in the parameter space. Moreover, to further handle malicious clients with time-varying behaviors, we propose to adaptively adjust the learning rate according to momentum-based update speculation. Extensive experiments on 4 public datasets demonstrate that our algorithm achieves enhanced robustness comparing to existing methods under both cross-silo and cross-device FL settings with faulty/malicious clients.
    Where is the Grass Greener? Revisiting Generalized Policy Iteration for Offline Reinforcement Learning. (arXiv:2107.01407v1 [cs.LG])
    (2 min) The performance of state-of-the-art baselines in the offline RL regime varies widely over the spectrum of dataset qualities, ranging from "far-from-optimal" random data to "close-to-optimal" expert demonstrations. We re-implement these under a fair, unified, and highly factorized framework, and show that when a given baseline outperforms its competing counterparts on one end of the spectrum, it never does on the other end. This consistent trend prevents us from naming a victor that outperforms the rest across the board. We attribute the asymmetry in performance between the two ends of the quality spectrum to the amount of inductive bias injected into the agent to entice it to posit that the behavior underlying the offline dataset is optimal for the task. The more bias is injected, the higher the agent performs, provided the dataset is close-to-optimal. Otherwise, its effect is brutally detrimental. Adopting an advantage-weighted regression template as base, we conduct an investigation which corroborates that injections of such optimality inductive bias, when not done parsimoniously, makes the agent subpar in the datasets it was dominant as soon as the offline policy is sub-optimal. In an effort to design methods that perform well across the whole spectrum, we revisit the generalized policy iteration scheme for the offline regime, and study the impact of nine distinct newly-introduced proposal distributions over actions, involved in proposed generalization of the policy evaluation and policy improvement update rules. We show that certain orchestrations strike the right balance and can improve the performance on one end of the spectrum without harming it on the other end.
    Spatiotemporal convolutional network for time-series prediction and causal inference. (arXiv:2107.01353v1 [cs.LG])
    (2 min) Making predictions in a robust way is not easy for nonlinear systems. In this work, a neural network computing framework, i.e., a spatiotemporal convolutional network (STCN), was developed to efficiently and accurately render a multistep-ahead prediction of a time series by employing a spatial-temporal information (STI) transformation. The STCN combines the advantages of both the temporal convolutional network (TCN) and the STI equation, which maps the high-dimensional/spatial data to the future temporal values of a target variable, thus naturally providing the prediction of the target variable. From the observed variables, the STCN also infers the causal factors of the target variable in the sense of Granger causality, which are in turn selected as effective spatial information to improve the prediction robustness. The STCN was successfully applied to both benchmark systems and real-world datasets, all of which show superior and robust performance in multistep-ahead prediction, even when the data were perturbed by noise. From both theoretical and computational viewpoints, the STCN has great potential in practical applications in artificial intelligence (AI) or machine learning fields as a model-free method based only on the observed data, and also opens a new way to explore the observed high-dimensional data in a dynamical manner for machine learning.
    Sparse Linear Networks with a Fixed Butterfly Structure: Theory and Practice. (arXiv:2007.08864v2 [cs.LG] UPDATED)
    (2 min) A butterfly network consists of logarithmically many layers, each with a linear number of non-zero weights (pre-specified). The fast Johnson-Lindenstrauss transform (FJLT) can be represented as a butterfly network followed by a projection onto a random subset of the coordinates. Moreover, a random matrix based on FJLT with high probability approximates the action of any matrix on a vector. Motivated by these facts, we propose to replace a dense linear layer in any neural network by an architecture based on the butterfly network. The proposed architecture significantly improves upon the quadratic number of weights required in a standard dense layer to nearly linear with little compromise in expressibility of the resulting operator. In a collection of wide variety of experiments, including supervised prediction on both the NLP and vision data, we show that this not only produces results that match and at times outperform existing well-known architectures, but it also offers faster training and prediction in deployment. To understand the optimization problems posed by neural networks with a butterfly network, we also study the optimization landscape of the encoder-decoder network, where the encoder is replaced by a butterfly network followed by a dense linear layer in smaller dimension. Theoretical result presented in the paper explains why the training speed and outcome are not compromised by our proposed approach.
    Mava: a research framework for distributed multi-agent reinforcement learning. (arXiv:2107.01460v1 [cs.LG])
    (2 min) Breakthrough advances in reinforcement learning (RL) research have led to a surge in the development and application of RL. To support the field and its rapid growth, several frameworks have emerged that aim to help the community more easily build effective and scalable agents. However, very few of these frameworks exclusively support multi-agent RL (MARL), an increasingly active field in itself, concerned with decentralised decision-making problems. In this work, we attempt to fill this gap by presenting Mava: a research framework specifically designed for building scalable MARL systems. Mava provides useful components, abstractions, utilities and tools for MARL and allows for simple scaling for multi-process system training and execution, while providing a high level of flexibility and composability. Mava is built on top of DeepMind's Acme \citep{hoffman2020acme}, and therefore integrates with, and greatly benefits from, a wide range of already existing single-agent RL components made available in Acme. Several MARL baseline systems have already been implemented in Mava. These implementations serve as examples showcasing Mava's reusable features, such as interchangeable system architectures, communication and mixing modules. Furthermore, these implementations allow existing MARL algorithms to be easily reproduced and extended. We provide experimental results for these implementations on a wide range of multi-agent environments and highlight the benefits of distributed system training.
    WisdomNet: Prognosis of COVID-19 with Slender Prospect of False Negative Cases and Vaticinating the Probability of Maturation to ARDS using Posteroanterior Chest X-Rays. (arXiv:2107.01392v1 [eess.IV])
    (3 min) Coronavirus is a large virus family consisting of diverse viruses, some of which disseminate among mammals and others cause sickness among humans. COVID-19 is highly contagious and is rapidly spreading, rendering its early diagnosis of preeminent status. Researchers, medical specialists and organizations all over the globe have been working tirelessly to combat this virus and help in its containment. In this paper, a novel neural network called WisdomNet has been proposed, for the diagnosis of COVID-19 using chest X-rays. The WisdomNet uses the concept of Wisdom of Crowds as its founding idea. It is a two-layered convolutional Neural Network (CNN), which takes chest x-ray images as input. Both layers of the proposed neural network consist of a number of neural networks each. The dataset used for this study consists of chest x-ray images of COVID-19 positive patients, compiled and shared by Dr. Cohen on GitHub, and the chest x-ray images of healthy lungs and lungs affected by viral and bacterial pneumonia were obtained from Kaggle. The network not only pinpoints the presence of COVID-19, but also gives the probability of the disease maturing into Acute Respiratory Distress Syndrome (ARDS). Thus, predicting the progression of the disease in the COVID-19 positive patients. The network also slender the occurrences of false negative cases by employing a high threshold value, thus aids in curbing the spread of the disease and gives an accuracy of 100% for successfully predicting COVID-19 among the chest x-rays of patients affected with COVID-19, bacterial and viral pneumonia.
    Memory and attention in deep learning. (arXiv:2107.01390v1 [cs.LG])
    (2 min) Intelligence necessitates memory. Without memory, humans fail to perform various nontrivial tasks such as reading novels, playing games or solving maths. As the ultimate goal of machine learning is to derive intelligent systems that learn and act automatically just like human, memory construction for machine is inevitable. Artificial neural networks model neurons and synapses in the brain by interconnecting computational units via weights, which is a typical class of machine learning algorithms that resembles memory structure. Their descendants with more complicated modeling techniques (a.k.a deep learning) have been successfully applied to many practical problems and demonstrated the importance of memory in the learning process of machinery systems. Recent progresses on modeling memory in deep learning have revolved around external memory constructions, which are highly inspired by computational Turing models and biological neuronal systems. Attention mechanisms are derived to support acquisition and retention operations on the external memory. Despite the lack of theoretical foundations, these approaches have shown promises to help machinery systems reach a higher level of intelligence. The aim of this thesis is to advance the understanding on memory and attention in deep learning. Its contributions include: (i) presenting a collection of taxonomies for memory, (ii) constructing new memory-augmented neural networks (MANNs) that support multiple control and memory units, (iii) introducing variability via memory in sequential generative models, (iv) searching for optimal writing operations to maximise the memorisation capacity in slot-based memory networks, and (v) simulating the Universal Turing Machine via Neural Stored-program Memory-a new kind of external memory for neural networks.
    AdaL: Adaptive Gradient Transformation Contributes to Convergences and Generalizations. (arXiv:2107.01525v1 [cs.LG])
    (2 min) Adaptive optimization methods have been widely used in deep learning. They scale the learning rates adaptively according to the past gradient, which has been shown to be effective to accelerate the convergence. However, they suffer from poor generalization performance compared with SGD. Recent studies point that smoothing exponential gradient noise leads to generalization degeneration phenomenon. Inspired by this, we propose AdaL, with a transformation on the original gradient. AdaL accelerates the convergence by amplifying the gradient in the early stage, as well as dampens the oscillation and stabilizes the optimization by shrinking the gradient later. Such modification alleviates the smoothness of gradient noise, which produces better generalization performance. We have theoretically proved the convergence of AdaL and demonstrated its effectiveness on several benchmarks.
    Truncated Marginal Neural Ratio Estimation. (arXiv:2107.01214v1 [stat.ML])
    (2 min) Parametric stochastic simulators are ubiquitous in science, often featuring high-dimensional input parameters and/or an intractable likelihood. Performing Bayesian parameter inference in this context can be challenging. We present a neural simulator-based inference algorithm which simultaneously offers simulation efficiency and fast empirical posterior testability, which is unique among modern algorithms. Our approach is simulation efficient by simultaneously estimating low-dimensional marginal posteriors instead of the joint posterior and by proposing simulations targeted to an observation of interest via a prior suitably truncated by an indicator function. Furthermore, by estimating a locally amortized posterior our algorithm enables efficient empirical tests of the robustness of the inference results. Such tests are important for sanity-checking inference in real-world applications, which do not feature a known ground truth. We perform experiments on a marginalized version of the simulation-based inference benchmark and two complex and narrow posteriors, highlighting the simulator efficiency of our algorithm as well as the quality of the estimated marginal posteriors. Implementation on GitHub.
    SHORING: Design Provable Conditional High-Order Interaction Network via Symbolic Testing. (arXiv:2107.01326v1 [cs.LG])
    (2 min) Deep learning provides a promising way to extract effective representations from raw data in an end-to-end fashion and has proven its effectiveness in various domains such as computer vision, natural language processing, etc. However, in domains such as content/product recommendation and risk management, where sequence of event data is the most used raw data form and experts derived features are more commonly used, deep learning models struggle to dominate the game. In this paper, we propose a symbolic testing framework that helps to answer the question of what kinds of expert-derived features could be learned by a neural network. Inspired by this testing framework, we introduce an efficient architecture named SHORING, which contains two components: \textit{event network} and \textit{sequence network}. The \textit{event} network learns arbitrarily yet efficiently high-order \textit{event-level} embeddings via a provable reparameterization trick, the \textit{sequence} network aggregates from sequence of \textit{event-level} embeddings. We argue that SHORING is capable of learning certain standard symbolic expressions which the standard multi-head self-attention network fails to learn, and conduct comprehensive experiments and ablation studies on four synthetic datasets and three real-world datasets. The results show that SHORING empirically outperforms the state-of-the-art methods.
    Short-term probabilistic photovoltaic power forecast based on deep convolutional long short-term memory network and kernel density estimation. (arXiv:2107.01343v1 [cs.LG])
    (2 min) Solar energy is a clean and renewable energy. Photovoltaic (PV) power is an important way to utilize solar energy. Accurate PV power forecast is crucial to the large-scale application of PV power and the stability of electricity grid. This paper proposes a novel method for short-term photovoltaic power forecast using deep convolutional long short-term memory (ConvLSTM) network and kernel density estimation (KDE). In the proposed method, ConvLSTM is used to forecast the future photovoltaic power and KDE is used for estimating the joint probabilistic density function and giving the probabilistic confidence interval. Experiments in an actual photovoltaic power station verify the effectiveness of the proposed method. Comparison experiments with convolutional neural network (CNN) and long short-term memory network (LSTM)shows that ConvLSTM can combine the advantages of both CNN and LSTM and significantly outperform CNN and LSTM in terms of forecast accuracy. Through further comparison with other five conventional methods including multilayer perceptron (MLP), support vector regression (SVR), extreme learning machine (ELM), classification and regression tree (CART) and gradient boosting decision tree (GBDT), ConvLSTM can significantly improve the forecast accuracy by more than 20% for most of the five methods and the superiorities of ConvLSTM are further verified.
    Pulmonary Vessel Segmentation based on Orthogonal Fused U-Net++ of Chest CT Images. (arXiv:2107.01502v1 [eess.IV])
    (2 min) Pulmonary vessel segmentation is important for clinical diagnosis of pulmonary diseases, while is also challenging due to the complicated structure. In this work, we present an effective framework and refinement process of pulmonary vessel segmentation from chest computed tomographic (CT) images. The key to our approach is a 2.5D segmentation network applied from three orthogonal axes, which presents a robust and fully automated pulmonary vessel segmentation result with lower network complexity and memory usage compared to 3D networks. The slice radius is introduced to convolve the adjacent information of the center slice and the multi-planar fusion optimizes the presentation of intra- and inter- slice features. Besides, the tree-like structure of the pulmonary vessel is extracted in the post-processing process, which is used for segmentation refining and pruning. In the evaluation experiments, three fusion methods are tested and the most promising one is compared with the state-of-the-art 2D and 3D structures on 300 cases of lung images randomly selected from LIDC dataset. Our method outperforms other network structures by a large margin and achieves by far the highest average DICE score of 0.9272 and precision of 0.9310, as per our knowledge from the pulmonary vessel segmentation models available in the literature.
    Domain Adaptation for Sentiment Analysis Using Increased Intraclass Separation. (arXiv:2107.01598v1 [cs.CL])
    (2 min) Sentiment analysis is a costly yet necessary task for enterprises to study the opinions of their customers to improve their products and to determine optimal marketing strategies. Due to the existence of a wide range of domains across different products and services, cross-domain sentiment analysis methods have received significant attention. These methods mitigate the domain gap between different applications by training cross-domain generalizable classifiers which help to relax the need for data annotation for each domain. Most existing methods focus on learning domain-agnostic representations that are invariant with respect to both the source and the target domains. As a result, a classifier that is trained using the source domain annotated data would generalize well in a related target domain. We introduce a new domain adaptation method which induces large margins between different classes in an embedding space. This embedding space is trained to be domain-agnostic by matching the data distributions across the domains. Large intraclass margins in the source domain help to reduce the effect of "domain shift" on the classifier performance in the target domain. Theoretical and empirical analysis are provided to demonstrate that the proposed method is effective.
    A convolutional neural network for prestack fracture detection. (arXiv:2107.01466v1 [physics.geo-ph])
    (2 min) Fractures are widely developed in hydrocarbon reservoirs and constitute the accumulation spaces and transport channels of oil and gas. Fracture detection is a fundamental task for reservoir characterization. From prestack seismic gathers, anisotropic analysis and inversion were commonly applied to characterize the dominant orientations and relative intensities of fractures. However, the existing methods were mostly based on the vertical aligned facture hypothesis, it is impossible for them to recognize fracture dip. Furthermore, it is difficult or impractical for existing methods to attain the real fracture densities. Based on data-driven deep learning, this paper designed a convolutional neural network to perform prestack fracture detection. Capitalizing on the connections between seismic responses and fracture parameters, a suitable azimuth dataset was firstly generated through fracture effective medium modeling and anisotropic plane wave analyzing. Then a multi-input and multi-output convolutional neural network was constructed to simultaneously detect fracture density, dip and strike azimuth. The application on a practical survey validated the effectiveness of the proposed CNN model.
    Smoothed Differential Privacy. (arXiv:2107.01559v1 [cs.CR])
    (2 min) Differential privacy (DP) is a widely-accepted and widely-applied notion of privacy based on worst-case analysis. Often, DP classifies most mechanisms without external noise as non-private [Dwork et al., 2014], and external noises, such as Gaussian noise or Laplacian noise [Dwork et al., 2006], are introduced to improve privacy. In many real-world applications, however, adding external noise is undesirable and sometimes prohibited. For example, presidential elections often require a deterministic rule to be used [Liu et al., 2020], and small noises can lead to dramatic decreases in the prediction accuracy of deep neural networks, especially the underrepresented classes [Bagdasaryan et al., 2019]. In this paper, we propose a natural extension and relaxation of DP following the worst average-case idea behind the celebrated smoothed analysis [Spielman and Teng, 2004]. Our notion, the smoothed DP, can effectively measure the privacy leakage of mechanisms without external noises under realistic settings. We prove several strong properties of the smoothed DP, including composability, robustness to post-processing and etc. We proved that any discrete mechanism with sampling procedures is more private than what DP predicts. In comparison, many continuous mechanisms with sampling procedures are still non-private under smoothed DP. Experimentally, we first verified that the discrete sampling mechanisms are private in real-world elections. Then, we apply the smoothed DP notion on quantized gradient descent, which indicates some neural networks can be private without adding any extra noises. We believe that these results contribute to the theoretical foundation of realistic privacy measures beyond worst-case analysis.
    BAGUA: Scaling up Distributed Learning with System Relaxations. (arXiv:2107.01499v1 [cs.LG])
    (2 min) Recently years have witnessed a growing list of systems for distributed data-parallel training. Existing systems largely fit into two paradigms, i.e., parameter server and MPI-style collective operations. On the algorithmic side, researchers have proposed a wide range of techniques to lower the communication via system relaxations: quantization, decentralization, and communication delay. However, most, if not all, existing systems only rely on standard synchronous and asynchronous stochastic gradient (SG) based optimization, therefore, cannot take advantage of all possible optimizations that the machine learning community has been developing recently. Given this emerging gap between the current landscapes of systems and theory, we build BAGUA, a communication framework whose design goal is to provide a system abstraction that is both flexible and modular to support state-of-the-art system relaxation techniques of distributed training. Powered by the new system design, BAGUA has a great ability to implement and extend various state-of-the-art distributed learning algorithms. In a production cluster with up to 16 machines (128 GPUs), BAGUA can outperform PyTorch-DDP, Horovod and BytePS in the end-to-end training time by a significant margin (up to 1.95 times) across a diverse range of tasks. Moreover, we conduct a rigorous tradeoff exploration showing that different algorithms and system relaxations achieve the best performance over different network conditions.
    Supervised Off-Policy Ranking. (arXiv:2107.01360v1 [cs.LG])
    (2 min) Off-policy evaluation (OPE) leverages data generated by other policies to evaluate a target policy. Previous OPE methods mainly focus on precisely estimating the true performance of a policy. We observe that in many applications, (1) the end goal of OPE is to compare two or multiple candidate policies and choose a good one, which is actually a much simpler task than evaluating their true performance; and (2) there are usually multiple policies that have been deployed in real-world systems and thus whose true performance is known through serving real users. Inspired by the two observations, in this work, we define a new problem, supervised off-policy ranking (SOPR), which aims to rank a set of new/target policies based on supervised learning by leveraging off-policy data and policies with known performance. We further propose a method for supervised off-policy ranking that learns a policy scoring model by correctly ranking training policies with known performance rather than estimating their precise performance. Our method leverages logged states and policies to learn a Transformer based model that maps offline interaction data including logged states and the actions taken by a target policy on these states to a score. Experiments on different games, datasets, training policy sets, and test policy sets show that our method outperforms strong baseline OPE methods in terms of both rank correlation and performance gap between the truly best and the best of the ranked top three policies. Furthermore, our method is more stable than baseline methods.
    Beyond Value-Function Gaps: Improved Instance-Dependent Regret Bounds for Episodic Reinforcement Learning. (arXiv:2107.01264v1 [cs.LG])
    (2 min) We provide improved gap-dependent regret bounds for reinforcement learning in finite episodic Markov decision processes. Compared to prior work, our bounds depend on alternative definitions of gaps. These definitions are based on the insight that, in order to achieve a favorable regret, an algorithm does not need to learn how to behave optimally in states that are not reached by an optimal policy. We prove tighter upper regret bounds for optimistic algorithms and accompany them with new information-theoretic lower bounds for a large class of MDPs. Our results show that optimistic algorithms can not achieve the information-theoretic lower bounds even in deterministic MDPs unless there is a unique optimal policy.
    Cluster Representatives Selection in Non-Metric Spaces for Nearest Prototype Classification. (arXiv:2107.01345v1 [cs.LG])
    (2 min) The nearest prototype classification is a less computationally intensive replacement for the $k$-NN method, especially when large datasets are considered. In metric spaces, centroids are often used as prototypes to represent whole clusters. The selection of cluster prototypes in non-metric spaces is more challenging as the idea of computing centroids is not directly applicable. In this paper, we present CRS, a novel method for selecting a small yet representative subset of objects as a cluster prototype. Memory and computationally efficient selection of representatives is enabled by leveraging the similarity graph representation of each cluster created by the NN-Descent algorithm. CRS can be used in an arbitrary metric or non-metric space because of the graph-based approach, which requires only a pairwise similarity measure. As we demonstrate in the experimental evaluation, our method outperforms the state of the art techniques on multiple datasets from different domains.
    Minimum Wasserstein Distance Estimator under Finite Location-scale Mixtures. (arXiv:2107.01323v1 [stat.ML])
    (2 min) When a population exhibits heterogeneity, we often model it via a finite mixture: decompose it into several different but homogeneous subpopulations. Contemporary practice favors learning the mixtures by maximizing the likelihood for statistical efficiency and the convenient EM-algorithm for numerical computation. Yet the maximum likelihood estimate (MLE) is not well defined for the most widely used finite normal mixture in particular and for finite location-scale mixture in general. We hence investigate feasible alternatives to MLE such as minimum distance estimators. Recently, the Wasserstein distance has drawn increased attention in the machine learning community. It has intuitive geometric interpretation and is successfully employed in many new applications. Do we gain anything by learning finite location-scale mixtures via a minimum Wasserstein distance estimator (MWDE)? This paper investigates this possibility in several respects. We find that the MWDE is consistent and derive a numerical solution under finite location-scale mixtures. We study its robustness against outliers and mild model mis-specifications. Our moderate scaled simulation study shows the MWDE suffers some efficiency loss against a penalized version of MLE in general without noticeable gain in robustness. We reaffirm the general superiority of the likelihood based learning strategies even for the non-regular finite location-scale mixtures.
    Can Transformers Jump Around Right in Natural Language? Assessing Performance Transfer from SCAN. (arXiv:2107.01366v1 [cs.CL])
    (2 min) Despite their practical success, modern seq2seq architectures are unable to generalize systematically on several SCAN tasks. Hence, it is not clear if SCAN-style compositional generalization is useful in realistic NLP tasks. In this work, we study the benefit that such compositionality brings about to several machine translation tasks. We present several focused modifications of Transformer that greatly improve generalization capabilities on SCAN and select one that remains on par with a vanilla Transformer on a standard machine translation (MT) task. Next, we study its performance in low-resource settings and on a newly introduced distribution-shifted English-French translation task. Overall, we find that improvements of a SCAN-capable model do not directly transfer to the resource-rich MT setup. In contrast, in the low-resource setup, general modifications lead to an improvement of up to 13.1% BLEU score w.r.t. a vanilla Transformer. Similarly, an improvement of 14% in an accuracy-based metric is achieved in the introduced compositional English-French translation task. This provides experimental evidence that the compositional generalization assessed in SCAN is particularly useful in resource-starved and domain-shifted scenarios.
    Incorporating Reachability Knowledge into a Multi-Spatial Graph Convolution Based Seq2Seq Model for Traffic Forecasting. (arXiv:2107.01528v1 [cs.LG])
    (2 min) Accurate traffic state prediction is the foundation of transportation control and guidance. It is very challenging due to the complex spatiotemporal dependencies in traffic data. Existing works cannot perform well for multi-step traffic prediction that involves long future time period. The spatiotemporal information dilution becomes serve when the time gap between input step and predicted step is large, especially when traffic data is not sufficient or noisy. To address this issue, we propose a multi-spatial graph convolution based Seq2Seq model. Our main novelties are three aspects: (1) We enrich the spatiotemporal information of model inputs by fusing multi-view features (time, location and traffic states) (2) We build multiple kinds of spatial correlations based on both prior knowledge and data-driven knowledge to improve model performance especially in insufficient or noisy data cases. (3) A spatiotemporal attention mechanism based on reachability knowledge is novelly designed to produce high-level features fed into decoder of Seq2Seq directly to ease information dilution. Our model is evaluated on two real world traffic datasets and achieves better performance than other competitors.
    Data-driven mapping between functional connectomes using optimal transport. (arXiv:2107.01303v1 [q-bio.NC])
    (2 min) Functional connectomes derived from functional magnetic resonance imaging have long been used to understand the functional organization of the brain. Nevertheless, a connectome is intrinsically linked to the atlas used to create it. In other words, a connectome generated from one atlas is different in scale and resolution compared to a connectome generated from another atlas. Being able to map connectomes and derived results between different atlases without additional pre-processing is a crucial step in improving interpretation and generalization between studies that use different atlases. Here, we use optimal transport, a powerful mathematical technique, to find an optimum mapping between two atlases. This mapping is then used to transform time series from one atlas to another in order to reconstruct a connectome. We validate our approach by comparing transformed connectomes against their "gold-standard" counterparts (i.e., connectomes generated directly from an atlas) and demonstrate the utility of transformed connectomes by applying these connectomes to predictive models based on a different atlas. We show that these transformed connectomes are significantly similar to their "gold-standard" counterparts and maintain individual differences in brain-behavior associations, demonstrating both the validity of our approach and its utility in downstream analyses. Overall, our approach is a promising avenue to increase the generalization of connectome-based results across different atlases.
    Implicit Greedy Rank Learning in Autoencoders via Overparameterized Linear Networks. (arXiv:2107.01301v1 [cs.LG])
    (2 min) Deep linear networks trained with gradient descent yield low rank solutions, as is typically studied in matrix factorization. In this paper, we take a step further and analyze implicit rank regularization in autoencoders. We show greedy learning of low-rank latent codes induced by a linear sub-network at the autoencoder bottleneck. We further propose orthogonal initialization and principled learning rate adjustment to mitigate sensitivity of training dynamics to spectral prior and linear depth. With linear autoencoders on synthetic data, our method converges stably to ground-truth latent code rank. With nonlinear autoencoders, our method converges to latent ranks optimal for downstream classification and image sampling.
    On Positional and Structural Node Features for Graph Neural Networks on Non-attributed Graphs. (arXiv:2107.01495v1 [cs.LG])
    (2 min) Graph neural networks (GNNs) have been widely used in various graph-related problems such as node classification and graph classification, where the superior performance is mainly established when natural node features are available. However, it is not well understood how GNNs work without natural node features, especially regarding the various ways to construct artificial ones. In this paper, we point out the two types of artificial node features,i.e., positional and structural node features, and provide insights on why each of them is more appropriate for certain tasks,i.e., positional node classification, structural node classification, and graph classification. Extensive experimental results on 10 benchmark datasets validate our insights, thus leading to a practical guideline on the choices between different artificial node features for GNNs on non-attributed graphs. The code is available at https://github.com/zjzijielu/gnn-exp/.
    Relaxed Attention: A Simple Method to Boost Performance of End-to-End Automatic Speech Recognition. (arXiv:2107.01275v1 [eess.AS])
    (2 min) Recently, attention-based encoder-decoder (AED) models have shown high performance for end-to-end automatic speech recognition (ASR) across several tasks. Addressing overconfidence in such models, in this paper we introduce the concept of relaxed attention, which is a simple gradual injection of a uniform distribution to the encoder-decoder attention weights during training that is easily implemented with two lines of code. We investigate the effect of relaxed attention across different AED model architectures and two prominent ASR tasks, Wall Street Journal (WSJ) and Librispeech. We found that transformers trained with relaxed attention outperform the standard baseline models consistently during decoding with external language models. On WSJ, we set a new benchmark for transformer-based end-to-end speech recognition with a word error rate of 3.65%, outperforming state of the art (4.20%) by 13.1% relative, while introducing only a single hyperparameter. Upon acceptance, models will be published on github.
    Isotonic Data Augmentation for Knowledge Distillation. (arXiv:2107.01412v1 [cs.LG])
    (2 min) Knowledge distillation uses both real hard labels and soft labels predicted by teacher models as supervision. Intuitively, we expect the soft labels and hard labels to be concordant w.r.t. their orders of probabilities. However, we found {\it critical order violations} between hard labels and soft labels in augmented samples. For example, for an augmented sample $x=0.7*panda+0.3*cat$, we expect the order of meaningful soft labels to be $P_\text{soft}(panda|x)>P_\text{soft}(cat|x)>P_\text{soft}(other|x)$. But real soft labels usually violate the order, e.g. $P_\text{soft}(tiger|x)>P_\text{soft}(panda|x)>P_\text{soft}(cat|x)$. We attribute this to the unsatisfactory generalization ability of the teacher, which leads to the prediction error of augmented samples. Empirically, we found the violations are common and injure the knowledge transfer.In this paper, we introduce order restrictions to data augmentation for knowledge distillation, which is denoted as isotonic data augmentation (IDA). We use isotonic regression (IR) -- a classic technique from statistics -- to eliminate the order violations. We show that IDA can be modeled as a tree-structured IR problem. We thereby adapt the classical IRT-BIN algorithm for optimal solutions with $O(c \log c)$ time complexity, where $c$ is the number of labels. In order to further reduce the time complexity, we also \cwy{propose} a GPU-friendly approximation with linear time complexity. We have verified on variant datasets and data augmentation techniques that our proposed IDA algorithms effectively increases the accuracy of knowledge distillation by eliminating the rank violations.
    Examining average and discounted reward optimality criteria in reinforcement learning. (arXiv:2107.01348v1 [cs.LG])
    (2 min) In reinforcement learning (RL), the goal is to obtain an optimal policy, for which the optimality criterion is fundamentally important. Two major optimality criteria are average and discounted rewards, where the later is typically considered as an approximation to the former. While the discounted reward is more popular, it is problematic to apply in environments that have no natural notion of discounting. This motivates us to revisit a) the progression of optimality criteria in dynamic programming, b) justification for and complication of an artificial discount factor, and c) benefits of directly maximizing the average reward. Our contributions include a thorough examination of the relationship between average and discounted rewards, as well as a discussion of their pros and cons in RL. We emphasize that average-reward RL methods possess the ingredient and mechanism for developing the general discounting-free optimality criterion (Veinott, 1969) in RL.
    Clustering of Time Series Data with Prior Geographical Information. (arXiv:2107.01310v1 [cs.LG])
    (2 min) Time Series data are broadly studied in various domains of transportation systems. Traffic data area challenging example of spatio-temporal data, as it is multi-variate time series with high correlations in spatial and temporal neighborhoods. Spatio-temporal clustering of traffic flow data find similar patterns in both spatial and temporal domain, where it provides better capability for analyzing a transportation network, and improving related machine learning models, such as traffic flow prediction and anomaly detection. In this paper, we propose a spatio-temporal clustering model, where it clusters time series data based on spatial and temporal contexts. We propose a variation of a Deep Embedded Clustering(DEC) model for finding spatio-temporal clusters. The proposed model Spatial-DEC (S-DEC) use prior geographical information in building latent feature representations. We also define evaluation metrics for spatio-temporal clusters. Not only do the obtained clusters have better temporal similarity when evaluated using DTW distance, but also the clusters better represents spatial connectivity and dis-connectivity. We use traffic flow data obtained by PeMS in our analysis. The results show that the proposed Spatial-DEC can find more desired spatio-temporal clusters.
    Prescient teleoperation of humanoid robots. (arXiv:2107.01281v1 [cs.RO])
    (2 min) Humanoid robots could be versatile and intuitive human avatars that operate remotely in inaccessible places: the robot could reproduce in the remote location the movements of an operator equipped with a wearable motion capture device while sending visual feedback to the operator. While substantial progress has been made on transferring ("retargeting") human motions to humanoid robots, a major problem preventing the deployment of such systems in real applications is the presence of communication delays between the human input and the feedback from the robot: even a few hundred milliseconds of delay can irreversibly disturb the operator, let alone a few seconds. To overcome these delays, we introduce a system in which a humanoid robot executes commands before it actually receives them, so that the visual feedback appears to be synchronized to the operator, whereas the robot executed the commands in the past. To do so, the robot continuously predicts future commands by querying a machine learning model that is trained on past trajectories and conditioned on the last received commands. In our experiments, an operator was able to successfully control a humanoid robot (32 degrees of freedom) with stochastic delays up to 2 seconds in several whole-body manipulation tasks, including reaching different targets, picking up, and placing a box at distinct locations.
    CInC Flow: Characterizable Invertible 3x3 Convolution. (arXiv:2107.01358v1 [cs.LG])
    (2 min) Normalizing flows are an essential alternative to GANs for generative modelling, which can be optimized directly on the maximum likelihood of the dataset. They also allow computation of the exact latent vector corresponding to an image since they are composed of invertible transformations. However, the requirement of invertibility of the transformation prevents standard and expressive neural network models such as CNNs from being directly used. Emergent convolutions were proposed to construct an invertible 3$\times$3 CNN layer using a pair of masked CNN layers, making them inefficient. We study conditions such that 3$\times$3 CNNs are invertible, allowing them to construct expressive normalizing flows. We derive necessary and sufficient conditions on a padded CNN for it to be invertible. Our conditions for invertibility are simple, can easily be maintained during the training process. Since we require only a single CNN layer for every effective invertible CNN layer, our approach is more efficient than emerging convolutions. We also proposed a coupling method, Quad-coupling. We benchmark our approach and show similar performance results to emergent convolutions while improving the model's efficiency.

2021-07-05

  • cs.CL updates on arXiv.org

    Many-to-English Machine Translation Tools, Data, and Pretrained Models. (arXiv:2104.00290v2 [cs.CL] UPDATED)
    (2 min) While there are more than 7000 languages in the world, most translation research efforts have targeted a few high-resource languages. Commercial translation systems support only one hundred languages or fewer, and do not make these models available for transfer to low resource languages. In this work, we present useful tools for machine translation research: MTData, NLCodec, and RTG. We demonstrate their usefulness by creating a multilingual neural machine translation model capable of translating from 500 source languages to English. We make this multilingual model readily downloadable and usable as a service, or as a parent model for transfer-learning to even lower-resource languages.
    Data Augmentation by Concatenation for Low-Resource Translation: A Mystery and a Solution. (arXiv:2105.01691v2 [cs.CL] UPDATED)
    (2 min) In this paper, we investigate the driving factors behind concatenation, a simple but effective data augmentation method for low-resource neural machine translation. Our experiments suggest that discourse context is unlikely the cause for the improvement of about +1 BLEU across four language pairs. Instead, we demonstrate that the improvement comes from three other factors unrelated to discourse: context diversity, length diversity, and (to a lesser extent) position shifting.
    UIT-ISE-NLP at SemEval-2021 Task 5: Toxic Spans Detection with BiLSTM-CRF and ToxicBERT Comment Classification. (arXiv:2104.10100v3 [cs.CL] UPDATED)
    (2 min) We present our works on SemEval-2021 Task 5 about Toxic Spans Detection. This task aims to build a model for identifying toxic words in whole posts. We use the BiLSTM-CRF model combining with ToxicBERT Classification to train the detection model for identifying toxic words in posts. Our model achieves 62.23% by F1-score on the Toxic Spans Detection task.
    Temporally Correlated Task Scheduling for Sequence Learning. (arXiv:2007.05290v2 [cs.CL] UPDATED)
    (2 min) Sequence learning has attracted much research attention from the machine learning community in recent years. In many applications, a sequence learning task is usually associated with multiple temporally correlated auxiliary tasks, which are different in terms of how much input information to use or which future step to predict. For example, (i) in simultaneous machine translation, one can conduct translation under different latency (i.e., how many input words to read/wait before translation); (ii) in stock trend forecasting, one can predict the price of a stock in different future days (e.g., tomorrow, the day after tomorrow). While it is clear that those temporally correlated tasks can help each other, there is a very limited exploration on how to better leverage multiple auxiliary tasks to boost the performance of the main task. In this work, we introduce a learnable scheduler to sequence learning, which can adaptively select auxiliary tasks for training depending on the model status and the current training data. The scheduler and the model for the main task are jointly trained through bi-level optimization. Experiments show that our method significantly improves the performance of simultaneous machine translation and stock trend forecasting.
    Multitask Learning for Grapheme-to-Phoneme Conversion of Anglicisms in German Speech Recognition. (arXiv:2105.12708v2 [cs.CL] UPDATED)
    (2 min) Loanwords, such as Anglicisms, are a challenge in German speech recognition. Due to their irregular pronunciation compared to native German words, automatically generated pronunciation dictionaries often include faulty phoneme sequences for Anglicisms. In this work, we propose a multitask sequence-to-sequence approach for grapheme-to-phoneme conversion to improve the phonetization of Anglicisms. We extended a grapheme-to-phoneme model with a classifier to distinguish Anglicisms from native German words. With this approach, the model learns to generate pronunciations differently depending on the classification result. We used our model to create supplementary Anglicism pronunciation dictionaries that are added to an existing German speech recognition model. Tested on a dedicated Anglicism evaluation set, we improved the recognition of Anglicisms compared to a baseline model, reducing the word error rate by 1 % and the Anglicism error rate by 3 %. We show that multitask learning can help solving the challenge of loanwords in German speech recognition.
    Conversational Machine Reading Comprehension for Vietnamese Healthcare Texts. (arXiv:2105.01542v5 [cs.CL] UPDATED)
    (2 min) Machine reading comprehension (MRC) is a sub-field in natural language processing that aims to assist computers understand unstructured texts and then answer questions related to them. In practice, the conversation is an essential way to communicate and transfer information. To help machines understand conversation texts, we present UIT-ViCoQA, a new corpus for conversational machine reading comprehension in the Vietnamese language. This corpus consists of 10,000 questions with answers over 2,000 conversations about health news articles. Then, we evaluate several baseline approaches for conversational machine comprehension on the UIT-ViCoQA corpus. The best model obtains an F1 score of 45.27%, which is 30.91 points behind human performance (76.18%), indicating that there is ample room for improvement. Our dataset is available at our website: this http URL for research purposes.
    Text-guided Legal Knowledge Graph Reasoning. (arXiv:2104.02284v2 [cs.AI] UPDATED)
    (2 min) Recent years have witnessed the prosperity of legal artificial intelligence with the development of technologies. In this paper, we propose a novel legal application of legal provision prediction (LPP), which aims to predict the related legal provisions of affairs. We formulate this task as a challenging knowledge graph completion problem, which requires not only text understanding but also graph reasoning. To this end, we propose a novel text-guided graph reasoning approach. We collect amounts of real-world legal provision data from the Guangdong government service website and construct a legal dataset called LegalLPP. Extensive experimental results on the dataset show that our approach achieves better performance compared with baselines. The code and dataset are available in \url{https://github.com/zjunlp/LegalPP} for reproducibility.
    Improving Patent Mining and Relevance Classification using Transformers. (arXiv:2105.03979v2 [cs.CL] UPDATED)
    (2 min) Patent analysis and mining are time-consuming and costly processes for companies, but nevertheless essential if they are willing to remain competitive. To face the overload induced by numerous patents, the idea is to automatically filter them, bringing only few to read to experts. This paper reports a successful application of fine-tuning and retraining on pre-trained deep Natural Language Processing models on patent classification. The solution that we propose combines several state-of-the-art treatments to achieve our goal - decrease the workload while preserving recall and precision metrics.
    ExplainaBoard: An Explainable Leaderboard for NLP. (arXiv:2104.06387v2 [cs.CL] UPDATED)
    (2 min) With the rapid development of NLP research, leaderboards have emerged as one tool to track the performance of various systems on various NLP tasks. They are effective in this goal to some extent, but generally present a rather simplistic one-dimensional view of the submitted systems, communicated only through holistic accuracy numbers. In this paper, we present a new conceptualization and implementation of NLP evaluation: the ExplainaBoard, which in addition to inheriting the functionality of the standard leaderboard, also allows researchers to (i) diagnose strengths and weaknesses of a single system (e.g.~what is the best-performing system bad at?) (ii) interpret relationships between multiple systems. (e.g.~where does system A outperform system B? What if we combine systems A, B, and C?) and (iii) examine prediction results closely (e.g.~what are common errors made by multiple systems, or in what contexts do particular errors occur?). So far, ExplainaBoard covers more than 400 systems, 50 datasets, 40 languages, and 12 tasks. ExplainaBoard keeps updated and is recently upgraded by supporting (1) multilingual multi-task benchmark, (2) meta-evaluation, and (3) more complicated task: machine translation, which reviewers also suggested.} We not only released an online platform on the website \url{this http URL} but also make our evaluation tool an API with MIT Licence at Github \url{https://github.com/neulab/explainaBoard} and PyPi \url{https://pypi.org/project/interpret-eval/} that allows users to conveniently assess their models offline. We additionally release all output files from systems that we have run or collected to motivate "output-driven" research in the future.
    Evaluating Gender Bias in Speech Translation. (arXiv:2010.14465v3 [cs.CL] UPDATED)
    (2 min) The scientific community is increasingly aware of the necessity to embrace pluralism and consistently represent major and minor social groups. Currently, there are no standard evaluation techniques for different types of biases. Accordingly, there is an urgent need to provide evaluation sets and protocols to measure existing biases in our automatic systems. Evaluating the biases should be an essential step towards mitigating them in the systems. This paper introduces WinoST, a new freely available challenge set for evaluating gender bias in speech translation. WinoST is the speech version of WinoMT which is a MT challenge set and both follow an evaluation protocol to measure gender accuracy. Using a state-of-the-art end-to-end speech translation system, we report the gender bias evaluation on four language pairs and we show that gender accuracy in speech translation is more than 23% lower than in MT.
    DRIFT: A Toolkit for Diachronic Analysis of Scientific Literature. (arXiv:2107.01198v1 [cs.CL])
    (2 min) In this work, we present to the NLP community, and to the wider research community as a whole, an application for the diachronic analysis of research corpora. We open source an easy-to-use tool coined: DRIFT, which allows researchers to track research trends and development over the years. The analysis methods are collated from well-cited research works, with a few of our own methods added for good measure. Succinctly put, some of the analysis methods are: keyword extraction, word clouds, predicting declining/stagnant/growing trends using Productivity, tracking bi-grams using Acceleration plots, finding the Semantic Drift of words, tracking trends using similarity, etc. To demonstrate the utility and efficacy of our tool, we perform a case study on the cs.CL corpus of the arXiv repository and draw inferences from the analysis methods. The toolkit and the associated code are available here: https://github.com/rajaswa/DRIFT.
    Misinformation Detection on YouTube Using Video Captions. (arXiv:2107.00941v1 [cs.LG])
    (2 min) Millions of people use platforms such as YouTube, Facebook, Twitter, and other mass media. Due to the accessibility of these platforms, they are often used to establish a narrative, conduct propaganda, and disseminate misinformation. This work proposes an approach that uses state-of-the-art NLP techniques to extract features from video captions (subtitles). To evaluate our approach, we utilize a publicly accessible and labeled dataset for classifying videos as misinformation or not. The motivation behind exploring video captions stems from our analysis of videos metadata. Attributes such as the number of views, likes, dislikes, and comments are ineffective as videos are hard to differentiate using this information. Using caption dataset, the proposed models can classify videos among three classes (Misinformation, Debunking Misinformation, and Neutral) with 0.85 to 0.90 F1-score. To emphasize the relevance of the misinformation class, we re-formulate our classification problem as a two-class classification - Misinformation vs. others (Debunking Misinformation and Neutral). In our experiments, the proposed models can classify videos with 0.92 to 0.95 F1-score and 0.78 to 0.90 AUC ROC.
    Language Identification of Hindi-English tweets using code-mixed BERT. (arXiv:2107.01202v1 [cs.CL])
    (2 min) Language identification of social media text has been an interesting problem of study in recent years. Social media messages are predominantly in code mixed in non-English speaking states. Prior knowledge by pre-training contextual embeddings have shown state of the art results for a range of downstream tasks. Recently, models such as BERT have shown that using a large amount of unlabeled data, the pretrained language models are even more beneficial for learning common language representations. Extensive experiments exploiting transfer learning and fine-tuning BERT models to identify language on Twitter are presented in this paper. The work utilizes a data collection of Hindi-English-Urdu codemixed text for language pre-training and Hindi-English codemixed for subsequent word-level language classification. The results show that the representations pre-trained over codemixed data produce better results by their monolingual counterpart.
    Case Relation Transformer: A Crossmodal Language Generation Model for Fetching Instructions. (arXiv:2107.00789v1 [cs.RO])
    (2 min) There have been many studies in robotics to improve the communication skills of domestic service robots. Most studies, however, have not fully benefited from recent advances in deep neural networks because the training datasets are not large enough. In this paper, our aim is to augment the datasets based on a crossmodal language generation model. We propose the Case Relation Transformer (CRT), which generates a fetching instruction sentence from an image, such as "Move the blue flip-flop to the lower left box." Unlike existing methods, the CRT uses the Transformer to integrate the visual features and geometry features of objects in the image. The CRT can handle the objects because of the Case Relation Block. We conducted comparison experiments and a human evaluation. The experimental results show the CRT outperforms baseline methods.
    Predicting Decisions in Language Based Persuasion Games. (arXiv:2012.09966v3 [cs.AI] UPDATED)
    (3 min) Sender-receiver interactions, and specifically persuasion games, are widely researched in economic modeling and artificial intelligence, and serve as a solid foundation for powerful applications. However, in the classic persuasion games setting, the messages sent from the expert to the decision-maker are abstract or well-structured application-specific signals rather than natural (human) language messages, although natural language is a very common communication signal in real-world persuasion setups. This paper addresses the use of natural language in persuasion games, exploring its impact on the decisions made by the players and aiming to construct effective models for the prediction of these decisions. For this purpose, we conduct an online repeated interaction experiment. At each trial of the interaction, an informed expert aims to sell an uninformed decision-maker a vacation in a hotel, by sending her a review that describes the hotel. While the expert is exposed to several scored reviews, the decision-maker observes only the single review sent by the expert, and her payoff in case she chooses to take the hotel is a random draw from the review score distribution available to the expert only. The expert's payoff, in turn, depends on the number of times the decision-maker chooses the hotel. We consider a number of modeling approaches for this setup, differing from each other in the model type (deep neural network (DNN) vs. linear classifier), the type of features used by the model (textual, behavioral or both) and the source of the textual features (DNN-based vs. hand-crafted). Our results demonstrate that given a prefix of the interaction sequence, our models can predict the future decisions of the decision-maker, particularly when a sequential modeling approach and hand-crafted textual features are applied.
    Target-dependent UNITER: A Transformer-Based Multimodal Language Comprehension Model for Domestic Service Robots. (arXiv:2107.00811v1 [cs.RO])
    (2 min) Currently, domestic service robots have an insufficient ability to interact naturally through language. This is because understanding human instructions is complicated by various ambiguities and missing information. In existing methods, the referring expressions that specify the relationships between objects are insufficiently modeled. In this paper, we propose Target-dependent UNITER, which learns the relationship between the target object and other objects directly by focusing on the relevant regions within an image, rather than the whole image. Our method is an extension of the UNITER-based Transformer that can be pretrained on general-purpose datasets. We extend the UNITER approach by introducing a new architecture for handling the target candidates. Our model is validated on two standard datasets, and the results show that Target-dependent UNITER outperforms the baseline method in terms of classification accuracy.
    SocialAI: Benchmarking Socio-Cognitive Abilities in Deep Reinforcement Learning Agents. (arXiv:2107.00956v1 [cs.LG])
    (2 min) Building embodied autonomous agents capable of participating in social interactions with humans is one of the main challenges in AI. Within the Deep Reinforcement Learning (DRL) field, this objective motivated multiple works on embodied language use. However, current approaches focus on language as a communication tool in very simplified and non-diverse social situations: the "naturalness" of language is reduced to the concept of high vocabulary size and variability. In this paper, we argue that aiming towards human-level AI requires a broader set of key social skills: 1) language use in complex and variable social contexts; 2) beyond language, complex embodied communication in multimodal settings within constantly evolving social worlds. We explain how concepts from cognitive sciences could help AI to draw a roadmap towards human-like intelligence, with a focus on its social dimensions. As a first step, we propose to expand current research to a broader set of core social skills. To do this, we present SocialAI, a benchmark to assess the acquisition of social skills of DRL agents using multiple grid-world environments featuring other (scripted) social agents. We then study the limits of a recent SOTA DRL approach when tested on SocialAI and discuss important next steps towards proficient social agents. Videos and code are available at https://sites.google.com/view/socialai.
    Data Centric Domain Adaptation for Historical Text with OCR Errors. (arXiv:2107.00927v1 [cs.CL])
    (2 min) We propose new methods for in-domain and cross-domain Named Entity Recognition (NER) on historical data for Dutch and French. For the cross-domain case, we address domain shift by integrating unsupervised in-domain data via contextualized string embeddings; and OCR errors by injecting synthetic OCR errors into the source domain and address data centric domain adaptation. We propose a general approach to imitate OCR errors in arbitrary input data. Our cross-domain as well as our in-domain results outperform several strong baselines and establish state-of-the-art results. We publish preprocessed versions of the French and Dutch Europeana NER corpora.
    Concept Identification of Directly and Indirectly Related Mentions Referring to Groups of Persons. (arXiv:2107.00955v1 [cs.CL])
    (2 min) Unsupervised concept identification through clustering, i.e., identification of semantically related words and phrases, is a common approach to identify contextual primitives employed in various use cases, e.g., text dimension reduction, i.e., replace words with the concepts to reduce the vocabulary size, summarization, and named entity resolution. We demonstrate the first results of an unsupervised approach for the identification of groups of persons as actors extracted from a set of related articles. Specifically, the approach clusters mentions of groups of persons that act as non-named entity actors in the texts, e.g., "migrant families" = "asylum-seekers." Compared to our baseline, the approach keeps the mentions of the geopolitical entities separated, e.g., "Iran leaders" != "European leaders," and clusters (in)directly related mentions with diverse wording, e.g., "American officials" = "Trump Administration."
    R2D2: Recursive Transformer based on Differentiable Tree for Interpretable Hierarchical Language Modeling. (arXiv:2107.00967v1 [cs.CL])
    (2 min) Human language understanding operates at multiple levels of granularity (e.g., words, phrases, and sentences) with increasing levels of abstraction that can be hierarchically combined. However, existing deep models with stacked layers do not explicitly model any sort of hierarchical process. This paper proposes a recursive Transformer model based on differentiable CKY style binary trees to emulate the composition process. We extend the bidirectional language model pre-training objective to this architecture, attempting to predict each word given its left and right abstraction nodes. To scale up our approach, we also introduce an efficient pruned tree induction algorithm to enable encoding in just a linear number of composition steps. Experimental results on language modeling and unsupervised parsing show the effectiveness of our approach.
    Transformer-F: A Transformer network with effective methods for learning universal sentence representation. (arXiv:2107.00653v1 [cs.CL])
    (2 min) The Transformer model is widely used in natural language processing for sentence representation. However, the previous Transformer-based models focus on function words that have limited meaning in most cases and could merely extract high-level semantic abstraction features. In this paper, two approaches are introduced to improve the performance of Transformers. We calculated the attention score by multiplying the part-of-speech weight vector with the correlation coefficient, which helps extract the words with more practical meaning. The weight vector is obtained by the input text sequence based on the importance of the part-of-speech. Furthermore, we fuse the features of each layer to make the sentence representation results more comprehensive and accurate. In experiments, we demonstrate the effectiveness of our model Transformer-F on three standard text classification datasets. Experimental results show that our proposed model significantly boosts the performance of text classification as compared to the baseline model. Specifically, we obtain a 5.28% relative improvement over the vanilla Transformer on the simple tasks.
    Normalizing Flow based Hidden Markov Models for Classification of Speech Phones with Explainability. (arXiv:2107.00730v1 [cs.LG])
    (2 min) In pursuit of explainability, we develop generative models for sequential data. The proposed models provide state-of-the-art classification results and robust performance for speech phone classification. We combine modern neural networks (normalizing flows) and traditional generative models (hidden Markov models - HMMs). Normalizing flow-based mixture models (NMMs) are used to model the conditional probability distribution given the hidden state in the HMMs. Model parameters are learned through judicious combinations of time-tested Bayesian learning methods and contemporary neural network learning methods. We mainly combine expectation-maximization (EM) and mini-batch gradient descent. The proposed generative models can compute likelihood of a data and hence directly suitable for maximum-likelihood (ML) classification approach. Due to structural flexibility of HMMs, we can use different normalizing flow models. This leads to different types of HMMs providing diversity in data modeling capacity. The diversity provides an opportunity for easy decision fusion from different models. For a standard speech phone classification setup involving 39 phones (classes) and the TIMIT dataset, we show that the use of standard features called mel-frequency-cepstral-coeffcients (MFCCs), the proposed generative models, and the decision fusion together can achieve $86.6\%$ accuracy by generative training only. This result is close to state-of-the-art results, for examples, $86.2\%$ accuracy of PyTorch-Kaldi toolkit [1], and $85.1\%$ accuracy using light gated recurrent units [2]. We do not use any discriminative learning approach and related sophisticated features in this article.
    Unsupervised Spoken Utterance Classification. (arXiv:2107.01068v1 [cs.CL])
    (2 min) An intelligent virtual assistant (IVA) enables effortless conversations in call routing through spoken utterance classification (SUC) which is a special form of spoken language understanding (SLU). Building a SUC system requires a large amount of supervised in-domain data that is not always available. In this paper, we introduce an unsupervised spoken utterance classification approach (USUC) that does not require any in-domain data except for the intent labels and a few para-phrases per intent. USUC is consisting of a KNN classifier (K=1) and a complex embedding model trained on a large amount of unsupervised customer service corpus. Among all embedding models, we demonstrate that Elmo works best for USUC. However, an Elmo model is too slow to be used at run-time for call routing. To resolve this issue, first, we compute the uni- and bi-gram embedding vectors offline and we build a lookup table of n-grams and their corresponding embedding vector. Then we use this table to compute sentence embedding vectors at run-time, along with back-off techniques for unseen n-grams. Experiments show that USUC outperforms the traditional utterance classification methods by reducing the classification error rate from 32.9% to 27.0% without requiring supervised data. Moreover, our lookup and back-off technique increases the processing speed from 16 utterances per second to 118 utterances per second.
    An Investigation of the (In)effectiveness of Counterfactually Augmented Data. (arXiv:2107.00753v1 [cs.CL])
    (2 min) While pretrained language models achieve excellent performance on natural language understanding benchmarks, they tend to rely on spurious correlations and generalize poorly to out-of-distribution (OOD) data. Recent work has explored using counterfactually-augmented data (CAD) -- data generated by minimally perturbing examples to flip the ground-truth label -- to identify robust features that are invariant under distribution shift. However, empirical results using CAD for OOD generalization have been mixed. To explain this discrepancy, we draw insights from a linear Gaussian model and demonstrate the pitfalls of CAD. Specifically, we show that (a) while CAD is effective at identifying robust features, it may prevent the model from learning unperturbed robust features, and (b) CAD may exacerbate existing spurious correlations in the data. Our results show that the lack of perturbation diversity in current CAD datasets limits its effectiveness on OOD generalization, calling for innovative crowdsourcing procedures to elicit diverse perturbation of examples.
    DUKweb: Diachronic word representations from the UK Web Archive corpus. (arXiv:2107.01076v1 [cs.CL])
    (2 min) Lexical semantic change (detecting shifts in the meaning and usage of words) is an important task for social and cultural studies as well as for Natural Language Processing applications. Diachronic word embeddings (time-sensitive vector representations of words that preserve their meaning) have become the standard resource for this task. However, given the significant computational resources needed for their generation, very few resources exist that make diachronic word embeddings available to the scientific community. In this paper we present DUKweb, a set of large-scale resources designed for the diachronic analysis of contemporary English. DUKweb was created from the JISC UK Web Domain Dataset (1996-2013), a very large archive which collects resources from the Internet Archive that were hosted on domains ending in `.uk'. DUKweb consists of a series word co-occurrence matrices and two types of word embeddings for each year in the JISC UK Web Domain dataset. We show the reuse potential of DUKweb and its quality standards via a case study on word meaning change detection.
    Learned Token Pruning for Transformers. (arXiv:2107.00910v1 [cs.CL])
    (2 min) A major challenge in deploying transformer models is their prohibitive inference cost, which quadratically scales with the input sequence length. This makes it especially difficult to use transformers for processing long sequences. To address this, we present a novel Learned Token Pruning (LTP) method that reduces redundant tokens as the data passes through the different layers of the transformer. In particular, LTP prunes tokens with an attention score below a threshold value, which is learned during training. Importantly, our threshold based method avoids algorithmically expensive operations such as top-k token selection which are used in prior token pruning methods, and also leads to structured pruning. We extensively test the performance of our approach on multiple GLUE tasks and show that our learned threshold based method consistently outperforms the prior state-of-the-art top-k token based method by up to ~2% higher accuracy with the same amount of FLOPs. Furthermore, our preliminary results show up to 1.4x and 1.9x throughput improvement on Tesla T4 GPU and Intel Haswell CPU, respectively, with less than 1% of accuracy drop (and up to 2.1x FLOPs reduction). Our code has been developed in PyTorch and has been open-sourced.
    Heterogeneous Graph Attention Network for Multi-hop Machine Reading Comprehension. (arXiv:2107.00841v1 [cs.CL])
    (2 min) Multi-hop machine reading comprehension is a challenging task in natural language processing, which requires more reasoning ability and explainability. Spectral models based on graph convolutional networks grant the inferring abilities and lead to competitive results, however, part of them still face the challenge of analyzing the reasoning in a human-understandable way. Inspired by the concept of the Grandmother Cells in cognitive neuroscience, a spatial graph attention framework named crname, imitating the procedure was proposed. This model is designed to assemble the semantic features in multi-angle representations and automatically concentrate or alleviate the information for reasoning. The name "crname" is a metaphor for the pattern of the model: regard the subjects of queries as the start points of clues, take the reasoning entities as bridge points, and consider the latent candidate entities as the grandmother cells, and the clues end up in candidate entities. The proposed model allows us to visualize the reasoning graph and analyze the importance of edges connecting two entities and the selectivity in the mention and candidate nodes, which can be easier to be comprehended empirically. The official evaluations in open-domain multi-hop reading dataset WikiHop and Drug-drug Interactions dataset MedHop prove the validity of our approach and show the probability of the application of the model in the molecular biology domain.
    Ethics Sheets for AI Tasks. (arXiv:2107.01183v1 [cs.AI])
    (2 min) Several high-profile events, such as the use of biased recidivism systems and mass testing of emotion recognition systems on vulnerable sub-populations, have highlighted how technology will often lead to more adverse outcomes for those that are already marginalized. In this paper, I will make a case for thinking about ethical considerations not just at the level of individual models and datasets, but also at the level of AI tasks. I will present a new form of such an effort, Ethics Sheets for AI Tasks, dedicated to fleshing out the assumptions and ethical considerations hidden in how a task is commonly framed and in the choices we make regarding the data, method, and evaluation. Finally, I will provide an example ethics sheet for automatic emotion recognition. Together with Data Sheets for datasets and Model Cards for AI systems, Ethics Sheets aid in the development and deployment of responsible AI systems.
    A Primer on Pretrained Multilingual Language Models. (arXiv:2107.00676v1 [cs.CL])
    (2 min) Multilingual Language Models (MLLMs) such as mBERT, XLM, XLM-R, \textit{etc.} have emerged as a viable option for bringing the power of pretraining to a large number of languages. Given their success in zero shot transfer learning, there has emerged a large body of work in (i) building bigger MLLMs covering a large number of languages (ii) creating exhaustive benchmarks covering a wider variety of tasks and languages for evaluating MLLMs (iii) analysing the performance of MLLMs on monolingual, zero shot crosslingual and bilingual tasks (iv) understanding the universal language patterns (if any) learnt by MLLMs and (v) augmenting the (often) limited capacity of MLLMs to improve their performance on seen or even unseen languages. In this survey, we review the existing literature covering the above broad areas of research pertaining to MLLMs. Based on our survey, we recommend some promising directions of future research.
    He Thinks He Knows Better than the Doctors: BERT for Event Factuality Fails on Pragmatics. (arXiv:2107.00807v1 [cs.CL])
    (2 min) We investigate how well BERT performs on predicting factuality in several existing English datasets, encompassing various linguistic constructions. Although BERT obtains a strong performance on most datasets, it does so by exploiting common surface patterns that correlate with certain factuality labels, and it fails on instances where pragmatic reasoning is necessary. Contrary to what the high performance suggests, we are still far from having a robust system for factuality prediction.
    Interactive decoding of words from visual speech recognition models. (arXiv:2107.00692v1 [cs.CL])
    (2 min) This work describes an interactive decoding method to improve the performance of visual speech recognition systems using user input to compensate for the inherent ambiguity of the task. Unlike most phoneme-to-word decoding pipelines, which produce phonemes and feed these through a finite state transducer, our method instead expands words in lockstep, facilitating the insertion of interaction points at each word position. Interaction points enable us to solicit input during decoding, allowing users to interactively direct the decoding process. We simulate the behavior of user input using an oracle to give an automated evaluation, and show promise for the use of this method for text input.
  • cs.CV updates on arXiv.org

    Ensemble of Loss Functions to Improve Generalizability of Deep Metric Learning methods. (arXiv:2107.01130v1 [cs.CV])
    (2 min) Deep Metric Learning (DML) learns a non-linear semantic embedding from input data that brings similar pairs together while keeps dissimilar data away from each other. To this end, many different methods are proposed in the last decade with promising results in various applications. The success of a DML algorithm greatly depends on its loss function. However, no loss function is perfect, and it deals only with some aspects of an optimal similarity embedding. Besides, the generalizability of the DML on unseen categories during the test stage is an important matter that is not considered by existing loss functions. To address these challenges, we propose novel approaches to combine different losses built on top of a shared deep feature extractor. The proposed ensemble of losses enforces the deep model to extract features that are consistent with all losses. Since the selected losses are diverse and each emphasizes different aspects of an optimal semantic embedding, our effective combining methods yield a considerable improvement over any individual loss and generalize well on unseen categories. Here, there is no limitation in choosing loss functions, and our methods can work with any set of existing ones. Besides, they can optimize each loss function as well as its weight in an end-to-end paradigm with no need to adjust any hyper-parameter. We evaluate our methods on some popular datasets from the machine vision domain in conventional Zero-Shot-Learning (ZSL) settings. The results are very encouraging and show that our methods outperform all baseline losses by a large margin in all datasets.
    NTIRE 2021 Multi-modal Aerial View Object Classification Challenge. (arXiv:2107.01189v1 [cs.CV])
    (2 min) In this paper, we introduce the first Challenge on Multi-modal Aerial View Object Classification (MAVOC) in conjunction with the NTIRE 2021 workshop at CVPR. This challenge is composed of two different tracks using EO andSAR imagery. Both EO and SAR sensors possess different advantages and drawbacks. The purpose of this competition is to analyze how to use both sets of sensory information in complementary ways. We discuss the top methods submitted for this competition and evaluate their results on our blind test set. Our challenge results show significant improvement of more than 15% accuracy from our current baselines for each track of the competition
    Polarized Self-Attention: Towards High-quality Pixel-wise Regression. (arXiv:2107.00782v1 [cs.CV])
    (2 min) Pixel-wise regression is probably the most common problem in fine-grained computer vision tasks, such as estimating keypoint heatmaps and segmentation masks. These regression problems are very challenging particularly because they require, at low computation overheads, modeling long-range dependencies on high-resolution inputs/outputs to estimate the highly nonlinear pixel-wise semantics. While attention mechanisms in Deep Convolutional Neural Networks(DCNNs) has become popular for boosting long-range dependencies, element-specific attention, such as Nonlocal blocks, is highly complex and noise-sensitive to learn, and most of simplified attention hybrids try to reach the best compromise among multiple types of tasks. In this paper, we present the Polarized Self-Attention(PSA) block that incorporates two critical designs towards high-quality pixel-wise regression: (1) Polarized filtering: keeping high internal resolution in both channel and spatial attention computation while completely collapsing input tensors along their counterpart dimensions. (2) Enhancement: composing non-linearity that directly fits the output distribution of typical fine-grained regression, such as the 2D Gaussian distribution (keypoint heatmaps), or the 2D Binormial distribution (binary segmentation masks). PSA appears to have exhausted the representation capacity within its channel-only and spatial-only branches, such that there is only marginal metric differences between its sequential and parallel layouts. Experimental results show that PSA boosts standard baselines by $2-4$ points, and boosts state-of-the-arts by $1-2$ points on 2D pose estimation and semantic segmentation benchmarks.
    Mixed Supervision Learning for Whole Slide Image Classification. (arXiv:2107.00934v1 [cs.CV])
    (2 min) Weak supervision learning on classification labels has demonstrated high performance in various tasks. When a few pixel-level fine annotations are also affordable, it is natural to leverage both of the pixel-level (e.g., segmentation) and image level (e.g., classification) annotation to further improve the performance. In computational pathology, however, such weak or mixed supervision learning is still a challenging task, since the high resolution of whole slide images makes it unattainable to perform end-to-end training of classification models. An alternative approach is to analyze such data by patch-base model training, i.e., using self-supervised learning to generate pixel-level pseudo labels for patches. However, such methods usually have model drifting issues, i.e., hard to converge, because the noise accumulates during the self-training process. To handle those problems, we propose a mixed supervision learning framework for super high-resolution images to effectively utilize their various labels (e.g., sufficient image-level coarse annotations and a few pixel-level fine labels). During the patch training stage, this framework can make use of coarse image-level labels to refine self-supervised learning and generate high-quality pixel-level pseudo labels. A comprehensive strategy is proposed to suppress pixel-level false positives and false negatives. Three real-world datasets with very large number of images (i.e., more than 10,000 whole slide images) and various types of labels are used to evaluate the effectiveness of mixed supervision learning. We reduced the false positive rate by around one third compared to state of the art while retaining 100\% sensitivity, in the task of image-level classification.
    LensID: A CNN-RNN-Based Framework Towards Lens Irregularity Detection in Cataract Surgery Videos. (arXiv:2107.00875v1 [eess.IV])
    (2 min) A critical complication after cataract surgery is the dislocation of the lens implant leading to vision deterioration and eye trauma. In order to reduce the risk of this complication, it is vital to discover the risk factors during the surgery. However, studying the relationship between lens dislocation and its suspicious risk factors using numerous videos is a time-extensive procedure. Hence, the surgeons demand an automatic approach to enable a larger-scale and, accordingly, more reliable study. In this paper, we propose a novel framework as the major step towards lens irregularity detection. In particular, we propose (I) an end-to-end recurrent neural network to recognize the lens-implantation phase and (II) a novel semantic segmentation network to segment the lens and pupil after the implantation phase. The phase recognition results reveal the effectiveness of the proposed surgical phase recognition approach. Moreover, the segmentation results confirm the proposed segmentation network's effectiveness compared to state-of-the-art rival approaches.
    On Measuring and Controlling the Spectral Bias of the Deep Image Prior. (arXiv:2107.01125v1 [eess.IV])
    (2 min) The deep image prior has demonstrated the remarkable ability that untrained networks can address inverse imaging problems, such as denoising, inpainting and super-resolution, by optimizing on just a single degraded image. Despite its promise, it suffers from two limitations. First, it remains unclear how one can control the prior beyond the choice of the network architecture. Second, it requires an oracle to determine when to stop the optimization as the performance degrades after reaching a peak. In this paper, we study the deep image prior from a spectral bias perspective to address these problems. By introducing a frequency-band correspondence measure, we observe that deep image priors for inverse imaging exhibit a spectral bias during optimization, where low-frequency image signals are learned faster and better than high-frequency noise signals. This pinpoints why degraded images can be denoised or inpainted when the optimization is stopped at the right time. Based on our observations, we propose to control the spectral bias in the deep image prior to prevent performance degradation and to speed up optimization convergence. We do so in the two core layer types of inverse imaging networks: the convolution layer and the upsampling layer. We present a Lipschitz-controlled approach for the convolution and a Gaussian-controlled approach for the upsampling layer. We further introduce a stopping criterion to avoid superfluous computation. The experiments on denoising, inpainting and super-resolution show that our method no longer suffers from performance degradation during optimization, relieving us from the need for an oracle criterion to stop early. We further outline a stopping criterion to avoid superfluous computation. Finally, we show that our approach obtains favorable restoration results compared to current approaches, across all tasks.
    Cooperative Training and Latent Space Data Augmentation for Robust Medical Image Segmentation. (arXiv:2107.01079v1 [cs.CV])
    (2 min) Deep learning-based segmentation methods are vulnerable to unforeseen data distribution shifts during deployment, e.g. change of image appearances or contrasts caused by different scanners, unexpected imaging artifacts etc. In this paper, we present a cooperative framework for training image segmentation models and a latent space augmentation method for generating hard examples. Both contributions improve model generalization and robustness with limited data. The cooperative training framework consists of a fast-thinking network (FTN) and a slow-thinking network (STN). The FTN learns decoupled image features and shape features for image reconstruction and segmentation tasks. The STN learns shape priors for segmentation correction and refinement. The two networks are trained in a cooperative manner. The latent space augmentation generates challenging examples for training by masking the decoupled latent space in both channel-wise and spatial-wise manners. We performed extensive experiments on public cardiac imaging datasets. Using only 10 subjects from a single site for training, we demonstrated improved cross-site segmentation performance and increased robustness against various unforeseen imaging artifacts compared to strong baseline methods. Particularly, cooperative training with latent space data augmentation yields 15% improvement in terms of average Dice score when compared to a standard training method.
    Long-Short Ensemble Network for Bipolar Manic-Euthymic State Recognition Based on Wrist-worn Sensors. (arXiv:2107.00710v1 [cs.LG])
    (2 min) Manic episodes of bipolar disorder can lead to uncritical behaviour and delusional psychosis, often with destructive consequences for those affected and their surroundings. Early detection and intervention of a manic episode are crucial to prevent escalation, hospital admission and premature death. However, people with bipolar disorder may not recognize that they are experiencing a manic episode and symptoms such as euphoria and increased productivity can also deter affected individuals from seeking help. This work proposes to perform user-independent, automatic mood-state detection based on actigraphy and electrodermal activity acquired from a wrist-worn device during mania and after recovery (euthymia). This paper proposes a new deep learning-based ensemble method leveraging long (20h) and short (5 minutes) time-intervals to discriminate between the mood-states. When tested on 47 bipolar patients, the proposed classification scheme achieves an average accuracy of 91.59% in euthymic/manic mood-state recognition.
    Topo-boundary: A Benchmark Dataset on Topological Road-boundary Detection Using Aerial Images for Autonomous Driving. (arXiv:2103.17119v2 [cs.CV] UPDATED)
    (2 min) Road-boundary detection is important for autonomous driving. It can be used to constrain autonomous vehicles running on road areas to ensure driving safety. Compared with online road-boundary detection using on-vehicle cameras/Lidars, offline detection using aerial images could alleviate the severe occlusion issue. Moreover, the offline detection results can be directly employed to annotate high-definition (HD) maps. In recent years, deep-learning technologies have been used in offline detection. But there still lacks a publicly available dataset for this task, which hinders the research progress in this area. So in this paper, we propose a new benchmark dataset, named \textit{Topo-boundary}, for offline topological road-boundary detection. The dataset contains 25,295 $1000\times1000$-sized 4-channel aerial images. Each image is provided with 8 training labels for different sub-tasks. We also design a new entropy-based metric for connectivity evaluation, which could better handle noises or outliers. We implement and evaluate 3 segmentation-based baselines and 5 graph-based baselines using the dataset. We also propose a new imitation-learning-based baseline which is enhanced from our previous work. The superiority of our enhancement is demonstrated from the comparison. The dataset and our-implemented code for the baselines are available at \texttt{\url{https://tonyxuqaq.github.io/Topo-boundary/}}.
    Comparison of end-to-end neural network architectures and data augmentation methods for automatic infant motility assessment using wearable sensors. (arXiv:2107.01086v1 [cs.CV])
    (2 min) Infant motility assessment using intelligent wearables is a promising new approach for assessment of infant neurophysiological development, and where efficient signal analysis plays a central role. This study investigates the use of different end-to-end neural network architectures for processing infant motility data from wearable sensors. We focus on the performance and computational burden of alternative sensor encoder and time-series modelling modules and their combinations. In addition, we explore the benefits of data augmentation methods in ideal and non-ideal recording conditions. The experiments are conducted using a data-set of multi-sensor movement recordings from 7-month-old infants, as captured by a recently proposed smart jumpsuit for infant motility assessment. Our results indicate that the choice of the encoder module has a major impact on classifier performance. For sensor encoders, the best performance was obtained with parallel 2-dimensional convolutions for intra-sensor channel fusion with shared weights for all sensors. The results also indicate that a relatively compact feature representation is obtainable for within-sensor feature extraction without a drastic loss to classifier performance. Comparison of time-series models revealed that feed-forward dilated convolutions with residual and skip connections outperformed all RNN-based models in performance, training time, and training stability. The experiments also indicate that data augmentation improves model robustness in simulated packet loss or sensor dropout scenarios. In particular, signal- and sensor-dropout-based augmentation strategies provided considerable boosts to performance without negatively affecting the baseline performance. Overall the results provide tangible suggestions on how to optimize end-to-end neural network training for multi-channel movement sensor data.
    Audio-visual Attentive Fusion for Continuous Emotion Recognition. (arXiv:2107.01175v1 [cs.CV])
    (2 min) We propose an audio-visual spatial-temporal deep neural network with: (1) a visual block containing a pretrained 2D-CNN followed by a temporal convolutional network (TCN); (2) an aural block containing several parallel TCNs; and (3) a leader-follower attentive fusion block combining the audio-visual information. The TCN with large history coverage enables our model to exploit spatial-temporal information within a much larger window length (i.e., 300) than that from the baseline and state-of-the-art methods (i.e., 36 or 48). The fusion block emphasizes the visual modality while exploits the noisy aural modality using the inter-modality attention mechanism. To make full use of the data and alleviate over-fitting, cross-validation is carried out on the training and validation set. The concordance correlation coefficient (CCC) centering is used to merge the results from each fold. On the development set, the achieved CCC is 0.410 for valence and 0.661 for arousal, which significantly outperforms the baseline method with the corresponding CCC of 0.210 and 0.230 for valence and arousal, respectively. The code is available at https://github.com/sucv/ABAW2.
    Simpler, Faster, Stronger: Breaking The log-K Curse On Contrastive Learners With FlatNCE. (arXiv:2107.01152v1 [stat.ML])
    (2 min) InfoNCE-based contrastive representation learners, such as SimCLR, have been tremendously successful in recent years. However, these contrastive schemes are notoriously resource demanding, as their effectiveness breaks down with small-batch training (i.e., the log-K curse, whereas K is the batch-size). In this work, we reveal mathematically why contrastive learners fail in the small-batch-size regime, and present a novel simple, non-trivial contrastive objective named FlatNCE, which fixes this issue. Unlike InfoNCE, our FlatNCE no longer explicitly appeals to a discriminative classification goal for contrastive learning. Theoretically, we show FlatNCE is the mathematical dual formulation of InfoNCE, thus bridging the classical literature on energy modeling; and empirically, we demonstrate that, with minimal modification of code, FlatNCE enables immediate performance boost independent of the subject-matter engineering efforts. The significance of this work is furthered by the powerful generalization of contrastive learning techniques, and the introduction of new tools to monitor and diagnose contrastive training. We substantiate our claims with empirical evidence on CIFAR10, ImageNet, and other datasets, where FlatNCE consistently outperforms InfoNCE.
    Neural Marching Cubes. (arXiv:2106.11272v2 [cs.CV] UPDATED)
    (2 min) We introduce Neural Marching Cubes (NMC), a data-driven approach for extracting a triangle mesh from a discretized implicit field. Classical MC is defined by coarse tessellation templates isolated to individual cubes. While more refined tessellations have been proposed, they all make heuristic assumptions, such as trilinearity, when determining the vertex positions and local mesh topologies in each cube. In principle, none of these approaches can reconstruct geometric features that reveal coherence or dependencies between nearby cubes (e.g., a sharp edge), as such information is unaccounted for, resulting in poor estimates of the true underlying implicit field. To tackle these challenges, we re-cast MC from a deep learning perspective, by designing tessellation templates more apt at preserving geometric features, and learning the vertex positions and mesh topologies from training meshes, to account for contextual information from nearby cubes. We develop a compact per-cube parameterization to represent the output triangle mesh, while being compatible with neural processing, so that a simple 3D convolutional network can be employed for the training. We show that all topological cases in each cube that are applicable to our design can be easily derived using our representation, and the resulting tessellations can also be obtained naturally and efficiently by following a few design guidelines. In addition, our network learns local features with limited receptive fields, hence it generalizes well to new shapes and new datasets. We evaluate our neural MC approach by quantitative and qualitative comparisons to all well-known MC variants. In particular, we demonstrate the ability of our network to recover sharp features such as edges and corners, a long-standing issue of MC and its variants. Our network also reconstructs local mesh topologies more accurately than previous approaches.
    PointGuard: Provably Robust 3D Point Cloud Classification. (arXiv:2103.03046v2 [cs.CR] UPDATED)
    (2 min) 3D point cloud classification has many safety-critical applications such as autonomous driving and robotic grasping. However, several studies showed that it is vulnerable to adversarial attacks. In particular, an attacker can make a classifier predict an incorrect label for a 3D point cloud via carefully modifying, adding, and/or deleting a small number of its points. Randomized smoothing is state-of-the-art technique to build certifiably robust 2D image classifiers. However, when applied to 3D point cloud classification, randomized smoothing can only certify robustness against adversarially modified points. In this work, we propose PointGuard, the first defense that has provable robustness guarantees against adversarially modified, added, and/or deleted points. Specifically, given a 3D point cloud and an arbitrary point cloud classifier, our PointGuard first creates multiple subsampled point clouds, each of which contains a random subset of the points in the original point cloud; then our PointGuard predicts the label of the original point cloud as the majority vote among the labels of the subsampled point clouds predicted by the point cloud classifier. Our first major theoretical contribution is that we show PointGuard provably predicts the same label for a 3D point cloud when the number of adversarially modified, added, and/or deleted points is bounded. Our second major theoretical contribution is that we prove the tightness of our derived bound when no assumptions on the point cloud classifier are made. Moreover, we design an efficient algorithm to compute our certified robustness guarantees. We also empirically evaluate PointGuard on ModelNet40 and ScanNet benchmark datasets.
    1st Place Solutions for UG2+ Challenge 2021 -- (Semi-)supervised Face detection in the low light condition. (arXiv:2107.00818v1 [cs.CV])
    (2 min) In this technical report, we briefly introduce the solution of our team "TAL-ai" for (Semi-) supervised Face detection in the low light condition in UG2+ Challenge in CVPR 2021. By conducting several experiments with popular image enhancement methods and image transfer methods, we pulled the low light image and the normal image to a more closer domain. And it is observed that using these data to training can achieve better performance. We also adapt several popular object detection frameworks, e.g., DetectoRS, Cascade-RCNN, and large backbone like Swin-transformer. Finally, we ensemble several models which achieved mAP 74.89 on the testing set, ranking 1st on the final leaderboard.
    Visual Relationship Forecasting in Videos. (arXiv:2107.01181v1 [cs.CV])
    (2 min) Real-world scenarios often require the anticipation of object interactions in unknown future, which would assist the decision-making process of both humans and agents. To meet this challenge, we present a new task named Visual Relationship Forecasting (VRF) in videos to explore the prediction of visual relationships in a reasoning manner. Specifically, given a subject-object pair with H existing frames, VRF aims to predict their future interactions for the next T frames without visual evidence. To evaluate the VRF task, we introduce two video datasets named VRF-AG and VRF-VidOR, with a series of spatio-temporally localized visual relation annotations in a video. These two datasets densely annotate 13 and 35 visual relationships in 1923 and 13447 video clips, respectively. In addition, we present a novel Graph Convolutional Transformer (GCT) framework, which captures both object-level and frame-level dependencies by spatio-temporal Graph Convolution Network and Transformer. Experimental results on both VRF-AG and VRF-VidOR datasets demonstrate that GCT outperforms the state-of-the-art sequence modelling methods on visual relationship forecasting.
    Unsupervised Single Image Super-resolution Under Complex Noise. (arXiv:2107.00986v1 [cs.CV])
    (2 min) While the researches on single image super-resolution (SISR), especially equipped with deep neural networks (DNNs), have achieved tremendous successes recently, they still suffer from two major limitations. Firstly, the real image degradation is usually unknown and highly variant from one to another, making it extremely hard to train a single model to handle the general SISR task. Secondly, most of current methods mainly focus on the downsampling process of the degradation, but ignore or underestimate the inevitable noise contamination. For example, the commonly-used independent and identically distributed (i.i.d.) Gaussian noise distribution always largely deviates from the real image noise (e.g., camera sensor noise), which limits their performance in real scenarios. To address these issues, this paper proposes a model-based unsupervised SISR method to deal with the general SISR task with unknown degradations. Instead of the traditional i.i.d. Gaussian noise assumption, a novel patch-based non-i.i.d. noise modeling method is proposed to fit the complex real noise. Besides, a deep generator parameterized by a DNN is used to map the latent variable to the high-resolution image, and the conventional hyper-Laplacian prior is also elaborately embedded into such generator to further constrain the image gradients. Finally, a Monte Carlo EM algorithm is designed to solve our model, which provides a general inference framework to update the image generator both w.r.t. the latent variable and the network parameters. Comprehensive experiments demonstrate that the proposed method can evidently surpass the current state of the art (SotA) method (about 1dB PSNR) not only with a slighter model (0.34M vs. 2.40M) but also faster speed.
    Optical Braille Recognition using Circular Hough Transform. (arXiv:2107.00993v1 [cs.CV])
    (2 min) Braille has empowered visually challenged community to read and write. But at the same time, it has created a gap due to widespread inability of non-Braille users to understand Braille scripts. This gap has fuelled researchers to propose Optical Braille Recognition techniques to convert Braille documents to natural language. The main motivation of this work is to cement the communication gap at academic institutions by translating personal documents of blind students. This has been accomplished by proposing an economical and effective technique which digitizes Braille documents using a smartphone camera. For any given Braille image, a dot detection mechanism based on Hough transform is proposed which is invariant to skewness, noise and other deterrents. The detected dots are then clustered into Braille cells using distance-based clustering algorithm. In succession, the standard physical parameters of each Braille cells are estimated for feature extraction and classification as natural language characters. The comprehensive evaluation of this technique on the proposed dataset of 54 Braille scripts has yielded into accuracy of 98.71%.
    Self-Supervised Training Enhances Online Continual Learning. (arXiv:2103.14010v2 [cs.CV] UPDATED)
    (2 min) In continual learning, a system must incrementally learn from a non-stationary data stream without catastrophic forgetting. Recently, multiple methods have been devised for incrementally learning classes on large-scale image classification tasks, such as ImageNet. State-of-the-art continual learning methods use an initial supervised pre-training phase, in which the first 10% - 50% of the classes in a dataset are used to learn representations in an offline manner before continual learning of new classes begins. We hypothesize that self-supervised pre-training could yield features that generalize better than supervised learning, especially when the number of samples used for pre-training is small. We test this hypothesis using the self-supervised MoCo-V2, Barlow Twins, and SwAV algorithms. On ImageNet, we find that these methods outperform supervised pre-training considerably for online continual learning, and the gains are larger when fewer samples are available. Our findings are consistent across three online continual learning algorithms. Our best system achieves a 14.95% relative increase in top-1 accuracy on class incremental ImageNet over the prior state of the art for online continual learning.
    Generative Max-Mahalanobis Classifiers for Image Classification, Generation and More. (arXiv:2101.00122v4 [cs.CV] UPDATED)
    (2 min) Joint Energy-based Model (JEM) of Grathwohl et al. shows that a standard softmax classifier can be reinterpreted as an energy-based model (EBM) for the joint distribution p(x,y); the resulting model can be optimized to improve calibration, robustness, and out-of-distribution detection, while generating samples rivaling the quality of recent GAN-based approaches. However, the softmax classifier that JEM exploits is inherently discriminative and its latent feature space is not well formulated as probabilistic distributions, which may hinder its potential for image generation and incur training instability. We hypothesize that generative classifiers, such as Linear Discriminant Analysis (LDA), might be more suitable for image generation since generative classifiers model the data generation process explicitly. This paper therefore investigates an LDA classifier for image classification and generation. In particular, the Max-Mahalanobis Classifier (MMC), a special case of LDA, fits our goal very well. We show that our Generative MMC (GMMC) can be trained discriminatively, generatively, or jointly for image classification and generation. Extensive experiments on multiple datasets show that GMMC achieves state-of-the-art discriminative and generative performances, while outperforming JEM in calibration, adversarial robustness, and out-of-distribution detection by a significant margin. Our source code is available at https://github.com/sndnyang/GMMC.
    Automatic Plant Cover Estimation with Convolutional Neural Networks. (arXiv:2106.11154v2 [cs.CV] UPDATED)
    (2 min) Monitoring the responses of plants to environmental changes is essential for plant biodiversity research. This, however, is currently still being done manually by botanists in the field. This work is very laborious, and the data obtained is, though following a standardized method to estimate plant coverage, usually subjective and has a coarse temporal resolution. To remedy these caveats, we investigate approaches using convolutional neural networks (CNNs) to automatically extract the relevant data from images, focusing on plant community composition and species coverages of 9 herbaceous plant species. To this end, we investigate several standard CNN architectures and different pretraining methods. We find that we outperform our previous approach at higher image resolutions using a custom CNN with a mean absolute error of 5.16%. In addition to these investigations, we also conduct an error analysis based on the temporal aspect of the plant cover images. This analysis gives insight into where problems for automatic approaches lie, like occlusion and likely misclassifications caused by temporal changes.
    Predicting Clinical Outcomes in COVID-19 using Radiomics and Deep Learning on Chest Radiographs: A Multi-Institutional Study. (arXiv:2007.08028v2 [q-bio.QM] UPDATED)
    (3 min) We predict mechanical ventilation requirement and mortality using computational modeling of chest radiographs (CXRs) for coronavirus disease 2019 (COVID-19) patients. This two-center, retrospective study analyzed 530 deidentified CXRs from 515 COVID-19 patients treated at Stony Brook University Hospital and Newark Beth Israel Medical Center between March and August 2020. DL and machine learning classifiers to predict mechanical ventilation requirement and mortality were trained and evaluated using patient CXRs. A novel radiomic embedding framework was also explored for outcome prediction. All results are compared against radiologist grading of CXRs (zone-wise expert severity scores). Radiomic and DL classification models had mAUCs of 0.78+/-0.02 and 0.81+/-0.04, compared with expert scores mAUCs of 0.75+/-0.02 and 0.79+/-0.05 for mechanical ventilation requirement and mortality prediction, respectively. Combined classifiers using both radiomics and expert severity scores resulted in mAUCs of 0.79+/-0.04 and 0.83+/-0.04 for each prediction task, demonstrating improvement over either artificial intelligence or radiologist interpretation alone. Our results also suggest instances where inclusion of radiomic features in DL improves model predictions, something that might be explored in other pathologies. The models proposed in this study and the prognostic information they provide might aid physician decision making and resource allocation during the COVID-19 pandemic.
    Sub-millisecond Video Synchronization of Multiple Android Smartphones. (arXiv:2107.00987v1 [cs.CV])
    (2 min) This paper addresses the problem of building an affordable easy-to-setup synchronized multi-view camera system, which is in demand for many Computer Vision and Robotics applications in high-dynamic environments. In our work, we propose a solution for this problem - a publicly-available Android application for synchronized video recording on multiple smartphones with sub-millisecond accuracy. We present a generalized mathematical model of timestamping for Android smartphones and prove its applicability on 47 different physical devices. Also, we estimate the time drift parameter for those smartphones, which is less than 1.2 millisecond per minute for most of the considered devices, that makes smartphones' camera system a worthy analog for professional multi-view systems. Finally, we demonstrate Android-app performance on the camera system built from Android smartphones quantitatively, showing less than 300 microseconds synchronization error, and qualitatively - on panorama stitching task.
    Collaborative Visual Navigation. (arXiv:2107.01151v1 [cs.CV])
    (2 min) As a fundamental problem for Artificial Intelligence, multi-agent system (MAS) is making rapid progress, mainly driven by multi-agent reinforcement learning (MARL) techniques. However, previous MARL methods largely focused on grid-world like or game environments; MAS in visually rich environments has remained less explored. To narrow this gap and emphasize the crucial role of perception in MAS, we propose a large-scale 3D dataset, CollaVN, for multi-agent visual navigation (MAVN). In CollaVN, multiple agents are entailed to cooperatively navigate across photo-realistic environments to reach target locations. Diverse MAVN variants are explored to make our problem more general. Moreover, a memory-augmented communication framework is proposed. Each agent is equipped with a private, external memory to persistently store communication information. This allows agents to make better use of their past communication information, enabling more efficient collaboration and robust long-term planning. In our experiments, several baselines and evaluation metrics are designed. We also empirically verify the efficacy of our proposed MARL approach across different MAVN task settings.
    HandVoxNet++: 3D Hand Shape and Pose Estimation using Voxel-Based Neural Networks. (arXiv:2107.01205v1 [cs.CV])
    (2 min) 3D hand shape and pose estimation from a single depth map is a new and challenging computer vision problem with many applications. Existing methods addressing it directly regress hand meshes via 2D convolutional neural networks, which leads to artifacts due to perspective distortions in the images. To address the limitations of the existing methods, we develop HandVoxNet++, i.e., a voxel-based deep network with 3D and graph convolutions trained in a fully supervised manner. The input to our network is a 3D voxelized-depth-map-based on the truncated signed distance function (TSDF). HandVoxNet++ relies on two hand shape representations. The first one is the 3D voxelized grid of hand shape, which does not preserve the mesh topology and which is the most accurate representation. The second representation is the hand surface that preserves the mesh topology. We combine the advantages of both representations by aligning the hand surface to the voxelized hand shape either with a new neural Graph-Convolutions-based Mesh Registration (GCN-MeshReg) or classical segment-wise Non-Rigid Gravitational Approach (NRGA++) which does not rely on training data. In extensive evaluations on three public benchmarks, i.e., SynHand5M, depth-based HANDS19 challenge and HO-3D, the proposed HandVoxNet++ achieves the state-of-the-art performance. In this journal extension of our previous approach presented at CVPR 2020, we gain 41.09% and 13.7% higher shape alignment accuracy on SynHand5M and HANDS19 datasets, respectively. Our method is ranked first on the HANDS19 challenge dataset (Task 1: Depth-Based 3D Hand Pose Estimation) at the moment of the submission of our results to the portal in August 2020.
    How Incomplete is Contrastive Learning? AnInter-intra Variant Dual Representation Method forSelf-supervised Video Recognition. (arXiv:2107.01194v1 [cs.CV])
    (2 min) Contrastive learning applied to self-supervised representation learning has seen a resurgence in deep models. In this paper, we find that existing contrastive learning based solutions for self-supervised video recognition focus on inter-variance encoding but ignore the intra-variance existing in clips within the same video. We thus propose to learn dual representations for each clip which (\romannumeral 1) encode intra-variance through a shuffle-rank pretext task; (\romannumeral 2) encode inter-variance through a temporal coherent contrastive loss. Experiment results show that our method plays an essential role in balancing inter and intra variances and brings consistent performance gains on multiple backbones and contrastive learning frameworks. Integrated with SimCLR and pretrained on Kinetics-400, our method achieves $\textbf{82.0\%}$ and $\textbf{51.2\%}$ downstream classification accuracy on UCF101 and HMDB51 test sets respectively and $\textbf{46.1\%}$ video retrieval accuracy on UCF101, outperforming both pretext-task based and contrastive learning based counterparts.
    Outlier-Robust Estimation: Hardness, Minimally Tuned Algorithms, and Applications. (arXiv:2007.15109v3 [cs.CV] UPDATED)
    (2 min) Nonlinear estimation in robotics and vision is typically plagued with outliers due to wrong data association, or to incorrect detections from signal processing and machine learning methods. This paper introduces two unifying formulations for outlier-robust estimation, Generalized Maximum Consensus (G-MC) and Generalized Truncated Least Squares (G-TLS), and investigates fundamental limits, practical algorithms, and applications. Our first contribution is a proof that outlier-robust estimation is inapproximable: in the worst case, it is impossible to (even approximately) find the set of outliers, even with slower-than-polynomial-time algorithms (particularly, algorithms running in quasi-polynomial time). As a second contribution, we review and extend two general-purpose algorithms. The first, Adaptive Trimming (ADAPT), is combinatorial, and is suitable for G-MC; the second, Graduated Non-Convexity (GNC), is based on homotopy methods, and is suitable for G-TLS. We extend ADAPT and GNC to the case where the user does not have prior knowledge of the inlier-noise statistics (or the statistics may vary over time) and is unable to guess a reasonable threshold to separate inliers from outliers (as the one commonly used in RANSAC). We propose the first minimally tuned algorithms for outlier rejection, that dynamically decide how to separate inliers from outliers. Our third contribution is an evaluation of the proposed algorithms on robot perception problems: mesh registration, image-based object detection (shape alignment), and pose graph optimization. ADAPT and GNC execute in real-time, are deterministic, outperform RANSAC, and are robust up to 80-90% outliers. Their minimally tuned versions also compare favorably with the state of the art, even though they do not rely on a noise bound for the inliers.
    Evaluating the Usefulness of Unsupervised monitoring in Cultural Heritage Monuments. (arXiv:2107.00964v1 [cs.CV])
    (2 min) In this paper, we scrutinize the effectiveness of various clustering techniques, investigating their applicability in Cultural Heritage monitoring applications. In the context of this paper, we detect the level of decomposition and corrosion on the walls of Saint Nicholas fort in Rhodes utilizing hyperspectral images. A total of 6 different clustering approaches have been evaluated over a set of 14 different orthorectified hyperspectral images. Experimental setup in this study involves K-means, Spectral, Meanshift, DBSCAN, Birch and Optics algorithms. For each of these techniques we evaluate its performance by the use of performance metrics such as Calinski-Harabasz, Davies-Bouldin indexes and Silhouette value. In this approach, we evaluate the outcomes of the clustering methods by comparing them with a set of annotated images which denotes the ground truth regarding the decomposition and/or corrosion area of the original images. The results depict that a few clustering techniques applied on the given dataset succeeded decent accuracy, precision, recall and f1 scores. Eventually, it was observed that the deterioration was detected quite accurately.
    Magnification-independent Histopathological Image Classification with Similarity-based Multi-scale Embeddings. (arXiv:2107.01063v1 [cs.CV])
    (2 min) The classification of histopathological images is of great value in both cancer diagnosis and pathological studies. However, multiple reasons, such as variations caused by magnification factors and class imbalance, make it a challenging task where conventional methods that learn from image-label datasets perform unsatisfactorily in many cases. We observe that tumours of the same class often share common morphological patterns. To exploit this fact, we propose an approach that learns similarity-based multi-scale embeddings (SMSE) for magnification-independent histopathological image classification. In particular, a pair loss and a triplet loss are leveraged to learn similarity-based embeddings from image pairs or image triplets. The learned embeddings provide accurate measurements of similarities between images, which are regarded as a more effective form of representation for histopathological morphology than normal image features. Furthermore, in order to ensure the generated models are magnification-independent, images acquired at different magnification factors are simultaneously fed to networks during training for learning multi-scale embeddings. In addition to the SMSE, to eliminate the impact of class imbalance, instead of using the hard sample mining strategy that intuitively discards some easy samples, we introduce a new reinforced focal loss to simultaneously punish hard misclassified samples while suppressing easy well-classified samples. Experimental results show that the SMSE improves the performance for histopathological image classification tasks for both breast and liver cancers by a large margin compared to previous methods. In particular, the SMSE achieves the best performance on the BreakHis benchmark with an improvement ranging from 5% to 18% compared to previous methods using traditional features.
    A Survey on Deep Learning Technique for Video Segmentation. (arXiv:2107.01153v1 [cs.CV])
    (2 min) Video segmentation, i.e., partitioning video frames into multiple segments or objects, plays a critical role in a broad range of practical applications, e.g., visual effect assistance in movie, scene understanding in autonomous driving, and virtual background creation in video conferencing, to name a few. Recently, due to the renaissance of connectionism in computer vision, there has been an influx of numerous deep learning based approaches that have been dedicated to video segmentation and delivered compelling performance. In this survey, we comprehensively review two basic lines of research in this area, i.e., generic object segmentation (of unknown categories) in videos and video semantic segmentation, by introducing their respective task settings, background concepts, perceived need, development history, and main challenges. We also provide a detailed overview of representative literature on both methods and datasets. Additionally, we present quantitative performance comparisons of the reviewed methods on benchmark datasets. At last, we point out a set of unsolved open issues in this field, and suggest possible opportunities for further research.
    Active Fire Detection in Landsat-8 Imagery: a Large-Scale Dataset and a Deep-Learning Study. (arXiv:2101.03409v2 [cs.CV] UPDATED)
    (2 min) Active fire detection in satellite imagery is of critical importance to the management of environmental conservation policies, supporting decision-making and law enforcement. This is a well established field, with many techniques being proposed over the years, usually based on pixel or region-level comparisons involving sensor-specific thresholds and neighborhood statistics. In this paper, we address the problem of active fire detection using deep learning techniques. In recent years, deep learning techniques have been enjoying an enormous success in many fields, but their use for active fire detection is relatively new, with open questions and demand for datasets and architectures for evaluation. This paper addresses these issues by introducing a new large-scale dataset for active fire detection, with over 150,000 image patches (more than 200 GB of data) extracted from Landsat-8 images captured around the world in August and September 2020, containing wildfires in several locations. The dataset was split in two parts, and contains 10-band spectral images with associated outputs, produced by three well known handcrafted algorithms for active fire detection in the first part, and manually annotated masks in the second part. We also present a study on how different convolutional neural network architectures can be used to approximate these handcrafted algorithms, and how models trained on automatically segmented patches can be combined to achieve better performance than the original algorithms - with the best combination having 87.2% precision and 92.4% recall on our manually annotated dataset. The proposed dataset, source codes and trained models are available on Github (https://github.com/pereira-gha/activefire), creating opportunities for further advances in the field
    Cross-view Geo-localization with Evolving Transformer. (arXiv:2107.00842v1 [cs.CV])
    (2 min) In this work, we address the problem of cross-view geo-localization, which estimates the geospatial location of a street view image by matching it with a database of geo-tagged aerial images. The cross-view matching task is extremely challenging due to drastic appearance and geometry differences across views. Unlike existing methods that predominantly fall back on CNN, here we devise a novel evolving geo-localization Transformer (EgoTR) that utilizes the properties of self-attention in Transformer to model global dependencies, thus significantly decreasing visual ambiguities in cross-view geo-localization. We also exploit the positional encoding of Transformer to help the EgoTR understand and correspond geometric configurations between ground and aerial images. Compared to state-of-the-art methods that impose strong assumption on geometry knowledge, the EgoTR flexibly learns the positional embeddings through the training objective and hence becomes more practical in many real-world scenarios. Although Transformer is well suited to our task, its vanilla self-attention mechanism independently interacts within image patches in each layer, which overlooks correlations between layers. Instead, this paper propose a simple yet effective self-cross attention mechanism to improve the quality of learned representations. The self-cross attention models global dependencies between adjacent layers, which relates between image patches while modeling how features evolve in the previous layer. As a result, the proposed self-cross attention leads to more stable training, improves the generalization ability and encourages representations to keep evolving as the network goes deeper. Extensive experiments demonstrate that our EgoTR performs favorably against state-of-the-art methods on standard, fine-grained and cross-dataset cross-view geo-localization tasks.
    VMAF And Variants: Towards A Unified VQA. (arXiv:2103.07770v3 [eess.IV] UPDATED)
    (2 min) Video quality assessment (VQA) is now a fastgrowing subject, beginning to mature in the full reference (FR) case, while the burgeoning no reference (NR) case remains challenging. We investigate variants of the popular VMAF video quality assessment algorithm for the FR case, using support vector regression and feedforward neural networks, and extend it to the NR case, using the same learning architectures, to develop a partially unified framework for VQA. When heavily trained, algorithms such as VMAF perform well on test datasets, with 90%+ match; but predicting performance in the wild is better done by training/testing from scratch, as we do. Even from scratch, we achieve 90%+ performance in FR, with gains over VMAF. And we greatly reduce complexity vs. leading recent NR algorithms, VIDEVAL, RAPIQUE, yet exceed 80% in SRCC. In our preliminary testing, we find the improvements in trainability, while also constraining computational complexity, as quite encouraging, suggesting further study and analysis.
    WiCluster: Passive Indoor 2D/3D Positioning using WiFi without Precise Labels. (arXiv:2107.01002v1 [cs.NI])
    (2 min) We introduce WiCluster, a new machine learning (ML) approach for passive indoor positioning using radio frequency (RF) channel state information (CSI). WiCluster can predict both a zone-level position and a precise 2D or 3D position, without using any precise position labels during training. Prior CSI-based indoor positioning work has relied on non-parametric approaches using digital signal-processing (DSP) and, more recently, parametric approaches (e.g., fully supervised ML methods). However these do not handle the complexity of real-world environments well and do not meet requirements for large-scale commercial deployments: the accuracy of DSP-based method deteriorates significantly in non-line-of-sight conditions, while supervised ML methods need large amounts of hard-to-acquire centimeter accuracy position labels. In contrast, WiCluster is both precise and requires weaker label-information that can be easily collected. Our first contribution is a novel dimensionality reduction method for charting. It combines a triplet-loss with a multi-scale clustering-loss to map the high-dimensional CSI representation to a 2D/3D latent space. Our second contribution is two weakly supervised losses that map this latent space into a Cartesian map, resulting in meter-accuracy position results. These losses only require simple to acquire priors: a sketch of the floorplan, approximate location of access-point locations and a few CSI packets that are labeled with the corresponding zone in the floorplan. Thirdly, we report results and a robustness study for 2D positioning in a single-floor office building and 3D positioning in a two-floor home to show the robustness of our method.
    Parasitic Egg Detection and Classification in Low-cost Microscopic Images using Transfer Learning. (arXiv:2107.00968v1 [cs.CV])
    (2 min) Intestinal parasitic infection leads to several morbidities to humans worldwide, especially in tropical countries. The traditional diagnosis usually relies on manual analysis from microscopic images which is prone to human error due to morphological similarity of different parasitic eggs and abundance of impurities in a sample. Many studies have developed automatic systems for parasite egg detection to reduce human workload. However, they work with high quality microscopes, which unfortunately remain unaffordable in some rural areas. Our work thus exploits a benefit of a low-cost USB microscope. This instrument however provides poor quality of images due to limitation of magnification (10x), causing difficulty in parasite detection and species classification. In this paper, we propose a CNN-based technique using transfer learning strategy to enhance the efficiency of automatic parasite classification in poor-quality microscopic images. The patch-based technique with sliding window is employed to search for location of the eggs. Two networks, AlexNet and ResNet50, are examined with a trade-off between architecture size and classification performance. The results show that our proposed framework outperforms the state-of-the-art object recognition methods. Our system combined with final decision from an expert may improve the real faecal examination with low-cost microscopes.
    Ultrasound Video Transformers for Cardiac Ejection Fraction Estimation. (arXiv:2107.00977v1 [cs.CV])
    (2 min) Cardiac ultrasound imaging is used to diagnose various heart diseases. Common analysis pipelines involve manual processing of the video frames by expert clinicians. This suffers from intra- and inter-observer variability. We propose a novel approach to ultrasound video analysis using a transformer architecture based on a Residual Auto-Encoder Network and a BERT model adapted for token classification. This enables videos of any length to be processed. We apply our model to the task of End-Systolic (ES) and End-Diastolic (ED) frame detection and the automated computation of the left ventricular ejection fraction. We achieve an average frame distance of 3.36 frames for the ES and 7.17 frames for the ED on videos of arbitrary length. Our end-to-end learnable approach can estimate the ejection fraction with a MAE of 5.95 and $R^2$ of 0.52 in 0.15s per video, showing that segmentation is not the only way to predict ejection fraction. Code and models are available at https://github.com/HReynaud/UVT.
    ResIST: Layer-Wise Decomposition of ResNets for Distributed Training. (arXiv:2107.00961v1 [cs.LG])
    (2 min) We propose {\rm \texttt{ResIST}}, a novel distributed training protocol for Residual Networks (ResNets). {\rm \texttt{ResIST}} randomly decomposes a global ResNet into several shallow sub-ResNets that are trained independently in a distributed manner for several local iterations, before having their updates synchronized and aggregated into the global model. In the next round, new sub-ResNets are randomly generated and the process repeats. By construction, per iteration, {\rm \texttt{ResIST}} communicates only a small portion of network parameters to each machine and never uses the full model during training. Thus, {\rm \texttt{ResIST}} reduces the communication, memory, and time requirements of ResNet training to only a fraction of the requirements of previous methods. In comparison to common protocols like data-parallel training and data-parallel training with local SGD, {\rm \texttt{ResIST}} yields a decrease in wall-clock training time, while being competitive with respect to model performance.
    Rapid Neural Architecture Search by Learning to Generate Graphs from Datasets. (arXiv:2107.00860v1 [cs.LG])
    (2 min) Despite the success of recent Neural Architecture Search (NAS) methods on various tasks which have shown to output networks that largely outperform human-designed networks, conventional NAS methods have mostly tackled the optimization of searching for the network architecture for a single task (dataset), which does not generalize well across multiple tasks (datasets). Moreover, since such task-specific methods search for a neural architecture from scratch for every given task, they incur a large computational cost, which is problematic when the time and monetary budget are limited. In this paper, we propose an efficient NAS framework that is trained once on a database consisting of datasets and pretrained networks and can rapidly search for a neural architecture for a novel dataset. The proposed MetaD2A (Meta Dataset-to-Architecture) model can stochastically generate graphs (architectures) from a given set (dataset) via a cross-modal latent space learned with amortized meta-learning. Moreover, we also propose a meta-performance predictor to estimate and select the best architecture without direct training on target datasets. The experimental results demonstrate that our model meta-learned on subsets of ImageNet-1K and architectures from NAS-Bench 201 search space successfully generalizes to multiple unseen datasets including CIFAR-10 and CIFAR-100, with an average search time of 33 GPU seconds. Even under MobileNetV3 search space, MetaD2A is 5.5K times faster than NSGANetV2, a transferable NAS method, with comparable performance. We believe that the MetaD2A proposes a new research direction for rapid NAS as well as ways to utilize the knowledge from rich databases of datasets and architectures accumulated over the past years. Code is available at https://github.com/HayeonLee/MetaD2A.
    MSN: Multi-Style Network for Trajectory Prediction. (arXiv:2107.00932v1 [cs.CV])
    (2 min) It is essential but challenging to predict future trajectories of various agents in complex scenes. Whether it is internal personality factors of agents, interactive behavior of the neighborhood, or the influence of surroundings, it will have an impact on their future behavior styles. It means that even for the same physical type of agents, there are huge differences in their behavior preferences. Although recent works have made significant progress in studying agents' multi-modal plannings, most of them still apply the same prediction strategy to all agents, which makes them difficult to fully show the multiple styles of vast agents. In this paper, we propose the Multi-Style Network (MSN) to focus on this problem by divide agents' preference styles into several hidden behavior categories adaptively and train each category's prediction network separately, therefore giving agents all styles of predictions simultaneously. Experiments demonstrate that our deterministic MSN-D and generative MSN-G outperform many recent state-of-the-art methods and show better multi-style characteristics in the visualized results.
    Target-dependent UNITER: A Transformer-Based Multimodal Language Comprehension Model for Domestic Service Robots. (arXiv:2107.00811v1 [cs.RO])
    (2 min) Currently, domestic service robots have an insufficient ability to interact naturally through language. This is because understanding human instructions is complicated by various ambiguities and missing information. In existing methods, the referring expressions that specify the relationships between objects are insufficiently modeled. In this paper, we propose Target-dependent UNITER, which learns the relationship between the target object and other objects directly by focusing on the relevant regions within an image, rather than the whole image. Our method is an extension of the UNITER-based Transformer that can be pretrained on general-purpose datasets. We extend the UNITER approach by introducing a new architecture for handling the target candidates. Our model is validated on two standard datasets, and the results show that Target-dependent UNITER outperforms the baseline method in terms of classification accuracy.
    HO-3D_v3: Improving the Accuracy of Hand-Object Annotations of the HO-3D Dataset. (arXiv:2107.00887v1 [cs.CV])
    (2 min) HO-3D is a dataset providing image sequences of various hand-object interaction scenarios annotated with the 3D pose of the hand and the object and was originally introduced as HO-3D_v2. The annotations were obtained automatically using an optimization method, 'HOnnotate', introduced in the original paper. HO-3D_v3 provides more accurate annotations for both the hand and object poses thus resulting in better estimates of contact regions between the hand and the object. In this report, we elaborate on the improvements to the HOnnotate method and provide evaluations to compare the accuracy of HO-3D_v2 and HO-3D_v3. HO-3D_v3 results in 4mm higher accuracy compared to HO-3D_v2 for hand poses while exhibiting higher contact regions with the object surface.
    Passing a Non-verbal Turing Test: Evaluating Gesture Animations Generated from Speech. (arXiv:2107.00712v1 [cs.CV])
    (2 min) In real life, people communicate using both speech and non-verbal signals such as gestures, face expression or body pose. Non-verbal signals impact the meaning of the spoken utterance in an abundance of ways. An absence of non-verbal signals impoverishes the process of communication. Yet, when users are represented as avatars, it is difficult to translate non-verbal signals along with the speech into the virtual world without specialized motion-capture hardware. In this paper, we propose a novel, data-driven technique for generating gestures directly from speech. Our approach is based on the application of Generative Adversarial Neural Networks (GANs) to model the correlation rather than causation between speech and gestures. This approach approximates neuroscience findings on how non-verbal communication and speech are correlated. We create a large dataset which consists of speech and corresponding gestures in a 3D human pose format from which our model learns the speaker-specific correlation. We evaluate the proposed technique in a user study that is inspired by the Turing test. For the study, we animate the generated gestures on a virtual character. We find that users are not able to distinguish between the generated and the recorded gestures. Moreover, users are able to identify our synthesized gestures as related or not related to a given utterance.
    Aerial Map-Based Navigation Using Semantic Segmentation and Pattern Matching. (arXiv:2107.00689v1 [cs.CV])
    (2 min) This paper proposes a novel approach to map-based navigation system for unmanned aircraft. The proposed system attempts label-to-label matching, not image-to-image matching between aerial images and a map database. By using semantic segmentation, the ground objects are labelled and the configuration of the objects is used to find the corresponding location in the map database. The use of the deep learning technique as a tool for extracting high-level features reduces the image-based localization problem to a pattern matching problem. This paper proposes a pattern matching algorithm which does not require altitude information or a camera model to estimate the absolute horizontal position. The feasibility analysis with simulated images shows the proposed map-based navigation can be realized with the proposed pattern matching algorithm and it is able to provide positions given the labelled objects.
    Unsupervised Image Segmentation by Mutual Information Maximization and Adversarial Regularization. (arXiv:2107.00691v1 [cs.CV])
    (2 min) Semantic segmentation is one of the basic, yet essential scene understanding tasks for an autonomous agent. The recent developments in supervised machine learning and neural networks have enjoyed great success in enhancing the performance of the state-of-the-art techniques for this task. However, their superior performance is highly reliant on the availability of a large-scale annotated dataset. In this paper, we propose a novel fully unsupervised semantic segmentation method, the so-called Information Maximization and Adversarial Regularization Segmentation (InMARS). Inspired by human perception which parses a scene into perceptual groups, rather than analyzing each pixel individually, our proposed approach first partitions an input image into meaningful regions (also known as superpixels). Next, it utilizes Mutual-Information-Maximization followed by an adversarial training strategy to cluster these regions into semantically meaningful classes. To customize an adversarial training scheme for the problem, we incorporate adversarial pixel noise along with spatial perturbations to impose photometrical and geometrical invariance on the deep neural network. Our experiments demonstrate that our method achieves the state-of-the-art performance on two commonly used unsupervised semantic segmentation datasets, COCO-Stuff, and Potsdam.
    SIMILAR: Submodular Information Measures Based Active Learning In Realistic Scenarios. (arXiv:2107.00717v1 [cs.LG])
    (2 min) Active learning has proven to be useful for minimizing labeling costs by selecting the most informative samples. However, existing active learning methods do not work well in realistic scenarios such as imbalance or rare classes, out-of-distribution data in the unlabeled set, and redundancy. In this work, we propose SIMILAR (Submodular Information Measures based actIve LeARning), a unified active learning framework using recently proposed submodular information measures (SIM) as acquisition functions. We argue that SIMILAR not only works in standard active learning, but also easily extends to the realistic settings considered above and acts as a one-stop solution for active learning that is scalable to large real-world datasets. Empirically, we show that SIMILAR significantly outperforms existing active learning algorithms by as much as ~5% - 18% in the case of rare classes and ~5% - 10% in the case of out-of-distribution data on several image classification tasks like CIFAR-10, MNIST, and ImageNet.
    Case Relation Transformer: A Crossmodal Language Generation Model for Fetching Instructions. (arXiv:2107.00789v1 [cs.RO])
    (2 min) There have been many studies in robotics to improve the communication skills of domestic service robots. Most studies, however, have not fully benefited from recent advances in deep neural networks because the training datasets are not large enough. In this paper, our aim is to augment the datasets based on a crossmodal language generation model. We propose the Case Relation Transformer (CRT), which generates a fetching instruction sentence from an image, such as "Move the blue flip-flop to the lower left box." Unlike existing methods, the CRT uses the Transformer to integrate the visual features and geometry features of objects in the image. The CRT can handle the objects because of the Case Relation Block. We conducted comparison experiments and a human evaluation. The experimental results show the CRT outperforms baseline methods.
    MMF: Multi-Task Multi-Structure Fusion for Hierarchical Image Classification. (arXiv:2107.00808v1 [cs.CV])
    (2 min) Hierarchical classification is significant for complex tasks by providing multi-granular predictions and encouraging better mistakes. As the label structure decides its performance, many existing approaches attempt to construct an excellent label structure for promoting the classification results. In this paper, we consider that different label structures provide a variety of prior knowledge for category recognition, thus fusing them is helpful to achieve better hierarchical classification results. Furthermore, we propose a multi-task multi-structure fusion model to integrate different label structures. It contains two kinds of branches: one is the traditional classification branch to classify the common subclasses, the other is responsible for identifying the heterogeneous superclasses defined by different label structures. Besides the effect of multiple label structures, we also explore the architecture of the deep model for better hierachical classification and adjust the hierarchical evaluation metrics for multiple label structures. Experimental results on CIFAR100 and Car196 show that our method obtains significantly better results than using a flat classifier or a hierarchical classifier with any single label structure.
    Mitigating Uncertainty of Classifier for Unsupervised Domain Adaptation. (arXiv:2107.00727v1 [cs.LG])
    (2 min) Understanding unsupervised domain adaptation has been an important task that has been well explored. However, the wide variety of methods have not analyzed the role of a classifier's performance in detail. In this paper, we thoroughly examine the role of a classifier in terms of matching source and target distributions. We specifically investigate the classifier ability by matching a) the distribution of features, b) probabilistic uncertainty for samples and c) certainty activation mappings. Our analysis suggests that using these three distributions does result in a consistently improved performance on all the datasets. Our work thus extends present knowledge on the role of the various distributions obtained from the classifier towards solving unsupervised domain adaptation.
    UTNet: A Hybrid Transformer Architecture for Medical Image Segmentation. (arXiv:2107.00781v1 [cs.CV])
    (2 min) Transformer architecture has emerged to be successful in a number of natural language processing tasks. However, its applications to medical vision remain largely unexplored. In this study, we present UTNet, a simple yet powerful hybrid Transformer architecture that integrates self-attention into a convolutional neural network for enhancing medical image segmentation. UTNet applies self-attention modules in both encoder and decoder for capturing long-range dependency at different scales with minimal overhead. To this end, we propose an efficient self-attention mechanism along with relative position encoding that reduces the complexity of self-attention operation significantly from $O(n^2)$ to approximate $O(n)$. A new self-attention decoder is also proposed to recover fine-grained details from the skipped connections in the encoder. Our approach addresses the dilemma that Transformer requires huge amounts of data to learn vision inductive bias. Our hybrid layer design allows the initialization of Transformer into convolutional networks without a need of pre-training. We have evaluated UTNet on the multi-label, multi-vendor cardiac magnetic resonance imaging cohort. UTNet demonstrates superior segmentation performance and robustness against the state-of-the-art approaches, holding the promise to generalize well on other medical image segmentations.
    Blind Image Super-Resolution via Contrastive Representation Learning. (arXiv:2107.00708v1 [cs.CV])
    (2 min) Image super-resolution (SR) research has witnessed impressive progress thanks to the advance of convolutional neural networks (CNNs) in recent years. However, most existing SR methods are non-blind and assume that degradation has a single fixed and known distribution (e.g., bicubic) which struggle while handling degradation in real-world data that usually follows a multi-modal, spatially variant, and unknown distribution. The recent blind SR studies address this issue via degradation estimation, but they do not generalize well to multi-source degradation and cannot handle spatially variant degradation. We design CRL-SR, a contrastive representation learning network that focuses on blind SR of images with multi-modal and spatially variant distributions. CRL-SR addresses the blind SR challenges from two perspectives. The first is contrastive decoupling encoding which introduces contrastive learning to extract resolution-invariant embedding and discard resolution-variant embedding under the guidance of a bidirectional contrastive loss. The second is contrastive feature refinement which generates lost or corrupted high-frequency details under the guidance of a conditional contrastive loss. Extensive experiments on synthetic datasets and real images show that the proposed CRL-SR can handle multi-modal and spatially variant degradation effectively under blind settings and it also outperforms state-of-the-art SR methods qualitatively and quantitatively.
    Enhancing Multi-Robot Perception via Learned Data Association. (arXiv:2107.00769v1 [cs.RO])
    (2 min) In this paper, we address the multi-robot collaborative perception problem, specifically in the context of multi-view infilling for distributed semantic segmentation. This setting entails several real-world challenges, especially those relating to unregistered multi-agent image data. Solutions must effectively leverage multiple, non-static, and intermittently-overlapping RGB perspectives. To this end, we propose the Multi-Agent Infilling Network: an extensible neural architecture that can be deployed (in a distributed manner) to each agent in a robotic swarm. Specifically, each robot is in charge of locally encoding and decoding visual information, and an extensible neural mechanism allows for an uncertainty-aware and context-based exchange of intermediate features. We demonstrate improved performance on a realistic multi-robot AirSim dataset.
    Overcoming Obstructions via Bandwidth-Limited Multi-Agent Spatial Handshaking. (arXiv:2107.00771v1 [cs.RO])
    (2 min) In this paper, we address bandwidth-limited and obstruction-prone collaborative perception, specifically in the context of multi-agent semantic segmentation. This setting presents several key challenges, including processing and exchanging unregistered robotic swarm imagery. To be successful, solutions must effectively leverage multiple non-static and intermittently-overlapping RGB perspectives, while heeding bandwidth constraints and overcoming unwanted foreground obstructions. As such, we propose an end-to-end learn-able Multi-Agent Spatial Handshaking network (MASH) to process, compress, and propagate visual information across a robotic swarm. Our distributed communication module operates directly (and exclusively) on raw image data, without additional input requirements such as pose, depth, or warping data. We demonstrate superior performance of our model compared against several baselines in a photo-realistic multi-robot AirSim environment, especially in the presence of image occlusions. Our method achieves an absolute 11% IoU improvement over strong baselines.
    Intrinsic Image Transfer for Illumination Manipulation. (arXiv:2107.00704v1 [cs.CV])
    (2 min) This paper presents a novel intrinsic image transfer (IIT) algorithm for illumination manipulation, which creates a local image translation between two illumination surfaces. This model is built on an optimization-based framework consisting of three photo-realistic losses defined on the sub-layers factorized by an intrinsic image decomposition. We illustrate that all losses can be reduced without the necessity of taking an intrinsic image decomposition under the well-known spatial-varying illumination illumination-invariant reflectance prior knowledge. Moreover, with a series of relaxations, all of them can be directly defined on images, giving a closed-form solution for image illumination manipulation. This new paradigm differs from the prevailing Retinex-based algorithms, as it provides an implicit way to deal with the per-pixel image illumination. We finally demonstrate its versatility and benefits to the illumination-related tasks such as illumination compensation, image enhancement, and high dynamic range (HDR) image compression, and show the high-quality results on natural image datasets.
  • cs.IR updates on arXiv.org

    Quantifying Availability and Discovery in Recommender Systems via Stochastic Reachability. (arXiv:2107.00833v1 [cs.IR])
    (2 min) In this work, we consider how preference models in interactive recommendation systems determine the availability of content and users' opportunities for discovery. We propose an evaluation procedure based on stochastic reachability to quantify the maximum probability of recommending a target piece of content to an user for a set of allowable strategic modifications. This framework allows us to compute an upper bound on the likelihood of recommendation with minimal assumptions about user behavior. Stochastic reachability can be used to detect biases in the availability of content and diagnose limitations in the opportunities for discovery granted to users. We show that this metric can be computed efficiently as a convex program for a variety of practical settings, and further argue that reachability is not inherently at odds with accuracy. We demonstrate evaluations of recommendation algorithms trained on large datasets of explicit and implicit ratings. Our results illustrate how preference models, selection rules, and user interventions impact reachability and how these effects can be distributed unevenly.
    Exploiting Cross-Session Information for Session-based Recommendation with Graph Neural Networks. (arXiv:2107.00852v1 [cs.IR])
    (2 min) Different from the traditional recommender system, the session-based recommender system introduces the concept of the session, i.e., a sequence of interactions between a user and multiple items within a period, to preserve the user's recent interest. The existing work on the session-based recommender system mainly relies on mining sequential patterns within individual sessions, which are not expressive enough to capture more complicated dependency relationships among items. In addition, it does not consider the cross-session information due to the anonymity of the session data, where the linkage between different sessions is prevented. In this paper, we solve these problems with the graph neural networks technique. First, each session is represented as a graph rather than a linear sequence structure, based on which a novel Full Graph Neural Network (FGNN) is proposed to learn complicated item dependency. To exploit and incorporate cross-session information in the individual session's representation learning, we further construct a Broadly Connected Session (BCS) graph to link different sessions and a novel Mask-Readout function to improve session embedding based on the BCS graph. Extensive experiments have been conducted on two e-commerce benchmark datasets, i.e., Yoochoose and Diginetica, and the experimental results demonstrate the superiority of our proposal through comparisons with state-of-the-art session-based recommender models.
    Exploiting Positional Information for Session-based Recommendation. (arXiv:2107.00846v1 [cs.IR])
    (2 min) For present e-commerce platforms, session-based recommender systems are developed to predict users' preference for next-item recommendation. Although a session can usually reflect a user's current preference, a local shift of the user's intention within the session may still exist. Specifically, the interactions that take place in the early positions within a session generally indicate the user's initial intention, while later interactions are more likely to represent the latest intention. Such positional information has been rarely considered in existing methods, which restricts their ability to capture the significance of interactions at different positions. To thoroughly exploit the positional information within a session, a theoretical framework is developed in this paper to provide an in-depth analysis of the positional information. We formally define the properties of forward-awareness and backward-awareness to evaluate the ability of positional encoding schemes in capturing the initial and the latest intention. According to our analysis, existing positional encoding schemes are generally forward-aware only, which can hardly represent the dynamics of the intention in a session. To enhance the positional encoding scheme for the session-based recommendation, a dual positional encoding (DPE) is proposed to account for both forward-awareness and backward-awareness. Based on DPE, we propose a novel Positional Recommender (PosRec) model with a well-designed Position-aware Gated Graph Neural Network module to fully exploit the positional information for session-based recommendation tasks. Extensive experiments are conducted on two e-commerce benchmark datasets, Yoochoose and Diginetica and the experimental results show the superiority of the PosRec by comparing it with the state-of-the-art session-based recommender models.
    On-Demand and Lightweight Knowledge Graph Generation -- a Demonstration with DBpedia. (arXiv:2107.00873v1 [cs.IR])
    (2 min) Modern large-scale knowledge graphs, such as DBpedia, are datasets which require large computational resources to serve and process. Moreover, they often have longer release cycles, which leads to outdated information in those graphs. In this paper, we present DBpedia on Demand -- a system which serves DBpedia resources on demand without the need to materialize and store the entire graph, and which even provides limited querying functionality.
  • cs.LG updates on arXiv.org

    Feature Encoding with AutoEncoders for Weakly-supervised Anomaly Detection. (arXiv:2105.10500v3 [cs.LG] UPDATED)
    (2 min) Weakly-supervised anomaly detection aims at learning an anomaly detector from a limited amount of labeled data and abundant unlabeled data. Recent works build deep neural networks for anomaly detection by discriminatively mapping the normal samples and abnormal samples to different regions in the feature space or fitting different distributions. However, due to the limited number of annotated anomaly samples, directly training networks with the discriminative loss may not be sufficient. To overcome this issue, this paper proposes a novel strategy to transform the input data into a more meaningful representation that could be used for anomaly detection. Specifically, we leverage an autoencoder to encode the input data and utilize three factors, hidden representation, reconstruction residual vector, and reconstruction error, as the new representation for the input data. This representation amounts to encode a test sample with its projection on the training data manifold, its direction to its projection and its distance to its projection. In addition to this encoding, we also propose a novel network architecture to seamlessly incorporate those three factors. From our extensive experiments, the benefits of the proposed strategy are clearly demonstrated by its superior performance over the competitive methods.
    Consequence-aware Sequential Counterfactual Generation. (arXiv:2104.05592v2 [cs.LG] UPDATED)
    (2 min) Counterfactuals have become a popular technique nowadays for interacting with black-box machine learning models and understanding how to change a particular instance to obtain a desired outcome from the model. However, most existing approaches assume instant materialization of these changes, ignoring that they may require effort and a specific order of application. Recently, methods have been proposed that also consider the order in which actions are applied, leading to the so-called sequential counterfactual generation problem. In this work, we propose a model-agnostic method for sequential counterfactual generation. We formulate the task as a multi-objective optimization problem and present a genetic algorithm approach to find optimal sequences of actions leading to the counterfactuals. Our cost model considers not only the direct effect of an action, but also its consequences. Experimental results show that compared to state-of-the-art, our approach generates less costly solutions, is more efficient and provides the user with a diverse set of solutions to choose from.
    Discretization Drift in Two-Player Games. (arXiv:2105.13922v2 [stat.ML] UPDATED)
    (2 min) Gradient-based methods for two-player games produce rich dynamics that can solve challenging problems, yet can be difficult to stabilize and understand. Part of this complexity originates from the discrete update steps given by simultaneous or alternating gradient descent, which causes each player to drift away from the continuous gradient flow -- a phenomenon we call discretization drift. Using backward error analysis, we derive modified continuous dynamical systems that closely follow the discrete dynamics. These modified dynamics provide an insight into the notorious challenges associated with zero-sum games, including Generative Adversarial Networks. In particular, we identify distinct components of the discretization drift that can alter performance and in some cases destabilize the game. Finally, quantifying discretization drift allows us to identify regularizers that explicitly cancel harmful forms of drift or strengthen beneficial forms of drift, and thus improve performance of GAN training.
    ExplainaBoard: An Explainable Leaderboard for NLP. (arXiv:2104.06387v2 [cs.CL] UPDATED)
    (2 min) With the rapid development of NLP research, leaderboards have emerged as one tool to track the performance of various systems on various NLP tasks. They are effective in this goal to some extent, but generally present a rather simplistic one-dimensional view of the submitted systems, communicated only through holistic accuracy numbers. In this paper, we present a new conceptualization and implementation of NLP evaluation: the ExplainaBoard, which in addition to inheriting the functionality of the standard leaderboard, also allows researchers to (i) diagnose strengths and weaknesses of a single system (e.g.~what is the best-performing system bad at?) (ii) interpret relationships between multiple systems. (e.g.~where does system A outperform system B? What if we combine systems A, B, and C?) and (iii) examine prediction results closely (e.g.~what are common errors made by multiple systems, or in what contexts do particular errors occur?). So far, ExplainaBoard covers more than 400 systems, 50 datasets, 40 languages, and 12 tasks. ExplainaBoard keeps updated and is recently upgraded by supporting (1) multilingual multi-task benchmark, (2) meta-evaluation, and (3) more complicated task: machine translation, which reviewers also suggested.} We not only released an online platform on the website \url{this http URL} but also make our evaluation tool an API with MIT Licence at Github \url{https://github.com/neulab/explainaBoard} and PyPi \url{https://pypi.org/project/interpret-eval/} that allows users to conveniently assess their models offline. We additionally release all output files from systems that we have run or collected to motivate "output-driven" research in the future.
    Neural Marching Cubes. (arXiv:2106.11272v2 [cs.CV] UPDATED)
    (2 min) We introduce Neural Marching Cubes (NMC), a data-driven approach for extracting a triangle mesh from a discretized implicit field. Classical MC is defined by coarse tessellation templates isolated to individual cubes. While more refined tessellations have been proposed, they all make heuristic assumptions, such as trilinearity, when determining the vertex positions and local mesh topologies in each cube. In principle, none of these approaches can reconstruct geometric features that reveal coherence or dependencies between nearby cubes (e.g., a sharp edge), as such information is unaccounted for, resulting in poor estimates of the true underlying implicit field. To tackle these challenges, we re-cast MC from a deep learning perspective, by designing tessellation templates more apt at preserving geometric features, and learning the vertex positions and mesh topologies from training meshes, to account for contextual information from nearby cubes. We develop a compact per-cube parameterization to represent the output triangle mesh, while being compatible with neural processing, so that a simple 3D convolutional network can be employed for the training. We show that all topological cases in each cube that are applicable to our design can be easily derived using our representation, and the resulting tessellations can also be obtained naturally and efficiently by following a few design guidelines. In addition, our network learns local features with limited receptive fields, hence it generalizes well to new shapes and new datasets. We evaluate our neural MC approach by quantitative and qualitative comparisons to all well-known MC variants. In particular, we demonstrate the ability of our network to recover sharp features such as edges and corners, a long-standing issue of MC and its variants. Our network also reconstructs local mesh topologies more accurately than previous approaches.
    NTIRE 2021 Multi-modal Aerial View Object Classification Challenge. (arXiv:2107.01189v1 [cs.CV])
    (2 min) In this paper, we introduce the first Challenge on Multi-modal Aerial View Object Classification (MAVOC) in conjunction with the NTIRE 2021 workshop at CVPR. This challenge is composed of two different tracks using EO andSAR imagery. Both EO and SAR sensors possess different advantages and drawbacks. The purpose of this competition is to analyze how to use both sets of sensory information in complementary ways. We discuss the top methods submitted for this competition and evaluate their results on our blind test set. Our challenge results show significant improvement of more than 15% accuracy from our current baselines for each track of the competition
    MegazordNet: combining statistical and machine learning standpoints for time series forecasting. (arXiv:2107.01017v1 [q-fin.ST])
    (2 min) Forecasting financial time series is considered to be a difficult task due to the chaotic feature of the series. Statistical approaches have shown solid results in some specific problems such as predicting market direction and single-price of stocks; however, with the recent advances in deep learning and big data techniques, new promising options have arises to tackle financial time series forecasting. Moreover, recent literature has shown that employing a combination of statistics and machine learning may improve accuracy in the forecasts in comparison to single solutions. Taking into consideration the mentioned aspects, in this work, we proposed the MegazordNet, a framework that explores statistical features within a financial series combined with a structured deep learning model for time series forecasting. We evaluated our approach predicting the closing price of stocks in the S&P 500 using different metrics, and we were able to beat single statistical and machine learning methods.
    Multi-user VoiceFilter-Lite via Attentive Speaker Embedding. (arXiv:2107.01201v1 [eess.AS])
    (2 min) In this paper, we propose a solution to allow speaker conditioned speech models, such as VoiceFilter-Lite, to support an arbitrary number of enrolled users in a single pass. This is achieved by using an attention mechanism on multiple speaker embeddings to compute a single attentive embedding, which is then used as a side input to the model. We implemented multi-user VoiceFilter-Lite and evaluated it for three tasks: (1) a streaming automatic speech recognition (ASR) task; (2) a text-independent speaker verification task; and (3) a personalized keyphrase detection task, where ASR has to detect keyphrases from multiple enrolled users in a noisy environment. Our experiments show that, with up to four enrolled users, multi-user VoiceFilter-Lite is able to significantly reduce speech recognition and speaker verification errors when there is overlapping speech, without affecting performance under other acoustic conditions. This attentive speaker embedding approach can also be easily applied to other speaker-conditioned models such as personal VAD and personalized ASR.
    Estimating the electrical power output of industrial devices with end-to-end time-series classification in the presence of label noise. (arXiv:2105.00349v2 [cs.LG] UPDATED)
    (2 min) In complex industrial settings, it is common practice to monitor the operation of machines in order to detect undesired states, adjust maintenance schedules, optimize system performance or collect usage statistics of individual machines. In this work, we focus on estimating the power output of a Combined Heat and Power (CHP) machine of a medium-sized company facility by analyzing the total facility power consumption. We formulate the problem as a time-series classification problem where the class label represents the CHP power output. As the facility is fully instrumented and sensor measurements from the CHP are available, we generate the training labels in an automated fashion from the CHP sensor readings. However, sensor failures result in mislabeled training data samples which are hard to detect and remove from the dataset. Therefore, we propose a novel multi-task deep learning approach that jointly trains a classifier and an autoencoder with a shared embedding representation. The proposed approach targets to gradually correct the mislabelled data samples during training in a self-supervised fashion, without any prior assumption on the amount of label noise. We benchmark our approach on several time-series classification datasets and find it to be comparable and sometimes better than state-of-the-art methods. On the real-world use-case of predicting the CHP power output, we thoroughly evaluate the architectural design choices and show that the final architecture considerably increases the robustness of the learning process and consistently beats other recent state-of-the-art algorithms in the presence of unstructured as well as structured label noise.
    Application of neural networks to classification of data of the TUS orbital telescope. (arXiv:2106.03361v2 [astro-ph.IM] UPDATED)
    (2 min) We employ neural networks for classification of data of the TUS fluorescence telescope, the world's first orbital detector of ultra-high energy cosmic rays. We focus on two particular types of signals in the TUS data: track-like flashes produced by cosmic ray hits of the photodetector and flashes that originated from distant lightnings. We demonstrate that even simple neural networks combined with certain conventional methods of data analysis can be highly effective in tasks of classification of data of fluorescence telescopes.
    Conditional Neural Relational Inference for Interacting Systems. (arXiv:2106.11083v2 [cs.LG] UPDATED)
    (2 min) In this work, we want to learn to model the dynamics of similar yet distinct groups of interacting objects. These groups follow some common physical laws that exhibit specificities that are captured through some vectorial description. We develop a model that allows us to do conditional generation from any such group given its vectorial description. Unlike previous work on learning dynamical systems that can only do trajectory completion and require a part of the trajectory dynamics to be provided as input in generation time, we do generation using only the conditioning vector with no access to generation time's trajectories. We evaluate our model in the setting of modeling human gait and, in particular pathological human gait.
    Fast Tucker Rank Reduction for Non-Negative Tensors Using Mean-Field Approximation. (arXiv:2103.02898v2 [stat.ML] UPDATED)
    (2 min) We present an efficient low-rank approximation algorithm for non-negative tensors. The algorithm is derived from our two findings: First, we show that rank-1 approximation for tensors can be viewed as a mean-field approximation by treating each tensor as a probability distribution. Second, we theoretically provide a sufficient condition for distribution parameters to reduce Tucker ranks of tensors and, interestingly, this sufficient condition can be achieved by iterative application of the mean-field approximation. Since the mean-field approximation is always given as a closed formula, our findings lead to a fast low-rank approximation algorithm without using a gradient method. We empirically demonstrate that our algorithm is faster than the existing non-negative Tucker rank reduction methods with achieving competitive or better approximation of given tensors.
    SparseDNN: Fast Sparse Deep Learning Inference on CPUs. (arXiv:2101.07948v3 [cs.LG] UPDATED)
    (2 min) The last few years have seen gigantic leaps in algorithms and systems to support efficient deep learning inference. Pruning and quantization algorithms can now consistently compress neural networks by an order of magnitude. For a compressed neural network, a multitude of inference frameworks have been designed to maximize the performance of the target hardware. While we find mature support for quantized neural networks in production frameworks such as OpenVINO and MNN, support for pruned sparse neural networks is still lacking. To tackle this challenge, we present SparseDNN, a sparse deep learning inference engine targeting CPUs. We present both kernel-level optimizations with a sparse code generator to accelerate sparse operators and novel network-level optimizations catering to sparse networks. We show that our sparse code generator can achieve significant speedups over state-of-the-art sparse and dense libraries. On end-to-end benchmarks such as Huggingface pruneBERT, SparseDNN achieves up to 5x throughput improvement over dense inference with state-of-the-art OpenVINO.
    Fine-Grained Gap-Dependent Bounds for Tabular MDPs via Adaptive Multi-Step Bootstrap. (arXiv:2102.04692v2 [cs.LG] UPDATED)
    (2 min) This paper presents a new model-free algorithm for episodic finite-horizon Markov Decision Processes (MDP), Adaptive Multi-step Bootstrap (AMB), which enjoys a stronger gap-dependent regret bound. The first innovation is to estimate the optimal $Q$-function by combining an optimistic bootstrap with an adaptive multi-step Monte Carlo rollout. The second innovation is to select the action with the largest confidence interval length among admissible actions that are not dominated by any other actions. We show when each state has a unique optimal action, AMB achieves a gap-dependent regret bound that only scales with the sum of the inverse of the sub-optimality gaps. In contrast, Simchowitz and Jamieson (2019) showed all upper-confidence-bound (UCB) algorithms suffer an additional $\Omega\left(\frac{S}{\Delta_{min}}\right)$ regret due to over-exploration where $\Delta_{min}$ is the minimum sub-optimality gap and $S$ is the number of states. We further show that for general MDPs, AMB suffers an additional $\frac{|Z_{mul}|}{\Delta_{min}}$ regret, where $Z_{mul}$ is the set of state-action pairs $(s,a)$'s satisfying $a$ is a non-unique optimal action for $s$. We complement our upper bound with a lower bound showing the dependency on $\frac{|Z_{mul}|}{\Delta_{min}}$ is unavoidable for any consistent algorithm. This lower bound also implies a separation between reinforcement learning and contextual bandits.
    Simple yet Sharp Sensitivity Analysis for Unmeasured Confounding. (arXiv:2104.13020v2 [stat.ME] UPDATED)
    (2 min) We present a method for assessing the sensitivity of the true causal effect to unmeasured confounding. The method requires the analyst to specify two intuitive parameters. Otherwise, the method is assumption-free. The method returns an interval that contains the true causal effect. Moreover, the bounds of the interval are sharp, i.e. attainable. We show experimentally that our bounds can be sharper than those obtained by the method of Ding and VanderWeele (2016). Finally, we extend our method to bound the natural direct and indirect effects when there are measured mediators and unmeasured exposure-outcome confounding.
    Many-to-English Machine Translation Tools, Data, and Pretrained Models. (arXiv:2104.00290v2 [cs.CL] UPDATED)
    (2 min) While there are more than 7000 languages in the world, most translation research efforts have targeted a few high-resource languages. Commercial translation systems support only one hundred languages or fewer, and do not make these models available for transfer to low resource languages. In this work, we present useful tools for machine translation research: MTData, NLCodec, and RTG. We demonstrate their usefulness by creating a multilingual neural machine translation model capable of translating from 500 source languages to English. We make this multilingual model readily downloadable and usable as a service, or as a parent model for transfer-learning to even lower-resource languages.
    A Systems Theory of Transfer Learning. (arXiv:2107.01196v1 [cs.LG])
    (2 min) Existing frameworks for transfer learning are incomplete from a systems theoretic perspective. They place emphasis on notions of domain and task, and neglect notions of structure and behavior. In doing so, they limit the extent to which formalism can be carried through into the elaboration of their frameworks. Herein, we use Mesarovician systems theory to define transfer learning as a relation on sets and subsequently characterize the general nature of transfer learning as a mathematical construct. We interpret existing frameworks in terms of ours and go beyond existing frameworks to define notions of transferability, transfer roughness, and transfer distance. Importantly, despite its formalism, our framework avoids the detailed mathematics of learning theory or machine learning solution methods without excluding their consideration. As such, we provide a formal, general systems framework for modeling transfer learning that offers a rigorous foundation for system design and analysis.
    Generative Max-Mahalanobis Classifiers for Image Classification, Generation and More. (arXiv:2101.00122v4 [cs.CV] UPDATED)
    (2 min) Joint Energy-based Model (JEM) of Grathwohl et al. shows that a standard softmax classifier can be reinterpreted as an energy-based model (EBM) for the joint distribution p(x,y); the resulting model can be optimized to improve calibration, robustness, and out-of-distribution detection, while generating samples rivaling the quality of recent GAN-based approaches. However, the softmax classifier that JEM exploits is inherently discriminative and its latent feature space is not well formulated as probabilistic distributions, which may hinder its potential for image generation and incur training instability. We hypothesize that generative classifiers, such as Linear Discriminant Analysis (LDA), might be more suitable for image generation since generative classifiers model the data generation process explicitly. This paper therefore investigates an LDA classifier for image classification and generation. In particular, the Max-Mahalanobis Classifier (MMC), a special case of LDA, fits our goal very well. We show that our Generative MMC (GMMC) can be trained discriminatively, generatively, or jointly for image classification and generation. Extensive experiments on multiple datasets show that GMMC achieves state-of-the-art discriminative and generative performances, while outperforming JEM in calibration, adversarial robustness, and out-of-distribution detection by a significant margin. Our source code is available at https://github.com/sndnyang/GMMC.
    Kernel Thinning. (arXiv:2105.05842v3 [stat.ML] UPDATED)
    (2 min) We introduce kernel thinning, a new procedure for compressing a distribution $\mathbb{P}$ more effectively than i.i.d. sampling or standard thinning. Given a suitable reproducing kernel $\mathbf{k}$ and $\mathcal{O}(n^2)$ time, kernel thinning compresses an $n$-point approximation to $\mathbb{P}$ into a $\sqrt{n}$-point approximation with comparable worst-case integration error in the associated reproducing kernel Hilbert space. With high probability, the maximum discrepancy in integration error is $\mathcal{O}_d(n^{-\frac{1}{2}}\sqrt{\log n})$ for compactly supported $\mathbb{P}$ and $\mathcal{O}_d(n^{-\frac{1}{2}} \sqrt{(\log n)^{d+1}\log\log n})$ for sub-exponential $\mathbb{P}$ on $\mathbb{R}^d$. In contrast, an equal-sized i.i.d. sample from $\mathbb{P}$ suffers $\Omega(n^{-\frac14})$ integration error. Our sub-exponential guarantees resemble the classical quasi-Monte Carlo error rates for uniform $\mathbb{P}$ on $[0,1]^d$ but apply to general distributions on $\mathbb{R}^d$ and a wide range of common kernels. We use our results to derive explicit non-asymptotic maximum mean discrepancy bounds for Gaussian, Mat\'ern, and B-spline kernels and present two vignettes illustrating the practical benefits of kernel thinning over i.i.d. sampling and standard Markov chain Monte Carlo thinning.
    Towards Real-World BCI: CCSPNet, A Compact Subject-Independent Motor Imagery Framework. (arXiv:2012.13567v4 [cs.LG] UPDATED)
    (3 min) A conventional subject-dependent (SD) brain-computer interface (BCI) requires a complete data-gathering, training, and calibration phase for each user before it can be used. In recent years, a number of subject-independent (SI) BCIs have been developed. However, there are many problems preventing them from being used in real-world BCI applications. A weaker performance compared to the subject-dependent (SD) approach, and a relatively large model requiring high computational power are the most important ones. Therefore, a potential real-world BCI would greatly benefit from a compact low-power subject-independent BCI framework, ready to be used immediately after the user puts it on. To move towards this goal, we propose a novel subject-independent BCI framework named CCSPNet (Convolutional Common Spatial Pattern Network) trained on the motor imagery (MI) paradigm of a large-scale electroencephalography (EEG) signals database consisting of 21600 trials for 54 subjects performing two-class hand-movement MI tasks. The proposed framework applies a wavelet kernel convolutional neural network (WKCNN) and a temporal convolutional neural network (TCNN) in order to represent and extract the diverse spectral features of EEG signals. The outputs of the convolutional layers go through a common spatial pattern (CSP) algorithm for spatial feature extraction. The number of CSP features is reduced by a dense neural network, and the final class label is determined by a linear discriminative analysis (LDA) classifier. The CCSPNet framework evaluation results show that it is possible to have a low-power compact BCI that achieves both SD and SI performance comparable to complex and computationally expensive.
    Learning to Optimize: A Primer and A Benchmark. (arXiv:2103.12828v2 [math.OC] UPDATED)
    (2 min) Learning to optimize (L2O) is an emerging approach that leverages machine learning to develop optimization methods, aiming at reducing the laborious iterations of hand engineering. It automates the design of an optimization method based on its performance on a set of training problems. This data-driven procedure generates methods that can efficiently solve problems similar to those in the training. In sharp contrast, the typical and traditional designs of optimization methods are theory-driven, so they obtain performance guarantees over the classes of problems specified by the theory. The difference makes L2O suitable for repeatedly solving a certain type of optimization problems over a specific distribution of data, while it typically fails on out-of-distribution problems. The practicality of L2O depends on the type of target optimization, the chosen architecture of the method to learn, and the training procedure. This new paradigm has motivated a community of researchers to explore L2O and report their findings. This article is poised to be the first comprehensive survey and benchmark of L2O for continuous optimization. We set up taxonomies, categorize existing works and research directions, present insights, and identify open challenges. We also benchmarked many existing L2O approaches on a few but representative optimization problems. For reproducible research and fair benchmarking purposes, we released our software implementation and data in the package Open-L2O at https://github.com/VITA-Group/Open-L2O.
    Multimodal Representation for Neural Code Search. (arXiv:2107.00992v1 [cs.SE])
    (2 min) Semantic code search is about finding semantically relevant code snippets for a given natural language query. In the state-of-the-art approaches, the semantic similarity between code and query is quantified as the distance of their representation in the shared vector space. In this paper, to improve the vector space, we introduce tree-serialization methods on a simplified form of AST and build the multimodal representation for the code data. We conduct extensive experiments using a single corpus that is large-scale and multi-language: CodeSearchNet. Our results show that both our tree-serialized representations and multimodal learning model improve the performance of neural code search. Last, we define two intuitive quantification metrics oriented to the completeness of semantic and syntactic information of the code data.
    On the Oracle Complexity of Higher-Order Smooth Non-Convex Finite-Sum Optimization. (arXiv:2103.05138v2 [math.OC] UPDATED)
    (2 min) We prove lower bounds for higher-order methods in smooth non-convex finite-sum optimization. Our contribution is threefold: We first show that a deterministic algorithm cannot profit from the finite-sum structure of the objective, and that simulating a pth-order regularized method on the whole function by constructing exact gradient information is optimal up to constant factors. We further show lower bounds for randomized algorithms and compare them with the best known upper bounds. To address some gaps between the bounds, we propose a new second-order smoothness assumption that can be seen as an analogue of the first-order mean-squared smoothness assumption. We prove that it is sufficient to ensure state-of-the-art convergence guarantees, while allowing for a sharper lower bound.
    Parasitic Egg Detection and Classification in Low-cost Microscopic Images using Transfer Learning. (arXiv:2107.00968v1 [cs.CV])
    (2 min) Intestinal parasitic infection leads to several morbidities to humans worldwide, especially in tropical countries. The traditional diagnosis usually relies on manual analysis from microscopic images which is prone to human error due to morphological similarity of different parasitic eggs and abundance of impurities in a sample. Many studies have developed automatic systems for parasite egg detection to reduce human workload. However, they work with high quality microscopes, which unfortunately remain unaffordable in some rural areas. Our work thus exploits a benefit of a low-cost USB microscope. This instrument however provides poor quality of images due to limitation of magnification (10x), causing difficulty in parasite detection and species classification. In this paper, we propose a CNN-based technique using transfer learning strategy to enhance the efficiency of automatic parasite classification in poor-quality microscopic images. The patch-based technique with sliding window is employed to search for location of the eggs. Two networks, AlexNet and ResNet50, are examined with a trade-off between architecture size and classification performance. The results show that our proposed framework outperforms the state-of-the-art object recognition methods. Our system combined with final decision from an expert may improve the real faecal examination with low-cost microscopes.
    Quantum machine learning with adaptive linear optics. (arXiv:2102.04579v2 [quant-ph] UPDATED)
    (2 min) We study supervised learning algorithms in which a quantum device is used to perform a computational subroutine - either for prediction via probability estimation, or to compute a kernel via estimation of quantum states overlap. We design implementations of these quantum subroutines using Boson Sampling architectures in linear optics, supplemented by adaptive measurements. We then challenge these quantum algorithms by deriving classical simulation algorithms for the tasks of output probability estimation and overlap estimation. We obtain different classical simulability regimes for these two computational tasks in terms of the number of adaptive measurements and input photons. In both cases, our results set explicit limits to the range of parameters for which a quantum advantage can be envisaged with adaptive linear optics compared to classical machine learning algorithms: we show that the number of input photons and the number of adaptive measurements cannot be simultaneously small compared to the number of modes. Interestingly, our analysis leaves open the possibility of a near-term quantum advantage with a single adaptive measurement.
    Beyond Low-Pass Filters: Adaptive Feature Propagation on Graphs. (arXiv:2103.14187v4 [cs.LG] UPDATED)
    (2 min) Graph neural networks (GNNs) have been extensively studied for prediction tasks on graphs. As pointed out by recent studies, most GNNs assume local homophily, i.e., strong similarities in local neighborhoods. This assumption however limits the generalizability power of GNNs. To address this limitation, we propose a flexible GNN model, which is capable of handling any graphs without being restricted by their underlying homophily. At its core, this model adopts a node attention mechanism based on multiple learnable spectral filters; therefore, the aggregation scheme is learned adaptively for each graph in the spectral domain. We evaluated the proposed model on node classification tasks over eight benchmark datasets. The proposed model is shown to generalize well to both homophilic and heterophilic graphs. Further, it outperforms all state-of-the-art baselines on heterophilic graphs and performs comparably with them on homophilic graphs.
    Active Fire Detection in Landsat-8 Imagery: a Large-Scale Dataset and a Deep-Learning Study. (arXiv:2101.03409v2 [cs.CV] UPDATED)
    (2 min) Active fire detection in satellite imagery is of critical importance to the management of environmental conservation policies, supporting decision-making and law enforcement. This is a well established field, with many techniques being proposed over the years, usually based on pixel or region-level comparisons involving sensor-specific thresholds and neighborhood statistics. In this paper, we address the problem of active fire detection using deep learning techniques. In recent years, deep learning techniques have been enjoying an enormous success in many fields, but their use for active fire detection is relatively new, with open questions and demand for datasets and architectures for evaluation. This paper addresses these issues by introducing a new large-scale dataset for active fire detection, with over 150,000 image patches (more than 200 GB of data) extracted from Landsat-8 images captured around the world in August and September 2020, containing wildfires in several locations. The dataset was split in two parts, and contains 10-band spectral images with associated outputs, produced by three well known handcrafted algorithms for active fire detection in the first part, and manually annotated masks in the second part. We also present a study on how different convolutional neural network architectures can be used to approximate these handcrafted algorithms, and how models trained on automatically segmented patches can be combined to achieve better performance than the original algorithms - with the best combination having 87.2% precision and 92.4% recall on our manually annotated dataset. The proposed dataset, source codes and trained models are available on Github (https://github.com/pereira-gha/activefire), creating opportunities for further advances in the field
    Decision tree heuristics can fail, even in the smoothed setting. (arXiv:2107.00819v1 [cs.LG])
    (2 min) Greedy decision tree learning heuristics are mainstays of machine learning practice, but theoretical justification for their empirical success remains elusive. In fact, it has long been known that there are simple target functions for which they fail badly (Kearns and Mansour, STOC 1996). Recent work of Brutzkus, Daniely, and Malach (COLT 2020) considered the smoothed analysis model as a possible avenue towards resolving this disconnect. Within the smoothed setting and for targets $f$ that are $k$-juntas, they showed that these heuristics successfully learn $f$ with depth-$k$ decision tree hypotheses. They conjectured that the same guarantee holds more generally for targets that are depth-$k$ decision trees. We provide a counterexample to this conjecture: we construct targets that are depth-$k$ decision trees and show that even in the smoothed setting, these heuristics build trees of depth $2^{\Omega(k)}$ before achieving high accuracy. We also show that the guarantees of Brutzkus et al. cannot extend to the agnostic setting: there are targets that are very close to $k$-juntas, for which these heuristics build trees of depth $2^{\Omega(k)}$ before achieving high accuracy.
    Data Dependent Randomized Smoothing. (arXiv:2012.04351v2 [cs.LG] UPDATED)
    (2 min) Randomized smoothing is a recent technique that achieves state-of-art performance in training certifiably robust deep neural networks. While the smoothing family of distributions is often connected to the choice of the norm used for certification, the parameters of these distributions are always set as global hyper parameters independent of the input data on which a network is certified. In this work, we revisit Gaussian randomized smoothing and show that the variance of the Gaussian distribution can be optimized at each input so as to maximize the certification radius for the construction of the smoothed classifier. This new approach is generic, parameter-free, and easy to implement. In fact, we show that our data dependent framework can be seamlessly incorporated into 3 randomized smoothing approaches, leading to consistent improved certified accuracy. When this framework is used in the training routine of these approaches followed by a data dependent certification, we achieve 9\% and 6\% improvement over the certified accuracy of the strongest baseline for a radius of 0.5 on CIFAR10 and ImageNet.
    Momentum Accelerates the Convergence of Stochastic AUPRC Maximization. (arXiv:2107.01173v1 [cs.LG])
    (2 min) In this paper, we study stochastic optimization of areas under precision-recall curves (AUPRC), which is widely used for combating imbalanced classification tasks. Although a few methods have been proposed for maximizing AUPRC, stochastic optimization of AUPRC with convergence guarantee remains an undeveloped territory. A recent work [42] has proposed a promising approach towards AUPRC based on maximizing a surrogate loss for the average precision, and proved an $O(1/\epsilon^5)$ complexity for finding an $\epsilon$-stationary solution of the non-convex objective. In this paper, we further improve the stochastic optimization of AURPC by (i) developing novel stochastic momentum methods with a better iteration complexity of $O(1/\epsilon^4)$ for finding an $\epsilon$-stationary solution; and (ii) designing a novel family of stochastic adaptive methods with the same iteration complexity of $O(1/\epsilon^4)$, which enjoy faster convergence in practice. To this end, we propose two innovative techniques that are critical for improving the convergence: (i) the biased estimators for tracking individual ranking scores are updated in a randomized coordinate-wise manner; and (ii) a momentum update is used on top of the stochastic gradient estimator for tracking the gradient of the objective. Extensive experiments on various data sets demonstrate the effectiveness of the proposed algorithms. Of independent interest, the proposed stochastic momentum and adaptive algorithms are also applicable to a class of two-level stochastic dependent compositional optimization problems.
    SE(3)-Equivariant Graph Neural Networks for Data-Efficient and Accurate Interatomic Potentials. (arXiv:2101.03164v2 [physics.comp-ph] UPDATED)
    (2 min) This work presents Neural Equivariant Interatomic Potentials (NequIP), a SE(3)-equivariant neural network approach for learning interatomic potentials from ab-initio calculations for molecular dynamics simulations. While most contemporary symmetry-aware models use invariant convolutions and only act on scalars, NequIP employs SE(3)-equivariant convolutions for interactions of geometric tensors, resulting in a more information-rich and faithful representation of atomic environments. The method achieves state-of-the-art accuracy on a challenging set of diverse molecules and materials while exhibiting remarkable data efficiency. NequIP outperforms existing models with up to three orders of magnitude fewer training data, challenging the widely held belief that deep neural networks require massive training sets. The high data efficiency of the method allows for the construction of accurate potentials using high-order quantum chemical level of theory as reference and enables high-fidelity molecular dynamics simulations over long time scales.
    An Investigation of the (In)effectiveness of Counterfactually Augmented Data. (arXiv:2107.00753v1 [cs.CL])
    (2 min) While pretrained language models achieve excellent performance on natural language understanding benchmarks, they tend to rely on spurious correlations and generalize poorly to out-of-distribution (OOD) data. Recent work has explored using counterfactually-augmented data (CAD) -- data generated by minimally perturbing examples to flip the ground-truth label -- to identify robust features that are invariant under distribution shift. However, empirical results using CAD for OOD generalization have been mixed. To explain this discrepancy, we draw insights from a linear Gaussian model and demonstrate the pitfalls of CAD. Specifically, we show that (a) while CAD is effective at identifying robust features, it may prevent the model from learning unperturbed robust features, and (b) CAD may exacerbate existing spurious correlations in the data. Our results show that the lack of perturbation diversity in current CAD datasets limits its effectiveness on OOD generalization, calling for innovative crowdsourcing procedures to elicit diverse perturbation of examples.
    Unintended Effects on Adaptive Learning Rate for Training Neural Network with Output Scale Change. (arXiv:2103.03466v2 [cs.LG] UPDATED)
    (2 min) A multiplicative constant scaling factor is often applied to the model output to adjust the dynamics of neural network parameters. This has been used as one of the key interventions in an empirical study of lazy and active behavior. However, we show that the combination of such scaling and a commonly used adaptive learning rate optimizer strongly affects the training behavior of the neural network. This is problematic as it can cause \emph{unintended behavior} of neural networks, resulting in the misinterpretation of experimental results. Specifically, for some scaling settings, the effect of the adaptive learning rate disappears or is strongly influenced by the scaling factor. To avoid the unintended effect, we present a modification of an optimization algorithm and demonstrate remarkable differences between adaptive learning rate optimization and simple gradient descent, especially with a small ($<1.0$) scaling factor.
    Segmented Federated Learning for Adaptive Intrusion Detection System. (arXiv:2107.00881v1 [cs.CR])
    (2 min) Cyberattacks are a major issues and it causes organizations great financial, and reputation harm. However, due to various factors, the current network intrusion detection systems (NIDS) seem to be insufficent. Predominant NIDS identifies Cyberattacks through a handcrafted dataset of rules. Although the recent applications of machine learning and deep learning have alleviated the enormous effort in NIDS, the security of network data has always been a prime concern. However, to encounter the security problem and enable sharing among organizations, Federated Learning (FL) scheme is employed. Although the current FL systems have been successful, a network's data distribution does not always fit into a single global model as in FL. Thus, in such cases, having a single global model in FL is no feasible. In this paper, we propose a Segmented-Federated Learning (Segmented-FL) learning scheme for a more efficient NIDS. The Segmented-FL approach employs periodic local model evaluation based on which the segmentation occurs. We aim to bring similar network environments to the same group. Further, the Segmented-FL system is coupled with a weighted aggregation of local model parameters based on the number of data samples a worker possesses to further augment the performance. The improved performance by our system as compared to the FL and centralized systems on standard dataset further validates our system and makes a strong case for extending our technique across various tasks. The solution finds its application in organizations that want to collaboratively learn on diverse network environments and protect the privacy of individual datasets.
    Neural networks for Anatomical Therapeutic Chemical (ATC). (arXiv:2101.11713v2 [q-bio.QM] UPDATED)
    (2 min) Motivation: Automatic Anatomical Therapeutic Chemical (ATC) classification is a critical and highly competitive area of research in bioinformatics because of its potential for expediting drug develop-ment and research. Predicting an unknown compound's therapeutic and chemical characteristics ac-cording to how these characteristics affect multiple organs/systems makes automatic ATC classifica-tion a challenging multi-label problem. Results: In this work, we propose combining multiple multi-label classifiers trained on distinct sets of features, including sets extracted from a Bidirectional Long Short-Term Memory Network (BiLSTM). Experiments demonstrate the power of this approach, which is shown to outperform the best methods reported in the literature, including the state-of-the-art developed by the fast.ai research group. Availability: All source code developed for this study is available at https://github.com/LorisNanni. Contact: loris.nanni@unipd.it
    PointGuard: Provably Robust 3D Point Cloud Classification. (arXiv:2103.03046v2 [cs.CR] UPDATED)
    (2 min) 3D point cloud classification has many safety-critical applications such as autonomous driving and robotic grasping. However, several studies showed that it is vulnerable to adversarial attacks. In particular, an attacker can make a classifier predict an incorrect label for a 3D point cloud via carefully modifying, adding, and/or deleting a small number of its points. Randomized smoothing is state-of-the-art technique to build certifiably robust 2D image classifiers. However, when applied to 3D point cloud classification, randomized smoothing can only certify robustness against adversarially modified points. In this work, we propose PointGuard, the first defense that has provable robustness guarantees against adversarially modified, added, and/or deleted points. Specifically, given a 3D point cloud and an arbitrary point cloud classifier, our PointGuard first creates multiple subsampled point clouds, each of which contains a random subset of the points in the original point cloud; then our PointGuard predicts the label of the original point cloud as the majority vote among the labels of the subsampled point clouds predicted by the point cloud classifier. Our first major theoretical contribution is that we show PointGuard provably predicts the same label for a 3D point cloud when the number of adversarially modified, added, and/or deleted points is bounded. Our second major theoretical contribution is that we prove the tightness of our derived bound when no assumptions on the point cloud classifier are made. Moreover, we design an efficient algorithm to compute our certified robustness guarantees. We also empirically evaluate PointGuard on ModelNet40 and ScanNet benchmark datasets.
    Weather-based forecasting of energy generation, consumption and price for electrical microgrids management. (arXiv:2107.01034v1 [eess.SY])
    (2 min) The Intergovernmental Panel on Climate Change proposes different mitigation strategies to achieve the net emissions reductions that would be required to follow a pathway that limits global warming to 1.5{\deg}C with no or limited overshoot. The transition towards a carbon-free society goes through an inevitable increase of the share of renewable generation in the energy mix and a drastic decrease in terms of the total consumption of fossil fuels. Therefore, this thesis studies the integration of renewables in power systems by investigating forecasting and decision-making tools. Indeed, in contrast to conventional power plants, renewable energy is subject to uncertainty. Most of the generation technologies based on renewable sources are non-dispatchable, and their production is stochastic and hard to predict in advance. A high share of renewables is a great challenge for power systems that have been designed and sized for dispatchable units. In this context, probabilistic forecasts, which aim at modeling the distribution of all possible future realizations, have become an important tool to equip decision-makers, hopefully leading to better decisions in energy applications. This thesis focus on two main research questions: (1) How to produce reliable probabilistic forecasts of renewable generation, consumption, and electricity prices? (2) How to make decisions with uncertainty using probabilistic forecasts? The thesis perimeter is the energy management of "small" systems such as microgrids at a residential scale on a day-ahead basis. It is divided into two main parts to propose directions to address both research questions (1) a forecasting part; (2) a planning and control part.
    Systematic Evaluation of Causal Discovery in Visual Model Based Reinforcement Learning. (arXiv:2107.00848v1 [stat.ML])
    (2 min) Inducing causal relationships from observations is a classic problem in machine learning. Most work in causality starts from the premise that the causal variables themselves are observed. However, for AI agents such as robots trying to make sense of their environment, the only observables are low-level variables like pixels in images. To generalize well, an agent must induce high-level variables, particularly those which are causal or are affected by causal variables. A central goal for AI and causality is thus the joint discovery of abstract representations and causal structure. However, we note that existing environments for studying causal induction are poorly suited for this objective because they have complicated task-specific causal graphs which are impossible to manipulate parametrically (e.g., number of nodes, sparsity, causal chain length, etc.). In this work, our goal is to facilitate research in learning representations of high-level variables as well as causal structures among them. In order to systematically probe the ability of methods to identify these variables and structures, we design a suite of benchmarking RL environments. We evaluate various representation learning algorithms from the literature and find that explicitly incorporating structure and modularity in models can help causal induction in model-based reinforcement learning.
    Exploration noise for learning linear-quadratic mean field games. (arXiv:2107.00839v1 [math.OC])
    (2 min) The goal of this paper is to demonstrate that common noise may serve as an exploration noise for learning the solution of a mean field game. This concept is here exemplified through a toy linear-quadratic model, for which a suitable form of common noise has already been proven to restore existence and uniqueness. We here go one step further and prove that the same form of common noise may force the convergence of the learning algorithm called `fictitious play', and this without any further potential or monotone structure. Several numerical examples are provided in order to support our theoretical analysis.
    Mitigating Uncertainty of Classifier for Unsupervised Domain Adaptation. (arXiv:2107.00727v1 [cs.LG])
    (2 min) Understanding unsupervised domain adaptation has been an important task that has been well explored. However, the wide variety of methods have not analyzed the role of a classifier's performance in detail. In this paper, we thoroughly examine the role of a classifier in terms of matching source and target distributions. We specifically investigate the classifier ability by matching a) the distribution of features, b) probabilistic uncertainty for samples and c) certainty activation mappings. Our analysis suggests that using these three distributions does result in a consistently improved performance on all the datasets. Our work thus extends present knowledge on the role of the various distributions obtained from the classifier towards solving unsupervised domain adaptation.
    Inter-Beat Interval Estimation with Tiramisu Model: A Novel Approach with Reduced Error. (arXiv:2107.00693v1 [eess.SP])
    (2 min) Inter-beat interval (IBI) measurement enables estimation of heart-rate variability (HRV) which, in turns, can provide early indication of potential cardiovascular diseases. However, extracting IBIs from noisy signals is challenging since the morphology of the signal is distorted in the presence of the noise. Electrocardiogram (ECG) of a person in heavy motion is highly corrupted with noise, known as motion-artifact, and IBI extracted from it is inaccurate. As a part of remote health monitoring and wearable system development, denoising ECG signals and estimating IBIs correctly from them have become an emerging topic among signal-processing researchers. Apart from conventional methods, deep-learning techniques have been successfully used in signal denoising recently, and diagnosis process has become easier, leading to accuracy levels that were previously unachievable. We propose a deep-learning approach leveraging tiramisu autoencoder model to suppress motion-artifact noise and make the R-peaks of the ECG signal prominent even in the presence of high-intensity motion. After denoising, IBIs are estimated more accurately expediting diagnosis tasks. Results illustrate that our method enables IBI estimation from noisy ECG signals with SNR up to -30dB with average root mean square error (RMSE) of 13 milliseconds for estimated IBIs. At this noise level, our error percentage remains below 8% and outperforms other state of the art techniques.
    Gap-Dependent Bounds for Two-Player Markov Games. (arXiv:2107.00685v1 [cs.LG])
    (2 min) As one of the most popular methods in the field of reinforcement learning, Q-learning has received increasing attention. Recently, there have been more theoretical works on the regret bound of algorithms that belong to the Q-learning class in different settings. In this paper, we analyze the cumulative regret when conducting Nash Q-learning algorithm on 2-player turn-based stochastic Markov games (2-TBSG), and propose the very first gap dependent logarithmic upper bounds in the episodic tabular setting. This bound matches the theoretical lower bound only up to a logarithmic term. Furthermore, we extend the conclusion to the discounted game setting with infinite horizon and propose a similar gap dependent logarithmic regret bound. Also, under the linear MDP assumption, we obtain another logarithmic regret for 2-TBSG, in both centralized and independent settings.
    Generalized Multivariate Signs for Nonparametric Hypothesis Testing in High Dimensions. (arXiv:2107.01103v1 [stat.ME])
    (2 min) High-dimensional data, where the dimension of the feature space is much larger than sample size, arise in a number of statistical applications. In this context, we construct the generalized multivariate sign transformation, defined as a vector divided by its norm. For different choices of the norm function, the resulting transformed vector adapts to certain geometrical features of the data distribution. Building up on this idea, we obtain one-sample and two-sample testing procedures for mean vectors of high-dimensional data using these generalized sign vectors. These tests are based on U-statistics using kernel inner products, do not require prohibitive assumptions, and are amenable to a fast randomization-based implementation. Through experiments in a number of data settings, we show that tests using generalized signs display higher power than existing tests, while maintaining nominal type-I error rates. Finally, we provide example applications on the MNIST and Minnesota Twin Studies genomic data.
    Textual Echo Cancellation. (arXiv:2008.06006v3 [eess.AS] UPDATED)
    (2 min) In this paper, we propose Textual Echo Cancellation (TEC) - a framework for cancelling the text-to-speech (TTS) playback echo from overlapping speech recordings. Such a system can largely improve speech recognition performance and user experience for intelligent devices such as smart speakers, as the user can talk to the device while the device is still playing the TTS signal responding to the previous query. We implement this system by using a novel sequence-to-sequence model with multi-source attention that takes both the microphone mixture signal and source text of the TTS playback as inputs, and predicts the enhanced audio. Experiments show that the textual information of the TTS playback is critical to enhancement performance. Besides, the text sequence is much smaller in size compared with the raw acoustic signal of the TTS playback, and can be immediately transmitted to the device or ASR server even before the playback is synthesized. Therefore, our proposed approach effectively reduces Internet communication and latency compared with alternative approaches such as acoustic echo cancellation (AEC).
    Deep Model Compression Via Two-Stage Deep Reinforcement Learning. (arXiv:1912.02254v2 [cs.LG] UPDATED)
    (2 min) Besides accuracy, the model size of convolutional neural networks (CNN) models is another important factor considering limited hardware resources in practical applications. For example, employing deep neural networks on mobile systems requires the design of accurate yet fast CNN for low latency in classification and object detection. To fulfill the need, we aim at obtaining CNN models with both high testing accuracy and small size to address resource constraints in many embedded devices. In particular, this paper focuses on proposing a generic reinforcement learning-based model compression approach in a two-stage compression pipeline: pruning and quantization. The first stage of compression, i.e., pruning, is achieved via exploiting deep reinforcement learning (DRL) to co-learn the accuracy and the FLOPs updated after layer-wise channel pruning and element-wise variational pruning via information dropout. The second stage, i.e., quantization, is achieved via a similar DRL approach but focuses on obtaining the optimal bits representation for individual layers. We further conduct experimental results on CIFAR-10 and ImageNet datasets. For the CIFAR-10 dataset, the proposed method can reduce the size of VGGNet by 9x from 20.04MB to 2.2MB with a slight accuracy increase. For the ImageNet dataset, the proposed method can reduce the size of VGG-16 by 33x from 138MB to 4.14MB with no accuracy loss.
    Bayesian Hyperparameter Optimization with BoTorch, GPyTorch and Ax. (arXiv:1912.05686v2 [cs.LG] UPDATED)
    (2 min) Deep learning models are full of hyperparameters, which are set manually before the learning process can start. To find the best configuration for these hyperparameters in such a high dimensional space, with time-consuming and expensive model training / validation, is not a trivial challenge. Bayesian optimization is a powerful tool for the joint optimization of hyperparameters, efficiently trading off exploration and exploitation of the hyperparameter space. In this paper, we discuss Bayesian hyperparameter optimization, including hyperparameter optimization, Bayesian optimization, and Gaussian processes. We also review BoTorch, GPyTorch and Ax, the new open-source frameworks that we use for Bayesian optimization, Gaussian process inference and adaptive experimentation, respectively. For experimentation, we apply Bayesian hyperparameter optimization, for optimizing group weights, to weighted group pooling, which couples unsupervised tiered graph autoencoders learning and supervised graph prediction learning for molecular graphs. We find that Ax, BoTorch and GPyTorch together provide a simple-to-use but powerful framework for Bayesian hyperparameter optimization, using Ax's high-level API that constructs and runs a full optimization loop and returns the best hyperparameter configuration.
    Mirrorless Mirror Descent: A Natural Derivation of Mirror Descent. (arXiv:2004.01025v3 [cs.LG] UPDATED)
    (2 min) We present a primal only derivation of Mirror Descent as a "partial" discretization of gradient flow on a Riemannian manifold where the metric tensor is the Hessian of the Mirror Descent potential. We contrast this discretization to Natural Gradient Descent, which is obtained by a "full" forward Euler discretization. This view helps shed light on the relationship between the methods and allows generalizing Mirror Descent to general Riemannian geometries, even when the metric tensor is {\em not} a Hessian, and thus there is no "dual."
    SHADOWCAST: Controllable Graph Generation. (arXiv:2006.03774v4 [cs.LG] UPDATED)
    (2 min) We introduce the controllable graph generation problem, formulated as controlling graph attributes during the generative process to produce desired graphs with understandable structures. Using a transparent and straightforward Markov model to guide this generative process, practitioners can shape and understand the generated graphs. We propose ${\rm S{\small HADOW}C{\small AST}}$, a generative model capable of controlling graph generation while retaining the original graph's intrinsic properties. The proposed model is based on a conditional generative adversarial network. Given an observed graph and some user-specified Markov model parameters, ${\rm S{\small HADOW}C{\small AST}}$ controls the conditions to generate desired graphs. Comprehensive experiments on three real-world network datasets demonstrate our model's competitive performance in the graph generation task. Furthermore, we show its effective controllability by directing ${\rm S{\small HADOW}C{\small AST}}$ to generate hypothetical scenarios with different graph structures.
    An Empirical Survey of Data Augmentation for Time Series Classification with Neural Networks. (arXiv:2007.15951v4 [cs.LG] UPDATED)
    (2 min) In recent times, deep artificial neural networks have achieved many successes in pattern recognition. Part of this success can be attributed to the reliance on big data to increase generalization. However, in the field of time series recognition, many datasets are often very small. One method of addressing this problem is through the use of data augmentation. In this paper, we survey data augmentation techniques for time series and their application to time series classification with neural networks. We propose a taxonomy and outline the four families in time series data augmentation, including transformation-based methods, pattern mixing, generative models, and decomposition methods. Furthermore, we empirically evaluate 12 time series data augmentation methods on 128 time series classification datasets with six different types of neural networks. Through the results, we are able to analyze the characteristics, advantages and disadvantages, and recommendations of each data augmentation method. This survey aims to help in the selection of time series data augmentation for neural network applications.
    Combinatorial Optimization with Physics-Inspired Graph Neural Networks. (arXiv:2107.01188v1 [cs.LG])
    (2 min) We demonstrate how graph neural networks can be used to solve combinatorial optimization problems. Our approach is broadly applicable to canonical NP-hard problems in the form of quadratic unconstrained binary optimization problems, such as maximum cut, minimum vertex cover, maximum independent set, as well as Ising spin glasses and higher-order generalizations thereof in the form of polynomial unconstrained binary optimization problems. We apply a relaxation strategy to the problem Hamiltonian to generate a differentiable loss function with which we train the graph neural network and apply a simple projection to integer variables once the unsupervised training process has completed. We showcase our approach with numerical results for the canonical maximum cut and maximum independent set problems. We find that the graph neural network optimizer performs on par or outperforms existing solvers, with the ability to scale beyond the state of the art to problems with millions of variables.
    Artificial Neural Network for Cybersecurity: A Comprehensive Review. (arXiv:2107.01185v1 [cs.CR])
    (2 min) Cybersecurity is a very emerging field that protects systems, networks, and data from digital attacks. With the increase in the scale of the Internet and the evolution of cyber attacks, developing novel cybersecurity tools has become important, particularly for Internet of things (IoT) networks. This paper provides a systematic review of the application of deep learning (DL) approaches for cybersecurity. This paper provides a short description of DL methods which is used in cybersecurity, including deep belief networks, generative adversarial networks, recurrent neural networks, and others. Next, we illustrate the differences between shallow learning and DL. Moreover, a discussion is provided on the currently prevailing cyber-attacks in IoT and other networks, and the effectiveness of DL methods to manage these attacks. Besides, this paper describes studies that highlight the DL technique, cybersecurity applications, and the source of datasets. Next, a discussion is provided on the feasibility of DL systems for malware detection and classification, intrusion detection, and other frequent cyber-attacks, including identifying file type, spam, and network traffic. Our review indicates that high classification accuracy of 99.72% is obtained by restricted Boltzmann machine (RBM) when applied to a custom dataset, while long short-term memory (LSTM) achieves an accuracy of 99.80% for KDD Cup 99 dataset. Finally, this article discusses the importance of cybersecurity for reliable and practicable IoT-driven healthcare systems.
    Gradient-based training of Gaussian Mixture Models for High-Dimensional Streaming Data. (arXiv:1912.09379v3 [cs.LG] UPDATED)
    (2 min) We present an approach for efficiently training Gaussian Mixture Model (GMM) by Stochastic Gradient Descent (SGD) with non-stationary, high-dimensional streaming data. Our training scheme does not require data-driven parameter initialization (e.g., k-means) and can thus be trained based on a random initialization. Furthermore, the approach allows mini-batch sizes as low as 1, which are typical for streaming-data settings. Major problems in such settings are undesirable local optima during early training phases and numerical instabilities due to high data dimensionalities. We introduce an adaptive annealing procedure to address the first problem, whereas numerical instabilities are eliminated by using an exponential-free approximation to the standard GMM log-likelihood. Experiments on a variety of visual and non-visual benchmarks show that our SGD approach can be trained completely without, for instance, k-means based centroid initialization. It also compares favorably to an online variant of Expectation-Maximization (EM) - stochastic EM (sEM), which it outperforms by a large margin for very high-dimensional data.
    Language Identification of Hindi-English tweets using code-mixed BERT. (arXiv:2107.01202v1 [cs.CL])
    (2 min) Language identification of social media text has been an interesting problem of study in recent years. Social media messages are predominantly in code mixed in non-English speaking states. Prior knowledge by pre-training contextual embeddings have shown state of the art results for a range of downstream tasks. Recently, models such as BERT have shown that using a large amount of unlabeled data, the pretrained language models are even more beneficial for learning common language representations. Extensive experiments exploiting transfer learning and fine-tuning BERT models to identify language on Twitter are presented in this paper. The work utilizes a data collection of Hindi-English-Urdu codemixed text for language pre-training and Hindi-English codemixed for subsequent word-level language classification. The results show that the representations pre-trained over codemixed data produce better results by their monolingual counterpart.
    Simpler, Faster, Stronger: Breaking The log-K Curse On Contrastive Learners With FlatNCE. (arXiv:2107.01152v1 [stat.ML])
    (2 min) InfoNCE-based contrastive representation learners, such as SimCLR, have been tremendously successful in recent years. However, these contrastive schemes are notoriously resource demanding, as their effectiveness breaks down with small-batch training (i.e., the log-K curse, whereas K is the batch-size). In this work, we reveal mathematically why contrastive learners fail in the small-batch-size regime, and present a novel simple, non-trivial contrastive objective named FlatNCE, which fixes this issue. Unlike InfoNCE, our FlatNCE no longer explicitly appeals to a discriminative classification goal for contrastive learning. Theoretically, we show FlatNCE is the mathematical dual formulation of InfoNCE, thus bridging the classical literature on energy modeling; and empirically, we demonstrate that, with minimal modification of code, FlatNCE enables immediate performance boost independent of the subject-matter engineering efforts. The significance of this work is furthered by the powerful generalization of contrastive learning techniques, and the introduction of new tools to monitor and diagnose contrastive training. We substantiate our claims with empirical evidence on CIFAR10, ImageNet, and other datasets, where FlatNCE consistently outperforms InfoNCE.
    Unveiling the structure of wide flat minima in neural networks. (arXiv:2107.01163v1 [cond-mat.dis-nn])
    (2 min) The success of deep learning has revealed the application potential of neural networks across the sciences and opened up fundamental theoretical problems. In particular, the fact that learning algorithms based on simple variants of gradient methods are able to find near-optimal minima of highly nonconvex loss functions is an unexpected feature of neural networks which needs to be understood in depth. Such algorithms are able to fit the data almost perfectly, even in the presence of noise, and yet they have excellent predictive capabilities. Several empirical results have shown a reproducible correlation between the so-called flatness of the minima achieved by the algorithms and the generalization performance. At the same time, statistical physics results have shown that in nonconvex networks a multitude of narrow minima may coexist with a much smaller number of wide flat minima, which generalize well. Here we show that wide flat minima arise from the coalescence of minima that correspond to high-margin classifications. Despite being exponentially rare compared to zero-margin solutions, high-margin minima tend to concentrate in particular regions. These minima are in turn surrounded by other solutions of smaller and smaller margin, leading to dense regions of solutions over long distances. Our analysis also provides an alternative analytical method for estimating when flat minima appear and when algorithms begin to find solutions, as the number of model parameters varies.
    Empirically Measuring Transfer Distance for System Design and Operation. (arXiv:2107.01184v1 [cs.LG])
    (2 min) Classical machine learning approaches are sensitive to non-stationarity. Transfer learning can address non-stationarity by sharing knowledge from one system to another, however, in areas like machine prognostics and defense, data is fundamentally limited. Therefore, transfer learning algorithms have little, if any, examples from which to learn. Herein, we suggest that these constraints on algorithmic learning can be addressed by systems engineering. We formally define transfer distance in general terms and demonstrate its use in empirically quantifying the transferability of models. We consider the use of transfer distance in the design of machine rebuild procedures to allow for transferable prognostic models. We also consider the use of transfer distance in predicting operational performance in computer vision. Practitioners can use the presented methodology to design and operate systems with consideration for the learning theoretic challenges faced by component learning systems.
    Enabling Machine Learning-Ready HPC Ensembles with Merlin. (arXiv:1912.02892v2 [cs.DC] UPDATED)
    (2 min) With the growing complexity of computational and experimental facilities, many scientific researchers are turning to machine learning (ML) techniques to analyze large scale ensemble data. With complexities such as multi-component workflows, heterogeneous machine architectures, parallel file systems, and batch scheduling, care must be taken to facilitate this analysis in a high performance computing (HPC) environment. In this paper, we present Merlin, a workflow framework to enable large ML-friendly ensembles of scientific HPC simulations. By augmenting traditional HPC with distributed compute technologies, Merlin aims to lower the barrier for scientific subject matter experts to incorporate ML into their analysis. In addition to its design, we describe some example applications that Merlin has enabled on leadership-class HPC resources, such as the ML-augmented optimization of nuclear fusion experiments and the calibration of infectious disease models to study the progression of and possible mitigation strategies for COVID-19.
    ROOTS: Object-Centric Representation and Rendering of 3D Scenes. (arXiv:2006.06130v3 [cs.LG] UPDATED)
    (2 min) A crucial ability of human intelligence is to build up models of individual 3D objects from partial scene observations. Recent works achieve object-centric generation but without the ability to infer the representation, or achieve 3D scene representation learning but without object-centric compositionality. Therefore, learning to represent and render 3D scenes with object-centric compositionality remains elusive. In this paper, we propose a probabilistic generative model for learning to build modular and compositional 3D object models from partial observations of a multi-object scene. The proposed model can (i) infer the 3D object representations by learning to search and group object areas and also (ii) render from an arbitrary viewpoint not only individual objects but also the full scene by compositing the objects. The entire learning process is unsupervised and end-to-end. In experiments, in addition to generation quality, we also demonstrate that the learned representation permits object-wise manipulation and novel scene generation, and generalizes to various settings. Results can be found on our project website: https://sites.google.com/view/roots3d
    On the Complexity of Symbolic Finite-State Automata. (arXiv:2011.05389v3 [cs.FL] UPDATED)
    (2 min) We revisit the complexity of procedures on SFAs (such as intersection, emptiness, etc.) and analyze them according to the measures we find suitable for symbolic automata: the number of states, the maximal number of transitions exiting a state, and the size of the most complex transition predicate. We pay attention to the special forms of SFAs: {normalized SFAs} and {neat SFAs}, as well as to SFAs over a {monotonic} effective Boolean algebra.
    Ensemble of Loss Functions to Improve Generalizability of Deep Metric Learning methods. (arXiv:2107.01130v1 [cs.CV])
    (2 min) Deep Metric Learning (DML) learns a non-linear semantic embedding from input data that brings similar pairs together while keeps dissimilar data away from each other. To this end, many different methods are proposed in the last decade with promising results in various applications. The success of a DML algorithm greatly depends on its loss function. However, no loss function is perfect, and it deals only with some aspects of an optimal similarity embedding. Besides, the generalizability of the DML on unseen categories during the test stage is an important matter that is not considered by existing loss functions. To address these challenges, we propose novel approaches to combine different losses built on top of a shared deep feature extractor. The proposed ensemble of losses enforces the deep model to extract features that are consistent with all losses. Since the selected losses are diverse and each emphasizes different aspects of an optimal semantic embedding, our effective combining methods yield a considerable improvement over any individual loss and generalize well on unseen categories. Here, there is no limitation in choosing loss functions, and our methods can work with any set of existing ones. Besides, they can optimize each loss function as well as its weight in an end-to-end paradigm with no need to adjust any hyper-parameter. We evaluate our methods on some popular datasets from the machine vision domain in conventional Zero-Shot-Learning (ZSL) settings. The results are very encouraging and show that our methods outperform all baseline losses by a large margin in all datasets.
    Road Roughness Estimation Using Machine Learning. (arXiv:2107.01199v1 [cs.LG])
    (2 min) Road roughness is a very important road condition for the infrastructure, as the roughness affects both the safety and ride comfort of passengers. The roads deteriorate over time which means the road roughness must be continuously monitored in order to have an accurate understand of the condition of the road infrastructure. In this paper, we propose a machine learning pipeline for road roughness prediction using the vertical acceleration of the car and the car speed. We compared well-known supervised machine learning models such as linear regression, naive Bayes, k-nearest neighbor, random forest, support vector machine, and the multi-layer perceptron neural network. The models are trained on an optimally selected set of features computed in the temporal and statistical domain. The results demonstrate that machine learning methods can accurately predict road roughness, using the recordings of the cost approachable in-vehicle sensors installed in conventional passenger cars. Our findings demonstrate that the technology is well suited to meet future pavement condition monitoring, by enabling continuous monitoring of a wide road network.
    How good is your explanation? Algorithmic stability measures to assess the qualityof explanations for deep neural networks. (arXiv:2009.04521v2 [cs.LG] UPDATED)
    (2 min) A plethora of methods have been proposed to explain howdeep neural networks reach a decision but comparativelylittle effort has been made to ensure that the explanationsproduced by these methods are objectively relevant. Whiledesirable properties for a good explanation are easy to come,objective measures have been harder to derive. Here, we pro-pose two new measures to evaluate explanations borrowedfrom the field of algorithmic stability: relative consistencyReCo and mean generalizability MeGe. We conduct severalexperiments on multiple image datasets and network archi-tectures to demonstrate the benefits of the proposed measuresover representative methods. We show that popular fidelitymeasures are not sufficient to guarantee good explanations.Finally, we show empirically that 1-Lipschitz networks pro-vide general and consistent explanations, regardless of theexplanation method used, making them a relevant directionfor explainability.
    Towards closing the gap between the theory and practice of SVRG. (arXiv:1908.02725v2 [math.OC] UPDATED)
    (2 min) Among the very first variance reduced stochastic methods for solving the empirical risk minimization problem was the SVRG method (Johnson & Zhang 2013). SVRG is an inner-outer loop based method, where in the outer loop a reference full gradient is evaluated, after which $m \in \mathbb{N}$ steps of an inner loop are executed where the reference gradient is used to build a variance reduced estimate of the current gradient. The simplicity of the SVRG method and its analysis have led to multiple extensions and variants for even non-convex optimization. We provide a more general analysis of SVRG than had been previously done by using arbitrary sampling, which allows us to analyse virtually all forms of mini-batching through a single theorem. Furthermore, our analysis is focused on more practical variants of SVRG including a new variant of the loopless SVRG (Hofman et al 2015, Kovalev et al 2019, Kulunchakov and Mairal 2019) and a variant of k-SVRG (Raj and Stich 2018) where $m=n$ and where $n$ is the number of data points. Since our setup and analysis reflect what is done in practice, we are able to set the parameters such as the mini-batch size and step size using our theory in such a way that produces a more efficient algorithm in practice, as we show in extensive numerical experiments.
    Structure Learning from Related Data Sets with a Hierarchical Bayesian Score. (arXiv:2008.01683v2 [stat.ML] UPDATED)
    (2 min) Score functions for learning the structure of Bayesian networks in the literature assume that data are a homogeneous set of observations; whereas it is often the case that they comprise different related, but not homogeneous, data sets collected in different ways. In this paper we propose a new Bayesian Dirichlet score, which we call Bayesian Hierarchical Dirichlet (BHD). The proposed score is based on a hierarchical model that pools information across data sets to learn a single encompassing network structure, while taking into account the differences in their probabilistic structures. We derive a closed-form expression for BHD using a variational approximation of the marginal likelihood and we study its performance using simulated data. We find that, when data comprise multiple related data sets, BHD outperforms the Bayesian Dirichlet equivalent uniform (BDeu) score in terms of reconstruction accuracy as measured by the Structural Hamming distance, and that it is as accurate as BDeu when data are homogeneous. Moreover, the estimated networks are sparser and therefore more interpretable than those obtained with BDeu, thanks to a lower number of false positive arcs.
    Gradient-Leakage Resilient Federated Learning. (arXiv:2107.01154v1 [cs.LG])
    (2 min) Federated learning(FL) is an emerging distributed learning paradigm with default client privacy because clients can keep sensitive data on their devices and only share local training parameter updates with the federated server. However, recent studies reveal that gradient leakages in FL may compromise the privacy of client training data. This paper presents a gradient leakage resilient approach to privacy-preserving federated learning with per training example-based client differential privacy, coined as Fed-CDP. It makes three original contributions. First, we identify three types of client gradient leakage threats in federated learning even with encrypted client-server communications. We articulate when and why the conventional server coordinated differential privacy approach, coined as Fed-SDP, is insufficient to protect the privacy of the training data. Second, we introduce Fed-CDP, the per example-based client differential privacy algorithm, and provide a formal analysis of Fed-CDP with the $(\epsilon, \delta)$ differential privacy guarantee, and a formal comparison between Fed-CDP and Fed-SDP in terms of privacy accounting. Third, we formally analyze the privacy-utility trade-off for providing differential privacy guarantee by Fed-CDP and present a dynamic decay noise-injection policy to further improve the accuracy and resiliency of Fed-CDP. We evaluate and compare Fed-CDP and Fed-CDP(decay) with Fed-SDP in terms of differential privacy guarantee and gradient leakage resilience over five benchmark datasets. The results show that the Fed-CDP approach outperforms conventional Fed-SDP in terms of resilience to client gradient leakages while offering competitive accuracy performance in federated learning.
    Predicting Clinical Outcomes in COVID-19 using Radiomics and Deep Learning on Chest Radiographs: A Multi-Institutional Study. (arXiv:2007.08028v2 [q-bio.QM] UPDATED)
    (3 min) We predict mechanical ventilation requirement and mortality using computational modeling of chest radiographs (CXRs) for coronavirus disease 2019 (COVID-19) patients. This two-center, retrospective study analyzed 530 deidentified CXRs from 515 COVID-19 patients treated at Stony Brook University Hospital and Newark Beth Israel Medical Center between March and August 2020. DL and machine learning classifiers to predict mechanical ventilation requirement and mortality were trained and evaluated using patient CXRs. A novel radiomic embedding framework was also explored for outcome prediction. All results are compared against radiologist grading of CXRs (zone-wise expert severity scores). Radiomic and DL classification models had mAUCs of 0.78+/-0.02 and 0.81+/-0.04, compared with expert scores mAUCs of 0.75+/-0.02 and 0.79+/-0.05 for mechanical ventilation requirement and mortality prediction, respectively. Combined classifiers using both radiomics and expert severity scores resulted in mAUCs of 0.79+/-0.04 and 0.83+/-0.04 for each prediction task, demonstrating improvement over either artificial intelligence or radiologist interpretation alone. Our results also suggest instances where inclusion of radiomic features in DL improves model predictions, something that might be explored in other pathologies. The models proposed in this study and the prognostic information they provide might aid physician decision making and resource allocation during the COVID-19 pandemic.
    CHISEL: Compression-Aware High-Accuracy Embedded Indoor Localization with Deep Learning. (arXiv:2107.01192v1 [cs.LG])
    (2 min) GPS technology has revolutionized the way we localize and navigate outdoors. However, the poor reception of GPS signals in buildings makes it unsuitable for indoor localization. WiFi fingerprinting-based indoor localization is one of the most promising ways to meet this demand. Unfortunately, most work in the domain fails to resolve challenges associated with deployability on resource-limited embedded devices. In this work, we propose a compression-aware and high-accuracy deep learning framework called CHISEL that outperforms the best-known works in the area while maintaining localization robustness on embedded devices.
    A Functional Perspective on Learning Symmetric Functions with Neural Networks. (arXiv:2008.06952v3 [cs.LG] UPDATED)
    (2 min) Symmetric functions, which take as input an unordered, fixed-size set, are known to be universally representable by neural networks that enforce permutation invariance. These architectures only give guarantees for fixed input sizes, yet in many practical applications, including point clouds and particle physics, a relevant notion of generalization should include varying the input size. In this work we treat symmetric functions (of any size) as functions over probability measures, and study the learning and representation of neural networks defined on measures. By focusing on shallow architectures, we establish approximation and generalization bounds under different choices of regularization (such as RKHS and variation norms), that capture a hierarchy of functional spaces with increasing degree of non-linear learning. The resulting models can be learned efficiently and enjoy generalization guarantees that extend across input sizes, as we verify empirically.
    Learnable and Instance-Robust Predictions for Online Matching, Flows and Load Balancing. (arXiv:2011.11743v2 [cs.LG] UPDATED)
    (2 min) We propose a new model for augmenting algorithms with predictions by requiring that they are formally learnable and instance robust. Learnability ensures that predictions can be efficiently constructed from a reasonable amount of past data. Instance robustness ensures that the prediction is robust to modest changes in the problem input, where the measure of the change may be problem specific. Instance robustness insists on a smooth degradation in performance as a function of the change. Ideally, the performance is never worse than worst-case bounds. This also allows predictions to be objectively compared. We design online algorithms with predictions for a network flow allocation problem and restricted assignment makespan minimization. For both problems, two key properties are established: high quality predictions can be learned from a small sample of prior instances and these predictions are robust to errors that smoothly degrade as the underlying problem instance changes.
    Quantum Algorithms for Structured Prediction. (arXiv:1809.04091v5 [cs.LG] UPDATED)
    (2 min) We introduce two quantum algorithms for solving structured prediction problems. We first show that a stochastic gradient descent that uses the quantum minimum finding algorithm and takes its probabilistic failure into account solves the structured prediction problem with a runtime that scales with the square root of the size of the label space, and in $\widetilde O\left(1/\epsilon\right)$ with respect to the precision, $\epsilon$, of the solution. Motivated by robust inference techniques in machine learning, we then introduce another quantum algorithm that solves a smooth approximation of the structured prediction problem with a similar quantum speedup in the size of the label space and a similar scaling in the precision parameter. In doing so, we analyze a variant of stochastic gradient descent for convex optimization in the presence of an additive error in the calculation of the gradients, and show that its convergence rate does not deteriorate if the additive errors are of the order $O(\sqrt\epsilon)$. This algorithm uses quantum Gibbs sampling at temperature $\Omega (\epsilon)$ as a subroutine. Based on these theoretical observations, we propose a method for using quantum Gibbs samplers to combine feedforward neural networks with probabilistic graphical models for quantum machine learning. Our numerical results using Monte Carlo simulations on an image tagging task demonstrate the benefit of the approach.
    Structure-aware reinforcement learning for node-overload protection in mobile edge computing. (arXiv:2107.01025v1 [cs.NI])
    (2 min) Mobile Edge Computing (MEC) refers to the concept of placing computational capability and applications at the edge of the network, providing benefits such as reduced latency in handling client requests, reduced network congestion, and improved performance of applications. The performance and reliability of MEC are degraded significantly when one or several edge servers in the cluster are overloaded. Especially when a server crashes due to the overload, it causes service failures in MEC. In this work, an adaptive admission control policy to prevent edge node from getting overloaded is presented. This approach is based on a recently-proposed low complexity RL (Reinforcement Learning) algorithm called SALMUT (Structure-Aware Learning for Multiple Thresholds), which exploits the structure of the optimal admission control policy in multi-class queues for an average-cost setting. We extend the framework to work for node overload-protection problem in a discounted-cost setting. The proposed solution is validated using several scenarios mimicking real-world deployments in two different settings - computer simulations and a docker testbed. Our empirical evaluations show that the total discounted cost incurred by SALMUT is similar to state-of-the-art deep RL algorithms such as PPO (Proximal Policy Optimization) and A2C (Advantage Actor Critic) but requires an order of magnitude less time to train, outputs easily interpretable policy, and can be deployed in an online manner.
    Vox Populi, Vox DIY: Benchmark Dataset for Crowdsourced Audio Transcription. (arXiv:2107.01091v1 [cs.SD])
    (2 min) Domain-specific data is the crux of the successful transfer of machine learning systems from benchmarks to real life. Crowdsourcing has become one of the standard tools for cheap and time-efficient data collection for simple problems such as image classification: thanks in large part to advances in research on aggregation methods. However, the applicability of crowdsourcing to more complex tasks (e.g., speech recognition) remains limited due to the lack of principled aggregation methods for these modalities. The main obstacle towards designing advanced aggregation methods is the absence of training data, and in this work, we focus on bridging this gap in speech recognition. For this, we collect and release CrowdSpeech -- the first publicly available large-scale dataset of crowdsourced audio transcriptions. Evaluation of existing aggregation methods on our data shows room for improvement, suggesting that our work may entail the design of better algorithms. At a higher level, we also contribute to the more general challenge of collecting high-quality datasets using crowdsourcing: we develop a principled pipeline for constructing datasets of crowdsourced audio transcriptions in any novel domain. We show its applicability on an under-resourced language by constructing VoxDIY -- a counterpart of CrowdSpeech for the Russian language. We also release the code that allows a full replication of our data collection pipeline and share various insights on best practices of data collection via crowdsourcing.
    Screening for a Reweighted Penalized Conditional Gradient Method. (arXiv:2107.01106v1 [math.OC])
    (2 min) The conditional gradient method (CGM) is widely used in large-scale sparse convex optimization, having a low per iteration computational cost for structured sparse regularizers and a greedy approach to collecting nonzeros. We explore the sparsity acquiring properties of a general penalized CGM (P-CGM) for convex regularizers and a reweighted penalized CGM (RP-CGM) for nonconvex regularizers, replacing the usual convex constraints with gauge-inspired penalties. This generalization does not increase the per-iteration complexity noticeably. Without assuming bounded iterates or using line search, we show $O(1/t)$ convergence of the gap of each subproblem, which measures distance to a stationary point. We couple this with a screening rule which is safe in the convex case, converging to the true support at a rate $O(1/(\delta^2))$ where $\delta \geq 0$ measures how close the problem is to degeneracy. In the nonconvex case the screening rule converges to the true support in a finite number of iterations, but is not necessarily safe in the intermediate iterates. In our experiments, we verify the consistency of the method and adjust the aggressiveness of the screening rule by tuning the concavity of the regularizer.
    Tight Mutual Information Estimation With Contrastive Fenchel-Legendre Optimization. (arXiv:2107.01131v1 [stat.ML])
    (2 min) Successful applications of InfoNCE and its variants have popularized the use of contrastive variational mutual information (MI) estimators in machine learning. While featuring superior stability, these estimators crucially depend on costly large-batch training, and they sacrifice bound tightness for variance reduction. To overcome these limitations, we revisit the mathematics of popular variational MI bounds from the lens of unnormalized statistical modeling and convex optimization. Our investigation not only yields a new unified theoretical framework encompassing popular variational MI bounds but also leads to a novel, simple, and powerful contrastive MI estimator named as FLO. Theoretically, we show that the FLO estimator is tight, and it provably converges under stochastic gradient descent. Empirically, our FLO estimator overcomes the limitations of its predecessors and learns more efficiently. The utility of FLO is verified using an extensive set of benchmarks, which also reveals the trade-offs in practical MI estimation.
    Interactive Causal Structure Discovery in Earth System Sciences. (arXiv:2107.01126v1 [physics.data-an])
    (2 min) Causal structure discovery (CSD) models are making inroads into several domains, including Earth system sciences. Their widespread adaptation is however hampered by the fact that the resulting models often do not take into account the domain knowledge of the experts and that it is often necessary to modify the resulting models iteratively. We present a workflow that is required to take this knowledge into account and to apply CSD algorithms in Earth system sciences. At the same time, we describe open research questions that still need to be addressed. We present a way to interactively modify the outputs of the CSD algorithms and argue that the user interaction can be modelled as a greedy finding of the local maximum-a-posteriori solution of the likelihood function, which is composed of the likelihood of the causal model and the prior distribution representing the knowledge of the expert user. We use a real-world data set for examples constructed in collaboration with our co-authors, who are the domain area experts. We show that finding maximally usable causal models in the Earth system sciences or other similar domains is a difficult task which contains many interesting open research questions. We argue that taking the domain knowledge into account has a substantial effect on the final causal models discovered.
    Neural Network Layer Algebra: A Framework to Measure Capacity and Compression in Deep Learning. (arXiv:2107.01081v1 [cs.LG])
    (2 min) We present a new framework to measure the intrinsic properties of (deep) neural networks. While we focus on convolutional networks, our framework can be extrapolated to any network architecture. In particular, we evaluate two network properties, namely, capacity (related to expressivity) and compression, both of which depend only on the network structure and are independent of the training and test data. To this end, we propose two metrics: the first one, called layer complexity, captures the architectural complexity of any network layer; and, the second one, called layer intrinsic power, encodes how data is compressed along the network. The metrics are based on the concept of layer algebra, which is also introduced in this paper. This concept is based on the idea that the global properties depend on the network topology, and the leaf nodes of any neural network can be approximated using local transfer functions, thereby, allowing a simple computation of the global metrics. We also compare the properties of the state-of-the art architectures using our metrics and use the properties to analyze the classification accuracy on benchmark datasets.
    Design and implementation of an islanded hybrid microgrid system for a large resort center for Penang Island with the proper application of excess energy. (arXiv:2107.01032v1 [eess.SY])
    (2 min) The energy demand is growing daily at an accelerated pace due to the internationalization and development of civilization. Yet proper economic utilization of additional energy generated by the Islanded Hybrid Microgrid System (IHMS) that was not consumed by the load is a major global challenge. To resolve the above-stated summons, this research focuses on a multi-optimal combination of IHMS for the Penang Hill Resort located on Penang Island, Malaysia, with effective use of redundant energy. To avail this excess energy efficiently, an electrical heater along with a storage tank has been designed concerning diversion load having proper energy management. Furthermore, the system design has adopted the HOMER Pro software for profitable and practical analysis. Alongside, MATLAB Simulink had stabilized the whole system by representing the values of 2068 and 19,072 kW that have been determined as the approximated peak and average load per day for the resort. Moreover, the optimized IHMS is comprehended of Photovoltaic (PV) cells, Diesel Generator, Wind Turbine, Battery, and Converter. Adjacent to this, the optimized system ensued in having a Net Present Cost (NPC) of $21.66 million, Renewable Fraction (RF) of 27.8%, Cost of Energy (COE) of $0.165/kWh, CO2 of 1,735,836 kg/year, and excess energy of 517.29MWh per annum. Since the diesel generator lead system was included in the scheme, a COE of $0.217/kWh, CO2 of 5,124,879 kg/year, and NPC of $23.25 million were attained. The amount of excess energy is effectively utilized with an electrical heater as a diversion load.
    Memory Efficient Meta-Learning with Large Images. (arXiv:2107.01105v1 [stat.ML])
    (2 min) Meta learning approaches to few-shot classification are computationally efficient at test time requiring just a few optimization steps or single forward pass to learn a new task, but they remain highly memory-intensive to train. This limitation arises because a task's entire support set, which can contain up to 1000 images, must be processed before an optimization step can be taken. Harnessing the performance gains offered by large images thus requires either parallelizing the meta-learner across multiple GPUs, which may not be available, or trade-offs between task and image size when memory constraints apply. We improve on both options by proposing LITE, a general and memory efficient episodic training scheme that enables meta-training on large tasks composed of large images on a single GPU. We achieve this by observing that the gradients for a task can be decomposed into a sum of gradients over the task's training images. This enables us to perform a forward pass on a task's entire training set but realize significant memory savings by back-propagating only a random subset of these images which we show is an unbiased approximation of the full gradient. We use LITE to train meta-learners and demonstrate new state-of-the-art accuracy on the real-world ORBIT benchmark and 3 of the 4 parts of the challenging VTAB+MD benchmark relative to leading meta-learners. LITE also enables meta-learners to be competitive with transfer learning approaches but at a fraction of the test-time computational cost, thus serving as a counterpoint to the recent narrative that transfer learning is all you need for few-shot classification.
    Cooperative Training and Latent Space Data Augmentation for Robust Medical Image Segmentation. (arXiv:2107.01079v1 [cs.CV])
    (2 min) Deep learning-based segmentation methods are vulnerable to unforeseen data distribution shifts during deployment, e.g. change of image appearances or contrasts caused by different scanners, unexpected imaging artifacts etc. In this paper, we present a cooperative framework for training image segmentation models and a latent space augmentation method for generating hard examples. Both contributions improve model generalization and robustness with limited data. The cooperative training framework consists of a fast-thinking network (FTN) and a slow-thinking network (STN). The FTN learns decoupled image features and shape features for image reconstruction and segmentation tasks. The STN learns shape priors for segmentation correction and refinement. The two networks are trained in a cooperative manner. The latent space augmentation generates challenging examples for training by masking the decoupled latent space in both channel-wise and spatial-wise manners. We performed extensive experiments on public cardiac imaging datasets. Using only 10 subjects from a single site for training, we demonstrated improved cross-site segmentation performance and increased robustness against various unforeseen imaging artifacts compared to strong baseline methods. Particularly, cooperative training with latent space data augmentation yields 15% improvement in terms of average Dice score when compared to a standard training method.
    WiCluster: Passive Indoor 2D/3D Positioning using WiFi without Precise Labels. (arXiv:2107.01002v1 [cs.NI])
    (2 min) We introduce WiCluster, a new machine learning (ML) approach for passive indoor positioning using radio frequency (RF) channel state information (CSI). WiCluster can predict both a zone-level position and a precise 2D or 3D position, without using any precise position labels during training. Prior CSI-based indoor positioning work has relied on non-parametric approaches using digital signal-processing (DSP) and, more recently, parametric approaches (e.g., fully supervised ML methods). However these do not handle the complexity of real-world environments well and do not meet requirements for large-scale commercial deployments: the accuracy of DSP-based method deteriorates significantly in non-line-of-sight conditions, while supervised ML methods need large amounts of hard-to-acquire centimeter accuracy position labels. In contrast, WiCluster is both precise and requires weaker label-information that can be easily collected. Our first contribution is a novel dimensionality reduction method for charting. It combines a triplet-loss with a multi-scale clustering-loss to map the high-dimensional CSI representation to a 2D/3D latent space. Our second contribution is two weakly supervised losses that map this latent space into a Cartesian map, resulting in meter-accuracy position results. These losses only require simple to acquire priors: a sketch of the floorplan, approximate location of access-point locations and a few CSI packets that are labeled with the corresponding zone in the floorplan. Thirdly, we report results and a robustness study for 2D positioning in a single-floor office building and 3D positioning in a two-floor home to show the robustness of our method.
    Backward-Compatible Prediction Updates: A Probabilistic Approach. (arXiv:2107.01057v1 [cs.LG])
    (2 min) When machine learning systems meet real world applications, accuracy is only one of several requirements. In this paper, we assay a complementary perspective originating from the increasing availability of pre-trained and regularly improving state-of-the-art models. While new improved models develop at a fast pace, downstream tasks vary more slowly or stay constant. Assume that we have a large unlabelled data set for which we want to maintain accurate predictions. Whenever a new and presumably better ML models becomes available, we encounter two problems: (i) given a limited budget, which data points should be re-evaluated using the new model?; and (ii) if the new predictions differ from the current ones, should we update? Problem (i) is about compute cost, which matters for very large data sets and models. Problem (ii) is about maintaining consistency of the predictions, which can be highly relevant for downstream applications; our demand is to avoid negative flips, i.e., changing correct to incorrect predictions. In this paper, we formalize the Prediction Update Problem and present an efficient probabilistic approach as answer to the above questions. In extensive experiments on standard classification benchmark data sets, we show that our method outperforms alternative strategies along key metrics for backward-compatible prediction updates.
    Gamers Private Network Performance Forecasting. From Raw Data to the Data Warehouse with Machine Learning and Neural Nets. (arXiv:2107.00998v1 [cs.NI])
    (2 min) Gamers Private Network (GPN) is a client/server technology that guarantees a connection for online video games that is more reliable and lower latency than a standard internet connection. Users of the GPN technology benefit from a stable and high-quality gaming experience for online games, which are hosted and played across the world. After transforming a massive volume of raw networking data collected by WTFast, we have structured the cleaned data into a special-purpose data warehouse and completed the extensive analysis using machine learning and neural nets technologies, and business intelligence tools. These analyses demonstrate the ability to predict and quantify changes in the network and demonstrate the benefits gained from the use of a GPN for users when connected to an online game session.
    R2D2: Recursive Transformer based on Differentiable Tree for Interpretable Hierarchical Language Modeling. (arXiv:2107.00967v1 [cs.CL])
    (2 min) Human language understanding operates at multiple levels of granularity (e.g., words, phrases, and sentences) with increasing levels of abstraction that can be hierarchically combined. However, existing deep models with stacked layers do not explicitly model any sort of hierarchical process. This paper proposes a recursive Transformer model based on differentiable CKY style binary trees to emulate the composition process. We extend the bidirectional language model pre-training objective to this architecture, attempting to predict each word given its left and right abstraction nodes. To scale up our approach, we also introduce an efficient pruned tree induction algorithm to enable encoding in just a linear number of composition steps. Experimental results on language modeling and unsupervised parsing show the effectiveness of our approach.
    DUKweb: Diachronic word representations from the UK Web Archive corpus. (arXiv:2107.01076v1 [cs.CL])
    (2 min) Lexical semantic change (detecting shifts in the meaning and usage of words) is an important task for social and cultural studies as well as for Natural Language Processing applications. Diachronic word embeddings (time-sensitive vector representations of words that preserve their meaning) have become the standard resource for this task. However, given the significant computational resources needed for their generation, very few resources exist that make diachronic word embeddings available to the scientific community. In this paper we present DUKweb, a set of large-scale resources designed for the diachronic analysis of contemporary English. DUKweb was created from the JISC UK Web Domain Dataset (1996-2013), a very large archive which collects resources from the Internet Archive that were hosted on domains ending in `.uk'. DUKweb consists of a series word co-occurrence matrices and two types of word embeddings for each year in the JISC UK Web Domain dataset. We show the reuse potential of DUKweb and its quality standards via a case study on word meaning change detection.
    Evaluating the Usefulness of Unsupervised monitoring in Cultural Heritage Monuments. (arXiv:2107.00964v1 [cs.CV])
    (2 min) In this paper, we scrutinize the effectiveness of various clustering techniques, investigating their applicability in Cultural Heritage monitoring applications. In the context of this paper, we detect the level of decomposition and corrosion on the walls of Saint Nicholas fort in Rhodes utilizing hyperspectral images. A total of 6 different clustering approaches have been evaluated over a set of 14 different orthorectified hyperspectral images. Experimental setup in this study involves K-means, Spectral, Meanshift, DBSCAN, Birch and Optics algorithms. For each of these techniques we evaluate its performance by the use of performance metrics such as Calinski-Harabasz, Davies-Bouldin indexes and Silhouette value. In this approach, we evaluate the outcomes of the clustering methods by comparing them with a set of annotated images which denotes the ground truth regarding the decomposition and/or corrosion area of the original images. The results depict that a few clustering techniques applied on the given dataset succeeded decent accuracy, precision, recall and f1 scores. Eventually, it was observed that the deterioration was detected quite accurately.
    Inverse-Dirichlet Weighting Enables Reliable Training of Physics Informed Neural Networks. (arXiv:2107.00940v1 [cs.LG])
    (2 min) We characterize and remedy a failure mode that may arise from multi-scale dynamics with scale imbalances during training of deep neural networks, such as Physics Informed Neural Networks (PINNs). PINNs are popular machine-learning templates that allow for seamless integration of physical equation models with data. Their training amounts to solving an optimization problem over a weighted sum of data-fidelity and equation-fidelity objectives. Conflicts between objectives can arise from scale imbalances, heteroscedasticity in the data, stiffness of the physical equation, or from catastrophic interference during sequential training. We explain the training pathology arising from this and propose a simple yet effective inverse-Dirichlet weighting strategy to alleviate the issue. We compare with Sobolev training of neural networks, providing the baseline of analytically $\boldsymbol{\epsilon}$-optimal training. We demonstrate the effectiveness of inverse-Dirichlet weighting in various applications, including a multi-scale model of active turbulence, where we show orders of magnitude improvement in accuracy and convergence over conventional PINN training. For inverse modeling using sequential training, we find that inverse-Dirichlet weighting protects a PINN against catastrophic forgetting.
    Feeling of Presence Maximization: mmWave-Enabled Virtual Reality Meets Deep Reinforcement Learning. (arXiv:2107.01001v1 [cs.NI])
    (2 min) This paper investigates the problem of providing ultra-reliable and energy-efficient virtual reality (VR) experiences for wireless mobile users. To ensure reliable ultra-high-definition (UHD) video frame delivery to mobile users and enhance their immersive visual experiences, a coordinated multipoint (CoMP) transmission technique and millimeter wave (mmWave) communications are exploited. Owing to user movement and time-varying wireless channels, the wireless VR experience enhancement problem is formulated as a sequence-dependent and mixed-integer problem with a goal of maximizing users' feeling of presence (FoP) in the virtual world, subject to power consumption constraints on access points (APs) and users' head-mounted displays (HMDs). The problem, however, is hard to be directly solved due to the lack of users' accurate tracking information and the sequence-dependent and mixed-integer characteristics. To overcome this challenge, we develop a parallel echo state network (ESN) learning method to predict users' tracking information by training fresh and historical tracking samples separately collected by APs. With the learnt results, we propose a deep reinforcement learning (DRL) based optimization algorithm to solve the formulated problem. In this algorithm, we implement deep neural networks (DNNs) as a scalable solution to produce integer decision variables and solving a continuous power control problem to criticize the integer decision variables. Finally, the performance of the proposed algorithm is compared with various benchmark algorithms, and the impact of different design parameters is also discussed. Simulation results demonstrate that the proposed algorithm is more 4.14% energy-efficient than the benchmark algorithms.
    Learning Primal Heuristics for Mixed Integer Programs. (arXiv:2107.00866v1 [cs.AI])
    (2 min) This paper proposes a novel primal heuristic for Mixed Integer Programs, by employing machine learning techniques. Mixed Integer Programming is a general technique for formulating combinatorial optimization problems. Inside a solver, primal heuristics play a critical role in finding good feasible solutions that enable one to tighten the duality gap from the outset of the Branch-and-Bound algorithm (B&B), greatly improving its performance by pruning the B&B tree aggressively. In this paper, we investigate whether effective primal heuristics can be automatically learned via machine learning. We propose a new method to represent an optimization problem as a graph, and train a Graph Convolutional Network on solved problem instances with known optimal solutions. This in turn can predict the values of decision variables in the optimal solution for an unseen problem instance of a similar type. The prediction of variable solutions is then leveraged by a novel configuration of the B&B method, Probabilistic Branching with guided Depth-first Search (PB-DFS) approach, aiming to find (near-)optimal solutions quickly. The experimental results show that this new heuristic can find better primal solutions at a much earlier stage of the solving process, compared to other state-of-the-art primal heuristics.
    The Causal Neural Connection: Expressiveness, Learnability, and Inference. (arXiv:2107.00793v1 [cs.LG])
    (2 min) One of the central elements of any causal inference is an object called structural causal model (SCM), which represents a collection of mechanisms and exogenous sources of random variation of the system under investigation (Pearl, 2000). An important property of many kinds of neural networks is universal approximability: the ability to approximate any function to arbitrary precision. Given this property, one may be tempted to surmise that a collection of neural nets is capable of learning any SCM by training on data generated by that SCM. In this paper, we show this is not the case by disentangling the notions of expressivity and learnability. Specifically, we show that the causal hierarchy theorem (Thm. 1, Bareinboim et al., 2020), which describes the limits of what can be learned from data, still holds for neural models. For instance, an arbitrarily complex and expressive neural net is unable to predict the effects of interventions given observational data alone. Given this result, we introduce a special type of SCM called a neural causal model (NCM), and formalize a new type of inductive bias to encode structural constraints necessary for performing causal inferences. Building on this new class of models, we focus on solving two canonical tasks found in the literature known as causal identification and estimation. Leveraging the neural toolbox, we develop an algorithm that is both sufficient and necessary to determine whether a causal effect can be learned from data (i.e., causal identifiability); it then estimates the effect whenever identifiability holds (causal estimation). Simulations corroborate the proposed approach.
    Supervised Contrastive Learning for Accented Speech Recognition. (arXiv:2107.00921v1 [cs.SD])
    (2 min) Neural network based speech recognition systems suffer from performance degradation due to accented speech, especially unfamiliar accents. In this paper, we study the supervised contrastive learning framework for accented speech recognition. To build different views (similar "positive" data samples) for contrastive learning, three data augmentation techniques including noise injection, spectrogram augmentation and TTS-same-sentence generation are further investigated. From the experiments on the Common Voice dataset, we have shown that contrastive learning helps to build data-augmentation invariant and pronunciation invariant representations, which significantly outperforms traditional joint training methods in both zero-shot and full-shot settings. Experiments show that contrastive learning can improve accuracy by 3.66% (zero-shot) and 3.78% (full-shot) on average, comparing to the joint training method.
    Theory of Deep Convolutional Neural Networks III: Approximating Radial Functions. (arXiv:2107.00896v1 [cs.LG])
    (2 min) We consider a family of deep neural networks consisting of two groups of convolutional layers, a downsampling operator, and a fully connected layer. The network structure depends on two structural parameters which determine the numbers of convolutional layers and the width of the fully connected layer. We establish an approximation theory with explicit approximation rates when the approximated function takes a composite form $f\circ Q$ with a feature polynomial $Q$ and a univariate function $f$. In particular, we prove that such a network can outperform fully connected shallow networks in approximating radial functions with $Q(x) =|x|^2$, when the dimension $d$ of data from $\mathbb{R}^d$ is large. This gives the first rigorous proof for the superiority of deep convolutional neural networks in approximating functions with special structures. Then we carry out generalization analysis for empirical risk minimization with such a deep network in a regression framework with the regression function of the form $f\circ Q$. Our network structure which does not use any composite information or the functions $Q$ and $f$ can automatically extract features and make use of the composite nature of the regression function via tuning the structural parameters. Our analysis provides an error bound which decreases with the network depth to a minimum and then increases, verifying theoretically a trade-off phenomenon observed for network depths in many practical applications.
    DeformRS: Certifying Input Deformations with Randomized Smoothing. (arXiv:2107.00996v1 [cs.LG])
    (2 min) Deep neural networks are vulnerable to input deformations in the form of vector fields of pixel displacements and to other parameterized geometric deformations e.g. translations, rotations, etc. Current input deformation certification methods either (i) do not scale to deep networks on large input datasets, or (ii) can only certify a specific class of deformations, e.g. only rotations. We reformulate certification in randomized smoothing setting for both general vector field and parameterized deformations and propose DeformRS-VF and DeformRS-Par, respectively. Our new formulation scales to large networks on large input datasets. For instance, DeformRS-Par certifies rich deformations, covering translations, rotations, scaling, affine deformations, and other visually aligned deformations such as ones parameterized by Discrete-Cosine-Transform basis. Extensive experiments on MNIST, CIFAR10 and ImageNet show that DeformRS-Par outperforms existing state-of-the-art in certified accuracy, e.g. improved certified accuracy of 6% against perturbed rotations in the set [-10,10] degrees on ImageNet.
    Misinformation Detection on YouTube Using Video Captions. (arXiv:2107.00941v1 [cs.LG])
    (2 min) Millions of people use platforms such as YouTube, Facebook, Twitter, and other mass media. Due to the accessibility of these platforms, they are often used to establish a narrative, conduct propaganda, and disseminate misinformation. This work proposes an approach that uses state-of-the-art NLP techniques to extract features from video captions (subtitles). To evaluate our approach, we utilize a publicly accessible and labeled dataset for classifying videos as misinformation or not. The motivation behind exploring video captions stems from our analysis of videos metadata. Attributes such as the number of views, likes, dislikes, and comments are ineffective as videos are hard to differentiate using this information. Using caption dataset, the proposed models can classify videos among three classes (Misinformation, Debunking Misinformation, and Neutral) with 0.85 to 0.90 F1-score. To emphasize the relevance of the misinformation class, we re-formulate our classification problem as a two-class classification - Misinformation vs. others (Debunking Misinformation and Neutral). In our experiments, the proposed models can classify videos with 0.92 to 0.95 F1-score and 0.78 to 0.90 AUC ROC.
    Online Metro Origin-Destination Prediction via Heterogeneous Information Aggregation. (arXiv:2107.00946v1 [cs.LG])
    (2 min) Metro origin-destination prediction is a crucial yet challenging task for intelligent transportation management, which aims to accurately forecast two specific types of cross-station ridership, i.e., Origin-Destination (OD) one and Destination-Origin (DO) one. However, complete OD matrices of previous time intervals can not be obtained immediately in online metro systems, and conventional methods only used limited information to forecast the future OD and DO ridership separately.In this work, we proposed a novel neural network module termed Heterogeneous Information Aggregation Machine (HIAM), which fully exploits heterogeneous information of historical data (e.g., incomplete OD matrices, unfinished order vectors, and DO matrices) to jointly learn the evolutionary patterns of OD and DO ridership. Specifically, an OD modeling branch estimates the potential destinations of unfinished orders explicitly to complement the information of incomplete OD matrices, while a DO modeling branch takes DO matrices as input to capture the spatial-temporal distribution of DO ridership. Moreover, a Dual Information Transformer is introduced to propagate the mutual information among OD features and DO features for modeling the OD-DO causality and correlation. Based on the proposed HIAM, we develop a unified Seq2Seq network to forecast the future OD and DO ridership simultaneously. Extensive experiments conducted on two large-scale benchmarks demonstrate the effectiveness of our method for online metro origin-destination prediction.
    From Personalized Medicine to Population Health: A Survey of mHealth Sensing Techniques. (arXiv:2107.00948v1 [cs.LG])
    (2 min) Mobile Sensing Apps have been widely used as a practical approach to collect behavioral and health-related information from individuals and provide timely intervention to promote health and well-beings, such as mental health and chronic cares. As the objectives of mobile sensing could be either \emph{(a) personalized medicine for individuals} or \emph{(b) public health for populations}, in this work we review the design of these mobile sensing apps, and propose to categorize the design of these apps/systems in two paradigms -- \emph{(i) Personal Sensing} and \emph{(ii) Crowd Sensing} paradigms. While both sensing paradigms might incorporate with common ubiquitous sensing technologies, such as wearable sensors, mobility monitoring, mobile data offloading, and/or cloud-based data analytics to collect and process sensing data from individuals, we present a novel taxonomy system with two major components that can specify and classify apps/systems from aspects of the life-cycle of mHealth Sensing: \emph{(1) Sensing Task Creation \& Participation}, \emph{(2) Health Surveillance \& Data Collection}, and \emph{(3) Data Analysis \& Knowledge Discovery}. With respect to different goals of the two paradigms, this work systematically reviews this field, and summarizes the design of typical apps/systems in the view of the configurations and interactions between these two components. In addition to summarization, the proposed taxonomy system also helps figure out the potential directions of mobile sensing for health from both personalized medicines and population health perspectives.
    A\c{C}AI: Ascent Similarity Caching with Approximate Indexes. (arXiv:2107.00957v1 [cs.NI])
    (2 min) Similarity search is a key operation in multimedia retrieval systems and recommender systems, and it will play an important role also for future machine learning and augmented reality applications. When these systems need to serve large objects with tight delay constraints, edge servers close to the end-user can operate as similarity caches to speed up the retrieval. In this paper we present A\c{C}AI, a new similarity caching policy which improves on the state of the art by using (i) an (approximate) index for the whole catalog to decide which objects to serve locally and which to retrieve from the remote server, and (ii) a mirror ascent algorithm to update the set of local objects with strong guarantees even when the request process does not exhibit any statistical regularity.
    ResIST: Layer-Wise Decomposition of ResNets for Distributed Training. (arXiv:2107.00961v1 [cs.LG])
    (2 min) We propose {\rm \texttt{ResIST}}, a novel distributed training protocol for Residual Networks (ResNets). {\rm \texttt{ResIST}} randomly decomposes a global ResNet into several shallow sub-ResNets that are trained independently in a distributed manner for several local iterations, before having their updates synchronized and aggregated into the global model. In the next round, new sub-ResNets are randomly generated and the process repeats. By construction, per iteration, {\rm \texttt{ResIST}} communicates only a small portion of network parameters to each machine and never uses the full model during training. Thus, {\rm \texttt{ResIST}} reduces the communication, memory, and time requirements of ResNet training to only a fraction of the requirements of previous methods. In comparison to common protocols like data-parallel training and data-parallel training with local SGD, {\rm \texttt{ResIST}} yields a decrease in wall-clock training time, while being competitive with respect to model performance.
    Rapid Neural Architecture Search by Learning to Generate Graphs from Datasets. (arXiv:2107.00860v1 [cs.LG])
    (2 min) Despite the success of recent Neural Architecture Search (NAS) methods on various tasks which have shown to output networks that largely outperform human-designed networks, conventional NAS methods have mostly tackled the optimization of searching for the network architecture for a single task (dataset), which does not generalize well across multiple tasks (datasets). Moreover, since such task-specific methods search for a neural architecture from scratch for every given task, they incur a large computational cost, which is problematic when the time and monetary budget are limited. In this paper, we propose an efficient NAS framework that is trained once on a database consisting of datasets and pretrained networks and can rapidly search for a neural architecture for a novel dataset. The proposed MetaD2A (Meta Dataset-to-Architecture) model can stochastically generate graphs (architectures) from a given set (dataset) via a cross-modal latent space learned with amortized meta-learning. Moreover, we also propose a meta-performance predictor to estimate and select the best architecture without direct training on target datasets. The experimental results demonstrate that our model meta-learned on subsets of ImageNet-1K and architectures from NAS-Bench 201 search space successfully generalizes to multiple unseen datasets including CIFAR-10 and CIFAR-100, with an average search time of 33 GPU seconds. Even under MobileNetV3 search space, MetaD2A is 5.5K times faster than NSGANetV2, a transferable NAS method, with comparable performance. We believe that the MetaD2A proposes a new research direction for rapid NAS as well as ways to utilize the knowledge from rich databases of datasets and architectures accumulated over the past years. Code is available at https://github.com/HayeonLee/MetaD2A.
    SocialAI: Benchmarking Socio-Cognitive Abilities in Deep Reinforcement Learning Agents. (arXiv:2107.00956v1 [cs.LG])
    (2 min) Building embodied autonomous agents capable of participating in social interactions with humans is one of the main challenges in AI. Within the Deep Reinforcement Learning (DRL) field, this objective motivated multiple works on embodied language use. However, current approaches focus on language as a communication tool in very simplified and non-diverse social situations: the "naturalness" of language is reduced to the concept of high vocabulary size and variability. In this paper, we argue that aiming towards human-level AI requires a broader set of key social skills: 1) language use in complex and variable social contexts; 2) beyond language, complex embodied communication in multimodal settings within constantly evolving social worlds. We explain how concepts from cognitive sciences could help AI to draw a roadmap towards human-like intelligence, with a focus on its social dimensions. As a first step, we propose to expand current research to a broader set of core social skills. To do this, we present SocialAI, a benchmark to assess the acquisition of social skills of DRL agents using multiple grid-world environments featuring other (scripted) social agents. We then study the limits of a recent SOTA DRL approach when tested on SocialAI and discuss important next steps towards proficient social agents. Videos and code are available at https://sites.google.com/view/socialai.
    An Experience Report on Machine Learning Reproducibility: Guidance for Practitioners and TensorFlow Model Garden Contributors. (arXiv:2107.00821v1 [cs.SE])
    (2 min) Machine learning techniques are becoming a fundamental tool for scientific and engineering progress. These techniques are applied in contexts as diverse as astronomy and spam filtering. However, correctly applying these techniques requires careful engineering. Much attention has been paid to the technical potential; relatively little attention has been paid to the software engineering process required to bring research-based machine learning techniques into practical utility. Technology companies have supported the engineering community through machine learning frameworks such as TensorFLow and PyTorch, but the details of how to engineer complex machine learning models in these frameworks have remained hidden. To promote best practices within the engineering community, academic institutions and Google have partnered to launch a Special Interest Group on Machine Learning Models (SIGMODELS) whose goal is to develop exemplary implementations of prominent machine learning models in community locations such as the TensorFlow Model Garden (TFMG). The purpose of this report is to define a process for reproducing a state-of-the-art machine learning model at a level of quality suitable for inclusion in the TFMG. We define the engineering process and elaborate on each step, from paper analysis to model release. We report on our experiences implementing the YOLO model family with a team of 26 student researchers, share the tools we developed, and describe the lessons we learned along the way.
    Quantifying Availability and Discovery in Recommender Systems via Stochastic Reachability. (arXiv:2107.00833v1 [cs.IR])
    (2 min) In this work, we consider how preference models in interactive recommendation systems determine the availability of content and users' opportunities for discovery. We propose an evaluation procedure based on stochastic reachability to quantify the maximum probability of recommending a target piece of content to an user for a set of allowable strategic modifications. This framework allows us to compute an upper bound on the likelihood of recommendation with minimal assumptions about user behavior. Stochastic reachability can be used to detect biases in the availability of content and diagnose limitations in the opportunities for discovery granted to users. We show that this metric can be computed efficiently as a convex program for a variety of practical settings, and further argue that reachability is not inherently at odds with accuracy. We demonstrate evaluations of recommendation algorithms trained on large datasets of explicit and implicit ratings. Our results illustrate how preference models, selection rules, and user interventions impact reachability and how these effects can be distributed unevenly.
    Conflict-free collective stochastic decision making by orbital angular momentum entangled photons. (arXiv:2107.00877v1 [quant-ph])
    (2 min) In recent cross-disciplinary studies involving both optics and computing, single-photon-based decision-making has been demonstrated by utilizing the wave-particle duality of light to solve multi-armed bandit problems. Furthermore, entangled-photon-based decision-making has managed to solve a competitive multi-armed bandit problem in such a way that conflicts of decisions among players are avoided while ensuring equality. However, as these studies are based on the polarization of light, the number of available choices is limited to two, corresponding to two orthogonal polarization states. Here we propose a scalable principle to solve competitive decision-making situations by using the orbital angular momentum as the tunable degree of freedom of photons, which theoretically allows an unlimited number of arms. Moreover, by extending the Hong-Ou-Mandel effect to more than two states, we theoretically establish an experimental configuration able to generate entangled photon states with orbital angular momentum and conditions that provide conflict-free selections at every turn. We numerically examine total rewards regarding three-armed bandit problems, for which the proposed strategy accomplishes almost the theoretical maximum, which is greater than a conventional mixed strategy intending to realize Nash equilibrium. This is thanks to the entanglement property that achieves no-conflict selections, even in the exploring phase to find the best arms.
    Data Centric Domain Adaptation for Historical Text with OCR Errors. (arXiv:2107.00927v1 [cs.CL])
    (2 min) We propose new methods for in-domain and cross-domain Named Entity Recognition (NER) on historical data for Dutch and French. For the cross-domain case, we address domain shift by integrating unsupervised in-domain data via contextualized string embeddings; and OCR errors by injecting synthetic OCR errors into the source domain and address data centric domain adaptation. We propose a general approach to imitate OCR errors in arbitrary input data. Our cross-domain as well as our in-domain results outperform several strong baselines and establish state-of-the-art results. We publish preprocessed versions of the French and Dutch Europeana NER corpora.
    A Novel Deep Reinforcement Learning Based Stock Direction Prediction using Knowledge Graph and Community Aware Sentiments. (arXiv:2107.00931v1 [cs.AI])
    (2 min) Stock market prediction has been an important topic for investors, researchers, and analysts. Because it is affected by too many factors, stock market prediction is a difficult task to handle. In this study, we propose a novel method that is based on deep reinforcement learning methodologies for the direction prediction of stocks using sentiments of community and knowledge graph. For this purpose, we firstly construct a social knowledge graph of users by analyzing relations between connections. After that, time series analysis of related stock and sentiment analysis is blended with deep reinforcement methodology. Turkish version of Bidirectional Encoder Representations from Transformers (BerTurk) is employed to analyze the sentiments of the users while deep Q-learning methodology is used for the deep reinforcement learning side of the proposed model to construct the deep Q network. In order to demonstrate the effectiveness of the proposed model, Garanti Bank (GARAN), Akbank (AKBNK), T\"urkiye \.I\c{s} Bankas{\i} (ISCTR) stocks in Istanbul Stock Exchange are used as a case study. Experiment results show that the proposed novel model achieves remarkable results for stock market prediction task.
    The Spotlight: A General Method for Discovering Systematic Errors in Deep Learning Models. (arXiv:2107.00758v1 [cs.LG])
    (2 min) Supervised learning models often make systematic errors on rare subsets of the data. However, such systematic errors can be difficult to identify, as model performance can only be broken down across sensitive groups when these groups are known and explicitly labelled. This paper introduces a method for discovering systematic errors, which we call the spotlight. The key idea is that similar inputs tend to have similar representations in the final hidden layer of a neural network. We leverage this structure by "shining a spotlight" on this representation space to find contiguous regions where the model performs poorly. We show that the spotlight surfaces semantically meaningful areas of weakness in a wide variety of model architectures, including image classifiers, language models, and recommender systems.
    Almost Tight Approximation Algorithms for Explainable Clustering. (arXiv:2107.00774v1 [cs.LG])
    (2 min) Recently, due to an increasing interest for transparency in artificial intelligence, several methods of explainable machine learning have been developed with the simultaneous goal of accuracy and interpretability by humans. In this paper, we study a recent framework of explainable clustering first suggested by Dasgupta et al.~\cite{dasgupta2020explainable}. Specifically, we focus on the $k$-means and $k$-medians problems and provide nearly tight upper and lower bounds. First, we provide an $O(\log k \log \log k)$-approximation algorithm for explainable $k$-medians, improving on the best known algorithm of $O(k)$~\cite{dasgupta2020explainable} and nearly matching the known $\Omega(\log k)$ lower bound~\cite{dasgupta2020explainable}. In addition, in low-dimensional spaces $d \ll \log k$, we show that our algorithm also provides an $O(d \log^2 d)$-approximate solution for explainable $k$-medians. This improves over the best known bound of $O(d \log k)$ for low dimensions~\cite{laber2021explainable}, and is a constant for constant dimensional spaces. To complement this, we show a nearly matching $\Omega(d)$ lower bound. Next, we study the $k$-means problem in this context and provide an $O(k \log k)$-approximation algorithm for explainable $k$-means, improving over the $O(k^2)$ bound of Dasgupta et al. and the $O(d k \log k)$ bound of \cite{laber2021explainable}. To complement this we provide an almost tight $\Omega(k)$ lower bound, improving over the $\Omega(\log k)$ lower bound of Dasgupta et al. All our algorithms run in near linear time in the number of points and the dimension.
    Deep learning-based statistical noise reduction for multidimensional spectral data. (arXiv:2107.00844v1 [cs.LG])
    (2 min) In spectroscopic experiments, data acquisition in multi-dimensional phase space may require long acquisition time, owing to the large phase space volume to be covered. In such case, the limited time available for data acquisition can be a serious constraint for experiments in which multidimensional spectral data are acquired. Here, taking angle-resolved photoemission spectroscopy (ARPES) as an example, we demonstrate a denoising method that utilizes deep learning as an intelligent way to overcome the constraint. With readily available ARPES data and random generation of training data set, we successfully trained the denoising neural network without overfitting. The denoising neural network can remove the noise in the data while preserving its intrinsic information. We show that the denoising neural network allows us to perform similar level of second-derivative and line shape analysis on data taken with two orders of magnitude less acquisition time. The importance of our method lies in its applicability to any multidimensional spectral data that are susceptible to statistical noise.
    Near-optimal Algorithms for Explainable k-Medians and k-Means. (arXiv:2107.00798v1 [cs.DS])
    (2 min) We consider the problem of explainable $k$-medians and $k$-means introduced by Dasgupta, Frost, Moshkovitz, and Rashtchian~(ICML 2020). In this problem, our goal is to find a \emph{threshold decision tree} that partitions data into $k$ clusters and minimizes the $k$-medians or $k$-means objective. The obtained clustering is easy to interpret because every decision node of a threshold tree splits data based on a single feature into two groups. We propose a new algorithm for this problem which is $\tilde O(\log k)$ competitive with $k$-medians with $\ell_1$ norm and $\tilde O(k)$ competitive with $k$-means. This is an improvement over the previous guarantees of $O(k)$ and $O(k^2)$ by Dasgupta et al (2020). We also provide a new algorithm which is $O(\log^{3/2} k)$ competitive for $k$-medians with $\ell_2$ norm. Our first algorithm is near-optimal: Dasgupta et al (2020) showed a lower bound of $\Omega(\log k)$ for $k$-medians; in this work, we prove a lower bound of $\tilde\Omega(k)$ for $k$-means. We also provide a lower bound of $\Omega(\log k)$ for $k$-medians with $\ell_2$ norm.
    Few-shot Learning for Unsupervised Feature Selection. (arXiv:2107.00816v1 [cs.LG])
    (2 min) We propose a few-shot learning method for unsupervised feature selection, which is a task to select a subset of relevant features in unlabeled data. Existing methods usually require many instances for feature selection. However, sufficient instances are often unavailable in practice. The proposed method can select a subset of relevant features in a target task given a few unlabeled target instances by training with unlabeled instances in multiple source tasks. Our model consists of a feature selector and decoder. The feature selector outputs a subset of relevant features taking a few unlabeled instances as input such that the decoder can reconstruct the original features of unseen instances from the selected ones. The feature selector uses the Concrete random variables to select features via gradient descent. To encode task-specific properties from a few unlabeled instances to the model, the Concrete random variables and decoder are modeled using permutation-invariant neural networks that take a few unlabeled instances as input. Our model is trained by minimizing the expected test reconstruction error given a few unlabeled instances that is calculated with datasets in source tasks. We experimentally demonstrate that the proposed method outperforms existing feature selection methods.
    Mitigating deep double descent by concatenating inputs. (arXiv:2107.00797v1 [cs.LG])
    (2 min) The double descent curve is one of the most intriguing properties of deep neural networks. It contrasts the classical bias-variance curve with the behavior of modern neural networks, occurring where the number of samples nears the number of parameters. In this work, we explore the connection between the double descent phenomena and the number of samples in the deep neural network setting. In particular, we propose a construction which augments the existing dataset by artificially increasing the number of samples. This construction empirically mitigates the double descent curve in this setting. We reproduce existing work on deep double descent, and observe a smooth descent into the overparameterized region for our construction. This occurs both with respect to the model size, and with respect to the number epochs.
    Flow-based sampling for multimodal distributions in lattice field theory. (arXiv:2107.00734v1 [hep-lat])
    (2 min) Recent results have demonstrated that samplers constructed with flow-based generative models are a promising new approach for configuration generation in lattice field theory. In this paper, we present a set of methods to construct flow models for targets with multiple separated modes (i.e. theories with multiple vacua). We demonstrate the application of these methods to modeling two-dimensional real scalar field theory in its symmetry-broken phase. In this context we investigate the performance of different flow-based sampling algorithms, including a composite sampling algorithm where flow-based proposals are occasionally augmented by applying updates using traditional algorithms like HMC.
    Reconsidering Dependency Networks from an Information Geometry Perspective. (arXiv:2107.00871v1 [cs.LG])
    (2 min) Dependency networks (Heckerman et al., 2000) are potential probabilistic graphical models for systems comprising a large number of variables. Like Bayesian networks, the structure of a dependency network is represented by a directed graph, and each node has a conditional probability table. Learning and inference are realized locally on individual nodes; therefore, computation remains tractable even with a large number of variables. However, the dependency network's learned distribution is the stationary distribution of a Markov chain called pseudo-Gibbs sampling and has no closed-form expressions. This technical disadvantage has impeded the development of dependency networks. In this paper, we consider a certain manifold for each node. Then, we can interpret pseudo-Gibbs sampling as iterative m-projections onto these manifolds. This interpretation provides a theoretical bound for the location where the stationary distribution of pseudo-Gibbs sampling exists in distribution space. Furthermore, this interpretation involves structure and parameter learning algorithms as optimization problems. In addition, we compare dependency and Bayesian networks experimentally. The results demonstrate that the dependency network and the Bayesian network have roughly the same performance in terms of the accuracy of their learned distributions. The results also show that the dependency network can learn much faster than the Bayesian network.
    On Bridging Generic and Personalized Federated Learning. (arXiv:2107.00778v1 [cs.LG])
    (2 min) Federated learning is promising for its ability to collaboratively train models with multiple clients without accessing their data, but vulnerable when clients' data distributions diverge from each other. This divergence further leads to a dilemma: "Should we prioritize the learned model's generic performance (for future use at the server) or its personalized performance (for each client)?" These two, seemingly competing goals have divided the community to focus on one or the other, yet in this paper we show that it is possible to approach both at the same time. Concretely, we propose a novel federated learning framework that explicitly decouples a model's dual duties with two prediction tasks. On the one hand, we introduce a family of losses that are robust to non-identical class distributions, enabling clients to train a generic predictor with a consistent objective across them. On the other hand, we formulate the personalized predictor as a lightweight adaptive module that is learned to minimize each client's empirical risk on top of the generic predictor. With this two-loss, two-predictor framework which we name Federated Robust Decoupling Fed-RoD, the learned model can simultaneously achieve state-of-the-art generic and personalized performance, essentially bridging the two tasks.
    RL-NCS: Reinforcement learning based data-driven approach for nonuniform compressed sensing. (arXiv:2107.00838v1 [cs.LG])
    (2 min) A reinforcement-learning-based non-uniform compressed sensing (NCS) framework for time-varying signals is introduced. The proposed scheme, referred to as RL-NCS, aims to boost the performance of signal recovery through an optimal and adaptive distribution of sensing energy among two groups of coefficients of the signal, referred to as the region of interest (ROI) coefficients and non-ROI coefficients. The coefficients in ROI usually have greater importance and need to be reconstructed with higher accuracy compared to non-ROI coefficients. In order to accomplish this task, the ROI is predicted at each time step using two specific approaches. One of these approaches incorporates a long short-term memory (LSTM) network for the prediction. The other approach employs the previous ROI information for predicting the next step ROI. Using the exploration-exploitation technique, a Q-network learns to choose the best approach for designing the measurement matrix. Furthermore, a joint loss function is introduced for the efficient training of the Q-network as well as the LSTM network. The result indicates a significant performance gain for our proposed method, even for rapidly varying signals and a reduced number of measurements.
    Meta-Learning for Relative Density-Ratio Estimation. (arXiv:2107.00801v1 [stat.ML])
    (2 min) The ratio of two probability densities, called a density-ratio, is a vital quantity in machine learning. In particular, a relative density-ratio, which is a bounded extension of the density-ratio, has received much attention due to its stability and has been used in various applications such as outlier detection and dataset comparison. Existing methods for (relative) density-ratio estimation (DRE) require many instances from both densities. However, sufficient instances are often unavailable in practice. In this paper, we propose a meta-learning method for relative DRE, which estimates the relative density-ratio from a few instances by using knowledge in related datasets. Specifically, given two datasets that consist of a few instances, our model extracts the datasets' information by using neural networks and uses it to obtain instance embeddings appropriate for the relative DRE. We model the relative density-ratio by a linear model on the embedded space, whose global optimum solution can be obtained as a closed-form solution. The closed-form solution enables fast and effective adaptation to a few instances, and its differentiability enables us to train our model such that the expected test error for relative DRE can be explicitly minimized after adapting to a few instances. We empirically demonstrate the effectiveness of the proposed method by using three problems: relative DRE, dataset comparison, and outlier detection.
    Cell-average based neural network method for hyperbolic and parabolic partial differential equations. (arXiv:2107.00813v1 [math.NA])
    (2 min) Motivated by finite volume scheme, a cell-average based neural network method is proposed. The method is based on the integral or weak formulation of partial differential equations. A simple feed forward network is forced to learn the solution average evolution between two neighboring time steps. Offline supervised training is carried out to obtain the optimal network parameter set, which uniquely identifies one finite volume like neural network method. Once well trained, the network method is implemented as a finite volume scheme, thus is mesh dependent. Different to traditional numerical methods, our method can be relieved from the explicit scheme CFL restriction and can adapt to any time step size for solution evolution. For Heat equation, first order of convergence is observed and the errors are related to the spatial mesh size but are observed independent of the mesh size in time. The cell-average based neural network method can sharply evolve contact discontinuity with almost zero numerical diffusion introduced. Shock and rarefaction waves are well captured for nonlinear hyperbolic conservation laws.
    Reinforcement Learning for Feedback-Enabled Cyber Resilience. (arXiv:2107.00783v1 [cs.CR])
    (2 min) The rapid growth in the number of devices and their connectivity has enlarged the attack surface and weakened cyber systems. As attackers become increasingly sophisticated and resourceful, mere reliance on traditional cyber protection, such as intrusion detection, firewalls, and encryption, is insufficient to secure cyber systems. Cyber resilience provides a new security paradigm that complements inadequate protection with resilience mechanisms. A Cyber-Resilient Mechanism (CRM) adapts to the known or zero-day threats and uncertainties in real-time and strategically responds to them to maintain the critical functions of the cyber systems. Feedback architectures play a pivotal role in enabling the online sensing, reasoning, and actuation of the CRM. Reinforcement Learning (RL) is an important class of algorithms that epitomize the feedback architectures for cyber resiliency, allowing the CRM to provide dynamic and sequential responses to attacks with limited prior knowledge of the attacker. In this work, we review the literature on RL for cyber resiliency and discuss the cyber-resilient defenses against three major types of vulnerabilities, i.e., posture-related, information-related, and human-related vulnerabilities. We introduce moving target defense, defensive cyber deception, and assistive human security technologies as three application domains of CRMs to elaborate on their designs. The RL technique also has vulnerabilities itself. We explain the major vulnerabilities of RL and present several attack models in which the attacks target the rewards, the measurements, and the actuators. We show that the attacker can trick the RL agent into learning a nefarious policy with minimum attacking effort, which shows serious security concerns for RL-enabled systems. Finally, we discuss the future challenges of RL for cyber security and resiliency and emerging applications of RL-based CRMs.
    Toward Robust Drug-Target Interaction Prediction via Ensemble Modeling and Transfer Learning. (arXiv:2107.00719v1 [q-bio.BM])
    (2 min) Drug-target interaction (DTI) prediction plays a crucial role in drug discovery, and deep learning approaches have achieved state-of-the-art performance in this field. We introduce an ensemble of deep learning models (EnsembleDLM) for robust DTI prediction. EnsembleDLM only uses the sequence information of chemical compounds and proteins, and it aggregates the predictions from multiple deep neural networks. This approach reduces the chance of overfitting, yields an unbiased prediction, and achieves state-of-the-art performance in Davis and KIBA datasets. EnsembleDLM also reaches state-of-the-art performance in cross-domain applications and decent cross-domain performance (Pearson correlation coefficient and concordance index > 0.8) with transfer learning using approximately twice the amount of test data in the new domain.
    Normalizing Flow based Hidden Markov Models for Classification of Speech Phones with Explainability. (arXiv:2107.00730v1 [cs.LG])
    (2 min) In pursuit of explainability, we develop generative models for sequential data. The proposed models provide state-of-the-art classification results and robust performance for speech phone classification. We combine modern neural networks (normalizing flows) and traditional generative models (hidden Markov models - HMMs). Normalizing flow-based mixture models (NMMs) are used to model the conditional probability distribution given the hidden state in the HMMs. Model parameters are learned through judicious combinations of time-tested Bayesian learning methods and contemporary neural network learning methods. We mainly combine expectation-maximization (EM) and mini-batch gradient descent. The proposed generative models can compute likelihood of a data and hence directly suitable for maximum-likelihood (ML) classification approach. Due to structural flexibility of HMMs, we can use different normalizing flow models. This leads to different types of HMMs providing diversity in data modeling capacity. The diversity provides an opportunity for easy decision fusion from different models. For a standard speech phone classification setup involving 39 phones (classes) and the TIMIT dataset, we show that the use of standard features called mel-frequency-cepstral-coeffcients (MFCCs), the proposed generative models, and the decision fusion together can achieve $86.6\%$ accuracy by generative training only. This result is close to state-of-the-art results, for examples, $86.2\%$ accuracy of PyTorch-Kaldi toolkit [1], and $85.1\%$ accuracy using light gated recurrent units [2]. We do not use any discriminative learning approach and related sophisticated features in this article.
    On the Bike Spreading Problem. (arXiv:2107.00761v1 [cs.DS])
    (2 min) A free-floating bike-sharing system (FFBSS) is a dockless rental system where an individual can borrow a bike and returns it everywhere, within the service area. To improve the rental service, available bikes should be distributed over the entire service area: a customer leaving from any position is then more likely to find a near bike and then to use the service. Moreover, spreading bikes among the entire service area increases urban spatial equity since the benefits of FFBSS are not a prerogative of just a few zones. For guaranteeing such distribution, the FFBSS operator can use vans to manually relocate bikes, but it incurs high economic and environmental costs. We propose a novel approach that exploits the existing bike flows generated by customers to distribute bikes. More specifically, by envisioning the problem as an Influence Maximization problem, we show that it is possible to position batches of bikes on a small number of zones, and then the daily use of FFBSS will efficiently spread these bikes on a large area. We show that detecting these areas is NP-complete, but there exists a simple and efficient $1-1/e$ approximation algorithm; our approach is then evaluated on a dataset of rides from the free-floating bike-sharing system of the city of Padova.
    q-Paths: Generalizing the Geometric Annealing Path using Power Means. (arXiv:2107.00745v1 [cs.LG])
    (2 min) Many common machine learning methods involve the geometric annealing path, a sequence of intermediate densities between two distributions of interest constructed using the geometric average. While alternatives such as the moment-averaging path have demonstrated performance gains in some settings, their practical applicability remains limited by exponential family endpoint assumptions and a lack of closed form energy function. In this work, we introduce $q$-paths, a family of paths which is derived from a generalized notion of the mean, includes the geometric and arithmetic mixtures as special cases, and admits a simple closed form involving the deformed logarithm function from nonextensive thermodynamics. Following previous analysis of the geometric path, we interpret our $q$-paths as corresponding to a $q$-exponential family of distributions, and provide a variational representation of intermediate densities as minimizing a mixture of $\alpha$-divergences to the endpoints. We show that small deviations away from the geometric path yield empirical gains for Bayesian inference using Sequential Monte Carlo and generative model evaluation using Annealed Importance Sampling.
    Neural Task Success Classifiers for Robotic Manipulation from Few Real Demonstrations. (arXiv:2107.00722v1 [cs.RO])
    (2 min) Robots learning a new manipulation task from a small amount of demonstrations are increasingly demanded in different workspaces. A classifier model assessing the quality of actions can predict the successful completion of a task, which can be used by intelligent agents for action-selection. This paper presents a novel classifier that learns to classify task completion only from a few demonstrations. We carry out a comprehensive comparison of different neural classifiers, e.g. fully connected-based, fully convolutional-based, sequence2sequence-based, and domain adaptation-based classification. We also present a new dataset including five robot manipulation tasks, which is publicly available. We compared the performances of our novel classifier and the existing models using our dataset and the MIME dataset. The results suggest domain adaptation and timing-based features improve success prediction. Our novel model, i.e. fully convolutional neural network with domain adaptation and timing features, achieves an average classification accuracy of 97.3\% and 95.5\% across tasks in both datasets whereas state-of-the-art classifiers without domain adaptation and timing-features only achieve 82.4\% and 90.3\%, respectively.
    A Map of Bandits for E-commerce. (arXiv:2107.00680v1 [cs.LG])
    (2 min) The rich body of Bandit literature not only offers a diverse toolbox of algorithms, but also makes it hard for a practitioner to find the right solution to solve the problem at hand. Typical textbooks on Bandits focus on designing and analyzing algorithms, and surveys on applications often present a list of individual applications. While these are valuable resources, there exists a gap in mapping applications to appropriate Bandit algorithms. In this paper, we aim to reduce this gap with a structured map of Bandits to help practitioners navigate to find relevant and practical Bandit algorithms. Instead of providing a comprehensive overview, we focus on a small number of key decision points related to reward, action, and features, which often affect how Bandit algorithms are chosen in practice.
    Distilling Reinforcement Learning Tricks for Video Games. (arXiv:2107.00703v1 [cs.LG])
    (2 min) Reinforcement learning (RL) research focuses on general solutions that can be applied across different domains. This results in methods that RL practitioners can use in almost any domain. However, recent studies often lack the engineering steps ("tricks") which may be needed to effectively use RL, such as reward shaping, curriculum learning, and splitting a large task into smaller chunks. Such tricks are common, if not necessary, to achieve state-of-the-art results and win RL competitions. To ease the engineering efforts, we distill descriptions of tricks from state-of-the-art results and study how well these tricks can improve a standard deep Q-learning agent. The long-term goal of this work is to enable combining proven RL methods with domain-specific tricks by providing a unified software framework and accompanying insights in multiple domains.
    SIMILAR: Submodular Information Measures Based Active Learning In Realistic Scenarios. (arXiv:2107.00717v1 [cs.LG])
    (2 min) Active learning has proven to be useful for minimizing labeling costs by selecting the most informative samples. However, existing active learning methods do not work well in realistic scenarios such as imbalance or rare classes, out-of-distribution data in the unlabeled set, and redundancy. In this work, we propose SIMILAR (Submodular Information Measures based actIve LeARning), a unified active learning framework using recently proposed submodular information measures (SIM) as acquisition functions. We argue that SIMILAR not only works in standard active learning, but also easily extends to the realistic settings considered above and acts as a one-stop solution for active learning that is scalable to large real-world datasets. Empirically, we show that SIMILAR significantly outperforms existing active learning algorithms by as much as ~5% - 18% in the case of rare classes and ~5% - 10% in the case of out-of-distribution data on several image classification tasks like CIFAR-10, MNIST, and ImageNet.
    Long-Short Ensemble Network for Bipolar Manic-Euthymic State Recognition Based on Wrist-worn Sensors. (arXiv:2107.00710v1 [cs.LG])
    (2 min) Manic episodes of bipolar disorder can lead to uncritical behaviour and delusional psychosis, often with destructive consequences for those affected and their surroundings. Early detection and intervention of a manic episode are crucial to prevent escalation, hospital admission and premature death. However, people with bipolar disorder may not recognize that they are experiencing a manic episode and symptoms such as euphoria and increased productivity can also deter affected individuals from seeking help. This work proposes to perform user-independent, automatic mood-state detection based on actigraphy and electrodermal activity acquired from a wrist-worn device during mania and after recovery (euthymia). This paper proposes a new deep learning-based ensemble method leveraging long (20h) and short (5 minutes) time-intervals to discriminate between the mood-states. When tested on 47 bipolar patients, the proposed classification scheme achieves an average accuracy of 91.59% in euthymic/manic mood-state recognition.
    Transformer-F: A Transformer network with effective methods for learning universal sentence representation. (arXiv:2107.00653v1 [cs.CL])
    (2 min) The Transformer model is widely used in natural language processing for sentence representation. However, the previous Transformer-based models focus on function words that have limited meaning in most cases and could merely extract high-level semantic abstraction features. In this paper, two approaches are introduced to improve the performance of Transformers. We calculated the attention score by multiplying the part-of-speech weight vector with the correlation coefficient, which helps extract the words with more practical meaning. The weight vector is obtained by the input text sequence based on the importance of the part-of-speech. Furthermore, we fuse the features of each layer to make the sentence representation results more comprehensive and accurate. In experiments, we demonstrate the effectiveness of our model Transformer-F on three standard text classification datasets. Experimental results show that our proposed model significantly boosts the performance of text classification as compared to the baseline model. Specifically, we obtain a 5.28% relative improvement over the vanilla Transformer on the simple tasks.
    Shared Data and Algorithms for Deep Learning in Fundamental Physics. (arXiv:2107.00656v1 [cs.LG])
    (2 min) We introduce a collection of datasets from fundamental physics research -- including particle physics, astroparticle physics, and hadron- and nuclear physics -- for supervised machine learning studies. These datasets, containing hadronic top quarks, cosmic-ray induced air showers, phase transitions in hadronic matter, and generator-level histories, are made public to simplify future work on cross-disciplinary machine learning and transfer learning in fundamental physics. Based on these data, we present a simple yet flexible graph-based neural network architecture that can easily be applied to a wide range of supervised learning tasks in these domains. We show that our approach reaches performance close to state-of-the-art dedicated methods on all datasets. To simplify adaptation for various problems, we provide easy-to-follow instructions on how graph-based representations of data structures, relevant for fundamental physics, can be constructed and provide code implementations for several of them. Implementations are also provided for our proposed method and all reference algorithms.

2021-07-04

  • cs.LG updates on arXiv.org

    Semi-supervised Learning with Missing Values Imputation. (arXiv:2106.01708v2 [cs.LG] UPDATED)
    (2 min) Incomplete instances with various missing attributes in many real-world applications have brought challenges to the classification tasks. Missing values imputation methods are often employed to replace the missing values with substitute values. However, this process often separates the imputation and classification, which may lead to inferior performance since label information are often ignored during imputation. Moreover, traditional methods may rely on improper assumptions to initialize the missing values, whereas the unreliability of such initialization might lead to inferior performance. To address these problems, a novel semi-supervised conditional normalizing flow (SSCFlow) is proposed in this paper. SSCFlow explicitly utilizes the label information to facilitate the imputation and classification simultaneously by estimating the conditional distribution of incomplete instances with a novel semi-supervised normalizing flow. Moreover, SSCFlow treats the initialized missing values as corrupted initial imputation and iteratively reconstructs their latent representations with an overcomplete denoising autoencoder to approximate their true conditional distribution. Experiments on real-world datasets demonstrate the robustness and effectiveness of the proposed algorithm.

2021-07-02

  • cs.CL updates on arXiv.org

    Word-Free Spoken Language Understanding for Mandarin-Chinese. (arXiv:2107.00186v1 [cs.CL])
    (2 min) Spoken dialogue systems such as Siri and Alexa provide great convenience to people's everyday life. However, current spoken language understanding (SLU) pipelines largely depend on automatic speech recognition (ASR) modules, which require a large amount of language-specific training data. In this paper, we propose a Transformer-based SLU system that works directly on phones. This acoustic-based SLU system consists of only two blocks and does not require the presence of ASR module. The first block is a universal phone recognition system, and the second block is a Transformer-based language model for phones. We verify the effectiveness of the system on an intent classification dataset in Mandarin Chinese.
    Identification of COVID-19 related Fake News via Neural Stacking. (arXiv:2101.03988v2 [cs.CL] UPDATED)
    (2 min) Identification of Fake News plays a prominent role in the ongoing pandemic, impacting multiple aspects of day-to-day life. In this work we present a solution to the shared task titled COVID19 Fake News Detection in English, scoring the 50th place amongst 168 submissions. The solution was within 1.5% of the best performing solution. The proposed solution employs a heterogeneous representation ensemble, adapted for the classification task via an additional neural classification head comprised of multiple hidden layers. The paper consists of detailed ablation studies further displaying the proposed method's behavior and possible implications. The solution is freely available. \url{https://gitlab.com/boshko.koloski/covid19-fake-news}
    The USTC-NELSLIP Systems for Simultaneous Speech Translation Task at IWSLT 2021. (arXiv:2107.00279v1 [cs.CL])
    (2 min) This paper describes USTC-NELSLIP's submissions to the IWSLT2021 Simultaneous Speech Translation task. We proposed a novel simultaneous translation model, Cross Attention Augmented Transducer (CAAT), which extends conventional RNN-T to sequence-to-sequence tasks without monotonic constraints, e.g., simultaneous translation. Experiments on speech-to-text (S2T) and text-to-text (T2T) simultaneous translation tasks shows CAAT achieves better quality-latency trade-offs compared to \textit{wait-k}, one of the previous state-of-the-art approaches. Based on CAAT architecture and data augmentation, we build S2T and T2T simultaneous translation systems in this evaluation campaign. Compared to last year's optimal systems, our S2T simultaneous translation system improves by an average of 11.3 BLEU for all latency regimes, and our T2T simultaneous translation system improves by an average of 4.6 BLEU.
    What do End-to-End Speech Models Learn about Speaker, Language and Channel Information? A Layer-wise and Neuron-level Analysis. (arXiv:2107.00439v1 [cs.CL])
    (3 min) End-to-end DNN architectures have pushed the state-of-the-art in speech technologies, as well as in other spheres of AI, leading researchers to train more complex and deeper models. These improvements came at the cost of transparency. DNNs are innately opaque and difficult to interpret. We no longer understand what features are learned, where they are preserved, and how they inter-operate. Such an analysis is important for better model understanding, debugging and to ensure fairness in ethical decision making. In this work, we analyze the representations trained within deep speech models, towards the task of speaker recognition, dialect identification and reconstruction of masked signals. We carry a layer- and neuron-level analysis on the utterance-level representations captured within pretrained speech models for speaker, language and channel properties. We study: is this information captured in the learned representations? where is it preserved? how is it distributed? and can we identify a minimal subset of network that posses this information. Using diagnostic classifiers, we answered these questions. Our results reveal: (i) channel and gender information is omnipresent and is redundantly distributed (ii) complex properties such as dialectal information is encoded only in the task-oriented pretrained network and is localised in the upper layers (iii) a minimal subset of neurons can be extracted to encode the predefined property (iv) salient neurons are sometimes shared between properties and can highlights presence of biases in the network. Our cross-architectural comparison indicates that (v) the pretrained models captures speaker-invariant information and (vi) the pretrained CNNs models are competitive to the Transformers for encoding information for the studied properties. To the best of our knowledge, this is the first study to investigate neuron analysis on the speech models.
    StableEmit: Selection Probability Discount for Reducing Emission Latency of Streaming Monotonic Attention ASR. (arXiv:2107.00635v1 [eess.AS])
    (2 min) While attention-based encoder-decoder (AED) models have been successfully extended to the online variants for streaming automatic speech recognition (ASR), such as monotonic chunkwise attention (MoChA), the models still have a large label emission latency because of the unconstrained end-to-end training objective. Previous works tackled this problem by leveraging alignment information to control the timing to emit tokens during training. In this work, we propose a simple alignment-free regularization method, StableEmit, to encourage MoChA to emit tokens earlier. StableEmit discounts the selection probabilities in hard monotonic attention for token boundary detection by a constant factor and regularizes them to recover the total attention mass during training. As a result, the scale of the selection probabilities is increased, and the values can reach a threshold for token emission earlier, leading to a reduction of emission latency and deletion errors. Moreover, StableEmit can be combined with methods that constraint alignments to further improve the accuracy and latency. Experimental evaluations with LSTM and Conformer encoders demonstrate that StableEmit significantly reduces the recognition errors and the emission latency simultaneously. We also show that the use of alignment information is complementary in both metrics.
    Scientia Potentia Est -- On the Role of Knowledge in Computational Argumentation. (arXiv:2107.00281v1 [cs.CL])
    (0 min) Despite extensive research in the past years, the computational modeling of argumentation remains challenging. The primary reason lies in the inherent complexity of the human processes behind, which commonly requires the integration of extensive knowledge far beyond what is needed for many other natural language understanding tasks. Existing work on the mining, assessment, reasoning, and generation of arguments acknowledges this issue, calling for more research on the integration of common sense and world knowledge into computational models. However, a systematic effort to collect and organize the types of knowledge needed is still missing, hindering targeted progress in the field. In this opinionated survey paper, we address the issue by (1) proposing a pyramid of types of knowledge required in computational argumentation, (2) briefly discussing the state of the art on the role and integration of these types in the field, and (3) outlining the main challenges for future work.
    Conditional Generation of Temporally-ordered Event Sequences. (arXiv:2012.15786v2 [cs.CL] UPDATED)
    (0 min) Models of narrative schema knowledge have proven useful for a range of event-related tasks, but they typically do not capture the temporal relationships between events. We propose a single model that addresses both temporal ordering, sorting given events into the order they occurred, and event infilling, predicting new events which fit into an existing temporally-ordered sequence. We use a BART-based conditional generation model that can capture both temporality and common event co-occurrence, meaning it can be flexibly applied to different tasks in this space. Our model is trained as a denoising autoencoder: we take temporally-ordered event sequences, shuffle them, delete some events, and then attempt to recover the original event sequence. This task teaches the model to make inferences given incomplete knowledge about the events in an underlying scenario. On the temporal ordering task, we show that our model is able to unscramble event sequences from existing datasets without access to explicitly labeled temporal training data, outperforming both a BERT-based pairwise model and a BERT-based pointer network. On event infilling, human evaluation shows that our model is able to generate events that fit better temporally into the input events when compared to GPT-2 story completion models.
    Multilingual Central Repository: a Cross-lingual Framework for Developing Wordnets. (arXiv:2107.00333v1 [cs.CL])
    (0 min) Language resources are necessary for language processing,but building them is costly, involves many researches from different areas and needs constant updating. In this paper, we describe the crosslingual framework used for developing the Multilingual Central Repository (MCR), a multilingual knowledge base that includes wordnets of Basque, Catalan, English, Galician, Portuguese, Spanish and the following ontologies: Base Concepts, Top Ontology, WordNet Domains and Suggested Upper Merged Ontology. We present the story of MCR, its state in 2017 and the developed tools.
    MultiCite: Modeling realistic citations requires moving beyond the single-sentence single-label setting. (arXiv:2107.00414v1 [cs.CL])
    (0 min) Citation context analysis (CCA) is an important task in natural language processing that studies how and why scholars discuss each others' work. Despite being studied for decades, traditional frameworks for CCA have largely relied on overly-simplistic assumptions of how authors cite, which ignore several important phenomena. For instance, scholarly papers often contain rich discussions of cited work that span multiple sentences and express multiple intents concurrently. Yet, CCA is typically approached as a single-sentence, single-label classification task, and thus existing datasets fail to capture this interesting discourse. In our work, we address this research gap by proposing a novel framework for CCA as a document-level context extraction and labeling task. We release MultiCite, a new dataset of 12,653 citation contexts from over 1,200 computational linguistics papers. Not only is it the largest collection of expert-annotated citation contexts to-date, MultiCite contains multi-sentence, multi-label citation contexts within full paper texts. Finally, we demonstrate how our dataset, while still usable for training classic CCA models, also supports the development of new types of models for CCA beyond fixed-width text classification. We release our code and dataset at https://github.com/allenai/multicite.
    Attention Meets Perturbations: Robust and Interpretable Attention with Adversarial Training. (arXiv:2009.12064v2 [cs.CL] UPDATED)
    (0 min) Although attention mechanisms have been applied to a variety of deep learning models and have been shown to improve the prediction performance, it has been reported to be vulnerable to perturbations to the mechanism. To overcome the vulnerability to perturbations in the mechanism, we are inspired by adversarial training (AT), which is a powerful regularization technique for enhancing the robustness of the models. In this paper, we propose a general training technique for natural language processing tasks, including AT for attention (Attention AT) and more interpretable AT for attention (Attention iAT). The proposed techniques improved the prediction performance and the model interpretability by exploiting the mechanisms with AT. In particular, Attention iAT boosts those advantages by introducing adversarial perturbation, which enhances the difference in the attention of the sentences. Evaluation experiments with ten open datasets revealed that AT for attention mechanisms, especially Attention iAT, demonstrated (1) the best performance in nine out of ten tasks and (2) more interpretable attention (i.e., the resulting attention correlated more strongly with gradient-based word importance) for all tasks. Additionally, the proposed techniques are (3) much less dependent on perturbation size in AT. Our code is available at https://github.com/shunk031/attention-meets-perturbation
    Keyboards as a new model of computation. (arXiv:2102.10182v3 [cs.FL] UPDATED)
    (0 min) We introduce a new formalisation of languages, called keyboards. We consider a set of elementary operations (writing/erasing a letter, going to the right or to the left,...) and we define a keyboard as a set of finite sequences of such operations, called keys. The corresponding language is the set of words obtained by applying some sequence of those keys. Unlike classical models of computation, every key can be applied anytime. We define various classes of languages based on different sets of elementary operations, and compare their expressive powers. We also compare them to well-known classes of languages (Chomsky hierarchy). We obtain a strict hierarchy of languages, whose expressivity is orthogonal to the one of the aforementionned classical models. -- Nous introduisons une nouvelle repr\'esentation de langages, les claviers. On se munit d'un ensemble d'op\'erations \'el\'ementaires (ajout, effacement d'une lettre, d\'eplacement \`a droite, \`a gauche, ...), et on d\'efinit un clavier comme un ensemble de suites finies d'op\'erations \'el\'ementaires, appel\'ees touches. Son langage sera l'ensemble des mots obtenus en appliquant une suite quelconque de touches. Contrairement \`a des mod\`eles de calcul classiques, toutes les touches peuvent \^etre appliqu\'ees \`a tout moment. En premier lieu nous d\'efinissons diff\'erentes classes de claviers en faisant varier l'ensemble des op\'erations \'el\'ementaires autoris\'ees, et nous comparons l'expressivit\'e des classes de langages obtenues. Nous comparons \'egalement ces classes \`a la hi\'erarchie de Chomsky. Nous obtenons que toutes les classes \'etudi\'ees sont diff\'erentes, et nous caract\'erisons les classes inclues dans les rationnels et les alg\'ebriques. L'expressivit\'e des claviers semble orthogonale \`a celle des mod\`eles \'evoqu\'es pr\'ec\'edemment.
    Modeling Target-side Inflection in Placeholder Translation. (arXiv:2107.00334v1 [cs.CL])
    (0 min) Placeholder translation systems enable the users to specify how a specific phrase is translated in the output sentence. The system is trained to output special placeholder tokens, and the user-specified term is injected into the output through the context-free replacement of the placeholder token. However, this approach could result in ungrammatical sentences because it is often the case that the specified term needs to be inflected according to the context of the output, which is unknown before the translation. To address this problem, we propose a novel method of placeholder translation that can inflect specified terms according to the grammatical construction of the output sentence. We extend the sequence-to-sequence architecture with a character-level decoder that takes the lemma of a user-specified term and the words generated from the word-level decoder to output the correct inflected form of the lemma. We evaluate our approach with a Japanese-to-English translation task in the scientific writing domain, and show that our model can incorporate specified terms in the correct form more successfully than other comparable models.
    Improving Zero-Shot Translation by Disentangling Positional Information. (arXiv:2012.15127v2 [cs.CL] UPDATED)
    (0 min) Multilingual neural machine translation has shown the capability of directly translating between language pairs unseen in training, i.e. zero-shot translation. Despite being conceptually attractive, it often suffers from low output quality. The difficulty of generalizing to new translation directions suggests the model representations are highly specific to those language pairs seen in training. We demonstrate that a main factor causing the language-specific representations is the positional correspondence to input tokens. We show that this can be easily alleviated by removing residual connections in an encoder layer. With this modification, we gain up to 18.5 BLEU points on zero-shot translation while retaining quality on supervised directions. The improvements are particularly prominent between related languages, where our proposed model outperforms pivot-based translation. Moreover, our approach allows easy integration of new languages, which substantially expands translation coverage. By thorough inspections of the hidden layer outputs, we show that our approach indeed leads to more language-independent representations.
    Generative Adversarial Transformers. (arXiv:2103.01209v3 [cs.CV] UPDATED)
    (0 min) We introduce the GANformer, a novel and efficient type of transformer, and explore it for the task of visual generative modeling. The network employs a bipartite structure that enables long-range interactions across the image, while maintaining computation of linear efficiency, that can readily scale to high-resolution synthesis. It iteratively propagates information from a set of latent variables to the evolving visual features and vice versa, to support the refinement of each in light of the other and encourage the emergence of compositional representations of objects and scenes. In contrast to the classic transformer architecture, it utilizes multiplicative integration that allows flexible region-based modulation, and can thus be seen as a generalization of the successful StyleGAN network. We demonstrate the model's strength and robustness through a careful evaluation over a range of datasets, from simulated multi-object environments to rich real-world indoor and outdoor scenes, showing it achieves state-of-the-art results in terms of image quality and diversity, while enjoying fast learning and better data-efficiency. Further qualitative and quantitative experiments offer us an insight into the model's inner workings, revealing improved interpretability and stronger disentanglement, and illustrating the benefits and efficacy of our approach. An implementation of the model is available at https://github.com/dorarad/gansformer.
    POSNoise: An Effective Countermeasure Against Topic Biases in Authorship Analysis. (arXiv:2005.06605v2 [cs.CL] UPDATED)
    (2 min) Authorship verification (AV) is a fundamental research task in digital text forensics, which addresses the problem of whether two texts were written by the same person. In recent years, a variety of AV methods have been proposed that focus on this problem and can be divided into two categories: The first category refers to such methods that are based on explicitly defined features, where one has full control over which features are considered and what they actually represent. The second category, on the other hand, relates to such AV methods that are based on implicitly defined features, where no control mechanism is involved, so that any character sequence in a text can serve as a potential feature. However, AV methods belonging to the second category bear the risk that the topic of the texts may bias their classification predictions, which in turn may lead to misleading conclusions regarding their results. To tackle this problem, we propose a preprocessing technique called POSNoise, which effectively masks topic-related content in a given text. In this way, AV methods are forced to focus on such text units that are more related to the writing style. Our empirical evaluation based on six AV methods (falling into the second category) and seven corpora shows that POSNoise leads to better results compared to a well-known topic masking approach in 34 out of 42 cases, with an increase in accuracy of up to 10%.
    Spatial Dependency Parsing for Semi-Structured Document Information Extraction. (arXiv:2005.00642v3 [cs.CL] UPDATED)
    (2 min) Information Extraction (IE) for semi-structured document images is often approached as a sequence tagging problem by classifying each recognized input token into one of the IOB (Inside, Outside, and Beginning) categories. However, such problem setup has two inherent limitations that (1) it cannot easily handle complex spatial relationships and (2) it is not suitable for highly structured information, which are nevertheless frequently observed in real-world document images. To tackle these issues, we first formulate the IE task as spatial dependency parsing problem that focuses on the relationship among text tokens in the documents. Under this setup, we then propose SPADE (SPAtial DEpendency parser) that models highly complex spatial relationships and an arbitrary number of information layers in the documents in an end-to-end manner. We evaluate it on various kinds of documents such as receipts, name cards, forms, and invoices, and show that it achieves a similar or better performance compared to strong baselines including BERT-based IOB taggger.
    Combining Feature and Instance Attribution to Detect Artifacts. (arXiv:2107.00323v1 [cs.CL])
    (2 min) Training the large deep neural networks that dominate NLP requires large datasets. Many of these are collected automatically or via crowdsourcing, and may exhibit systematic biases or annotation artifacts. By the latter, we mean correlations between inputs and outputs that are spurious, insofar as they do not represent a generally held causal relationship between features and classes; models that exploit such correlations may appear to perform a given task well, but fail on out of sample data. In this paper we propose methods to facilitate identification of training data artifacts, using new hybrid approaches that combine saliency maps (which highlight important input features) with instance attribution methods (which retrieve training samples influential to a given prediction). We show that this proposed training-feature attribution approach can be used to uncover artifacts in training data, and use it to identify previously unreported artifacts in a few standard NLP datasets. We execute a small user study to evaluate whether these methods are useful to NLP researchers in practice, with promising results. We make code for all methods and experiments in this paper available.
    Knowledge Distillation for Quality Estimation. (arXiv:2107.00411v1 [cs.CL])
    (2 min) Quality Estimation (QE) is the task of automatically predicting Machine Translation quality in the absence of reference translations, making it applicable in real-time settings, such as translating online social media conversations. Recent success in QE stems from the use of multilingual pre-trained representations, where very large models lead to impressive results. However, the inference time, disk and memory requirements of such models do not allow for wide usage in the real world. Models trained on distilled pre-trained representations remain prohibitively large for many usage scenarios. We instead propose to directly transfer knowledge from a strong QE teacher model to a much smaller model with a different, shallower architecture. We show that this approach, in combination with data augmentation, leads to light-weight QE models that perform competitively with distilled pre-trained representations with 8x fewer parameters.
    CLINE: Contrastive Learning with Semantic Negative Examples for Natural Language Understanding. (arXiv:2107.00440v1 [cs.CL])
    (2 min) Despite pre-trained language models have proven useful for learning high-quality semantic representations, these models are still vulnerable to simple perturbations. Recent works aimed to improve the robustness of pre-trained models mainly focus on adversarial training from perturbed examples with similar semantics, neglecting the utilization of different or even opposite semantics. Different from the image processing field, the text is discrete and few word substitutions can cause significant semantic changes. To study the impact of semantics caused by small perturbations, we conduct a series of pilot experiments and surprisingly find that adversarial training is useless or even harmful for the model to detect these semantic changes. To address this problem, we propose Contrastive Learning with semantIc Negative Examples (CLINE), which constructs semantic negative examples unsupervised to improve the robustness under semantically adversarial attacking. By comparing with similar and opposite semantic examples, the model can effectively perceive the semantic changes caused by small perturbations. Empirical results show that our approach yields substantial improvements on a range of sentiment analysis, reasoning, and reading comprehension tasks. And CLINE also ensures the compactness within the same semantics and separability across different semantics in sentence-level.
    Ensemble Learning-Based Approach for Improving Generalization Capability of Machine Reading Comprehension Systems. (arXiv:2107.00368v1 [cs.CL])
    (2 min) Machine Reading Comprehension (MRC) is an active field in natural language processing with many successful developed models in recent years. Despite their high in-distribution accuracy, these models suffer from two issues: high training cost and low out-of-distribution accuracy. Even though some approaches have been presented to tackle the generalization problem, they have high, intolerable training costs. In this paper, we investigate the effect of ensemble learning approach to improve generalization of MRC systems without retraining a big model. After separately training the base models with different structures on different datasets, they are ensembled using weighting and stacking approaches in probabilistic and non-probabilistic settings. Three configurations are investigated including heterogeneous, homogeneous, and hybrid on eight datasets and six state-of-the-art models. We identify the important factors in the effectiveness of ensemble methods. Also, we compare the robustness of ensemble and fine-tuned models against data distribution shifts. The experimental results show the effectiveness and robustness of the ensemble approach in improving the out-of-distribution accuracy of MRC systems, especially when the base models are similar in accuracies.
    ESPnet-ST IWSLT 2021 Offline Speech Translation System. (arXiv:2107.00636v1 [eess.AS])
    (2 min) This paper describes the ESPnet-ST group's IWSLT 2021 submission in the offline speech translation track. This year we made various efforts on training data, architecture, and audio segmentation. On the data side, we investigated sequence-level knowledge distillation (SeqKD) for end-to-end (E2E) speech translation. Specifically, we used multi-referenced SeqKD from multiple teachers trained on different amounts of bitext. On the architecture side, we adopted the Conformer encoder and the Multi-Decoder architecture, which equips dedicated decoders for speech recognition and translation tasks in a unified encoder-decoder model and enables search in both source and target language spaces during inference. We also significantly improved audio segmentation by using the pyannote.audio toolkit and merging multiple short segments for long context modeling. Experimental evaluations showed that each of them contributed to large improvements in translation performance. Our best E2E system combined all the above techniques with model ensembling and achieved 31.4 BLEU on the 2-ref of tst2021 and 21.2 BLEU and 19.3 BLEU on the two single references of tst2021.
    Towards the evaluation of automatic simultaneous speech translation from a communicative perspective. (arXiv:2103.08364v2 [cs.CL] UPDATED)
    (2 min) In recent years, automatic speech-to-speech and speech-to-text translation has gained momentum thanks to advances in artificial intelligence, especially in the domains of speech recognition and machine translation. The quality of such applications is commonly tested with automatic metrics, such as BLEU, primarily with the goal of assessing improvements of releases or in the context of evaluation campaigns. However, little is known about how the output of such systems is perceived by end users or how they compare to human performances in similar communicative tasks. In this paper, we present the results of an experiment aimed at evaluating the quality of a real-time speech translation engine by comparing it to the performance of professional simultaneous interpreters. To do so, we adopt a framework developed for the assessment of human interpreters and use it to perform a manual evaluation on both human and machine performances. In our sample, we found better performance for the human interpreters in terms of intelligibility, while the machine performs slightly better in terms of informativeness. The limitations of the study and the possible enhancements of the chosen framework are discussed. Despite its intrinsic limitations, the use of this framework represents a first step towards a user-centric and communication-oriented methodology for evaluating real-time automatic speech translation.
    Interviewer-Candidate Role Play: Towards Developing Real-World NLP Systems. (arXiv:2107.00315v1 [cs.CL])
    (2 min) Standard NLP tasks do not incorporate several common real-world scenarios such as seeking clarifications about the question, taking advantage of clues, abstaining in order to avoid incorrect answers, etc. This difference in task formulation hinders the adoption of NLP systems in real-world settings. In this work, we take a step towards bridging this gap and present a multi-stage task that simulates a typical human-human questioner-responder interaction such as an interview. Specifically, the system is provided with question simplifications, knowledge statements, examples, etc. at various stages to improve its prediction when it is not sufficiently confident. We instantiate the proposed task in Natural Language Inference setting where a system is evaluated on both in-domain and out-of-domain (OOD) inputs. We conduct comprehensive experiments and find that the multi-stage formulation of our task leads to OOD generalization performance improvement up to 2.29% in Stage 1, 1.91% in Stage 2, 54.88% in Stage 3, and 72.02% in Stage 4 over the standard unguided prediction. However, our task leaves a significant challenge for NLP researchers to further improve OOD performance at each stage.
    GlyphCRM: Bidirectional Encoder Representation for Chinese Character with its Glyph. (arXiv:2107.00395v1 [cs.AI])
    (2 min) Previous works indicate that the glyph of Chinese characters contains rich semantic information and has the potential to enhance the representation of Chinese characters. The typical method to utilize the glyph features is by incorporating them into the character embedding space. Inspired by previous methods, we innovatively propose a Chinese pre-trained representation model named as GlyphCRM, which abandons the ID-based character embedding method yet solely based on sequential character images. We render each character into a binary grayscale image and design two-channel position feature maps for it. Formally, we first design a two-layer residual convolutional neural network, namely HanGlyph to generate the initial glyph representation of Chinese characters, and subsequently adopt multiple bidirectional encoder Transformer blocks as the superstructure to capture the context-sensitive information. Meanwhile, we feed the glyph features extracted from each layer of the HanGlyph module into the underlying Transformer blocks by skip-connection method to fully exploit the glyph features of Chinese characters. As the HanGlyph module can obtain a sufficient glyph representation of any Chinese character, the long-standing out-of-vocabulary problem could be effectively solved. Extensive experimental results indicate that GlyphCRM substantially outperforms the previous BERT-based state-of-the-art model on 9 fine-tuning tasks, and it has strong transferability and generalization on specialized fields and low-resource tasks. We hope this work could spark further research beyond the realms of well-established representation of Chinese texts.
    Capturing Event Argument Interaction via A Bi-Directional Entity-Level Recurrent Decoder. (arXiv:2107.00189v1 [cs.CL])
    (2 min) Capturing interactions among event arguments is an essential step towards robust event argument extraction (EAE). However, existing efforts in this direction suffer from two limitations: 1) The argument role type information of contextual entities is mainly utilized as training signals, ignoring the potential merits of directly adopting it as semantically rich input features; 2) The argument-level sequential semantics, which implies the overall distribution pattern of argument roles over an event mention, is not well characterized. To tackle the above two bottlenecks, we formalize EAE as a Seq2Seq-like learning problem for the first time, where a sentence with a specific event trigger is mapped to a sequence of event argument roles. A neural architecture with a novel Bi-directional Entity-level Recurrent Decoder (BERD) is proposed to generate argument roles by incorporating contextual entities' argument role predictions, like a word-by-word text generation process, thereby distinguishing implicit argument distribution patterns within an event more accurately.
    Tweet Sentiment Quantification: An Experimental Re-Evaluation. (arXiv:2011.08091v2 [cs.CL] UPDATED)
    (2 min) Sentiment quantification is the task of estimating the relative frequency (or "prevalence") of sentiment-related classes (such as Positive, Neutral, Negative) in a sample of unlabelled texts; this is especially important when these texts are tweets, since most sentiment classification endeavours carried out on Twitter data actually have quantification (and not the classification of individual tweets) as their ultimate goal. It is well-known that solving quantification via "classify and count" (i.e., by classifying all unlabelled items via a standard classifier and counting the items that have been assigned to a given class) is suboptimal in terms of accuracy, and that more accurate quantification methods exist. In 2016, Gao and Sebastiani carried out a systematic comparison of quantification methods on the task of tweet sentiment quantification. In hindsight, we observe that the experimental protocol followed in that work is flawed, and that its results are thus unreliable. We now re-evaluate those quantification methods on the very same datasets, this time following a now consolidated and much more robust experimental protocol, that involves 5775 as many experiments as run in the original study. Our experimentation yields results dramatically different from those obtained by Gao and Sebastiani, and thus provide a different, much more solid understanding of the relative strengths and weaknesses of different sentiment quantification methods.
    Transformer-based end-to-end speech recognition with residual Gaussian-based self-attention. (arXiv:2103.15722v3 [cs.SD] UPDATED)
    (2 min) Self-attention (SA), which encodes vector sequences according to their pairwise similarity, is widely used in speech recognition due to its strong context modeling ability. However, when applied to long sequence data, its accuracy is reduced. This is caused by the fact that its weighted average operator may lead to the dispersion of the attention distribution, which results in the relationship between adjacent signals ignored. To address this issue, in this paper, we introduce relative-position-awareness self-attention (RPSA). It not only maintains the global-range dependency modeling ability of self-attention, but also improves the localness modeling ability. Because the local window length of the original RPSA is fixed and sensitive to different test data, here we propose Gaussian-based self-attention (GSA) whose window length is learnable and adaptive to the test data automatically. We further generalize GSA to a new residual Gaussian self-attention (resGSA) for the performance improvement. We apply RPSA, GSA, and resGSA to Transformer-based speech recognition respectively. Experimental results on the AISHELL-1 Mandarin speech recognition corpus demonstrate the effectiveness of the proposed methods. For example, the resGSA-Transformer achieves a character error rate (CER) of 5.86% on the test set, which is relative 7.8% lower than that of the SA-Transformer. Although the performance of the proposed resGSA-Transformer is only slightly better than that of the RPSA-Transformer, it does not have to tune the window length manually.
    An Objective Evaluation Framework for Pathological Speech Synthesis. (arXiv:2107.00308v1 [cs.SD])
    (2 min) The development of pathological speech systems is currently hindered by the lack of a standardised objective evaluation framework. In this work, (1) we utilise existing detection and analysis techniques to propose a general framework for the consistent evaluation of synthetic pathological speech. This framework evaluates the voice quality and the intelligibility aspects of speech and is shown to be complementary using our experiments. (2) Using our proposed evaluation framework, we develop and test a dysarthric voice conversion system (VC) using CycleGAN-VC and a PSOLA-based speech rate modification technique. We show that the developed system is able to synthesise dysarthric speech with different levels of speech intelligibility.
    Zero-pronoun Data Augmentation for Japanese-to-English Translation. (arXiv:2107.00318v1 [cs.CL])
    (2 min) For Japanese-to-English translation, zero pronouns in Japanese pose a challenge, since the model needs to infer and produce the corresponding pronoun in the target side of the English sentence. However, although fully resolving zero pronouns often needs discourse context, in some cases, the local context within a sentence gives clues to the inference of the zero pronoun. In this study, we propose a data augmentation method that provides additional training signals for the translation model to learn correlations between local context and zero pronouns. We show that the proposed method significantly improves the accuracy of zero pronoun translation with machine translation experiments in the conversational domain.
    Multimodal Graph-based Transformer Framework for Biomedical Relation Extraction. (arXiv:2107.00596v1 [cs.CL])
    (2 min) The recent advancement of pre-trained Transformer models has propelled the development of effective text mining models across various biomedical tasks. However, these models are primarily learned on the textual data and often lack the domain knowledge of the entities to capture the context beyond the sentence. In this study, we introduced a novel framework that enables the model to learn multi-omnics biological information about entities (proteins) with the help of additional multi-modal cues like molecular structure. Towards this, rather developing modality-specific architectures, we devise a generalized and optimized graph based multi-modal learning mechanism that utilizes the GraphBERT model to encode the textual and molecular structure information and exploit the underlying features of various modalities to enable end-to-end learning. We evaluated our proposed method on ProteinProtein Interaction task from the biomedical corpus, where our proposed generalized approach is observed to be benefited by the additional domain-specific modality.
    Reinforcement Learning for Abstractive Question Summarization with Question-aware Semantic Rewards. (arXiv:2107.00176v1 [cs.CL])
    (2 min) The growth of online consumer health questions has led to the necessity for reliable and accurate question answering systems. A recent study showed that manual summarization of consumer health questions brings significant improvement in retrieving relevant answers. However, the automatic summarization of long questions is a challenging task due to the lack of training data and the complexity of the related subtasks, such as the question focus and type recognition. In this paper, we introduce a reinforcement learning-based framework for abstractive question summarization. We propose two novel rewards obtained from the downstream tasks of (i) question-type identification and (ii) question-focus recognition to regularize the question generation model. These rewards ensure the generation of semantically valid questions and encourage the inclusion of key medical entities/foci in the question summary. We evaluated our proposed method on two benchmark datasets and achieved higher performance over state-of-the-art models. The manual evaluation of the summaries reveals that the generated questions are more diverse and have fewer factual inconsistencies than the baseline summaries
    Elbert: Fast Albert with Confidence-Window Based Early Exit. (arXiv:2107.00175v1 [cs.CL])
    (2 min) Despite the great success in Natural Language Processing (NLP) area, large pre-trained language models like BERT are not well-suited for resource-constrained or real-time applications owing to the large number of parameters and slow inference speed. Recently, compressing and accelerating BERT have become important topics. By incorporating a parameter-sharing strategy, ALBERT greatly reduces the number of parameters while achieving competitive performance. Nevertheless, ALBERT still suffers from a long inference time. In this work, we propose the ELBERT, which significantly improves the average inference speed compared to ALBERT due to the proposed confidence-window based early exit mechanism, without introducing additional parameters or extra training overhead. Experimental results show that ELBERT achieves an adaptive inference speedup varying from 2$\times$ to 10$\times$ with negligible accuracy degradation compared to ALBERT on various datasets. Besides, ELBERT achieves higher accuracy than existing early exit methods used for accelerating BERT under the same computation cost. Furthermore, to understand the principle of the early exit mechanism, we also visualize the decision-making process of it in ELBERT.
    Learning a Reversible Embedding Mapping using Bi-Directional Manifold Alignment. (arXiv:2107.00124v1 [cs.CL])
    (2 min) We propose a Bi-Directional Manifold Alignment (BDMA) that learns a non-linear mapping between two manifolds by explicitly training it to be bijective. We demonstrate BDMA by training a model for a pair of languages rather than individual, directed source and target combinations, reducing the number of models by 50%. We show that models trained with BDMA in the "forward" (source to target) direction can successfully map words in the "reverse" (target to source) direction, yielding equivalent (or better) performance to standard unidirectional translation models where the source and target language is flipped. We also show how BDMA reduces the overall size of the model.
    Regressing Location on Text for Probabilistic Geocoding. (arXiv:2107.00080v1 [cs.CL])
    (2 min) Text data are an important source of detailed information about social and political events. Automated systems parse large volumes of text data to infer or extract structured information that describes actors, actions, dates, times, and locations. One of these sub-tasks is geocoding: predicting the geographic coordinates associated with events or locations described by a given text. We present an end-to-end probabilistic model for geocoding text data. Additionally, we collect a novel data set for evaluating the performance of geocoding systems. We compare the model-based solution, called ELECTRo-map, to the current state-of-the-art open source system for geocoding texts for event data. Finally, we discuss the benefits of end-to-end model-based geocoding, including principled uncertainty estimation and the ability of these models to leverage contextual information.
    Controllable Open-ended Question Generation with A New Question Type Ontology. (arXiv:2107.00152v1 [cs.CL])
    (2 min) We investigate the less-explored task of generating open-ended questions that are typically answered by multiple sentences. We first define a new question type ontology which differentiates the nuanced nature of questions better than widely used question words. A new dataset with 4,959 questions is labeled based on the new ontology. We then propose a novel question type-aware question generation framework, augmented by a semantic graph representation, to jointly predict question focuses and produce the question. Based on this framework, we further use both exemplars and automatically generated templates to improve controllability and diversity. Experiments on two newly collected large-scale datasets show that our model improves question quality over competitive comparisons based on automatic metrics. Human judges also rate our model outputs highly in answerability, coverage of scope, and overall quality. Finally, our model variants with templates can produce questions with enhanced controllability and diversity.
    All That's 'Human' Is Not Gold: Evaluating Human Evaluation of Generated Text. (arXiv:2107.00061v1 [cs.CL])
    (2 min) Human evaluations are typically considered the gold standard in natural language generation, but as models' fluency improves, how well can evaluators detect and judge machine-generated text? We run a study assessing non-experts' ability to distinguish between human- and machine-authored text (GPT2 and GPT3) in three domains (stories, news articles, and recipes). We find that, without training, evaluators distinguished between GPT3- and human-authored text at random chance level. We explore three approaches for quickly training evaluators to better identify GPT3-authored text (detailed instructions, annotated examples, and paired examples) and find that while evaluators' accuracy improved up to 55%, it did not significantly improve across the three domains. Given the inconsistent results across text domains and the often contradictory reasons evaluators gave for their judgments, we examine the role untrained human evaluations play in NLG evaluation and provide recommendations to NLG researchers for improving human evaluations of text generated from state-of-the-art models.
    Zipf's laws of meaning in Catalan. (arXiv:2107.00042v1 [cs.CL])
    (2 min) In his pioneering research, G. K. Zipf formulated a couple of statistical laws on the relationship between the frequency of a word with its number of meanings: the law of meaning distribution, relating the frequency of a word and its frequency rank, and the meaning-frequency law, relating the frequency of a word with its number of meanings. Although these laws were formulated more than half a century ago, they have been only investigated in a few languages. Here we present the first study of these laws in Catalan. We verify these laws in Catalan via the relationship among their exponents and that of the rank-frequency law. We present a new protocol for the analysis of these Zipfian laws that can be extended to other languages. We report the first evidence of two marked regimes for these laws in written language and speech, paralleling the two regimes in Zipf's rank-frequency law in large multi-author corpora discovered in early 2000s. Finally, the implications of these two regimes will be discussed.
    Learning to communicate about shared procedural abstractions. (arXiv:2107.00077v1 [cs.CL])
    (2 min) Many real-world tasks require agents to coordinate their behavior to achieve shared goals. Successful collaboration requires not only adopting the same communicative conventions, but also grounding these conventions in the same task-appropriate conceptual abstractions. We investigate how humans use natural language to collaboratively solve physical assembly problems more effectively over time. Human participants were paired up in an online environment to reconstruct scenes containing two block towers. One participant could see the target towers, and sent assembly instructions for the other participant to reconstruct. Participants provided increasingly concise instructions across repeated attempts on each pair of towers, using higher-level referring expressions that captured each scene's hierarchical structure. To explain these findings, we extend recent probabilistic models of ad-hoc convention formation with an explicit perceptual learning mechanism. These results shed light on the inductive biases that enable intelligent agents to coordinate upon shared procedural abstractions.
  • cs.CV updates on arXiv.org

    Orthonormal Product Quantization Network for Scalable Face Image Retrieval. (arXiv:2107.00327v1 [cs.CV])
    (2 min) Recently, deep hashing with Hamming distance metric has drawn increasing attention for face image retrieval tasks. However, its counterpart deep quantization methods, which learn binary code representations with dictionary-related distance metrics, have seldom been explored for the task. This paper makes the first attempt to integrate product quantization into an end-to-end deep learning framework for face image retrieval. Unlike prior deep quantization methods where the codewords for quantization are learned from data, we propose a novel scheme using predefined orthonormal vectors as codewords, which aims to enhance the quantization informativeness and reduce the codewords' redundancy. To make the most of the discriminative information, we design a tailored loss function that maximizes the identity discriminability in each quantization subspace for both the quantized and the original features. Furthermore, an entropy-based regularization term is imposed to reduce the quantization error. We conduct experiments on three commonly-used datasets under the settings of both single-domain and cross-domain retrieval. It shows that the proposed method outperforms all the compared deep hashing/quantization methods under both settings with significant superiority. The proposed codewords scheme consistently improves both regular model performance and model generalization ability, verifying the importance of codewords' distribution for the quantization quality. Besides, our model's better generalization ability than deep hashing models indicates that it is more suitable for scalable face image retrieval tasks.
    High Resolution Face Editing with Masked GAN Latent Code Optimization. (arXiv:2103.11135v2 [cs.CV] UPDATED)
    (2 min) Face editing represents a popular research topic within the computer vision and image processing communities. While significant progress has been made recently in this area, existing solutions: (i) are still largely focused on low-resolution images, (ii) often generate editing results with visual artefacts, or (iii) lack fine-grained control and alter multiple (entangled) attributes at once, when trying to generate the desired facial semantics. In this paper, we aim to address these issues though a novel attribute editing approach called MaskFaceGAN. The proposed approach is based on an optimization procedure that directly optimizes the latent code of a pre-trained (state-of-the-art) Generative Adversarial Network (i.e., StyleGAN2) with respect to several constraints that ensure: (i) preservation of relevant image content, (ii) generation of the targeted facial attributes, and (iii) spatially--selective treatment of local image areas. The constraints are enforced with the help of an (differentiable) attribute classifier and face parser that provide the necessary reference information for the optimization procedure. MaskFaceGAN is evaluated in extensive experiments on the CelebA-HQ, Helen and SiblingsDB-HQf datasets and in comparison with several state-of-the-art techniques from the literature, i.e., StarGAN, AttGAN, STGAN, and two versions of InterFaceGAN. Our experimental results show that the proposed approach is able to edit face images with respect to several facial attributes with unprecedented image quality and at high-resolutions (1024x1024), while exhibiting considerably less problems with attribute entanglement than competing solutions. The source code is made freely available from: https://github.com/MartinPernus/MaskFaceGAN.
    Overhead-MNIST: Machine Learning Baselines for Image Classification. (arXiv:2107.00436v1 [cs.CV])
    (2 min) Twenty-three machine learning algorithms were trained then scored to establish baseline comparison metrics and to select an image classification algorithm worthy of embedding into mission-critical satellite imaging systems. The Overhead-MNIST dataset is a collection of satellite images similar in style to the ubiquitous MNIST hand-written digits found in the machine learning literature. The CatBoost classifier, Light Gradient Boosting Machine, and Extreme Gradient Boosting models produced the highest accuracies, Areas Under the Curve (AUC), and F1 scores in a PyCaret general comparison. Separate evaluations showed that a deep convolutional architecture was the most promising. We present results for the overall best performing algorithm as a baseline for edge deployability and future performance improvement: a convolutional neural network (CNN) scoring 0.965 categorical accuracy on unseen test data.
    Generative Adversarial Transformers. (arXiv:2103.01209v3 [cs.CV] UPDATED)
    (2 min) We introduce the GANformer, a novel and efficient type of transformer, and explore it for the task of visual generative modeling. The network employs a bipartite structure that enables long-range interactions across the image, while maintaining computation of linear efficiency, that can readily scale to high-resolution synthesis. It iteratively propagates information from a set of latent variables to the evolving visual features and vice versa, to support the refinement of each in light of the other and encourage the emergence of compositional representations of objects and scenes. In contrast to the classic transformer architecture, it utilizes multiplicative integration that allows flexible region-based modulation, and can thus be seen as a generalization of the successful StyleGAN network. We demonstrate the model's strength and robustness through a careful evaluation over a range of datasets, from simulated multi-object environments to rich real-world indoor and outdoor scenes, showing it achieves state-of-the-art results in terms of image quality and diversity, while enjoying fast learning and better data-efficiency. Further qualitative and quantitative experiments offer us an insight into the model's inner workings, revealing improved interpretability and stronger disentanglement, and illustrating the benefits and efficacy of our approach. An implementation of the model is available at https://github.com/dorarad/gansformer.
    CLIP-It! Language-Guided Video Summarization. (arXiv:2107.00650v1 [cs.CV])
    (2 min) A generic video summary is an abridged version of a video that conveys the whole story and features the most important scenes. Yet the importance of scenes in a video is often subjective, and users should have the option of customizing the summary by using natural language to specify what is important to them. Further, existing models for fully automatic generic summarization have not exploited available language models, which can serve as an effective prior for saliency. This work introduces CLIP-It, a single framework for addressing both generic and query-focused video summarization, typically approached separately in the literature. We propose a language-guided multimodal transformer that learns to score frames in a video based on their importance relative to one another and their correlation with a user-defined query (for query-focused summarization) or an automatically generated dense video caption (for generic video summarization). Our model can be extended to the unsupervised setting by training without ground-truth supervision. We outperform baselines and prior work by a significant margin on both standard video summarization datasets (TVSum and SumMe) and a query-focused video summarization dataset (QFVS). Particularly, we achieve large improvements in the transfer setting, attesting to our method's strong generalization capabilities.
    AutoFormer: Searching Transformers for Visual Recognition. (arXiv:2107.00651v1 [cs.CV])
    (2 min) Recently, pure transformer-based models have shown great potentials for vision tasks such as image classification and detection. However, the design of transformer networks is challenging. It has been observed that the depth, embedding dimension, and number of heads can largely affect the performance of vision transformers. Previous models configure these dimensions based upon manual crafting. In this work, we propose a new one-shot architecture search framework, namely AutoFormer, dedicated to vision transformer search. AutoFormer entangles the weights of different blocks in the same layers during supernet training. Benefiting from the strategy, the trained supernet allows thousands of subnets to be very well-trained. Specifically, the performance of these subnets with weights inherited from the supernet is comparable to those retrained from scratch. Besides, the searched models, which we refer to AutoFormers, surpass the recent state-of-the-arts such as ViT and DeiT. In particular, AutoFormer-tiny/small/base achieve 74.7%/81.7%/82.4% top-1 accuracy on ImageNet with 5.7M/22.9M/53.7M parameters, respectively. Lastly, we verify the transferability of AutoFormer by providing the performance on downstream benchmarks and distillation experiments. Code and models are available at https://github.com/microsoft/AutoML.
    A Survey on Graph-Based Deep Learning for Computational Histopathology. (arXiv:2107.00272v1 [cs.LG])
    (2 min) With the remarkable success of representation learning for prediction problems, we have witnessed a rapid expansion of the use of machine learning and deep learning for the analysis of digital pathology and biopsy image patches. However, traditional learning over patch-wise features using convolutional neural networks limits the model when attempting to capture global contextual information. The phenotypical and topological distribution of constituent histological entities play a critical role in tissue diagnosis. As such, graph data representations and deep learning have attracted significant attention for encoding tissue representations, and capturing intra- and inter- entity level interactions. In this review, we provide a conceptual grounding of graph-based deep learning and discuss its current success for tumor localization and classification, tumor invasion and staging, image retrieval, and survival prediction. We provide an overview of these methods in a systematic manner organized by the graph representation of the input image including whole slide images and tissue microarrays. We also outline the limitations of existing techniques, and suggest potential future advances in this domain.
    Dep-$L_0$: Improving $L_0$-based Network Sparsification via Dependency Modeling. (arXiv:2107.00070v1 [cs.LG])
    (2 min) Training deep neural networks with an $L_0$ regularization is one of the prominent approaches for network pruning or sparsification. The method prunes the network during training by encouraging weights to become exactly zero. However, recent work of Gale et al. reveals that although this method yields high compression rates on smaller datasets, it performs inconsistently on large-scale learning tasks, such as ResNet50 on ImageNet. We analyze this phenomenon through the lens of variational inference and find that it is likely due to the independent modeling of binary gates, the mean-field approximation, which is known in Bayesian statistics for its poor performance due to the crude approximation. To mitigate this deficiency, we propose a dependency modeling of binary gates, which can be modeled effectively as a multi-layer perceptron (MLP). We term our algorithm Dep-$L_0$ as it prunes networks via a dependency-enabled $L_0$ regularization. Extensive experiments on CIFAR10, CIFAR100 and ImageNet with VGG16, ResNet50, ResNet56 show that our Dep-$L_0$ outperforms the original $L_0$-HC algorithm of Louizos et al. by a significant margin, especially on ImageNet. Compared with the state-of-the-arts network sparsification algorithms, our dependency modeling makes the $L_0$-based sparsification once again very competitive on large-scale learning tasks. Our source code is available at https://github.com/leo-yangli/dep-l0.
    One-class Steel Detector Using Patch GAN Discriminator for Visualising Anomalous Feature Map. (arXiv:2107.00143v1 [cs.CV])
    (2 min) For steel product manufacturing in indoor factories, steel defect detection is important for quality control. For example, a steel sheet is extremely delicate, and must be accurately inspected. However, to maintain the painted steel parts of the infrastructure around a severe outdoor environment, corrosion detection is critical for predictive maintenance. In this paper, we propose a general-purpose application for steel anomaly detection that consists of the following four components. The first, a learner, is a unit image classification network to determine whether the region of interest or background has been recognised, after dividing the original large sized image into 256 square unit images. The second, an extractor, is a discriminator feature encoder based on a pre-trained steel generator with a patch generative adversarial network discriminator(GAN). The third, an anomaly detector, is a one-class support vector machine(SVM) to predict the anomaly score using the discriminator feature. The fourth, an indicator, is an anomalous probability map used to visually explain the anomalous features. Furthermore, we demonstrated our method through the inspection of steel sheet defects with 13,774 unit images using high-speed cameras, and painted steel corrosion with 19,766 unit images based on an eye inspection of the photographs. Finally, we visualise anomalous feature maps of steel using a strip and painted steel inspection dataset
    Semi-Sparsity for Smoothing Filters. (arXiv:2107.00627v1 [cs.CV])
    (0 min) In this paper, we propose an interesting semi-sparsity smoothing algorithm based on a novel sparsity-inducing optimization framework. This method is derived from the multiple observations, that is, semi-sparsity prior knowledge is more universally applicable, especially in areas where sparsity is not fully admitted, such as polynomial-smoothing surfaces. We illustrate that this semi-sparsity can be identified into a generalized $L_0$-norm minimization in higher-order gradient domains, thereby giving rise to a new ``feature-aware'' filtering method with a powerful simultaneous-fitting ability in both sparse features (singularities and sharpening edges) and non-sparse regions (polynomial-smoothing surfaces). Notice that a direct solver is always unavailable due to the non-convexity and combinatorial nature of $L_0$-norm minimization. Instead, we solve the model based on an efficient half-quadratic splitting minimization with fast Fourier transforms (FFTs) for acceleration. We finally demonstrate its versatility and many benefits to a series of signal/image processing and computer vision applications.
    Deep Hierarchical Super-Resolution for Scientific Data Reduction and Visualization. (arXiv:2107.00462v1 [eess.IV])
    (0 min) We present an approach for hierarchical super resolution (SR) using neural networks on an octree data representation. We train a hierarchy of neural networks, each capable of 2x upscaling in each spatial dimension between two levels of detail, and use these networks in tandem to facilitate large scale factor super resolution, scaling with the number of trained networks. We utilize these networks in a hierarchical super resolution algorithm that upscales multiresolution data to a uniform high resolution without introducing seam artifacts on octree node boundaries. We evaluate application of this algorithm in a data reduction framework by dynamically downscaling input data to an octree-based data structure to represent the multiresolution data before compressing for additional storage reduction. We demonstrate that our approach avoids seam artifacts common to multiresolution data formats, and show how neural network super resolution assisted data reduction can preserve global features better than compressors alone at the same compression ratios.
    Graph Self Supervised Learning: the BT, the HSIC, and the VICReg. (arXiv:2105.12247v3 [cs.LG] UPDATED)
    (2 min) Self-supervised learning and pre-training strategies have developed over the last few years especially for Convolutional Neural Networks (CNNs). Recently application of such methods can also be noticed for Graph Neural Networks (GNNs) . In this paper, we have used a graph based self-supervised learning strategy with different loss functions (Barlow Twins[Zbontar et al., 2021], HSIC[Tsai et al., 2021], VICReg[Bardes et al., 2021]) which have shown promising results when applied with CNNs previously. We have also proposed a hybrid loss function combining the advantages of VICReg and HSIC and called it as VICRegHSIC. The performance of these aforementioned methods have been compared when applied to different datasets such as MUTAG, PROTEINS and IMDB-Binary. Moreover, the impact of different batch sizes, projector dimensions and data augmentation strategies have also been explored
    Crowdsourcing Evaluation of Saliency-based XAI Methods. (arXiv:2107.00456v1 [cs.HC])
    (2 min) Understanding the reasons behind the predictions made by deep neural networks is critical for gaining human trust in many important applications, which is reflected in the increasing demand for explainability in AI (XAI) in recent years. Saliency-based feature attribution methods, which highlight important parts of images that contribute to decisions by classifiers, are often used as XAI methods, especially in the field of computer vision. In order to compare various saliency-based XAI methods quantitatively, several approaches for automated evaluation schemes have been proposed; however, there is no guarantee that such automated evaluation metrics correctly evaluate explainability, and a high rating by an automated evaluation scheme does not necessarily mean a high explainability for humans. In this study, instead of the automated evaluation, we propose a new human-based evaluation scheme using crowdsourcing to evaluate XAI methods. Our method is inspired by a human computation game, "Peek-a-boom", and can efficiently compare different XAI methods by exploiting the power of crowds. We evaluate the saliency maps of various XAI methods on two datasets with automated and crowd-based evaluation schemes. Our experiments show that the result of our crowd-based evaluation scheme is different from those of automated evaluation schemes. In addition, we regard the crowd-based evaluation results as ground truths and provide a quantitative performance measure to compare different automated evaluation schemes. We also discuss the impact of crowd workers on the results and show that the varying ability of crowd workers does not significantly impact the results.
    Deep Feature Space: A Geometrical Perspective. (arXiv:2007.00062v2 [cs.CV] UPDATED)
    (2 min) One of the most prominent attributes of Neural Networks (NNs) constitutes their capability of learning to extract robust and descriptive features from high dimensional data, like images. Hence, such an ability renders their exploitation as feature extractors particularly frequent in an abundant of modern reasoning systems. Their application scope mainly includes complex cascade tasks, like multi-modal recognition and deep Reinforcement Learning (RL). However, NNs induce implicit biases that are difficult to avoid or to deal with and are not met in traditional image descriptors. Moreover, the lack of knowledge for describing the intra-layer properties -- and thus their general behavior -- restricts the further applicability of the extracted features. With the paper at hand, a novel way of visualizing and understanding the vector space before the NNs' output layer is presented, aiming to enlighten the deep feature vectors' properties under classification tasks. Main attention is paid to the nature of overfitting in the feature space and its adverse effect on further exploitation. We present the findings that can be derived from our model's formulation, and we evaluate them on realistic recognition scenarios, proving its prominence by improving the obtained results.
    AdaXpert: Adapting Neural Architecture for Growing Data. (arXiv:2107.00254v1 [cs.LG])
    (2 min) In real-world applications, data often come in a growing manner, where the data volume and the number of classes may increase dynamically. This will bring a critical challenge for learning: given the increasing data volume or the number of classes, one has to instantaneously adjust the neural model capacity to obtain promising performance. Existing methods either ignore the growing nature of data or seek to independently search an optimal architecture for a given dataset, and thus are incapable of promptly adjusting the architectures for the changed data. To address this, we present a neural architecture adaptation method, namely Adaptation eXpert (AdaXpert), to efficiently adjust previous architectures on the growing data. Specifically, we introduce an architecture adjuster to generate a suitable architecture for each data snapshot, based on the previous architecture and the different extent between current and previous data distributions. Furthermore, we propose an adaptation condition to determine the necessity of adjustment, thereby avoiding unnecessary and time-consuming adjustments. Extensive experiments on two growth scenarios (increasing data volume and number of classes) demonstrate the effectiveness of the proposed method.
    Supervised Segmentation with Domain Adaptation for Small Sampled Orbital CT Images. (arXiv:2107.00418v1 [eess.IV])
    (0 min) Deep neural networks (DNNs) have been widely used for medical image analysis. However, the lack of access a to large-scale annotated dataset poses a great challenge, especially in the case of rare diseases, or new domains for the research society. Transfer of pre-trained features, from the relatively large dataset is a considerable solution. In this paper, we have explored supervised segmentation using domain adaptation for optic nerve and orbital tumor, when only small sampled CT images are given. Even the lung image database consortium image collection (LIDC-IDRI) is a cross-domain to orbital CT, but the proposed domain adaptation method improved the performance of attention U-Net for the segmentation in public optic nerve dataset and our clinical orbital tumor dataset. The code and dataset are available at https://github.com/cmcbigdata.
    Action Transformer: A Self-Attention Model for Short-Time Human Action Recognition. (arXiv:2107.00606v1 [cs.CV])
    (2 min) Deep neural networks based purely on attention have been successful across several domains, relying on minimal architectural priors from the designer. In Human Action Recognition (HAR), attention mechanisms have been primarily adopted on top of standard convolutional or recurrent layers, improving the overall generalization capability. In this work, we introduce Action Transformer (AcT), a simple, fully self-attentional architecture that consistently outperforms more elaborated networks that mix convolutional, recurrent, and attentive layers. In order to limit computational and energy requests, building on previous human action recognition research, the proposed approach exploits 2D pose representations over small temporal windows, providing a low latency solution for accurate and effective real-time performance. Moreover, we open-source MPOSE2021, a new large-scale dataset, as an attempt to build a formal training and evaluation benchmark for real-time short-time human action recognition. Extensive experimentation on MPOSE2021 with our proposed methodology and several previous architectural solutions proves the effectiveness of the AcT model and poses the base for future work on HAR.
    Color Variants Identification in Fashion e-commerce via Contrastive Self-Supervised Representation Learning. (arXiv:2104.08581v2 [cs.CV] UPDATED)
    (2 min) In this paper, we utilize deep visual Representation Learning to address an important problem in fashion e-commerce: color variants identification, i.e., identifying fashion products that match exactly in their design (or style), but only to differ in their color. At first we attempt to tackle the problem by obtaining manual annotations (depicting whether two products are color variants), and train a supervised triplet loss based neural network model to learn representations of fashion products. However, for large scale real-world industrial datasets such as addressed in our paper, it is infeasible to obtain annotations for the entire dataset, while capturing all the difficult corner cases. Interestingly, we observed that color variants are essentially manifestations of color jitter based augmentations. Thus, we instead explore Self-Supervised Learning (SSL) to solve this problem. We observed that existing state-of-the-art SSL methods perform poor, for our problem. To address this, we propose a novel SSL based color variants model that simultaneously focuses on different parts of an apparel. Quantitative and qualitative evaluation shows that our method outperforms existing SSL methods, and at times, the supervised model.
    From Recognition to Prediction: Analysis of Human Action and Trajectory Prediction in Video. (arXiv:2011.10670v2 [cs.CV] UPDATED)
    (2 min) With the advancement in computer vision deep learning, systems now are able to analyze an unprecedented amount of rich visual information from videos to enable applications such as autonomous driving, socially-aware robot assistant and public safety monitoring. Deciphering human behaviors to predict their future paths/trajectories and what they would do from videos is important in these applications. However, human trajectory prediction still remains a challenging task, as scene semantics and human intent are difficult to model. Many systems do not provide high-level semantic attributes to reason about pedestrian future. This design hinders prediction performance in video data from diverse domains and unseen scenarios. To enable optimal future human behavioral forecasting, it is crucial for the system to be able to detect and analyze human activities as well as scene semantics, passing informative features to the subsequent prediction module for context understanding.
    Interval-valued aggregation functions based on moderate deviations applied to Motor-Imagery-Based Brain Computer Interface. (arXiv:2011.09831v2 [cs.HC] UPDATED)
    (2 min) In this work we study the use of moderate deviation functions to measure similarity and dissimilarity among a set of given interval-valued data. To do so, we introduce the notion of interval-valued moderate deviation function and we study in particular those interval-valued moderate deviation functions which preserve the width of the input intervals. Then, we study how to apply these functions to construct interval-valued aggregation functions. We have applied them in the decision making phase of two Motor-Imagery Brain Computer Interface frameworks, obtaining better results than those obtained using other numerical and intervalar aggregations.
    Explicit Clothing Modeling for an Animatable Full-Body Avatar. (arXiv:2106.14879v2 [cs.CV] UPDATED)
    (2 min) Recent work has shown great progress in building photorealistic animatable full-body codec avatars, but these avatars still face difficulties in generating high-fidelity animation of clothing. To address the difficulties, we propose a method to build an animatable clothed body avatar with an explicit representation of the clothing on the upper body from multi-view captured videos. We use a two-layer mesh representation to separately register the 3D scans with templates. In order to improve the photometric correspondence across different frames, texture alignment is then performed through inverse rendering of the clothing geometry and texture predicted by a variational autoencoder. We then train a new two-layer codec avatar with separate modeling of the upper clothing and the inner body layer. To learn the interaction between the body dynamics and clothing states, we use a temporal convolution network to predict the clothing latent code based on a sequence of input skeletal poses. We show photorealistic animation output for three different actors, and demonstrate the advantage of our clothed-body avatars over single-layer avatars in the previous work. We also show the benefit of an explicit clothing model which allows the clothing texture to be edited in the animation output.
    Learning to See before Learning to Act: Visual Pre-training for Manipulation. (arXiv:2107.00646v1 [cs.RO])
    (2 min) Does having visual priors (e.g. the ability to detect objects) facilitate learning to perform vision-based manipulation (e.g. picking up objects)? We study this problem under the framework of transfer learning, where the model is first trained on a passive vision task, and adapted to perform an active manipulation task. We find that pre-training on vision tasks significantly improves generalization and sample efficiency for learning to manipulate objects. However, realizing these gains requires careful selection of which parts of the model to transfer. Our key insight is that outputs of standard vision models highly correlate with affordance maps commonly used in manipulation. Therefore, we explore directly transferring model parameters from vision networks to affordance prediction networks, and show that this can result in successful zero-shot adaptation, where a robot can pick up certain objects with zero robotic experience. With just a small amount of robotic experience, we can further fine-tune the affordance model to achieve better results. With just 10 minutes of suction experience or 1 hour of grasping experience, our method achieves ~80% success rate at picking up novel objects.
    Deep Orthogonal Fusion: Multimodal Prognostic Biomarker Discovery Integrating Radiology, Pathology, Genomic, and Clinical Data. (arXiv:2107.00648v1 [cs.CV])
    (2 min) Clinical decision-making in oncology involves multimodal data such as radiology scans, molecular profiling, histopathology slides, and clinical factors. Despite the importance of these modalities individually, no deep learning framework to date has combined them all to predict patient prognosis. Here, we predict the overall survival (OS) of glioma patients from diverse multimodal data with a Deep Orthogonal Fusion (DOF) model. The model learns to combine information from multiparametric MRI exams, biopsy-based modalities (such as H&E slide images and/or DNA sequencing), and clinical variables into a comprehensive multimodal risk score. Prognostic embeddings from each modality are learned and combined via attention-gated tensor fusion. To maximize the information gleaned from each modality, we introduce a multimodal orthogonalization (MMO) loss term that increases model performance by incentivizing constituent embeddings to be more complementary. DOF predicts OS in glioma patients with a median C-index of 0.788 +/- 0.067, significantly outperforming (p=0.023) the best performing unimodal model with a median C-index of 0.718 +/- 0.064. The prognostic model significantly stratifies glioma patients by OS within clinical subsets, adding further granularity to prognostic clinical grading and molecular subtyping.
    Fusing Higher-Order Features in Graph Neural Networks for Skeleton-Based Action Recognition. (arXiv:2105.01563v3 [cs.CV] UPDATED)
    (2 min) Skeleton sequences are lightweight and compact, thus are ideal candidates for action recognition on edge devices. Recent skeleton-based action recognition methods extract features from 3D joint coordinates as spatial-temporal cues, using these representations in a graph neural network for feature fusion to boost recognition performance. The use of first- and second-order features, \ie{} joint and bone representations, has led to high accuracy. Nonetheless, many models are still confused by actions that have similar motion trajectories. To address these issues, we propose fusing third-order features in the form of angular encoding into modern architectures to robustly capture the relationships between joints and body parts. This simple fusion with popular spatial-temporal graph neural networks achieves new state-of-the-art accuracy in two large benchmarks, including NTU60 and NTU120, while employing fewer parameters and reduced run time. Our source code is publicly available at: https://github.com/ZhenyueQin/Angular-Skeleton-Encoding.
    A Multi-task Two-stream Spatiotemporal Convolutional Neural Network for Convective Storm Nowcasting. (arXiv:2010.14100v2 [cs.CV] UPDATED)
    (0 min) The goal of convective storm nowcasting is local prediction of severe and imminent convective storms. Here, we consider the convective storm nowcasting problem from the perspective of machine learning. First, we use a pixel-wise sampling method to construct spatiotemporal features for nowcasting, and flexibly adjust the proportions of positive and negative samples in the training set to mitigate class-imbalance issues. Second, we employ a concise two-stream convolutional neural network to extract spatial and temporal cues for nowcasting. This simplifies the network structure, reduces the training time requirement, and improves classification accuracy. The two-stream network used both radar and satellite data. In the resulting two-stream, fused convolutional neural network, some of the parameters are entered into a single-stream convolutional neural network, but it can learn the features of many data. Further, considering the relevance of classification and regression tasks, we develop a multi-task learning strategy that predicts the labels used in such tasks. We integrate two-stream multi-task learning into a single convolutional neural network. Given the compact architecture, this network is more efficient and easier to optimize than existing recurrent neural networks.
    PointINS: Point-based Instance Segmentation. (arXiv:2003.06148v2 [cs.CV] UPDATED)
    (2 min) In this paper, we explore the mask representation in instance segmentation with Point-of-Interest (PoI) features. Differentiating multiple potential instances within a single PoI feature is challenging because learning a high-dimensional mask feature for each instance using vanilla convolution demands a heavy computing burden. To address this challenge, we propose an instance-aware convolution. It decomposes this mask representation learning task into two tractable modules as instance-aware weights and instance-agnostic features. The former is to parametrize convolution for producing mask features corresponding to different instances, improving mask learning efficiency by avoiding employing several independent convolutions. Meanwhile, the latter serves as mask templates in a single point. Together, instance-aware mask features are computed by convolving the template with dynamic weights, used for the mask prediction. Along with instance-aware convolution, we propose PointINS, a simple and practical instance segmentation approach, building upon dense one-stage detectors. Through extensive experiments, we evaluated the effectiveness of our framework built upon RetinaNet and FCOS. PointINS in ResNet101 backbone achieves a 38.3 mask mean average precision (mAP) on COCO dataset, outperforming existing point-based methods by a large margin. It gives a comparable performance to the region-based Mask R-CNN with faster inference.
    Goal-Auxiliary Actor-Critic for 6D Robotic Grasping with Point Clouds. (arXiv:2010.00824v4 [cs.RO] UPDATED)
    (2 min) 6D robotic grasping beyond top-down bin-picking scenarios is a challenging task. Previous solutions based on 6D grasp synthesis with robot motion planning usually operate in an open-loop setting, which are sensitive to grasp synthesis errors. In this work, we propose a new method for learning closed-loop control policies for 6D grasping. Our policy takes a segmented point cloud of an object from an egocentric camera as input, and outputs continuous 6D control actions of the robot gripper for grasping the object. We combine imitation learning and reinforcement learning and introduce a goal-auxiliary actor-critic algorithm for policy learning. We demonstrate that our learned policy can be integrated into a tabletop 6D grasping system and a human-robot handover system to improve the grasping performance of unseen objects. Our videos and code can be found at https://sites.google.com/view/gaddpg .
    A Unified Framework of Bundle Adjustment and Feature Matching for High-Resolution Satellite Images. (arXiv:2107.00598v1 [cs.CV])
    (2 min) Bundle adjustment (BA) is a technique for refining sensor orientations of satellite images, while adjustment accuracy is correlated with feature matching results. Feature match-ing often contains high uncertainties in weak/repeat textures, while BA results are helpful in reducing these uncertainties. To compute more accurate orientations, this article incorpo-rates BA and feature matching in a unified framework and formulates the union as the optimization of a global energy function so that the solutions of the BA and feature matching are constrained with each other. To avoid a degeneracy in the optimization, we propose a comprised solution by breaking the optimization of the global energy function into two-step suboptimizations and compute the local minimums of each suboptimization in an incremental manner. Experiments on multi-view high-resolution satellite images show that our proposed method outperforms state-of-the-art orientation techniques with or without accurate least-squares matching.
    3D Iterative Spatiotemporal Filtering for Classification of Multitemporal Satellite Data Sets. (arXiv:2107.00590v1 [cs.CV])
    (2 min) The current practice in land cover/land use change analysis relies heavily on the individually classified maps of the multitemporal data set. Due to varying acquisition conditions (e.g., illumination, sensors, seasonal differences), the classification maps yielded are often inconsistent through time for robust statistical analysis. 3D geometric features have been shown to be stable for assessing differences across the temporal data set. Therefore, in this article we investigate he use of a multitemporal orthophoto and digital surface model derived from satellite data for spatiotemporal classification. Our approach consists of two major steps: generating per-class probability distribution maps using the random-forest classifier with limited training samples, and making spatiotemporal inferences using an iterative 3D spatiotemporal filter operating on per-class probability maps. Our experimental results demonstrate that the proposed methods can consistently improve the individual classification results by 2%-6% and thus can be an important postclassification refinement approach.
    Fair Visual Recognition in Limited Data Regime using Self-Supervision and Self-Distillation. (arXiv:2107.00067v1 [cs.CV])
    (2 min) Deep learning models generally learn the biases present in the training data. Researchers have proposed several approaches to mitigate such biases and make the model fair. Bias mitigation techniques assume that a sufficiently large number of training examples are present. However, we observe that if the training data is limited, then the effectiveness of bias mitigation methods is severely degraded. In this paper, we propose a novel approach to address this problem. Specifically, we adapt self-supervision and self-distillation to reduce the impact of biases on the model in this setting. Self-supervision and self-distillation are not used for bias mitigation. However, through this work, we demonstrate for the first time that these techniques are very effective in bias mitigation. We empirically show that our approach can significantly reduce the biases learned by the model. Further, we experimentally demonstrate that our approach is complementary to other bias mitigation strategies. Our approach significantly improves their performance and further reduces the model biases in the limited data regime. Specifically, on the L-CIFAR-10S skewed dataset, our approach significantly reduces the bias score of the baseline model by 78.22% and outperforms it in terms of accuracy by a significant absolute margin of 8.89%. It also significantly reduces the bias score for the state-of-the-art domain independent bias mitigation method by 59.26% and improves its performance by a significant absolute margin of 7.08%.
    Unsupervised Neural Domain Adaptation for Document Image Binarization. (arXiv:2012.01204v2 [cs.CV] UPDATED)
    (2 min) Binarization is a well-known image processing task, whose objective is to separate the foreground of an image from the background. One of the many tasks for which it is useful is that of preprocessing document images in order to identify relevant information, such as text or symbols. The wide variety of document types, alphabets, and formats makes binarization challenging. There are multiple proposals with which to solve this problem, from classical manually-adjusted methods, to more recent approaches based on machine learning. The latter techniques require a large amount of training data in order to obtain good results; however, labeling a portion of each existing collection of documents is not feasible in practice. This is a common problem in supervised learning, which can be addressed by using the so-called Domain Adaptation (DA) techniques. These techniques take advantage of the knowledge learned in one domain, for which labeled data are available, to apply it to other domains for which there are no labeled data. This paper proposes a method that combines neural networks and DA in order to carry out unsupervised document binarization. However, when both the source and target domains are very similar, this adaptation could be detrimental. Our methodology, therefore, first measures the similarity between domains in an innovative manner in order to determine whether or not it is appropriate to apply the adaptation process. The results reported in the experimentation, when evaluating up to 20 possible combinations among five different domains, show that our proposal successfully deals with the binarization of new document domains without the need for labeled data.
    Segmenting two-dimensional structures with strided tensor networks. (arXiv:2102.06900v2 [cs.CV] UPDATED)
    (0 min) Tensor networks provide an efficient approximation of operations involving high dimensional tensors and have been extensively used in modelling quantum many-body systems. More recently, supervised learning has been attempted with tensor networks, primarily focused on tasks such as image classification. In this work, we propose a novel formulation of tensor networks for supervised image segmentation which allows them to operate on high resolution medical images. We use the matrix product state (MPS) tensor network on non-overlapping patches of a given input image to predict the segmentation mask by learning a pixel-wise linear classification rule in a high dimensional space. The proposed model is end-to-end trainable using backpropagation. It is implemented as a Strided Tensor Network to reduce the parameter complexity. The performance of the proposed method is evaluated on two public medical imaging datasets and compared to relevant baselines. The evaluation shows that the strided tensor network yields competitive performance compared to CNN-based models while using fewer resources. Additionally, based on the experiments we discuss the feasibility of using fully linear models for segmentation tasks.
    Learning to Disambiguate Strongly Interacting Hands via Probabilistic Per-pixel Part Segmentation. (arXiv:2107.00434v1 [cs.CV])
    (0 min) In natural conversation and interaction, our hands often overlap or are in contact with each other. Due to the homogeneous appearance of hands, this makes estimating the 3D pose of interacting hands from images difficult. In this paper we demonstrate that self-similarity, and the resulting ambiguities in assigning pixel observations to the respective hands and their parts, is a major cause of the final 3D pose error. Motivated by this insight, we propose DIGIT, a novel method for estimating the 3D poses of two interacting hands from a single monocular image. The method consists of two interwoven branches that process the input imagery into a per-pixel semantic part segmentation mask and a visual feature volume. In contrast to prior work, we do not decouple the segmentation from the pose estimation stage, but rather leverage the per-pixel probabilities directly in the downstream pose estimation task. To do so, the part probabilities are merged with the visual features and processed via fully-convolutional layers. We experimentally show that the proposed approach achieves new state-of-the-art performance on the InterHand2.6M dataset for both single and interacting hands across all metrics. We provide detailed ablation studies to demonstrate the efficacy of our method and to provide insights into how the modelling of pixel ownership affects single and interacting hand pose estimation. Our code will be released for research purposes.
    Individual Tree Detection and Crown Delineation with 3D Information from Multi-view Satellite Images. (arXiv:2107.00592v1 [cs.CV])
    (2 min) Individual tree detection and crown delineation (ITDD) are critical in forest inventory management and remote sensing based forest surveys are largely carried out through satellite images. However, most of these surveys only use 2D spectral information which normally has not enough clues for ITDD. To fully explore the satellite images, we propose a ITDD method using the orthophoto and digital surface model (DSM) derived from the multi-view satellite data. Our algorithm utilizes the top-hat morphological operation to efficiently extract the local maxima from DSM as treetops, and then feed them to a modi-fied superpixel segmentation that combines both 2D and 3D information for tree crown delineation. In subsequent steps, our method incorporates the biological characteristics of the crowns through plant allometric equation to falsify potential outliers. Experiments against manually marked tree plots on three representative regions have demonstrated promising results - the best overall detection accuracy can be 89%.
    Lattice Fusion Networks for Image Denoising. (arXiv:2011.14196v3 [eess.IV] UPDATED)
    (2 min) A novel method for feature fusion in convolutional neural networks is proposed in this paper. Different feature fusion techniques are suggested to facilitate the flow of information and improve the training of deep neural networks. Some of these techniques as well as the proposed network can be considered a type of Directed Acyclic Graph (DAG) Network, where a layer can receive inputs from other layers and have outputs to other layers. In the proposed general framework of Lattice Fusion Network (LFNet), feature maps of each convolutional layer are passed to other layers based on a lattice graph structure, where nodes are convolutional layers. To evaluate the performance of the proposed architecture, different designs based on the general framework of LFNet are implemented for the task of image denoising. This task is used as an example where training deep convolutional networks is needed. Results are compared with state of the art methods. The proposed network is able to achieve better results with far fewer learnable parameters, which shows the effectiveness of LFNets for training of deep neural networks.
    Global Filter Networks for Image Classification. (arXiv:2107.00645v1 [cs.CV])
    (2 min) Recent advances in self-attention and pure multi-layer perceptrons (MLP) models for vision have shown great potential in achieving promising performance with fewer inductive biases. These models are generally based on learning interaction among spatial locations from raw data. The complexity of self-attention and MLP grows quadratically as the image size increases, which makes these models hard to scale up when high-resolution features are required. In this paper, we present the Global Filter Network (GFNet), a conceptually simple yet computationally efficient architecture, that learns long-term spatial dependencies in the frequency domain with log-linear complexity. Our architecture replaces the self-attention layer in vision transformers with three key operations: a 2D discrete Fourier transform, an element-wise multiplication between frequency-domain features and learnable global filters, and a 2D inverse Fourier transform. We exhibit favorable accuracy/complexity trade-offs of our models on both ImageNet and downstream tasks. Our results demonstrate that GFNet can be a very competitive alternative to transformer-style models and CNNs in efficiency, generalization ability and robustness. Code is available at https://github.com/raoyongming/GFNet
    Fast Gravitational Approach for Rigid Point Set Registration with Ordinary Differential Equations. (arXiv:2009.14005v2 [cs.CV] UPDATED)
    (2 min) This article introduces a new physics-based method for rigid point set alignment called Fast Gravitational Approach (FGA). In FGA, the source and target point sets are interpreted as rigid particle swarms with masses interacting in a globally multiply-linked manner while moving in a simulated gravitational force field. The optimal alignment is obtained by explicit modeling of forces acting on the particles as well as their velocities and displacements with second-order ordinary differential equations of motion. Additional alignment cues (point-based or geometric features, and other boundary conditions) can be integrated into FGA through particle masses. We propose a smooth-particle mass function for point mass initialization, which improves robustness to noise and structural discontinuities. To avoid prohibitive quadratic complexity of all-to-all point interactions, we adapt a Barnes-Hut tree for accelerated force computation and achieve quasilinear computational complexity. We show that the new method class has characteristics not found in previous alignment methods such as efficient handling of partial overlaps, inhomogeneous point sampling densities, and coping with large point clouds with reduced runtime compared to the state of the art. Experiments show that our method performs on par with or outperforms all compared competing non-deep-learning-based and general-purpose techniques (which do not assume the availability of training data and a scene prior) in resolving transformations for LiDAR data and gains state-of-the-art accuracy and speed when coping with different types of data disturbances.
    Generalization and Robustness Implications in Object-Centric Learning. (arXiv:2107.00637v1 [cs.LG])
    (2 min) The idea behind object-centric representation learning is that natural scenes can better be modeled as compositions of objects and their relations as opposed to distributed representations. This inductive bias can be injected into neural networks to potentially improve systematic generalization and learning efficiency of downstream tasks in scenes with multiple objects. In this paper, we train state-of-the-art unsupervised models on five common multi-object datasets and evaluate segmentation accuracy and downstream object property prediction. In addition, we study systematic generalization and robustness by investigating the settings where either single objects are out-of-distribution -- e.g., having unseen colors, textures, and shapes -- or global properties of the scene are altered -- e.g., by occlusions, cropping, or increasing the number of objects. From our experimental study, we find object-centric representations to be generally useful for downstream tasks and robust to shifts in the data distribution, especially if shifts affect single objects.
    On the Practicality of Deterministic Epistemic Uncertainty. (arXiv:2107.00649v1 [cs.CV])
    (2 min) A set of novel approaches for estimating epistemic uncertainty in deep neural networks with a single forward pass has recently emerged as a valid alternative to Bayesian Neural Networks. On the premise of informative representations, these deterministic uncertainty methods (DUMs) achieve strong performance on detecting out-of-distribution (OOD) data while adding negligible computational costs at inference time. However, it remains unclear whether DUMs are well calibrated and can seamlessly scale to real-world applications - both prerequisites for their practical deployment. To this end, we first provide a taxonomy of DUMs, evaluate their calibration under continuous distributional shifts and their performance on OOD detection for image classification tasks. Then, we extend the most promising approaches to semantic segmentation. We find that, while DUMs scale to realistic vision tasks and perform well on OOD detection, the practicality of current methods is undermined by poor calibration under realistic distributional shifts.
    Interviewer-Candidate Role Play: Towards Developing Real-World NLP Systems. (arXiv:2107.00315v1 [cs.CL])
    (2 min) Standard NLP tasks do not incorporate several common real-world scenarios such as seeking clarifications about the question, taking advantage of clues, abstaining in order to avoid incorrect answers, etc. This difference in task formulation hinders the adoption of NLP systems in real-world settings. In this work, we take a step towards bridging this gap and present a multi-stage task that simulates a typical human-human questioner-responder interaction such as an interview. Specifically, the system is provided with question simplifications, knowledge statements, examples, etc. at various stages to improve its prediction when it is not sufficiently confident. We instantiate the proposed task in Natural Language Inference setting where a system is evaluated on both in-domain and out-of-domain (OOD) inputs. We conduct comprehensive experiments and find that the multi-stage formulation of our task leads to OOD generalization performance improvement up to 2.29% in Stage 1, 1.91% in Stage 2, 54.88% in Stage 3, and 72.02% in Stage 4 over the standard unguided prediction. However, our task leaves a significant challenge for NLP researchers to further improve OOD performance at each stage.
    CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows. (arXiv:2107.00652v1 [cs.CV])
    (0 min) We present CSWin Transformer, an efficient and effective Transformer-based backbone for general-purpose vision tasks. A challenging issue in Transformer design is that global self-attention is very expensive to compute whereas local self-attention often limits the field of interactions of each token. To address this issue, we develop the Cross-Shaped Window self-attention mechanism for computing self-attention in the horizontal and vertical stripes in parallel that form a cross-shaped window, with each stripe obtained by splitting the input feature into stripes of equal width. We provide a detailed mathematical analysis of the effect of the stripe width and vary the stripe width for different layers of the Transformer network which achieves strong modeling capability while limiting the computation cost. We also introduce Locally-enhanced Positional Encoding (LePE), which handles the local positional information better than existing encoding schemes. LePE naturally supports arbitrary input resolutions, and is thus especially effective and friendly for downstream tasks. Incorporated with these designs and a hierarchical structure, CSWin Transformer demonstrates competitive performance on common vision tasks. Specifically, it achieves 85.4% Top-1 accuracy on ImageNet-1K without any extra training data or label, 53.9 box AP and 46.4 mask AP on the COCO detection task, and 51.7 mIOU on the ADE20K semantic segmentation task, surpassing previous state-of-the-art Swin Transformer backbone by +1.2, +2.0, +1.4, and +2.0 respectively under the similar FLOPs setting. By further pretraining on the larger dataset ImageNet-21K, we achieve 87.5% Top-1 accuracy on ImageNet-1K and state-of-the-art segmentation performance on ADE20K with 55.2 mIoU. The code and models will be available at https://github.com/microsoft/CSWin-Transformer.
    Improving Human Motion Prediction Through Continual Learning. (arXiv:2107.00544v1 [cs.RO])
    (0 min) Human motion prediction is an essential component for enabling closer human-robot collaboration. The task of accurately predicting human motion is non-trivial. It is compounded by the variability of human motion, both at a skeletal level due to the varying size of humans and at a motion level due to individual movement's idiosyncrasies. These variables make it challenging for learning algorithms to obtain a general representation that is robust to the diverse spatio-temporal patterns of human motion. In this work, we propose a modular sequence learning approach that allows end-to-end training while also having the flexibility of being fine-tuned. Our approach relies on the diversity of training samples to first learn a robust representation, which can then be fine-tuned in a continual learning setup to predict the motion of new subjects. We evaluated the proposed approach by comparing its performance against state-of-the-art baselines. The results suggest that our approach outperforms other methods over all the evaluated temporal horizons, using a small amount of data for fine-tuning. The improved performance of our approach opens up the possibility of using continual learning for personalized and reliable motion prediction.
    Stabilizing Deep Q-Learning with ConvNets and Vision Transformers under Data Augmentation. (arXiv:2107.00644v1 [cs.LG])
    (0 min) While agents trained by Reinforcement Learning (RL) can solve increasingly challenging tasks directly from visual observations, generalizing learned skills to novel environments remains very challenging. Extensive use of data augmentation is a promising technique for improving generalization in RL, but it is often found to decrease sample efficiency and can even lead to divergence. In this paper, we investigate causes of instability when using data augmentation in common off-policy RL algorithms. We identify two problems, both rooted in high-variance Q-targets. Based on our findings, we propose a simple yet effective technique for stabilizing this class of algorithms under augmentation. We perform extensive empirical evaluation of image-based RL using both ConvNets and Vision Transformers (ViT) on a family of benchmarks based on DeepMind Control Suite, as well as in robotic manipulation tasks. Our method greatly improves stability and sample efficiency of ConvNets under augmentation, and achieves generalization results competitive with state-of-the-art methods for image-based RL. We further show that our method scales to RL with ViT-based architectures, and that data augmentation may be especially important in this setting.
    Egocentric Image Captioning for Privacy-Preserved Passive Dietary Intake Monitoring. (arXiv:2107.00372v1 [cs.CV])
    (0 min) Camera-based passive dietary intake monitoring is able to continuously capture the eating episodes of a subject, recording rich visual information, such as the type and volume of food being consumed, as well as the eating behaviours of the subject. However, there currently is no method that is able to incorporate these visual clues and provide a comprehensive context of dietary intake from passive recording (e.g., is the subject sharing food with others, what food the subject is eating, and how much food is left in the bowl). On the other hand, privacy is a major concern while egocentric wearable cameras are used for capturing. In this paper, we propose a privacy-preserved secure solution (i.e., egocentric image captioning) for dietary assessment with passive monitoring, which unifies food recognition, volume estimation, and scene understanding. By converting images into rich text descriptions, nutritionists can assess individual dietary intake based on the captions instead of the original images, reducing the risk of privacy leakage from images. To this end, an egocentric dietary image captioning dataset has been built, which consists of in-the-wild images captured by head-worn and chest-worn cameras in field studies in Ghana. A novel transformer-based architecture is designed to caption egocentric dietary images. Comprehensive experiments have been conducted to evaluate the effectiveness and to justify the design of the proposed architecture for egocentric dietary image captioning. To the best of our knowledge, this is the first work that applies image captioning to dietary intake assessment in real life settings.
    Lossless Coding of Point Cloud Geometry using a Deep Generative Model. (arXiv:2107.00400v1 [eess.IV])
    (0 min) This paper proposes a lossless point cloud (PC) geometry compression method that uses neural networks to estimate the probability distribution of voxel occupancy. First, to take into account the PC sparsity, our method adaptively partitions a point cloud into multiple voxel block sizes. This partitioning is signalled via an octree. Second, we employ a deep auto-regressive generative model to estimate the occupancy probability of each voxel given the previously encoded ones. We then employ the estimated probabilities to code efficiently a block using a context-based arithmetic coder. Our context has variable size and can expand beyond the current block to learn more accurate probabilities. We also consider using data augmentation techniques to increase the generalization capability of the learned probability models, in particular in the presence of noise and lower-density point clouds. Experimental evaluation, performed on a variety of point clouds from four different datasets and with diverse characteristics, demonstrates that our method reduces significantly (by up to 30%) the rate for lossless coding compared to the state-of-the-art MPEG codec.
    SinGAN-Seg: Synthetic Training Data Generation for Medical Image Segmentation. (arXiv:2107.00471v1 [eess.IV])
    (0 min) Processing medical data to find abnormalities is a time-consuming and costly task, requiring tremendous efforts from medical experts. Therefore, Ai has become a popular tool for the automatic processing of medical data, acting as a supportive tool for doctors. AI tools highly depend on data for training the models. However, there are several constraints to access to large amounts of medical data to train machine learning algorithms in the medical domain, e.g., due to privacy concerns and the costly, time-consuming medical data annotation process. To address this, in this paper we present a novel synthetic data generation pipeline called SinGAN-Seg to produce synthetic medical data with the corresponding annotated ground truth masks. We show that these synthetic data generation pipelines can be used as an alternative to bypass privacy concerns and as an alternative way to produce artificial segmentation datasets with corresponding ground truth masks to avoid the tedious medical data annotation process. As a proof of concept, we used an open polyp segmentation dataset. By training UNet++ using both the real polyp segmentation dataset and the corresponding synthetic dataset generated from the SinGAN-Seg pipeline, we show that the synthetic data can achieve a very close performance to the real data when the real segmentation datasets are large enough. In addition, we show that synthetic data generated from the SinGAN-Seg pipeline improving the performance of segmentation algorithms when the training dataset is very small. Since our SinGAN-Seg pipeline is applicable for any medical dataset, this pipeline can be used with any other segmentation datasets.
    DivergentNets: Medical Image Segmentation by Network Ensemble. (arXiv:2107.00283v1 [eess.IV])
    (0 min) Detection of colon polyps has become a trending topic in the intersecting fields of machine learning and gastrointestinal endoscopy. The focus has mainly been on per-frame classification. More recently, polyp segmentation has gained attention in the medical community. Segmentation has the advantage of being more accurate than per-frame classification or object detection as it can show the affected area in greater detail. For our contribution to the EndoCV 2021 segmentation challenge, we propose two separate approaches. First, a segmentation model named TriUNet composed of three separate UNet models. Second, we combine TriUNet with an ensemble of well-known segmentation models, namely UNet++, FPN, DeepLabv3, and DeepLabv3+, into a model called DivergentNets to produce more generalizable medical image segmentation masks. In addition, we propose a modified Dice loss that calculates loss only for a single class when performing multiclass segmentation, forcing the model to focus on what is most important. Overall, the proposed methods achieved the best average scores for each respective round in the challenge, with TriUNet being the winning model in Round I and DivergentNets being the winning model in Round II of the segmentation generalization challenge at EndoCV 2021. The implementation of our approach is made publicly available on GitHub.
    PoliTO-IIT Submission to the EPIC-KITCHENS-100 Unsupervised Domain Adaptation Challenge for Action Recognition. (arXiv:2107.00337v1 [cs.CV])
    (0 min) In this report, we describe the technical details of our submission to the EPIC-Kitchens-100 Unsupervised Domain Adaptation (UDA) Challenge in Action Recognition. To tackle the domain-shift which exists under the UDA setting, we first exploited a recent Domain Generalization (DG) technique, called Relative Norm Alignment (RNA). It consists in designing a model able to generalize well to any unseen domain, regardless of the possibility to access target data at training time. Then, in a second phase, we extended the approach to work on unlabelled target data, allowing the model to adapt to the target distribution in an unsupervised fashion. For this purpose, we included in our framework existing UDA algorithms, such as Temporal Attentive Adversarial Adaptation Network (TA3N), jointly with new multi-stream consistency losses, namely Temporal Hard Norm Alignment (T-HNA) and Min-Entropy Consistency (MEC). Our submission (entry 'plnet') is visible on the leaderboard and it achieved the 1st position for 'verb', and the 3rd position for both 'noun' and 'action'.
    Training Interpretable Convolutional Neural Networks by Differentiating Class-specific Filters. (arXiv:2007.08194v3 [cs.CV] UPDATED)
    (0 min) Convolutional neural networks (CNNs) have been successfully used in a range of tasks. However, CNNs are often viewed as "black-box" and lack of interpretability. One main reason is due to the filter-class entanglement -- an intricate many-to-many correspondence between filters and classes. Most existing works attempt post-hoc interpretation on a pre-trained model, while neglecting to reduce the entanglement underlying the model. In contrast, we focus on alleviating filter-class entanglement during training. Inspired by cellular differentiation, we propose a novel strategy to train interpretable CNNs by encouraging class-specific filters, among which each filter responds to only one (or few) class. Concretely, we design a learnable sparse Class-Specific Gate (CSG) structure to assign each filter with one (or few) class in a flexible way. The gate allows a filter's activation to pass only when the input samples come from the specific class. Extensive experiments demonstrate the fabulous performance of our method in generating a sparse and highly class-related representation of the input, which leads to stronger interpretability. Moreover, comparing with the standard training strategy, our model displays benefits in applications like object localization and adversarial sample detection. Code link: https://github.com/hyliang96/CSGCNN.
    Circuit Complexity of Visual Search. (arXiv:2107.00223v1 [cs.CC])
    (0 min) We study computational hardness of feature and conjunction search through the lens of circuit complexity. Let $x = (x_1, ... , x_n)$ (resp., $y = (y_1, ... , y_n)$) be Boolean variables each of which takes the value one if and only if a neuron at place $i$ detects a feature (resp., another feature). We then simply formulate the feature and conjunction search as Boolean functions ${\rm FTR}_n(x) = \bigvee_{i=1}^n x_i$ and ${\rm CONJ}_n(x, y) = \bigvee_{i=1}^n x_i \wedge y_i$, respectively. We employ a threshold circuit or a discretized circuit (such as a sigmoid circuit or a ReLU circuit with discretization) as our models of neural networks, and consider the following four computational resources: [i] the number of neurons (size), [ii] the number of levels (depth), [iii] the number of active neurons outputting non-zero values (energy), and [iv] synaptic weight resolution (weight). We first prove that any threshold circuit $C$ of size $s$, depth $d$, energy $e$ and weight $w$ satisfies $\log rk(M_C) \le ed (\log s + \log w + \log n)$, where $rk(M_C)$ is the rank of the communication matrix $M_C$ of a $2n$-variable Boolean function that $C$ computes. Since ${\rm CONJ}_n$ has rank $2^n$, we have $n \le ed (\log s + \log w + \log n)$. Thus, an exponential lower bound on the size of even sublinear-depth threshold circuits exists if the energy and weight are sufficiently small. Since ${\rm FTR}_n$ is computable independently of $n$, our result suggests that computational capacity for the feature and conjunction search are different. We also show that the inequality is tight up to a constant factor if $ed = o(n/ \log n)$. We next show that a similar inequality holds for any discretized circuit. Thus, if we regard the number of gates outputting non-zero values as a measure for sparse activity, our results suggest that larger depth helps neural networks to acquire sparse activity.
    Few-Shot Learning with a Strong Teacher. (arXiv:2107.00197v1 [cs.CV])
    (0 min) Few-shot learning (FSL) aims to train a strong classifier using limited labeled examples. Many existing works take the meta-learning approach, sampling few-shot tasks in turn and optimizing the few-shot learner's performance on classifying the query examples. In this paper, we point out two potential weaknesses of this approach. First, the sampled query examples may not provide sufficient supervision for the few-shot learner. Second, the effectiveness of meta-learning diminishes sharply with increasing shots (i.e., the number of training examples per class). To resolve these issues, we propose a novel objective to directly train the few-shot learner to perform like a strong classifier. Concretely, we associate each sampled few-shot task with a strong classifier, which is learned with ample labeled examples. The strong classifier has a better generalization ability and we use it to supervise the few-shot learner. We present an efficient way to construct the strong classifier, making our proposed objective an easily plug-and-play term to existing meta-learning based FSL methods. We validate our approach in combinations with many representative meta-learning methods. On several benchmark datasets including miniImageNet and tiredImageNet, our approach leads to a notable improvement across a variety of tasks. More importantly, with our approach, meta-learning based FSL methods can consistently outperform non-meta-learning based ones, even in a many-shot setting, greatly strengthening their applicability.
    GlyphCRM: Bidirectional Encoder Representation for Chinese Character with its Glyph. (arXiv:2107.00395v1 [cs.AI])
    (0 min) Previous works indicate that the glyph of Chinese characters contains rich semantic information and has the potential to enhance the representation of Chinese characters. The typical method to utilize the glyph features is by incorporating them into the character embedding space. Inspired by previous methods, we innovatively propose a Chinese pre-trained representation model named as GlyphCRM, which abandons the ID-based character embedding method yet solely based on sequential character images. We render each character into a binary grayscale image and design two-channel position feature maps for it. Formally, we first design a two-layer residual convolutional neural network, namely HanGlyph to generate the initial glyph representation of Chinese characters, and subsequently adopt multiple bidirectional encoder Transformer blocks as the superstructure to capture the context-sensitive information. Meanwhile, we feed the glyph features extracted from each layer of the HanGlyph module into the underlying Transformer blocks by skip-connection method to fully exploit the glyph features of Chinese characters. As the HanGlyph module can obtain a sufficient glyph representation of any Chinese character, the long-standing out-of-vocabulary problem could be effectively solved. Extensive experimental results indicate that GlyphCRM substantially outperforms the previous BERT-based state-of-the-art model on 9 fine-tuning tasks, and it has strong transferability and generalization on specialized fields and low-resource tasks. We hope this work could spark further research beyond the realms of well-established representation of Chinese texts.
    Segmenting 3D Hybrid Scenes via Zero-Shot Learning. (arXiv:2107.00430v1 [cs.CV])
    (0 min) This work is to tackle the problem of point cloud semantic segmentation for 3D hybrid scenes under the framework of zero-shot learning. Here by hybrid, we mean the scene consists of both seen-class and unseen-class 3D objects, a more general and realistic setting in application. To our knowledge, this problem has not been explored in the literature. To this end, we propose a network to synthesize point features for various classes of objects by leveraging the semantic features of both seen and unseen object classes, called PFNet. The proposed PFNet employs a GAN architecture to synthesize point features, where the semantic relationship between seen-class and unseen-class features is consolidated by adapting a new semantic regularizer, and the synthesized features are used to train a classifier for predicting the labels of the testing 3D scene points. Besides we also introduce two benchmarks for algorithmic evaluation by re-organizing the public S3DIS and ScanNet datasets under six different data splits. Experimental results on the two benchmarks validate our proposed method, and we hope our introduced two benchmarks and methodology could be of help for more research on this new direction.
    FedMix: Approximation of Mixup under Mean Augmented Federated Learning. (arXiv:2107.00233v1 [cs.LG])
    (0 min) Federated learning (FL) allows edge devices to collectively learn a model without directly sharing data within each device, thus preserving privacy and eliminating the need to store data globally. While there are promising results under the assumption of independent and identically distributed (iid) local data, current state-of-the-art algorithms suffer from performance degradation as the heterogeneity of local data across clients increases. To resolve this issue, we propose a simple framework, Mean Augmented Federated Learning (MAFL), where clients send and receive averaged local data, subject to the privacy requirements of target applications. Under our framework, we propose a new augmentation algorithm, named FedMix, which is inspired by a phenomenal yet simple data augmentation method, Mixup, but does not require local raw data to be directly shared among devices. Our method shows greatly improved performance in the standard benchmark datasets of FL, under highly non-iid federated settings, compared to conventional algorithms.
    Extraction of Key-frames of Endoscopic Videos by using Depth Information. (arXiv:2107.00005v1 [cs.CV])
    (0 min) A deep learning-based monocular depth estimation (MDE) technique is proposed for selection of most informative frames (key frames) of an endoscopic video. In most of the cases, ground truth depth maps of polyps are not readily available and that is why the transfer learning approach is adopted in our method. An endoscopic modalities generally capture thousands of frames. In this scenario, it is quite important to discard low-quality and clinically irrelevant frames of an endoscopic video while the most informative frames should be retained for clinical diagnosis. In this view, a key-frame selection strategy is proposed by utilizing the depth information of polyps. In our method, image moment, edge magnitude, and key-points are considered for adaptively selecting the key frames. One important application of our proposed method could be the 3D reconstruction of polyps with the help of extracted key frames. Also, polyps are localized with the help of extracted depth maps.
    Multi-modal Graph Learning for Disease Prediction. (arXiv:2107.00206v1 [cs.LG])
    (0 min) Benefiting from the powerful expressive capability of graphs, graph-based approaches have achieved impressive performance in various biomedical applications. Most existing methods tend to define the adjacency matrix among samples manually based on meta-features, and then obtain the node embeddings for downstream tasks by Graph Representation Learning (GRL). However, it is not easy for these approaches to generalize to unseen samples. Meanwhile, the complex correlation between modalities is also ignored. As a result, these factors inevitably yield the inadequacy of providing valid information about the patient's condition for a reliable diagnosis. In this paper, we propose an end-to-end Multimodal Graph Learning framework (MMGL) for disease prediction. To effectively exploit the rich information across multi-modality associated with diseases, amodal-attentional multi-modal fusion is proposed to integrate the features of each modality by leveraging the correlation and complementarity between the modalities. Furthermore, instead of defining the adjacency matrix manually as existing methods, the latent graph structure can be captured through a novel way of adaptive graph learning. It could be jointly optimized with the prediction model, thus revealing the intrinsic connections among samples. Unlike the previous transductive methods, our model is also applicable to the scenario of inductive learning for those unseen data. An extensive group of experiments on two disease prediction problems is then carefully designed and presented, demonstrating that MMGL obtains more favorable performances. In addition, we also visualize and analyze the learned graph structure to provide more reliable decision support for doctors in real medical applications and inspiration for disease research.
    Automated Detection and Diagnosis of Diabetic Retinopathy: A Comprehensive Survey. (arXiv:2107.00115v1 [eess.IV])
    (0 min) Diabetic Retinopathy (DR) is a leading cause of vision loss in the world,. In the past few Diabetic Retinopathy (DR) is a leading cause of vision loss in the world. In the past few years, Artificial Intelligence (AI) based approaches have been used to detect and grade DR. Early detection enables appropriate treatment and thus prevents vision loss, Both fundus and optical coherence tomography (OCT) images are used to image the retina. With deep learning/machine learning apprroaches it is possible to extract features from the images and detect the presence of DR. Multiple strategies are implemented to detect and grade the presence of DR using classification, segmentation, and hybrid techniques. This review covers the literature dealing with AI approaches to DR that have been published in the open literature over a five year span (2016-2021). In addition a comprehensive list of available DR datasets is reported. Both the PICO (P-patient, I-intervention, C-control O-outcome) and Preferred Reporting Items for Systematic Review and Meta-Analysis (PRISMA)2009 search strategies were employed. We summarize a total of 114 published articles which conformed to the scope of the review. In addition a list of 43 major datasets is presented.
    Towards Measuring Bias in Image Classification. (arXiv:2107.00360v1 [cs.CV])
    (0 min) Convolutional Neural Networks (CNN) have become de fact state-of-the-art for the main computer vision tasks. However, due to the complex underlying structure their decisions are hard to understand which limits their use in some context of the industrial world. A common and hard to detect challenge in machine learning (ML) tasks is data bias. In this work, we present a systematic approach to uncover data bias by means of attribution maps. For this purpose, first an artificial dataset with a known bias is created and used to train intentionally biased CNNs. The networks' decisions are then inspected using attribution maps. Finally, meaningful metrics are used to measure the attribution maps' representativeness with respect to the known bias. The proposed study shows that some attribution map techniques highlight the presence of bias in the data better than others and metrics can support the identification of bias.
    MASS: Multi-Attentional Semantic Segmentation of LiDAR Data for Dense Top-View Understanding. (arXiv:2107.00346v1 [cs.CV])
    (0 min) At the heart of all automated driving systems is the ability to sense the surroundings, e.g., through semantic segmentation of LiDAR sequences, which experienced a remarkable progress due to the release of large datasets such as SemanticKITTI and nuScenes-LidarSeg. While most previous works focus on sparse segmentation of the LiDAR input, dense output masks provide self-driving cars with almost complete environment information. In this paper, we introduce MASS - a Multi-Attentional Semantic Segmentation model specifically built for dense top-view understanding of the driving scenes. Our framework operates on pillar- and occupancy features and comprises three attention-based building blocks: (1) a keypoint-driven graph attention, (2) an LSTM-based attention computed from a vector embedding of the spatial input, and (3) a pillar-based attention, resulting in a dense 360-degree segmentation mask. With extensive experiments on both, SemanticKITTI and nuScenes-LidarSeg, we quantitatively demonstrate the effectiveness of our model, outperforming the state of the art by 19.0% on SemanticKITTI and reaching 32.7% in mIoU on nuScenes-LidarSeg, where MASS is the first work addressing the dense segmentation task. Furthermore, our multi-attention model is shown to be very effective for 3D object detection validated on the KITTI-3D dataset, showcasing its high generalizability to other tasks related to 3D vision.
    CBNetV2: A Composite Backbone Network Architecture for Object Detection. (arXiv:2107.00420v1 [cs.CV])
    (0 min) Modern top-performing object detectors depend heavily on backbone networks, whose advances bring consistent performance gains through exploring more effective network structures. However, designing or searching for a new backbone and pre-training it on ImageNet may require a large number of computational resources, making it costly to obtain better detection performance. In this paper, we propose a novel backbone network, namely CBNetV2, by constructing compositions of existing open-sourced pre-trained backbones. In particular, CBNetV2 architecture groups multiple identical backbones, which are connected through composite connections. We also propose a better training strategy with the Assistant Supervision for CBNet-based detectors. Without additional pre-training, CBNetV2 can be integrated into mainstream detectors, including one-stage and two-stage detectors, as well as anchor-based and anchor-free-based ones, and significantly improve their performance by more than 3.0% AP over the baseline on COCO. Also, experiments provide strong evidence showing that composite backbones are more efficient and resource-friendly than pre-trained wider and deeper networks, including manual-based and NAS-based, as well as CNN-based and Transformer-based ones. Particularly, with single-model and single-scale testing, our HTC Dual-Swin-B achieves 58.6% box AP and 51.1% mask AP on COCO test-dev, which is significantly better than the state-of-the-art result (i.e., 57.7% box AP and 50.2% mask AP) achieved by a stronger baseline HTC++ with a larger backbone Swin-L. Code will be released at https://github.com/VDIGPKU/CBNetV2.
    Are Convolutional Neural Networks or Transformers more like human vision?. (arXiv:2105.07197v2 [cs.CV] UPDATED)
    (0 min) Modern machine learning models for computer vision exceed humans in accuracy on specific visual recognition tasks, notably on datasets like ImageNet. However, high accuracy can be achieved in many ways. The particular decision function found by a machine learning system is determined not only by the data to which the system is exposed, but also the inductive biases of the model, which are typically harder to characterize. In this work, we follow a recent trend of in-depth behavioral analyses of neural network models that go beyond accuracy as an evaluation metric by looking at patterns of errors. Our focus is on comparing a suite of standard Convolutional Neural Networks (CNNs) and a recently-proposed attention-based network, the Vision Transformer (ViT), which relaxes the translation-invariance constraint of CNNs and therefore represents a model with a weaker set of inductive biases. Attention-based networks have previously been shown to achieve higher accuracy than CNNs on vision tasks, and we demonstrate, using new metrics for examining error consistency with more granularity, that their errors are also more consistent with those of humans. These results have implications both for building more human-like vision models, as well as for understanding visual object recognition in humans.
    Focal Self-attention for Local-Global Interactions in Vision Transformers. (arXiv:2107.00641v1 [cs.CV])
    (0 min) Recently, Vision Transformer and its variants have shown great promise on various computer vision tasks. The ability of capturing short- and long-range visual dependencies through self-attention is arguably the main source for the success. But it also brings challenges due to quadratic computational overhead, especially for the high-resolution vision tasks (e.g., object detection). In this paper, we present focal self-attention, a new mechanism that incorporates both fine-grained local and coarse-grained global interactions. Using this new mechanism, each token attends the closest surrounding tokens at fine granularity but the tokens far away at coarse granularity, and thus can capture both short- and long-range visual dependencies efficiently and effectively. With focal self-attention, we propose a new variant of Vision Transformer models, called Focal Transformer, which achieves superior performance over the state-of-the-art vision Transformers on a range of public image classification and object detection benchmarks. In particular, our Focal Transformer models with a moderate size of 51.1M and a larger size of 89.8M achieve 83.5 and 83.8 Top-1 accuracy, respectively, on ImageNet classification at 224x224 resolution. Using Focal Transformers as the backbones, we obtain consistent and substantial improvements over the current state-of-the-art Swin Transformers for 6 different object detection methods trained with standard 1x and 3x schedules. Our largest Focal Transformer yields 58.7/58.9 box mAPs and 50.9/51.3 mask mAPs on COCO mini-val/test-dev, and 55.4 mIoU on ADE20K for semantic segmentation, creating new SoTA on three of the most challenging computer vision tasks.
    VideoLightFormer: Lightweight Action Recognition using Transformers. (arXiv:2107.00451v1 [cs.CV])
    (0 min) Efficient video action recognition remains a challenging problem. One large model after another takes the place of the state-of-the-art on the Kinetics dataset, but real-world efficiency evaluations are often lacking. In this work, we fill this gap and investigate the use of transformers for efficient action recognition. We propose a novel, lightweight action recognition architecture, VideoLightFormer. In a factorized fashion, we carefully extend the 2D convolutional Temporal Segment Network with transformers, while maintaining spatial and temporal video structure throughout the entire model. Existing methods often resort to one of the two extremes, where they either apply huge transformers to video features, or minimal transformers on highly pooled video features. Our method differs from them by keeping the transformer models small, but leveraging full spatiotemporal feature structure. We evaluate VideoLightFormer in a high-efficiency setting on the temporally-demanding EPIC-KITCHENS-100 and Something-Something-V2 (SSV2) datasets and find that it achieves a better mix of efficiency and accuracy than existing state-of-the-art models, apart from the Temporal Shift Module on SSV2.
    SALYPATH: A Deep-Based Architecture for visual attention prediction. (arXiv:2107.00559v1 [cs.CV])
    (0 min) Human vision is naturally more attracted by some regions within their field of view than others. This intrinsic selectivity mechanism, so-called visual attention, is influenced by both high- and low-level factors; such as the global environment (illumination, background texture, etc.), stimulus characteristics (color, intensity, orientation, etc.), and some prior visual information. Visual attention is useful for many computer vision applications such as image compression, recognition, and captioning. In this paper, we propose an end-to-end deep-based method, so-called SALYPATH (SALiencY and scanPATH), that efficiently predicts the scanpath of an image through features of a saliency model. The idea is predict the scanpath by exploiting the capacity of a deep-based model to predict the saliency. The proposed method was evaluated through 2 well-known datasets. The results obtained showed the relevance of the proposed framework comparing to state-of-the-art models.
    Inter Extreme Points Geodesics for Weakly Supervised Segmentation. (arXiv:2107.00583v1 [cs.CV])
    (0 min) We introduce $\textit{InExtremIS}$, a weakly supervised 3D approach to train a deep image segmentation network using particularly weak train-time annotations: only 6 extreme clicks at the boundary of the objects of interest. Our fully-automatic method is trained end-to-end and does not require any test-time annotations. From the extreme points, 3D bounding boxes are extracted around objects of interest. Then, deep geodesics connecting extreme points are generated to increase the amount of "annotated" voxels within the bounding boxes. Finally, a weakly supervised regularised loss derived from a Conditional Random Field formulation is used to encourage prediction consistency over homogeneous regions. Extensive experiments are performed on a large open dataset for Vestibular Schwannoma segmentation. $\textit{InExtremIS}$ obtained competitive performance, approaching full supervision and outperforming significantly other weakly supervised techniques based on bounding boxes. Moreover, given a fixed annotation time budget, $\textit{InExtremIS}$ outperforms full supervision. Our code and data are available online.
    SSC: Semantic Scan Context for Large-Scale Place Recognition. (arXiv:2107.00382v1 [cs.CV])
    (0 min) Place recognition gives a SLAM system the ability to correct cumulative errors. Unlike images that contain rich texture features, point clouds are almost pure geometric information which makes place recognition based on point clouds challenging. Existing works usually encode low-level features such as coordinate, normal, reflection intensity, etc., as local or global descriptors to represent scenes. Besides, they often ignore the translation between point clouds when matching descriptors. Different from most existing methods, we explore the use of high-level features, namely semantics, to improve the descriptor's representation ability. Also, when matching descriptors, we try to correct the translation between point clouds to improve accuracy. Concretely, we propose a novel global descriptor, Semantic Scan Context, which explores semantic information to represent scenes more effectively. We also present a two-step global semantic ICP to obtain the 3D pose (x, y, yaw) used to align the point cloud to improve matching performance. Our experiments on the KITTI dataset show that our approach outperforms the state-of-the-art methods with a large margin. Our code is available at: https://github.com/lilin-hitcrt/SSC.
    DVS-Attacks: Adversarial Attacks on Dynamic Vision Sensors for Spiking Neural Networks. (arXiv:2107.00415v1 [cs.CV])
    (0 min) Spiking Neural Networks (SNNs), despite being energy-efficient when implemented on neuromorphic hardware and coupled with event-based Dynamic Vision Sensors (DVS), are vulnerable to security threats, such as adversarial attacks, i.e., small perturbations added to the input for inducing a misclassification. Toward this, we propose DVS-Attacks, a set of stealthy yet efficient adversarial attack methodologies targeted to perturb the event sequences that compose the input of the SNNs. First, we show that noise filters for DVS can be used as defense mechanisms against adversarial attacks. Afterwards, we implement several attacks and test them in the presence of two types of noise filters for DVS cameras. The experimental results show that the filters can only partially defend the SNNs against our proposed DVS-Attacks. Using the best settings for the noise filters, our proposed Mask Filter-Aware Dash Attack reduces the accuracy by more than 20% on the DVS-Gesture dataset and by more than 65% on the MNIST dataset, compared to the original clean frames. The source code of all the proposed DVS-Attacks and noise filters is released at https://github.com/albertomarchisio/DVS-Attacks.
    Generating Synthetic Training Data for Deep Learning-Based UAV Trajectory Prediction. (arXiv:2107.00422v1 [cs.CV])
    (0 min) Deep learning-based models, such as recurrent neural networks (RNNs), have been applied to various sequence learning tasks with great success. Following this, these models are increasingly replacing classic approaches in object tracking applications for motion prediction. On the one hand, these models can capture complex object dynamics with less modeling required, but on the other hand, they depend on a large amount of training data for parameter tuning. Towards this end, we present an approach for generating synthetic trajectory data of unmanned-aerial-vehicles (UAVs) in image space. Since UAVs, or rather quadrotors are dynamical systems, they can not follow arbitrary trajectories. With the prerequisite that UAV trajectories fulfill a smoothness criterion corresponding to a minimal change of higher-order motion, methods for planning aggressive quadrotors flights can be utilized to generate optimal trajectories through a sequence of 3D waypoints. By projecting these maneuver trajectories, which are suitable for controlling quadrotors, to image space, a versatile trajectory data set is realized. To demonstrate the applicability of the synthetic trajectory data, we show that an RNN-based prediction model solely trained on the generated data can outperform classic reference models on a real-world UAV tracking dataset. The evaluation is done on the publicly available ANTI-UAV dataset.
    On the detection-to-track association for online multi-object tracking. (arXiv:2107.00500v1 [cs.CV])
    (2 min) Driven by recent advances in object detection with deep neural networks, the tracking-by-detection paradigm has gained increasing prevalence in the research community of multi-object tracking (MOT). It has long been known that appearance information plays an essential role in the detection-to-track association, which lies at the core of the tracking-by-detection paradigm. While most existing works consider the appearance distances between the detections and the tracks, they ignore the statistical information implied by the historical appearance distance records in the tracks, which can be particularly useful when a detection has similar distances with two or more tracks. In this work, we propose a hybrid track association (HTA) algorithm that models the historical appearance distances of a track with an incremental Gaussian mixture model (IGMM) and incorporates the derived statistical information into the calculation of the detection-to-track association cost. Experimental results on three MOT benchmarks confirm that HTA effectively improves the target identification performance with a small compromise to the tracking speed. Additionally, compared to many state-of-the-art trackers, the DeepSORT tracker equipped with HTA achieves better or comparable performance in terms of the balance of tracking quality and speed.
    MIDV-2020: A Comprehensive Benchmark Dataset for Identity Document Analysis. (arXiv:2107.00396v1 [cs.CV])
    (2 min) Identity documents recognition is an important sub-field of document analysis, which deals with tasks of robust document detection, type identification, text fields recognition, as well as identity fraud prevention and document authenticity validation given photos, scans, or video frames of an identity document capture. Significant amount of research has been published on this topic in recent years, however a chief difficulty for such research is scarcity of datasets, due to the subject matter being protected by security requirements. A few datasets of identity documents which are available lack diversity of document types, capturing conditions, or variability of document field values. In addition, the published datasets were typically designed only for a subset of document recognition problems, not for a complex identity document analysis. In this paper, we present a dataset MIDV-2020 which consists of 1000 video clips, 2000 scanned images, and 1000 photos of 1000 unique mock identity documents, each with unique text field values and unique artificially generated faces, with rich annotation. For the presented benchmark dataset baselines are provided for such tasks as document location and identification, text fields recognition, and face detection. With 72409 annotated images in total, to the date of publication the proposed dataset is the largest publicly available identity documents dataset with variable artificially generated data, and we believe that it will prove invaluable for advancement of the field of document analysis and recognition. The dataset is available for download at this ftp URL and this http URL .
    End-to-end Compression Towards Machine Vision: Network Architecture Design and Optimization. (arXiv:2107.00328v1 [cs.CV])
    (2 min) The research of visual signal compression has a long history. Fueled by deep learning, exciting progress has been made recently. Despite achieving better compression performance, existing end-to-end compression algorithms are still designed towards better signal quality in terms of rate-distortion optimization. In this paper, we show that the design and optimization of network architecture could be further improved for compression towards machine vision. We propose an inverted bottleneck structure for end-to-end compression towards machine vision, which specifically accounts for efficient representation of the semantic information. Moreover, we quest the capability of optimization by incorporating the analytics accuracy into the optimization process, and the optimality is further explored with generalized rate-accuracy optimization in an iterative manner. We use object detection as a showcase for end-to-end compression towards machine vision, and extensive experiments show that the proposed scheme achieves significant BD-rate savings in terms of analysis performance. Moreover, the promise of the scheme is also demonstrated with strong generalization capability towards other machine vision tasks, due to the enabling of signal-level reconstruction.
    OPT: Omni-Perception Pre-Trainer for Cross-Modal Understanding and Generation. (arXiv:2107.00249v1 [cs.CV])
    (2 min) In this paper, we propose an Omni-perception Pre-Trainer (OPT) for cross-modal understanding and generation, by jointly modeling visual, text and audio resources. OPT is constructed in an encoder-decoder framework, including three single-modal encoders to generate token-based embeddings for each modality, a cross-modal encoder to encode the correlations among the three modalities, and two cross-modal decoders to generate text and image respectively. For the OPT's pre-training, we design a multi-task pretext learning scheme to model multi-modal resources from three different data granularities, \ie, token-, modality-, and sample-level modeling, through which OPT learns to align and translate among different modalities. The pre-training task is carried out on a large amount of image-text-audio triplets from Open Images. Experimental results show that OPT can learn strong image-text-audio multi-modal representations and achieve promising results on a variety of cross-modal understanding and generation tasks.
    Drone swarm patrolling with uneven coverage requirements. (arXiv:2107.00362v1 [cs.CV])
    (2 min) Swarms of drones are being more and more used in many practical scenarios, such as surveillance, environmental monitoring, search and rescue in hardly-accessible areas, etc.. While a single drone can be guided by a human operator, the deployment of a swarm of multiple drones requires proper algorithms for automatic task-oriented control. In this paper, we focus on visual coverage optimization with drone-mounted camera sensors. In particular, we consider the specific case in which the coverage requirements are uneven, meaning that different parts of the environment have different coverage priorities. We model these coverage requirements with relevance maps and propose a deep reinforcement learning algorithm to guide the swarm. The paper first defines a proper learning model for a single drone, and then extends it to the case of multiple drones both with greedy and cooperative strategies. Experimental results show the performance of the proposed method, also compared with a standard patrolling algorithm.
    iMiGUE: An Identity-free Video Dataset for Micro-Gesture Understanding and Emotion Analysis. (arXiv:2107.00285v1 [cs.CV])
    (2 min) We introduce a new dataset for the emotional artificial intelligence research: identity-free video dataset for Micro-Gesture Understanding and Emotion analysis (iMiGUE). Different from existing public datasets, iMiGUE focuses on nonverbal body gestures without using any identity information, while the predominant researches of emotion analysis concern sensitive biometric data, like face and speech. Most importantly, iMiGUE focuses on micro-gestures, i.e., unintentional behaviors driven by inner feelings, which are different from ordinary scope of gestures from other gesture datasets which are mostly intentionally performed for illustrative purposes. Furthermore, iMiGUE is designed to evaluate the ability of models to analyze the emotional states by integrating information of recognized micro-gesture, rather than just recognizing prototypes in the sequences separately (or isolatedly). This is because the real need for emotion AI is to understand the emotional states behind gestures in a holistic way. Moreover, to counter for the challenge of imbalanced sample distribution of this dataset, an unsupervised learning method is proposed to capture latent representations from the micro-gesture sequences themselves. We systematically investigate representative methods on this dataset, and comprehensive experimental results reveal several interesting insights from the iMiGUE, e.g., micro-gesture-based analysis can promote emotion understanding. We confirm that the new iMiGUE dataset could advance studies of micro-gesture and emotion AI.
    Generic Event Boundary Detection Challenge at CVPR 2021 Technical Report: Cascaded Temporal Attention Network (CASTANET). (arXiv:2107.00239v1 [cs.CV])
    (2 min) This report presents the approach used in the submission of Generic Event Boundary Detection (GEBD) Challenge at CVPR21. In this work, we design a Cascaded Temporal Attention Network (CASTANET) for GEBD, which is formed by three parts, the backbone network, the temporal attention module, and the classification module. Specifically, the Channel-Separated Convolutional Network (CSN) is used as the backbone network to extract features, and the temporal attention module is designed to enforce the network to focus on the discriminative features. After that, the cascaded architecture is used in the classification module to generate more accurate boundaries. In addition, the ensemble strategy is used to further improve the performance of the proposed method. The proposed method achieves 83.30% F1 score on Kinetics-GEBD test set, which improves 20.5% F1 score compared to the baseline method. Code is available at https://github.com/DexiangHong/Cascade-PC.
    Feasibility of Haralick's Texture Features for the Classification of Chromogenic In-situ Hybridization Images. (arXiv:2107.00235v1 [eess.IV])
    (2 min) This paper presents a proof of concept for the usefulness of second-order texture features for the qualitative analysis and classification of chromogenic in-situ hybridization whole slide images in high-throughput imaging experiments. The challenge is that currently, the gold standard for gene expression grading in such images is expert assessment. The idea of the research team is to use different approaches in the analysis of these images that will be used for structural segmentation and functional analysis in gene expression. The article presents such perspective idea to select a number of textural features that are going to be used for classification. In our experiment, natural grouping of image samples (tiles) depending on their local texture properties was explored in an unsupervised classification procedure. The features are reduced to two dimensions with fuzzy c-means clustering. The overall conclusion of this experiment is that Haralick features are a viable choice for classification and analysis of chromogenic in-situ hybridization image data. The principal component analysis approach produced slightly more "understandable" from an annotator's point of view classes.
    Deep auxiliary learning for visual localization using colorization task. (arXiv:2107.00222v1 [cs.CV])
    (2 min) Visual localization is one of the most important components for robotics and autonomous driving. Recently, inspiring results have been shown with CNN-based methods which provide a direct formulation to end-to-end regress 6-DoF absolute pose. Additional information like geometric or semantic constraints is generally introduced to improve performance. Especially, the latter can aggregate high-level semantic information into localization task, but it usually requires enormous manual annotations. To this end, we propose a novel auxiliary learning strategy for camera localization by introducing scene-specific high-level semantics from self-supervised representation learning task. Viewed as a powerful proxy task, image colorization task is chosen as complementary task that outputs pixel-wise color version of grayscale photograph without extra annotations. In our work, feature representations from colorization network are embedded into localization network by design to produce discriminative features for pose regression. Meanwhile an attention mechanism is introduced for the benefit of localization performance. Extensive experiments show that our model significantly improve localization accuracy over state-of-the-arts on both indoor and outdoor datasets.
    Explainable Diabetic Retinopathy Detection and Retinal Image Generation. (arXiv:2107.00296v1 [eess.IV])
    (2 min) Though deep learning has shown successful performance in classifying the label and severity stage of certain diseases, most of them give few explanations on how to make predictions. Inspired by Koch's Postulates, the foundation in evidence-based medicine (EBM) to identify the pathogen, we propose to exploit the interpretability of deep learning application in medical diagnosis. By determining and isolating the neuron activation patterns on which diabetic retinopathy (DR) detector relies to make decisions, we demonstrate the direct relation between the isolated neuron activation and lesions for a pathological explanation. To be specific, we first define novel pathological descriptors using activated neurons of the DR detector to encode both spatial and appearance information of lesions. Then, to visualize the symptom encoded in the descriptor, we propose Patho-GAN, a new network to synthesize medically plausible retinal images. By manipulating these descriptors, we could even arbitrarily control the position, quantity, and categories of generated lesions. We also show that our synthesized images carry the symptoms directly related to diabetic retinopathy diagnosis. Our generated images are both qualitatively and quantitatively superior to the ones by previous methods. Besides, compared to existing methods that take hours to generate an image, our second level speed endows the potential to be an effective solution for data augmentation.
    CLDA: Contrastive Learning for Semi-Supervised Domain Adaptation. (arXiv:2107.00085v1 [cs.CV])
    (2 min) Unsupervised Domain Adaptation (UDA) aims to align the labeled source distribution with the unlabeled target distribution to obtain domain invariant predictive models. However, the application of well-known UDA approaches does not generalize well in Semi-Supervised Domain Adaptation (SSDA) scenarios where few labeled samples from the target domain are available. In this paper, we propose a simple Contrastive Learning framework for semi-supervised Domain Adaptation (CLDA) that attempts to bridge the intra-domain gap between the labeled and unlabeled target distributions and inter-domain gap between source and unlabeled target distribution in SSDA. We suggest employing class-wise contrastive learning to reduce the inter-domain gap and instance-level contrastive alignment between the original (input image) and strongly augmented unlabeled target images to minimize the intra-domain discrepancy. We have shown empirically that both of these modules complement each other to achieve superior performance. Experiments on three well-known domain adaptation benchmark datasets namely DomainNet, Office-Home, and Office31 demonstrate the effectiveness of our approach. CLDA achieves state-of-the-art results on all the above datasets.
    Sanity Checks for Lottery Tickets: Does Your Winning Ticket Really Win the Jackpot?. (arXiv:2107.00166v1 [cs.LG])
    (2 min) There have been long-standing controversies and inconsistencies over the experiment setup and criteria for identifying the "winning ticket" in literature. To reconcile such, we revisit the definition of lottery ticket hypothesis, with comprehensive and more rigorous conditions. Under our new definition, we show concrete evidence to clarify whether the winning ticket exists across the major DNN architectures and/or applications. Through extensive experiments, we perform quantitative analysis on the correlations between winning tickets and various experimental factors, and empirically study the patterns of our observations. We find that the key training hyperparameters, such as learning rate and training epochs, as well as the architecture characteristics such as capacities and residual connections, are all highly correlated with whether and when the winning tickets can be identified. Based on our analysis, we summarize a guideline for parameter settings in regards of specific architecture characteristics, which we hope to catalyze the research progress on the topic of lottery ticket hypothesis.
    E-DSSR: Efficient Dynamic Surgical Scene Reconstruction with Transformer-based Stereoscopic Depth Perception. (arXiv:2107.00229v1 [cs.CV])
    (2 min) Reconstructing the scene of robotic surgery from the stereo endoscopic video is an important and promising topic in surgical data science, which potentially supports many applications such as surgical visual perception, robotic surgery education and intra-operative context awareness. However, current methods are mostly restricted to reconstructing static anatomy assuming no tissue deformation, tool occlusion and de-occlusion, and camera movement. However, these assumptions are not always satisfied in minimal invasive robotic surgeries. In this work, we present an efficient reconstruction pipeline for highly dynamic surgical scenes that runs at 28 fps. Specifically, we design a transformer-based stereoscopic depth perception for efficient depth estimation and a light-weight tool segmentor to handle tool occlusion. After that, a dynamic reconstruction algorithm which can estimate the tissue deformation and camera movement, and aggregate the information over time is proposed for surgical scene reconstruction. We evaluate the proposed pipeline on two datasets, the public Hamlyn Centre Endoscopic Video Dataset and our in-house DaVinci robotic surgery dataset. The results demonstrate that our method can recover the scene obstructed by the surgical tool and handle the movement of camera in realistic surgical scenarios effectively at real-time speed.
    Scalable Certified Segmentation via Randomized Smoothing. (arXiv:2107.00228v1 [cs.LG])
    (2 min) We present a new certification method for image and point cloud segmentation based on randomized smoothing. The method leverages a novel scalable algorithm for prediction and certification that correctly accounts for multiple testing, necessary for ensuring statistical guarantees. The key to our approach is reliance on established multiple-testing correction mechanisms as well as the ability to abstain from classifying single pixels or points while still robustly segmenting the overall input. Our experimental evaluation on synthetic data and challenging datasets, such as Pascal Context, Cityscapes, and ShapeNet, shows that our algorithm can achieve, for the first time, competitive accuracy and certification guarantees on real-world segmentation tasks. We provide an implementation at https://github.com/eth-sri/segmentation-smoothing.
    Simple Training Strategies and Model Scaling for Object Detection. (arXiv:2107.00057v1 [cs.CV])
    (2 min) The speed-accuracy Pareto curve of object detection systems have advanced through a combination of better model architectures, training and inference methods. In this paper, we methodically evaluate a variety of these techniques to understand where most of the improvements in modern detection systems come from. We benchmark these improvements on the vanilla ResNet-FPN backbone with RetinaNet and RCNN detectors. The vanilla detectors are improved by 7.7% in accuracy while being 30% faster in speed. We further provide simple scaling strategies to generate family of models that form two Pareto curves, named RetinaNet-RS and Cascade RCNN-RS. These simple rescaled detectors explore the speed-accuracy trade-off between the one-stage RetinaNet detectors and two-stage RCNN detectors. Our largest Cascade RCNN-RS models achieve 52.9% AP with a ResNet152-FPN backbone and 53.6% with a SpineNet143L backbone. Finally, we show the ResNet architecture, with three minor architectural changes, outperforms EfficientNet as the backbone for object detection and instance segmentation systems.
    Improving Task Adaptation for Cross-domain Few-shot Learning. (arXiv:2107.00358v1 [cs.CV])
    (2 min) In this paper, we look at the problem of cross-domain few-shot classification that aims to learn a classifier from previously unseen classes and domains with few labeled samples. We study several strategies including various adapter topologies and operations in terms of their performance and efficiency that can be easily attached to existing methods with different meta-training strategies and adapt them for a given task during meta-test phase. We show that parametric adapters attached to convolutional layers with residual connections performs the best, and significantly improves the performance of the state-of-the-art models in the Meta-Dataset benchmark with minor additional cost. Our code will be available at https://github.com/VICO-UoE/URL.
    Revisiting Knowledge Distillation: An Inheritance and Exploration Framework. (arXiv:2107.00181v1 [cs.LG])
    (2 min) Knowledge Distillation (KD) is a popular technique to transfer knowledge from a teacher model or ensemble to a student model. Its success is generally attributed to the privileged information on similarities/consistency between the class distributions or intermediate feature representations of the teacher model and the student model. However, directly pushing the student model to mimic the probabilities/features of the teacher model to a large extent limits the student model in learning undiscovered knowledge/features. In this paper, we propose a novel inheritance and exploration knowledge distillation framework (IE-KD), in which a student model is split into two parts - inheritance and exploration. The inheritance part is learned with a similarity loss to transfer the existing learned knowledge from the teacher model to the student model, while the exploration part is encouraged to learn representations different from the inherited ones with a dis-similarity loss. Our IE-KD framework is generic and can be easily combined with existing distillation or mutual learning methods for training deep neural networks. Extensive experiments demonstrate that these two parts can jointly push the student model to learn more diversified and effective representations, and our IE-KD can be a general technique to improve the student network to achieve SOTA performance. Furthermore, by applying our IE-KD to the training of two networks, the performance of both can be improved w.r.t. deep mutual learning. The code and models of IE-KD will be make publicly available at https://github.com/yellowtownhz/IE-KD.
    Attention Bottlenecks for Multimodal Fusion. (arXiv:2107.00135v1 [cs.CV])
    (2 min) Humans perceive the world by concurrently processing and fusing high-dimensional inputs from multiple modalities such as vision and audio. Machine perception models, in stark contrast, are typically modality-specific and optimised for unimodal benchmarks, and hence late-stage fusion of final representations or predictions from each modality (`late-fusion') is still a dominant paradigm for multimodal video classification. Instead, we introduce a novel transformer based architecture that uses `fusion bottlenecks' for modality fusion at multiple layers. Compared to traditional pairwise self-attention, our model forces information between different modalities to pass through a small number of bottleneck latents, requiring the model to collate and condense the most relevant information in each modality and only share what is necessary. We find that such a strategy improves fusion performance, at the same time reducing computational cost. We conduct thorough ablation studies, and achieve state-of-the-art results on multiple audio-visual classification benchmarks including Audioset, Epic-Kitchens and VGGSound. All code and models will be released.
    Unsupervised Model Drift Estimation with Batch Normalization Statistics for Dataset Shift Detection and Model Selection. (arXiv:2107.00191v1 [cs.LG])
    (2 min) While many real-world data streams imply that they change frequently in a nonstationary way, most of deep learning methods optimize neural networks on training data, and this leads to severe performance degradation when dataset shift happens. However, it is less possible to annotate or inspect newly streamed data by humans, and thus it is desired to measure model drift at inference time in an unsupervised manner. In this paper, we propose a novel method of model drift estimation by exploiting statistics of batch normalization layer on unlabeled test data. To remedy possible sampling error of streamed input data, we adopt low-rank approximation to each representational layer. We show the effectiveness of our method not only on dataset shift detection but also on model selection when there are multiple candidate models among model zoo or training trajectories in an unsupervised way. We further demonstrate the consistency of our method by comparing model drift scores between different network architectures.
  • cs.IR updates on arXiv.org

    GraphHINGE: Learning Interaction Models of Structured Neighborhood on Heterogeneous Information Network. (arXiv:2011.12683v2 [cs.IR] UPDATED)
    (2 min) Heterogeneous information network (HIN) has been widely used to characterize entities of various types and their complex relations. Recent attempts either rely on explicit path reachability to leverage path-based semantic relatedness or graph neighborhood to learn heterogeneous network representations before predictions. These weakly coupled manners overlook the rich interactions among neighbor nodes, which introduces an early summarization issue. In this paper, we propose GraphHINGE (Heterogeneous INteract and aggreGatE), which captures and aggregates the interactive patterns between each pair of nodes through their structured neighborhoods. Specifically, we first introduce Neighborhood-based Interaction (NI) module to model the interactive patterns under the same metapaths, and then extend it to Cross Neighborhood-based Interaction (CNI) module to deal with different metapaths. Next, in order to address the complexity issue on large-scale networks, we formulate the interaction modules via a convolutional framework and learn the parameters efficiently with fast Fourier transform. Furthermore, we design a novel neighborhood-based selection (NS) mechanism, a sampling strategy, to filter high-order neighborhood information based on their low-order performance. The extensive experiments on six different types of heterogeneous graphs demonstrate the performance gains by comparing with state-of-the-arts in both click-through rate prediction and top-N recommendation tasks.
    Proof of Reference(PoR): A unified informetrics based consensus mechanism. (arXiv:2107.00214v1 [cs.DL])
    (2 min) Bibliometrics is useful to analyze the research impact for measuring the research quality. Different bibliographic databases like Scopus, Web of Science, Google Scholar etc. are accessed for evaluating the trend of publications and citations from time to time. Some of these databases are free and some are subscription based. Its always debatable that which bibliographic database is better and in what terms. To provide an optimal solution to availability of multiple bibliographic databases, we have implemented a single authentic database named as ``conflate'' which can be used for fetching publication and citation trend of an author. To further strengthen the generated database and to provide the transparent system to the stakeholders, a consensus mechanism ``proof of reference (PoR)'' is proposed. Due to three consent based checks implemented in PoR, we feel that it could be considered as a authentic and honest citation data source for the calculation of unified informetrics for an author.
    Pseudo-Relevance Feedback for Multiple Representation Dense Retrieval. (arXiv:2106.11251v2 [cs.IR] UPDATED)
    (2 min) Pseudo-relevance feedback mechanisms, from Rocchio to the relevance models, have shown the usefulness of expanding and reweighting the users' initial queries using information occurring in an initial set of retrieved documents, known as the pseudo-relevant set. Recently, dense retrieval -- through the use of neural contextual language models such as BERT for analysing the documents' and queries' contents and computing their relevance scores -- has shown a promising performance on several information retrieval tasks still relying on the traditional inverted index for identifying documents relevant to a query. Two different dense retrieval families have emerged: the use of single embedded representations for each passage and query (e.g. using BERT's [CLS] token), or via multiple representations (e.g. using an embedding for each token of the query and document). In this work, we conduct the first study into the potential for multiple representation dense retrieval to be enhanced using pseudo-relevance feedback. In particular, based on the pseudo-relevant set of documents identified using a first-pass dense retrieval, we extract representative feedback embeddings (using KMeans clustering) -- while ensuring that these embeddings discriminate among passages (based on IDF) -- which are then added to the query representation. These additional feedback embeddings are shown to both enhance the effectiveness of a reranking as well as an additional dense retrieval operation. Indeed, experiments on the MSMARCO passage ranking dataset show that MAP can be improved by upto 26% on the TREC 2019 query set and 10% on the TREC 2020 query set by the application of our proposed ColBERT-PRF method on a ColBERT dense retrieval approach.
    A Search Engine for Scientific Publications: a Cybersecurity Case Study. (arXiv:2107.00082v1 [cs.AI])
    (2 min) Cybersecurity is a very challenging topic of research nowadays, as digitalization increases the interaction of people, software and services on the Internet by means of technology devices and networks connected to it. The field is broad and has a lot of unexplored ground under numerous disciplines such as management, psychology, and data science. Its large disciplinary spectrum and many significant research topics generate a considerable amount of information, making it hard for us to find what we are looking for when researching a particular subject. This work proposes a new search engine for scientific publications which combines both information retrieval and reading comprehension algorithms to extract answers from a collection of domain-specific documents. The proposed solution although being applied to the context of cybersecurity exhibited great generalization capabilities and can be easily adapted to perform under other distinct knowledge domains.
    Tweet Sentiment Quantification: An Experimental Re-Evaluation. (arXiv:2011.08091v2 [cs.CL] UPDATED)
    (2 min) Sentiment quantification is the task of estimating the relative frequency (or "prevalence") of sentiment-related classes (such as Positive, Neutral, Negative) in a sample of unlabelled texts; this is especially important when these texts are tweets, since most sentiment classification endeavours carried out on Twitter data actually have quantification (and not the classification of individual tweets) as their ultimate goal. It is well-known that solving quantification via "classify and count" (i.e., by classifying all unlabelled items via a standard classifier and counting the items that have been assigned to a given class) is suboptimal in terms of accuracy, and that more accurate quantification methods exist. In 2016, Gao and Sebastiani carried out a systematic comparison of quantification methods on the task of tweet sentiment quantification. In hindsight, we observe that the experimental protocol followed in that work is flawed, and that its results are thus unreliable. We now re-evaluate those quantification methods on the very same datasets, this time following a now consolidated and much more robust experimental protocol, that involves 5775 as many experiments as run in the original study. Our experimentation yields results dramatically different from those obtained by Gao and Sebastiani, and thus provide a different, much more solid understanding of the relative strengths and weaknesses of different sentiment quantification methods.
    The Use of Bandit Algorithms in Intelligent Interactive Recommender Systems. (arXiv:2107.00161v1 [cs.IR])
    (2 min) In today's business marketplace, many high-tech Internet enterprises constantly explore innovative ways to provide optimal online user experiences for gaining competitive advantages. The great needs of developing intelligent interactive recommendation systems are indicated, which could sequentially suggest users the most proper items by accurately predicting their preferences, while receiving the up-to-date feedback to refine the recommendation results, continuously. Multi-armed bandit algorithms, which have been widely applied into various online systems, are quite capable of delivering such efficient recommendation services. However, few existing bandit models are able to adapt to new changes introduced by the modern recommender systems.
    SearchGCN: Powering Embedding Retrieval by Graph Convolution Networks for E-Commerce Search. (arXiv:2107.00525v1 [cs.IR])
    (2 min) Graph convolution networks (GCN), which recently becomes new state-of-the-art method for graph node classification, recommendation and other applications, has not been successfully applied to industrial-scale search engine yet. In this proposal, we introduce our approach, namely SearchGCN, for embedding-based candidate retrieval in one of the largest e-commerce search engine in the world. Empirical studies demonstrate that SearchGCN learns better embedding representations than existing methods, especially for long tail queries and items. Thus, SearchGCN has been deployed into JD.com's search production since July 2020.
    Embedding-based Recommender System for Job to Candidate Matching on Scale. (arXiv:2107.00221v1 [cs.IR])
    (2 min) The online recruitment matching system has been the core technology and service platform in CareerBuilder. One of the major challenges in an online recruitment scenario is to provide good matches between job posts and candidates using a recommender system on the scale. In this paper, we discussed the techniques for applying an embedding-based recommender system for the large scale of job to candidates matching. To learn the comprehensive and effective embedding for job posts and candidates, we have constructed a fused-embedding via different levels of representation learning from raw text, semantic entities and location information. The clusters of fused-embedding of job and candidates are then used to build and train the Faiss index that supports runtime approximate nearest neighbor search for candidate retrieval. After the first stage of candidate retrieval, a second stage reranking model that utilizes other contextual information was used to generate the final matching result. Both offline and online evaluation results indicate a significant improvement of our proposed two-staged embedding-based system in terms of click-through rate (CTR), quality and normalized discounted accumulated gain (nDCG), compared to those obtained from our baseline system. We further described the deployment of the system that supports the million-scale job and candidate matching process at CareerBuilder. The overall improvement of our job to candidate matching system has demonstrated its feasibility and scalability at a major online recruitment site.
  • cs.LG updates on arXiv.org

    Gym-$\mu$RTS: Toward Affordable Full Game Real-time Strategy Games Research with Deep Reinforcement Learning. (arXiv:2105.13807v2 [cs.LG] UPDATED)
    (2 min) In recent years, researchers have achieved great success in applying Deep Reinforcement Learning (DRL) algorithms to Real-time Strategy (RTS) games, creating strong autonomous agents that could defeat professional players in StarCraft~II. However, existing approaches to tackle full games have high computational costs, usually requiring the use of thousands of GPUs and CPUs for weeks. This paper has two main contributions to address this issue: 1) We introduce Gym-$\mu$RTS (pronounced "gym-micro-RTS") as a fast-to-run RL environment for full-game RTS research and 2) we present a collection of techniques to scale DRL to play full-game $\mu$RTS as well as ablation studies to demonstrate their empirical importance. Our best-trained bot can defeat every $\mu$RTS bot we tested from the past $\mu$RTS competitions when working in a single-map setting, resulting in a state-of-the-art DRL agent while only taking about 60 hours of training using a single machine (one GPU, three vCPU, 16GB RAM).
    Deep Feature Space: A Geometrical Perspective. (arXiv:2007.00062v2 [cs.CV] UPDATED)
    (2 min) One of the most prominent attributes of Neural Networks (NNs) constitutes their capability of learning to extract robust and descriptive features from high dimensional data, like images. Hence, such an ability renders their exploitation as feature extractors particularly frequent in an abundant of modern reasoning systems. Their application scope mainly includes complex cascade tasks, like multi-modal recognition and deep Reinforcement Learning (RL). However, NNs induce implicit biases that are difficult to avoid or to deal with and are not met in traditional image descriptors. Moreover, the lack of knowledge for describing the intra-layer properties -- and thus their general behavior -- restricts the further applicability of the extracted features. With the paper at hand, a novel way of visualizing and understanding the vector space before the NNs' output layer is presented, aiming to enlighten the deep feature vectors' properties under classification tasks. Main attention is paid to the nature of overfitting in the feature space and its adverse effect on further exploitation. We present the findings that can be derived from our model's formulation, and we evaluate them on realistic recognition scenarios, proving its prominence by improving the obtained results.
    Attention Meets Perturbations: Robust and Interpretable Attention with Adversarial Training. (arXiv:2009.12064v2 [cs.CL] UPDATED)
    (2 min) Although attention mechanisms have been applied to a variety of deep learning models and have been shown to improve the prediction performance, it has been reported to be vulnerable to perturbations to the mechanism. To overcome the vulnerability to perturbations in the mechanism, we are inspired by adversarial training (AT), which is a powerful regularization technique for enhancing the robustness of the models. In this paper, we propose a general training technique for natural language processing tasks, including AT for attention (Attention AT) and more interpretable AT for attention (Attention iAT). The proposed techniques improved the prediction performance and the model interpretability by exploiting the mechanisms with AT. In particular, Attention iAT boosts those advantages by introducing adversarial perturbation, which enhances the difference in the attention of the sentences. Evaluation experiments with ten open datasets revealed that AT for attention mechanisms, especially Attention iAT, demonstrated (1) the best performance in nine out of ten tasks and (2) more interpretable attention (i.e., the resulting attention correlated more strongly with gradient-based word importance) for all tasks. Additionally, the proposed techniques are (3) much less dependent on perturbation size in AT. Our code is available at https://github.com/shunk031/attention-meets-perturbation
    Temporal Difference Uncertainties as a Signal for Exploration. (arXiv:2010.02255v2 [cs.AI] UPDATED)
    (2 min) An effective approach to exploration in reinforcement learning is to rely on an agent's uncertainty over the optimal policy, which can yield near-optimal exploration strategies in tabular settings. However, in non-tabular settings that involve function approximators, obtaining accurate uncertainty estimates is almost as challenging a problem. In this paper, we highlight that value estimates are easily biased and temporally inconsistent. In light of this, we propose a novel method for estimating uncertainty over the value function that relies on inducing a distribution over temporal difference errors. This exploration signal controls for state-action transitions so as to isolate uncertainty in value that is due to uncertainty over the agent's parameters. Because our measure of uncertainty conditions on state-action transitions, we cannot act on this measure directly. Instead, we incorporate it as an intrinsic reward and treat exploration as a separate learning problem, induced by the agent's temporal difference uncertainties. We introduce a distinct exploration policy that learns to collect data with high estimated uncertainty, which gives rise to a curriculum that smoothly changes throughout learning and vanishes in the limit of perfect value estimates. We evaluate our method on hard exploration tasks, including Deep Sea and Atari 2600 environments and find that our proposed form of exploration facilitates both diverse and deep exploration.
    BASGD: Buffered Asynchronous SGD for Byzantine Learning. (arXiv:2003.00937v3 [cs.LG] UPDATED)
    (2 min) Distributed learning has become a hot research topic, due to its wide application in cluster-based large-scale learning, federated learning, edge computing and so on. Most distributed learning methods assume no error and attack on the workers. However, many unexpected cases, such as communication error and even malicious attack, may happen in real applications. Hence, Byzantine learning (BL), which refers to distributed learning with attack or error, has recently attracted much attention. Most existing BL methods are synchronous, which will result in slow convergence when there exist heterogeneous workers. Furthermore, in some applications like federated learning and edge computing, synchronization cannot even be performed most of the time due to the online workers (clients or edge servers). Hence, asynchronous BL (ABL) is more general and practical than synchronous BL (SBL). To the best of our knowledge, there exist only two ABL methods. One of them cannot resist malicious attack. The other needs to store some training instances on the server, which has the privacy leak problem. In this paper, we propose a novel method, called buffered asynchronous stochastic gradient descent (BASGD), for BL. BASGD is an asynchronous method. Furthermore, BASGD has no need to store any training instances on the server, and hence can preserve privacy in ABL. BASGD is theoretically proved to have the ability of resisting against error and malicious attack. Moreover, BASGD has a similar theoretical convergence rate to that of vanilla asynchronous SGD (ASGD), with an extra constant variance. Empirical results show that BASGD can significantly outperform vanilla ASGD and other ABL baselines, when there exists error or attack on workers.
    A Fully Bayesian Gradient-Free Supervised Dimension Reduction Method using Gaussian Processes. (arXiv:2008.03534v2 [stat.ML] UPDATED)
    (2 min) Modern day engineering problems are ubiquitously characterized by sophisticated computer codes that map parameters or inputs to an underlying physical process. In other situations, experimental setups are used to model the physical process in a laboratory, ensuring high precision while being costly in materials and logistics. In both scenarios, only limited amount of data can be generated by querying the expensive information source at a finite number of inputs or designs. This problem is compounded further in the presence of a high-dimensional input space. State-of-the-art parameter space dimension reduction methods, such as active subspace, aim to identify a subspace of the original input space that is sufficient to explain the output response. These methods are restricted by their reliance on gradient evaluations or copious data, making them inadequate to expensive problems without direct access to gradients. The proposed methodology is gradient-free and fully Bayesian, as it quantifies uncertainty in both the low-dimensional subspace and the surrogate model parameters. This enables a full quantification of epistemic uncertainty and robustness to limited data availability. It is validated on multiple datasets from engineering and science and compared to two other state-of-the-art methods based on four aspects: a) recovery of the active subspace, b) deterministic prediction accuracy, c) probabilistic prediction accuracy, and d) training time. The comparison shows that the proposed method improves the active subspace recovery and predictive accuracy, in both the deterministic and probabilistic sense, when only few model observations are available for training, at the cost of increased training time.
    A New Theoretical Framework for Fast and Accurate Online Decision-Making. (arXiv:1905.11797v5 [cs.LG] UPDATED)
    (2 min) We introduce a novel theoretical framework for Return On Investment (ROI) maximization in repeated decision-making. Our setting is motivated by the use case of companies that regularly receive proposals for technological innovations and want to quickly decide whether they are worth implementing. We design an algorithm for learning ROI-maximizing decision-making policies over a sequence of innovation proposals. Our algorithm provably converges to an optimal policy in class $\Pi$ at a rate of order $\min\big\{1/(N\Delta^2),N^{-1/3}\}$, where $N$ is the number of innovations and $\Delta$ is the suboptimality gap in $\Pi$. A significant hurdle of our formulation, which sets it aside from other online learning problems such as bandits, is that running a policy does not provide an unbiased estimate of its performance.
    Mandoline: Model Evaluation under Distribution Shift. (arXiv:2107.00643v1 [cs.LG])
    (2 min) Machine learning models are often deployed in different settings than they were trained and validated on, posing a challenge to practitioners who wish to predict how well the deployed model will perform on a target distribution. If an unlabeled sample from the target distribution is available, along with a labeled sample from a possibly different source distribution, standard approaches such as importance weighting can be applied to estimate performance on the target. However, importance weighting struggles when the source and target distributions have non-overlapping support or are high-dimensional. Taking inspiration from fields such as epidemiology and polling, we develop Mandoline, a new evaluation framework that mitigates these issues. Our key insight is that practitioners may have prior knowledge about the ways in which the distribution shifts, which we can use to better guide the importance weighting procedure. Specifically, users write simple "slicing functions" - noisy, potentially correlated binary functions intended to capture possible axes of distribution shift - to compute reweighted performance estimates. We further describe a density ratio estimation framework for the slices and show how its estimation error scales with slice quality and dataset size. Empirical validation on NLP and vision tasks shows that \name can estimate performance on the target distribution up to $3\times$ more accurately compared to standard baselines.
    Asymptotically Exact and Fast Gaussian Copula Models for Imputation of Mixed Data Types. (arXiv:2102.02642v2 [stat.ML] UPDATED)
    (2 min) Missing values with mixed data types is a common problem in a large number of machine learning applications such as processing of surveys and in different medical applications. Recently, Gaussian copula models have been suggested as a means of performing imputation of missing values using a probabilistic framework. While the present Gaussian copula models have shown to yield state of the art performance, they have two limitations: they are based on an approximation that is fast but may be imprecise and they do not support unordered multinomial variables. We address the first limitation using direct and arbitrarily precise approximations both for model estimation and imputation by using randomized quasi-Monte Carlo procedures. The method we provide has lower errors for the estimated model parameters and the imputed values, compared to previously proposed methods. We also extend the previous Gaussian copula models to include unordered multinomial variables in addition to the present support of ordinal, binary, and continuous variables.
    Classical Planning in Deep Latent Space. (arXiv:2107.00110v1 [cs.AI])
    (2 min) Current domain-independent, classical planners require symbolic models of the problem domain and instance as input, resulting in a knowledge acquisition bottleneck. Meanwhile, although deep learning has achieved significant success in many fields, the knowledge is encoded in a subsymbolic representation which is incompatible with symbolic systems such as planners. We propose Latplan, an unsupervised architecture combining deep learning and classical planning. Given only an unlabeled set of image pairs showing a subset of transitions allowed in the environment (training inputs), Latplan learns a complete propositional PDDL action model of the environment. Later, when a pair of images representing the initial and the goal states (planning inputs) is given, Latplan finds a plan to the goal state in a symbolic latent space and returns a visualized plan execution. We evaluate Latplan using image-based versions of 6 planning domains: 8-puzzle, 15-Puzzle, Blocksworld, Sokoban and Two variations of LightsOut.
    Audiovisual Singing Voice Separation. (arXiv:2107.00231v1 [cs.SD])
    (2 min) Separating a song into vocal and accompaniment components is an active research topic, and recent years witnessed an increased performance from supervised training using deep learning techniques. We propose to apply the visual information corresponding to the singers' vocal activities to further improve the quality of the separated vocal signals. The video frontend model takes the input of mouth movement and fuses it into the feature embeddings of an audio-based separation framework. To facilitate the network to learn audiovisual correlation of singing activities, we add extra vocal signals irrelevant to the mouth movement to the audio mixture during training. We create two audiovisual singing performance datasets for training and evaluation, respectively, one curated from audition recordings on the Internet, and the other recorded in house. The proposed method outperforms audio-based methods in terms of separation quality on most test recordings. This advantage is especially pronounced when there are backing vocals in the accompaniment, which poses a great challenge for audio-only methods.
    Well-calibrated prediction intervals for regression problems. (arXiv:2107.00363v1 [stat.ML])
    (2 min) Over the last few decades, various methods have been proposed for estimating prediction intervals in regression settings, including Bayesian methods, ensemble methods, direct interval estimation methods and conformal prediction methods. An important issue is the calibration of these methods: the generated prediction intervals should have a predefined coverage level, without being overly conservative. In this work, we review the above four classes of methods from a conceptual and experimental point of view. Results on benchmark data sets from various domains highlight large fluctuations in performance from one data set to another. These observations can be attributed to the violation of certain assumptions that are inherent to some classes of methods. We illustrate how conformal prediction can be used as a general calibration procedure for methods that deliver poor results without a calibration step.
    Inverse Design of Grating Couplers Using the Policy Gradient Method from Reinforcement Learning. (arXiv:2107.00088v1 [physics.comp-ph])
    (0 min) We present a proof-of-concept technique for the inverse design of electromagnetic devices motivated by the policy gradient method in reinforcement learning, named PHORCED (PHotonic Optimization using REINFORCE Criteria for Enhanced Design). This technique uses a probabilistic generative neural network interfaced with an electromagnetic solver to assist in the design of photonic devices, such as grating couplers. We show that PHORCED obtains better performing grating coupler designs than local gradient-based inverse design via the adjoint method, while potentially providing faster convergence over competing state-of-the-art generative methods. Furthermore, we implement transfer learning with PHORCED, demonstrating that a neural network trained to optimize 8$^\circ$ grating couplers can then be re-trained on grating couplers with alternate scattering angles while requiring >$10\times$ fewer simulations than control cases.
    Markov Decision Process modeled with Bandits for Sequential Decision Making in Linear-flow. (arXiv:2107.00204v1 [cs.LG])
    (2 min) In membership/subscriber acquisition and retention, we sometimes need to recommend marketing content for multiple pages in sequence. Different from general sequential decision making process, the use cases have a simpler flow where customers per seeing recommended content on each page can only return feedback as moving forward in the process or dropping from it until a termination state. We refer to this type of problems as sequential decision making in linear--flow. We propose to formulate the problem as an MDP with Bandits where Bandits are employed to model the transition probability matrix. At recommendation time, we use Thompson sampling (TS) to sample the transition probabilities and allocate the best series of actions with analytical solution through exact dynamic programming. The way that we formulate the problem allows us to leverage TS's efficiency in balancing exploration and exploitation and Bandit's convenience in modeling actions' incompatibility. In the simulation study, we observe the proposed MDP with Bandits algorithm outperforms Q-learning with $\epsilon$-greedy and decreasing $\epsilon$, independent Bandits, and interaction Bandits. We also find the proposed algorithm's performance is the most robust to changes in the across-page interdependence strength.
    A first look into the carbon footprint of federated learning. (arXiv:2102.07627v3 [cs.LG] UPDATED)
    (2 min) Despite impressive results, deep learning-based technologies also raise severe privacy and environmental concerns induced by the training procedure often conducted in datacenters. In response, alternatives to centralized training such as Federated Learning (FL) have emerged. Perhaps unexpectedly, FL, in particular, is starting to be deployed at a global scale by companies that must adhere to new legal demands and policies originating from governments and civil society for privacy protection. However, the potential environmental impact related to FL remains unclear and unexplored. This paper offers the first-ever systematic study of the carbon footprint of FL. First, we propose a rigorous model to quantify the carbon footprint, hence facilitating the investigation of the relationship between FL design and carbon emissions. Then, we compare the carbon footprint of FL to traditional centralized learning. Our findings show that FL, despite being slower to converge in some cases, may result in a comparatively greener impact than a centralized equivalent setup. We performed extensive experiments across different types of datasets, settings, and various deep learning models with FL. Finally, we highlight and connect the reported results to the future challenges and trends in FL to reduce its environmental impact, including algorithms efficiency, hardware capabilities, and stronger industry transparency.
    Attribute Selection using Contranominal Scales. (arXiv:2106.10978v2 [cs.AI] UPDATED)
    (0 min) Formal Concept Analysis (FCA) allows to analyze binary data by deriving concepts and ordering them in lattices. One of the main goals of FCA is to enable humans to comprehend the information that is encapsulated in the data; however, the large size of concept lattices is a limiting factor for the feasibility of understanding the underlying structural properties. The size of such a lattice depends on the number of subcontexts in the corresponding formal context that are isomorphic to a contranominal scale of high dimension. In this work, we propose the algorithm ContraFinder that enables the computation of all contranominal scales of a given formal context. Leveraging this algorithm, we introduce delta-adjusting, a novel approach in order to decrease the number of contranominal scales in a formal context by the selection of an appropriate attribute subset. We demonstrate that delta-adjusting a context reduces the size of the hereby emerging sub-semilattice and that the implication set is restricted to meaningful implications. This is evaluated with respect to its associated knowledge by means of a classification task. Hence, our proposed technique strongly improves understandability while preserving important conceptual structures.
    Robust Asymmetric Learning in POMDPs. (arXiv:2012.15566v3 [cs.LG] UPDATED)
    (0 min) Policies for partially observed Markov decision processes can be efficiently learned by imitating policies for the corresponding fully observed Markov decision processes. Unfortunately, existing approaches for this kind of imitation learning have a serious flaw: the expert does not know what the trainee cannot see, and so may encourage actions that are sub-optimal, even unsafe, under partial information. We derive an objective to instead train the expert to maximize the expected reward of the imitating agent policy, and use it to construct an efficient algorithm, adaptive asymmetric DAgger (A2D), that jointly trains the expert and the agent. We show that A2D produces an expert policy that the agent can safely imitate, in turn outperforming policies learned by imitating a fixed expert.
    Imitation Learning from Pixel-Level Demonstrations by HashReward. (arXiv:1909.03773v3 [cs.LG] UPDATED)
    (2 min) One of the key issues for imitation learning lies in making policy learned from limited samples to generalize well in the whole state-action space. This problem is much more severe in high-dimensional state environments, such as game playing with raw pixel inputs. Under this situation, even state-of-the-art adversary-based imitation learning algorithms fail. Through empirical studies, we find that the main cause lies in the failure of training a powerful discriminator to generate meaningful rewards in high-dimensional environments. Although it seems that dimensionality reduction can help, a straightforward application of off-the-shelf methods cannot achieve good performance. In this work, we show in theory that the balance between dimensionality reduction and discriminative training is essential for effective learning. To achieve this target, we propose HashReward, which utilizes the idea of supervised hashing to realize such an ideal balance. Experimental results show that HashReward could outperform state-of-the-art methods for a large gap under the challenging high-dimensional environments.
    Decoupled Exploration and Exploitation Policies for Sample-Efficient Reinforcement Learning. (arXiv:2101.09458v2 [cs.LG] UPDATED)
    (2 min) Despite the close connection between exploration and sample efficiency, most state of the art reinforcement learning algorithms include no considerations for exploration beyond maximizing the entropy of the policy. In this work we address this seeming missed opportunity. We observe that the most common formulation of directed exploration in deep RL, known as bonus-based exploration (BBE), suffers from bias and slow coverage in the few-sample regime. This causes BBE to be actively detrimental to policy learning in many control tasks. We show that by decoupling the task policy from the exploration policy, directed exploration can be highly effective for sample-efficient continuous control. Our method, Decoupled Exploration and Exploitation Policies (DEEP), can be combined with any off-policy RL algorithm without modification. When used in conjunction with soft actor-critic, DEEP incurs no performance penalty in densely-rewarding environments. On sparse environments, DEEP gives a several-fold improvement in data efficiency due to better exploration.
    Generalized Dirichlet-process-means for $f$-separable distortion measures. (arXiv:1901.11331v3 [cs.LG] UPDATED)
    (2 min) DP-means clustering was obtained as an extension of $K$-means clustering. While it is implemented with a simple and efficient algorithm, it can estimate the number of clusters simultaneously. However, DP-means is specifically designed for the average distortion measure. Therefore, it is vulnerable to outliers in data, and can cause large maximum distortion in clusters. In this work, we extend the objective function of the DP-means to $f$-separable distortion measures and propose a unified learning algorithm to overcome the above problems by selecting the function $f$. Further, the influence function of the estimated cluster center is analyzed to evaluate the robustness against outliers. We demonstrate the performance of the generalized method by numerical experiments using real datasets.
    Auto-encoding brain networks with applications to analyzing large-scale brain imaging datasets. (arXiv:1911.02728v2 [stat.ML] UPDATED)
    (2 min) There has been huge interest in studying human brain connectomes inferred from different imaging modalities and exploring their relationship with human traits, such as cognition. Brain connectomes are usually represented as networks, with nodes corresponding to different regions of interest (ROIs) and edges to connection strengths between ROIs. Due to the high-dimensionality and non-Euclidean nature of networks, it is challenging to depict their population distribution and relate them to human traits. Current approaches focus on summarizing the network using either pre-specified topological features or principal components analysis (PCA). In this paper, building on recent advances in deep learning, we develop a nonlinear latent factor model to characterize the population distribution of brain graphs and infer the relationships between brain structural connectomes and human traits. We refer to our method as Graph AuTo-Encoding (GATE). We applied GATE to two large-scale brain imaging datasets, the Adolescent Brain Cognitive Development (ABCD) study and the Human Connectome Project (HCP) for adults, to understand the structural brain connectome and its relationship with cognition. Numerical results demonstrate huge advantages of GATE over competitors in terms of prediction accuracy, statistical inference and computing efficiency. We found that structural connectomes have a stronger association with a wide range of human cognitive traits than was apparent using previous approaches.
    Augmented Sliced Wasserstein Distances. (arXiv:2006.08812v4 [cs.LG] UPDATED)
    (2 min) While theoretically appealing, the application of the Wasserstein distance to large-scale machine learning problems has been hampered by its prohibitive computational cost. The sliced Wasserstein distance and its variants improve the computational efficiency through the random projection, yet they suffer from low accuracy if the number of projections is not sufficiently large, because the majority of projections result in trivially small values. In this work, we propose a new family of distance metrics, called augmented sliced Wasserstein distances (ASWDs), constructed by first mapping samples to higher-dimensional hypersurfaces parameterized by neural networks. It is derived from a key observation that (random) linear projections of samples residing on these hypersurfaces would translate to much more flexible nonlinear projections in the original sample space, so they can capture complex structures of the data distribution. We show that the hypersurfaces can be optimized by gradient ascent efficiently. We provide the condition under which the ASWD is a valid metric and show that this can be obtained by an injective neural network architecture. Numerical results demonstrate that the ASWD significantly outperforms other Wasserstein variants for both synthetic and real-world problems.
    Decentralized Learning for Channel Allocation in IoT Networks over Unlicensed Bandwidth as a Contextual Multi-player Multi-armed Bandit Game. (arXiv:2003.13314v3 [cs.MA] UPDATED)
    (0 min) We study a decentralized channel allocation problem in an ad-hoc Internet of Things network underlaying on the spectrum licensed to a primary cellular network. In the considered network, the impoverished channel sensing/probing capability and computational resource on the IoT devices make them difficult to acquire the detailed Channel State Information (CSI) for the shared multiple channels. In practice, the unknown patterns of the primary users' transmission activities and the time-varying CSI (e.g., due to small-scale fading or device mobility) also cause stochastic changes in the channel quality. Decentralized IoT links are thus expected to learn channel conditions online based on partial observations, while acquiring no information about the channels that they are not operating on. They also have to reach an efficient, collision-free solution of channel allocation with limited coordination. Our study maps this problem into a contextual multi-player, multi-armed bandit game, and proposes a purely decentralized, three-stage policy learning algorithm through trial-and-error. Theoretical analyses shows that the proposed scheme guarantees the IoT links to jointly converge to the social optimal channel allocation with a sub-linear (i.e., polylogarithmic) regret with respect to the operational time. Simulations demonstrate that it strikes a good balance between efficiency and network scalability when compared with the other state-of-the-art decentralized bandit algorithms.
    To Split or Not to Split: The Impact of Disparate Treatment in Classification. (arXiv:2002.04788v3 [cs.LG] UPDATED)
    (0 min) Disparate treatment occurs when a machine learning model yields different decisions for individuals based on a sensitive attribute (e.g., age, sex). In domains where prediction accuracy is paramount, it could potentially be acceptable to fit a model which exhibits disparate treatment. To evaluate the effect of disparate treatment, we compare the performance of split classifiers (i.e., classifiers trained and deployed separately on each group) with group-blind classifiers (i.e., classifiers which do not use a sensitive attribute). We introduce the benefit-of-splitting for quantifying the performance improvement by splitting classifiers. Computing the benefit-of-splitting directly from its definition could be intractable since it involves solving optimization problems over an infinite-dimensional functional space. Under different performance measures, we (i) prove an equivalent expression for the benefit-of-splitting which can be efficiently computed by solving small-scale convex programs; (ii) provide sharp upper and lower bounds for the benefit-of-splitting which reveal precise conditions where a group-blind classifier will always suffer from a non-trivial performance gap from the split classifiers. In the finite sample regime, splitting is not necessarily beneficial and we provide data-dependent bounds to understand this effect. Finally, we validate our theoretical results through numerical experiments on both synthetic and real-world datasets.
    Forecasting directional movements of stock prices for intraday trading using LSTM and random forests. (arXiv:2004.10178v2 [cs.LG] UPDATED)
    (0 min) We employ both random forests and LSTM networks (more precisely CuDNNLSTM) as training methodologies to analyze their effectiveness in forecasting out-of-sample directional movements of constituent stocks of the S&P 500 from January 1993 till December 2018 for intraday trading. We introduce a multi-feature setting consisting not only of the returns with respect to the closing prices, but also with respect to the opening prices and intraday returns. As trading strategy, we use Krauss et al. (2017) and Fischer & Krauss (2018) as benchmark. On each trading day, we buy the 10 stocks with the highest probability and sell short the 10 stocks with the lowest probability to outperform the market in terms of intraday returns -- all with equal monetary weight. Our empirical results show that the multi-feature setting provides a daily return, prior to transaction costs, of 0.64% using LSTM networks, and 0.54% using random forests. Hence we outperform the single-feature setting in Fischer & Krauss (2018) and Krauss et al. (2017) consisting only of the daily returns with respect to the closing prices, having corresponding daily returns of 0.41% and of 0.39% with respect to LSTM and random forests, respectively.
    Sequential prediction under log-loss and misspecification. (arXiv:2102.00050v2 [cs.LG] UPDATED)
    (0 min) We consider the question of sequential prediction under the log-loss in terms of cumulative regret. Namely, given a hypothesis class of distributions, learner sequentially predicts the (distribution of the) next letter in sequence and its performance is compared to the baseline of the best constant predictor from the hypothesis class. The well-specified case corresponds to an additional assumption that the data-generating distribution belongs to the hypothesis class as well. Here we present results in the more general misspecified case. Due to special properties of the log-loss, the same problem arises in the context of competitive-optimality in density estimation, and model selection. For the $d$-dimensional Gaussian location hypothesis class, we show that cumulative regrets in the well-specified and misspecified cases asymptotically coincide. In other words, we provide an $o(1)$ characterization of the distribution-free (or PAC) regret in this case -- the first such result as far as we know. We recall that the worst-case (or individual-sequence) regret in this case is larger by an additive constant ${d\over 2} + o(1)$. Surprisingly, neither the traditional Bayesian estimators, nor the Shtarkov's normalized maximum likelihood achieve the PAC regret and our estimator requires special "robustification" against heavy-tailed data. In addition, we show two general results for misspecified regret: the existence and uniqueness of the optimal estimator, and the bound sandwiching the misspecified regret between well-specified regrets with (asymptotically) close hypotheses classes.
    Action Transformer: A Self-Attention Model for Short-Time Human Action Recognition. (arXiv:2107.00606v1 [cs.CV])
    (0 min) Deep neural networks based purely on attention have been successful across several domains, relying on minimal architectural priors from the designer. In Human Action Recognition (HAR), attention mechanisms have been primarily adopted on top of standard convolutional or recurrent layers, improving the overall generalization capability. In this work, we introduce Action Transformer (AcT), a simple, fully self-attentional architecture that consistently outperforms more elaborated networks that mix convolutional, recurrent, and attentive layers. In order to limit computational and energy requests, building on previous human action recognition research, the proposed approach exploits 2D pose representations over small temporal windows, providing a low latency solution for accurate and effective real-time performance. Moreover, we open-source MPOSE2021, a new large-scale dataset, as an attempt to build a formal training and evaluation benchmark for real-time short-time human action recognition. Extensive experimentation on MPOSE2021 with our proposed methodology and several previous architectural solutions proves the effectiveness of the AcT model and poses the base for future work on HAR.
    Graph Self Supervised Learning: the BT, the HSIC, and the VICReg. (arXiv:2105.12247v3 [cs.LG] UPDATED)
    (0 min) Self-supervised learning and pre-training strategies have developed over the last few years especially for Convolutional Neural Networks (CNNs). Recently application of such methods can also be noticed for Graph Neural Networks (GNNs) . In this paper, we have used a graph based self-supervised learning strategy with different loss functions (Barlow Twins[Zbontar et al., 2021], HSIC[Tsai et al., 2021], VICReg[Bardes et al., 2021]) which have shown promising results when applied with CNNs previously. We have also proposed a hybrid loss function combining the advantages of VICReg and HSIC and called it as VICRegHSIC. The performance of these aforementioned methods have been compared when applied to different datasets such as MUTAG, PROTEINS and IMDB-Binary. Moreover, the impact of different batch sizes, projector dimensions and data augmentation strategies have also been explored
    On the Convergence of Stochastic Extragradient for Bilinear Games with Restarted Iteration Averaging. (arXiv:2107.00464v1 [math.OC])
    (0 min) We study the stochastic bilinear minimax optimization problem, presenting an analysis of the Stochastic ExtraGradient (SEG) method with constant step size, and presenting variations of the method that yield favorable convergence. We first note that the last iterate of the basic SEG method only contracts to a fixed neighborhood of the Nash equilibrium, independent of the step size. This contrasts sharply with the standard setting of minimization where standard stochastic algorithms converge to a neighborhood that vanishes in proportion to the square-root (constant) step size. Under the same setting, however, we prove that when augmented with iteration averaging, SEG provably converges to the Nash equilibrium, and such a rate is provably accelerated by incorporating a scheduled restarting procedure. In the interpolation setting, we achieve an optimal convergence rate up to tight constants. We present numerical experiments that validate our theoretical findings and demonstrate the effectiveness of the SEG method when equipped with iteration averaging and restarting.
    Conditional independence for pretext task selection in Self-supervised speech representation learning. (arXiv:2104.07388v2 [eess.AS] UPDATED)
    (0 min) Through solving pretext tasks, self-supervised learning (SSL) leverages unlabeled data to extract useful latent representations replacing traditional input features in the downstream task. A common pretext task consists in pretraining a SSL model on pseudo-labels derived from the original signal. This technique is particularly relevant for speech data where various meaningful signal processing features may serve as pseudo-labels. However, the process of selecting pseudo-labels, for speech or other types of data, remains mostly unexplored and currently relies on observing the results on the final downstream task. Nevertheless, this methodology is not sustainable at scale due to substantial computational (hence carbon) costs. Thus, this paper introduces a practical and theoretical framework to select relevant pseudo-labels with respect to a given downstream task. More precisely, we propose a functional estimator of the pseudo-label utility grounded in the conditional independence theory, which does not require any training. The experiments conducted on speaker recognition and automatic speech recognition validate our estimator, showing a significant correlation between the performance observed on the downstream task and the utility estimates obtained with our approach, facilitating the prospection of relevant pseudo-labels for self-supervised speech representation learning.
    Learning deep autoregressive models for hierarchical data. (arXiv:2104.13853v3 [cs.LG] UPDATED)
    (0 min) We propose a model for hierarchical structured data as an extension to the stochastic temporal convolutional network. The proposed model combines an autoregressive model with a hierarchical variational autoencoder and downsampling to achieve superior computational complexity. We evaluate the proposed model on two different types of sequential data: speech and handwritten text. The results are promising with the proposed model achieving state-of-the-art performance.
    Neural Networks as Geometric Chaotic Maps. (arXiv:1912.05081v4 [cs.LG] UPDATED)
    (0 min) The use of artificial neural networks as models of chaotic dynamics has been rapidly expanding. Still, a theoretical understanding of how neural networks learn chaos is lacking. Here, we employ a geometric perspective to show that neural networks can efficiently model chaotic dynamics by becoming structurally chaotic themselves. We first confirm neural network's efficiency in emulating chaos by showing that a parsimonious neural network trained only on few data points can reconstruct strange attractors, extrapolate outside training data boundaries, and accurately predict local divergence rates. We then posit that the trained network's map comprises sequential geometric stretching, rotation, and compression operations. These geometric operations indicate topological mixing and chaos, explaining why neural networks are naturally suitable to emulate chaotic dynamics.
    Spotting adversarial samples for speaker verification by neural vocoders. (arXiv:2107.00309v1 [cs.SD])
    (2 min) Automatic speaker verification (ASV), one of the most important technology for biometric identification, has been widely adopted in security-critic applications, including transaction authentication and access control. However, previous works have shown ASV is seriously vulnerable to recently emerged adversarial attacks, yet effective countermeasures against them are limited. In this paper, we adopt neural vocoders to spot adversarial samples for ASV. We use neural vocoder to re-synthesize audio and find that the difference between the ASV scores for the original and re-synthesized audio is a good indicator to distinguish genuine and adversarial samples. As the very beginning work in this direction of detecting adversarial samples for ASV, there is no reliable baseline for comparison. So we first implement Griffin-Lim for detection and set it as our baseline. The proposed method accomplishes effective detection performance and outperforms all the baselines in all the settings. We also show the neural vocoder adopted in the detection framework is dataset independent. Our codes will be made open-source for future works to do comparison.
    ControlBurn: Feature Selection by Sparse Forests. (arXiv:2107.00219v1 [cs.LG])
    (2 min) Tree ensembles distribute feature importance evenly amongst groups of correlated features. The average feature ranking of the correlated group is suppressed, which reduces interpretability and complicates feature selection. In this paper we present ControlBurn, a feature selection algorithm that uses a weighted LASSO-based feature selection method to prune unnecessary features from tree ensembles, just as low-intensity fire reduces overgrown vegetation. Like the linear LASSO, ControlBurn assigns all the feature importance of a correlated group of features to a single feature. Moreover, the algorithm is efficient and only requires a single training iteration to run, unlike iterative wrapper-based feature selection methods. We show that ControlBurn performs substantially better than feature selection methods with comparable computational costs on datasets with correlated features.
    Algebraic Neural Networks: Stability to Deformations. (arXiv:2009.01433v5 [cs.LG] UPDATED)
    (0 min) We study algebraic neural networks (AlgNNs) with commutative algebras which unify diverse architectures such as Euclidean convolutional neural networks, graph neural networks, and group neural networks under the umbrella of algebraic signal processing. An AlgNN is a stacked layered information processing structure where each layer is conformed by an algebra, a vector space and a homomorphism between the algebra and the space of endomorphisms of the vector space. Signals are modeled as elements of the vector space and are processed by convolutional filters that are defined as the images of the elements of the algebra under the action of the homomorphism. We analyze stability of algebraic filters and AlgNNs to deformations of the homomorphism and derive conditions on filters that lead to Lipschitz stable operators. We conclude that stable algebraic filters have frequency responses -- defined as eigenvalue domain representations -- whose derivative is inversely proportional to the frequency -- defined as eigenvalue magnitudes. It follows that for a given level of discriminability, AlgNNs are more stable than algebraic filters, thereby explaining their better empirical performance. This same phenomenon has been proven for Euclidean convolutional neural networks and graph neural networks. Our analysis shows that this is a deep algebraic property shared by a number of architectures.
    A Survey on Federated Learning Systems: Vision, Hype and Reality for Data Privacy and Protection. (arXiv:1907.09693v6 [cs.LG] UPDATED)
    (2 min) Federated learning has been a hot research topic in enabling the collaborative training of machine learning models among different organizations under the privacy restrictions. As researchers try to support more machine learning models with different privacy-preserving approaches, there is a requirement in developing systems and infrastructures to ease the development of various federated learning algorithms. Similar to deep learning systems such as PyTorch and TensorFlow that boost the development of deep learning, federated learning systems (FLSs) are equivalently important, and face challenges from various aspects such as effectiveness, efficiency, and privacy. In this survey, we conduct a comprehensive review on federated learning systems. To achieve smooth flow and guide future research, we introduce the definition of federated learning systems and analyze the system components. Moreover, we provide a thorough categorization for federated learning systems according to six different aspects, including data distribution, machine learning model, privacy mechanism, communication architecture, scale of federation and motivation of federation. The categorization can help the design of federated learning systems as shown in our case studies. By systematically summarizing the existing federated learning systems, we present the design factors, case studies, and future research opportunities.
    Learning How to Search: Generating Effective Test Cases Through Adaptive Fitness Function Selection. (arXiv:2102.04822v2 [cs.SE] UPDATED)
    (0 min) Search-based test generation is guided by feedback from one or more fitness functions -- scoring functions that judge solution optimality. Choosing informative fitness functions is crucial to meeting the goals of a tester. Unfortunately, many goals - such as forcing the class-under-test to throw exceptions, increasing test suite diversity, and attaining Strong Mutation Coverage - do not have effective fitness function formulations. We propose that meeting such goals requires treating fitness function identification as a secondary optimization step. An adaptive algorithm that can vary the selection of fitness functions could adjust its selection throughout the generation process to maximize goal attainment, based on the current population of test suites. To test this hypothesis, we have implemented two reinforcement learning algorithms in the EvoSuite unit test generation framework, and used these algorithms to dynamically set the fitness functions used during generation for the three goals identified above. We have evaluated our framework, EvoSuiteFIT, on a set of Java case examples. EvoSuiteFIT techniques attain significant improvements for two of the three goals, and show limited improvements on the third when the number of generations of evolution is fixed. Additionally, for two of the three goals, EvoSuiteFIT detects faults missed by the other techniques. The ability to adjust fitness functions allows strategic choices that efficiently produce more effective test suites, and examining these choices offers insight into how to attain our testing goals. We find that adaptive fitness function selection is a powerful technique to apply when an effective fitness function does not already exist for achieving a testing goal.
    Differentiable Sparsification for Deep Neural Networks. (arXiv:1910.03201v5 [cs.LG] UPDATED)
    (2 min) Deep neural networks have relieved a great deal of burden on human experts in relation to feature engineering. However, comparable efforts are instead required to determine effective architectures. In addition, as the sizes of networks have grown overly large, a considerable amount of resources is also invested in reducing the sizes. The sparsification of an over-complete model addresses these problems as it removes redundant components and connections. In this study, we propose a fully differentiable sparsification method for deep neural networks which allows parameters to be zero during training via stochastic gradient descent. Thus, the proposed method can learn the sparsified structure and weights of a network in an end-to-end manner. The method is directly applicable to various modern deep neural networks and imposes minimum modification to existing models. To the best of our knowledge, this is the first fully [sub-]differentiable sparsification method that zeroes out parameters. It provides a foundation for future structure learning and model compression methods.
    Information-theoretic Task Selection for Meta-Reinforcement Learning. (arXiv:2011.01054v2 [cs.LG] UPDATED)
    (0 min) In Meta-Reinforcement Learning (meta-RL) an agent is trained on a set of tasks to prepare for and learn faster in new, unseen, but related tasks. The training tasks are usually hand-crafted to be representative of the expected distribution of test tasks and hence all used in training. We show that given a set of training tasks, learning can be both faster and more effective (leading to better performance in the test tasks), if the training tasks are appropriately selected. We propose a task selection algorithm, Information-Theoretic Task Selection (ITTS), based on information theory, which optimizes the set of tasks used for training in meta-RL, irrespectively of how they are generated. The algorithm establishes which training tasks are both sufficiently relevant for the test tasks, and different enough from one another. We reproduce different meta-RL experiments from the literature and show that ITTS improves the final performance in all of them.
    Spatial Dependency Parsing for Semi-Structured Document Information Extraction. (arXiv:2005.00642v3 [cs.CL] UPDATED)
    (0 min) Information Extraction (IE) for semi-structured document images is often approached as a sequence tagging problem by classifying each recognized input token into one of the IOB (Inside, Outside, and Beginning) categories. However, such problem setup has two inherent limitations that (1) it cannot easily handle complex spatial relationships and (2) it is not suitable for highly structured information, which are nevertheless frequently observed in real-world document images. To tackle these issues, we first formulate the IE task as spatial dependency parsing problem that focuses on the relationship among text tokens in the documents. Under this setup, we then propose SPADE (SPAtial DEpendency parser) that models highly complex spatial relationships and an arbitrary number of information layers in the documents in an end-to-end manner. We evaluate it on various kinds of documents such as receipts, name cards, forms, and invoices, and show that it achieves a similar or better performance compared to strong baselines including BERT-based IOB taggger.
    Dynamic Batch Learning in High-Dimensional Sparse Linear Contextual Bandits. (arXiv:2008.11918v3 [stat.ML] UPDATED)
    (0 min) We study the problem of dynamic batch learning in high-dimensional sparse linear contextual bandits, where a decision maker can only adapt decisions at a batch level. In particular, the decision maker, only observing rewards at the end of each batch, dynamically decides how many individuals to include in the next batch (at the current batch's end) and what personalized action-selection scheme to adopt within the batch. Such batch constraints are ubiquitous in a variety of practical contexts, including personalized product offerings in marketing and medical treatment selection in clinical trials. We characterize the fundamental learning limit in this problem via a novel lower bound analysis and provide a simple, exploration-free algorithm that uses the LASSO estimator, which achieves the minimax optimal performance characterized by the lower bound (up to log factors). To our best knowledge, our work provides the first inroad into a rigorous understanding of dynamic batch learning with high-dimensional covariates. We also demonstrate the efficacy of our algorithm on both synthetic data and the Warfarin medical dosing data. The empirical results show that with three batches (hence only two opportunities to adapt), our algorithm already performs comparably (in terms of statistical performance) to the state-of-the-art fully online high-dimensional linear contextual bandits algorithm. As an added bonus, since our algorithm operates in batches, it is orders of magnitudes faster than fully online learning algorithms. As such, our algorithm provides a desirable candidate for practical data-driven personalized decision making problems, where limited adaptivity is often a hard constraint.
    Training Interpretable Convolutional Neural Networks by Differentiating Class-specific Filters. (arXiv:2007.08194v3 [cs.CV] UPDATED)
    (0 min) Convolutional neural networks (CNNs) have been successfully used in a range of tasks. However, CNNs are often viewed as "black-box" and lack of interpretability. One main reason is due to the filter-class entanglement -- an intricate many-to-many correspondence between filters and classes. Most existing works attempt post-hoc interpretation on a pre-trained model, while neglecting to reduce the entanglement underlying the model. In contrast, we focus on alleviating filter-class entanglement during training. Inspired by cellular differentiation, we propose a novel strategy to train interpretable CNNs by encouraging class-specific filters, among which each filter responds to only one (or few) class. Concretely, we design a learnable sparse Class-Specific Gate (CSG) structure to assign each filter with one (or few) class in a flexible way. The gate allows a filter's activation to pass only when the input samples come from the specific class. Extensive experiments demonstrate the fabulous performance of our method in generating a sparse and highly class-related representation of the input, which leads to stronger interpretability. Moreover, comparing with the standard training strategy, our model displays benefits in applications like object localization and adversarial sample detection. Code link: https://github.com/hyliang96/CSGCNN.
    Stabilizing Deep Q-Learning with ConvNets and Vision Transformers under Data Augmentation. (arXiv:2107.00644v1 [cs.LG])
    (2 min) While agents trained by Reinforcement Learning (RL) can solve increasingly challenging tasks directly from visual observations, generalizing learned skills to novel environments remains very challenging. Extensive use of data augmentation is a promising technique for improving generalization in RL, but it is often found to decrease sample efficiency and can even lead to divergence. In this paper, we investigate causes of instability when using data augmentation in common off-policy RL algorithms. We identify two problems, both rooted in high-variance Q-targets. Based on our findings, we propose a simple yet effective technique for stabilizing this class of algorithms under augmentation. We perform extensive empirical evaluation of image-based RL using both ConvNets and Vision Transformers (ViT) on a family of benchmarks based on DeepMind Control Suite, as well as in robotic manipulation tasks. Our method greatly improves stability and sample efficiency of ConvNets under augmentation, and achieves generalization results competitive with state-of-the-art methods for image-based RL. We further show that our method scales to RL with ViT-based architectures, and that data augmentation may be especially important in this setting.
    Circuit Complexity of Visual Search. (arXiv:2107.00223v1 [cs.CC])
    (3 min) We study computational hardness of feature and conjunction search through the lens of circuit complexity. Let $x = (x_1, ... , x_n)$ (resp., $y = (y_1, ... , y_n)$) be Boolean variables each of which takes the value one if and only if a neuron at place $i$ detects a feature (resp., another feature). We then simply formulate the feature and conjunction search as Boolean functions ${\rm FTR}_n(x) = \bigvee_{i=1}^n x_i$ and ${\rm CONJ}_n(x, y) = \bigvee_{i=1}^n x_i \wedge y_i$, respectively. We employ a threshold circuit or a discretized circuit (such as a sigmoid circuit or a ReLU circuit with discretization) as our models of neural networks, and consider the following four computational resources: [i] the number of neurons (size), [ii] the number of levels (depth), [iii] the number of active neurons outputting non-zero values (energy), and [iv] synaptic weight resolution (weight). We first prove that any threshold circuit $C$ of size $s$, depth $d$, energy $e$ and weight $w$ satisfies $\log rk(M_C) \le ed (\log s + \log w + \log n)$, where $rk(M_C)$ is the rank of the communication matrix $M_C$ of a $2n$-variable Boolean function that $C$ computes. Since ${\rm CONJ}_n$ has rank $2^n$, we have $n \le ed (\log s + \log w + \log n)$. Thus, an exponential lower bound on the size of even sublinear-depth threshold circuits exists if the energy and weight are sufficiently small. Since ${\rm FTR}_n$ is computable independently of $n$, our result suggests that computational capacity for the feature and conjunction search are different. We also show that the inequality is tight up to a constant factor if $ed = o(n/ \log n)$. We next show that a similar inequality holds for any discretized circuit. Thus, if we regard the number of gates outputting non-zero values as a measure for sparse activity, our results suggest that larger depth helps neural networks to acquire sparse activity.
    Understanding Adversarial Examples Through Deep Neural Network's Response Surface and Uncertainty Regions. (arXiv:2107.00003v1 [cs.LG])
    (2 min) Deep neural network (DNN) is a popular model implemented in many systems to handle complex tasks such as image classification, object recognition, natural language processing etc. Consequently DNN structural vulnerabilities become part of the security vulnerabilities in those systems. In this paper we study the root cause of DNN adversarial examples. We examine the DNN response surface to understand its classification boundary. Our study reveals the structural problem of DNN classification boundary that leads to the adversarial examples. Existing attack algorithms can generate from a handful to a few hundred adversarial examples given one clean image. We show there are infinitely many adversarial images given one clean sample, all within a small neighborhood of the clean sample. We then define DNN uncertainty regions and show transferability of adversarial examples is not universal. We also argue that generalization error, the large sample theoretical guarantee established for DNN, cannot adequately capture the phenomenon of adversarial examples. We need new theory to measure DNN robustness.
    Sparse GCA and Thresholded Gradient Descent. (arXiv:2107.00371v1 [stat.ML])
    (2 min) Generalized correlation analysis (GCA) is concerned with uncovering linear relationships across multiple datasets. It generalizes canonical correlation analysis that is designed for two datasets. We study sparse GCA when there are potentially multiple generalized correlation tuples in data and the loading matrix has a small number of nonzero rows. It includes sparse CCA and sparse PCA of correlation matrices as special cases. We first formulate sparse GCA as generalized eigenvalue problems at both population and sample levels via a careful choice of normalization constraints. Based on a Lagrangian form of the sample optimization problem, we propose a thresholded gradient descent algorithm for estimating GCA loading vectors and matrices in high dimensions. We derive tight estimation error bounds for estimators generated by the algorithm with proper initialization. We also demonstrate the prowess of the algorithm on a number of synthetic datasets.
    Adaptive Stochastic ADMM for Decentralized Reinforcement Learning in Edge Industrial IoT. (arXiv:2107.00481v1 [cs.LG])
    (2 min) Edge computing provides a promising paradigm to support the implementation of Industrial Internet of Things (IIoT) by offloading tasks to nearby edge nodes. Meanwhile, the increasing network size makes it impractical for centralized data processing due to limited bandwidth, and consequently a decentralized learning scheme is preferable. Reinforcement learning (RL) has been widely investigated and shown to be a promising solution for decision-making and optimal control processes. For RL in a decentralized setup, edge nodes (agents) connected through a communication network aim to work collaboratively to find a policy to optimize the global reward as the sum of local rewards. However, communication costs, scalability and adaptation in complex environments with heterogeneous agents may significantly limit the performance of decentralized RL. Alternating direction method of multipliers (ADMM) has a structure that allows for decentralized implementation, and has shown faster convergence than gradient descent based methods. Therefore, we propose an adaptive stochastic incremental ADMM (asI-ADMM) algorithm and apply the asI-ADMM to decentralized RL with edge-computing-empowered IIoT networks. We provide convergence properties for proposed algorithms by designing a Lyapunov function and prove that the asI-ADMM has $O(\frac{1}{k}) +O(\frac{1}{M})$ convergence rate where $k$ and $ M$ are the number of iterations and batch samples, respectively. Then, we test our algorithm with two supervised learning problems. For performance evaluation, we simulate two applications in decentralized RL settings with homogeneous and heterogeneous agents. The experiment results show that our proposed algorithms outperform the state of the art in terms of communication costs and scalability, and can well adapt to complex IoT environments.
    Scalable Certified Segmentation via Randomized Smoothing. (arXiv:2107.00228v1 [cs.LG])
    (2 min) We present a new certification method for image and point cloud segmentation based on randomized smoothing. The method leverages a novel scalable algorithm for prediction and certification that correctly accounts for multiple testing, necessary for ensuring statistical guarantees. The key to our approach is reliance on established multiple-testing correction mechanisms as well as the ability to abstain from classifying single pixels or points while still robustly segmenting the overall input. Our experimental evaluation on synthetic data and challenging datasets, such as Pascal Context, Cityscapes, and ShapeNet, shows that our algorithm can achieve, for the first time, competitive accuracy and certification guarantees on real-world segmentation tasks. We provide an implementation at https://github.com/eth-sri/segmentation-smoothing.
    Automatic Metadata Extraction Incorporating Visual Features from Scanned Electronic Theses and Dissertations. (arXiv:2107.00516v1 [cs.DL])
    (0 min) Electronic Theses and Dissertations (ETDs) contain domain knowledge that can be used for many digital library tasks, such as analyzing citation networks and predicting research trends. Automatic metadata extraction is important to build scalable digital library search engines. Most existing methods are designed for born-digital documents, so they often fail to extract metadata from scanned documents such as for ETDs. Traditional sequence tagging methods mainly rely on text-based features. In this paper, we propose a conditional random field (CRF) model that combines text-based and visual features. To verify the robustness of our model, we extended an existing corpus and created a new ground truth corpus consisting of 500 ETD cover pages with human validated metadata. Our experiments show that CRF with visual features outperformed both a heuristic and a CRF model with only text-based features. The proposed model achieved 81.3%-96% F1 measure on seven metadata fields. The data and source code are publicly available on Google Drive (https://tinyurl.com/y8kxzwrp) and a GitHub repository (https://github.com/lamps-lab/ETDMiner/tree/master/etd_crf), respectively.
    Approximate Frank-Wolfe Algorithms over Graph-structured Support Sets. (arXiv:2107.00472v1 [math.OC])
    (0 min) In this paper, we propose approximate Frank-Wolfe (FW) algorithms to solve convex optimization problems over graph-structured support sets where the \textit{linear minimization oracle} (LMO) cannot be efficiently obtained in general. We first demonstrate that two popular approximation assumptions (\textit{additive} and \textit{multiplicative gap errors)}, are not valid for our problem, in that no cheap gap-approximate LMO oracle exists in general. Instead, a new \textit{approximate dual maximization oracle} (DMO) is proposed, which approximates the inner product rather than the gap. When the objective is $L$-smooth, we prove that the standard FW method using a $\delta$-approximate DMO converges as $\mathcal{O}(L / \delta t + (1-\delta)(\delta^{-1} + \delta^{-2}))$ in general, and as $\mathcal{O}(L/(\delta^2(t+2)))$ over a $\delta$-relaxation of the constraint set. Additionally, when the objective is $\mu$-strongly convex and the solution is unique, a variant of FW converges to $\mathcal{O}(L^2\log(t)/(\mu \delta^6 t^2))$ with the same per-iteration complexity. Our empirical results suggest that even these improved bounds are pessimistic, with significant improvement in recovering real-world images with graph-structured sparsity.
    Solving Inverse Problems with a Flow-based Noise Model. (arXiv:2003.08089v3 [cs.LG] UPDATED)
    (0 min) We study image inverse problems with a normalizing flow prior. Our formulation views the solution as the maximum a posteriori estimate of the image conditioned on the measurements. This formulation allows us to use noise models with arbitrary dependencies as well as non-linear forward operators. We empirically validate the efficacy of our method on various inverse problems, including compressed sensing with quantized measurements and denoising with highly structured noise patterns. We also present initial theoretical recovery guarantees for solving inverse problems with a flow prior.
    Generative Adversarial Transformers. (arXiv:2103.01209v3 [cs.CV] UPDATED)
    (0 min) We introduce the GANformer, a novel and efficient type of transformer, and explore it for the task of visual generative modeling. The network employs a bipartite structure that enables long-range interactions across the image, while maintaining computation of linear efficiency, that can readily scale to high-resolution synthesis. It iteratively propagates information from a set of latent variables to the evolving visual features and vice versa, to support the refinement of each in light of the other and encourage the emergence of compositional representations of objects and scenes. In contrast to the classic transformer architecture, it utilizes multiplicative integration that allows flexible region-based modulation, and can thus be seen as a generalization of the successful StyleGAN network. We demonstrate the model's strength and robustness through a careful evaluation over a range of datasets, from simulated multi-object environments to rich real-world indoor and outdoor scenes, showing it achieves state-of-the-art results in terms of image quality and diversity, while enjoying fast learning and better data-efficiency. Further qualitative and quantitative experiments offer us an insight into the model's inner workings, revealing improved interpretability and stronger disentanglement, and illustrating the benefits and efficacy of our approach. An implementation of the model is available at https://github.com/dorarad/gansformer.
    Non-parametric Active Learning and Rate Reduction in Many-body Hilbert Space with Rescaled Logarithmic Fidelity. (arXiv:2107.00195v1 [quant-ph])
    (0 min) In quantum and quantum-inspired machine learning, the very first step is to embed the data in quantum space known as Hilbert space. Developing quantum kernel function (QKF), which defines the distances among the samples in the Hilbert space, belongs to the fundamental topics for machine learning. In this work, we propose the rescaled logarithmic fidelity (RLF) and a non-parametric active learning in the quantum space, which we name as RLF-NAL. The rescaling takes advantage of the non-linearity of the kernel to tune the mutual distances of samples in the Hilbert space, and meanwhile avoids the exponentially-small fidelities between quantum many-qubit states. We compare RLF-NAL with several well-known non-parametric algorithms including naive Bayes classifiers, $k$-nearest neighbors, and spectral clustering. Our method exhibits excellent accuracy particularly for the unsupervised case with no labeled samples and the few-shot cases with small numbers of labeled samples. With the visualizations by t-SNE, our results imply that the machine learning in the Hilbert space complies with the principles of maximal coding rate reduction, where the low-dimensional data exhibit within-class compressibility, between-class discrimination, and overall diversity. Our proposals can be applied to other quantum and quantum-inspired machine learning, including the methods using the parametric models such as tensor networks, quantum circuits, and quantum neural networks.
    Transformer-based end-to-end speech recognition with residual Gaussian-based self-attention. (arXiv:2103.15722v3 [cs.SD] UPDATED)
    (0 min) Self-attention (SA), which encodes vector sequences according to their pairwise similarity, is widely used in speech recognition due to its strong context modeling ability. However, when applied to long sequence data, its accuracy is reduced. This is caused by the fact that its weighted average operator may lead to the dispersion of the attention distribution, which results in the relationship between adjacent signals ignored. To address this issue, in this paper, we introduce relative-position-awareness self-attention (RPSA). It not only maintains the global-range dependency modeling ability of self-attention, but also improves the localness modeling ability. Because the local window length of the original RPSA is fixed and sensitive to different test data, here we propose Gaussian-based self-attention (GSA) whose window length is learnable and adaptive to the test data automatically. We further generalize GSA to a new residual Gaussian self-attention (resGSA) for the performance improvement. We apply RPSA, GSA, and resGSA to Transformer-based speech recognition respectively. Experimental results on the AISHELL-1 Mandarin speech recognition corpus demonstrate the effectiveness of the proposed methods. For example, the resGSA-Transformer achieves a character error rate (CER) of 5.86% on the test set, which is relative 7.8% lower than that of the SA-Transformer. Although the performance of the proposed resGSA-Transformer is only slightly better than that of the RPSA-Transformer, it does not have to tune the window length manually.
    Adaptive Sequential Design for a Single Time-Series. (arXiv:2102.00102v2 [math.ST] UPDATED)
    (0 min) The current work is motivated by the need for robust statistical methods for precision medicine; as such, we address the need for statistical methods that provide actionable inference for a single unit at any point in time. We aim to learn an optimal, unknown choice of the controlled components of the design in order to optimize the expected outcome; with that, we adapt the randomization mechanism for future time-point experiments based on the data collected on the individual over time. Our results demonstrate that one can learn the optimal rule based on a single sample, and thereby adjust the design at any point t with valid inference for the mean target parameter. This work provides several contributions to the field of statistical precision medicine. First, we define a general class of averages of conditional causal parameters defined by the current context for the single unit time-series data. We define a nonparametric model for the probability distribution of the time-series under few assumptions, and aim to fully utilize the sequential randomization in the estimation procedure via the double robust structure of the efficient influence curve of the proposed target parameter. We present multiple exploration-exploitation strategies for assigning treatment, and methods for estimating the optimal rule. Lastly, we present the study of the data-adaptive inference on the mean under the optimal treatment rule, where the target parameter adapts over time in response to the observed context of the individual. Our target parameter is pathwise differentiable with an efficient influence function that is doubly robust - which makes it easier to estimate than previously proposed variations. We characterize the limit distribution of our estimator under a Donsker condition expressed in terms of a notion of bracketing entropy adapted to martingale settings.
    Towards Measuring Bias in Image Classification. (arXiv:2107.00360v1 [cs.CV])
    (0 min) Convolutional Neural Networks (CNN) have become de fact state-of-the-art for the main computer vision tasks. However, due to the complex underlying structure their decisions are hard to understand which limits their use in some context of the industrial world. A common and hard to detect challenge in machine learning (ML) tasks is data bias. In this work, we present a systematic approach to uncover data bias by means of attribution maps. For this purpose, first an artificial dataset with a known bias is created and used to train intentionally biased CNNs. The networks' decisions are then inspected using attribution maps. Finally, meaningful metrics are used to measure the attribution maps' representativeness with respect to the known bias. The proposed study shows that some attribution map techniques highlight the presence of bias in the data better than others and metrics can support the identification of bias.
    Offline-to-Online Reinforcement Learning via Balanced Replay and Pessimistic Q-Ensemble. (arXiv:2107.00591v1 [cs.RO])
    (0 min) Recent advance in deep offline reinforcement learning (RL) has made it possible to train strong robotic agents from offline datasets. However, depending on the quality of the trained agents and the application being considered, it is often desirable to fine-tune such agents via further online interactions. In this paper, we observe that state-action distribution shift may lead to severe bootstrap error during fine-tuning, which destroys the good initial policy obtained via offline RL. To address this issue, we first propose a balanced replay scheme that prioritizes samples encountered online while also encouraging the use of near-on-policy samples from the offline dataset. Furthermore, we leverage multiple Q-functions trained pessimistically offline, thereby preventing overoptimism concerning unfamiliar actions at novel states during the initial training phase. We show that the proposed method improves sample-efficiency and final performance of the fine-tuned robotic agents on various locomotion and manipulation tasks. Our code is available at: https://github.com/shlee94/Off2OnRL.
    Color Variants Identification in Fashion e-commerce via Contrastive Self-Supervised Representation Learning. (arXiv:2104.08581v2 [cs.CV] UPDATED)
    (0 min) In this paper, we utilize deep visual Representation Learning to address an important problem in fashion e-commerce: color variants identification, i.e., identifying fashion products that match exactly in their design (or style), but only to differ in their color. At first we attempt to tackle the problem by obtaining manual annotations (depicting whether two products are color variants), and train a supervised triplet loss based neural network model to learn representations of fashion products. However, for large scale real-world industrial datasets such as addressed in our paper, it is infeasible to obtain annotations for the entire dataset, while capturing all the difficult corner cases. Interestingly, we observed that color variants are essentially manifestations of color jitter based augmentations. Thus, we instead explore Self-Supervised Learning (SSL) to solve this problem. We observed that existing state-of-the-art SSL methods perform poor, for our problem. To address this, we propose a novel SSL based color variants model that simultaneously focuses on different parts of an apparel. Quantitative and qualitative evaluation shows that our method outperforms existing SSL methods, and at times, the supervised model.
    Improving Sound Event Classification by Increasing Shift Invariance in Convolutional Neural Networks. (arXiv:2107.00623v1 [cs.SD])
    (0 min) Recent studies have put into question the commonly assumed shift invariance property of convolutional networks, showing that small shifts in the input can affect the output predictions substantially. In this paper, we ask whether lack of shift invariance is a problem in sound event classification, and whether there are benefits in addressing it. Specifically, we evaluate two pooling methods to improve shift invariance in CNNs, based on low-pass filtering and adaptive sampling of incoming feature maps. These methods are implemented via small architectural modifications inserted into the pooling layers of CNNs. We evaluate the effect of these architectural changes on the FSD50K dataset using models of different capacity and in presence of strong regularization. We show that these modifications consistently improve sound event classification in all cases considered, without adding any (or adding very few) trainable parameters, which makes them an appealing alternative to conventional pooling layers. The outcome is a new state-of-the-art mAP of 0.541 on the FSD50K classification benchmark.
    Intrinsic persistent homology via density-based metric learning. (arXiv:2012.07621v2 [stat.ML] UPDATED)
    (0 min) We address the problem of estimating intrinsic distances in a manifold from a finite sample. We prove that the metric space defined by the sample endowed with a computable metric known as sample Fermat distance converges a.s. in the sense of Gromov-Hausdorff. The limiting object is the manifold itself endowed with the population Fermat distance, an intrinsic metric that accounts for both the geometry of the manifold and the density that produces the sample. This result is applied to obtain intrinsic persistence diagrams, which are less sensitive to the particular embedding of the manifold in the Euclidean space. We show that this approach is robust to outliers and deduce a method for pattern recognition in signals, with applications in real data.
    When does gradient descent with logistic loss interpolate using deep networks with smoothed ReLU activations?. (arXiv:2102.04998v2 [stat.ML] UPDATED)
    (0 min) We establish conditions under which gradient descent applied to fixed-width deep networks drives the logistic loss to zero, and prove bounds on the rate of convergence. Our analysis applies for smoothed approximations to the ReLU, such as Swish and the Huberized ReLU, proposed in previous applied work. We provide two sufficient conditions for convergence. The first is simply a bound on the loss at initialization. The second is a data separation condition used in prior analyses.
    Demon: Improved Neural Network Training with Momentum Decay. (arXiv:1910.04952v4 [cs.LG] UPDATED)
    (0 min) Momentum is a widely used technique for gradient-based optimizers in deep learning. In this paper, we propose a decaying momentum (\textsc{Demon}) rule. We conduct the first large-scale empirical analysis of momentum decay methods for modern neural network optimization, in addition to the most popular learning rate decay schedules. Across 28 relevant combinations of models, epochs, datasets, and optimizers, \textsc{Demon} achieves the highest number of Top-1 and Top-3 finishes at 39\% and 85\% respectively, almost doubling the second-placed learning rate cosine schedule at 17\% and 60\%, respectively. \textsc{Demon} also outperforms other widely used schedulers including, but not limited to, the learning rate step schedule, linear schedule, OneCycle schedule, and exponential schedule. Compared with the widely used learning rate step schedule, \textsc{Demon} is observed to be less sensitive to parameter tuning, which is critical to training neural networks in practice. Results are demonstrated across a variety of settings and architectures, including image classification, generative models, and language models. \textsc{Demon} is easy to implement, requires no additional tuning, and incurs almost no extra computational overhead compared to the vanilla counterparts. Code is readily available.
    Generalization and Robustness Implications in Object-Centric Learning. (arXiv:2107.00637v1 [cs.LG])
    (0 min) The idea behind object-centric representation learning is that natural scenes can better be modeled as compositions of objects and their relations as opposed to distributed representations. This inductive bias can be injected into neural networks to potentially improve systematic generalization and learning efficiency of downstream tasks in scenes with multiple objects. In this paper, we train state-of-the-art unsupervised models on five common multi-object datasets and evaluate segmentation accuracy and downstream object property prediction. In addition, we study systematic generalization and robustness by investigating the settings where either single objects are out-of-distribution -- e.g., having unseen colors, textures, and shapes -- or global properties of the scene are altered -- e.g., by occlusions, cropping, or increasing the number of objects. From our experimental study, we find object-centric representations to be generally useful for downstream tasks and robust to shifts in the data distribution, especially if shifts affect single objects.
    Explainable Diabetic Retinopathy Detection and Retinal Image Generation. (arXiv:2107.00296v1 [eess.IV])
    (0 min) Though deep learning has shown successful performance in classifying the label and severity stage of certain diseases, most of them give few explanations on how to make predictions. Inspired by Koch's Postulates, the foundation in evidence-based medicine (EBM) to identify the pathogen, we propose to exploit the interpretability of deep learning application in medical diagnosis. By determining and isolating the neuron activation patterns on which diabetic retinopathy (DR) detector relies to make decisions, we demonstrate the direct relation between the isolated neuron activation and lesions for a pathological explanation. To be specific, we first define novel pathological descriptors using activated neurons of the DR detector to encode both spatial and appearance information of lesions. Then, to visualize the symptom encoded in the descriptor, we propose Patho-GAN, a new network to synthesize medically plausible retinal images. By manipulating these descriptors, we could even arbitrarily control the position, quantity, and categories of generated lesions. We also show that our synthesized images carry the symptoms directly related to diabetic retinopathy diagnosis. Our generated images are both qualitatively and quantitatively superior to the ones by previous methods. Besides, compared to existing methods that take hours to generate an image, our second level speed endows the potential to be an effective solution for data augmentation.
    DVS-Attacks: Adversarial Attacks on Dynamic Vision Sensors for Spiking Neural Networks. (arXiv:2107.00415v1 [cs.CV])
    (0 min) Spiking Neural Networks (SNNs), despite being energy-efficient when implemented on neuromorphic hardware and coupled with event-based Dynamic Vision Sensors (DVS), are vulnerable to security threats, such as adversarial attacks, i.e., small perturbations added to the input for inducing a misclassification. Toward this, we propose DVS-Attacks, a set of stealthy yet efficient adversarial attack methodologies targeted to perturb the event sequences that compose the input of the SNNs. First, we show that noise filters for DVS can be used as defense mechanisms against adversarial attacks. Afterwards, we implement several attacks and test them in the presence of two types of noise filters for DVS cameras. The experimental results show that the filters can only partially defend the SNNs against our proposed DVS-Attacks. Using the best settings for the noise filters, our proposed Mask Filter-Aware Dash Attack reduces the accuracy by more than 20% on the DVS-Gesture dataset and by more than 65% on the MNIST dataset, compared to the original clean frames. The source code of all the proposed DVS-Attacks and noise filters is released at https://github.com/albertomarchisio/DVS-Attacks.
    Random Hyperboxes. (arXiv:2006.00695v3 [cs.LG] UPDATED)
    (0 min) This paper proposes a simple yet powerful ensemble classifier, called Random Hyperboxes, constructed from individual hyperbox-based classifiers trained on the random subsets of sample and feature spaces of the training set. We also show a generalization error bound of the proposed classifier based on the strength of the individual hyperbox-based classifiers as well as the correlation among them. The effectiveness of the proposed classifier is analyzed using a carefully selected illustrative example and compared empirically with other popular single and ensemble classifiers via 20 datasets using statistical testing methods. The experimental results confirmed that our proposed method outperformed other fuzzy min-max neural networks, popular learning algorithms, and is competitive with other ensemble methods. Finally, we identify the existing issues related to the generalization error bounds of the real datasets and inform the potential research directions.
    Continual Distributed Learning for Crisis Management. (arXiv:2104.12876v2 [cs.LG] UPDATED)
    (0 min) Social media platforms such as Twitter, Facebook etc can be utilised as an important source of information during disaster events. This information can be used for disaster response and crisis management if processed accurately and quickly. However, the data present in such situations is ever-changing, and using considerable resources during such a crisis is not feasible. Therefore, we have to develop a low resource and continually learning system that incorporates text classification models which are robust against noisy and unordered data. We utilised Distributed learning which enabled us to learn on resource-constrained devices, then to alleviate catastrophic forgetting in our target neural networks we utilized regularization. We then applied federated averaging for distributed learning and to aggregate the central model for continual learning.
    Which Echo Chamber? Regions of Attraction in Learning with Decision-Dependent Distributions. (arXiv:2107.00055v1 [cs.LG])
    (0 min) As data-driven methods are deployed in real-world settings, the processes that generate the observed data will often react to the decisions of the learner. For example, a data source may have some incentive for the algorithm to provide a particular label (e.g. approve a bank loan), and manipulate their features accordingly. Work in strategic classification and decision-dependent distributions seeks to characterize the closed-loop behavior of deploying learning algorithms by explicitly considering the effect of the classifier on the underlying data distribution. More recently, works in performative prediction seek to classify the closed-loop behavior by considering general properties of the mapping from classifier to data distribution, rather than an explicit form. Building on this notion, we analyze repeated risk minimization as the perturbed trajectories of the gradient flows of performative risk minimization. We consider the case where there may be multiple local minimizers of performative risk, motivated by real world situations where the initial conditions may have significant impact on the long-term behavior of the system. As a motivating example, we consider a company whose current employee demographics affect the applicant pool they interview: the initial demographics of the company can affect the long-term hiring policies of the company. We provide sufficient conditions to characterize the region of attraction for the various equilibria in this settings. Additionally, we introduce the notion of performative alignment, which provides a geometric condition on the convergence of repeated risk minimization to performative risk minimizers.
    A Biased Graph Neural Network Sampler with Near-Optimal Regret. (arXiv:2103.01089v2 [cs.LG] UPDATED)
    (2 min) Graph neural networks (GNN) have recently emerged as a vehicle for applying deep network architectures to graph and relational data. However, given the increasing size of industrial datasets, in many practical situations the message passing computations required for sharing information across GNN layers are no longer scalable. Although various sampling methods have been introduced to approximate full-graph training within a tractable budget, there remain unresolved complications such as high variances and limited theoretical guarantees. To address these issues, we build upon existing work and treat GNN neighbor sampling as a multi-armed bandit problem but with a newly-designed reward function that introduces some degree of bias designed to reduce variance and avoid unstable, possibly-unbounded pay outs. And unlike prior bandit-GNN use cases, the resulting policy leads to near-optimal regret while accounting for the GNN training dynamics introduced by SGD. From a practical standpoint, this translates into lower variance estimates and competitive or superior test accuracy across several benchmarks.
    Identification of COVID-19 related Fake News via Neural Stacking. (arXiv:2101.03988v2 [cs.CL] UPDATED)
    (2 min) Identification of Fake News plays a prominent role in the ongoing pandemic, impacting multiple aspects of day-to-day life. In this work we present a solution to the shared task titled COVID19 Fake News Detection in English, scoring the 50th place amongst 168 submissions. The solution was within 1.5% of the best performing solution. The proposed solution employs a heterogeneous representation ensemble, adapted for the classification task via an additional neural classification head comprised of multiple hidden layers. The paper consists of detailed ablation studies further displaying the proposed method's behavior and possible implications. The solution is freely available. \url{https://gitlab.com/boshko.koloski/covid19-fake-news}
    An error analysis of generative adversarial networks for learning distributions. (arXiv:2105.13010v4 [cs.LG] UPDATED)
    (2 min) This paper studies how well generative adversarial networks (GANs) learn probability distributions from finite samples. Our main results establish the convergence rates of GANs under a collection of integral probability metrics defined through H\"older classes, including the Wasserstein distance as a special case. We also show that GANs are able to adaptively learn data distributions with low-dimensional structures or have H\"older densities, when the network architectures are chosen properly. In particular, for distributions concentrated around a low-dimensional set, we show that the learning rates of GANs do not depend on the high ambient dimension, but on the lower intrinsic dimension. Our analysis is based on a new oracle inequality decomposing the estimation error into the generator and discriminator approximation error and the statistical error, which may be of independent interest.
    Towards Tight Communication Lower Bounds for Distributed Optimisation. (arXiv:2010.08222v2 [cs.LG] UPDATED)
    (2 min) We consider a standard distributed optimisation setting where $N$ machines, each holding a $d$-dimensional function $f_i$, aim to jointly minimise the sum of the functions $\sum_{i = 1}^N f_i (x)$. This problem arises naturally in large-scale distributed optimisation, where a standard solution is to apply variants of (stochastic) gradient descent. We focus on the communication complexity of this problem: our main result provides the first fully unconditional bounds on total number of bits which need to be sent and received by the $N$ machines to solve this problem under point-to-point communication, within a given error-tolerance. Specifically, we show that $\Omega( Nd \log d / N\varepsilon)$ total bits need to be communicated between the machines to find an additive $\epsilon$-approximation to the minimum of $\sum_{i = 1}^N f_i (x)$. The result holds for both deterministic and randomised algorithms, and, importantly, requires no assumptions on the algorithm structure. The lower bound is tight under certain restrictions on parameter values, and is matched within constant factors for quadratic objectives by a new variant of quantised gradient descent, which we describe and analyse. Our results bring over tools from communication complexity to distributed optimisation, which has potential for further applications.
    Information Geometry and Classical Cram\'{e}r-Rao Type Inequalities. (arXiv:2104.01061v2 [cs.IT] UPDATED)
    (2 min) We examine the role of information geometry in the context of classical Cram\'er-Rao (CR) type inequalities. In particular, we focus on Eguchi's theory of obtaining dualistic geometric structures from a divergence function and then applying Amari-Nagoaka's theory to obtain a CR type inequality. The classical deterministic CR inequality is derived from Kullback-Leibler (KL)-divergence. We show that this framework could be generalized to other CR type inequalities through four examples: $\alpha$-version of CR inequality, generalized CR inequality, Bayesian CR inequality, and Bayesian $\alpha$-CR inequality. These are obtained from, respectively, $I_\alpha$-divergence (or relative $\alpha$-entropy), generalized Csisz\'ar divergence, Bayesian KL divergence, and Bayesian $I_\alpha$-divergence.
    Lattice Fusion Networks for Image Denoising. (arXiv:2011.14196v3 [eess.IV] UPDATED)
    (2 min) A novel method for feature fusion in convolutional neural networks is proposed in this paper. Different feature fusion techniques are suggested to facilitate the flow of information and improve the training of deep neural networks. Some of these techniques as well as the proposed network can be considered a type of Directed Acyclic Graph (DAG) Network, where a layer can receive inputs from other layers and have outputs to other layers. In the proposed general framework of Lattice Fusion Network (LFNet), feature maps of each convolutional layer are passed to other layers based on a lattice graph structure, where nodes are convolutional layers. To evaluate the performance of the proposed architecture, different designs based on the general framework of LFNet are implemented for the task of image denoising. This task is used as an example where training deep convolutional networks is needed. Results are compared with state of the art methods. The proposed network is able to achieve better results with far fewer learnable parameters, which shows the effectiveness of LFNets for training of deep neural networks.
    Parallel Predictive Entropy Search for Multi-objective Bayesian Optimization with Constraints. (arXiv:2004.00601v2 [stat.ML] UPDATED)
    (2 min) Real-world problems often involve the optimization of several objectives under multiple constraints. An example is the hyper-parameter tuning problem of machine learning algorithms. In particular, the minimization of the estimation of the generalization error of a deep neural network and at the same time the minimization of its prediction time. We may also consider as a constraint that the deep neural network must be implemented in a chip with an area below some size. Here, both the objectives and the constraint are black boxes, i.e., functions whose analytical expressions are unknown and are expensive to evaluate. Bayesian optimization (BO) methodologies have given state-of-the-art results for the optimization of black-boxes. Nevertheless, most BO methods are sequential and evaluate the objectives and the constraints at just one input location, iteratively. Sometimes, however, we may have resources to evaluate several configurations in parallel. Notwithstanding, no parallel BO method has been proposed to deal with the optimization of multiple objectives under several constraints. If the expensive evaluations can be carried out in parallel (as when a cluster of computers is available), sequential evaluations result in a waste of resources. This article introduces PPESMOC, Parallel Predictive Entropy Search for Multi-objective Bayesian Optimization with Constraints, an information-based batch method for the simultaneous optimization of multiple expensive-to-evaluate black-box functions under the presence of several constraints. Iteratively, PPESMOC selects a batch of input locations at which to evaluate the black-boxes so as to maximally reduce the entropy of the Pareto set of the optimization problem. We present empirical evidence in the form of synthetic, benchmark and real-world experiments that illustrate the effectiveness of PPESMOC.
    Byzantine-Robust Learning on Heterogeneous Datasets via Resampling. (arXiv:2006.09365v3 [cs.LG] UPDATED)
    (2 min) In Byzantine robust distributed or federated learning, a central server wants to train a machine learning model over data distributed across multiple workers. However, a fraction of these workers may deviate from the prescribed algorithm and send arbitrary messages. While this problem has received significant attention recently, most current defenses assume that the workers have identical data. For realistic cases when the data across workers are heterogeneous (non-iid), we design new attacks which circumvent current defenses, leading to significant loss of performance. We then propose a simple resampling scheme that adapts existing robust algorithms to heterogeneous datasets at a negligible computational cost. We also theoretically and experimentally validate our approach, showing that combining resampling with existing robust algorithms is effective against challenging attacks. Our work is the first to establish guaranteed convergence for the non-iid Byzantine robust problem under realistic assumptions.
    Deep Learning for Molecular Graphs with Tiered Graph Autoencoders and Graph Prediction. (arXiv:1910.11390v2 [cs.LG] UPDATED)
    (2 min) Tiered graph autoencoders provide the architecture and mechanisms for learning tiered latent representations and latent spaces for molecular graphs that explicitly represent and utilize groups (e.g., functional groups). This enables the utilization and exploration of tiered molecular latent spaces, either individually - the node (atom) tier, the group tier, or the graph (molecule) tier - or jointly, as well as navigation across the tiers. In this paper, we discuss the use of tiered graph autoencoders together with graph prediction for molecular graphs. We show features of molecular graphs used, and groups in molecular graphs identified for some sample molecules. We briefly review graph prediction and the QM9 dataset for background information, and discuss the use of tiered graph embeddings for graph prediction, particularly weighted group pooling. We find that functional groups and ring groups effectively capture and represent the chemical essence of molecular graphs (structures). Further, tiered graph autoencoders and graph prediction together provide effective, efficient and interpretable deep learning for molecular graphs, with the former providing unsupervised, transferable learning and the latter providing supervised, task-optimized learning.
    Model Mediated Teleoperation with a Hand-Arm Exoskeleton in Long Time Delays Using Reinforcement Learning. (arXiv:2107.00359v1 [cs.RO])
    (2 min) Telerobotic systems must adapt to new environmental conditions and deal with high uncertainty caused by long-time delays. As one of the best alternatives to human-level intelligence, Reinforcement Learning (RL) may offer a solution to cope with these issues. This paper proposes to integrate RL with the Model Mediated Teleoperation (MMT) concept. The teleoperator interacts with a simulated virtual environment, which provides instant feedback. Whereas feedback from the real environment is delayed, feedback from the model is instantaneous, leading to high transparency. The MMT is realized in combination with an intelligent system with two layers. The first layer utilizes Dynamic Movement Primitives (DMP) which accounts for certain changes in the avatar environment. And, the second layer addresses the problems caused by uncertainty in the model using RL methods. Augmented reality was also provided to fuse the avatar device and virtual environment models for the teleoperator. Implemented on DLR's Exodex Adam hand-arm haptic exoskeleton, the results show RL methods are able to find different solutions when changes are applied to the object position after the demonstration. The results also show DMPs to be effective at adapting to new conditions where there is no uncertainty involved.
    Impact Remediation: Optimal Interventions to Reduce Inequality. (arXiv:2107.00593v1 [cs.LG])
    (2 min) A significant body of research in the data sciences considers unfair discrimination against social categories such as race or gender that could occur or be amplified as a result of algorithmic decisions. Simultaneously, real-world disparities continue to exist, even before algorithmic decisions are made. In this work, we draw on insights from the social sciences and humanistic studies brought into the realm of causal modeling and constrained optimization, and develop a novel algorithmic framework for tackling pre-existing real-world disparities. The purpose of our framework, which we call the "impact remediation framework," is to measure real-world disparities and discover the optimal intervention policies that could help improve equity or access to opportunity for those who are underserved with respect to an outcome of interest. We develop a disaggregated approach to tackling pre-existing disparities that relaxes the typical set of assumptions required for the use of social categories in structural causal models. Our approach flexibly incorporates counterfactuals and is compatible with various ontological assumptions about the nature of social categories. We demonstrate impact remediation with a real-world case study and compare our disaggregated approach to an existing state-of-the-art approach, comparing its structure and resulting policy recommendations. In contrast to most work on optimal policy learning, we explore disparity reduction itself as an objective, explicitly focusing the power of algorithms on reducing inequality.
    Momentum-inspired Low-Rank Coordinate Descent for Diagonally Constrained SDPs. (arXiv:2106.08775v1 [math.OC] CROSS LISTED)
    (2 min) We present a novel, practical, and provable approach for solving diagonally constrained semi-definite programming (SDP) problems at scale using accelerated non-convex programming. Our algorithm non-trivially combines acceleration motions from convex optimization with coordinate power iteration and matrix factorization techniques. The algorithm is extremely simple to implement, and adds only a single extra hyperparameter -- momentum. We prove that our method admits local linear convergence in the neighborhood of the optimum and always converges to a first-order critical point. Experimentally, we showcase the merits of our method on three major application domains: MaxCut, MaxSAT, and MIMO signal detection. In all cases, our methodology provides significant speedups over non-convex and convex SDP solvers -- 5X faster than state-of-the-art non-convex solvers, and 9 to 10^3 X faster than convex SDP solvers -- with comparable or improved solution quality.
    Tweet Sentiment Quantification: An Experimental Re-Evaluation. (arXiv:2011.08091v2 [cs.CL] UPDATED)
    (2 min) Sentiment quantification is the task of estimating the relative frequency (or "prevalence") of sentiment-related classes (such as Positive, Neutral, Negative) in a sample of unlabelled texts; this is especially important when these texts are tweets, since most sentiment classification endeavours carried out on Twitter data actually have quantification (and not the classification of individual tweets) as their ultimate goal. It is well-known that solving quantification via "classify and count" (i.e., by classifying all unlabelled items via a standard classifier and counting the items that have been assigned to a given class) is suboptimal in terms of accuracy, and that more accurate quantification methods exist. In 2016, Gao and Sebastiani carried out a systematic comparison of quantification methods on the task of tweet sentiment quantification. In hindsight, we observe that the experimental protocol followed in that work is flawed, and that its results are thus unreliable. We now re-evaluate those quantification methods on the very same datasets, this time following a now consolidated and much more robust experimental protocol, that involves 5775 as many experiments as run in the original study. Our experimentation yields results dramatically different from those obtained by Gao and Sebastiani, and thus provide a different, much more solid understanding of the relative strengths and weaknesses of different sentiment quantification methods.
    Scalable and Adaptive Graph Neural Networks with Self-Label-Enhanced training. (arXiv:2104.09376v3 [cs.LG] UPDATED)
    (2 min) It is hard to directly implement Graph Neural Networks (GNNs) on large scaled graphs. Besides of existed neighbor sampling techniques, scalable methods decoupling graph convolutions and other learnable transformations into preprocessing and post classifier allow normal minibatch training. By replacing redundant concatenation operation with attention mechanism in SIGN, we propose Scalable and Adaptive Graph Neural Networks (SAGN). SAGN can adaptively gather neighborhood information among different hops. To further improve scalable models on semi-supervised learning tasks, we propose Self-Label-Enhance (SLE) framework combining self-training approach and label propagation in depth. We add base model with a scalable node label module. Then we iteratively train models and enhance train set in several stages. To generate input of node label module, we directly apply label propagation based on one-hot encoded label vectors without inner random masking. We find out that empirically the label leakage has been effectively alleviated after graph convolutions. The hard pseudo labels in enhanced train set participate in label propagation with true labels. Experiments on both inductive and transductive datasets demonstrate that, compared with other sampling-based and sampling-free methods, SAGN achieves better or comparable results and SLE can further improve performance.
    Background Knowledge in Schema Matching: Strategy vs. Data. (arXiv:2107.00001v1 [cs.DB])
    (2 min) The use of external background knowledge can be beneficial for the task of matching schemas or ontologies automatically. In this paper, we exploit six general-purpose knowledge graphs as sources of background knowledge for the matching task. The background sources are evaluated by applying three different exploitation strategies. We find that explicit strategies still outperform latent ones and that the choice of the strategy has a greater impact on the final alignment than the actual background dataset on which the strategy is applied. While we could not identify a universally superior resource, BabelNet achieved consistently good results. Our best matcher configuration with BabelNet performs very competitively when compared to other matching systems even though no dataset-specific optimizations were made.
    POSNoise: An Effective Countermeasure Against Topic Biases in Authorship Analysis. (arXiv:2005.06605v2 [cs.CL] UPDATED)
    (2 min) Authorship verification (AV) is a fundamental research task in digital text forensics, which addresses the problem of whether two texts were written by the same person. In recent years, a variety of AV methods have been proposed that focus on this problem and can be divided into two categories: The first category refers to such methods that are based on explicitly defined features, where one has full control over which features are considered and what they actually represent. The second category, on the other hand, relates to such AV methods that are based on implicitly defined features, where no control mechanism is involved, so that any character sequence in a text can serve as a potential feature. However, AV methods belonging to the second category bear the risk that the topic of the texts may bias their classification predictions, which in turn may lead to misleading conclusions regarding their results. To tackle this problem, we propose a preprocessing technique called POSNoise, which effectively masks topic-related content in a given text. In this way, AV methods are forced to focus on such text units that are more related to the writing style. Our empirical evaluation based on six AV methods (falling into the second category) and seven corpora shows that POSNoise leads to better results compared to a well-known topic masking approach in 34 out of 42 cases, with an increase in accuracy of up to 10%.
    Neural Network Training with Highly Incomplete Datasets. (arXiv:2107.00429v1 [cs.LG])
    (2 min) Neural network training and validation rely on the availability of large high-quality datasets. However, in many cases only incomplete datasets are available, particularly in health care applications, where each patient typically undergoes different clinical procedures or can drop out of a study. Since the data to train the neural networks need to be complete, most studies discard the incomplete datapoints, which reduces the size of the training data, or impute the missing features, which can lead to artefacts. Alas, both approaches are inadequate when a large portion of the data is missing. Here, we introduce GapNet, an alternative deep-learning training approach that can use highly incomplete datasets. First, the dataset is split into subsets of samples containing all values for a certain cluster of features. Then, these subsets are used to train individual neural networks. Finally, this ensemble of neural networks is combined into a single neural network whose training is fine-tuned using all complete datapoints. Using two highly incomplete real-world medical datasets, we show that GapNet improves the identification of patients with underlying Alzheimer's disease pathology and of patients at risk of hospitalization due to Covid-19. By distilling the information available in incomplete datasets without having to reduce their size or to impute missing values, GapNet will permit to extract valuable information from a wide range of datasets, benefiting diverse fields from medicine to engineering.
    Machine Learning and Deep Learning for Fixed-Text Keystroke Dynamics. (arXiv:2107.00507v1 [cs.LG])
    (2 min) Keystroke dynamics can be used to analyze the way that users type by measuring various aspects of keyboard input. Previous work has demonstrated the feasibility of user authentication and identification utilizing keystroke dynamics. In this research, we consider a wide variety of machine learning and deep learning techniques based on fixed-text keystroke-derived features, we optimize the resulting models, and we compare our results to those obtained in related research. We find that models based on extreme gradient boosting (XGBoost) and multi-layer perceptrons (MLP)perform well in our experiments. Our best models outperform previous comparable research.
    Sanity Checks for Lottery Tickets: Does Your Winning Ticket Really Win the Jackpot?. (arXiv:2107.00166v1 [cs.LG])
    (2 min) There have been long-standing controversies and inconsistencies over the experiment setup and criteria for identifying the "winning ticket" in literature. To reconcile such, we revisit the definition of lottery ticket hypothesis, with comprehensive and more rigorous conditions. Under our new definition, we show concrete evidence to clarify whether the winning ticket exists across the major DNN architectures and/or applications. Through extensive experiments, we perform quantitative analysis on the correlations between winning tickets and various experimental factors, and empirically study the patterns of our observations. We find that the key training hyperparameters, such as learning rate and training epochs, as well as the architecture characteristics such as capacities and residual connections, are all highly correlated with whether and when the winning tickets can be identified. Based on our analysis, we summarize a guideline for parameter settings in regards of specific architecture characteristics, which we hope to catalyze the research progress on the topic of lottery ticket hypothesis.
    The Limit Order Book Recreation Model (LOBRM): An Extended Analysis. (arXiv:2107.00534v1 [q-fin.TR])
    (2 min) The limit order book (LOB) depicts the fine-grained demand and supply relationship for financial assets and is widely used in market microstructure studies. Nevertheless, the availability and high cost of LOB data restrict its wider application. The LOB recreation model (LOBRM) was recently proposed to bridge this gap by synthesizing the LOB from trades and quotes (TAQ) data. However, in the original LOBRM study, there were two limitations: (1) experiments were conducted on a relatively small dataset containing only one day of LOB data; and (2) the training and testing were performed in a non-chronological fashion, which essentially re-frames the task as interpolation and potentially introduces lookahead bias. In this study, we extend the research on LOBRM and further validate its use in real-world application scenarios. We first advance the workflow of LOBRM by (1) adding a time-weighted z-score standardization for the LOB and (2) substituting the ordinary differential equation kernel with an exponential decay kernel to lower computation complexity. Experiments are conducted on the extended LOBSTER dataset in a chronological fashion, as it would be used in a real-world application. We find that (1) LOBRM with decay kernel is superior to traditional non-linear models, and module ensembling is effective; (2) prediction accuracy is negatively related to the volatility of order volumes resting in the LOB; (3) the proposed sparse encoding method for TAQ exhibits good generalization ability and can facilitate manifold tasks; and (4) the influence of stochastic drift on prediction accuracy can be alleviated by increasing historical samples.
    Deep Orthogonal Fusion: Multimodal Prognostic Biomarker Discovery Integrating Radiology, Pathology, Genomic, and Clinical Data. (arXiv:2107.00648v1 [cs.CV])
    (2 min) Clinical decision-making in oncology involves multimodal data such as radiology scans, molecular profiling, histopathology slides, and clinical factors. Despite the importance of these modalities individually, no deep learning framework to date has combined them all to predict patient prognosis. Here, we predict the overall survival (OS) of glioma patients from diverse multimodal data with a Deep Orthogonal Fusion (DOF) model. The model learns to combine information from multiparametric MRI exams, biopsy-based modalities (such as H&E slide images and/or DNA sequencing), and clinical variables into a comprehensive multimodal risk score. Prognostic embeddings from each modality are learned and combined via attention-gated tensor fusion. To maximize the information gleaned from each modality, we introduce a multimodal orthogonalization (MMO) loss term that increases model performance by incentivizing constituent embeddings to be more complementary. DOF predicts OS in glioma patients with a median C-index of 0.788 +/- 0.067, significantly outperforming (p=0.023) the best performing unimodal model with a median C-index of 0.718 +/- 0.064. The prognostic model significantly stratifies glioma patients by OS within clinical subsets, adding further granularity to prognostic clinical grading and molecular subtyping.
    CSWin Transformer: A General Vision Transformer Backbone with Cross-Shaped Windows. (arXiv:2107.00652v1 [cs.CV])
    (2 min) We present CSWin Transformer, an efficient and effective Transformer-based backbone for general-purpose vision tasks. A challenging issue in Transformer design is that global self-attention is very expensive to compute whereas local self-attention often limits the field of interactions of each token. To address this issue, we develop the Cross-Shaped Window self-attention mechanism for computing self-attention in the horizontal and vertical stripes in parallel that form a cross-shaped window, with each stripe obtained by splitting the input feature into stripes of equal width. We provide a detailed mathematical analysis of the effect of the stripe width and vary the stripe width for different layers of the Transformer network which achieves strong modeling capability while limiting the computation cost. We also introduce Locally-enhanced Positional Encoding (LePE), which handles the local positional information better than existing encoding schemes. LePE naturally supports arbitrary input resolutions, and is thus especially effective and friendly for downstream tasks. Incorporated with these designs and a hierarchical structure, CSWin Transformer demonstrates competitive performance on common vision tasks. Specifically, it achieves 85.4% Top-1 accuracy on ImageNet-1K without any extra training data or label, 53.9 box AP and 46.4 mask AP on the COCO detection task, and 51.7 mIOU on the ADE20K semantic segmentation task, surpassing previous state-of-the-art Swin Transformer backbone by +1.2, +2.0, +1.4, and +2.0 respectively under the similar FLOPs setting. By further pretraining on the larger dataset ImageNet-21K, we achieve 87.5% Top-1 accuracy on ImageNet-1K and state-of-the-art segmentation performance on ADE20K with 55.2 mIoU. The code and models will be available at https://github.com/microsoft/CSWin-Transformer.
    Fast Margin Maximization via Dual Acceleration. (arXiv:2107.00595v1 [cs.LG])
    (2 min) We present and analyze a momentum-based gradient method for training linear classifiers with an exponentially-tailed loss (e.g., the exponential or logistic loss), which maximizes the classification margin on separable data at a rate of $\widetilde{\mathcal{O}}(1/t^2)$. This contrasts with a rate of $\mathcal{O}(1/\log(t))$ for standard gradient descent, and $\mathcal{O}(1/t)$ for normalized gradient descent. This momentum-based method is derived via the convex dual of the maximum-margin problem, and specifically by applying Nesterov acceleration to this dual, which manages to result in a simple and intuitive method in the primal. This dual view can also be used to derive a stochastic variant, which performs adaptive non-uniform sampling via the dual variables.
    Physics-Informed Neural Networks for Minimising Worst-Case Violations in DC Optimal Power Flow. (arXiv:2107.00465v1 [eess.SY])
    (2 min) Physics-informed neural networks exploit the existing models of the underlying physical systems to generate higher accuracy results with fewer data. Such approaches can help drastically reduce the computation time and generate a good estimate of computationally intensive processes in power systems, such as dynamic security assessment or optimal power flow. Combined with the extraction of worst-case guarantees for the neural network performance, such neural networks can be applied in safety-critical applications in power systems and build a high level of trust among power system operators. This paper takes the first step and applies, for the first time to our knowledge, Physics-Informed Neural Networks with Worst-Case Guarantees for the DC Optimal Power Flow problem. We look for guarantees related to (i) maximum constraint violations, (ii) maximum distance between predicted and optimal decision variables, and (iii) maximum sub-optimality in the entire input domain. In a range of PGLib-OPF networks, we demonstrate how physics-informed neural networks can be supplied with worst-case guarantees and how they can lead to reduced worst-case violations compared with conventional neural networks.
    Global Filter Networks for Image Classification. (arXiv:2107.00645v1 [cs.CV])
    (2 min) Recent advances in self-attention and pure multi-layer perceptrons (MLP) models for vision have shown great potential in achieving promising performance with fewer inductive biases. These models are generally based on learning interaction among spatial locations from raw data. The complexity of self-attention and MLP grows quadratically as the image size increases, which makes these models hard to scale up when high-resolution features are required. In this paper, we present the Global Filter Network (GFNet), a conceptually simple yet computationally efficient architecture, that learns long-term spatial dependencies in the frequency domain with log-linear complexity. Our architecture replaces the self-attention layer in vision transformers with three key operations: a 2D discrete Fourier transform, an element-wise multiplication between frequency-domain features and learnable global filters, and a 2D inverse Fourier transform. We exhibit favorable accuracy/complexity trade-offs of our models on both ImageNet and downstream tasks. Our results demonstrate that GFNet can be a very competitive alternative to transformer-style models and CNNs in efficiency, generalization ability and robustness. Code is available at https://github.com/raoyongming/GFNet
    Learning Large DAGs by Combining Continuous Optimization and Feedback Arc Set Heuristics. (arXiv:2107.00571v1 [cs.LG])
    (2 min) Bayesian networks represent relations between variables using a directed acyclic graph (DAG). Learning the DAG is an NP-hard problem and exact learning algorithms are feasible only for small sets of variables. We propose two scalable heuristics for learning DAGs in the linear structural equation case. Our methods learn the DAG by alternating between unconstrained gradient descent-based step to optimize an objective function and solving a maximum acyclic subgraph problem to enforce acyclicity. Thanks to this decoupling, our methods scale up beyond thousands of variables.
    Goal-Conditioned Reinforcement Learning with Imagined Subgoals. (arXiv:2107.00541v1 [cs.LG])
    (2 min) Goal-conditioned reinforcement learning endows an agent with a large variety of skills, but it often struggles to solve tasks that require more temporally extended reasoning. In this work, we propose to incorporate imagined subgoals into policy learning to facilitate learning of complex tasks. Imagined subgoals are predicted by a separate high-level policy, which is trained simultaneously with the policy and its critic. This high-level policy predicts intermediate states halfway to the goal using the value function as a reachability metric. We don't require the policy to reach these subgoals explicitly. Instead, we use them to define a prior policy, and incorporate this prior into a KL-constrained policy iteration scheme to speed up and regularize learning. Imagined subgoals are used during policy learning, but not during test time, where we only apply the learned policy. We evaluate our approach on complex robotic navigation and manipulation tasks and show that it outperforms existing methods by a large margin.
    Dep-$L_0$: Improving $L_0$-based Network Sparsification via Dependency Modeling. (arXiv:2107.00070v1 [cs.LG])
    (2 min) Training deep neural networks with an $L_0$ regularization is one of the prominent approaches for network pruning or sparsification. The method prunes the network during training by encouraging weights to become exactly zero. However, recent work of Gale et al. reveals that although this method yields high compression rates on smaller datasets, it performs inconsistently on large-scale learning tasks, such as ResNet50 on ImageNet. We analyze this phenomenon through the lens of variational inference and find that it is likely due to the independent modeling of binary gates, the mean-field approximation, which is known in Bayesian statistics for its poor performance due to the crude approximation. To mitigate this deficiency, we propose a dependency modeling of binary gates, which can be modeled effectively as a multi-layer perceptron (MLP). We term our algorithm Dep-$L_0$ as it prunes networks via a dependency-enabled $L_0$ regularization. Extensive experiments on CIFAR10, CIFAR100 and ImageNet with VGG16, ResNet50, ResNet56 show that our Dep-$L_0$ outperforms the original $L_0$-HC algorithm of Louizos et al. by a significant margin, especially on ImageNet. Compared with the state-of-the-arts network sparsification algorithms, our dependency modeling makes the $L_0$-based sparsification once again very competitive on large-scale learning tasks. Our source code is available at https://github.com/leo-yangli/dep-l0.
    Latent Execution for Neural Program Synthesis Beyond Domain-Specific Languages. (arXiv:2107.00101v1 [cs.PL])
    (2 min) Program synthesis from input-output examples has been a long-standing challenge, and recent works have demonstrated some success in designing deep neural networks for program synthesis. However, existing efforts in input-output neural program synthesis have been focusing on domain-specific languages, thus the applicability of previous approaches to synthesize code in full-fledged popular programming languages, such as C, remains a question. The main challenges lie in two folds. On the one hand, the program search space grows exponentially when the syntax and semantics of the programming language become more complex, which poses higher requirements on the synthesis algorithm. On the other hand, increasing the complexity of the programming language also imposes more difficulties on data collection, since building a large-scale training set for input-output program synthesis require random program generators to sample programs and input-output examples. In this work, we take the first step to synthesize C programs from input-output examples. In particular, we propose LaSynth, which learns the latent representation to approximate the execution of partially generated programs, even if their semantics are not well-defined. We demonstrate the possibility of synthesizing elementary C code from input-output examples, and leveraging learned execution significantly improves the prediction performance over existing approaches. Meanwhile, compared to the randomly generated ground-truth programs, LaSynth synthesizes more concise programs that resemble human-written code. We show that training on these synthesized programs further improves the prediction performance for both Karel and C program synthesis, indicating the promise of leveraging the learned program synthesizer to improve the dataset quality for input-output program synthesis.
    Variational Diffusion Models. (arXiv:2107.00630v1 [cs.LG])
    (2 min) Diffusion-based generative models have demonstrated a capacity for perceptually impressive synthesis, but can they also be great likelihood-based models? We answer this in the affirmative, and introduce a family of diffusion-based generative models that obtain state-of-the-art likelihoods on standard image density estimation benchmarks. Unlike other diffusion-based models, our method allows for efficient optimization of the noise schedule jointly with the rest of the model. We show that the variational lower bound (VLB) simplifies to a remarkably short expression in terms of the signal-to-noise ratio of the diffused data, thereby improving our theoretical understanding of this model class. Using this insight, we prove an equivalence between several models proposed in the literature. In addition, we show that the continuous-time VLB is invariant to the noise schedule, except for the signal-to-noise ratio at its endpoints. This enables us to learn a noise schedule that minimizes the variance of the resulting VLB estimator, leading to faster optimization. Combining these advances with architectural improvements, we obtain state-of-the-art likelihoods on image density estimation benchmarks, outperforming autoregressive models that have dominated these benchmarks for many years, with often significantly faster optimization. In addition, we show how to turn the model into a bits-back compression scheme, and demonstrate lossless compression rates close to the theoretical optimum.
    Machine learning based iterative learning control for non-repetitive time-varying systems. (arXiv:2107.00421v1 [eess.SY])
    (2 min) The repetitive tracking task for time-varying systems (TVSs) with non-repetitive time-varying parameters, which is also called non-repetitive TVSs, is realized in this paper using iterative learning control (ILC). A machine learning (ML) based nominal model update mechanism, which utilizes the linear regression technique to update the nominal model at each ILC trial only using the current trial information, is proposed for non-repetitive TVSs in order to enhance the ILC performance. Given that the ML mechanism forces the model uncertainties to remain within the ILC robust tolerance, an ILC update law is proposed to deal with non-repetitive TVSs. How to tune parameters inside ML and ILC algorithms to achieve the desired aggregate performance is also provided. The robustness and reliability of the proposed method are verified by simulations. Comparison with current state-of-the-art demonstrates its superior control performance in terms of controlling precision. This paper broadens ILC applications from time-invariant systems to non-repetitive TVSs, adopts ML regression technique to estimate non-repetitive time-varying parameters between two ILC trials and proposes a detailed parameter tuning mechanism to achieve desired performance, which are the main contributions.
    Secure Quantized Training for Deep Learning. (arXiv:2107.00501v1 [cs.LG])
    (2 min) We have implemented training of neural networks in secure multi-party computation (MPC) using quantization commonly used in the said setting. To the best of our knowledge, we are the first to present an MNIST classifier purely trained in MPC that comes within 0.2 percent of the accuracy of the same convolutional neural network trained via plaintext computation. More concretely, we have trained a network with two convolution and two dense layers to 99.2% accuracy in 25 epochs. This took 3.5 hours in our MPC implementation (under one hour for 99% accuracy).
    Pretext Tasks selection for multitask self-supervised speech representation learning. (arXiv:2107.00594v1 [eess.AS])
    (2 min) Through solving pretext tasks, self-supervised learning leverages unlabeled data to extract useful latent representations replacing traditional input features in the downstream task. In various application domains, including computer vision, natural language processing and audio/speech signal processing, a wide range of features where engineered through decades of research efforts. As it turns out, learning to predict such features has proven to be a particularly relevant pretext task leading to building useful self-supervised representations that prove to be effective for downstream tasks. However, methods and common practices for combining such pretext tasks, where each task targets a different group of features for better performance on the downstream task have not been explored and understood properly. In fact, the process relies almost exclusively on a computationally heavy experimental procedure, which becomes intractable with the increase of the number of pretext tasks. This paper introduces a method to select a group of pretext tasks among a set of candidates. The method we propose estimates properly calibrated weights for the partial losses corresponding to the considered pretext tasks during the self-supervised training process. The experiments conducted on speaker recognition and automatic speech recognition validate our approach, as the groups selected and weighted with our method perform better than classic baselines, thus facilitating the selection and combination of relevant pseudo-labels for self-supervised representation learning.
    Differentiable Particle Filters through Conditional Normalizing Flow. (arXiv:2107.00488v1 [cs.AI])
    (2 min) Differentiable particle filters provide a flexible mechanism to adaptively train dynamic and measurement models by learning from observed data. However, most existing differentiable particle filters are within the bootstrap particle filtering framework and fail to incorporate the information from latest observations to construct better proposals. In this paper, we utilize conditional normalizing flows to construct proposal distributions for differentiable particle filters, enriching the distribution families that the proposal distributions can represent. In addition, normalizing flows are incorporated in the construction of the dynamic model, resulting in a more expressive dynamic model. We demonstrate the performance of the proposed conditional normalizing flow-based differentiable particle filters in a visual tracking task.
    Focal Self-attention for Local-Global Interactions in Vision Transformers. (arXiv:2107.00641v1 [cs.CV])
    (2 min) Recently, Vision Transformer and its variants have shown great promise on various computer vision tasks. The ability of capturing short- and long-range visual dependencies through self-attention is arguably the main source for the success. But it also brings challenges due to quadratic computational overhead, especially for the high-resolution vision tasks (e.g., object detection). In this paper, we present focal self-attention, a new mechanism that incorporates both fine-grained local and coarse-grained global interactions. Using this new mechanism, each token attends the closest surrounding tokens at fine granularity but the tokens far away at coarse granularity, and thus can capture both short- and long-range visual dependencies efficiently and effectively. With focal self-attention, we propose a new variant of Vision Transformer models, called Focal Transformer, which achieves superior performance over the state-of-the-art vision Transformers on a range of public image classification and object detection benchmarks. In particular, our Focal Transformer models with a moderate size of 51.1M and a larger size of 89.8M achieve 83.5 and 83.8 Top-1 accuracy, respectively, on ImageNet classification at 224x224 resolution. Using Focal Transformers as the backbones, we obtain consistent and substantial improvements over the current state-of-the-art Swin Transformers for 6 different object detection methods trained with standard 1x and 3x schedules. Our largest Focal Transformer yields 58.7/58.9 box mAPs and 50.9/51.3 mask mAPs on COCO mini-val/test-dev, and 55.4 mIoU on ADE20K for semantic segmentation, creating new SoTA on three of the most challenging computer vision tasks.
    Policy Transfer across Visual and Dynamics Domain Gaps via Iterative Grounding. (arXiv:2107.00339v1 [cs.RO])
    (2 min) The ability to transfer a policy from one environment to another is a promising avenue for efficient robot learning in realistic settings where task supervision is not available. This can allow us to take advantage of environments well suited for training, such as simulators or laboratories, to learn a policy for a real robot in a home or office. To succeed, such policy transfer must overcome both the visual domain gap (e.g. different illumination or background) and the dynamics domain gap (e.g. different robot calibration or modelling error) between source and target environments. However, prior policy transfer approaches either cannot handle a large domain gap or can only address one type of domain gap at a time. In this paper, we propose a novel policy transfer method with iterative "environment grounding", IDAPT, that alternates between (1) directly minimizing both visual and dynamics domain gaps by grounding the source environment in the target environment domains, and (2) training a policy on the grounded source environment. This iterative training progressively aligns the domains between the two environments and adapts the policy to the target environment. Once trained, the policy can be directly executed on the target environment. The empirical results on locomotion and robotic manipulation tasks demonstrate that our approach can effectively transfer a policy across visual and dynamics domain gaps with minimal supervision and interaction with the target environment. Videos and code are available at https://clvrai.com/idapt .
    Online learning of windmill time series using Long Short-term Cognitive Networks. (arXiv:2107.00425v1 [cs.LG])
    (2 min) Forecasting windmill time series is often the basis of other processes such as anomaly detection, health monitoring, or maintenance scheduling. The amount of data generated on windmill farms makes online learning the most viable strategy to follow. Such settings require retraining the model each time a new batch of data is available. However, update the model with the new information is often very expensive to perform using traditional Recurrent Neural Networks (RNNs). In this paper, we use Long Short-term Cognitive Networks (LSTCNs) to forecast windmill time series in online settings. These recently introduced neural systems consist of chained Short-term Cognitive Network blocks, each processing a temporal data chunk. The learning algorithm of these blocks is based on a very fast, deterministic learning rule that makes LSTCNs suitable for online learning tasks. The numerical simulations using a case study with four windmills showed that our approach reported the lowest forecasting errors with respect to a simple RNN, a Long Short-term Memory, a Gated Recurrent Unit, and a Hidden Markov Model. What is perhaps more important is that the LSTCN approach is significantly faster than these state-of-the-art models.
    Interviewer-Candidate Role Play: Towards Developing Real-World NLP Systems. (arXiv:2107.00315v1 [cs.CL])
    (2 min) Standard NLP tasks do not incorporate several common real-world scenarios such as seeking clarifications about the question, taking advantage of clues, abstaining in order to avoid incorrect answers, etc. This difference in task formulation hinders the adoption of NLP systems in real-world settings. In this work, we take a step towards bridging this gap and present a multi-stage task that simulates a typical human-human questioner-responder interaction such as an interview. Specifically, the system is provided with question simplifications, knowledge statements, examples, etc. at various stages to improve its prediction when it is not sufficiently confident. We instantiate the proposed task in Natural Language Inference setting where a system is evaluated on both in-domain and out-of-domain (OOD) inputs. We conduct comprehensive experiments and find that the multi-stage formulation of our task leads to OOD generalization performance improvement up to 2.29% in Stage 1, 1.91% in Stage 2, 54.88% in Stage 3, and 72.02% in Stage 4 over the standard unguided prediction. However, our task leaves a significant challenge for NLP researchers to further improve OOD performance at each stage.
    Deep Hierarchical Super-Resolution for Scientific Data Reduction and Visualization. (arXiv:2107.00462v1 [eess.IV])
    (2 min) We present an approach for hierarchical super resolution (SR) using neural networks on an octree data representation. We train a hierarchy of neural networks, each capable of 2x upscaling in each spatial dimension between two levels of detail, and use these networks in tandem to facilitate large scale factor super resolution, scaling with the number of trained networks. We utilize these networks in a hierarchical super resolution algorithm that upscales multiresolution data to a uniform high resolution without introducing seam artifacts on octree node boundaries. We evaluate application of this algorithm in a data reduction framework by dynamically downscaling input data to an octree-based data structure to represent the multiresolution data before compressing for additional storage reduction. We demonstrate that our approach avoids seam artifacts common to multiresolution data formats, and show how neural network super resolution assisted data reduction can preserve global features better than compressors alone at the same compression ratios.
    Predictive Modeling in the Presence of Nuisance-Induced Spurious Correlations. (arXiv:2107.00520v1 [cs.LG])
    (2 min) Deep predictive models often make use of spurious correlations between the label and the covariates that differ between training and test distributions. In many classification tasks, spurious correlations are induced by a changing relationship between the label and some nuisance variables correlated with the covariates. For example, in classifying animals in natural images, the background, which is the nuisance, can predict the type of animal, but this nuisance label relationship does not always hold. This nuisance-label relationship does not always hold. We formalize a family of distributions that only differ in the nuisance-label relationship and and introduce a distribution where this relationship is broken called the nuisance-randomized distribution. We introduce a set of predictive models built from the nuisance-randomized distribution with representations, that when conditioned on, do not correlate the label and the nuisance. For models in this set, we lower bound the performance for any member of the family with the mutual information between the representation and the label under the nuisance-randomized distribution. To build predictive models that maximize the performance lower bound, we develop Nuisance-Randomized Distillation (NURD). We evaluate NURD on a synthetic example, colored-MNIST, and classifying chest X-rays. When using non-lung patches as the nuisance in classifying chest X-rays, NURD produces models that predict pneumonia under strong spurious correlations.
    DivergentNets: Medical Image Segmentation by Network Ensemble. (arXiv:2107.00283v1 [eess.IV])
    (2 min) Detection of colon polyps has become a trending topic in the intersecting fields of machine learning and gastrointestinal endoscopy. The focus has mainly been on per-frame classification. More recently, polyp segmentation has gained attention in the medical community. Segmentation has the advantage of being more accurate than per-frame classification or object detection as it can show the affected area in greater detail. For our contribution to the EndoCV 2021 segmentation challenge, we propose two separate approaches. First, a segmentation model named TriUNet composed of three separate UNet models. Second, we combine TriUNet with an ensemble of well-known segmentation models, namely UNet++, FPN, DeepLabv3, and DeepLabv3+, into a model called DivergentNets to produce more generalizable medical image segmentation masks. In addition, we propose a modified Dice loss that calculates loss only for a single class when performing multiclass segmentation, forcing the model to focus on what is most important. Overall, the proposed methods achieved the best average scores for each respective round in the challenge, with TriUNet being the winning model in Round I and DivergentNets being the winning model in Round II of the segmentation generalization challenge at EndoCV 2021. The implementation of our approach is made publicly available on GitHub.
    A Consistency-Based Loss for Deep Odometry Through Uncertainty Propagation. (arXiv:2107.00366v1 [cs.LG])
    (2 min) The incremental poses computed through odometry can be integrated over time to calculate the pose of a device with respect to an initial location. The resulting global pose may be used to formulate a second, consistency based, loss term in a deep odometry setting. In such cases where multiple losses are imposed on a network, the uncertainty over each output can be derived to weigh the different loss terms in a maximum likelihood setting. However, when imposing a constraint on the integrated transformation, due to how only odometry is estimated at each iteration of the algorithm, there is no information about the uncertainty associated with the global pose to weigh the global loss term. In this paper, we associate uncertainties with the output poses of a deep odometry network and propagate the uncertainties through each iteration. Our goal is to use the estimated covariance matrix at each incremental step to weigh the loss at the corresponding step while weighting the global loss term using the compounded uncertainty. This formulation provides an adaptive method to weigh the incremental and integrated loss terms against each other, noting the increase in uncertainty as new estimates arrive. We provide quantitative and qualitative analysis of pose estimates and show that our method surpasses the accuracy of the state-of-the-art Visual Odometry approaches. Then, uncertainty estimates are evaluated and comparisons against fixed baselines are provided. Finally, the uncertainty values are used in a realistic example to show the effectiveness of uncertainty quantification for localization.
    Never Go Full Batch (in Stochastic Convex Optimization). (arXiv:2107.00469v1 [math.OC])
    (2 min) We study the generalization performance of $\text{full-batch}$ optimization algorithms for stochastic convex optimization: these are first-order methods that only access the exact gradient of the empirical risk (rather than gradients with respect to individual data points), that include a wide range of algorithms such as gradient descent, mirror descent, and their regularized and/or accelerated variants. We provide a new separation result showing that, while algorithms such as stochastic gradient descent can generalize and optimize the population risk to within $\epsilon$ after $O(1/\epsilon^2)$ iterations, full-batch methods either need at least $\Omega(1/\epsilon^4)$ iterations or exhibit a dimension-dependent sample complexity.
    Global Knowledge Distillation in Federated Learning. (arXiv:2107.00051v1 [cs.LG])
    (2 min) Knowledge distillation has caught a lot of attention in Federated Learning (FL) recently. It has the advantage for FL to train on heterogeneous clients which have different data size and data structure. However, data samples across all devices are usually not independent and identically distributed (non-i.i.d), posing additional challenges to the convergence and speed of federated learning. As FL randomly asks the clients to join the training process and each client only learns from local non-i.i.d data, which makes learning processing even slower. In order to solve this problem, an intuitive idea is using the global model to guide local training. In this paper, we propose a novel global knowledge distillation method, named FedGKD, which learns the knowledge from past global models to tackle down the local bias training problem. By learning from global knowledge and consistent with current local models, FedGKD learns a global knowledge model in FL. To demonstrate the effectiveness of the proposed method, we conduct extensive experiments on various CV datasets (CIFAR-10/100) and settings (non-i.i.d data). The evaluation results show that FedGKD outperforms previous state-of-the-art methods.
    Lossless Coding of Point Cloud Geometry using a Deep Generative Model. (arXiv:2107.00400v1 [eess.IV])
    (2 min) This paper proposes a lossless point cloud (PC) geometry compression method that uses neural networks to estimate the probability distribution of voxel occupancy. First, to take into account the PC sparsity, our method adaptively partitions a point cloud into multiple voxel block sizes. This partitioning is signalled via an octree. Second, we employ a deep auto-regressive generative model to estimate the occupancy probability of each voxel given the previously encoded ones. We then employ the estimated probabilities to code efficiently a block using a context-based arithmetic coder. Our context has variable size and can expand beyond the current block to learn more accurate probabilities. We also consider using data augmentation techniques to increase the generalization capability of the learned probability models, in particular in the presence of noise and lower-density point clouds. Experimental evaluation, performed on a variety of point clouds from four different datasets and with diverse characteristics, demonstrates that our method reduces significantly (by up to 30%) the rate for lossless coding compared to the state-of-the-art MPEG codec.
    Unsupervised Model Drift Estimation with Batch Normalization Statistics for Dataset Shift Detection and Model Selection. (arXiv:2107.00191v1 [cs.LG])
    (2 min) While many real-world data streams imply that they change frequently in a nonstationary way, most of deep learning methods optimize neural networks on training data, and this leads to severe performance degradation when dataset shift happens. However, it is less possible to annotate or inspect newly streamed data by humans, and thus it is desired to measure model drift at inference time in an unsupervised manner. In this paper, we propose a novel method of model drift estimation by exploiting statistics of batch normalization layer on unlabeled test data. To remedy possible sampling error of streamed input data, we adopt low-rank approximation to each representational layer. We show the effectiveness of our method not only on dataset shift detection but also on model selection when there are multiple candidate models among model zoo or training trajectories in an unsupervised way. We further demonstrate the consistency of our method by comparing model drift scores between different network architectures.
    Using Anomaly Feature Vectors for Detecting, Classifying and Warning of Outlier Adversarial Examples. (arXiv:2107.00561v1 [cs.LG])
    (2 min) We present DeClaW, a system for detecting, classifying, and warning of adversarial inputs presented to a classification neural network. In contrast to current state-of-the-art methods that, given an input, detect whether an input is clean or adversarial, we aim to also identify the types of adversarial attack (e.g., PGD, Carlini-Wagner or clean). To achieve this, we extract statistical profiles, which we term as anomaly feature vectors, from a set of latent features. Preliminary findings suggest that AFVs can help distinguish among several types of adversarial attacks (e.g., PGD versus Carlini-Wagner) with close to 93% accuracy on the CIFAR-10 dataset. The results open the door to using AFV-based methods for exploring not only adversarial attack detection but also classification of the attack type and then design of attack-specific mitigation strategies.
    CarSNN: An Efficient Spiking Neural Network for Event-Based Autonomous Cars on the Loihi Neuromorphic Research Processor. (arXiv:2107.00401v1 [cs.NE])
    (2 min) Autonomous Driving (AD) related features provide new forms of mobility that are also beneficial for other kind of intelligent and autonomous systems like robots, smart transportation, and smart industries. For these applications, the decisions need to be made fast and in real-time. Moreover, in the quest for electric mobility, this task must follow low power policy, without affecting much the autonomy of the mean of transport or the robot. These two challenges can be tackled using the emerging Spiking Neural Networks (SNNs). When deployed on a specialized neuromorphic hardware, SNNs can achieve high performance with low latency and low power consumption. In this paper, we use an SNN connected to an event-based camera for facing one of the key problems for AD, i.e., the classification between cars and other objects. To consume less power than traditional frame-based cameras, we use a Dynamic Vision Sensor (DVS). The experiments are made following an offline supervised learning rule, followed by mapping the learnt SNN model on the Intel Loihi Neuromorphic Research Chip. Our best experiment achieves an accuracy on offline implementation of 86%, that drops to 83% when it is ported onto the Loihi Chip. The Neuromorphic Hardware implementation has maximum 0.72 ms of latency for every sample, and consumes only 310 mW. To the best of our knowledge, this work is the first implementation of an event-based car classifier on a Neuromorphic Chip.
    Explainable nonlinear modelling of multiple time series with invertible neural networks. (arXiv:2107.00391v1 [eess.SP])
    (2 min) A method for nonlinear topology identification is proposed, based on the assumption that a collection of time series are generated in two steps: i) a vector autoregressive process in a latent space, and ii) a nonlinear, component-wise, monotonically increasing observation mapping. The latter mappings are assumed invertible, and are modelled as shallow neural networks, so that their inverse can be numerically evaluated, and their parameters can be learned using a technique inspired in deep learning. Due to the function inversion, the back-propagation step is not straightforward, and this paper explains the steps needed to calculate the gradients applying implicit differentiation. Whereas the model explainability is the same as that for linear VAR processes, preliminary numerical tests show that the prediction error becomes smaller.
    Stochastic Gradient Descent-Ascent and Consensus Optimization for Smooth Games: Convergence Analysis under Expected Co-coercivity. (arXiv:2107.00052v1 [cs.LG])
    (2 min) Two of the most prominent algorithms for solving unconstrained smooth games are the classical stochastic gradient descent-ascent (SGDA) and the recently introduced stochastic consensus optimization (SCO) (Mescheder et al., 2017). SGDA is known to converge to a stationary point for specific classes of games, but current convergence analyses require a bounded variance assumption. SCO is used successfully for solving large-scale adversarial problems, but its convergence guarantees are limited to its deterministic variant. In this work, we introduce the expected co-coercivity condition, explain its benefits, and provide the first last-iterate convergence guarantees of SGDA and SCO under this condition for solving a class of stochastic variational inequality problems that are potentially non-monotone. We prove linear convergence of both methods to a neighborhood of the solution when they use constant step-size, and we propose insightful stepsize-switching rules to guarantee convergence to the exact solution. In addition, our convergence guarantees hold under the arbitrary sampling paradigm, and as such, we give insights into the complexity of minibatching.
    VideoLightFormer: Lightweight Action Recognition using Transformers. (arXiv:2107.00451v1 [cs.CV])
    (2 min) Efficient video action recognition remains a challenging problem. One large model after another takes the place of the state-of-the-art on the Kinetics dataset, but real-world efficiency evaluations are often lacking. In this work, we fill this gap and investigate the use of transformers for efficient action recognition. We propose a novel, lightweight action recognition architecture, VideoLightFormer. In a factorized fashion, we carefully extend the 2D convolutional Temporal Segment Network with transformers, while maintaining spatial and temporal video structure throughout the entire model. Existing methods often resort to one of the two extremes, where they either apply huge transformers to video features, or minimal transformers on highly pooled video features. Our method differs from them by keeping the transformer models small, but leveraging full spatiotemporal feature structure. We evaluate VideoLightFormer in a high-efficiency setting on the temporally-demanding EPIC-KITCHENS-100 and Something-Something-V2 (SSV2) datasets and find that it achieves a better mix of efficiency and accuracy than existing state-of-the-art models, apart from the Temporal Shift Module on SSV2.
    A Survey on Graph-Based Deep Learning for Computational Histopathology. (arXiv:2107.00272v1 [cs.LG])
    (2 min) With the remarkable success of representation learning for prediction problems, we have witnessed a rapid expansion of the use of machine learning and deep learning for the analysis of digital pathology and biopsy image patches. However, traditional learning over patch-wise features using convolutional neural networks limits the model when attempting to capture global contextual information. The phenotypical and topological distribution of constituent histological entities play a critical role in tissue diagnosis. As such, graph data representations and deep learning have attracted significant attention for encoding tissue representations, and capturing intra- and inter- entity level interactions. In this review, we provide a conceptual grounding of graph-based deep learning and discuss its current success for tumor localization and classification, tumor invasion and staging, image retrieval, and survival prediction. We provide an overview of these methods in a systematic manner organized by the graph representation of the input image including whole slide images and tissue microarrays. We also outline the limitations of existing techniques, and suggest potential future advances in this domain.
    Cascade Decoders-Based Autoencoders for Image Reconstruction. (arXiv:2107.00002v1 [cs.LG])
    (2 min) Autoencoders are composed of coding and decoding units, hence they hold the inherent potential of high-performance data compression and signal compressed sensing. The main disadvantages of current autoencoders comprise the following several aspects: the research objective is not data reconstruction but feature representation; the performance evaluation of data recovery is neglected; it is hard to achieve lossless data reconstruction by pure autoencoders, even by pure deep learning. This paper aims for image reconstruction of autoencoders, employs cascade decoders-based autoencoders, perfects the performance of image reconstruction, approaches gradually lossless image recovery, and provides solid theory and application basis for autoencoders-based image compression and compressed sensing. The proposed serial decoders-based autoencoders include the architectures of multi-level decoders and the related optimization algorithms. The cascade decoders consist of general decoders, residual decoders, adversarial decoders and their combinations. It is evaluated by the experimental results that the proposed autoencoders outperform the classical autoencoders in the performance of image reconstruction.
    SinGAN-Seg: Synthetic Training Data Generation for Medical Image Segmentation. (arXiv:2107.00471v1 [eess.IV])
    (2 min) Processing medical data to find abnormalities is a time-consuming and costly task, requiring tremendous efforts from medical experts. Therefore, Ai has become a popular tool for the automatic processing of medical data, acting as a supportive tool for doctors. AI tools highly depend on data for training the models. However, there are several constraints to access to large amounts of medical data to train machine learning algorithms in the medical domain, e.g., due to privacy concerns and the costly, time-consuming medical data annotation process. To address this, in this paper we present a novel synthetic data generation pipeline called SinGAN-Seg to produce synthetic medical data with the corresponding annotated ground truth masks. We show that these synthetic data generation pipelines can be used as an alternative to bypass privacy concerns and as an alternative way to produce artificial segmentation datasets with corresponding ground truth masks to avoid the tedious medical data annotation process. As a proof of concept, we used an open polyp segmentation dataset. By training UNet++ using both the real polyp segmentation dataset and the corresponding synthetic dataset generated from the SinGAN-Seg pipeline, we show that the synthetic data can achieve a very close performance to the real data when the real segmentation datasets are large enough. In addition, we show that synthetic data generated from the SinGAN-Seg pipeline improving the performance of segmentation algorithms when the training dataset is very small. Since our SinGAN-Seg pipeline is applicable for any medical dataset, this pipeline can be used with any other segmentation datasets.
    Overhead-MNIST: Machine Learning Baselines for Image Classification. (arXiv:2107.00436v1 [cs.CV])
    (2 min) Twenty-three machine learning algorithms were trained then scored to establish baseline comparison metrics and to select an image classification algorithm worthy of embedding into mission-critical satellite imaging systems. The Overhead-MNIST dataset is a collection of satellite images similar in style to the ubiquitous MNIST hand-written digits found in the machine learning literature. The CatBoost classifier, Light Gradient Boosting Machine, and Extreme Gradient Boosting models produced the highest accuracies, Areas Under the Curve (AUC), and F1 scores in a PyCaret general comparison. Separate evaluations showed that a deep convolutional architecture was the most promising. We present results for the overall best performing algorithm as a baseline for edge deployability and future performance improvement: a convolutional neural network (CNN) scoring 0.965 categorical accuracy on unseen test data.
    AdaXpert: Adapting Neural Architecture for Growing Data. (arXiv:2107.00254v1 [cs.LG])
    (2 min) In real-world applications, data often come in a growing manner, where the data volume and the number of classes may increase dynamically. This will bring a critical challenge for learning: given the increasing data volume or the number of classes, one has to instantaneously adjust the neural model capacity to obtain promising performance. Existing methods either ignore the growing nature of data or seek to independently search an optimal architecture for a given dataset, and thus are incapable of promptly adjusting the architectures for the changed data. To address this, we present a neural architecture adaptation method, namely Adaptation eXpert (AdaXpert), to efficiently adjust previous architectures on the growing data. Specifically, we introduce an architecture adjuster to generate a suitable architecture for each data snapshot, based on the previous architecture and the different extent between current and previous data distributions. Furthermore, we propose an adaptation condition to determine the necessity of adjustment, thereby avoiding unnecessary and time-consuming adjustments. Extensive experiments on two growth scenarios (increasing data volume and number of classes) demonstrate the effectiveness of the proposed method.
    The Interplay between Distribution Parameters and the Accuracy-Robustness Tradeoff in Classification. (arXiv:2107.00247v1 [cs.LG])
    (2 min) Adversarial training tends to result in models that are less accurate on natural (unperturbed) examples compared to standard models. This can be attributed to either an algorithmic shortcoming or a fundamental property of the training data distribution, which admits different solutions for optimal standard and adversarial classifiers. In this work, we focus on the latter case under a binary Gaussian mixture classification problem. Unlike earlier work, we aim to derive the natural accuracy gap between the optimal Bayes and adversarial classifiers, and study the effect of different distributional parameters, namely separation between class centroids, class proportions, and the covariance matrix, on the derived gap. We show that under certain conditions, the natural error of the optimal adversarial classifier, as well as the gap, are locally minimized when classes are balanced, contradicting the performance of the Bayes classifier where perfect balance induces the worst accuracy. Moreover, we show that with an $\ell_\infty$ bounded perturbation and an adversarial budget of $\epsilon$, this gap is $\Theta(\epsilon^2)$ for the worst-case parameters, which for suitably small $\epsilon$ indicates the theoretical possibility of achieving robust classifiers with near-perfect accuracy, which is rarely reflected in practical algorithms.
    FedMix: Approximation of Mixup under Mean Augmented Federated Learning. (arXiv:2107.00233v1 [cs.LG])
    (2 min) Federated learning (FL) allows edge devices to collectively learn a model without directly sharing data within each device, thus preserving privacy and eliminating the need to store data globally. While there are promising results under the assumption of independent and identically distributed (iid) local data, current state-of-the-art algorithms suffer from performance degradation as the heterogeneity of local data across clients increases. To resolve this issue, we propose a simple framework, Mean Augmented Federated Learning (MAFL), where clients send and receive averaged local data, subject to the privacy requirements of target applications. Under our framework, we propose a new augmentation algorithm, named FedMix, which is inspired by a phenomenal yet simple data augmentation method, Mixup, but does not require local raw data to be directly shared among devices. Our method shows greatly improved performance in the standard benchmark datasets of FL, under highly non-iid federated settings, compared to conventional algorithms.
    FCMI: Feature Correlation based Missing Data Imputation. (arXiv:2107.00100v1 [cs.LG])
    (2 min) Processed data are insightful, and crude data are obtuse. A serious threat to data reliability is missing values. Such data leads to inaccurate analysis and wrong predictions. We propose an efficient technique to impute the missing value in the dataset based on correlation called FCMI (Feature Correlation based Missing Data Imputation). We have considered the correlation of the attributes of the dataset, and that is our central idea. Our proposed algorithm picks the highly correlated attributes of the dataset and uses these attributes to build a regression model whose parameters are optimized such that the correlation of the dataset is maintained. Experiments conducted on both classification and regression datasets show that the proposed imputation technique outperforms existing imputation algorithms.
    Reparameterized Sampling for Generative Adversarial Networks. (arXiv:2107.00352v1 [stat.ML])
    (2 min) Recently, sampling methods have been successfully applied to enhance the sample quality of Generative Adversarial Networks (GANs). However, in practice, they typically have poor sample efficiency because of the independent proposal sampling from the generator. In this work, we propose REP-GAN, a novel sampling method that allows general dependent proposals by REParameterizing the Markov chains into the latent space of the generator. Theoretically, we show that our reparameterized proposal admits a closed-form Metropolis-Hastings acceptance ratio. Empirically, extensive experiments on synthetic and real datasets demonstrate that our REP-GAN largely improves the sample efficiency and obtains better sample quality simultaneously.
    Morphological classification of compact and extended radio galaxies using convolutional neural networks and data augmentation techniques. (arXiv:2107.00385v1 [astro-ph.GA])
    (2 min) Machine learning techniques have been increasingly used in astronomical applications and have proven to successfully classify objects in image data with high accuracy. The current work uses archival data from the Faint Images of the Radio Sky at Twenty Centimeters (FIRST) to classify radio galaxies into four classes: Fanaroff-Riley Class I (FRI), Fanaroff-Riley Class II (FRII), Bent-Tailed (BENT), and Compact (COMPT). The model presented in this work is based on Convolutional Neural Networks (CNNs). The proposed architecture comprises three parallel blocks of convolutional layers combined and processed for final classification by two feed-forward layers. Our model classified selected classes of radio galaxy sources on an independent testing subset with an average of 96\% for precision, recall, and F1 score. The best selected augmentation techniques were rotations, horizontal or vertical flips, and increase of brightness. Shifts, zoom and decrease of brightness worsened the performance of the model. The current results show that model developed in this work is able to identify different morphological classes of radio galaxies with a high efficiency and performance
    Using AntiPatterns to avoid MLOps Mistakes. (arXiv:2107.00079v1 [cs.LG])
    (2 min) We describe lessons learned from developing and deploying machine learning models at scale across the enterprise in a range of financial analytics applications. These lessons are presented in the form of antipatterns. Just as design patterns codify best software engineering practices, antipatterns provide a vocabulary to describe defective practices and methodologies. Here we catalog and document numerous antipatterns in financial ML operations (MLOps). Some antipatterns are due to technical errors, while others are due to not having sufficient knowledge of the surrounding context in which ML results are used. By providing a common vocabulary to discuss these situations, our intent is that antipatterns will support better documentation of issues, rapid communication between stakeholders, and faster resolution of problems. In addition to cataloging antipatterns, we describe solutions, best practices, and future directions toward MLOps maturity.
    Learning a Reversible Embedding Mapping using Bi-Directional Manifold Alignment. (arXiv:2107.00124v1 [cs.CL])
    (2 min) We propose a Bi-Directional Manifold Alignment (BDMA) that learns a non-linear mapping between two manifolds by explicitly training it to be bijective. We demonstrate BDMA by training a model for a pair of languages rather than individual, directed source and target combinations, reducing the number of models by 50%. We show that models trained with BDMA in the "forward" (source to target) direction can successfully map words in the "reverse" (target to source) direction, yielding equivalent (or better) performance to standard unidirectional translation models where the source and target language is flipped. We also show how BDMA reduces the overall size of the model.
    Implicit Acceleration and Feature Learning inInfinitely Wide Neural Networks with Bottlenecks. (arXiv:2107.00364v1 [cs.LG])
    (2 min) We analyze the learning dynamics of infinitely wide neural networks with a finite sized bottle-neck. Unlike the neural tangent kernel limit, a bottleneck in an otherwise infinite width network al-lows data dependent feature learning in its bottle-neck representation. We empirically show that a single bottleneck in infinite networks dramatically accelerates training when compared to purely in-finite networks, with an improved overall performance. We discuss the acceleration phenomena by drawing similarities to infinitely wide deep linear models, where the acceleration effect of a bottleneck can be understood theoretically.
    Cross-Lingual Adaptation for Type Inference. (arXiv:2107.00157v1 [cs.AI])
    (2 min) Deep learning-based techniques have been widely applied to the program analysis tasks, in fields such as type inference, fault localization, and code summarization. Hitherto deep learning-based software engineering systems rely thoroughly on supervised learning approaches, which require laborious manual effort to collect and label a prohibitively large amount of data. However, most Turing-complete imperative languages share similar control- and data-flow structures, which make it possible to transfer knowledge learned from one language to another. In this paper, we propose cross-lingual adaptation of program analysis, which allows us to leverage prior knowledge learned from the labeled dataset of one language and transfer it to the others. Specifically, we implemented a cross-lingual adaptation framework, PLATO, to transfer a deep learning-based type inference procedure across weakly typed languages, e.g., Python to JavaScript and vice versa. PLATO incorporates a novel joint graph kernelized attention based on abstract syntax tree and control flow graph, and applies anchor word augmentation across different languages. Besides, by leveraging data from strongly typed languages, PLATO improves the perplexity of the backbone cross-programming-language model and the performance of downstream cross-lingual transfer for type inference. Experimental results illustrate that our framework significantly improves the transferability over the baseline method by a large margin.
    MHER: Model-based Hindsight Experience Replay. (arXiv:2107.00306v1 [cs.LG])
    (2 min) Solving multi-goal reinforcement learning (RL) problems with sparse rewards is generally challenging. Existing approaches have utilized goal relabeling on collected experiences to alleviate issues raised from sparse rewards. However, these methods are still limited in efficiency and cannot make full use of experiences. In this paper, we propose Model-based Hindsight Experience Replay (MHER), which exploits experiences more efficiently by leveraging environmental dynamics to generate virtual achieved goals. Replacing original goals with virtual goals generated from interaction with a trained dynamics model leads to a novel relabeling method, \emph{model-based relabeling} (MBR). Based on MBR, MHER performs both reinforcement learning and supervised learning for efficient policy improvement. Theoretically, we also prove the supervised part in MHER, i.e., goal-conditioned supervised learning with MBR data, optimizes a lower bound on the multi-goal RL objective. Experimental results in several point-based tasks and simulated robotics environments show that MHER achieves significantly higher sample efficiency than previous state-of-the-art methods.
    Regressing Location on Text for Probabilistic Geocoding. (arXiv:2107.00080v1 [cs.CL])
    (2 min) Text data are an important source of detailed information about social and political events. Automated systems parse large volumes of text data to infer or extract structured information that describes actors, actions, dates, times, and locations. One of these sub-tasks is geocoding: predicting the geographic coordinates associated with events or locations described by a given text. We present an end-to-end probabilistic model for geocoding text data. Additionally, we collect a novel data set for evaluating the performance of geocoding systems. We compare the model-based solution, called ELECTRo-map, to the current state-of-the-art open source system for geocoding texts for event data. Finally, we discuss the benefits of end-to-end model-based geocoding, including principled uncertainty estimation and the ability of these models to leverage contextual information.
    Boosting Certified $\ell_\infty$ Robustness with EMA Method and Ensemble Model. (arXiv:2107.00230v1 [cs.LG])
    (2 min) The neural network with $1$-Lipschitz property based on $\ell_\infty$-dist neuron has a theoretical guarantee in certified $\ell_\infty$ robustness. However, due to the inherent difficulties in the training of the network, the certified accuracy of previous work is limited. In this paper, we propose two approaches to deal with these difficuties. Aiming at the characteristics of the training process based on $\ell_\infty$-norm neural network, we introduce the EMA method to improve the training process. Considering the randomness of the training algorithm, we propose an ensemble method based on trained base models that have the $1$-Lipschitz property and gain significant improvement in the small parameter network. Moreover, we give the theoretical analysis of the ensemble method based on the $1$-Lipschitz property on the certified robustness, which ensures the effectiveness and stability of the algorithm. Our code is available at https://github.com/Theia-4869/EMA-and-Ensemble-Lip-Networks.
    Robust Coreset for Continuous-and-Bounded Learning (with Outliers). (arXiv:2107.00068v1 [cs.LG])
    (2 min) In this big data era, we often confront large-scale data in many machine learning tasks. A common approach for dealing with large-scale data is to build a small summary, {\em e.g.,} coreset, that can efficiently represent the original input. However, real-world datasets usually contain outliers and most existing coreset construction methods are not resilient against outliers (in particular, the outliers can be located arbitrarily in the space by an adversarial attacker). In this paper, we propose a novel robust coreset method for the {\em continuous-and-bounded learning} problem (with outliers) which includes a broad range of popular optimization objectives in machine learning, like logistic regression and $ k $-means clustering. Moreover, our robust coreset can be efficiently maintained in fully-dynamic environment. To the best of our knowledge, this is the first robust and fully-dynamic coreset construction method for these optimization problems. We also conduct the experiments to evaluate the effectiveness of our robust coreset in practice.
    On the Expected Complexity of Maxout Networks. (arXiv:2107.00379v1 [stat.ML])
    (2 min) Learning with neural networks relies on the complexity of the representable functions, but more importantly, the particular assignment of typical parameters to functions of different complexity. Taking the number of activation regions as a complexity measure, recent works have shown that the practical complexity of deep ReLU networks is often far from the theoretical maximum. In this work we show that this phenomenon also occurs in networks with maxout (multi-argument) activation functions and when considering the decision boundaries in classification tasks. We also show that the parameter space has a multitude of full-dimensional regions with widely different complexity, and obtain nontrivial lower bounds on the expected complexity. Finally, we investigate different parameter initialization procedures and show that they can increase the speed of convergence in training.
    Combining Feature and Instance Attribution to Detect Artifacts. (arXiv:2107.00323v1 [cs.CL])
    (2 min) Training the large deep neural networks that dominate NLP requires large datasets. Many of these are collected automatically or via crowdsourcing, and may exhibit systematic biases or annotation artifacts. By the latter, we mean correlations between inputs and outputs that are spurious, insofar as they do not represent a generally held causal relationship between features and classes; models that exploit such correlations may appear to perform a given task well, but fail on out of sample data. In this paper we propose methods to facilitate identification of training data artifacts, using new hybrid approaches that combine saliency maps (which highlight important input features) with instance attribution methods (which retrieve training samples influential to a given prediction). We show that this proposed training-feature attribution approach can be used to uncover artifacts in training data, and use it to identify previously unreported artifacts in a few standard NLP datasets. We execute a small user study to evaluate whether these methods are useful to NLP researchers in practice, with promising results. We make code for all methods and experiments in this paper available.
    SA-MATD3:Self-attention-based multi-agent continuous control method in cooperative environments. (arXiv:2107.00284v1 [cs.MA])
    (2 min) Cooperative problems under continuous control have always been the focus of multi-agent reinforcement learning. Existing algorithms suffer from the problem of uneven learning degree with the increase of the number of agents. In this paper, a new structure for a multi-agent actor critic is proposed, and the self-attention mechanism is applied in the critic network and the value decomposition method used to solve the uneven problem. The proposed algorithm makes full use of the samples in the replay memory buffer to learn the behavior of a class of agents. First, a new update method is proposed for policy networks that promotes learning efficiency. Second, the utilization of samples is improved, at the same time reflecting the ability of perspective-taking among groups. Finally, the "deceptive signal" in training is eliminated and the learning degree among agents is more uniform than in the existing methods. Multiple experiments were conducted in two typical scenarios of a multi-agent particle environment. Experimental results show that the proposed algorithm can perform better than the state-of-the-art ones, and that it exhibits higher learning efficiency with an increasing number of agents.
    Few-Shot Learning with a Strong Teacher. (arXiv:2107.00197v1 [cs.CV])
    (2 min) Few-shot learning (FSL) aims to train a strong classifier using limited labeled examples. Many existing works take the meta-learning approach, sampling few-shot tasks in turn and optimizing the few-shot learner's performance on classifying the query examples. In this paper, we point out two potential weaknesses of this approach. First, the sampled query examples may not provide sufficient supervision for the few-shot learner. Second, the effectiveness of meta-learning diminishes sharply with increasing shots (i.e., the number of training examples per class). To resolve these issues, we propose a novel objective to directly train the few-shot learner to perform like a strong classifier. Concretely, we associate each sampled few-shot task with a strong classifier, which is learned with ample labeled examples. The strong classifier has a better generalization ability and we use it to supervise the few-shot learner. We present an efficient way to construct the strong classifier, making our proposed objective an easily plug-and-play term to existing meta-learning based FSL methods. We validate our approach in combinations with many representative meta-learning methods. On several benchmark datasets including miniImageNet and tiredImageNet, our approach leads to a notable improvement across a variety of tasks. More importantly, with our approach, meta-learning based FSL methods can consistently outperform non-meta-learning based ones, even in a many-shot setting, greatly strengthening their applicability.
    Reducing the Variance of Gaussian Process Hyperparameter Optimization with Preconditioning. (arXiv:2107.00243v1 [cs.LG])
    (2 min) Gaussian processes remain popular as a flexible and expressive model class, but the computational cost of kernel hyperparameter optimization stands as a major limiting factor to their scaling and broader adoption. Recent work has made great strides combining stochastic estimation with iterative numerical techniques, essentially boiling down GP inference to the cost of (many) matrix-vector multiplies. Preconditioning -- a highly effective step for any iterative method involving matrix-vector multiplication -- can be used to accelerate convergence and thus reduce bias in hyperparameter optimization. Here, we prove that preconditioning has an additional benefit that has been previously unexplored. It not only reduces the bias of the $\log$-marginal likelihood estimator and its derivatives, but it also simultaneously can reduce variance at essentially negligible cost. We leverage this result to derive sample-efficient algorithms for GP hyperparameter optimization requiring as few as $\mathcal{O}(\log(\varepsilon^{-1}))$ instead of $\mathcal{O}(\varepsilon^{-2})$ samples to achieve error $\varepsilon$. Our theoretical results enable provably efficient and scalable optimization of kernel hyperparameters, which we validate empirically on a set of large-scale benchmark problems. There, variance reduction via preconditioning results in an order of magnitude speedup in hyperparameter optimization of exact GPs.
    Distributed Nonparametric Function Estimation: Optimal Rate of Convergence and Cost of Adaptation. (arXiv:2107.00179v1 [math.ST])
    (2 min) Distributed minimax estimation and distributed adaptive estimation under communication constraints for Gaussian sequence model and white noise model are studied. The minimax rate of convergence for distributed estimation over a given Besov class, which serves as a benchmark for the cost of adaptation, is established. We then quantify the exact communication cost for adaptation and construct an optimally adaptive procedure for distributed estimation over a range of Besov classes. The results demonstrate significant differences between nonparametric function estimation in the distributed setting and the conventional centralized setting. For global estimation, adaptation in general cannot be achieved for free in the distributed setting. The new technical tools to obtain the exact characterization for the cost of adaptation can be of independent interest.
    Multi-modal Graph Learning for Disease Prediction. (arXiv:2107.00206v1 [cs.LG])
    (2 min) Benefiting from the powerful expressive capability of graphs, graph-based approaches have achieved impressive performance in various biomedical applications. Most existing methods tend to define the adjacency matrix among samples manually based on meta-features, and then obtain the node embeddings for downstream tasks by Graph Representation Learning (GRL). However, it is not easy for these approaches to generalize to unseen samples. Meanwhile, the complex correlation between modalities is also ignored. As a result, these factors inevitably yield the inadequacy of providing valid information about the patient's condition for a reliable diagnosis. In this paper, we propose an end-to-end Multimodal Graph Learning framework (MMGL) for disease prediction. To effectively exploit the rich information across multi-modality associated with diseases, amodal-attentional multi-modal fusion is proposed to integrate the features of each modality by leveraging the correlation and complementarity between the modalities. Furthermore, instead of defining the adjacency matrix manually as existing methods, the latent graph structure can be captured through a novel way of adaptive graph learning. It could be jointly optimized with the prediction model, thus revealing the intrinsic connections among samples. Unlike the previous transductive methods, our model is also applicable to the scenario of inductive learning for those unseen data. An extensive group of experiments on two disease prediction problems is then carefully designed and presented, demonstrating that MMGL obtains more favorable performances. In addition, we also visualize and analyze the learned graph structure to provide more reliable decision support for doctors in real medical applications and inspiration for disease research.
    Improving black-box optimization in VAE latent space using decoder uncertainty. (arXiv:2107.00096v1 [cs.LG])
    (2 min) Optimization in the latent space of variational autoencoders is a promising approach to generate high-dimensional discrete objects that maximize an expensive black-box property (e.g., drug-likeness in molecular generation, function approximation with arithmetic expressions). However, existing methods lack robustness as they may decide to explore areas of the latent space for which no data was available during training and where the decoder can be unreliable, leading to the generation of unrealistic or invalid objects. We propose to leverage the epistemic uncertainty of the decoder to guide the optimization process. This is not trivial though, as a naive estimation of uncertainty in the high-dimensional and structured settings we consider would result in high estimator variance. To solve this problem, we introduce an importance sampling-based estimator that provides more robust estimates of epistemic uncertainty. Our uncertainty-guided optimization approach does not require modifications of the model architecture nor the training process. It produces samples with a better trade-off between black-box objective and validity of the generated samples, sometimes improving both simultaneously. We illustrate these advantages across several experimental settings in digit generation, arithmetic expression approximation and molecule generation for drug design.
    Revisiting Knowledge Distillation: An Inheritance and Exploration Framework. (arXiv:2107.00181v1 [cs.LG])
    (2 min) Knowledge Distillation (KD) is a popular technique to transfer knowledge from a teacher model or ensemble to a student model. Its success is generally attributed to the privileged information on similarities/consistency between the class distributions or intermediate feature representations of the teacher model and the student model. However, directly pushing the student model to mimic the probabilities/features of the teacher model to a large extent limits the student model in learning undiscovered knowledge/features. In this paper, we propose a novel inheritance and exploration knowledge distillation framework (IE-KD), in which a student model is split into two parts - inheritance and exploration. The inheritance part is learned with a similarity loss to transfer the existing learned knowledge from the teacher model to the student model, while the exploration part is encouraged to learn representations different from the inherited ones with a dis-similarity loss. Our IE-KD framework is generic and can be easily combined with existing distillation or mutual learning methods for training deep neural networks. Extensive experiments demonstrate that these two parts can jointly push the student model to learn more diversified and effective representations, and our IE-KD can be a general technique to improve the student network to achieve SOTA performance. Furthermore, by applying our IE-KD to the training of two networks, the performance of both can be improved w.r.t. deep mutual learning. The code and models of IE-KD will be make publicly available at https://github.com/yellowtownhz/IE-KD.
    Robust Generative Adversarial Imitation Learning via Local Lipschitzness. (arXiv:2107.00116v1 [cs.LG])
    (2 min) We explore methodologies to improve the robustness of generative adversarial imitation learning (GAIL) algorithms to observation noise. Towards this objective, we study the effect of local Lipschitzness of the discriminator and the generator on the robustness of policies learned by GAIL. In many robotics applications, the learned policies by GAIL typically suffer from a degraded performance at test time since the observations from the environment might be corrupted by noise. Hence, robustifying the learned policies against the observation noise is of critical importance. To this end, we propose a regularization method to induce local Lipschitzness in the generator and the discriminator of adversarial imitation learning methods. We show that the modified objective leads to learning significantly more robust policies. Moreover, we demonstrate -- both theoretically and experimentally -- that training a locally Lipschitz discriminator leads to a locally Lipschitz generator, thereby improving the robustness of the resultant policy. We perform extensive experiments on simulated robot locomotion environments from the MuJoCo suite that demonstrate the proposed method learns policies that significantly outperform the state-of-the-art generative adversarial imitation learning algorithm when applied to test scenarios with noise-corrupted observations.
    From DNNs to GANs: Review of efficient hardware architectures for deep learning. (arXiv:2107.00092v1 [cs.LG])
    (2 min) In recent times, the trend in very large scale integration (VLSI) industry is multi-dimensional, for example, reduction of energy consumption, occupancy of less space, precise result, less power dissipation, faster response. To meet these needs, the hardware architecture should be reliable and robust to these problems. Recently, neural network and deep learning has been started to impact the present research paradigm significantly which consists of parameters in the order of millions, nonlinear function for activation, convolutional operation for feature extraction, regression for classification, generative adversarial networks. These operations involve huge calculation and memory overhead. Presently available DSP processors are incapable of performing these operations and they mostly face the problems, for example, memory overhead, performance drop and compromised accuracy. Moreover, if a huge silicon area is powered to accelerate the operation using parallel computation, the ICs will be having significant chance of burning out due to the considerable generation of heat. Hence, novel dark silicon constraint is developed to reduce the heat dissipation without sacrificing the accuracy. Similarly, different algorithms have been adapted to design a DSP processor compatible for fast performance in neural network, activation function, convolutional neural network and generative adversarial network. In this review, we illustrate the recent developments in hardware for accelerating the efficient implementation of deep learning networks with enhanced performance. The techniques investigated in this review are expected to direct future research challenges of hardware optimization for high-performance computations.
    Mesh-based graph convolutional neural network models of processes with complex initial states. (arXiv:2107.00090v1 [cs.LG])
    (2 min) Predicting the evolution of a representative sample of a material with microstructure is a fundamental problem in homogenization. In this work we propose a graph convolutional neural network that utilizes the discretized representation of the initial microstructure directly, without segmentation or clustering. Compared to feature-based and pixel-based convolutional neural network models, the proposed method has a number of advantages: (a) it is deep in that it does not require featurization but can benefit from it, (b) it has a simple implementation with standard convolutional filters and layers, (c) it works natively on unstructured and structured grid data without interpolation (unlike pixel-based convolutional neural networks), and (d) it preserves rotational invariance like other graph-based convolutional neural networks. We demonstrate the performance of the proposed network and compare it to traditional pixel-based convolution neural network models and feature-based graph convolutional neural networks on three large datasets.

2021-07-01

  • cs.CL updates on arXiv.org

    Revisiting the Primacy of English in Zero-shot Cross-lingual Transfer. (arXiv:2106.16171v1 [cs.CL])
    (2 min) Despite their success, large pre-trained multilingual models have not completely alleviated the need for labeled data, which is cumbersome to collect for all target languages. Zero-shot cross-lingual transfer is emerging as a practical solution: pre-trained models later fine-tuned on one transfer language exhibit surprising performance when tested on many target languages. English is the dominant source language for transfer, as reinforced by popular zero-shot benchmarks. However, this default choice has not been systematically vetted. In our study, we compare English against other transfer languages for fine-tuning, on two pre-trained multilingual models (mBERT and mT5) and multiple classification and question answering tasks. We find that other high-resource languages such as German and Russian often transfer more effectively, especially when the set of target languages is diverse or unknown a priori. Unexpectedly, this can be true even when the training sets were automatically translated from English. This finding can have immediate impact on multilingual zero-shot systems, and should inform future benchmark designs.
    The Volctrans Neural Speech Translation System for IWSLT 2021. (arXiv:2105.07319v2 [cs.CL] UPDATED)
    (2 min) This paper describes the systems submitted to IWSLT 2021 by the Volctrans team. We participate in the offline speech translation and text-to-text simultaneous translation tracks. For offline speech translation, our best end-to-end model achieves 8.1 BLEU improvements over the benchmark on the MuST-C test set and is even approaching the results of a strong cascade solution. For text-to-text simultaneous translation, we explore the best practice to optimize the wait-k model. As a result, our final submitted systems exceed the benchmark at around 7 BLEU on the same latency regime. We will publish our code and model to facilitate both future research works and industrial applications. This paper describes the systems submitted to IWSLT 2021 by the Volctrans team. We participate in the offline speech translation and text-to-text simultaneous translation tracks. For offline speech translation, our best end-to-end model achieves 7.9 BLEU improvements over the benchmark on the MuST-C test set and is even approaching the results of a strong cascade solution. For text-to-text simultaneous translation, we explore the best practice to optimize the wait-k model. As a result, our final submitted systems exceed the benchmark at around 7 BLEU on the same latency regime. We release our code and model at \url{https://github.com/bytedance/neurst/tree/master/examples/iwslt21} to facilitate both future research works and industrial applications.
    Decoding Time Lexical Domain Adaptation for Neural Machine Translation. (arXiv:2101.00421v2 [cs.CL] UPDATED)
    (2 min) Machine translation systems are vulnerable to domain mismatch, especially when the task is low-resource. In this setting, out of domain translations are often of poor quality and prone to hallucinations, due to the translation model preferring to predict common words it has seen during training, as opposed to the more uncommon ones from a different domain. We present two simple methods for improving translation quality in this particular setting: First, we use lexical shortlisting in order to restrict the neural network predictions by IBM model computed alignments. Second, we perform $n$-best list reordering by reranking all translations based on the amount they overlap with each other. Our methods are computationally simpler and faster than alternative approaches, and show a moderate success on low-resource settings with explicit out of domain test sets. However, our methods lose their effectiveness when the domain mismatch is too great, or in high resource setting.
    AVLnet: Learning Audio-Visual Language Representations from Instructional Videos. (arXiv:2006.09199v2 [cs.CV] UPDATED)
    (2 min) Current methods for learning visually grounded language from videos often rely on text annotation, such as human generated captions or machine generated automatic speech recognition (ASR) transcripts. In this work, we introduce the Audio-Video Language Network (AVLnet), a self-supervised network that learns a shared audio-visual embedding space directly from raw video inputs. To circumvent the need for text annotation, we learn audio-visual representations from randomly segmented video clips and their raw audio waveforms. We train AVLnet on HowTo100M, a large corpus of publicly available instructional videos, and evaluate on image retrieval and video retrieval tasks, achieving state-of-the-art performance. We perform analysis of AVLnet's learned representations, showing our model utilizes speech and natural sounds to learn audio-visual concepts. Further, we propose a tri-modal model that jointly processes raw audio, video, and text captions from videos to learn a multi-modal semantic embedding space useful for text-video retrieval. Our code, data, and trained models will be released at avlnet.csail.mit.edu
    MultiSubs: A Large-scale Multimodal and Multilingual Dataset. (arXiv:2103.01910v2 [cs.CL] UPDATED)
    (2 min) This paper introduces a large-scale multimodal and multilingual dataset that aims to facilitate research on grounding words to images in their contextual usage in language. The dataset consists of images selected to unambiguously illustrate concepts expressed in sentences from movie subtitles. The dataset is a valuable resource as (i) the images are aligned to text fragments rather than whole sentences; (ii) multiple images are possible for a text fragment and a sentence; (iii) the sentences are free-form and real-world like; (iv) the parallel texts are multilingual. We set up a fill-in-the-blank game for humans to evaluate the quality of the automatic image selection process of our dataset. We show the utility of the dataset on two automatic tasks: (i) fill-in-the blank; (ii) lexical translation. Results of the human evaluation and automatic models demonstrate that images can be a useful complement to the textual context. The dataset will benefit research on visual grounding of words especially in the context of free-form sentences, and can be obtained from https://doi.org/10.5281/zenodo.5034604 under a Creative Commons licence.
    SaRoCo: Detecting Satire in a Novel Romanian Corpus of News Articles. (arXiv:2105.06456v3 [cs.CL] UPDATED)
    (2 min) In this work, we introduce a corpus for satire detection in Romanian news. We gathered 55,608 public news articles from multiple real and satirical news sources, composing one of the largest corpora for satire detection regardless of language and the only one for the Romanian language. We provide an official split of the text samples, such that training news articles belong to different sources than test news articles, thus ensuring that models do not achieve high performance simply due to overfitting. We conduct experiments with two state-of-the-art deep neural models, resulting in a set of strong baselines for our novel corpus. Our results show that the machine-level accuracy for satire detection in Romanian is quite low (under 73% on the test set) compared to the human-level accuracy (87%), leaving enough room for improvement in future research.
    The MultiBERTs: BERT Reproductions for Robustness Analysis. (arXiv:2106.16163v1 [cs.CL])
    (2 min) Experiments with pretrained models such as BERT are often based on a single checkpoint. While the conclusions drawn apply to the artifact (i.e., the particular instance of the model), it is not always clear whether they hold for the more general procedure (which includes the model architecture, training data, initialization scheme, and loss function). Recent work has shown that re-running pretraining can lead to substantially different conclusions about performance, suggesting that alternative evaluations are needed to make principled statements about procedures. To address this question, we introduce MultiBERTs: a set of 25 BERT-base checkpoints, trained with similar hyper-parameters as the original BERT model but differing in random initialization and data shuffling. The aim is to enable researchers to draw robust and statistically justified conclusions about pretraining procedures. The full release includes 25 fully trained checkpoints, as well as statistical guidelines and a code library implementing our recommended hypothesis testing methods. Finally, for five of these models we release a set of 28 intermediate checkpoints in order to support research on learning dynamics.
    Early Risk Detection of Pathological Gambling, Self-Harm and Depression Using BERT. (arXiv:2106.16175v1 [cs.CL])
    (2 min) Early risk detection of mental illnesses has a massive positive impact upon the well-being of people. The eRisk workshop has been at the forefront of enabling interdisciplinary research in developing computational methods to automatically estimate early risk factors for mental issues such as depression, self-harm, anorexia and pathological gambling. In this paper, we present the contributions of the BLUE team in the 2021 edition of the workshop, in which we tackle the problems of early detection of gambling addiction, self-harm and estimating depression severity from social media posts. We employ pre-trained BERT transformers and data crawled automatically from mental health subreddits and obtain reasonable results on all three tasks.
    AutoLAW: Augmented Legal Reasoning through Legal Precedent Prediction. (arXiv:2106.16034v1 [cs.CL])
    (2 min) This paper demonstrate how NLP can be used to address an unmet need of the legal community and increase access to justice. The paper introduces Legal Precedent Prediction (LPP), the task of predicting relevant passages from precedential court decisions given the context of a legal argument. To this end, the paper showcases a BERT model, trained on 530,000 examples of legal arguments made by U.S. federal judges, to predict relevant passages from precedential court decisions given the context of a legal argument. In 96% of unseen test examples the correct target passage is among the top-10 predicted passages. The same model is able to predict relevant precedent given a short summary of a complex and unseen legal brief, predicting the precedent that was actually cited by the brief's co-author, former U.S. Solicitor General and current U.S. Supreme Court Justice Elena Kagan.
    A Comprehensive Assessment of Dialog Evaluation Metrics. (arXiv:2106.03706v2 [cs.CL] UPDATED)
    (2 min) Automatic evaluation metrics are a crucial component of dialog systems research. Standard language evaluation metrics are known to be ineffective for evaluating dialog. As such, recent research has proposed a number of novel, dialog-specific metrics that correlate better with human judgements. Due to the fast pace of research, many of these metrics have been assessed on different datasets and there has as yet been no time for a systematic comparison between them. To this end, this paper provides a comprehensive assessment of recently proposed dialog evaluation metrics on a number of datasets. In this paper, 17 different automatic evaluation metrics are evaluated on 10 different datasets. Furthermore, the metrics are assessed in different settings, to better qualify their respective strengths and weaknesses. Metrics are assessed (1) on both the turn level and the dialog level, (2) for different dialog lengths, (3) for different dialog qualities (e.g., coherence, engaging), (4) for different types of response generation models (i.e., generative, retrieval, simple models and state-of-the-art models), (5) taking into account the similarity of different metrics and (6) exploring combinations of different metrics. This comprehensive assessment offers several takeaways pertaining to dialog evaluation metrics in general. It also suggests how to best assess evaluation metrics and indicates promising directions for future work.
    Improving Factual Consistency of Abstractive Summarization on Customer Feedback. (arXiv:2106.16188v1 [cs.CL])
    (2 min) E-commerce stores collect customer feedback to let sellers learn about customer concerns and enhance customer order experience. Because customer feedback often contains redundant information, a concise summary of the feedback can be generated to help sellers better understand the issues causing customer dissatisfaction. Previous state-of-the-art abstractive text summarization models make two major types of factual errors when producing summaries from customer feedback, which are wrong entity detection (WED) and incorrect product-defect description (IPD). In this work, we introduce a set of methods to enhance the factual consistency of abstractive summarization on customer feedback. We augment the training data with artificially corrupted summaries, and use them as counterparts of the target summaries. We add a contrastive loss term into the training objective so that the model learns to avoid certain factual errors. Evaluation results show that a large portion of WED and IPD errors are alleviated for BART and T5. Furthermore, our approaches do not depend on the structure of the summarization model and thus are generalizable to any abstractive summarization systems.
    Whose Opinions Matter? Perspective-aware Models to Identify Opinions of Hate Speech Victims in Abusive Language Detection. (arXiv:2106.15896v1 [cs.CL])
    (2 min) Social media platforms provide users the freedom of expression and a medium to exchange information and express diverse opinions. Unfortunately, this has also resulted in the growth of abusive content with the purpose of discriminating people and targeting the most vulnerable communities such as immigrants, LGBT, Muslims, Jews and women. Because abusive language is subjective in nature, there might be highly polarizing topics or events involved in the annotation of abusive contents such as hate speech (HS). Therefore, we need novel approaches to model conflicting perspectives and opinions coming from people with different personal and demographic backgrounds. In this paper, we present an in-depth study to model polarized opinions coming from different communities under the hypothesis that similar characteristics (ethnicity, social background, culture etc.) can influence the perspectives of annotators on a certain phenomenon. We believe that by relying on this information, we can divide the annotators into groups sharing similar perspectives. We can create separate gold standards, one for each group, to train state-of-the-art deep learning models. We can employ an ensemble approach to combine the perspective-aware classifiers from different groups to an inclusive model. We also propose a novel resource, a multi-perspective English language dataset annotated according to different sub-categories relevant for characterising online abuse: hate speech, aggressiveness, offensiveness and stereotype. By training state-of-the-art deep learning models on this novel resource, we show how our approach improves the prediction performance of a state-of-the-art supervised classifier.
    Incorporating Domain Knowledge for Extractive Summarization of Legal Case Documents. (arXiv:2106.15876v1 [cs.CL])
    (2 min) Automatic summarization of legal case documents is an important and practical challenge. Apart from many domain-independent text summarization algorithms that can be used for this purpose, several algorithms have been developed specifically for summarizing legal case documents. However, most of the existing algorithms do not systematically incorporate domain knowledge that specifies what information should ideally be present in a legal case document summary. To address this gap, we propose an unsupervised summarization algorithm DELSumm which is designed to systematically incorporate guidelines from legal experts into an optimization setup. We conduct detailed experiments over case documents from the Indian Supreme Court. The experiments show that our proposed unsupervised method outperforms several strong baselines in terms of ROUGE scores, including both general summarization algorithms and legal-specific ones. In fact, though our proposed algorithm is unsupervised, it outperforms several supervised summarization models that are trained over thousands of document-summary pairs.
    News Article Retrieval in Context for Event-centric Narrative Creation. (arXiv:2106.16053v1 [cs.CL])
    (2 min) Writers such as journalists often use automatic tools to find relevant content to include in their narratives. In this paper, we focus on supporting writers in the news domain to develop event-centric narratives. Given an incomplete narrative that specifies a main event and a context, we aim to retrieve news articles that discuss relevant events that would enable the continuation of the narrative. We formally define this task and propose a retrieval dataset construction procedure that relies on existing news articles to simulate incomplete narratives and relevant articles. Experiments on two datasets derived from this procedure show that state-of-the-art lexical and semantic rankers are not sufficient for this task. We show that combining those with a ranker that ranks articles by reverse chronological order outperforms those rankers alone. We also perform an in-depth quantitative and qualitative analysis of the results that sheds light on the characteristics of this task.
    On the Power of Saturated Transformers: A View from Circuit Complexity. (arXiv:2106.16213v1 [cs.CL])
    (2 min) Transformers have become a standard architecture for many NLP problems. This has motivated theoretically analyzing their capabilities as models of language, in order to understand what makes them successful, and what their potential weaknesses might be. Recent work has shown that transformers with hard attention are quite limited in capacity, and in fact can be simulated by constant-depth circuits. However, hard attention is a restrictive assumption, which may complicate the relevance of these results for practical transformers. In this work, we analyze the circuit complexity of transformers with saturated attention: a generalization of hard attention that more closely captures the attention patterns learnable in practical transformers. We show that saturated transformers transcend the limitations of hard-attention transformers. With some minor assumptions, we prove that the number of bits needed to represent a saturated transformer memory vector is $O(\log n)$, which implies saturated transformers can be simulated by log-depth circuits. Thus, the jump from hard to saturated attention can be understood as increasing the transformer's effective circuit depth by a factor of $O(\log n)$.
    An Analysis of the Recent Visibility of the SigDial Conference. (arXiv:2106.16196v1 [cs.CL])
    (2 min) Automated speech and text interfaces are continuing to improve, resulting in increased research in the area of dialogue systems. Moreover, conferences and workshops from various fields are focusing more on language through speech and text mediums as candidates for interaction with applications such as search interfaces and robots. In this paper, we explore how visible the SigDial conference is to outside conferences by analysing papers from top Natural Langauge Processing conferences since 2015 to determine the popularity of certain SigDial-related topics, as well as analysing what SigDial papers are being cited by others outside of SigDial. We find that despite a dramatic increase in dialogue-related research, SigDial visibility has not increased. We conclude by offering some suggestions.
    A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers. (arXiv:2106.15772v1 [cs.AI])
    (2 min) We present ASDiv (Academia Sinica Diverse MWP Dataset), a diverse (in terms of both language patterns and problem types) English math word problem (MWP) corpus for evaluating the capability of various MWP solvers. Existing MWP corpora for studying AI progress remain limited either in language usage patterns or in problem types. We thus present a new English MWP corpus with 2,305 MWPs that cover more text patterns and most problem types taught in elementary school. Each MWP is annotated with its problem type and grade level (for indicating the level of difficulty). Furthermore, we propose a metric to measure the lexicon usage diversity of a given MWP corpus, and demonstrate that ASDiv is more diverse than existing corpora. Experiments show that our proposed corpus reflects the true capability of MWP solvers more faithfully.
    Cross-lingual alignments of ELMo contextual embeddings. (arXiv:2106.15986v1 [cs.CL])
    (2 min) Building machine learning prediction models for a specific NLP task requires sufficient training data, which can be difficult to obtain for low-resource languages. Cross-lingual embeddings map word embeddings from a low-resource language to a high-resource language so that a prediction model trained on data from the high-resource language can also be used in the low-resource language. To produce cross-lingual mappings of recent contextual embeddings, anchor points between the embedding spaces have to be words in the same context. We address this issue with a new method for creating datasets for cross-lingual contextual alignments. Based on that, we propose novel cross-lingual mapping methods for ELMo embeddings. Our linear mapping methods use existing vecmap and MUSE alignments on contextual ELMo embeddings. Our new nonlinear ELMoGAN mapping method is based on GANs and does not assume isomorphic embedding spaces. We evaluate the proposed mapping methods on nine languages, using two downstream tasks, NER and dependency parsing. The ELMoGAN method performs well on the NER task, with low cross-lingual loss compared to direct training on some languages. In the dependency parsing, linear alignment variants are more successful.
    Mixed Cross Entropy Loss for Neural Machine Translation. (arXiv:2106.15880v1 [cs.CL])
    (2 min) In neural machine translation, cross entropy (CE) is the standard loss function in two training methods of auto-regressive models, i.e., teacher forcing and scheduled sampling. In this paper, we propose mixed cross entropy loss (mixed CE) as a substitute for CE in both training approaches. In teacher forcing, the model trained with CE regards the translation problem as a one-to-one mapping process, while in mixed CE this process can be relaxed to one-to-many. In scheduled sampling, we show that mixed CE has the potential to encourage the training and testing behaviours to be similar to each other, more effectively mitigating the exposure bias problem. We demonstrate the superiority of mixed CE over CE on several machine translation datasets, WMT'16 Ro-En, WMT'16 Ru-En, and WMT'14 En-De in both teacher forcing and scheduled sampling setups. Furthermore, in WMT'14 En-De, we also find mixed CE consistently outperforms CE on a multi-reference set as well as a challenging paraphrased reference set. We also found the model trained with mixed CE is able to provide a better probability distribution defined over the translation output space. Our code is available at https://github.com/haorannlp/mix.
    O2D2: Out-Of-Distribution Detector to Capture Undecidable Trials in Authorship Verification. (arXiv:2106.15825v1 [cs.CL])
    (2 min) The PAN 2021 authorship verification (AV) challenge is part of a three-year strategy, moving from a cross-topic/closed-set to a cross-topic/open-set AV task over a collection of fanfiction texts. In this work, we present our modified hybrid neural-probabilistic framework. It is based on our 2020 winning submission, with updates to significantly reduce sensitivities to topical variations and to further improve the system's calibration by means of an uncertainty-adaptation layer. Our framework additionally includes an Out-Of-Distribution Detector (O2D2) for defining non-responses, outperforming all other systems that participated in the PAN 2021 AV task.
    Automatically Select Emotion for Response via Personality-affected Emotion Transition. (arXiv:2106.15846v1 [cs.CL])
    (2 min) To provide consistent emotional interaction with users, dialog systems should be capable to automatically select appropriate emotions for responses like humans. However, most existing works focus on rendering specified emotions in responses or empathetically respond to the emotion of users, yet the individual difference in emotion expression is overlooked. This may lead to inconsistent emotional expressions and disinterest users. To tackle this issue, we propose to equip the dialog system with personality and enable it to automatically select emotions in responses by simulating the emotion transition of humans in conversation. In detail, the emotion of the dialog system is transitioned from its preceding emotion in context. The transition is triggered by the preceding dialog context and affected by the specified personality trait. To achieve this, we first model the emotion transition in the dialog system as the variation between the preceding emotion and the response emotion in the Valence-Arousal-Dominance (VAD) emotion space. Then, we design neural networks to encode the preceding dialog context and the specified personality traits to compose the variation. Finally, the emotion for response is selected from the sum of the preceding emotion and the variation. We construct a dialog dataset with emotion and personality labels and conduct emotion prediction tasks for evaluation. Experimental results validate the effectiveness of the personality-affected emotion transition.
    Zero-Shot Estimation of Base Models' Weights in Ensemble of Machine Reading Comprehension Systems for Robust Generalization. (arXiv:2106.16013v1 [cs.CL])
    (2 min) One of the main challenges of the machine reading comprehension (MRC) models is their fragile out-of-domain generalization, which makes these models not properly applicable to real-world general-purpose question answering problems. In this paper, we leverage a zero-shot weighted ensemble method for improving the robustness of out-of-domain generalization in MRC models. In the proposed method, a weight estimation module is used to estimate out-of-domain weights, and an ensemble module aggregate several base models' predictions based on their weights. The experiments indicate that the proposed method not only improves the final accuracy, but also is robust against domain changes.
    Language Modeling with Reduced Densities. (arXiv:2007.03834v3 [cs.CL] UPDATED)
    (2 min) This work originates from the observation that today's state of the art statistical language models are impressive not only for their performance, but also - and quite crucially - because they are built entirely from correlations in unstructured text data. The latter observation prompts a fundamental question that lies at the heart of this paper: What mathematical structure exists in unstructured text data? We put forth enriched category theory as a natural answer. We show that sequences of symbols from a finite alphabet, such as those found in a corpus of text, form a category enriched over probabilities. We then address a second fundamental question: How can this information be stored and modeled in a way that preserves the categorical structure? We answer this by constructing a functor from our enriched category of text to a particular enriched category of reduced density operators. The latter leverages the Loewner order on positive semidefinite operators, which can further be interpreted as a toy example of entailment.
    What Can Unsupervised Machine Translation Contribute to High-Resource Language Pairs?. (arXiv:2106.15818v1 [cs.CL])
    (2 min) Whereas existing literature on unsupervised machine translation (MT) focuses on exploiting unsupervised techniques for low-resource language pairs where bilingual training data is scare or unavailable, we investigate whether unsupervised MT can also improve translation quality of high-resource language pairs where sufficient bitext does exist. We compare the style of correct translations generated by either supervised or unsupervised MT and find that the unsupervised output is less monotonic and more natural than supervised output. We demonstrate a way to combine the benefits of unsupervised and supervised MT into a single system, resulting in better human evaluation of quality and fluency. Our results open the door to discussions about the potential contributions of unsupervised MT in high-resource settings, and how supervised and unsupervised systems might be mutually-beneficial.
    ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information. (arXiv:2106.16038v1 [cs.CL])
    (2 min) Recent pretraining models in Chinese neglect two important aspects specific to the Chinese language: glyph and pinyin, which carry significant syntax and semantic information for language understanding. In this work, we propose ChineseBERT, which incorporates both the {\it glyph} and {\it pinyin} information of Chinese characters into language model pretraining. The glyph embedding is obtained based on different fonts of a Chinese character, being able to capture character semantics from the visual features, and the pinyin embedding characterizes the pronunciation of Chinese characters, which handles the highly prevalent heteronym phenomenon in Chinese (the same character has different pronunciations with different meanings). Pretrained on large-scale unlabeled Chinese corpus, the proposed ChineseBERT model yields significant performance boost over baseline models with fewer training steps. The porpsoed model achieves new SOTA performances on a wide range of Chinese NLP tasks, including machine reading comprehension, natural language inference, text classification, sentence pair matching, and competitive performances in named entity recognition. Code and pretrained models are publicly available at https://github.com/ShannonAI/ChineseBert.
    IMS' Systems for the IWSLT 2021 Low-Resource Speech Translation Task. (arXiv:2106.16055v1 [cs.CL])
    (2 min) This paper describes the submission to the IWSLT 2021 Low-Resource Speech Translation Shared Task by IMS team. We utilize state-of-the-art models combined with several data augmentation, multi-task and transfer learning approaches for the automatic speech recognition (ASR) and machine translation (MT) steps of our cascaded system. Moreover, we also explore the feasibility of a full end-to-end speech translation (ST) model in the case of very constrained amount of ground truth labeled data. Our best system achieves the best performance among all submitted systems for Congolese Swahili to English and French with BLEU scores 7.7 and 13.7 respectively, and the second best result for Coastal Swahili to English with BLEU score 14.9.
    End-to-End Spoken Language Understanding using RNN-Transducer ASR. (arXiv:2106.15919v1 [cs.CL])
    (2 min) We propose an end-to-end trained spoken language understanding (SLU) system that extracts transcripts, intents and slots from an input speech utterance. It consists of a streaming recurrent neural network transducer (RNNT) based automatic speech recognition (ASR) model connected to a neural natural language understanding (NLU) model through a neural interface. This interface allows for end-to-end training using multi-task RNNT and NLU losses. Additionally, we introduce semantic sequence loss training for the joint RNNT-NLU system that allows direct optimization of non-differentiable SLU metrics. This end-to-end SLU model paradigm can leverage state-of-the-art advancements and pretrained models in both ASR and NLU research communities, outperforming recently proposed direct speech-to-semantics models, and conventional pipelined ASR and NLU systems. We show that this method improves both ASR and NLU metrics on both public SLU datasets and large proprietary datasets.
    Alzheimer's Dementia Recognition Using Acoustic, Lexical, Disfluency and Speech Pause Features Robust to Noisy Inputs. (arXiv:2106.15684v1 [cs.CL])
    (2 min) We present two multimodal fusion-based deep learning models that consume ASR transcribed speech and acoustic data simultaneously to classify whether a speaker in a structured diagnostic task has Alzheimer's Disease and to what degree, evaluating the ADReSSo challenge 2021 data. Our best model, a BiLSTM with highway layers using words, word probabilities, disfluency features, pause information, and a variety of acoustic features, achieves an accuracy of 84% and RSME error prediction of 4.26 on MMSE cognitive scores. While predicting cognitive decline is more challenging, our models show improvement using the multimodal approach and word probabilities, disfluency and pause information over word-only models. We show considerable gains for AD classification using multimodal fusion and gating, which can effectively deal with noisy inputs from acoustic features and ASR hypotheses.
    Genre determining prediction: Non-standard TAM marking in football language. (arXiv:2106.15872v1 [cs.CL])
    (2 min) German and French football language display tense-aspect-mood (TAM) forms which differ from the TAM use in other genres. In German football talk, the present indicative may replace the pluperfect subjunctive. In French reports of football matches, the imperfective past may occur instead of a perfective past tense-aspect form. We argue that the two phenomena share a functional core and are licensed in the same way, which is a direct result of the genre they occur in. More precisely, football match reports adhere to a precise script and specific events are temporally determined in terms of objective time. This allows speakers to exploit a secondary function of TAM forms, namely, they shift the temporal perspective. We argue that it is on the grounds of the genre that comprehenders predict the deviating forms and are also able to decode them. We present various corpus studies where we explore the functioning of these phenomena in order to gain insights into their distribution, grammaticalization and their functioning in discourse. Relevant factors are Aktionsart properties, rhetorical relations and their interaction with other TAM forms. This allows us to discuss coping mechanisms on the part of the comprehender. We broaden our understanding of the phenomena, which have only been partly covered for French and up to now seem to have been ignored in German.
    XLM-E: Cross-lingual Language Model Pre-training via ELECTRA. (arXiv:2106.16138v1 [cs.CL])
    (2 min) In this paper, we introduce ELECTRA-style tasks to cross-lingual language model pre-training. Specifically, we present two pre-training tasks, namely multilingual replaced token detection, and translation replaced token detection. Besides, we pretrain the model, named as XLM-E, on both multilingual and parallel corpora. Our model outperforms the baseline models on various cross-lingual understanding tasks with much less computation cost. Moreover, analysis shows that XLM-E tends to obtain better cross-lingual transferability.
    A Conditional Splitting Framework for Efficient Constituency Parsing. (arXiv:2106.15760v1 [cs.CL])
    (2 min) We introduce a generic seq2seq parsing framework that casts constituency parsing problems (syntactic and discourse parsing) into a series of conditional splitting decisions. Our parsing model estimates the conditional probability distribution of possible splitting points in a given text span and supports efficient top-down decoding, which is linear in number of nodes. The conditional splitting formulation together with efficient beam search inference facilitate structural consistency without relying on expensive structured inference. Crucially, for discourse analysis we show that in our formulation, discourse segmentation can be framed as a special case of parsing which allows us to perform discourse parsing without requiring segmentation as a pre-requisite. Experiments show that our model achieves good results on the standard syntactic parsing tasks under settings with/without pre-trained representations and rivals state-of-the-art (SoTA) methods that are more computationally expensive than ours. In discourse parsing, our method outperforms SoTA by a good margin.
    Learning to Ask Conversational Questions by Optimizing Levenshtein Distance. (arXiv:2106.15903v1 [cs.CL])
    (2 min) Conversational Question Simplification (CQS) aims to simplify self-contained questions into conversational ones by incorporating some conversational characteristics, e.g., anaphora and ellipsis. Existing maximum likelihood estimation (MLE) based methods often get trapped in easily learned tokens as all tokens are treated equally during training. In this work, we introduce a Reinforcement Iterative Sequence Editing (RISE) framework that optimizes the minimum Levenshtein distance (MLD) through explicit editing actions. RISE is able to pay attention to tokens that are related to conversational characteristics. To train RISE, we devise an Iterative Reinforce Training (IRT) algorithm with a Dynamic Programming based Sampling (DPS) process to improve exploration. Experimental results on two benchmark datasets show that RISE significantly outperforms state-of-the-art methods and generalizes well on unseen data.
    HySPA: Hybrid Span Generation for Scalable Text-to-Graph Extraction. (arXiv:2106.15838v1 [cs.CL])
    (2 min) Text-to-Graph extraction aims to automatically extract information graphs consisting of mentions and types from natural language texts. Existing approaches, such as table filling and pairwise scoring, have shown impressive performance on various information extraction tasks, but they are difficult to scale to datasets with longer input texts because of their second-order space/time complexities with respect to the input length. In this work, we propose a Hybrid Span Generator (HySPA) that invertibly maps the information graph to an alternating sequence of nodes and edge types, and directly generates such sequences via a hybrid span decoder which can decode both the spans and the types recurrently in linear time and space complexities. Extensive experiments on the ACE05 dataset show that our approach also significantly outperforms state-of-the-art on the joint entity and relation extraction task.
    SAT Based Analogy Evaluation Framework for Persian Word Embeddings. (arXiv:2106.15674v1 [cs.CL])
    (2 min) In recent years there has been a special interest in word embeddings as a new approach to convert words to vectors. It has been a focal point to understand how much of the semantics of the the words has been transferred into embedding vectors. This is important as the embedding is going to be used as the basis for downstream NLP applications and it will be costly to evaluate the application end-to-end in order to identify quality of the used embedding model. Generally the word embeddings are evaluated through a number of tests, including analogy test. In this paper we propose a test framework for Persian embedding models. Persian is a low resource language and there is no rich semantic benchmark to evaluate word embedding models for this language. In this paper we introduce an evaluation framework including a hand crafted Persian SAT based analogy dataset, a colliquial test set (specific to Persian) and a benchmark to study the impact of various parameters on the semantic evaluation task.
    Evaluation of Thematic Coherence in Microblogs. (arXiv:2106.15971v1 [cs.CL])
    (2 min) Collecting together microblogs representing opinions about the same topics within the same timeframe is useful to a number of different tasks and practitioners. A major question is how to evaluate the quality of such thematic clusters. Here we create a corpus of microblog clusters from three different domains and time windows and define the task of evaluating thematic coherence. We provide annotation guidelines and human annotations of thematic coherence by journalist experts. We subsequently investigate the efficacy of different automated evaluation metrics for the task. We consider a range of metrics including surface level metrics, ones for topic model coherence and text generation metrics (TGMs). While surface level metrics perform well, outperforming topic coherence metrics, they are not as consistent as TGMs. TGMs are more reliable than all other metrics considered for capturing thematic coherence in microblog clusters due to being less sensitive to the effect of time windows.
  • cs.CV updates on arXiv.org

    Leveraging Hidden Structure in Self-Supervised Learning. (arXiv:2106.16060v1 [cs.LG])
    (2 min) This work considers the problem of learning structured representations from raw images using self-supervised learning. We propose a principled framework based on a mutual information objective, which integrates self-supervised and structure learning. Furthermore, we devise a post-hoc procedure to interpret the meaning of the learnt representations. Preliminary experiments on CIFAR-10 show that the proposed framework achieves higher generalization performance in downstream classification tasks and provides more interpretable representations compared to the ones learnt through traditional self-supervised learning.
    Constrained Contrastive Distribution Learning for Unsupervised Anomaly Detection and Localisation in Medical Images. (arXiv:2103.03423v2 [cs.CV] UPDATED)
    (2 min) Unsupervised anomaly detection (UAD) learns one-class classifiers exclusively with normal (i.e., healthy) images to detect any abnormal (i.e., unhealthy) samples that do not conform to the expected normal patterns. UAD has two main advantages over its fully supervised counterpart. Firstly, it is able to directly leverage large datasets available from health screening programs that contain mostly normal image samples, avoiding the costly manual labelling of abnormal samples and the subsequent issues involved in training with extremely class-imbalanced data. Further, UAD approaches can potentially detect and localise any type of lesions that deviate from the normal patterns. One significant challenge faced by UAD methods is how to learn effective low-dimensional image representations to detect and localise subtle abnormalities, generally consisting of small lesions. To address this challenge, we propose a novel self-supervised representation learning method, called Constrained Contrastive Distribution learning for anomaly detection (CCD), which learns fine-grained feature representations by simultaneously predicting the distribution of augmented data and image contexts using contrastive learning with pretext constraints. The learned representations can be leveraged to train more anomaly-sensitive detection models. Extensive experiment results show that our method outperforms current state-of-the-art UAD approaches on three different colonoscopy and fundus screening datasets. Our code is available at https://github.com/tianyu0207/CCD.
    10-mega pixel snapshot compressive imaging with a hybrid coded aperture. (arXiv:2106.15765v1 [eess.IV])
    (2 min) High resolution images are widely used in our daily life, whereas high-speed video capture is challenging due to the low frame rate of cameras working at the high resolution mode. Digging deeper, the main bottleneck lies in the low throughput of existing imaging systems. Towards this end, snapshot compressive imaging (SCI) was proposed as a promising solution to improve the throughput of imaging systems by compressive sampling and computational reconstruction. During acquisition, multiple high-speed images are encoded and collapsed to a single measurement. After this, algorithms are employed to retrieve the video frames from the coded snapshot. Recently developed Plug-and-Play (PnP) algorithms make it possible for SCI reconstruction in large-scale problems. However, the lack of high-resolution encoding systems still precludes SCI's wide application. In this paper, we build a novel hybrid coded aperture snapshot compressive imaging (HCA-SCI) system by incorporating a dynamic liquid crystal on silicon and a high-resolution lithography mask. We further implement a PnP reconstruction algorithm with cascaded denoisers for high quality reconstruction. Based on the proposed HCA-SCI system and algorithm, we achieve a 10-mega pixel SCI system to capture high-speed scenes, leading to a high throughput of 4.6G voxels per second. Both simulation and real data experiments verify the feasibility and performance of our proposed HCA-SCI scheme.
    Machine Learning-based Lie Detector applied to a Novel Annotated Game Dataset. (arXiv:2104.12345v2 [cs.CV] UPDATED)
    (2 min) Lie detection is considered a concern for everyone in their day to day life given its impact on human interactions. Thus, people normally pay attention to both what their interlocutors are saying and also to their visual appearances, including faces, to try to find any signs that indicate whether the person is telling the truth or not. While automatic lie detection may help us to understand this lying characteristics, current systems are still fairly limited, partly due to lack of adequate datasets to evaluate their performance in realistic scenarios. In this work, we have collected an annotated dataset of facial images, comprising both 2D and 3D information of several participants during a card game that encourages players to lie. Using our collected dataset, We evaluated several types of machine learning-based lie detectors in terms of their generalization, person-specific and cross-domain experiments. Our results show that models based on deep learning achieve the best accuracy, reaching up to 57\% for the generalization task and 63\% when dealing with a single participant. Finally, we also highlight the limitation of the deep learning based lie detector when dealing with cross-domain lie detection tasks.
    BLNet: A Fast Deep Learning Framework for Low-Light Image Enhancement with Noise Removal and Color Restoration. (arXiv:2106.15953v1 [eess.IV])
    (3 min) Images obtained in real-world low-light conditions are not only low in brightness, but they also suffer from many other types of degradation, such as color bias, unknown noise, detail loss and halo artifacts. In this paper, we propose a very fast deep learning framework called Bringing the Lightness (denoted as BLNet) that consists of two U-Nets with a series of well-designed loss functions to tackle all of the above degradations. Based on Retinex Theory, the decomposition net in our model can decompose low-light images into reflectance and illumination and remove noise in the reflectance during the decomposition phase. We propose a Noise and Color Bias Control module (NCBC Module) that contains a convolutional neural network and two loss functions (noise loss and color loss). This module is only used to calculate the loss functions during the training phase, so our method is very fast during the test phase. This module can smooth the reflectance to achieve the purpose of noise removal while preserving details and edge information and controlling color bias. We propose a network that can be trained to learn the mapping between low-light and normal-light illumination and enhance the brightness of images taken in low-light illumination. We train and evaluate the performance of our proposed model over the real-world Low-Light (LOL) dataset), and we also test our model over several other frequently used datasets (LIME, DICM and MEF datasets). We conduct extensive experiments to demonstrate that our approach achieves a promising effect with good rubustness and generalization and outperforms many other state-of-the-art methods qualitatively and quantitatively. Our method achieves high speed because we use loss functions instead of introducing additional denoisers for noise removal and color correction. The code and model are available at https://github.com/weixinxu666/BLNet.
    A Survey on Adversarial Image Synthesis. (arXiv:2106.16056v1 [cs.CV])
    (2 min) Generative Adversarial Networks (GANs) have been extremely successful in various application domains. Adversarial image synthesis has drawn increasing attention and made tremendous progress in recent years because of its wide range of applications in many computer vision and image processing problems. Among the many applications of GAN, image synthesis is the most well-studied one, and research in this area has already demonstrated the great potential of using GAN in image synthesis. In this paper, we provide a taxonomy of methods used in image synthesis, review different models for text-to-image synthesis and image-to-image translation, and discuss some evaluation metrics as well as possible future research directions in image synthesis with GAN.
    Looking Outside the Window: Wider-Context Transformer for the Semantic Segmentation of High-Resolution Remote Sensing Images. (arXiv:2106.15754v1 [cs.CV])
    (2 min) Long-range context information is crucial for the semantic segmentation of High-Resolution (HR) Remote Sensing Images (RSIs). The image cropping operations, commonly used for training neural networks, limit the perception of long-range context information in large RSIs. To break this limitation, we propose a Wider-Context Network (WiCNet) for the semantic segmentation of HR RSIs. In the WiCNet, apart from a conventional feature extraction network to aggregate the local information, an extra context branch is designed to explicitly model the context information in a larger image area. The information between the two branches is communicated through a Context Transformer, which is a novel design derived from the Vision Transformer to model the long-range context correlations. Ablation studies and comparative experiments conducted on several benchmark datasets prove the effectiveness of the proposed method. Additionally, we present a new Beijing Land-Use (BLU) dataset. This is a large-scale HR satellite dataset provided with high-quality and fine-grained reference labels, which we hope will boost future studies in this field.
    Transductive Zero-Shot Hashing for Multilabel Image Retrieval. (arXiv:1911.07192v2 [cs.CV] UPDATED)
    (2 min) Hash coding has been widely used in approximate nearest neighbor search for large-scale image retrieval. Given semantic annotations such as class labels and pairwise similarities of the training data, hashing methods can learn and generate effective and compact binary codes. While some newly introduced images may contain undefined semantic labels, which we call unseen images, zeor-shot hashing techniques have been studied. However, existing zeor-shot hashing methods focus on the retrieval of single-label images, and cannot handle multi-label images. In this paper, for the first time, a novel transductive zero-shot hashing method is proposed for multi-label unseen image retrieval. In order to predict the labels of the unseen/target data, a visual-semantic bridge is built via instance-concept coherence ranking on the seen/source data. Then, pairwise similarity loss and focal quantization loss are constructed for training a hashing model using both the seen/source and unseen/target data. Extensive evaluations on three popular multi-label datasets demonstrate that, the proposed hashing method achieves significantly better results than the competing methods.
    Affective Image Content Analysis: Two Decades Review and New Perspectives. (arXiv:2106.16125v1 [cs.CV])
    (2 min) Images can convey rich semantics and induce various emotions in viewers. Recently, with the rapid advancement of emotional intelligence and the explosive growth of visual data, extensive research efforts have been dedicated to affective image content analysis (AICA). In this survey, we will comprehensively review the development of AICA in the recent two decades, especially focusing on the state-of-the-art methods with respect to three main challenges -- the affective gap, perception subjectivity, and label noise and absence. We begin with an introduction to the key emotion representation models that have been widely employed in AICA and description of available datasets for performing evaluation with quantitative comparison of label noise and dataset bias. We then summarize and compare the representative approaches on (1) emotion feature extraction, including both handcrafted and deep features, (2) learning methods on dominant emotion recognition, personalized emotion prediction, emotion distribution learning, and learning from noisy data or few labels, and (3) AICA based applications. Finally, we discuss some challenges and promising research directions in the future, such as image content and context understanding, group emotion clustering, and viewer-image interaction.
    Dual Reweighting Domain Generalization for Face Presentation Attack Detection. (arXiv:2106.16128v1 [cs.CV])
    (2 min) Face anti-spoofing approaches based on domain generalization (DG) have drawn growing attention due to their robustness for unseen scenarios. Previous methods treat each sample from multiple domains indiscriminately during the training process, and endeavor to extract a common feature space to improve the generalization. However, due to complex and biased data distribution, directly treating them equally will corrupt the generalization ability. To settle the issue, we propose a novel Dual Reweighting Domain Generalization (DRDG) framework which iteratively reweights the relative importance between samples to further improve the generalization. Concretely, Sample Reweighting Module is first proposed to identify samples with relatively large domain bias, and reduce their impact on the overall optimization. Afterwards, Feature Reweighting Module is introduced to focus on these samples and extract more domain-irrelevant features via a self-distilling mechanism. Combined with the domain discriminator, the iteration of the two modules promotes the extraction of generalized features. Extensive experiments and visualizations are presented to demonstrate the effectiveness and interpretability of our method against the state-of-the-art competitors.
    Resolution learning in deep convolutional networks using scale-space theory. (arXiv:2106.03412v2 [cs.CV] UPDATED)
    (2 min) Resolution in deep convolutional neural networks (CNNs) is typically bounded by the receptive field size through filter sizes, and subsampling layers or strided convolutions on feature maps. The optimal resolution may vary significantly depending on the dataset. Modern CNNs hard-code their resolution hyper-parameters in the network architecture which makes tuning such hyper-parameters cumbersome. We propose to do away with hard-coded resolution hyper-parameters and aim to learn the appropriate resolution from data. We use scale-space theory to obtain a self-similar parametrization of filters and make use of the N-Jet: a truncated Taylor series to approximate a filter by a learned combination of Gaussian derivative filters. The parameter sigma of the Gaussian basis controls both the amount of detail the filter encodes and the spatial extent of the filter. Since sigma is a continuous parameter, we can optimize it with respect to the loss. The proposed N-Jet layer achieves comparable performance when used in state-of-the art architectures, while learning the correct resolution in each layer automatically. We evaluate our N-Jet layer on both classification and segmentation, and we show that learning sigma is especially beneficial for inputs at multiple sizes.
    Long-Short Temporal Modeling for Efficient Action Recognition. (arXiv:2106.15787v1 [cs.CV])
    (2 min) Efficient long-short temporal modeling is key for enhancing the performance of action recognition task. In this paper, we propose a new two-stream action recognition network, termed as MENet, consisting of a Motion Enhancement (ME) module and a Video-level Aggregation (VLA) module to achieve long-short temporal modeling. Specifically, motion representations have been proved effective in capturing short-term and high-frequency action. However, current motion representations are calculated from adjacent frames, which may have poor interpretation and bring useless information (noisy or blank). Thus, for short-term motions, we design an efficient ME module to enhance the short-term motions by mingling the motion saliency among neighboring segments. As for long-term aggregations, VLA is adopted at the top of the appearance branch to integrate the long-term dependencies across all segments. The two components of MENet are complementary in temporal modeling. Extensive experiments are conducted on UCF101 and HMDB51 benchmarks, which verify the effectiveness and efficiency of our proposed MENet.
    Improving the Efficiency of Transformers for Resource-Constrained Devices. (arXiv:2106.16006v1 [cs.LG])
    (2 min) Transformers provide promising accuracy and have become popular and used in various domains such as natural language processing and computer vision. However, due to their massive number of model parameters, memory and computation requirements, they are not suitable for resource-constrained low-power devices. Even with high-performance and specialized devices, the memory bandwidth can become a performance-limiting bottleneck. In this paper, we present a performance analysis of state-of-the-art vision transformers on several devices. We propose to reduce the overall memory footprint and memory transfers by clustering the model parameters. We show that by using only 64 clusters to represent model parameters, it is possible to reduce the data transfer from the main memory by more than 4x, achieve up to 22% speedup and 39% energy savings on mobile devices with less than 0.1% accuracy loss.
    Recent Advances in Fibrosis and Scar Segmentation from Cardiac MRI: A State-of-the-Art Review and Future Perspectives. (arXiv:2106.15707v1 [eess.IV])
    (2 min) Segmentation of cardiac fibrosis and scar are essential for clinical diagnosis and can provide invaluable guidance for the treatment of cardiac diseases. Late Gadolinium enhancement (LGE) cardiovascular magnetic resonance (CMR) has been successful for its efficacy in guiding the clinical diagnosis and treatment reliably. For LGE CMR, many methods have demonstrated success in accurately segmenting scarring regions. Co-registration with other non-contrast-agent (non-CA) modalities, balanced steady-state free precession (bSSFP) and cine magnetic resonance imaging (MRI) for example, can further enhance the efficacy of automated segmentation of cardiac anatomies. Many conventional methods have been proposed to provide automated or semi-automated segmentation of scars. With the development of deep learning in recent years, we can also see more advanced methods that are more efficient in providing more accurate segmentations. This paper conducts a state-of-the-art review of conventional and current state-of-the-art approaches utilising different modalities for accurate cardiac fibrosis and scar segmentation.
    Multi-Source domain adaptation via supervised contrastive learning and confident consistency regularization. (arXiv:2106.16093v1 [cs.CV])
    (2 min) Multi-Source Unsupervised Domain Adaptation (multi-source UDA) aims to learn a model from several labeled source domains while performing well on a different target domain where only unlabeled data are available at training time. To align source and target features distributions, several recent works use source and target explicit statistics matching such as features moments or class centroids. Yet, these approaches do not guarantee class conditional distributions alignment across domains. In this work, we propose a new framework called Contrastive Multi-Source Domain Adaptation (CMSDA) for multi-source UDA that addresses this limitation. Discriminative features are learned from interpolated source examples via cross entropy minimization and from target examples via consistency regularization and hard pseudo-labeling. Simultaneously, interpolated source examples are leveraged to align source class conditional distributions through an interpolated version of the supervised contrastive loss. This alignment leads to more general and transferable features which further improve the generalization on the target domain. Extensive experiments have been carried out on three standard multi-source UDA datasets where our method reports state-of-the-art results.
    Interventional Assays for the Latent Space of Autoencoders. (arXiv:2106.16091v1 [cs.LG])
    (2 min) The encoders and decoders of autoencoders effectively project the input onto learned manifolds in the latent space and data space respectively. We propose a framework, called latent responses, for probing the learned data manifold using interventions in the latent space. Using this framework, we investigate "holes" in the representation to quantitatively ascertain to what extent the latent space of a trained VAE is consistent with the chosen prior. Furthermore, we use the identified structure to improve interpolation between latent vectors. We evaluate how our analyses improve the quality of the generated samples using the VAE on a variety of benchmark datasets.
    Learning More for Free - A Multi Task Learning Approach for Improved Pathology Classification in Capsule Endoscopy. (arXiv:2106.16162v1 [cs.CV])
    (2 min) The progress in Computer Aided Diagnosis (CADx) of Wireless Capsule Endoscopy (WCE) is thwarted by the lack of data. The inadequacy in richly representative healthy and abnormal conditions results in isolated analyses of pathologies, that can not handle realistic multi-pathology scenarios. In this work, we explore how to learn more for free, from limited data through solving a WCE multicentric, multi-pathology classification problem. Learning more implies to learning more than full supervision would allow with the same data. This is done by combining self supervision with full supervision, under multi task learning. Additionally, we draw inspiration from the Human Visual System (HVS) in designing self supervision tasks and investigate if seemingly ineffectual signals within the data itself can be exploited to gain performance, if so, which signals would be better than others. Further, we present our analysis of the high level features as a stepping stone towards more robust multi-pathology CADx in WCE.
    Recognizing Facial Expressions in the Wild using Multi-Architectural Representations based Ensemble Learning with Distillation. (arXiv:2106.16126v1 [cs.CV])
    (2 min) Facial expressions are the most universal forms of body language and automatic facial expression recognition is one of the challenging tasks due to different uncertainties. However, it has been an active field of research for many years. Nevertheless, efficiency and performance are yet essential aspects for building robust systems. We proposed two models, EmoXNet which is an ensemble learning technique for learning convoluted facial representations, and EmoXNetLite which is a distillation technique that is useful for transferring the knowledge from our ensemble model to an efficient deep neural network using label-smoothen soft labels for able to effectively detect expressions in real-time. Both of the techniques performed quite well, where the ensemble model (EmoXNet) helped to achieve 85.07% test accuracy on FER2013 with FER+ annotations and 86.25% test accuracy on RAF-DB. Moreover, the distilled model (EmoXNetLite) showed 82.07% test accuracy on FER2013 with FER+ annotations and 81.78% test accuracy on RAF-DB.
    CodeVIO: Visual-Inertial Odometry with Learned Optimizable Dense Depth. (arXiv:2012.10133v2 [cs.CV] UPDATED)
    (2 min) In this work, we present a lightweight, tightly-coupled deep depth network and visual-inertial odometry (VIO) system, which can provide accurate state estimates and dense depth maps of the immediate surroundings. Leveraging the proposed lightweight Conditional Variational Autoencoder (CVAE) for depth inference and encoding, we provide the network with previously marginalized sparse features from VIO to increase the accuracy of initial depth prediction and generalization capability. The compact encoded depth maps are then updated jointly with navigation states in a sliding window estimator in order to provide the dense local scene geometry. We additionally propose a novel method to obtain the CVAE's Jacobian which is shown to be more than an order of magnitude faster than previous works, and we additionally leverage First-Estimate Jacobian (FEJ) to avoid recalculation. As opposed to previous works relying on completely dense residuals, we propose to only provide sparse measurements to update the depth code and show through careful experimentation that our choice of sparse measurements and FEJs can still significantly improve the estimated depth maps. Our full system also exhibits state-of-the-art pose estimation accuracy, and we show that it can run in real-time with single-thread execution while utilizing GPU acceleration only for the network and code Jacobian.
    Distill on the Go: Online knowledge distillation in self-supervised learning. (arXiv:2104.09866v2 [cs.CV] UPDATED)
    (2 min) Self-supervised learning solves pretext prediction tasks that do not require annotations to learn feature representations. For vision tasks, pretext tasks such as predicting rotation, solving jigsaw are solely created from the input data. Yet, predicting this known information helps in learning representations useful for downstream tasks. However, recent works have shown that wider and deeper models benefit more from self-supervised learning than smaller models. To address the issue of self-supervised pre-training of smaller models, we propose Distill-on-the-Go (DoGo), a self-supervised learning paradigm using single-stage online knowledge distillation to improve the representation quality of the smaller models. We employ deep mutual learning strategy in which two models collaboratively learn from each other to improve one another. Specifically, each model is trained using self-supervised learning along with distillation that aligns each model's softmax probabilities of similarity scores with that of the peer model. We conduct extensive experiments on multiple benchmark datasets, learning objectives, and architectures to demonstrate the potential of our proposed method. Our results show significant performance gain in the presence of noisy and limited labels and generalization to out-of-distribution data.
    Dual-stream Network for Visual Recognition. (arXiv:2105.14734v2 [cs.CV] UPDATED)
    (2 min) Transformers with remarkable global representation capacities achieve competitive results for visual tasks, but fail to consider high-level local pattern information in input images. In this paper, we present a generic Dual-stream Network (DS-Net) to fully explore the representation capacity of local and global pattern features for image classification. Our DS-Net can simultaneously calculate fine-grained and integrated features and efficiently fuse them. Specifically, we propose an Intra-scale Propagation module to process two different resolutions in each block and an Inter-Scale Alignment module to perform information interaction across features at dual scales. Besides, we also design a Dual-stream FPN (DS-FPN) to further enhance contextual information for downstream dense predictions. Without bells and whistles, the propsed DS-Net outperforms Deit-Small by 2.4% in terms of top-1 accuracy on ImageNet-1k and achieves state-of-the-art performance over other Vision Transformers and ResNets. For object detection and instance segmentation, DS-Net-Small respectively outperforms ResNet-50 by 6.4% and 5.5 % in terms of mAP on MSCOCO 2017, and surpasses the previous state-of-the-art scheme, which significantly demonstrates its potential to be a general backbone in vision tasks. The code will be released soon.
    AVLnet: Learning Audio-Visual Language Representations from Instructional Videos. (arXiv:2006.09199v2 [cs.CV] UPDATED)
    (2 min) Current methods for learning visually grounded language from videos often rely on text annotation, such as human generated captions or machine generated automatic speech recognition (ASR) transcripts. In this work, we introduce the Audio-Video Language Network (AVLnet), a self-supervised network that learns a shared audio-visual embedding space directly from raw video inputs. To circumvent the need for text annotation, we learn audio-visual representations from randomly segmented video clips and their raw audio waveforms. We train AVLnet on HowTo100M, a large corpus of publicly available instructional videos, and evaluate on image retrieval and video retrieval tasks, achieving state-of-the-art performance. We perform analysis of AVLnet's learned representations, showing our model utilizes speech and natural sounds to learn audio-visual concepts. Further, we propose a tri-modal model that jointly processes raw audio, video, and text captions from videos to learn a multi-modal semantic embedding space useful for text-video retrieval. Our code, data, and trained models will be released at avlnet.csail.mit.edu
    Escaping the Big Data Paradigm with Compact Transformers. (arXiv:2104.05704v2 [cs.CV] UPDATED)
    (2 min) With the rise of Transformers as the standard for language processing, and their advancements in computer vision, along with their unprecedented size and amounts of training data, many have come to believe that they are not suitable for small sets of data. This trend leads to great concerns, including but not limited to: limited availability of data in certain scientific domains and the exclusion of those with limited resource from research in the field. In this paper, we dispel the myth that transformers are "data hungry" and therefore can only be applied to large sets of data. We show for the first time that with the right size and tokenization, transformers can perform head-to-head with state-of-the-art CNNs on small datasets. Our model eliminates the requirement for class token and positional embeddings through a novel sequence pooling strategy and the use of convolutions. We show that compared to CNNs, our compact transformers have fewer parameters and MACs, while obtaining similar accuracies. Our method is flexible in terms of model size, and can have as little as 0.28M parameters and achieve reasonable results. It can reach an accuracy of 95.29 % when training from scratch on CIFAR-10, which is comparable with modern CNN based approaches, and a significant improvement over previous Transformer based models. Our simple and compact design democratizes transformers by making them accessible to those equipped with basic computing resources and/or dealing with important small datasets. Our method works on larger datasets, such as ImageNet (80.28% accuracy with 29% parameters of ViT), and NLP tasks as well. Our code and pre-trained models are publicly available at https://github.com/SHI-Labs/Compact-Transformers.
    Muti-view Mouse Social Behaviour Recognition with Deep Graphical Model. (arXiv:2011.02451v2 [cs.CV] UPDATED)
    (2 min) Home-cage social behaviour analysis of mice is an invaluable tool to assess therapeutic efficacy of neurodegenerative diseases. Despite tremendous efforts made within the research community, single-camera video recordings are mainly used for such analysis. Because of the potential to create rich descriptions of mouse social behaviors, the use of multi-view video recordings for rodent observations is increasingly receiving much attention. However, identifying social behaviours from various views is still challenging due to the lack of correspondence across data sources. To address this problem, we here propose a novel multiview latent-attention and dynamic discriminative model that jointly learns view-specific and view-shared sub-structures, where the former captures unique dynamics of each view whilst the latter encodes the interaction between the views. Furthermore, a novel multi-view latent-attention variational autoencoder model is introduced in learning the acquired features, enabling us to learn discriminative features in each view. Experimental results on the standard CRMI13 and our multi-view Parkinson's Disease Mouse Behaviour (PDMB) datasets demonstrate that our model outperforms the other state of the arts technologies and effectively deals with the imbalanced data problem.
    Mutual-GAN: Towards Unsupervised Cross-Weather Adaptation with Mutual Information Constraint. (arXiv:2106.16000v1 [cs.CV])
    (2 min) Convolutional neural network (CNN) have proven its success for semantic segmentation, which is a core task of emerging industrial applications such as autonomous driving. However, most progress in semantic segmentation of urban scenes is reported on standard scenarios, i.e., daytime scenes with favorable illumination conditions. In practical applications, the outdoor weather and illumination are changeable, e.g., cloudy and nighttime, which results in a significant drop of semantic segmentation accuracy of CNN only trained with daytime data. In this paper, we propose a novel generative adversarial network (namely Mutual-GAN) to alleviate the accuracy decline when daytime-trained neural network is applied to videos captured under adverse weather conditions. The proposed Mutual-GAN adopts mutual information constraint to preserve image-objects during cross-weather adaptation, which is an unsolved problem for most unsupervised image-to-image translation approaches (e.g., CycleGAN). The proposed Mutual-GAN is evaluated on two publicly available driving video datasets (i.e., CamVid and SYNTHIA). The experimental results demonstrate that our Mutual-GAN can yield visually plausible translated images and significantly improve the semantic segmentation accuracy of daytime-trained deep learning network while processing videos under challenging weathers.
    SimNet: Enabling Robust Unknown Object Manipulation from Pure Synthetic Data via Stereo. (arXiv:2106.16118v1 [cs.RO])
    (2 min) Robot manipulation of unknown objects in unstructured environments is a challenging problem due to the variety of shapes, materials, arrangements and lighting conditions. Even with large-scale real-world data collection, robust perception and manipulation of transparent and reflective objects across various lighting conditions remain challenging. To address these challenges we propose an approach to performing sim-to-real transfer of robotic perception. The underlying model, SimNet, is trained as a single multi-headed neural network using simulated stereo data as input and simulated object segmentation masks, 3D oriented bounding boxes (OBBs), object keypoints, and disparity as output. A key component of SimNet is the incorporation of a learned stereo sub-network that predicts disparity. SimNet is evaluated on 2D car detection, unknown object detection, and deformable object keypoint detection and significantly outperforms a baseline that uses a structured light RGB-D sensor. By inferring grasp positions using the OBB and keypoint predictions, SimNet can be used to perform end-to-end manipulation of unknown objects in both easy and hard scenarios using our fleet of Toyota HSR robots in four home environments. In unknown object grasping experiments, the predictions from the baseline RGB-D network and SimNet enable successful grasps of most of the easy objects. However, the RGB-D baseline only grasps 35% of the hard (e.g., transparent) objects, while SimNet grasps 95%, suggesting that SimNet can enable robust manipulation of unknown objects, including transparent objects, in unknown environments.
    Image Super-Resolution via Iterative Refinement. (arXiv:2104.07636v2 [eess.IV] UPDATED)
    (2 min) We present SR3, an approach to image Super-Resolution via Repeated Refinement. SR3 adapts denoising diffusion probabilistic models to conditional image generation and performs super-resolution through a stochastic denoising process. Inference starts with pure Gaussian noise and iteratively refines the noisy output using a U-Net model trained on denoising at various noise levels. SR3 exhibits strong performance on super-resolution tasks at different magnification factors, on faces and natural images. We conduct human evaluation on a standard 8X face super-resolution task on CelebA-HQ, comparing with SOTA GAN methods. SR3 achieves a fool rate close to 50%, suggesting photo-realistic outputs, while GANs do not exceed a fool rate of 34%. We further show the effectiveness of SR3 in cascaded image generation, where generative models are chained with super-resolution models, yielding a competitive FID score of 11.3 on ImageNet.
    Automated Onychomycosis Detection Using Deep Neural Networks. (arXiv:2106.16139v1 [cs.CV])
    (2 min) Clinical dermatology, still relies heavily on manual introspection of fungi within a Potassium Hydroxide (KOH) solution using a brightfield microscope. However, this method takes a long time, is based on the experience of the clinician, and has a low accuracy. With the increase of neural network applications in the field of clinical microscopy it is now possible to automate such manual processes increasing both efficiency and accuracy. This study presents a deep neural network structure that enables the rapid solutions for these problems and can perform automatic fungi detection in grayscale images without colorants. Microscopic images of 81 fungi and 235 ceratine were collected. Then, smaller patches were extracted containing 2062 fungi and 2142 ceratine. In order to detect fungus and ceratine, two models were created one of which was a custom neural network and the other was based on the VGG16 architecture. The developed custom model had 99.84% accuracy, and an area under the curve (AUC) value of 1.00, while the VGG16 model had 98.89% accuracy and an AUC value of 0.99. However, average accuracy and AUC value of clinicians is 72.8% and 0.87 respectively. This deep learning model allows the development of an automated system that can detect fungi within microscopic images.
    How to Train Your MAML to Excel in Few-Shot Classification. (arXiv:2106.16245v1 [cs.LG])
    (2 min) Model-agnostic meta-learning (MAML) is arguably the most popular meta-learning algorithm nowadays, given its flexibility to incorporate various model architectures and to be applied to different problems. Nevertheless, its performance on few-shot classification is far behind many recent algorithms dedicated to the problem. In this paper, we point out several key facets of how to train MAML to excel in few-shot classification. First, we find that a large number of gradient steps are needed for the inner loop update, which contradicts the common usage of MAML for few-shot classification. Second, we find that MAML is sensitive to the permutation of class assignments in meta-testing: for a few-shot task of $N$ classes, there are exponentially many ways to assign the learned initialization of the $N$-way classifier to the $N$ classes, leading to an unavoidably huge variance. Third, we investigate several ways for permutation invariance and find that learning a shared classifier initialization for all the classes performs the best. On benchmark datasets such as MiniImageNet and TieredImageNet, our approach, which we name UNICORN-MAML, performs on a par with or even outperforms state-of-the-art algorithms, while keeping the simplicity of MAML without adding any extra sub-networks.
    Domain adaptation for person re-identification on new unlabeled data using AlignedReID++. (arXiv:2106.15693v1 [cs.CV])
    (2 min) In the world where big data reigns and there is plenty of hardware prepared to gather a huge amount of non structured data, data acquisition is no longer a problem. Surveillance cameras are ubiquitous and they capture huge numbers of people walking across different scenes. However, extracting value from this data is challenging, specially for tasks that involve human images, such as face recognition and person re-identification. Annotation of this kind of data is a challenging and expensive task. In this work we propose a domain adaptation workflow to allow CNNs that were trained in one domain to be applied to another domain without the need for new annotation of the target data. Our method uses AlignedReID++ as the baseline, trained using a Triplet loss with batch hard. Domain adaptation is done by using pseudo-labels generated using an unsupervised learning strategy. Our results show that domain adaptation techniques really improve the performance of the CNN when applied in the target domain.
    Small in-distribution changes in 3D perspective and lighting fool both CNNs and Transformers. (arXiv:2106.16198v1 [cs.CV])
    (2 min) Neural networks are susceptible to small transformations including 2D rotations and shifts, image crops, and even changes in object colors. This is often attributed to biases in the training dataset, and the lack of 2D shift-invariance due to not respecting the sampling theorem. In this paper, we challenge this hypothesis by training and testing on unbiased datasets, and showing that networks are brittle to both small 3D perspective changes and lighting variations which cannot be explained by dataset bias or lack of shift-invariance. To find these in-distribution errors, we introduce an evolution strategies (ES) based approach, which we call CMA-Search. Despite training with a large-scale (0.5 million images), unbiased dataset of camera and light variations, in over 71% cases CMA-Search can find camera parameters in the vicinity of a correctly classified image which lead to in-distribution misclassifications with < 3.6% change in parameters. With lighting changes, CMA-Search finds misclassifications in 33% cases with < 11.6% change in parameters. Finally, we extend this method to find misclassifications in the vicinity of ImageNet images for both ResNet and OpenAI's CLIP model.
    Word-level Sign Language Recognition with Multi-stream Neural Networks Focusing on Local Regions. (arXiv:2106.15989v1 [cs.CV])
    (2 min) In recent years, Word-level Sign Language Recognition (WSLR) research has gained popularity in the computer vision community, and thus various approaches have been proposed. Among these approaches, the method using I3D network achieves the highest recognition accuracy on large public datasets for WSLR. However, the method with I3D only utilizes appearance information of the upper body of the signers to recognize sign language words. On the other hand, in WSLR, the information of local regions, such as the hand shape and facial expression, and the positional relationship among the body and both hands are important. Thus in this work, we utilized local region images of both hands and face, along with skeletal information to capture local information and the positions of both hands relative to the body, respectively. In other words, we propose a novel multi-stream WSLR framework, in which a stream with local region images and a stream with skeletal information are introduced by extending I3D network to improve the recognition accuracy of WSLR. From the experimental results on WLASL dataset, it is evident that the proposed method has achieved about 15% improvement in the Top-1 accuracy than the existing conventional methods.
    Zero-shot Learning with Class Description Regularization. (arXiv:2106.16108v1 [cs.CV])
    (2 min) The purpose of generative Zero-shot learning (ZSL) is to learning from seen classes, transfer the learned knowledge, and create samples of unseen classes from the description of these unseen categories. To achieve better ZSL accuracies, models need to better understand the descriptions of unseen classes. We introduce a novel form of regularization that encourages generative ZSL models to pay more attention to the description of each category. Our empirical results demonstrate improvements over the performance of multiple state-of-the-art models on the task of generalized zero-shot recognition and classification when trained on textual description-based datasets like CUB and NABirds and attribute-based datasets like AWA2, aPY and SUN.
    MissFormer: (In-)attention-based handling of missing observations for trajectory filtering and prediction. (arXiv:2106.16009v1 [cs.CV])
    (2 min) In applications such as object tracking, time-series data inevitably carry missing observations. Following the success of deep learning-based models for various sequence learning tasks, these models increasingly replace classic approaches in object tracking applications for inferring the object motions state. While traditional tracking approaches can deal with missing observations, most of their deep counterparts are, by default, not suited for this. Towards this end, this paper introduces a transformer-based approach for handling missing observations in variable input length trajectory data. The model is formed indirectly by successively increasing the complexity of the demanded inference tasks. Starting from reproducing noise-free trajectories, the model then learns to infer trajectories from noisy inputs. By providing missing tokens, binary-encoded missing events, the model learns to in-attend to missing data and infers a complete trajectory conditioned on the remaining inputs. In the case of a sequence of successive missing events, the model then acts as a pure prediction model. The model's abilities are demonstrated on synthetic data and real-world data reflecting prototypical object tracking scenarios.
    Hierarchical Phenotyping and Graph Modeling of Spatial Architecture in Lymphoid Neoplasms. (arXiv:2106.16174v1 [q-bio.QM])
    (2 min) The cells and their spatial patterns in the tumor microenvironment (TME) play a key role in tumor evolution, and yet remains an understudied topic in computational pathology. This study, to the best of our knowledge, is among the first to hybrid local and global graph methods to profile orchestration and interaction of cellular components. To address the challenge in hematolymphoid cancers where the cell classes in TME are unclear, we first implemented cell level unsupervised learning and identified two new cell subtypes. Local cell graphs or supercells were built for each image by considering the individual cell's geospatial location and classes. Then, we applied supercell level clustering and identified two new cell communities. In the end, we built global graphs to abstract spatial interaction patterns and extract features for disease diagnosis. We evaluate the proposed algorithm on H\&E slides of 60 hematolymphoid neoplasm patients and further compared it with three cell level graph-based algorithms, including the global cell graph, cluster cell graph, and FLocK. The proposed algorithm achieves a mean diagnosis accuracy of 0.703 with the repeated 5-fold cross-validation scheme. In conclusion, our algorithm shows superior performance over the existing methods and can be potentially applied to other cancer types.
    Weakly Supervised Temporal Adjacent Network for Language Grounding. (arXiv:2106.16136v1 [cs.CV])
    (2 min) Temporal language grounding (TLG) is a fundamental and challenging problem for vision and language understanding. Existing methods mainly focus on fully supervised setting with temporal boundary labels for training, which, however, suffers expensive cost of annotation. In this work, we are dedicated to weakly supervised TLG, where multiple description sentences are given to an untrimmed video without temporal boundary labels. In this task, it is critical to learn a strong cross-modal semantic alignment between sentence semantics and visual content. To this end, we introduce a novel weakly supervised temporal adjacent network (WSTAN) for temporal language grounding. Specifically, WSTAN learns cross-modal semantic alignment by exploiting temporal adjacent network in a multiple instance learning (MIL) paradigm, with a whole description paragraph as input. Moreover, we integrate a complementary branch into the framework, which explicitly refines the predictions with pseudo supervision from the MIL stage. An additional self-discriminating loss is devised on both the MIL branch and the complementary branch, aiming to enhance semantic discrimination by self-supervising. Extensive experiments are conducted on three widely used benchmark datasets, \emph{i.e.}, ActivityNet-Captions, Charades-STA, and DiDeMo, and the results demonstrate the effectiveness of our approach.
    Decoder-side Cross Resolution Synthesis for Video Compression Enhancement. (arXiv:2012.00650v3 [cs.CV] UPDATED)
    (2 min) This paper proposes a decoder-side Cross Resolution Synthesis (CRS) module to pursue better compression efficiency beyond the latest Versatile Video Coding (VVC), where we encode intra frames at original high resolution (HR), compress inter frames at a lower resolution (LR), and then super-resolve decoded LR inter frames with the help from preceding HR intra and neighboring LR inter frames. For a LR inter frame, a motion alignment and aggregation network (MAN) is devised to produce temporally aggregated motion representation (AMR) for the guarantee of temporal smoothness; Another texture compensation network (TCN) inputs decoded HR intra frame, re-sampled HR intra frame, and this LR inter frame to generate multiscale affinity map (MAM) and multiscale texture representation (MTR) for better augmenting spatial details; Finally, similarity-driven fusion synthesizes AMR, MTR, MAM to upscale LR inter frame for the removal of compression and resolution re-sampling noises. We enhance the VVC using proposed CRS, showing averaged 8.76% and 11.93% Bj{\o}ntegaard Delta Rate (BD-Rate) gains against the latest VVC anchor in Random Access (RA) and Low-delay P (LDP) settings respectively. In addition, experimental comparisons to the state-of-the-art super-resolution (SR) based VVC enhancement methods, and ablation studies are conducted to further report superior efficiency and generalization of proposed algorithm. All materials will be made to public at https://njuvision.github.io/CRS for reproducible research.
    I Want This Product but Different : Multimodal Retrieval with Synthetic Query Expansion. (arXiv:2102.08871v2 [cs.CV] UPDATED)
    (2 min) This paper addresses the problem of media retrieval using a multimodal query (a query which combines visual input with additional semantic information in natural language feedback). We propose a SynthTriplet GAN framework which resolves this task by expanding the multimodal query with a synthetically generated image that captures semantic information from both image and text input. We introduce a novel triplet mining method that uses a synthetic image as an anchor to directly optimize for embedding distances of generated and target images. We demonstrate that apart from the added value of retrieval illustration with synthetic image with the focus on customization and user feedback, the proposed method greatly surpasses other multimodal generation methods and achieves state of the art results in the multimodal retrieval task. We also show that in contrast to other retrieval methods, our method provides explainable embeddings.
    Famous Companies Use More Letters in Logo:A Large-Scale Analysis of Text Area in Logo. (arXiv:2104.00327v2 [cs.CV] UPDATED)
    (2 min) This paper analyzes a large number of logo images from the LLD-logo dataset, by recent deep learning-based techniques, to understand not only design trends of logo images and but also the correlation to their owner company. Especially, we focus on three correlations between logo images and their text areas, between the text areas and the number of followers on Twitter, and between the logo images and the number of followers. Various findings include the weak positive correlation between the text area ratio and the number of followers of the company. In addition, deep regression and deep ranking methods can catch correlations between the logo images and the number of followers.
    Opening Deep Neural Networks with Generative Models. (arXiv:2105.10013v3 [cs.CV] UPDATED)
    (2 min) Image classification methods are usually trained to perform predictions taking into account a predefined group of known classes. Real-world problems, however, may not allow for a full knowledge of the input and label spaces, making failures in recognition a hazard to deep visual learning. Open set recognition methods are characterized by the ability to correctly identify inputs of known and unknown classes. In this context, we propose GeMOS: simple and plug-and-play open set recognition modules that can be attached to pretrained Deep Neural Networks for visual recognition. The GeMOS framework pairs pre-trained Convolutional Neural Networks with generative models for open set recognition to extract open set scores for each sample, allowing for failure recognition in object recognition tasks. We conduct a thorough evaluation of the proposed method in comparison with state-of-the-art open set algorithms, finding that GeMOS either outperforms or is statistically indistinguishable from more complex and costly models.
    Pros and Cons of GAN Evaluation Measures: New Developments. (arXiv:2103.09396v2 [cs.LG] UPDATED)
    (2 min) This work is an update of a previous paper on the same topic published a few years ago. With the dramatic progress in generative modeling, a suite of new quantitative and qualitative techniques to evaluate models has emerged. Although some measures such as Inception Score, Frechet Inception Distance, Precision-Recall, and Perceptual Path Length are relatively more popular, GAN evaluation is not a settled issue and there is still room for improvement. Here, I describe new dimensions that are becoming important in assessing models (e.g. bias and fairness) and discuss the connection between GAN evaluation and deepfakes. These are important areas of concern in the machine learning community today and progress in GAN evaluation can help mitigate them.
    S2C2 - An orthogonal method for Semi-Supervised Learning on fuzzy labels. (arXiv:2106.16209v1 [cs.CV])
    (2 min) Semi-Supervised Learning (SSL) can decrease the amount of required labeled image data and thus the cost for deep learning. Most SSL methods only consider a clear distinction between classes but in many real-world datasets, this clear distinction is not given due to intra- or interobserver variability. This variability can lead to different annotations per image. Thus many images have ambiguous annotations and their label needs to be considered "fuzzy". This fuzziness of labels must be addressed as it will limit the performance of Semi-Supervised Learning (SSL) and deep learning in general. We propose Semi-Supervised Classification & Clustering (S2C2) which can extend many deep SSL algorithms. S2C2 can estimate the fuzziness of a label and applies SSL as a classification to certainly labeled data while creating distinct clusters for images with similar but fuzzy labels. We show that S2C2 results in median 7.4% better F1-score for classifications and 5.4% lower inner distance of clusters across multiple SSL algorithms and datasets while being more interpretable due to the fuzziness estimation of our method. Overall, a combination of Semi-Supervised Learning with our method S2C2 leads to better handling of the fuzziness of labels and thus real-world datasets.
    Multi-Source Domain Adaptation for Object Detection. (arXiv:2106.15793v1 [cs.CV])
    (2 min) To reduce annotation labor associated with object detection, an increasing number of studies focus on transferring the learned knowledge from a labeled source domain to another unlabeled target domain. However, existing methods assume that the labeled data are sampled from a single source domain, which ignores a more generalized scenario, where labeled data are from multiple source domains. For the more challenging task, we propose a unified Faster R-CNN based framework, termed Divide-and-Merge Spindle Network (DMSN), which can simultaneously enhance domain invariance and preserve discriminative power. Specifically, the framework contains multiple source subnets and a pseudo target subnet. First, we propose a hierarchical feature alignment strategy to conduct strong and weak alignments for low- and high-level features, respectively, considering their different effects for object detection. Second, we develop a novel pseudo subnet learning algorithm to approximate optimal parameters of pseudo target subset by weighted combination of parameters in different source subnets. Finally, a consistency regularization for region proposal network is proposed to facilitate each subnet to learn more abstract invariances. Extensive experiments on different adaptation scenarios demonstrate the effectiveness of the proposed model.
    Domain-Adversarial Training of Self-Attention Based Networks for Land Cover Classification using Multi-temporal Sentinel-2 Satellite Imagery. (arXiv:2104.00564v2 [cs.CV] UPDATED)
    (2 min) The increasing availability of large-scale remote sensing labeled data has prompted researchers to develop increasingly precise and accurate data-driven models for land cover and crop classification (LC&CC). Moreover, with the introduction of self-attention and introspection mechanisms, deep learning approaches have shown promising results in processing long temporal sequences in the multi-spectral domain with a contained computational request. Nevertheless, most practical applications cannot rely on labeled data, and in the field, surveys are a time consuming solution that poses strict limitations to the number of collected samples. Moreover, atmospheric conditions and specific geographical region characteristics constitute a relevant domain gap that does not allow direct applicability of a trained model on the available dataset to the area of interest. In this paper, we investigate adversarial training of deep neural networks to bridge the domain discrepancy between distinct geographical zones. In particular, we perform a thorough analysis of domain adaptation applied to challenging multi-spectral, multi-temporal data, accurately highlighting the advantages of adapting state-of-the-art self-attention based models for LC&CC to different target zones where labeled data are not available. Extensive experimentation demonstrated significant performance and generalization gain in applying domain-adversarial training to source and target regions with marked dissimilarities between the distribution of extracted features.
    GAttANet: Global attention agreement for convolutional neural networks. (arXiv:2104.05575v2 [cs.CV] UPDATED)
    (2 min) Transformer attention architectures, similar to those developed for natural language processing, have recently proved efficient also in vision, either in conjunction with or as a replacement for convolutional layers. Typically, visual attention is inserted in the network architecture as a (series of) feedforward self-attention module(s), with mutual key-query agreement as the main selection and routing operation. However efficient, this strategy is only vaguely compatible with the way that attention is implemented in biological brains: as a separate and unified network of attentional selection regions, receiving inputs from and exerting modulatory influence on the entire hierarchy of visual regions. Here, we report experiments with a simple such attention system that can improve the performance of standard convolutional networks, with relatively few additional parameters. Each spatial position in each layer of the network produces a key-query vector pair; all queries are then pooled into a global attention query. On the next iteration, the match between each key and the global attention query modulates the network's activations -- emphasizing or silencing the locations that agree or disagree (respectively) with the global attention system. We demonstrate the usefulness of this brain-inspired Global Attention Agreement network (GAttANet) for various convolutional backbones (from a simple 5-layer toy model to a standard ResNet50 architecture) and datasets (CIFAR10, CIFAR100, Imagenet-1k). Each time, our global attention system improves accuracy over the corresponding baseline.
    Synthetic Data Are as Good as the Real for Association Knowledge Learning in Multi-object Tracking. (arXiv:2106.16100v1 [cs.CV])
    (2 min) Association, aiming to link bounding boxes of the same identity in a video sequence, is a central component in multi-object tracking (MOT). To train association modules, e.g., parametric networks, real video data are usually used. However, annotating person tracks in consecutive video frames is expensive, and such real data, due to its inflexibility, offer us limited opportunities to evaluate the system performance w.r.t changing tracking scenarios. In this paper, we study whether 3D synthetic data can replace real-world videos for association training. Specifically, we introduce a large-scale synthetic data engine named MOTX, where the motion characteristics of cameras and objects are manually configured to be similar to those in real-world datasets. We show that compared with real data, association knowledge obtained from synthetic data can achieve very similar performance on real-world test sets without domain adaption techniques. Our intriguing observation is credited to two factors. First and foremost, 3D engines can well simulate motion factors such as camera movement, camera view and object movement, so that the simulated videos can provide association modules with effective motion features. Second, experimental results show that the appearance domain gap hardly harms the learning of association knowledge. In addition, the strong customization ability of MOTX allows us to quantitatively assess the impact of motion factors on MOT, which brings new insights to the community.
    Shape Completion via IMLE. (arXiv:2106.16237v1 [cs.CV])
    (2 min) Shape completion is the problem of completing partial input shapes such as partial scans. This problem finds important applications in computer vision and robotics due to issues such as occlusion or sparsity in real-world data. However, most of the existing research related to shape completion has been focused on completing shapes by learning a one-to-one mapping which limits the diversity and creativity of the produced results. We propose a novel multimodal shape completion technique that is effectively able to learn a one-to-many mapping and generates diverse complete shapes. Our approach is based on the conditional Implicit MaximumLikelihood Estimation (IMLE) technique wherein we condition our inputs on partial 3D point clouds. We extensively evaluate our approach by comparing it to various baselines both quantitatively and qualitatively. We show that our method is superior to alternatives in terms of completeness and diversity of shapes
    Operator-valued formulas for Riemannian Gradient and Hessian and families of tractable metrics. (arXiv:2009.10159v2 [math.OC] UPDATED)
    (2 min) We provide an explicit formula for the Levi-Civita connection and Riemannian Hessian for a Riemannian manifold that is a quotient of a manifold embedded in an inner product space with a non-constant metric function. Together with a classical formula for projection, this allows us to evaluate Riemannian gradient and Hessian for several families of metrics on classical manifolds, including a family of metrics on Stiefel manifolds connecting both the constant and canonical ambient metrics with closed-form geodesics. Using these formulas, we derive Riemannian optimization frameworks on quotients of Stiefel manifolds, including flag manifolds, and a new family of complete quotient metrics on the manifold of positive-semidefinite matrices of fixed rank, considered as a quotient of a product of Stiefel and positive-definite matrix manifold with affine-invariant metrics. The method is procedural, and in many instances, the Riemannian gradient and Hessian formulas could be derived by symbolic calculus. The method extends the list of potential metrics that could be used in manifold optimization and machine learning.
    Learnable Reconstruction Methods from RGB Images to Hyperspectral Imaging: A Survey. (arXiv:2106.15944v1 [eess.IV])
    (2 min) Hyperspectral imaging enables versatile applications due to its competence in capturing abundant spatial and spectral information, which are crucial for identifying substances. However, the devices for acquiring hyperspectral images are expensive and complicated. Therefore, many alternative spectral imaging methods have been proposed by directly reconstructing the hyperspectral information from lower-cost, more available RGB images. We present a thorough investigation of these state-of-the-art spectral reconstruction methods from the widespread RGB images. A systematic study and comparison of more than 25 methods has revealed that most of the data-driven deep learning methods are superior to prior-based methods in terms of reconstruction accuracy and quality despite lower speeds. This comprehensive review can serve as a fruitful reference source for peer researchers, thus further inspiring future development directions in related domains.
    Recurrently Estimating Reflective Symmetry Planes from Partial Pointclouds. (arXiv:2106.16129v1 [cs.CV])
    (2 min) Many man-made objects are characterised by a shape that is symmetric along one or more planar directions. Estimating the location and orientation of such symmetry planes can aid many tasks such as estimating the overall orientation of an object of interest or performing shape completion, where a partial scan of an object is reflected across the estimated symmetry plane in order to obtain a more detailed shape. Many methods processing 3D data rely on expensive 3D convolutions. In this paper we present an alternative novel encoding that instead slices the data along the height dimension and passes it sequentially to a 2D convolutional recurrent regression scheme. The method also comprises a differentiable least squares step, allowing for end-to-end accurate and fast processing of both full and partial scans of symmetric objects. We use this approach to efficiently handle 3D inputs to design a method to estimate planar reflective symmetries. We show that our approach has an accuracy comparable to state-of-the-art techniques on the task of planar reflective symmetry estimation on full synthetic objects. Additionally, we show that it can be deployed on partial scans of objects in a real-world pipeline to improve the outputs of a 3D object detector.
    Cyclist Trajectory Forecasts by Incorporation of Multi-View Video Information. (arXiv:2106.15991v1 [cs.CV])
    (2 min) This article presents a novel approach to incorporate visual cues from video-data from a wide-angle stereo camera system mounted at an urban intersection into the forecast of cyclist trajectories. We extract features from image and optical flow (OF) sequences using 3D convolutional neural networks (3D-ConvNet) and combine them with features extracted from the cyclist's past trajectory to forecast future cyclist positions. By the use of additional information, we are able to improve positional accuracy by about 7.5 % for our test dataset and by up to 22 % for specific motion types compared to a method solely based on past trajectories. Furthermore, we compare the use of image sequences to the use of OF sequences as additional information, showing that OF alone leads to significant improvements in positional accuracy. By training and testing our methods using a real-world dataset recorded at a heavily frequented public intersection and evaluating the methods' runtimes, we demonstrate the applicability in real traffic scenarios. Our code and parts of our dataset are made publicly available.
    Attention Aware Wavelet-based Detection of Morphed Face Images. (arXiv:2106.15686v1 [cs.CV])
    (2 min) Morphed images have exploited loopholes in the face recognition checkpoints, e.g., Credential Authentication Technology (CAT), used by Transportation Security Administration (TSA), which is a non-trivial security concern. To overcome the risks incurred due to morphed presentations, we propose a wavelet-based morph detection methodology which adopts an end-to-end trainable soft attention mechanism . Our attention-based deep neural network (DNN) focuses on the salient Regions of Interest (ROI) which have the most spatial support for morph detector decision function, i.e, morph class binary softmax output. A retrospective of morph synthesizing procedure aids us to speculate the ROI as regions around facial landmarks , particularly for the case of landmark-based morphing techniques. Moreover, our attention-based DNN is adapted to the wavelet space, where inputs of the network are coarse-to-fine spectral representations, 48 stacked wavelet sub-bands to be exact. We evaluate performance of the proposed framework using three datasets, VISAPP17, LMA, and MorGAN. In addition, as attention maps can be a robust indicator whether a probe image under investigation is genuine or counterfeit, we analyze the estimated attention maps for both a bona fide image and its corresponding morphed image. Finally, we present an ablation study on the efficacy of utilizing attention mechanism for the sake of morph detection.
    Align Yourself: Self-supervised Pre-training for Fine-grained Recognition via Saliency Alignment. (arXiv:2106.15788v1 [cs.CV])
    (2 min) Self-supervised contrastive learning has demonstrated great potential in learning visual representations. Despite their success on various downstream tasks such as image classification and object detection, self-supervised pre-training for fine-grained scenarios is not fully explored. In this paper, we first point out that current contrastive methods are prone to memorizing background/foreground texture and therefore have a limitation in localizing the foreground object. Analysis suggests that learning to extract discriminative texture information and localization are equally crucial for self-supervised pre-training under fine-grained scenarios. Based on our findings, we introduce Cross-view Saliency Alignment (CVSA), a contrastive learning framework that first crops and swaps saliency regions of images as a novel view generation and then guides the model to localize on the foreground object via a cross-view alignment loss. Extensive experiments on four popular fine-grained classification benchmarks show that CVSA significantly improves the learned representation.
    Fast whole-slide cartography in colon cancer histology using superpixels and CNN classification. (arXiv:2106.15893v1 [eess.IV])
    (2 min) Whole-slide-image cartography is the process of automatically detecting and outlining different tissue types in digitized histological specimen. This semantic segmentation provides a basis for many follow-up analyses and can potentially guide subsequent medical decisions. Due to their large size, whole-slide-images typically have to be divided into smaller patches which are then analyzed individually using machine learning-based approaches. Thereby, local dependencies of image regions get lost and since a whole-slide-image comprises many thousands of such patches this process is inherently slow. We propose to subdivide the image into coherent regions prior to classification by grouping visually similar adjacent image pixels into larger segments, i.e. superpixels. Afterwards, only a random subset of patches per superpixel is classified and patch labels are combined into a single superpixel label. The algorithm has been developed and validated on a dataset of 159 hand-annotated whole-slide-images of colon resections and its performance has been compared to a standard patch-based approach. The algorithm shows an average speed-up of 41% on the test data and the overall accuracy is increased from 93.8% to 95.7%. We additionally propose a metric for identifying superpixels with an uncertain classification so they can be excluded from further analysis. Finally, we evaluate two potential medical applications, namely tumor area estimation including tumor invasive margin generation and tumor composition analysis.
    SOLO: A Simple Framework for Instance Segmentation. (arXiv:2106.15947v1 [cs.CV])
    (2 min) Compared to many other dense prediction tasks, e.g., semantic segmentation, it is the arbitrary number of instances that has made instance segmentation much more challenging. In order to predict a mask for each instance, mainstream approaches either follow the 'detect-then-segment' strategy (e.g., Mask R-CNN), or predict embedding vectors first then cluster pixels into individual instances. In this paper, we view the task of instance segmentation from a completely new perspective by introducing the notion of "instance categories", which assigns categories to each pixel within an instance according to the instance's location. With this notion, we propose segmenting objects by locations (SOLO), a simple, direct, and fast framework for instance segmentation with strong performance. We derive a few SOLO variants (e.g., Vanilla SOLO, Decoupled SOLO, Dynamic SOLO) following the basic principle. Our method directly maps a raw input image to the desired object categories and instance masks, eliminating the need for the grouping post-processing or the bounding box detection. Our approach achieves state-of-the-art results for instance segmentation in terms of both speed and accuracy, while being considerably simpler than the existing methods. Besides instance segmentation, our method yields state-of-the-art results in object detection (from our mask byproduct) and panoptic segmentation. We further demonstrate the flexibility and high-quality segmentation of SOLO by extending it to perform one-stage instance-level image matting. Code is available at: https://git.io/AdelaiDet
    RCNN-SliceNet: A Slice and Cluster Approach for Nuclei Centroid Detection in Three-Dimensional Fluorescence Microscopy Images. (arXiv:2106.15753v1 [eess.IV])
    (2 min) Robust and accurate nuclei centroid detection is important for the understanding of biological structures in fluorescence microscopy images. Existing automated nuclei localization methods face three main challenges: (1) Most of object detection methods work only on 2D images and are difficult to extend to 3D volumes; (2) Segmentation-based models can be used on 3D volumes but it is computational expensive for large microscopy volumes and they have difficulty distinguishing different instances of objects; (3) Hand annotated ground truth is limited for 3D microscopy volumes. To address these issues, we present a scalable approach for nuclei centroid detection of 3D microscopy volumes. We describe the RCNN-SliceNet to detect 2D nuclei centroids for each slice of the volume from different directions and 3D agglomerative hierarchical clustering (AHC) is used to estimate the 3D centroids of nuclei in a volume. The model was trained with the synthetic microscopy data generated using Spatially Constrained Cycle-Consistent Adversarial Networks (SpCycleGAN) and tested on different types of real 3D microscopy data. Extensive experimental results demonstrate that our proposed method can accurately count and detect the nuclei centroids in a 3D microscopy volume.
    Positive-unlabeled Learning for Cell Detection in Histopathology Images with Incomplete Annotations. (arXiv:2106.15918v1 [cs.CV])
    (2 min) Cell detection in histopathology images is of great value in clinical practice. \textit{Convolutional neural networks} (CNNs) have been applied to cell detection to improve the detection accuracy, where cell annotations are required for network training. However, due to the variety and large number of cells, complete annotations that include every cell of interest in the training images can be challenging. Usually, incomplete annotations can be achieved, where positive labeling results are carefully examined to ensure their reliability but there can be other positive instances, i.e., cells of interest, that are not included in the annotations. This annotation strategy leads to a lack of knowledge about true negative samples. Most existing methods simply treat instances that are not labeled as positive as truly negative during network training, which can adversely affect the network performance. In this work, to address the problem of incomplete annotations, we formulate the training of detection networks as a positive-unlabeled learning problem. Specifically, the classification loss in network training is revised to take into account incomplete annotations, where the terms corresponding to negative samples are approximated with the true positive samples and the other samples of which the labels are unknown. To evaluate the proposed method, experiments were performed on a publicly available dataset for mitosis detection in breast cancer cells, and the experimental results show that our method improves the performance of cell detection given incomplete annotations for training.
    The Evolution of Out-of-Distribution Robustness Throughout Fine-Tuning. (arXiv:2106.15831v1 [cs.LG])
    (2 min) Although machine learning models typically experience a drop in performance on out-of-distribution data, accuracies on in- versus out-of-distribution data are widely observed to follow a single linear trend when evaluated across a testbed of models. Models that are more accurate on the out-of-distribution data relative to this baseline exhibit "effective robustness" and are exceedingly rare. Identifying such models, and understanding their properties, is key to improving out-of-distribution performance. We conduct a thorough empirical investigation of effective robustness during fine-tuning and surprisingly find that models pre-trained on larger datasets exhibit effective robustness during training that vanishes at convergence. We study how properties of the data influence effective robustness, and we show that it increases with the larger size, more diversity, and higher example difficulty of the dataset. We also find that models that display effective robustness are able to correctly classify 10% of the examples that no other current testbed model gets correct. Finally, we discuss several strategies for scaling effective robustness to the high-accuracy regime to improve the out-of-distribution accuracy of state-of-the-art models.
    Diff2Dist: Learning Spectrally Distinct Edge Functions, with Applications to Cell Morphology Analysis. (arXiv:2106.15716v1 [cs.LG])
    (2 min) We present a method for learning "spectrally descriptive" edge weights for graphs. We generalize a previously known distance measure on graphs (Graph Diffusion Distance), thereby allowing it to be tuned to minimize an arbitrary loss function. Because all steps involved in calculating this modified GDD are differentiable, we demonstrate that it is possible for a small neural network model to learn edge weights which minimize loss. GDD alone does not effectively discriminate between graphs constructed from shoot apical meristem images of wild-type vs. mutant \emph{Arabidopsis thaliana} specimens. However, training edge weights and kernel parameters with contrastive loss produces a learned distance metric with large margins between these graph categories. We demonstrate this by showing improved performance of a simple k-nearest-neighbors classifier on the learned distance matrix. We also demonstrate a further application of this method to biological image analysis: once trained, we use our model to compute the distance between the biological graphs and a set of graphs output by a cell division simulator. This allows us to identify simulation parameter regimes which are similar to each class of graph in our original dataset.
    SIMPL: Generating Synthetic Overhead Imagery to Address Zero-shot and Few-Shot Detection Problems. (arXiv:2106.15681v1 [cs.CV])
    (2 min) Recently deep neural networks (DNNs) have achieved tremendous success for object detection in overhead (e.g., satellite) imagery. One ongoing challenge however is the acquisition of training data, due to high costs of obtaining satellite imagery and annotating objects in it. In this work we present a simple approach - termed Synthetic object IMPLantation (SIMPL) - to easily and rapidly generate large quantities of synthetic overhead training data for custom target objects. We demonstrate the effectiveness of using SIMPL synthetic imagery for training DNNs in zero-shot scenarios where no real imagery is available; and few-shot learning scenarios, where limited real-world imagery is available. We also conduct experiments to study the sensitivity of SIMPL's effectiveness to some key design parameters, providing users for insights when designing synthetic imagery for custom objects. We release a software implementation of our SIMPL approach so that others can build upon it, or use it for their own custom problems.
    Dense Graph Convolutional Neural Networks on 3D Meshes for 3D Object Segmentation and Classification. (arXiv:2106.15778v1 [cs.CV])
    (2 min) This paper presents new designs of graph convolutional neural networks (GCNs) on 3D meshes for 3D object segmentation and classification. We use the faces of the mesh as basic processing units and represent a 3D mesh as a graph where each node corresponds to a face. To enhance the descriptive power of the graph, we introduce a 1-ring face neighbourhood structure to derive novel multi-dimensional spatial and structure features to represent the graph nodes. Based on this new graph representation, we then design a densely connected graph convolutional block which aggregates local and regional features as the key construction component to build effective and efficient practical GCN models for 3D object classification and segmentation. We will present experimental results to show that our new technique outperforms state of the art where our models are shown to have the smallest number of parameters and consietently achieve the highest accuracies across a number of benchmark datasets. We will also present ablation studies to demonstrate the soundness of our design principles and the effectiveness of our practical models.
    A Structured Analysis of the Video Degradation Effects on the Performance of a Machine Learning-enabled Pedestrian Detector. (arXiv:2106.15889v1 [cs.CV])
    (2 min) ML-enabled software systems have been incorporated in many public demonstrations for automated driving (AD) systems. Such solutions have also been considered as a crucial approach to aim at SAE Level 5 systems, where the passengers in such vehicles do not have to interact with the system at all anymore. Already in 2016, Nvidia demonstrated a complete end-to-end approach for training the complete software stack covering perception, planning and decision making, and the actual vehicle control. While such approaches show the great potential of such ML-enabled systems, there have also been demonstrations where already changes to single pixels in a video frame can potentially lead to completely different decisions with dangerous consequences. In this paper, a structured analysis has been conducted to explore video degradation effects on the performance of an ML-enabled pedestrian detector. Firstly, a baseline of applying YOLO to 1,026 frames with pedestrian annotations in the KITTI Vision Benchmark Suite has been established. Next, video degradation candidates for each of these frames were generated using the leading video codecs libx264, libx265, Nvidia HEVC, and AV1: 52 frames for the various compression presets for color and gray-scale frames resulting in 104 degradation candidates per original KITTI frame and 426,816 images in total. YOLO was applied to each image to compute the intersection-over-union (IoU) metric to compare the performance with the original baseline. While aggressively lossy compression settings result in significant performance drops as expected, it was also observed that some configurations actually result in slightly better IoU results compared to the baseline. The findings show that carefully chosen lossy video configurations preserve a decent performance of particular ML-enabled systems while allowing for substantial savings when storing or transmitting data.
    Augmented Shortcuts for Vision Transformers. (arXiv:2106.15941v1 [cs.CV])
    (2 min) Transformer models have achieved great progress on computer vision tasks recently. The rapid development of vision transformers is mainly contributed by their high representation ability for extracting informative features from input images. However, the mainstream transformer models are designed with deep architectures, and the feature diversity will be continuously reduced as the depth increases, i.e., feature collapse. In this paper, we theoretically analyze the feature collapse phenomenon and study the relationship between shortcuts and feature diversity in these transformer models. Then, we present an augmented shortcut scheme, which inserts additional paths with learnable parameters in parallel on the original shortcuts. To save the computational costs, we further explore an efficient approach that uses the block-circulant projection to implement augmented shortcuts. Extensive experiments conducted on benchmark datasets demonstrate the effectiveness of the proposed method, which brings about 1% accuracy increase of the state-of-the-art visual transformers without obviously increasing their parameters and FLOPs.
    Efficient Spatio-Temporal Recurrent Neural Network for Video Deblurring. (arXiv:2106.16028v1 [cs.CV])
    (2 min) Real-time video deblurring still remains a challenging task due to the complexity of spatially and temporally varying blur itself and the requirement of low computational cost. To improve the network efficiency, we adopt residual dense blocks into RNN cells, so as to efficiently extract the spatial features of the current frame. Furthermore, a global spatio-temporal attention module is proposed to fuse the effective hierarchical features from past and future frames to help better deblur the current frame. Another issue needs to be addressed urgently is the lack of a real-world benchmark dataset. Thus, we contribute a novel dataset (BSD) to the community, by collecting paired blurry/sharp video clips using a co-axis beam splitter acquisition system. Experimental results show that the proposed method (ESTRNN) can achieve better deblurring performance both quantitatively and qualitatively with less computational cost against state-of-the-art video deblurring methods. In addition, cross-validation experiments between datasets illustrate the high generality of BSD over the synthetic datasets. The code and dataset are released at https://github.com/zzh-tech/ESTRNN.
    ResViT: Residual vision transformers for multi-modal medical image synthesis. (arXiv:2106.16031v1 [eess.IV])
    (2 min) Multi-modal imaging is a key healthcare technology in the diagnosis and management of disease, but it is often underutilized due to costs associated with multiple separate scans. This limitation yields the need for synthesis of unacquired modalities from the subset of available modalities. In recent years, generative adversarial network (GAN) models with superior depiction of structural details have been established as state-of-the-art in numerous medical image synthesis tasks. However, GANs are characteristically based on convolutional neural network (CNN) backbones that perform local processing with compact filters. This inductive bias, in turn, compromises learning of long-range spatial dependencies. While attention maps incorporated in GANs can multiplicatively modulate CNN features to emphasize critical image regions, their capture of global context is mostly implicit. Here, we propose a novel generative adversarial approach for medical image synthesis, ResViT, to combine local precision of convolution operators with contextual sensitivity of vision transformers. Based on an encoder-decoder architecture, ResViT employs a central bottleneck comprising novel aggregated residual transformer (ART) blocks that synergistically combine convolutional and transformer modules. Comprehensive demonstrations are performed for synthesizing missing sequences in multi-contrast MRI and CT images from MRI. Our results indicate the superiority of ResViT against competing methods in terms of qualitative observations and quantitative metrics.
    Content-Aware Convolutional Neural Networks. (arXiv:2106.15797v1 [cs.CV])
    (2 min) Convolutional Neural Networks (CNNs) have achieved great success due to the powerful feature learning ability of convolution layers. Specifically, the standard convolution traverses the input images/features using a sliding window scheme to extract features. However, not all the windows contribute equally to the prediction results of CNNs. In practice, the convolutional operation on some of the windows (e.g., smooth windows that contain very similar pixels) can be very redundant and may introduce noises into the computation. Such redundancy may not only deteriorate the performance but also incur the unnecessary computational cost. Thus, it is important to reduce the computational redundancy of convolution to improve the performance. To this end, we propose a Content-aware Convolution (CAC) that automatically detects the smooth windows and applies a 1x1 convolutional kernel to replace the original large kernel. In this sense, we are able to effectively avoid the redundant computation on similar pixels. By replacing the standard convolution in CNNs with our CAC, the resultant models yield significantly better performance and lower computational cost than the baseline models with the standard convolution. More critically, we are able to dynamically allocate suitable computation resources according to the data smoothness of different images, making it possible for content-aware computation. Extensive experiments on various computer vision tasks demonstrate the superiority of our method over existing methods.
    Semantic Segmentation of Periocular Near-Infra-Red Eye Images Under Alcohol Effects. (arXiv:2106.15828v1 [cs.CV])
    (2 min) This paper proposes a new framework to detect, segment, and estimate the localization of the eyes from a periocular Near-Infra-Red iris image under alcohol consumption. The purpose of the system is to measure the fitness for duty. Fitness systems allow us to determine whether a person is physically or psychologically able to perform their tasks. Our framework is based on an object detector trained from scratch to detect both eyes from a single image. Then, two efficient networks were used for semantic segmentation; a Criss-Cross attention network and DenseNet10, with only 122,514 and 210,732 parameters, respectively. These networks can find the pupil, iris, and sclera. In the end, the binary output eye mask is used for pupil and iris diameter estimation with high precision. Five state-of-the-art algorithms were used for this purpose. A mixed proposal reached the best results. A second contribution is establishing an alcohol behavior curve to detect the alcohol presence utilizing a stream of images captured from an iris instance. Also, a manually labeled database with more than 20k images was created. Our best method obtains a mean Intersection-over-Union of 94.54% with DenseNet10 with only 210,732 parameters and an error of only 1-pixel on average.
    RICE: Refining Instance Masks in Cluttered Environments with Graph Neural Networks. (arXiv:2106.15711v1 [cs.CV])
    (2 min) Segmenting unseen object instances in cluttered environments is an important capability that robots need when functioning in unstructured environments. While previous methods have exhibited promising results, they still tend to provide incorrect results in highly cluttered scenes. We postulate that a network architecture that encodes relations between objects at a high-level can be beneficial. Thus, in this work, we propose a novel framework that refines the output of such methods by utilizing a graph-based representation of instance masks. We train deep networks capable of sampling smart perturbations to the segmentations, and a graph neural network, which can encode relations between objects, to evaluate the perturbed segmentations. Our proposed method is orthogonal to previous works and achieves state-of-the-art performance when combined with them. We demonstrate an application that uses uncertainty estimates generated by our method to guide a manipulator, leading to efficient understanding of cluttered scenes. Code, models, and video can be found at https://github.com/chrisdxie/rice .
    Learning to Map for Active Semantic Goal Navigation. (arXiv:2106.15648v1 [cs.CV])
    (2 min) We consider the problem of object goal navigation in unseen environments. In our view, solving this problem requires learning of contextual semantic priors, a challenging endeavour given the spatial and semantic variability of indoor environments. Current methods learn to implicitly encode these priors through goal-oriented navigation policy functions operating on spatial representations that are limited to the agent's observable areas. In this work, we propose a novel framework that actively learns to generate semantic maps outside the field of view of the agent and leverages the uncertainty over the semantic classes in the unobserved areas to decide on long term goals. We demonstrate that through this spatial prediction strategy, we are able to learn semantic priors in scenes that can be leveraged in unknown environments. Additionally, we show how different objectives can be defined by balancing exploration with exploitation during searching for semantic targets. Our method is validated in the visually realistic environments offered by the Matterport3D dataset and show state of the art results on the object goal navigation task.
    Monocular 3D Object Detection: An Extrinsic Parameter Free Approach. (arXiv:2106.15796v1 [cs.CV])
    (2 min) Monocular 3D object detection is an important task in autonomous driving. It can be easily intractable where there exists ego-car pose change w.r.t. ground plane. This is common due to the slight fluctuation of road smoothness and slope. Due to the lack of insight in industrial application, existing methods on open datasets neglect the camera pose information, which inevitably results in the detector being susceptible to camera extrinsic parameters. The perturbation of objects is very popular in most autonomous driving cases for industrial products. To this end, we propose a novel method to capture camera pose to formulate the detector free from extrinsic perturbation. Specifically, the proposed framework predicts camera extrinsic parameters by detecting vanishing point and horizon change. A converter is designed to rectify perturbative features in the latent space. By doing so, our 3D detector works independent of the extrinsic parameter variations and produces accurate results in realistic cases, e.g., potholed and uneven roads, where almost all existing monocular detectors fail to handle. Experiments demonstrate our method yields the best performance compared with the other state-of-the-arts by a large margin on both KITTI 3D and nuScenes datasets.
    Single-Step Adversarial Training for Semantic Segmentation. (arXiv:2106.15998v1 [cs.CV])
    (2 min) Even though deep neural networks succeed on many different tasks including semantic segmentation, they lack on robustness against adversarial examples. To counteract this exploit, often adversarial training is used. However, it is known that adversarial training with weak adversarial attacks (e.g. using the Fast Gradient Method) does not improve the robustness against stronger attacks. Recent research shows that it is possible to increase the robustness of such single-step methods by choosing an appropriate step size during the training. Finding such a step size, without increasing the computational effort of single-step adversarial training, is still an open challenge. In this work we address the computationally particularly demanding task of semantic segmentation and propose a new step size control algorithm that increases the robustness of single-step adversarial training. The proposed algorithm does not increase the computational effort of single-step adversarial training considerably and also simplifies training, because it is free of meta-parameter. We show that the robustness of our approach can compete with multi-step adversarial training on two popular benchmarks for semantic segmentation.
    When Video Classification Meets Incremental Classes. (arXiv:2106.15827v1 [cs.CV])
    (2 min) With the rapid development of social media, tremendous videos with new classes are generated daily, which raise an urgent demand for video classification methods that can continuously update new classes while maintaining the knowledge of old videos with limited storage and computing resources. In this paper, we summarize this task as \textit{Class-Incremental Video Classification (CIVC)} and propose a novel framework to address it. As a subarea of incremental learning tasks, the challenge of \textit{catastrophic forgetting} is unavoidable in CIVC. To better alleviate it, we utilize some characteristics of videos. First, we decompose the spatio-temporal knowledge before distillation rather than treating it as a whole in the knowledge transfer process; trajectory is also used to refine the decomposition. Second, we propose a dual granularity exemplar selection method to select and store representative video instances of old classes and key-frames inside videos under a tight storage budget. We benchmark our method and previous SOTA class-incremental learning methods on Something-Something V2 and Kinetics datasets, and our method outperforms previous methods significantly.
  • cs.IR updates on arXiv.org

    The Evaluation of Rating Systems in Team-based Battle Royale Games. (arXiv:2105.14069v2 [cs.IR] UPDATED)
    (2 min) Online competitive games have become a mainstream entertainment platform. To create a fair and exciting experience, these games use rating systems to match players with similar skills. While there has been an increasing amount of research on improving the performance of these systems, less attention has been paid to how their performance is evaluated. In this paper, we explore the utility of several metrics for evaluating three popular rating systems on a real-world dataset of over 25,000 team battle royale matches. Our results suggest considerable differences in their evaluation patterns. Some metrics were highly impacted by the inclusion of new players. Many could not capture the real differences between certain groups of players. Among all metrics studied, normalized discounted cumulative gain (NDCG) demonstrated more reliable performance and more flexibility. It alleviated most of the challenges faced by the other metrics while adding the freedom to adjust the focus of the evaluations on different groups of players.
    Discovering Collaborative Signals for Next POI Recommendation with Iterative Seq2Graph Augmentation. (arXiv:2106.15814v1 [cs.IR])
    (2 min) Being an indispensable component in location-based social networks, next point-of-interest (POI) recommendation recommends users unexplored POIs based on their recent visiting histories. However, existing work mainly models check-in data as isolated POI sequences, neglecting the crucial collaborative signals from cross-sequence check-in information. Furthermore, the sparse POI-POI transitions restrict the ability of a model to learn effective sequential patterns for recommendation. In this paper, we propose Sequence-to-Graph (Seq2Graph) augmentation for each POI sequence, allowing collaborative signals to be propagated from correlated POIs belonging to other sequences. We then devise a novel Sequence-to-Graph POI Recommender (SGRec), which jointly learns POI embeddings and infers a user's temporal preferences from the graph-augmented POI sequence. To overcome the sparsity of POI-level interactions, we further infuse category-awareness into SGRec with a multi-task learning scheme that captures the denser category-wise transitions. As such, SGRec makes full use of the collaborative signals for learning expressive POI representations, and also comprehensively uncovers multi-level sequential patterns for user preference modelling. Extensive experiments on two real-world datasets demonstrate the superiority of SGRec against state-of-the-art methods in next POI recommendation.
    Dual Adversarial Variational Embedding for Robust Recommendation. (arXiv:2106.15779v1 [cs.IR])
    (2 min) Robust recommendation aims at capturing true preference of users from noisy data, for which there are two lines of methods have been proposed. One is based on noise injection, and the other is to adopt the generative model Variational Auto-encoder (VAE). However, the existing works still face two challenges. First, the noise injection based methods often draw the noise from a fixed noise distribution given in advance, while in real world, the noise distributions of different users and items may differ from each other due to personal behaviors and item usage patterns. Second, the VAE based models are not expressive enough to capture the true preference since VAE often yields an embedding space of a single modal, while in real world, user-item interactions usually exhibit multi-modality on user preference distribution. In this paper, we propose a novel model called Dual Adversarial Variational Embedding (DAVE) for robust recommendation, which can provide personalized noise reduction for different users and items, and capture the multi-modality of the embedding space, by combining the advantages of VAE and adversarial training between the introduced auxiliary discriminators and the variational inference networks. The extensive experiments conducted on real datasets verify the effectiveness of DAVE on robust recommendation.
    Incorporating Domain Knowledge for Extractive Summarization of Legal Case Documents. (arXiv:2106.15876v1 [cs.CL])
    (2 min) Automatic summarization of legal case documents is an important and practical challenge. Apart from many domain-independent text summarization algorithms that can be used for this purpose, several algorithms have been developed specifically for summarizing legal case documents. However, most of the existing algorithms do not systematically incorporate domain knowledge that specifies what information should ideally be present in a legal case document summary. To address this gap, we propose an unsupervised summarization algorithm DELSumm which is designed to systematically incorporate guidelines from legal experts into an optimization setup. We conduct detailed experiments over case documents from the Indian Supreme Court. The experiments show that our proposed unsupervised method outperforms several strong baselines in terms of ROUGE scores, including both general summarization algorithms and legal-specific ones. In fact, though our proposed algorithm is unsupervised, it outperforms several supervised summarization models that are trained over thousands of document-summary pairs.
    Generative Adversarial Networks for Spatio-temporal Data: A Survey. (arXiv:2008.08903v3 [cs.LG] UPDATED)
    (2 min) Generative Adversarial Networks (GANs) have shown remarkable success in producing realistic-looking images in the computer vision area. Recently, GAN-based techniques are shown to be promising for spatio-temporal-based applications such as trajectory prediction, events generation and time-series data imputation. While several reviews for GANs in computer vision have been presented, no one has considered addressing the practical applications and challenges relevant to spatio-temporal data. In this paper, we have conducted a comprehensive review of the recent developments of GANs for spatio-temporal data. We summarise the application of popular GAN architectures for spatio-temporal data and the common practices for evaluating the performance of spatio-temporal applications with GANs. Finally, we point out future research directions to benefit researchers in this area.
    News Article Retrieval in Context for Event-centric Narrative Creation. (arXiv:2106.16053v1 [cs.CL])
    (2 min) Writers such as journalists often use automatic tools to find relevant content to include in their narratives. In this paper, we focus on supporting writers in the news domain to develop event-centric narratives. Given an incomplete narrative that specifies a main event and a context, we aim to retrieve news articles that discuss relevant events that would enable the continuation of the narrative. We formally define this task and propose a retrieval dataset construction procedure that relies on existing news articles to simulate incomplete narratives and relevant articles. Experiments on two datasets derived from this procedure show that state-of-the-art lexical and semantic rankers are not sufficient for this task. We show that combining those with a ranker that ranks articles by reverse chronological order outperforms those rankers alone. We also perform an in-depth quantitative and qualitative analysis of the results that sheds light on the characteristics of this task.
    Machine Reading of Hypotheses for Organizational Research Reviews and Pre-trained Models via R Shiny App for Non-Programmers. (arXiv:2106.16102v1 [cs.IR])
    (2 min) The volume of scientific publications in organizational research becomes exceedingly overwhelming for human researchers who seek to timely extract and review knowledge. This paper introduces natural language processing (NLP) models to accelerate the discovery, extraction, and organization of theoretical developments (i.e., hypotheses) from social science publications. We illustrate and evaluate NLP models in the context of a systematic review of stakeholder value constructs and hypotheses. Specifically, we develop NLP models to automatically 1) detect sentences in scholarly documents as hypotheses or not (Hypothesis Detection), 2) deconstruct the hypotheses into nodes (constructs) and links (causal/associative relationships) (Relationship Deconstruction ), and 3) classify the features of links in terms causality (versus association) and direction (positive, negative, versus nonlinear) (Feature Classification). Our models have reported high performance metrics for all three tasks. While our models are built in Python, we have made the pre-trained models fully accessible for non-programmers. We have provided instructions on installing and using our pre-trained models via an R Shiny app graphic user interface (GUI). Finally, we suggest the next paths to extend our methodology for computer-assisted knowledge synthesis.
    GraphFM: Graph Factorization Machines for Feature Interaction Modeling. (arXiv:2105.11866v2 [cs.LG] UPDATED)
    (2 min) Factorization machine (FM) is a prevalent approach to modeling pairwise (second-order) feature interactions when dealing with high-dimensional sparse data. However, on the one hand, FM fails to capture higher-order feature interactions suffering from combinatorial expansion, on the other hand, taking into account interaction between every pair of features may introduce noise and degrade prediction accuracy. To solve the problems, we propose a novel approach Graph Factorization Machine (GraphFM) by naturally representing features in the graph structure. In particular, a novel mechanism is designed to select the beneficial feature interactions and formulate them as edges between features. Then our proposed model which integrates the interaction function of FM into the feature aggregation strategy of Graph Neural Network (GNN), can model arbitrary-order feature interactions on the graph-structured features by stacking layers. Experimental results on several real-world datasets has demonstrated the rationality and effectiveness of our proposed approach.
    Context-Aware Attention-Based Data Augmentation for POI Recommendation. (arXiv:2106.15984v1 [cs.IR])
    (2 min) With the rapid growth of location-based social networks (LBSNs), Point-Of-Interest (POI) recommendation has been broadly studied in this decade. Recently, the next POI recommendation, a natural extension of POI recommendation, has attracted much attention. It aims at suggesting the next POI to a user in spatial and temporal context, which is a practical yet challenging task in various applications. Existing approaches mainly model the spatial and temporal information, and memorize historical patterns through user's trajectories for recommendation. However, they suffer from the negative impact of missing and irregular check-in data, which significantly influences the model performance. In this paper, we propose an attention-based sequence-to-sequence generative model, namely POI-Augmentation Seq2Seq (PA-Seq2Seq), to address the sparsity of training set by making check-in records to be evenly-spaced. Specifically, the encoder summarises each check-in sequence and the decoder predicts the possible missing check-ins based on the encoded information. In order to learn time-aware correlation among user history, we employ local attention mechanism to help the decoder focus on a specific range of context information when predicting a certain missing check-in point. Extensive experiments have been conducted on two real-world check-in datasets, Gowalla and Brightkite, for performance and effectiveness evaluation.
    Multi-Modal Chorus Recognition for Improving Song Search. (arXiv:2106.16153v1 [cs.IR])
    (2 min) We discuss a novel task, Chorus Recognition, which could potentially benefit downstream tasks such as song search and music summarization. Different from the existing tasks such as music summarization or lyrics summarization relying on single-modal information, this paper models chorus recognition as a multi-modal one by utilizing both the lyrics and the tune information of songs. We propose a multi-modal Chorus Recognition model that considers diverse features. Besides, we also create and publish the first Chorus Recognition dataset containing 627 songs for public use. Our empirical study performed on the dataset demonstrates that our approach outperforms several baselines in chorus recognition. In addition, our approach also helps to improve the accuracy of its downstream task - song search by more than 10.6%.
    Learning to Ask Conversational Questions by Optimizing Levenshtein Distance. (arXiv:2106.15903v1 [cs.CL])
    (2 min) Conversational Question Simplification (CQS) aims to simplify self-contained questions into conversational ones by incorporating some conversational characteristics, e.g., anaphora and ellipsis. Existing maximum likelihood estimation (MLE) based methods often get trapped in easily learned tokens as all tokens are treated equally during training. In this work, we introduce a Reinforcement Iterative Sequence Editing (RISE) framework that optimizes the minimum Levenshtein distance (MLD) through explicit editing actions. RISE is able to pay attention to tokens that are related to conversational characteristics. To train RISE, we devise an Iterative Reinforce Training (IRT) algorithm with a Dynamic Programming based Sampling (DPS) process to improve exploration. Experimental results on two benchmark datasets show that RISE significantly outperforms state-of-the-art methods and generalizes well on unseen data.
  • cs.LG updates on arXiv.org

    Escaping the Big Data Paradigm with Compact Transformers. (arXiv:2104.05704v2 [cs.CV] UPDATED)
    (2 min) With the rise of Transformers as the standard for language processing, and their advancements in computer vision, along with their unprecedented size and amounts of training data, many have come to believe that they are not suitable for small sets of data. This trend leads to great concerns, including but not limited to: limited availability of data in certain scientific domains and the exclusion of those with limited resource from research in the field. In this paper, we dispel the myth that transformers are "data hungry" and therefore can only be applied to large sets of data. We show for the first time that with the right size and tokenization, transformers can perform head-to-head with state-of-the-art CNNs on small datasets. Our model eliminates the requirement for class token and positional embeddings through a novel sequence pooling strategy and the use of convolutions. We show that compared to CNNs, our compact transformers have fewer parameters and MACs, while obtaining similar accuracies. Our method is flexible in terms of model size, and can have as little as 0.28M parameters and achieve reasonable results. It can reach an accuracy of 95.29 % when training from scratch on CIFAR-10, which is comparable with modern CNN based approaches, and a significant improvement over previous Transformer based models. Our simple and compact design democratizes transformers by making them accessible to those equipped with basic computing resources and/or dealing with important small datasets. Our method works on larger datasets, such as ImageNet (80.28% accuracy with 29% parameters of ViT), and NLP tasks as well. Our code and pre-trained models are publicly available at https://github.com/SHI-Labs/Compact-Transformers.
    Bias-reduced Multi-step Hindsight Experience Replay for Efficient Multi-goal Reinforcement Learning. (arXiv:2102.12962v2 [cs.LG] UPDATED)
    (2 min) Multi-goal reinforcement learning is widely applied in planning and robot manipulation. Two main challenges in multi-goal reinforcement learning are sparse rewards and sample inefficiency. Hindsight Experience Replay (HER) aims to tackle the two challenges via goal relabeling. However, HER-related works still need millions of samples and a huge computation. In this paper, we propose Multi-step Hindsight Experience Replay (MHER), incorporating multi-step relabeled returns based on $n$-step relabeling to improve sample efficiency. Despite the advantages of $n$-step relabeling, we theoretically and experimentally prove the off-policy $n$-step bias introduced by $n$-step relabeling may lead to poor performance in many environments. To address the above issue, two bias-reduced MHER algorithms, MHER($\lambda$) and Model-based MHER (MMHER) are presented. MHER($\lambda$) exploits the $\lambda$ return while MMHER benefits from model-based value expansions. Experimental results on numerous multi-goal robotic tasks show that our solutions can successfully alleviate off-policy $n$-step bias and achieve significantly higher sample efficiency than HER and Curriculum-guided HER with little additional computation beyond HER.
    Concentration of Non-Isotropic Random Tensors with Applications to Learning and Empirical Risk Minimization. (arXiv:2102.04259v2 [stat.ML] UPDATED)
    (2 min) Dimension is an inherent bottleneck to some modern learning tasks, where optimization methods suffer from the size of the data. In this paper, we study non-isotropic distributions of data and develop tools that aim at reducing these dimensional costs by a dependency on an effective dimension rather than the ambient one. Based on non-asymptotic estimates of the metric entropy of ellipsoids -- that prove to generalize to infinite dimensions -- and on a chaining argument, our uniform concentration bounds involve an effective dimension instead of the global dimension, improving over existing results. We show the importance of taking advantage of non-isotropic properties in learning problems with the following applications: i) we improve state-of-the-art results in statistical preconditioning for communication-efficient distributed optimization, ii) we introduce a non-isotropic randomized smoothing for non-smooth optimization. Both applications cover a class of functions that encompasses empirical risk minization (ERM) for linear models.
    Discovering conservation laws from trajectories via machine learning. (arXiv:2102.04008v2 [cs.LG] UPDATED)
    (2 min) Invariants and conservation laws convey critical information about the underlying dynamics of a system, yet it is generally infeasible to find them from large-scale data without any prior knowledge or human insight. We propose ConservNet to achieve this goal, a neural network that spontaneously discovers a conserved quantity from grouped data where the members of each group share invariants, similar to a general experimental setting where trajectories from different trials are observed. As a neural network trained with a novel and intuitive loss function called noise-variance loss, ConservNet learns the hidden invariants in each group of multi-dimensional observables in a data-driven, end-to-end manner. Our model successfully discovers underlying invariants from the simulated systems having invariants as well as a real-world double pendulum trajectory. Since the model is robust to various noises and data conditions compared to baseline, our approach is directly applicable to experimental data for discovering hidden conservation laws and further, general relationships between variables.
    Pretrained Transformers as Universal Computation Engines. (arXiv:2103.05247v2 [cs.LG] UPDATED)
    (2 min) We investigate the capability of a transformer pretrained on natural language to generalize to other modalities with minimal finetuning -- in particular, without finetuning of the self-attention and feedforward layers of the residual blocks. We consider such a model, which we call a Frozen Pretrained Transformer (FPT), and study finetuning it on a variety of sequence classification tasks spanning numerical computation, vision, and protein fold prediction. In contrast to prior works which investigate finetuning on the same modality as the pretraining dataset, we show that pretraining on natural language can improve performance and compute efficiency on non-language downstream tasks. Additionally, we perform an analysis of the architecture, comparing the performance of a random initialized transformer to a random LSTM. Combining the two insights, we find language-pretrained transformers can obtain strong performance on a variety of non-language tasks.
    Pros and Cons of GAN Evaluation Measures: New Developments. (arXiv:2103.09396v2 [cs.LG] UPDATED)
    (2 min) This work is an update of a previous paper on the same topic published a few years ago. With the dramatic progress in generative modeling, a suite of new quantitative and qualitative techniques to evaluate models has emerged. Although some measures such as Inception Score, Frechet Inception Distance, Precision-Recall, and Perceptual Path Length are relatively more popular, GAN evaluation is not a settled issue and there is still room for improvement. Here, I describe new dimensions that are becoming important in assessing models (e.g. bias and fairness) and discuss the connection between GAN evaluation and deepfakes. These are important areas of concern in the machine learning community today and progress in GAN evaluation can help mitigate them.
    ppAURORA: Privacy Preserving Area Under Receiver Operating Characteristic and Precision-Recall Curves with Secure 3-Party Computation. (arXiv:2102.08788v2 [cs.LG] UPDATED)
    (3 min) Computing an AUC as a performance measure to compare the quality of different machine learning models is one of the final steps of many research projects. Many of these methods are trained on privacy-sensitive data and there are several different approaches like $\epsilon$-differential privacy, federated machine learning and methods based on cryptographic approaches if the datasets cannot be shared or evaluated jointly at one place. In this setting, it can also be a problem to compute the global performance measure like an AUC, since the labels might also contain privacy-sensitive information. There have been approaches based on $\epsilon$-differential privacy to deal with this problem, but to the best of our knowledge, no exact privacy preserving solution has been introduced. In this paper, we propose an MPC-based framework, called \fw{}, with private merging of sorted lists and novel methods for comparing two secret-shared values, selecting between two secret-shared values, converting the modulus, and performing division to compute the exact AUC as one could obtain on the pooled original test samples. With \fw{} computation of the exact area under precision-recall curve and receiver operating characteristic curve is even possible when ties between prediction confidence values exist. To show the applicability of \fw{}, we use it to evaluate a model trained to predict acute myeloid leukemia therapy response and we also assess its scalability via experiments on synthetic data. The experiments show that we efficiently compute exactly the same AUC with both evaluation metrics in a privacy preserving manner as one can obtain on the pooled test samples in the plaintext domain. Our solution provides security against semi-honest corruption of at most one of the servers performing the secure computation.
    Weight Divergence Driven Divide-and-Conquer Approach for Optimal Federated Learning from non-IID Data. (arXiv:2106.14503v2 [cs.LG] UPDATED)
    (2 min) Federated Learning allows training of data stored in distributed devices without the need for centralizing training data, thereby maintaining data privacy. Addressing the ability to handle data heterogeneity (non-identical and independent distribution or non-IID) is a key enabler for the wider deployment of Federated Learning. In this paper, we propose a novel Divide-and-Conquer training methodology that enables the use of the popular FedAvg aggregation algorithm by overcoming the acknowledged FedAvg limitations in non-IID environments. We propose a novel use of Cosine-distance based Weight Divergence metric to determine the exact point where a Deep Learning network can be divided into class agnostic initial layers and class-specific deep layers for performing a Divide and Conquer training. We show that the methodology achieves trained model accuracy at par (and in certain cases exceeding) with numbers achieved by state-of-the-art Aggregation algorithms like FedProx, FedMA, etc. Also, we show that this methodology leads to compute and bandwidth optimizations under certain documented conditions.
    On the Utility of Gradient Compression in Distributed Training Systems. (arXiv:2103.00543v3 [cs.DC] UPDATED)
    (2 min) A rich body of prior work has highlighted the existence of communication bottlenecks in synchronous data-parallel training. To alleviate these bottlenecks, a long line of recent work proposes gradient and model compression methods. In this work, we evaluate the efficacy of gradient compression methods and compare their scalability with optimized implementations of synchronous data-parallel SGD across more than 200 different setups. Surprisingly, we observe that only in 6 cases out of more than 200, gradient compression methods provide speedup over optimized synchronous data-parallel training in the typical data-center setting. We conduct an extensive investigation to identify the root causes of this phenomenon, and offer a performance model that can be used to identify the benefits of gradient compression for a variety of system setups. Based on our analysis, we propose a list of desirable properties that gradient compression methods should satisfy, in order for them to provide a meaningful end-to-end speedup.
    Machine Learning-enhanced Receive Processing for MU-MIMO OFDM Systems. (arXiv:2106.16074v1 [cs.IT])
    (2 min) Machine learning (ML) can be used in various ways to improve multi-user multiple-input multiple-output (MU-MIMO) receive processing. Typical approaches either augment a single processing step, such as symbol detection, or replace multiple steps jointly by a single neural network (NN). These techniques demonstrate promising results but often assume perfect channel state information (CSI) or fail to satisfy the interpretability and scalability constraints imposed by practical systems. In this paper, we propose a new strategy which preserves the benefits of a conventional receiver, but enhances specific parts with ML components. The key idea is to exploit the orthogonal frequency-division multiplexing (OFDM) signal structure to improve both the demapping and the computation of the channel estimation error statistics. Evaluation results show that the proposed ML-enhanced receiver beats practical baselines on all considered scenarios, with significant gains at high speeds.
    Evolving Metric Learning for Incremental and Decremental Features. (arXiv:2006.15334v2 [cs.LG] UPDATED)
    (2 min) Online metric learning has been widely exploited for large-scale data classification due to the low computational cost. However, amongst online practical scenarios where the features are evolving (e.g., some features are vanished and some new features are augmented), most metric learning models cannot be successfully applied to these scenarios, although they can tackle the evolving instances efficiently. To address the challenge, we develop a new online Evolving Metric Learning (EML) model for incremental and decremental features, which can handle the instance and feature evolutions simultaneously by incorporating with a smoothed Wasserstein metric distance. Specifically, our model contains two essential stages: a Transforming stage (T-stage) and a Inheriting stage (I-stage). For the T-stage, we propose to extract important information from vanished features while neglecting non-informative knowledge, and forward it into survived features by transforming them into a low-rank discriminative metric space. It further explores the intrinsic low-rank structure of heterogeneous samples to reduce the computation and memory burden especially for highly-dimensional large-scale data. For the I-stage, we inherit the metric performance of survived features from the T-stage and then expand to include the new augmented features. Moreover, a smoothed Wasserstein distance is utilized to characterize the similarity relationships among the heterogeneous and complex samples, since the evolving features are not strictly aligned in the different stages. In addition to tackling the challenges in one-shot case, we also extend our model into multishot scenario. After deriving an efficient optimization strategy for both T-stage and I-stage, extensive experiments on several datasets verify the superior performance of our EML model.
    Image Super-Resolution via Iterative Refinement. (arXiv:2104.07636v2 [eess.IV] UPDATED)
    (2 min) We present SR3, an approach to image Super-Resolution via Repeated Refinement. SR3 adapts denoising diffusion probabilistic models to conditional image generation and performs super-resolution through a stochastic denoising process. Inference starts with pure Gaussian noise and iteratively refines the noisy output using a U-Net model trained on denoising at various noise levels. SR3 exhibits strong performance on super-resolution tasks at different magnification factors, on faces and natural images. We conduct human evaluation on a standard 8X face super-resolution task on CelebA-HQ, comparing with SOTA GAN methods. SR3 achieves a fool rate close to 50%, suggesting photo-realistic outputs, while GANs do not exceed a fool rate of 34%. We further show the effectiveness of SR3 in cascaded image generation, where generative models are chained with super-resolution models, yielding a competitive FID score of 11.3 on ImageNet.
    Language Modeling with Reduced Densities. (arXiv:2007.03834v3 [cs.CL] UPDATED)
    (2 min) This work originates from the observation that today's state of the art statistical language models are impressive not only for their performance, but also - and quite crucially - because they are built entirely from correlations in unstructured text data. The latter observation prompts a fundamental question that lies at the heart of this paper: What mathematical structure exists in unstructured text data? We put forth enriched category theory as a natural answer. We show that sequences of symbols from a finite alphabet, such as those found in a corpus of text, form a category enriched over probabilities. We then address a second fundamental question: How can this information be stored and modeled in a way that preserves the categorical structure? We answer this by constructing a functor from our enriched category of text to a particular enriched category of reduced density operators. The latter leverages the Loewner order on positive semidefinite operators, which can further be interpreted as a toy example of entailment.
    SGD Generalizes Better Than GD (And Regularization Doesn't Help). (arXiv:2102.01117v2 [cs.LG] UPDATED)
    (2 min) We give a new separation result between the generalization performance of stochastic gradient descent (SGD) and of full-batch gradient descent (GD) in the fundamental stochastic convex optimization model. While for SGD it is well-known that $O(1/\epsilon^2)$ iterations suffice for obtaining a solution with $\epsilon$ excess expected risk, we show that with the same number of steps GD may overfit and emit a solution with $\Omega(1)$ generalization error. Moreover, we show that in fact $\Omega(1/\epsilon^4)$ iterations are necessary for GD to match the generalization performance of SGD, which is also tight due to recent work by Bassily et al. (2020). We further discuss how regularizing the empirical risk minimized by GD essentially does not change the above result, and revisit the concepts of stability, implicit bias and the role of the learning algorithm in generalization.
    Exponential Savings in Agnostic Active Learning through Abstention. (arXiv:2102.00451v2 [cs.LG] UPDATED)
    (2 min) We show that in pool-based active classification without assumptions on the underlying distribution, if the learner is given the power to abstain from some predictions by paying the price marginally smaller than the average loss $1/2$ of a random guess, exponential savings in the number of label requests are possible whenever they are possible in the corresponding realizable problem. We extend this result to provide a necessary and sufficient condition for exponential savings in pool-based active classification under the model misspecification.
    1-bit Adam: Communication Efficient Large-Scale Training with Adam's Convergence Speed. (arXiv:2102.02888v2 [cs.LG] UPDATED)
    (2 min) Scalable training of large models (like BERT and GPT-3) requires careful optimization rooted in model design, architecture, and system capabilities. From a system standpoint, communication has become a major bottleneck, especially on commodity systems with standard TCP interconnects that offer limited network bandwidth. Communication compression is an important technique to reduce training time on such systems. One of the most effective methods is error-compensated compression, which offers robust convergence speed even under 1-bit compression. However, state-of-the-art error compensation techniques only work with basic optimizers like SGD and momentum SGD, which are linearly dependent on the gradients. They do not work with non-linear gradient-based optimizers like Adam, which offer state-of-the-art convergence efficiency and accuracy for models like BERT. In this paper, we propose 1-bit Adam that reduces the communication volume by up to $5\times$, offers much better scalability, and provides the same convergence speed as uncompressed Adam. Our key finding is that Adam's variance (non-linear term) becomes stable (after a warmup phase) and can be used as a fixed precondition for the rest of the training (compression phase). Experiments on up to 256 GPUs show that 1-bit Adam enables up to $3.3\times$ higher throughput for BERT-Large pre-training and up to $2.9\times$ higher throughput for SQuAD fine-tuning. In addition, we provide theoretical analysis for our proposed work.
    Distill on the Go: Online knowledge distillation in self-supervised learning. (arXiv:2104.09866v2 [cs.CV] UPDATED)
    (2 min) Self-supervised learning solves pretext prediction tasks that do not require annotations to learn feature representations. For vision tasks, pretext tasks such as predicting rotation, solving jigsaw are solely created from the input data. Yet, predicting this known information helps in learning representations useful for downstream tasks. However, recent works have shown that wider and deeper models benefit more from self-supervised learning than smaller models. To address the issue of self-supervised pre-training of smaller models, we propose Distill-on-the-Go (DoGo), a self-supervised learning paradigm using single-stage online knowledge distillation to improve the representation quality of the smaller models. We employ deep mutual learning strategy in which two models collaboratively learn from each other to improve one another. Specifically, each model is trained using self-supervised learning along with distillation that aligns each model's softmax probabilities of similarity scores with that of the peer model. We conduct extensive experiments on multiple benchmark datasets, learning objectives, and architectures to demonstrate the potential of our proposed method. Our results show significant performance gain in the presence of noisy and limited labels and generalization to out-of-distribution data.
    Enhancing Human-Machine Teaming for Medical Prognosis Through Neural Ordinary Differential Equations (NODEs). (arXiv:2102.04121v2 [cs.AI] UPDATED)
    (2 min) Machine Learning (ML) has recently been demonstrated to rival expert-level human accuracy in prediction and detection tasks in a variety of domains, including medicine. Despite these impressive findings, however, a key barrier to the full realization of ML's potential in medical prognoses is technology acceptance. Recent efforts to produce explainable AI (XAI) have made progress in improving the interpretability of some ML models, but these efforts suffer from limitations intrinsic to their design: they work best at identifying why a system fails, but do poorly at explaining when and why a model's prediction is correct. We posit that the acceptability of ML predictions in expert domains is limited by two key factors: the machine's horizon of prediction that extends beyond human capability, and the inability for machine predictions to incorporate human intuition into their models. We propose the use of a novel ML architecture, Neural Ordinary Differential Equations (NODEs) to enhance human understanding and encourage acceptability. Our approach prioritizes human cognitive intuition at the center of the algorithm design, and offers a distribution of predictions rather than single outputs. We explain how this approach may significantly improve human-machine collaboration in prediction tasks in expert domains such as medical prognoses. We propose a model and demonstrate, by expanding a concrete example from the literature, how our model advances the vision of future hybrid Human-AI systems.
    Machine Learning for MU-MIMO Receive Processing in OFDM Systems. (arXiv:2012.08177v2 [cs.IT] UPDATED)
    (2 min) Machine learning (ML) starts to be widely used to enhance the performance of multi-user multiple-input multiple-output (MU-MIMO) receivers. However, it is still unclear if such methods are truly competitive with respect to conventional methods in realistic scenarios and under practical constraints. In addition to enabling accurate signal reconstruction on realistic channel models, MU-MIMO receive algorithms must allow for easy adaptation to a varying number of users without the need for retraining. In contrast to existing work, we propose an ML-enhanced MU-MIMO receiver that builds on top of a conventional linear minimum mean squared error (LMMSE) architecture. It preserves the interpretability and scalability of the LMMSE receiver, while improving its accuracy in two ways. First, convolutional neural networks (CNNs) are used to compute an approximation of the second-order statistics of the channel estimation error which are required for accurate equalization. Second, a CNN-based demapper jointly processes a large number of orthogonal frequency-division multiplexing (OFDM) symbols and subcarriers, which allows it to compute better log likelihood ratios (LLRs) by compensating for channel aging. The resulting architecture can be used in the up- and downlink and is trained in an end-to-end manner, removing the need for hard-to-get perfect channel state information (CSI) during the training phase. Simulation results demonstrate consistent performance improvements over the baseline which are especially pronounced in high mobility scenarios.
    A Reinforcement Learning Approach to the Orienteering Problem with Time Windows. (arXiv:2011.03647v2 [cs.LG] UPDATED)
    (2 min) The Orienteering Problem with Time Windows (OPTW) is a combinatorial optimization problem where the goal is to maximize the total score collected from different visited locations. The application of neural network models to combinatorial optimization has recently shown promising results in dealing with similar problems, like the Travelling Salesman Problem. A neural network allows learning solutions using reinforcement learning or supervised learning, depending on the available data. After the learning stage, it can be generalized and quickly fine-tuned to further improve performance and personalization. The advantages are evident since, for real-world applications, solution quality, personalization, and execution times are all important factors that should be taken into account. This study explores the use of Pointer Network models trained using reinforcement learning to solve the OPTW problem. We propose a modified architecture that leverages Pointer Networks to better address problems related with dynamic time-dependent constraints. Among its various applications, the OPTW can be used to model the Tourist Trip Design Problem (TTDP). We train the Pointer Network with the TTDP problem in mind, by sampling variables that can change across tourists visiting a particular instance-region: starting position, starting time, available time, and the scores given to each point of interest. Once a model-region is trained, it can infer a solution for a particular tourist using beam search. We based the assessment of our approach on several existing benchmark OPTW instances. We show that it generalizes across different tourists that visit each region and that it generally outperforms the most commonly used heuristic, while computing the solution in realistic times.
    Knowledge-Based Learning of Nonlinear Dynamics and Chaos. (arXiv:2010.03415v3 [nlin.CD] UPDATED)
    (2 min) Extracting predictive models from nonlinear systems is a central task in scientific machine learning. One key problem is the reconciliation between modern data-driven approaches and first principles. Despite rapid advances in machine learning techniques, embedding domain knowledge into data-driven models remains a challenge. In this work, we present a universal learning framework for extracting predictive models from nonlinear systems based on observations. Our framework can readily incorporate first principle knowledge because it naturally models nonlinear systems as continuous-time systems. This both improves the extracted models' extrapolation power and reduces the amount of data needed for training. In addition, our framework has the advantages of robustness to observational noise and applicability to irregularly sampled data. We demonstrate the effectiveness of our scheme by learning predictive models for a wide variety of systems including a stiff Van der Pol oscillator, the Lorenz system, and the Kuramoto-Sivashinsky equation. For the Lorenz system, different types of domain knowledge are incorporated to demonstrate the strength of knowledge embedding in data-driven system identification.
    How to Train Your MAML to Excel in Few-Shot Classification. (arXiv:2106.16245v1 [cs.LG])
    (2 min) Model-agnostic meta-learning (MAML) is arguably the most popular meta-learning algorithm nowadays, given its flexibility to incorporate various model architectures and to be applied to different problems. Nevertheless, its performance on few-shot classification is far behind many recent algorithms dedicated to the problem. In this paper, we point out several key facets of how to train MAML to excel in few-shot classification. First, we find that a large number of gradient steps are needed for the inner loop update, which contradicts the common usage of MAML for few-shot classification. Second, we find that MAML is sensitive to the permutation of class assignments in meta-testing: for a few-shot task of $N$ classes, there are exponentially many ways to assign the learned initialization of the $N$-way classifier to the $N$ classes, leading to an unavoidably huge variance. Third, we investigate several ways for permutation invariance and find that learning a shared classifier initialization for all the classes performs the best. On benchmark datasets such as MiniImageNet and TieredImageNet, our approach, which we name UNICORN-MAML, performs on a par with or even outperforms state-of-the-art algorithms, while keeping the simplicity of MAML without adding any extra sub-networks.
    Scalable Normalizing Flows for Permutation Invariant Densities. (arXiv:2010.03242v2 [cs.LG] UPDATED)
    (2 min) Modeling sets is an important problem in machine learning since this type of data can be found in many domains. A promising approach defines a family of permutation invariant densities with continuous normalizing flows. This allows us to maximize the likelihood directly and sample new realizations with ease. In this work, we demonstrate how calculating the trace, a crucial step in this method, raises issues that occur both during training and inference, limiting its practicality. We propose an alternative way of defining permutation equivariant transformations that give closed form trace. This leads not only to improvements while training, but also to better final performance. We demonstrate the benefits of our approach on point processes and general set modeling.
    Challenges and Opportunities in High-dimensional Variational Inference. (arXiv:2103.01085v2 [cs.LG] UPDATED)
    (2 min) Current black-box variational inference (BBVI) methods require the user to make numerous design choices -- such as the selection of variational objective and approximating family -- yet there is little principled guidance on how to do so. We develop a conceptual framework and set of experimental tools to understand the effects of these choices, which we leverage to propose best practices for maximizing posterior approximation accuracy. Our approach is based on studying the pre-asymptotic tail behavior of the density ratios between the joint distribution and the variational approximation, then exploiting insights and tools from the importance sampling literature. Our framework and supporting experiments help to distinguish between the behavior of BBVI methods for approximating low-dimensional versus moderate-to-high-dimensional posteriors. In the latter case, we show that mass-covering variational objectives are difficult to optimize and do not improve accuracy, but flexible variational families can improve accuracy and the effectiveness of importance sampling -- at the cost of additional optimization challenges. Therefore, for moderate-to-high-dimensional posteriors we recommend using the (mode-seeking) exclusive KL divergence since it is the easiest to optimize, and improving the variational family or using model parameter transformations to make the posterior and optimal variational approximation more similar. On the other hand, in low-dimensional settings, we show that heavy-tailed variational families and mass-covering divergences are effective and can increase the chances that the approximation can be improved by importance sampling.
    Is Reinforcement Learning More Difficult Than Bandits? A Near-optimal Algorithm Escaping the Curse of Horizon. (arXiv:2009.13503v2 [cs.LG] UPDATED)
    (2 min) Episodic reinforcement learning and contextual bandits are two widely studied sequential decision-making problems. Episodic reinforcement learning generalizes contextual bandits and is often perceived to be more difficult due to long planning horizon and unknown state-dependent transitions. The current paper shows that the long planning horizon and the unknown state-dependent transitions (at most) pose little additional difficulty on sample complexity. We consider the episodic reinforcement learning with $S$ states, $A$ actions, planning horizon $H$, total reward bounded by $1$, and the agent plays for $K$ episodes. We propose a new algorithm, \textbf{M}onotonic \textbf{V}alue \textbf{P}ropagation (MVP), which relies on a new Bernstein-type bonus. Compared to existing bonus constructions, the new bonus is tighter since it is based on a well-designed monotonic value function. In particular, the \emph{constants} in the bonus should be subtly setting to ensure optimism and monotonicity. We show MVP enjoys an $O\left(\left(\sqrt{SAK} + S^2A\right) \poly\log \left(SAHK\right)\right)$ regret, approaching the $\Omega\left(\sqrt{SAK}\right)$ lower bound of \emph{contextual bandits} up to logarithmic terms. Notably, this result 1) \emph{exponentially} improves the state-of-the-art polynomial-time algorithms by Dann et al. [2019] and Zanette et al. [2019] in terms of the dependency on $H$, and 2) \emph{exponentially} improves the running time in [Wang et al. 2020] and significantly improves the dependency on $S$, $A$ and $K$ in sample complexity.
    Analytic Insights into Structure and Rank of Neural Network Hessian Maps. (arXiv:2106.16225v1 [cs.LG])
    (2 min) The Hessian of a neural network captures parameter interactions through second-order derivatives of the loss. It is a fundamental object of study, closely tied to various problems in deep learning, including model design, optimization, and generalization. Most prior work has been empirical, typically focusing on low-rank approximations and heuristics that are blind to the network structure. In contrast, we develop theoretical tools to analyze the range of the Hessian map, providing us with a precise understanding of its rank deficiency as well as the structural reasons behind it. This yields exact formulas and tight upper bounds for the Hessian rank of deep linear networks, allowing for an elegant interpretation in terms of rank deficiency. Moreover, we demonstrate that our bounds remain faithful as an estimate of the numerical Hessian rank, for a larger class of models such as rectified and hyperbolic tangent networks. Further, we also investigate the implications of model architecture (e.g.~width, depth, bias) on the rank deficiency. Overall, our work provides novel insights into the source and extent of redundancy in overparameterized networks.
    A Robust Classification-autoencoder to Defend Outliers and Adversaries. (arXiv:2106.15927v1 [cs.LG])
    (2 min) In this paper, we present a robust classification-autoencoder (CAE) which has strong ability to recognize outliers and defend adversaries. The basic idea is to change the autoencoder from an unsupervised learning method into a classifier. The CAE is a modified autoencoder, where the encoder is used to compress samples with different labels into disjoint compression spaces and the decoder is used to recover a sample with a given label from the corresponding compression space. The encoder is used as a classifier and the decoder is used to decide whether the classification given by the encoder is correct by comparing the input sample with the output. Since adversary samples are seeming inevitable for the current DNN framework, we introduce the list classification based on CAE to defend adversaries, which outputs several labels and the corresponding samples recovered by the CAE. The CAE is evaluated using the MNIST dataset in great detail. It is shown that the CAE network can recognize almost all outliers and the list classification contains the correct label for almost all adversaries.
    Cautiously Optimistic Policy Optimization and Exploration with Linear Function Approximation. (arXiv:2103.12923v2 [cs.LG] UPDATED)
    (2 min) Policy optimization methods are popular reinforcement learning algorithms, because their incremental and on-policy nature makes them more stable than the value-based counterparts. However, the same properties also make them slow to converge and sample inefficient, as the on-policy requirement precludes data reuse and the incremental updates couple large iteration complexity into the sample complexity. These characteristics have been observed in experiments as well as in theory in the recent work of~\citet{agarwal2020pc}, which provides a policy optimization method PCPG that can robustly find near optimal polices for approximately linear Markov decision processes but suffers from an extremely poor sample complexity compared with value-based techniques. In this paper, we propose a new algorithm, COPOE, that overcomes the sample complexity issue of PCPG while retaining its robustness to model misspecification. Compared with PCPG, COPOE makes several important algorithmic enhancements, such as enabling data reuse, and uses more refined analysis techniques, which we expect to be more broadly applicable to designing new reinforcement learning algorithms. The result is an improvement in sample complexity from $\widetilde{O}(1/\epsilon^{11})$ for PCPG to $\widetilde{O}(1/\epsilon^3)$ for PCPG, nearly bridging the gap with value-based techniques.
    Gym-ANM: Reinforcement Learning Environments for Active Network Management Tasks in Electricity Distribution Systems. (arXiv:2103.07932v2 [cs.LG] UPDATED)
    (2 min) Active network management (ANM) of electricity distribution networks include many complex stochastic sequential optimization problems. These problems need to be solved for integrating renewable energies and distributed storage into future electrical grids. In this work, we introduce Gym-ANM, a framework for designing reinforcement learning (RL) environments that model ANM tasks in electricity distribution networks. These environments provide new playgrounds for RL research in the management of electricity networks that do not require an extensive knowledge of the underlying dynamics of such systems. Along with this work, we are releasing an implementation of an introductory toy-environment, ANM6-Easy, designed to emphasize common challenges in ANM. We also show that state-of-the-art RL algorithms can already achieve good performance on ANM6-Easy when compared against a model predictive control (MPC) approach. Finally, we provide guidelines to create new Gym-ANM environments differing in terms of (a) the distribution network topology and parameters, (b) the observation space, (c) the modelling of the stochastic processes present in the system, and (d) a set of hyperparameters influencing the reward signal. Gym-ANM can be downloaded at https://github.com/robinhenry/gym-anm.
    MAGIC: Learning Macro-Actions for Online POMDP Planning. (arXiv:2011.03813v3 [cs.RO] UPDATED)
    (2 min) The partially observable Markov decision process (POMDP) is a principled general framework for robot decision making under uncertainty, but POMDP planning suffers from high computational complexity, when long-term planning is required. While temporally-extended macro-actions help to cut down the effective planning horizon and significantly improve computational efficiency, how do we acquire good macro-actions? This paper proposes Macro-Action Generator-Critic (MAGIC), which performs offline learning of macro-actions optimized for online POMDP planning. Specifically, MAGIC learns a macro-action generator end-to-end, using an online planner's performance as the feedback. During online planning, the generator generates on the fly situation-aware macro-actions conditioned on the robot's belief and the environment context. We evaluated MAGIC on several long-horizon planning tasks both in simulation and on a real robot. The experimental results show that the learned macro-actions offer significant benefits in online planning performance, compared with primitive actions and handcrafted macro-actions.
    COKE: Communication-Censored Decentralized Kernel Learning. (arXiv:2001.10133v2 [cs.LG] UPDATED)
    (2 min) This paper studies the decentralized optimization and learning problem where multiple interconnected agents aim to learn an optimal decision function defined over a reproducing kernel Hilbert space by jointly minimizing a global objective function, with access to their own locally observed dataset. As a non-parametric approach, kernel learning faces a major challenge in distributed implementation: the decision variables of local objective functions are data-dependent and thus cannot be optimized under the decentralized consensus framework without any raw data exchange among agents. To circumvent this major challenge, we leverage the random feature (RF) approximation approach to enable consensus on the function modeled in the RF space by data-independent parameters across different agents. We then design an iterative algorithm, termed DKLA, for fast-convergent implementation via ADMM. Based on DKLA, we further develop a communication-censored kernel learning (COKE) algorithm that reduces the communication load of DKLA by preventing an agent from transmitting at every iteration unless its local updates are deemed informative. Theoretical results in terms of linear convergence guarantee and generalization performance analysis of DKLA and COKE are provided. Comprehensive tests on both synthetic and real datasets are conducted to verify the communication efficiency and learning effectiveness of COKE.
    Relational VAE: A Continuous Latent Variable Model for Graph Structured Data. (arXiv:2106.16049v1 [cs.CE])
    (2 min) Graph Networks (GNs) enable the fusion of prior knowledge and relational reasoning with flexible function approximations. In this work, a general GN-based model is proposed which takes full advantage of the relational modeling capabilities of GNs and extends these to probabilistic modeling with Variational Bayes (VB). To that end, we combine complementary pre-existing approaches on VB for graph data and propose an approach that relies on graph-structured latent and conditioning variables. It is demonstrated that Neural Processes can also be viewed through the lens of the proposed model. We show applications on the problem of structured probability density modeling for simulated and real wind farm monitoring data, as well as on the meta-learning of simulated Gaussian Process data. We release the source code, along with the simulated datasets.
    An Inertial Newton Algorithm for Deep Learning. (arXiv:1905.12278v5 [cs.LG] UPDATED)
    (0 min) We introduce a new second-order inertial optimization method for machine learning called INNA. It exploits the geometry of the loss function while only requiring stochastic approximations of the function values and the generalized gradients. This makes INNA fully implementable and adapted to large-scale optimization problems such as the training of deep neural networks. The algorithm combines both gradient-descent and Newton-like behaviors as well as inertia. We prove the convergence of INNA for most deep learning problems. To do so, we provide a well-suited framework to analyze deep learning loss functions involving tame optimization in which we study a continuous dynamical system together with its discrete stochastic approximations. We prove sublinear convergence for the continuous-time differential inclusion which underlies our algorithm. Additionally, we also show how standard optimization mini-batch methods applied to non-smooth non-convex problems can yield a certain type of spurious stationary points never discussed before. We address this issue by providing a theoretical framework around the new idea of $D$-criticality; we then give a simple asymptotic analysis of INNA. Our algorithm allows for using an aggressive learning rate of $o(1/\log k)$. From an empirical viewpoint, we show that INNA returns competitive results with respect to state of the art (stochastic gradient descent, ADAGRAD, ADAM) on popular deep learning benchmark problems.
    Reinforcement Learning based Disease Progression Model for Alzheimer's Disease. (arXiv:2106.16187v1 [cs.LG])
    (2 min) We model Alzheimer's disease (AD) progression by combining differential equations (DEs) and reinforcement learning (RL) with domain knowledge. DEs provide relationships between some, but not all, factors relevant to AD. We assume that the missing relationships must satisfy general criteria about the working of the brain, for e.g., maximizing cognition while minimizing the cost of supporting cognition. This allows us to extract the missing relationships by using RL to optimize an objective (reward) function that captures the above criteria. We use our model consisting of DEs (as a simulator) and the trained RL agent to predict individualized 10-year AD progression using baseline (year 0) features on synthetic and real data. The model was comparable or better at predicting 10-year cognition trajectories than state-of-the-art learning-based models. Our interpretable model demonstrated, and provided insights into, "recovery/compensatory" processes that mitigate the effect of AD, even though those processes were not explicitly encoded in the model. Our framework combines DEs with RL for modelling AD progression and has broad applicability for understanding other neurological disorders.
    Improving Factual Consistency of Abstractive Summarization on Customer Feedback. (arXiv:2106.16188v1 [cs.CL])
    (0 min) E-commerce stores collect customer feedback to let sellers learn about customer concerns and enhance customer order experience. Because customer feedback often contains redundant information, a concise summary of the feedback can be generated to help sellers better understand the issues causing customer dissatisfaction. Previous state-of-the-art abstractive text summarization models make two major types of factual errors when producing summaries from customer feedback, which are wrong entity detection (WED) and incorrect product-defect description (IPD). In this work, we introduce a set of methods to enhance the factual consistency of abstractive summarization on customer feedback. We augment the training data with artificially corrupted summaries, and use them as counterparts of the target summaries. We add a contrastive loss term into the training objective so that the model learns to avoid certain factual errors. Evaluation results show that a large portion of WED and IPD errors are alleviated for BART and T5. Furthermore, our approaches do not depend on the structure of the summarization model and thus are generalizable to any abstractive summarization systems.
    Long Short-term Cognitive Networks. (arXiv:2106.16233v1 [cs.LG])
    (2 min) In this paper, we present a recurrent neural system named Long Short-term Cognitive Networks (LSTCNs) as a generalisation of the Short-term Cognitive Network (STCN) model. Such a generalisation is motivated by the difficulty of forecasting very long time series in an efficient, greener fashion. The LSTCN model can be defined as a collection of STCN blocks, each processing a specific time patch of the (multivariate) time series being modelled. In this neural ensemble, each block passes information to the subsequent one in the form of a weight matrix referred to as the prior knowledge matrix. As a second contribution, we propose a deterministic learning algorithm to compute the learnable weights while preserving the prior knowledge resulting from previous learning processes. As a third contribution, we introduce a feature influence score as a proxy to explain the forecasting process in multivariate time series. The simulations using three case studies show that our neural system reports small forecasting errors while being up to thousands of times faster than state-of-the-art recurrent models.
    Detecting Errors and Estimating Accuracy on Unlabeled Data with Self-training Ensembles. (arXiv:2106.15728v1 [cs.LG])
    (2 min) When a deep learning model is deployed in the wild, it can encounter test data drawn from distributions different from the training data distribution and suffer drop in performance. For safe deployment, it is essential to estimate the accuracy of the pre-trained model on the test data. However, the labels for the test inputs are usually not immediately available in practice, and obtaining them can be expensive. This observation leads to two challenging tasks: (1) unsupervised accuracy estimation, which aims to estimate the accuracy of a pre-trained classifier on a set of unlabeled test inputs; (2) error detection, which aims to identify mis-classified test inputs. In this paper, we propose a principled and practically effective framework that simultaneously addresses the two tasks. The proposed framework iteratively learns an ensemble of models to identify mis-classified data points and performs self-training to improve the ensemble with the identified points. Theoretical analysis demonstrates that our framework enjoys provable guarantees for both accuracy estimation and error detection under mild conditions readily satisfied by practical deep learning models. Along with the framework, we proposed and experimented with two instantiations and achieved state-of-the-art results on 59 tasks. For example, on iWildCam, one instantiation reduces the estimation error for unsupervised accuracy estimation by at least 70% and improves the F1 score for error detection by at least 4.7% compared to existing methods.
    Limited-Fronthaul Cell-Free Hybrid Beamforming with Distributed Deep Neural Network. (arXiv:2106.16194v1 [eess.SP])
    (2 min) Cell-free massive MIMO (CF-mMIMO) systems represent a promising approach to increase the spectral efficiency of wireless communication systems. However, near-optimal solutions require a large amount of signaling exchange between access points (APs) and the network controller (NC). In addition, the use of hybrid beamforming in each AP reduces the number of power hungry RF chains, but imposes a large computational complexity to find near-optimal precoders. In this letter, we propose two unsupervised deep neural networks (DNN) architectures, fully and partially distributed, that can perform coordinated hybrid beamforming with zero or limited communication overhead between APs and NC, while achieving near-optimal sum-rate with a reduced computational complexity compared to conventional near-optimal solutions.
    Asymptotically Optimal Information-Directed Sampling. (arXiv:2011.05944v3 [stat.ML] UPDATED)
    (2 min) We introduce a simple and efficient algorithm for stochastic linear bandits with finitely many actions that is asymptotically optimal and (nearly) worst-case optimal in finite time. The approach is based on the frequentist information-directed sampling (IDS) framework, with a surrogate for the information gain that is informed by the optimization problem that defines the asymptotic lower bound. Our analysis sheds light on how IDS balances the trade-off between regret and information and uncovers a surprising connection between the recently proposed primal-dual methods and the IDS algorithm. We demonstrate empirically that IDS is competitive with UCB in finite-time, and can be significantly better in the asymptotic regime.
    Invertible Manifold Learning for Dimension Reduction. (arXiv:2010.04012v2 [cs.LG] UPDATED)
    (2 min) Dimension reduction (DR) aims to learn low-dimensional representations of high-dimensional data with the preservation of essential information. In the context of manifold learning, we define that the representation after information-lossless DR preserves the topological and geometric properties of data manifolds formally, and propose a novel two-stage DR method, called invertible manifold learning (inv-ML) to bridge the gap between theoretical information-lossless and practical DR. The first stage includes a homeomorphic sparse coordinate transformation to learn low-dimensional representations without destroying topology and a local isometry constraint to preserve local geometry. In the second stage, a linear compression is implemented for the trade-off between the target dimension and the incurred information loss in excessive DR scenarios. Experiments are conducted on seven datasets with a neural network implementation of inv-ML, called i-ML-Enc. Empirically, i-ML-Enc achieves invertible DR in comparison with typical existing methods as well as reveals the characteristics of the learned manifolds. Through latent space interpolation on real-world datasets, we find that the reliability of tangent space approximated by the local neighborhood is the key to the success of manifold-based DR algorithms.
    Bayesian Joint Chance Constrained Optimization: Approximations and Statistical Consistency. (arXiv:2106.12199v2 [math.ST] CROSS LISTED)
    (2 min) This paper considers data-driven chance-constrained stochastic optimization problems in a Bayesian framework. Bayesian posteriors afford a principled mechanism to incorporate data and prior knowledge into stochastic optimization problems. However, the computation of Bayesian posteriors is typically an intractable problem, and has spawned a large literature on approximate Bayesian computation. Here, in the context of chance-constrained optimization, we focus on the question of statistical consistency (in an appropriate sense) of the optimal value, computed using an approximate posterior distribution. To this end, we rigorously prove a frequentist consistency result demonstrating the convergence of the optimal value to the optimal value of a fixed, parameterized constrained optimization problem. We augment this by also establishing a probabilistic rate of convergence of the optimal value. We also prove the convex feasibility of the approximate Bayesian stochastic optimization problem. Finally, we demonstrate the utility of our approach on an optimal staffing problem for an M/M/c queueing model.
    Diff2Dist: Learning Spectrally Distinct Edge Functions, with Applications to Cell Morphology Analysis. (arXiv:2106.15716v1 [cs.LG])
    (2 min) We present a method for learning "spectrally descriptive" edge weights for graphs. We generalize a previously known distance measure on graphs (Graph Diffusion Distance), thereby allowing it to be tuned to minimize an arbitrary loss function. Because all steps involved in calculating this modified GDD are differentiable, we demonstrate that it is possible for a small neural network model to learn edge weights which minimize loss. GDD alone does not effectively discriminate between graphs constructed from shoot apical meristem images of wild-type vs. mutant \emph{Arabidopsis thaliana} specimens. However, training edge weights and kernel parameters with contrastive loss produces a learned distance metric with large margins between these graph categories. We demonstrate this by showing improved performance of a simple k-nearest-neighbors classifier on the learned distance matrix. We also demonstrate a further application of this method to biological image analysis: once trained, we use our model to compute the distance between the biological graphs and a set of graphs output by a cell division simulator. This allows us to identify simulation parameter regimes which are similar to each class of graph in our original dataset.
    Protein-Ligand Docking Surrogate Models: A SARS-CoV-2 Benchmark for Deep Learning Accelerated Virtual Screening. (arXiv:2106.07036v2 [q-bio.BM] UPDATED)
    (2 min) We propose a benchmark to study surrogate model accuracy for protein-ligand docking. We share a dataset consisting of 200 million 3D complex structures and 2D structure scores across a consistent set of 13 million "in-stock" molecules over 15 receptors, or binding sites, across the SARS-CoV-2 proteome. Our work shows surrogate docking models have six orders of magnitude more throughput than standard docking protocols on the same supercomputer node types. We demonstrate the power of high-speed surrogate models by running each target against 1 billion molecules in under a day (50k predictions per GPU seconds). We showcase a workflow for docking utilizing surrogate ML models as a pre-filter. Our workflow is ten times faster at screening a library of compounds than the standard technique, with an error rate less than 0.01\% of detecting the underlying best scoring 0.1\% of compounds. Our analysis of the speedup explains that to screen more molecules under a docking paradigm, another order of magnitude speedup must come from model accuracy rather than computing speed (which, if increased, will not anymore alter our throughput to screen molecules). We believe this is strong evidence for the community to begin focusing on improving the accuracy of surrogate models to improve the ability to screen massive compound libraries 100x or even 1000x faster than current techniques.
    A Unified View of Stochastic Hamiltonian Sampling. (arXiv:2106.16200v1 [cs.LG])
    (2 min) In this work, we revisit the theoretical properties of Hamiltonian stochastic differential equations (SDEs) for Bayesian posterior sampling, and we study the two types of errors that arise from numerical SDE simulation: the discretization error and the error due to noisy gradient estimates in the context of data subsampling. We consider overlooked results describing the ergodic convergence rates of numerical integration schemes, and we produce a novel analysis for the effect of mini-batches through the lens of differential operator splitting. In our analysis, the stochastic component of the proposed Hamiltonian SDE is decoupled from the gradient noise, for which we make no normality assumptions. This allows us to derive interesting connections among different sampling schemes, including the original Hamiltonian Monte Carlo (HMC) algorithm, and explain their performance. We show that for a careful selection of numerical integrators, both errors vanish at a rate $\mathcal{O}(\eta^2)$, where $\eta$ is the integrator step size. Our theoretical results are supported by an empirical study on a variety of regression and classification tasks for Bayesian neural networks.
    On the Power of Saturated Transformers: A View from Circuit Complexity. (arXiv:2106.16213v1 [cs.CL])
    (2 min) Transformers have become a standard architecture for many NLP problems. This has motivated theoretically analyzing their capabilities as models of language, in order to understand what makes them successful, and what their potential weaknesses might be. Recent work has shown that transformers with hard attention are quite limited in capacity, and in fact can be simulated by constant-depth circuits. However, hard attention is a restrictive assumption, which may complicate the relevance of these results for practical transformers. In this work, we analyze the circuit complexity of transformers with saturated attention: a generalization of hard attention that more closely captures the attention patterns learnable in practical transformers. We show that saturated transformers transcend the limitations of hard-attention transformers. With some minor assumptions, we prove that the number of bits needed to represent a saturated transformer memory vector is $O(\log n)$, which implies saturated transformers can be simulated by log-depth circuits. Thus, the jump from hard to saturated attention can be understood as increasing the transformer's effective circuit depth by a factor of $O(\log n)$.
    Advantages and Bottlenecks of Quantum Machine Learning for Remote Sensing. (arXiv:2101.10657v3 [quant-ph] UPDATED)
    (2 min) This concept paper aims to provide a brief outline of quantum computers, explore existing methods of quantum image classification techniques, so focusing on remote sensing applications, and discuss the bottlenecks of performing these algorithms on currently available open source platforms. Initial results demonstrate feasibility. Next steps include expanding the size of the quantum hidden layer and increasing the variety of output image options.
    SaRoCo: Detecting Satire in a Novel Romanian Corpus of News Articles. (arXiv:2105.06456v3 [cs.CL] UPDATED)
    (2 min) In this work, we introduce a corpus for satire detection in Romanian news. We gathered 55,608 public news articles from multiple real and satirical news sources, composing one of the largest corpora for satire detection regardless of language and the only one for the Romanian language. We provide an official split of the text samples, such that training news articles belong to different sources than test news articles, thus ensuring that models do not achieve high performance simply due to overfitting. We conduct experiments with two state-of-the-art deep neural models, resulting in a set of strong baselines for our novel corpus. Our results show that the machine-level accuracy for satire detection in Romanian is quite low (under 73% on the test set) compared to the human-level accuracy (87%), leaving enough room for improvement in future research.
    Universal Regular Conditional Distributions. (arXiv:2105.07743v2 [cs.LG] UPDATED)
    (2 min) We introduce a general framework for approximating regular conditional distributions (RCDs). Our approximations of these RCDs are implemented by a new class of geometric deep learning models with inputs in $\mathbb{R}^d$ and outputs in the Wasserstein-$1$ space $\mathcal{P}_1(\mathbb{R}^D)$. We find that the models built using our framework can approximate any continuous functions from $\mathbb{R}^d$ to $\mathcal{P}_1(\mathbb{R}^D)$ uniformly on compacts, and quantitative rates are obtained. We identify two methods for avoiding the "curse of dimensionality"; i.e.: the number of parameters determining the approximating neural network depends only polynomially on the involved dimension and the approximation error. The first solution describes functions in $C(\mathbb{R}^d,\mathcal{P}_1(\mathbb{R}^D))$ which can be efficiently approximated on any compact subset of $\mathbb{R}^d$. Conversely, the second approach describes sets in $\mathbb{R}^d$, on which any function in $C(\mathbb{R}^d,\mathcal{P}_1(\mathbb{R}^D))$ can be efficiently approximated. Our framework is used to obtain an affirmative answer to the open conjecture of Bishop (1994); namely: mixture density networks are universal regular conditional distributions. The predictive performance of the proposed models is evaluated against comparable learning models on various probabilistic predictions tasks in the context of ELMs, model uncertainty, and heteroscedastic regression. All the results are obtained for more general input and output spaces and thus apply to geometric deep learning contexts.
    SimNet: Enabling Robust Unknown Object Manipulation from Pure Synthetic Data via Stereo. (arXiv:2106.16118v1 [cs.RO])
    (2 min) Robot manipulation of unknown objects in unstructured environments is a challenging problem due to the variety of shapes, materials, arrangements and lighting conditions. Even with large-scale real-world data collection, robust perception and manipulation of transparent and reflective objects across various lighting conditions remain challenging. To address these challenges we propose an approach to performing sim-to-real transfer of robotic perception. The underlying model, SimNet, is trained as a single multi-headed neural network using simulated stereo data as input and simulated object segmentation masks, 3D oriented bounding boxes (OBBs), object keypoints, and disparity as output. A key component of SimNet is the incorporation of a learned stereo sub-network that predicts disparity. SimNet is evaluated on 2D car detection, unknown object detection, and deformable object keypoint detection and significantly outperforms a baseline that uses a structured light RGB-D sensor. By inferring grasp positions using the OBB and keypoint predictions, SimNet can be used to perform end-to-end manipulation of unknown objects in both easy and hard scenarios using our fleet of Toyota HSR robots in four home environments. In unknown object grasping experiments, the predictions from the baseline RGB-D network and SimNet enable successful grasps of most of the easy objects. However, the RGB-D baseline only grasps 35% of the hard (e.g., transparent) objects, while SimNet grasps 95%, suggesting that SimNet can enable robust manipulation of unknown objects, including transparent objects, in unknown environments.
    Generative Adversarial Networks for Spatio-temporal Data: A Survey. (arXiv:2008.08903v3 [cs.LG] UPDATED)
    (2 min) Generative Adversarial Networks (GANs) have shown remarkable success in producing realistic-looking images in the computer vision area. Recently, GAN-based techniques are shown to be promising for spatio-temporal-based applications such as trajectory prediction, events generation and time-series data imputation. While several reviews for GANs in computer vision have been presented, no one has considered addressing the practical applications and challenges relevant to spatio-temporal data. In this paper, we have conducted a comprehensive review of the recent developments of GANs for spatio-temporal data. We summarise the application of popular GAN architectures for spatio-temporal data and the common practices for evaluating the performance of spatio-temporal applications with GANs. Finally, we point out future research directions to benefit researchers in this area.
    Monte Carlo Variational Auto-Encoders. (arXiv:2106.15921v1 [stat.ML])
    (2 min) Variational auto-encoders (VAE) are popular deep latent variable models which are trained by maximizing an Evidence Lower Bound (ELBO). To obtain tighter ELBO and hence better variational approximations, it has been proposed to use importance sampling to get a lower variance estimate of the evidence. However, importance sampling is known to perform poorly in high dimensions. While it has been suggested many times in the literature to use more sophisticated algorithms such as Annealed Importance Sampling (AIS) and its Sequential Importance Sampling (SIS) extensions, the potential benefits brought by these advanced techniques have never been realized for VAE: the AIS estimate cannot be easily differentiated, while SIS requires the specification of carefully chosen backward Markov kernels. In this paper, we address both issues and demonstrate the performance of the resulting Monte Carlo VAEs on a variety of applications.
    Linear-Mapping based Variational Ensemble Kalman Filter. (arXiv:2103.06315v3 [math.NA] UPDATED)
    (2 min) We propose a linear-mapping based variational Ensemble Kalman filter for sequential Bayesian filtering problems with generic observation models. Specifically, the proposed method is formulated as to construct a linear mapping from the prior ensemble to the posterior one, and the linear mapping is computed via a variational Bayesian formulation, i.e., by minimizing the Kullback-Leibler divergence between the transformed distribution by the linear mapping and the actual posterior. A gradient descent scheme is proposed to solve the resulting optimization problem. With numerical examples we demonstrate that the method has competitive performance against existing methods.
    Likelihoods and Parameter Priors for Bayesian Networks. (arXiv:2105.06241v2 [cs.LG] UPDATED)
    (2 min) We develop simple methods for constructing likelihoods and parameter priors for learning about the parameters and structure of a Bayesian network. In particular, we introduce several assumptions that permit the construction of likelihoods and parameter priors for a large number of Bayesian-network structures from a small set of assessments. The most notable assumption is that of likelihood equivalence, which says that data can not help to discriminate network structures that encode the same assertions of conditional independence. We describe the constructions that follow from these assumptions, and also present a method for directly computing the marginal likelihood of a random sample with no missing observations. Also, we show how these assumptions lead to a general framework for characterizing parameter priors of multivariate distributions.
    Operator-valued formulas for Riemannian Gradient and Hessian and families of tractable metrics. (arXiv:2009.10159v2 [math.OC] UPDATED)
    (2 min) We provide an explicit formula for the Levi-Civita connection and Riemannian Hessian for a Riemannian manifold that is a quotient of a manifold embedded in an inner product space with a non-constant metric function. Together with a classical formula for projection, this allows us to evaluate Riemannian gradient and Hessian for several families of metrics on classical manifolds, including a family of metrics on Stiefel manifolds connecting both the constant and canonical ambient metrics with closed-form geodesics. Using these formulas, we derive Riemannian optimization frameworks on quotients of Stiefel manifolds, including flag manifolds, and a new family of complete quotient metrics on the manifold of positive-semidefinite matrices of fixed rank, considered as a quotient of a product of Stiefel and positive-definite matrix manifold with affine-invariant metrics. The method is procedural, and in many instances, the Riemannian gradient and Hessian formulas could be derived by symbolic calculus. The method extends the list of potential metrics that could be used in manifold optimization and machine learning.
    PSD Representations for Effective Probability Models. (arXiv:2106.16116v1 [cs.LG])
    (2 min) Finding a good way to model probability densities is key to probabilistic inference. An ideal model should be able to concisely approximate any probability, while being also compatible with two main operations: multiplications of two models (product rule) and marginalization with respect to a subset of the random variables (sum rule). In this work, we show that a recently proposed class of positive semi-definite (PSD) models for non-negative functions is particularly suited to this end. In particular, we characterize both approximation and generalization capabilities of PSD models, showing that they enjoy strong theoretical guarantees. Moreover, we show that we can perform efficiently both sum and product rule in closed form via matrix operations, enjoying the same versatility of mixture models. Our results open the way to applications of PSD models to density estimation, decision theory and inference. Preliminary empirical evaluation supports our findings.
    Partially-Connected Differentiable Architecture Search for Deepfake and Spoofing Detection. (arXiv:2104.03123v2 [cs.LG] UPDATED)
    (2 min) This paper reports the first successful application of a differentiable architecture search (DARTS) approach to the deepfake and spoofing detection problems. An example of neural architecture search, DARTS operates upon a continuous, differentiable search space which enables both the architecture and parameters to be optimised via gradient descent. Solutions based on partially-connected DARTS use random channel masking in the search space to reduce GPU time and automatically learn and optimise complex neural architectures composed of convolutional operations and residual blocks. Despite being learned quickly with little human effort, the resulting networks are competitive with the best performing systems reported in the literature. Some are also far less complex, containing 85% fewer parameters than a Res2Net competitor.
    Application of deep reinforcement learning for Indian stock trading automation. (arXiv:2106.16088v1 [q-fin.TR])
    (0 min) In stock trading, feature extraction and trading strategy design are the two important tasks to achieve long-term benefits using machine learning techniques. Several methods have been proposed to design trading strategy by acquiring trading signals to maximize the rewards. In the present paper the theory of deep reinforcement learning is applied for stock trading strategy and investment decisions to Indian markets. The experiments are performed systematically with three classical Deep Reinforcement Learning models Deep Q-Network, Double Deep Q-Network and Dueling Double Deep Q-Network on ten Indian stock datasets. The performance of the models are evaluated and comparison is made.
    Small in-distribution changes in 3D perspective and lighting fool both CNNs and Transformers. (arXiv:2106.16198v1 [cs.CV])
    (2 min) Neural networks are susceptible to small transformations including 2D rotations and shifts, image crops, and even changes in object colors. This is often attributed to biases in the training dataset, and the lack of 2D shift-invariance due to not respecting the sampling theorem. In this paper, we challenge this hypothesis by training and testing on unbiased datasets, and showing that networks are brittle to both small 3D perspective changes and lighting variations which cannot be explained by dataset bias or lack of shift-invariance. To find these in-distribution errors, we introduce an evolution strategies (ES) based approach, which we call CMA-Search. Despite training with a large-scale (0.5 million images), unbiased dataset of camera and light variations, in over 71% cases CMA-Search can find camera parameters in the vicinity of a correctly classified image which lead to in-distribution misclassifications with < 3.6% change in parameters. With lighting changes, CMA-Search finds misclassifications in 33% cases with < 11.6% change in parameters. Finally, we extend this method to find misclassifications in the vicinity of ImageNet images for both ResNet and OpenAI's CLIP model.
    Exponential Weights Algorithms for Selective Learning. (arXiv:2106.15662v1 [cs.LG])
    (2 min) We study the selective learning problem introduced by Qiao and Valiant (2019), in which the learner observes $n$ labeled data points one at a time. At a time of its choosing, the learner selects a window length $w$ and a model $\hat\ell$ from the model class $\mathcal{L}$, and then labels the next $w$ data points using $\hat\ell$. The excess risk incurred by the learner is defined as the difference between the average loss of $\hat\ell$ over those $w$ data points and the smallest possible average loss among all models in $\mathcal{L}$ over those $w$ data points. We give an improved algorithm, termed the hybrid exponential weights algorithm, that achieves an expected excess risk of $O((\log\log|\mathcal{L}| + \log\log n)/\log n)$. This result gives a doubly exponential improvement in the dependence on $|\mathcal{L}|$ over the best known bound of $O(\sqrt{|\mathcal{L}|/\log n})$. We complement the positive result with an almost matching lower bound, which suggests the worst-case optimality of the algorithm. We also study a more restrictive family of learning algorithms that are bounded-recall in the sense that when a prediction window of length $w$ is chosen, the learner's decision only depends on the most recent $w$ data points. We analyze an exponential weights variant of the ERM algorithm in Qiao and Valiant (2019). This new algorithm achieves an expected excess risk of $O(\sqrt{\log |\mathcal{L}|/\log n})$, which is shown to be nearly optimal among all bounded-recall learners. Our analysis builds on a generalized version of the selective mean prediction problem in Drucker (2013); Qiao and Valiant (2019), which may be of independent interest.
    Grey-box models for wave loading prediction. (arXiv:2105.13813v2 [cs.LG] UPDATED)
    (2 min) The quantification of wave loading on offshore structures and components is a crucial element in the assessment of their useful remaining life. In many applications the well-known Morison's equation is employed to estimate the forcing from waves with assumed particle velocities and accelerations. This paper develops a grey-box modelling approach to improve the predictions of the force on structural members. A grey-box model intends to exploit the enhanced predictive capabilities of data-based modelling whilst retaining physical insight into the behaviour of the system; in the context of the work carried out here, this can be considered as physics-informed machine learning. There are a number of possible approaches to establish a grey-box model. This paper demonstrates two means of combining physics (white box) and data-based (black box) components; one where the model is a simple summation of the two components, the second where the white-box prediction is fed into the black box as an additional input. Here Morison's equation is used as the physics-based component in combination with a data-based Gaussian process NARX - a dynamic variant of the more well-known Gaussian process regression. Two key challenges with employing the GP-NARX formulation that are addressed here are the selection of appropriate lag terms and the proper treatment of uncertainty propagation within the dynamic GP. The best performing grey-box model, the residual modelling GP-NARX, was able to achieve a 29.13\% and 5.48\% relative reduction in NMSE over Morison's Equation and a black-box GP-NARX respectively, alongside significant benefits in extrapolative capabilities of the model, in circumstances of low dataset coverage.
    FROCC: Fast Random projection-based One-Class Classification. (arXiv:2011.14317v3 [cs.LG] UPDATED)
    (2 min) We present Fast Random projection-based One-Class Classification (FROCC), an extremely efficient method for one-class classification. Our method is based on a simple idea of transforming the training data by projecting it onto a set of random unit vectors that are chosen uniformly and independently from the unit sphere, and bounding the regions based on separation of the data. FROCC can be naturally extended with kernels. We theoretically prove that FROCC generalizes well in the sense that it is stable and has low bias. FROCC achieves up to 3.1 percent points better ROC, with 1.2--67.8x speedup in training and test times over a range of state-of-the-art benchmarks including the SVM and the deep learning based models for the OCC task.
    Opening Deep Neural Networks with Generative Models. (arXiv:2105.10013v3 [cs.CV] UPDATED)
    (2 min) Image classification methods are usually trained to perform predictions taking into account a predefined group of known classes. Real-world problems, however, may not allow for a full knowledge of the input and label spaces, making failures in recognition a hazard to deep visual learning. Open set recognition methods are characterized by the ability to correctly identify inputs of known and unknown classes. In this context, we propose GeMOS: simple and plug-and-play open set recognition modules that can be attached to pretrained Deep Neural Networks for visual recognition. The GeMOS framework pairs pre-trained Convolutional Neural Networks with generative models for open set recognition to extract open set scores for each sample, allowing for failure recognition in object recognition tasks. We conduct a thorough evaluation of the proposed method in comparison with state-of-the-art open set algorithms, finding that GeMOS either outperforms or is statistically indistinguishable from more complex and costly models.
    Probabilistic Graphical Models and Tensor Networks: A Hybrid Framework. (arXiv:2106.15666v1 [stat.ML])
    (2 min) We investigate a correspondence between two formalisms for discrete probabilistic modeling: probabilistic graphical models (PGMs) and tensor networks (TNs), a powerful modeling framework for simulating complex quantum systems. The graphical calculus of PGMs and TNs exhibits many similarities, with discrete undirected graphical models (UGMs) being a special case of TNs. However, more general probabilistic TN models such as Born machines (BMs) employ complex-valued hidden states to produce novel forms of correlation among the probabilities. While representing a new modeling resource for capturing structure in discrete probability distributions, this behavior also renders the direct application of standard PGM tools impossible. We aim to bridge this gap by introducing a hybrid PGM-TN formalism that integrates quantum-like correlations into PGM models in a principled manner, using the physically-motivated concept of decoherence. We first prove that applying decoherence to the entirety of a BM model converts it into a discrete UGM, and conversely, that any subgraph of a discrete UGM can be represented as a decohered BM. This method allows a broad family of probabilistic TN models to be encoded as partially decohered BMs, a fact we leverage to combine the representational strengths of both model families. We experimentally verify the performance of such hybrid models in a sequential modeling task, and identify promising uses of our method within the context of existing applications of graphical models.
    Learning to Minimize Age of Information over an Unreliable Channel with Energy Harvesting. (arXiv:2106.16037v1 [cs.IT])
    (2 min) The time average expected age of information (AoI) is studied for status updates sent over an error-prone channel from an energy-harvesting transmitter with a finite-capacity battery. Energy cost of sensing new status updates is taken into account as well as the transmission energy cost better capturing practical systems. The optimal scheduling policy is first studied under the hybrid automatic repeat request (HARQ) protocol when the channel and energy harvesting statistics are known, and the existence of a threshold-based optimal policy is shown. For the case of unknown environments, average-cost reinforcement-learning algorithms are proposed that learn the system parameters and the status update policy in real-time. The effectiveness of the proposed methods is demonstrated through numerical results.
    Dual Aspect Self-Attention based on Transformer for Remaining Useful Life Prediction. (arXiv:2106.15842v1 [eess.SP])
    (2 min) Remaining useful life prediction (RUL) is one of the key technologies of condition-based maintenance, which is important to maintain the reliability and safety of industrial equipments. While deep learning has achieved great success in RUL prediction, existing methods have difficulties in processing long sequences and extracting information from the sensor and time step aspects. In this paper, we propose Dual Aspect Self-attention based on Transformer (DAST), a novel deep RUL prediction method. DAST consists of two encoders, which work in parallel to simultaneously extract features of different sensors and time steps. Solely based on self-attention, the DAST encoders are more effective in processing long data sequences, and are capable of adaptively learning to focus on more important parts of input. Moreover, the parallel feature extraction design avoids mutual influence of information from two aspects. Experimental results on two real turbofan engine datasets show that our method significantly outperforms state-of-the-art methods.
    GraphFM: Graph Factorization Machines for Feature Interaction Modeling. (arXiv:2105.11866v2 [cs.LG] UPDATED)
    (2 min) Factorization machine (FM) is a prevalent approach to modeling pairwise (second-order) feature interactions when dealing with high-dimensional sparse data. However, on the one hand, FM fails to capture higher-order feature interactions suffering from combinatorial expansion, on the other hand, taking into account interaction between every pair of features may introduce noise and degrade prediction accuracy. To solve the problems, we propose a novel approach Graph Factorization Machine (GraphFM) by naturally representing features in the graph structure. In particular, a novel mechanism is designed to select the beneficial feature interactions and formulate them as edges between features. Then our proposed model which integrates the interaction function of FM into the feature aggregation strategy of Graph Neural Network (GNN), can model arbitrary-order feature interactions on the graph-structured features by stacking layers. Experimental results on several real-world datasets has demonstrated the rationality and effectiveness of our proposed approach.
    A Generative Model for Raw Audio Using Transformer Architectures. (arXiv:2106.16036v1 [cs.SD])
    (2 min) This paper proposes a novel way of doing audio synthesis at the waveform level using Transformer architectures. We propose a deep neural network for generating waveforms, similar to wavenet \cite{oord2016wavenet}. This is fully probabilistic, auto-regressive, and causal, i.e. each sample generated depends only on the previously observed samples. Our approach outperforms a widely used wavenet architecture by up to 9\% on a similar dataset for predicting the next step. Using the attention mechanism, we enable the architecture to learn which audio samples are important for the prediction of the future sample. We show how causal transformer generative models can be used for raw waveform synthesis. We also show that this performance can be improved by another 2\% by conditioning samples over a wider context. The flexibility of the current model to synthesize audio from latent representations suggests a large number of potential applications. The novel approach of using generative transformer architectures for raw audio synthesis is, however, still far away from generating any meaningful music, without using latent codes/meta-data to aid the generation process.
    Cyclist Trajectory Forecasts by Incorporation of Multi-View Video Information. (arXiv:2106.15991v1 [cs.CV])
    (2 min) This article presents a novel approach to incorporate visual cues from video-data from a wide-angle stereo camera system mounted at an urban intersection into the forecast of cyclist trajectories. We extract features from image and optical flow (OF) sequences using 3D convolutional neural networks (3D-ConvNet) and combine them with features extracted from the cyclist's past trajectory to forecast future cyclist positions. By the use of additional information, we are able to improve positional accuracy by about 7.5 % for our test dataset and by up to 22 % for specific motion types compared to a method solely based on past trajectories. Furthermore, we compare the use of image sequences to the use of OF sequences as additional information, showing that OF alone leads to significant improvements in positional accuracy. By training and testing our methods using a real-world dataset recorded at a heavily frequented public intersection and evaluating the methods' runtimes, we demonstrate the applicability in real traffic scenarios. Our code and parts of our dataset are made publicly available.
    On the Landscape of One-hidden-layer Sparse Networks and Beyond. (arXiv:2009.07439v3 [cs.LG] UPDATED)
    (2 min) Sparse neural networks have received increasing interests due to their small size compared to dense networks. Nevertheless, most existing works on neural network theory have focused on dense neural networks, and our understanding of sparse networks is very limited. In this paper, we study the loss landscape of one-hidden-layer sparse networks. We first consider sparse networks with linear activations. We show that sparse linear networks can have spurious strict minima, which is in sharp contrast to dense linear networks which do not even have spurious minima. Second, we show that spurious valleys can exist for wide sparse non-linear networks. This is different from wide dense networks which do not have spurious valleys under mild assumptions.
    Learning from Informants: Relations between Learning Success Criteria. (arXiv:1801.10502v5 [cs.FL] UPDATED)
    (2 min) Learning from positive and negative information, so-called \emph{informants}, being one of the models for human and machine learning introduced by E.~M.~Gold, is investigated. Particularly, naturally arising questions about this learning setting, originating in results on learning from solely positive information, are answered. By a carefully arranged argument learners can be assumed to only change their hypothesis in case it is inconsistent with the data (such a learning behavior is called \emph{conservative}). The deduced main theorem states the relations between the most important delayable learning success criteria, being the ones not ruined by a delayed in time hypothesis output. Additionally, our investigations concerning the non-delayable requirement of consistent learning underpin the claim for \emph{delayability} being the right structural property to gain a deeper understanding concerning the nature of learning success criteria. Moreover, we obtain an anomalous \emph{hierarchy} when allowing for an increasing finite number of \emph{anomalies} of the hypothesized language by the learner compared with the language to be learned. In contrast to the vacillatory hierarchy for learning from solely positive information, we observe a \emph{duality} depending on whether infinitely many \emph{vacillations} between different (almost) correct hypotheses are still considered a successful learning behavior.
    Parameter Priors for Directed Acyclic Graphical Models and the Characterization of Several Probability Distributions. (arXiv:2105.03248v2 [stat.ML] UPDATED)
    (2 min) We develop simple methods for constructing parameter priors for model choice among Directed Acyclic Graphical (DAG) models. In particular, we introduce several assumptions that permit the construction of parameter priors for a large number of DAG models from a small set of assessments. We then present a method for directly computing the marginal likelihood of every DAG model given a random sample with no missing observations. We apply this methodology to Gaussian DAG models which consist of a recursive set of linear regression models. We show that the only parameter prior for complete Gaussian DAG models that satisfies our assumptions is the normal-Wishart distribution. Our analysis is based on the following new characterization of the Wishart distribution: let $W$ be an $n \times n$, $n \ge 3$, positive-definite symmetric matrix of random variables and $f(W)$ be a pdf of $W$. Then, f$(W)$ is a Wishart distribution if and only if $W_{11} - W_{12} W_{22}^{-1} W'_{12}$ is independent of $\{W_{12},W_{22}\}$ for every block partitioning $W_{11},W_{12}, W'_{12}, W_{22}$ of $W$. Similar characterizations of the normal and normal-Wishart distributions are provided as well.
    Understanding and Improving Early Stopping for Learning with Noisy Labels. (arXiv:2106.15853v1 [cs.LG])
    (2 min) The memorization effect of deep neural network (DNN) plays a pivotal role in many state-of-the-art label-noise learning methods. To exploit this property, the early stopping trick, which stops the optimization at the early stage of training, is usually adopted. Current methods generally decide the early stopping point by considering a DNN as a whole. However, a DNN can be considered as a composition of a series of layers, and we find that the latter layers in a DNN are much more sensitive to label noise, while their former counterparts are quite robust. Therefore, selecting a stopping point for the whole network may make different DNN layers antagonistically affected each other, thus degrading the final performance. In this paper, we propose to separate a DNN into different parts and progressively train them to address this problem. Instead of the early stopping, which trains a whole DNN all at once, we initially train former DNN layers by optimizing the DNN with a relatively large number of epochs. During training, we progressively train the latter DNN layers by using a smaller number of epochs with the preceding layers fixed to counteract the impact of noisy labels. We term the proposed method as progressive early stopping (PES). Despite its simplicity, compared with the early stopping, PES can help to obtain more promising and stable results. Furthermore, by combining PES with existing approaches on noisy label training, we achieve state-of-the-art performance on image classification benchmarks.
    GPyTorch: Blackbox Matrix-Matrix Gaussian Process Inference with GPU Acceleration. (arXiv:1809.11165v6 [cs.LG] UPDATED)
    (2 min) Despite advances in scalable models, the inference tools used for Gaussian processes (GPs) have yet to fully capitalize on developments in computing hardware. We present an efficient and general approach to GP inference based on Blackbox Matrix-Matrix multiplication (BBMM). BBMM inference uses a modified batched version of the conjugate gradients algorithm to derive all terms for training and inference in a single call. BBMM reduces the asymptotic complexity of exact GP inference from $O(n^3)$ to $O(n^2)$. Adapting this algorithm to scalable approximations and complex GP models simply requires a routine for efficient matrix-matrix multiplication with the kernel and its derivative. In addition, BBMM uses a specialized preconditioner to substantially speed up convergence. In experiments we show that BBMM effectively uses GPU hardware to dramatically accelerate both exact GP inference and scalable approximations. Additionally, we provide GPyTorch, a software platform for scalable GP inference via BBMM, built on PyTorch.
    Fixed points of monotonic and (weakly) scalable neural networks. (arXiv:2106.16239v1 [stat.ML])
    (2 min) We derive conditions for the existence of fixed points of neural networks, an important research objective to understand their behavior in modern applications involving autoencoders and loop unrolling techniques, among others. In particular, we focus on networks with nonnegative inputs and nonnegative network parameters, as often considered in the literature. We show that such networks can be recognized as monotonic and (weakly) scalable functions within the framework of nonlinear Perron-Frobenius theory. This fact enables us to derive conditions for the existence of a nonempty fixed point set of the neural networks, and these conditions are weaker than those obtained recently using arguments in convex analysis, which are typically based on the assumption of nonexpansivity of the activation functions. Furthermore, we prove that the shape of the fixed point set of monotonic and weakly scalable neural networks is often an interval, which degenerates to a point for the case of scalable networks. The chief results of this paper are verified in numerical simulations, where we consider an autoencoder-type network that first compresses angular power spectra in massive MIMO systems, and, second, reconstruct the input spectra from the compressed signal.
    Reducing Representation Drift in Online Continual Learning. (arXiv:2104.05025v2 [cs.LG] UPDATED)
    (2 min) In the online continual learning paradigm, agents must learn from a changing distribution while respecting memory and compute constraints. Previous work in this setting often tries to reduce catastrophic forgetting by limiting changes in the space of model parameters. In this work we instead focus on the change in representations of observed data that arises when previously unobserved classes appear in the incoming data stream, and new classes must be distinguished from previous ones. Starting from a popular approach, experience replay, we consider metric learning based loss functions which, when adjusted to appropriately select negative samples, can effectively nudge the learned representations to be more robust to new future classes. We show that this selection of negatives is in fact critical for reducing representation drift of previously observed data. Motivated by this we further introduce a simple adjustment to the standard cross entropy loss used in prior experience replay that achieves similar effect. Our approach directly improves the performance of experience replay for this setting, obtaining state-of-the-art results on several existing benchmarks in online continual learning, while remaining efficient in both memory and compute. We release an implementation of our experiments at https://github.com/naderAsadi/AML
    Interventional Assays for the Latent Space of Autoencoders. (arXiv:2106.16091v1 [cs.LG])
    (2 min) The encoders and decoders of autoencoders effectively project the input onto learned manifolds in the latent space and data space respectively. We propose a framework, called latent responses, for probing the learned data manifold using interventions in the latent space. Using this framework, we investigate "holes" in the representation to quantitatively ascertain to what extent the latent space of a trained VAE is consistent with the chosen prior. Furthermore, we use the identified structure to improve interpolation between latent vectors. We evaluate how our analyses improve the quality of the generated samples using the VAE on a variety of benchmark datasets.
    Improving the Efficiency of Transformers for Resource-Constrained Devices. (arXiv:2106.16006v1 [cs.LG])
    (2 min) Transformers provide promising accuracy and have become popular and used in various domains such as natural language processing and computer vision. However, due to their massive number of model parameters, memory and computation requirements, they are not suitable for resource-constrained low-power devices. Even with high-performance and specialized devices, the memory bandwidth can become a performance-limiting bottleneck. In this paper, we present a performance analysis of state-of-the-art vision transformers on several devices. We propose to reduce the overall memory footprint and memory transfers by clustering the model parameters. We show that by using only 64 clusters to represent model parameters, it is possible to reduce the data transfer from the main memory by more than 4x, achieve up to 22% speedup and 39% energy savings on mobile devices with less than 0.1% accuracy loss.
    Multiagent Deep Reinforcement Learning: Challenges and Directions Towards Human-Like Approaches. (arXiv:2106.15691v1 [cs.LG])
    (2 min) This paper surveys the field of multiagent deep reinforcement learning. The combination of deep neural networks with reinforcement learning has gained increased traction in recent years and is slowly shifting the focus from single-agent to multiagent environments. Dealing with multiple agents is inherently more complex as (a) the future rewards depend on the joint actions of multiple players and (b) the computational complexity of functions increases. We present the most common multiagent problem representations and their main challenges, and identify five research areas that address one or more of these challenges: centralised training and decentralised execution, opponent modelling, communication, efficient coordination, and reward shaping. We find that many computational studies rely on unrealistic assumptions or are not generalisable to other settings; they struggle to overcome the curse of dimensionality or nonstationarity. Approaches from psychology and sociology capture promising relevant behaviours such as communication and coordination. We suggest that, for multiagent reinforcement learning to be successful, future research addresses these challenges with an interdisciplinary approach to open up new possibilities for more human-oriented solutions in multiagent reinforcement learning.
    Nearly-Tight and Oblivious Algorithms for Explainable Clustering. (arXiv:2106.16147v1 [cs.DS])
    (2 min) We study the problem of explainable clustering in the setting first formalized by Moshkovitz, Dasgupta, Rashtchian, and Frost (ICML 2020). A $k$-clustering is said to be explainable if it is given by a decision tree where each internal node splits data points with a threshold cut in a single dimension (feature), and each of the $k$ leaves corresponds to a cluster. We give an algorithm that outputs an explainable clustering that loses at most a factor of $O(\log^2 k)$ compared to an optimal (not necessarily explainable) clustering for the $k$-medians objective, and a factor of $O(k \log^2 k)$ for the $k$-means objective. This improves over the previous best upper bounds of $O(k)$ and $O(k^2)$, respectively, and nearly matches the previous $\Omega(\log k)$ lower bound for $k$-medians and our new $\Omega(k)$ lower bound for $k$-means. The algorithm is remarkably simple. In particular, given an initial not necessarily explainable clustering in $\mathbb{R}^d$, it is oblivious to the data points and runs in time $O(dk \log^2 k)$, independent of the number of data points $n$. Our upper and lower bounds also generalize to objectives given by higher $\ell_p$-norms.
    Latent Space Model for Higher-order Networks and Generalized Tensor Decomposition. (arXiv:2106.16042v1 [cs.LG])
    (2 min) We introduce a unified framework, formulated as general latent space models, to study complex higher-order network interactions among multiple entities. Our framework covers several popular models in recent network analysis literature, including mixture multi-layer latent space model and hypergraph latent space model. We formulate the relationship between the latent positions and the observed data via a generalized multilinear kernel as the link function. While our model enjoys decent generality, its maximum likelihood parameter estimation is also convenient via a generalized tensor decomposition procedure.We propose a novel algorithm using projected gradient descent on Grassmannians. We also develop original theoretical guarantees for our algorithm. First, we show its linear convergence under mild conditions. Second, we establish finite-sample statistical error rates of latent position estimation, determined by the signal strength, degrees of freedom and the smoothness of link function, for both general and specific latent space models. We demonstrate the effectiveness of our method on synthetic data. We also showcase the merit of our method on two real-world datasets that are conventionally described by different specific models in producing meaningful and interpretable parameter estimations and accurate link prediction. We demonstrate the effectiveness of our method on synthetic data. We also showcase the merit of our method on two real-world datasets that are conventionally described by different specific models in producing meaningful and interpretable parameter estimations and accurate link prediction.
    Faithful Edge Federated Learning: Scalability and Privacy. (arXiv:2106.15905v1 [cs.LG])
    (2 min) Federated learning enables machine learning algorithms to be trained over a network of multiple decentralized edge devices without requiring the exchange of local datasets. Successfully deploying federated learning requires ensuring that agents (e.g., mobile devices) faithfully execute the intended algorithm, which has been largely overlooked in the literature. In this study, we first use risk bounds to analyze how the key feature of federated learning, unbalanced and non-i.i.d. data, affects agents' incentives to voluntarily participate and obediently follow traditional federated learning algorithms. To be more specific, our analysis reveals that agents with less typical data distributions and relatively more samples are more likely to opt out of or tamper with federated learning algorithms. To this end, we formulate the first faithful implementation problem of federated learning and design two faithful federated learning mechanisms which satisfy economic properties, scalability, and privacy. Further, the time complexity of computing all agents' payments in the number of agents is $\mathcal{O}(1)$. First, we design a Faithful Federated Learning (FFL) mechanism which approximates the Vickrey-Clarke-Groves (VCG) payments via an incremental computation. We show that it achieves (probably approximate) optimality, faithful implementation, voluntary participation, and some other economic properties (such as budget balance). Second, by partitioning agents into several subsets, we present a scalable VCG mechanism approximation. We further design a scalable and Differentially Private FFL (DP-FFL) mechanism, the first differentially private faithful mechanism, that maintains the economic properties. Our mechanism enables one to make three-way performance tradeoffs among privacy, the iterations needed, and payment accuracy loss.
    Group Testing under Superspreading Dynamics. (arXiv:2106.15988v1 [stat.AP])
    (2 min) Testing is recommended for all close contacts of confirmed COVID-19 patients. However, existing group testing methods are oblivious to the circumstances of contagion provided by contact tracing. Here, we build upon a well-known semi-adaptive pool testing method, Dorfman's method with imperfect tests, and derive a simple group testing method based on dynamic programming that is specifically designed to use the information provided by contact tracing. Experiments using a variety of reproduction numbers and dispersion levels, including those estimated in the context of the COVID-19 pandemic, show that the pools found using our method result in a significantly lower number of tests than those found using standard Dorfman's method, especially when the number of contacts of an infected individual is small. Moreover, our results show that our method can be more beneficial when the secondary infections are highly overdispersed.
    What can linear interpolation of neural network loss landscapes tell us?. (arXiv:2106.16004v1 [cs.LG])
    (2 min) Studying neural network loss landscapes provides insights into the nature of the underlying optimization problems. Unfortunately, loss landscapes are notoriously difficult to visualize in a human-comprehensible fashion. One common way to address this problem is to plot linear slices of the landscape, for example from the initial state of the network to the final state after optimization. On the basis of this analysis, prior work has drawn broader conclusions about the difficulty of the optimization problem. In this paper, we put inferences of this kind to the test, systematically evaluating how linear interpolation and final performance vary when altering the data, choice of initialization, and other optimizer and architecture design choices. Further, we use linear interpolation to study the role played by individual layers and substructures of the network. We find that certain layers are more sensitive to the choice of initialization and optimizer hyperparameter settings, and we exploit these observations to design custom optimization schemes. However, our results cast doubt on the broader intuition that the presence or absence of barriers when interpolating necessarily relates to the success of optimization.
    Multi-Scale Spectrogram Modelling for Neural Text-to-Speech. (arXiv:2106.15649v1 [eess.AS])
    (2 min) We propose a novel Multi-Scale Spectrogram (MSS) modelling approach to synthesise speech with an improved coarse and fine-grained prosody. We present a generic multi-scale spectrogram prediction mechanism where the system first predicts coarser scale mel-spectrograms that capture the suprasegmental information in speech, and later uses these coarser scale mel-spectrograms to predict finer scale mel-spectrograms capturing fine-grained prosody. We present details for two specific versions of MSS called Word-level MSS and Sentence-level MSS where the scales in our system are motivated by the linguistic units. The Word-level MSS models word, phoneme, and frame-level spectrograms while Sentence-level MSS models sentence-level spectrogram in addition. Subjective evaluations show that Word-level MSS performs statistically significantly better compared to the baseline on two voices.
    Emotions in Macroeconomic News and their Impact on the European Bond Market. (arXiv:2106.15698v1 [econ.GN])
    (2 min) We show how emotions extracted from macroeconomic news can be used to explain and forecast future behaviour of sovereign bond yield spreads in Italy and Spain. We use a big, open-source, database known as Global Database of Events, Language and Tone to construct emotion indicators of bond market affective states. We find that negative emotions extracted from news improve the forecasting power of government yield spread models during distressed periods even after controlling for the number of negative words present in the text. In addition, stronger negative emotions, such as panic, reveal useful information for predicting changes in spread at the short-term horizon, while milder emotions, such as distress, are useful at longer time horizons. Emotions generated by the Italian political turmoil propagate to the Spanish news affecting this neighbourhood market.
    On the Generative Utility of Cyclic Conditionals. (arXiv:2106.15962v1 [cs.LG])
    (2 min) We study whether and how can we model a joint distribution $p(x,z)$ using two conditional models $p(x|z)$ and $q(z|x)$ that form a cycle. This is motivated by the observation that deep generative models, in addition to a likelihood model $p(x|z)$, often also use an inference model $q(z|x)$ for data representation, but they rely on a usually uninformative prior distribution $p(z)$ to define a joint distribution, which may render problems like posterior collapse and manifold mismatch. To explore the possibility to model a joint distribution using only $p(x|z)$ and $q(z|x)$, we study their compatibility and determinacy, corresponding to the existence and uniqueness of a joint distribution whose conditional distributions coincide with them. We develop a general theory for novel and operable equivalence criteria for compatibility, and sufficient conditions for determinacy. Based on the theory, we propose the CyGen framework for cyclic-conditional generative modeling, including methods to enforce compatibility and use the determined distribution to fit and generate data. With the prior constraint removed, CyGen better fits data and captures more representative features, supported by experiments showing better generation and downstream classification performance.
    Leveraging Hidden Structure in Self-Supervised Learning. (arXiv:2106.16060v1 [cs.LG])
    (2 min) This work considers the problem of learning structured representations from raw images using self-supervised learning. We propose a principled framework based on a mutual information objective, which integrates self-supervised and structure learning. Furthermore, we devise a post-hoc procedure to interpret the meaning of the learnt representations. Preliminary experiments on CIFAR-10 show that the proposed framework achieves higher generalization performance in downstream classification tasks and provides more interpretable representations compared to the ones learnt through traditional self-supervised learning.
    Reservoir Based Edge Training on RF Data To Deliver Intelligent and Efficient IoT Spectrum Sensors. (arXiv:2106.16087v1 [eess.SP])
    (2 min) Current radio frequency (RF) sensors at the Edge lack the computational resources to support practical, in-situ training for intelligent spectrum monitoring, and sensor data classification in general. We propose a solution via Deep Delay Loop Reservoir Computing (DLR), a processing architecture that supports general machine learning algorithms on compact mobile devices by leveraging delay-loop reservoir computing in combination with innovative electrooptical hardware. With both digital and photonic realizations of our design of the loops, DLR delivers reductions in form factor, hardware complexity and latency, compared to the State-of-the-Art (SoA). The main impact of the reservoir is to project the input data into a higher dimensional space of reservoir state vectors in order to linearly separate the input classes. Once the classes are well separated, traditionally complex, power-hungry classification models are no longer needed for the learning process. Yet, even with simple classifiers based on Ridge regression (RR), the complexity grows at least quadratically with the input size. Hence, the hardware reduction required for training on compact devices is in contradiction with the large dimension of state vectors. DLR employs a RR-based classifier to exceed the SoA accuracy, while further reducing power consumption by leveraging the architecture of parallel (split) loops. We present DLR architectures composed of multiple smaller loops whose state vectors are linearly combined to create a lower dimensional input into Ridge regression. We demonstrate the advantages of using DLR for two distinct applications: RF Specific Emitter Identification (SEI) for IoT authentication, and wireless protocol recognition for IoT situational awareness.
    DAEMA: Denoising Autoencoder with Mask Attention. (arXiv:2106.16057v1 [cs.LG])
    (2 min) Missing data is a recurrent and challenging problem, especially when using machine learning algorithms for real-world applications. For this reason, missing data imputation has become an active research area, in which recent deep learning approaches have achieved state-of-the-art results. We propose DAEMA (Denoising Autoencoder with Mask Attention), an algorithm based on a denoising autoencoder architecture with an attention mechanism. While most imputation algorithms use incomplete inputs as they would use complete data - up to basic preprocessing (e.g. mean imputation) - DAEMA leverages a mask-based attention mechanism to focus on the observed values of its inputs. We evaluate DAEMA both in terms of reconstruction capabilities and downstream prediction and show that it achieves superior performance to state-of-the-art algorithms on several publicly available real-world datasets under various missingness settings.
    Parameter Priors for Directed Acyclic Graphical Models and the Characterization of Several Probability Distributions. (arXiv:1301.6697v4 [cs.LG] UPDATED)
    (2 min) We show that the only parameter prior for complete Gaussian DAG models that satisfies global parameter independence, complete model equivalence, and some weak regularity assumptions, is the normal-Wishart distribution. Our analysis is based on the following new characterization of the Wishart distribution: let W be an n x n, n >= 3, positive-definite symmetric matrix of random variables and f(W) be a pdf of W. Then, f(W) is a Wishart distribution if and only if W_{11}-W_{12}W_{22}^{-1}W_{12}' is independent of {W_{12}, W_{22}} for every block partitioning W_{11}, W_{12}, W_{12}', W_{22} of W. Similar characterizations of the normal and normal-Wishart distributions are provided as well. We also show how to construct a prior for every DAG model over X from the prior of a single regression model.
    HybridDeepRx: Deep Learning Receiver for High-EVM Signals. (arXiv:2106.16079v1 [eess.SP])
    (2 min) In this paper, we propose a machine learning (ML) based physical layer receiver solution for demodulating OFDM signals that are subject to a high level of nonlinear distortion. Specifically, a novel deep learning based convolutional neural network receiver is devised, containing layers in both time- and frequency domains, allowing to demodulate and decode the transmitted bits reliably despite the high error vector magnitude (EVM) in the transmit signal. Extensive set of numerical results is provided, in the context of 5G NR uplink incorporating also measured terminal power amplifier characteristics. The obtained results show that the proposed receiver system is able to clearly outperform classical linear receivers as well as existing ML receiver approaches, especially when the EVM is high in comparison with modulation order. The proposed ML receiver can thus facilitate pushing the terminal power amplifier (PA) systems deeper into saturation, and thereon improve the terminal power-efficiency, radiated power and network coverage.
    Augmented Shortcuts for Vision Transformers. (arXiv:2106.15941v1 [cs.CV])
    (2 min) Transformer models have achieved great progress on computer vision tasks recently. The rapid development of vision transformers is mainly contributed by their high representation ability for extracting informative features from input images. However, the mainstream transformer models are designed with deep architectures, and the feature diversity will be continuously reduced as the depth increases, i.e., feature collapse. In this paper, we theoretically analyze the feature collapse phenomenon and study the relationship between shortcuts and feature diversity in these transformer models. Then, we present an augmented shortcut scheme, which inserts additional paths with learnable parameters in parallel on the original shortcuts. To save the computational costs, we further explore an efficient approach that uses the block-circulant projection to implement augmented shortcuts. Extensive experiments conducted on benchmark datasets demonstrate the effectiveness of the proposed method, which brings about 1% accuracy increase of the state-of-the-art visual transformers without obviously increasing their parameters and FLOPs.
    Transductive Zero-Shot Hashing for Multilabel Image Retrieval. (arXiv:1911.07192v2 [cs.CV] UPDATED)
    (2 min) Hash coding has been widely used in approximate nearest neighbor search for large-scale image retrieval. Given semantic annotations such as class labels and pairwise similarities of the training data, hashing methods can learn and generate effective and compact binary codes. While some newly introduced images may contain undefined semantic labels, which we call unseen images, zeor-shot hashing techniques have been studied. However, existing zeor-shot hashing methods focus on the retrieval of single-label images, and cannot handle multi-label images. In this paper, for the first time, a novel transductive zero-shot hashing method is proposed for multi-label unseen image retrieval. In order to predict the labels of the unseen/target data, a visual-semantic bridge is built via instance-concept coherence ranking on the seen/source data. Then, pairwise similarity loss and focal quantization loss are constructed for training a hashing model using both the seen/source and unseen/target data. Extensive evaluations on three popular multi-label datasets demonstrate that, the proposed hashing method achieves significantly better results than the competing methods.
    Graph Signal Restoration Using Nested Deep Algorithm Unrolling. (arXiv:2106.15910v1 [eess.SP])
    (2 min) Graph signal processing is a ubiquitous task in many applications such as sensor, social, transportation and brain networks, point cloud processing, and graph neural networks. Graph signals are often corrupted through sensing processes, and need to be restored for the above applications. In this paper, we propose two graph signal restoration methods based on deep algorithm unrolling (DAU). First, we present a graph signal denoiser by unrolling iterations of the alternating direction method of multiplier (ADMM). We then propose a general restoration method for linear degradation by unrolling iterations of Plug-and-Play ADMM (PnP-ADMM). In the second method, the unrolled ADMM-based denoiser is incorporated as a submodule. Therefore, our restoration method has a nested DAU structure. Thanks to DAU, parameters in the proposed denoising/restoration methods are trainable in an end-to-end manner. Since the proposed restoration methods are based on iterations of a (convex) optimization algorithm, the method is interpretable and keeps the number of parameters small because we only need to tune graph-independent regularization parameters. We solve two main problems in existing graph signal restoration methods: 1) limited performance of convex optimization algorithms due to fixed parameters which are often determined manually. 2) large number of parameters of graph neural networks that result in difficulty of training. Several experiments for graph signal denoising and interpolation are performed on synthetic and real-world data. The proposed methods show performance improvements to several existing methods in terms of root mean squared error in both tasks.
    Robust and Interpretable Temporal Convolution Network for Event Detection in Lung Sound Recordings. (arXiv:2106.15835v1 [cs.SD])
    (2 min) This paper proposes a novel framework for lung sound event detection, segmenting continuous lung sound recordings into discrete events and performing recognition on each event. Exploiting the lightweight nature of Temporal Convolution Networks (TCNs) and their superior results compared to their recurrent counterparts, we propose a lightweight, yet robust, and completely interpretable framework for lung sound event detection. We propose the use of a multi-branch TCN architecture and exploit a novel fusion strategy to combine the resultant features from these branches. This not only allows the network to retain the most salient information across different temporal granularities and disregards irrelevant information, but also allows our network to process recordings of arbitrary length. Results: The proposed method is evaluated on multiple public and in-house benchmarks of irregular and noisy recordings of the respiratory auscultation process for the identification of numerous auscultation events including inhalation, exhalation, crackles, wheeze, stridor, and rhonchi. We exceed the state-of-the-art results in all evaluations. Furthermore, we empirically analyse the effect of the proposed multi-branch TCN architecture and the feature fusion strategy and provide quantitative and qualitative evaluations to illustrate their efficiency. Moreover, we provide an end-to-end model interpretation pipeline that interprets the operations of all the components of the proposed framework. Our analysis of different feature fusion strategies shows that the proposed feature concatenation method leads to better suppression of non-informative features, which drastically reduces the classifier overhead resulting in a robust lightweight network.The lightweight nature of our model allows it to be deployed in end-user devices such as smartphones, and it has the ability to generate predictions in real-time.
    AdaGDA: Faster Adaptive Gradient Descent Ascent Methods for Minimax Optimization. (arXiv:2106.16101v1 [math.OC])
    (2 min) In the paper, we propose a class of faster adaptive gradient descent ascent methods for solving the nonconvex-strongly-concave minimax problems by using unified adaptive matrices used in the SUPER-ADAM \citep{huang2021super}. Specifically, we propose a fast adaptive gradient decent ascent (AdaGDA) method based on the basic momentum technique, which reaches a low sample complexity of $O(\kappa^4\epsilon^{-4})$ for finding an $\epsilon$-stationary point without large batches, which improves the existing result of adaptive minimax optimization method by a factor of $O(\sqrt{\kappa})$. Moreover, we present an accelerated version of AdaGDA (VR-AdaGDA) method based on the momentum-based variance reduced technique, which achieves the best known sample complexity of $O(\kappa^3\epsilon^{-3})$ for finding an $\epsilon$-stationary point without large batches. Further assume the bounded Lipschitz parameter of objective function, we prove that our VR-AdaGDA method reaches a lower sample complexity of $O(\kappa^{2.5}\epsilon^{-3})$ with the mini-batch size $O(\kappa)$. In particular, we provide an effective convergence analysis framework for our adaptive methods based on unified adaptive matrices, which include almost existing adaptive learning rates.
    Explanation-Guided Diagnosis of Machine Learning Evasion Attacks. (arXiv:2106.15820v1 [cs.CR])
    (2 min) Machine Learning (ML) models are susceptible to evasion attacks. Evasion accuracy is typically assessed using aggregate evasion rate, and it is an open question whether aggregate evasion rate enables feature-level diagnosis on the effect of adversarial perturbations on evasive predictions. In this paper, we introduce a novel framework that harnesses explainable ML methods to guide high-fidelity assessment of ML evasion attacks. Our framework enables explanation-guided correlation analysis between pre-evasion perturbations and post-evasion explanations. Towards systematic assessment of ML evasion attacks, we propose and evaluate a novel suite of model-agnostic metrics for sample-level and dataset-level correlation analysis. Using malware and image classifiers, we conduct comprehensive evaluations across diverse model architectures and complementary feature representations. Our explanation-guided correlation analysis reveals correlation gaps between adversarial samples and the corresponding perturbations performed on them. Using a case study on explanation-guided evasion, we show the broader usage of our methodology for assessing robustness of ML models.
    Exploring Robustness of Neural Networks through Graph Measures. (arXiv:2106.15850v1 [cs.LG])
    (2 min) Motivated by graph theory, artificial neural networks (ANNs) are traditionally structured as layers of neurons (nodes), which learn useful information by the passage of data through interconnections (edges). In the machine learning realm, graph structures (i.e., neurons and connections) of ANNs have recently been explored using various graph-theoretic measures linked to their predictive performance. On the other hand, in network science (NetSci), certain graph measures including entropy and curvature are known to provide insight into the robustness and fragility of real-world networks. In this work, we use these graph measures to explore the robustness of various ANNs to adversarial attacks. To this end, we (1) explore the design space of inter-layer and intra-layers connectivity regimes of ANNs in the graph domain and record their predictive performance after training under different types of adversarial attacks, (2) use graph representations for both inter-layer and intra-layers connectivity regimes to calculate various graph-theoretic measures, including curvature and entropy, and (3) analyze the relationship between these graph measures and the adversarial performance of ANNs. We show that curvature and entropy, while operating in the graph domain, can quantify the robustness of ANNs without having to train these ANNs. Our results suggest that the real-world networks, including brain networks, financial networks, and social networks may provide important clues to the neural architecture search for robust ANNs. We propose a search strategy that efficiently finds robust ANNs amongst a set of well-performing ANNs without having a need to train all of these ANNs.
    Edge Proposal Sets for Link Prediction. (arXiv:2106.15810v1 [cs.SI])
    (2 min) Graphs are a common model for complex relational data such as social networks and protein interactions, and such data can evolve over time (e.g., new friendships) and be noisy (e.g., unmeasured interactions). Link prediction aims to predict future edges or infer missing edges in the graph, and has diverse applications in recommender systems, experimental design, and complex systems. Even though link prediction algorithms strongly depend on the set of edges in the graph, existing approaches typically do not modify the graph topology to improve performance. Here, we demonstrate how simply adding a set of edges, which we call a \emph{proposal set}, to the graph as a pre-processing step can improve the performance of several link prediction algorithms. The underlying idea is that if the edges in the proposal set generally align with the structure of the graph, link prediction algorithms are further guided towards predicting the right edges; in other words, adding a proposal set of edges is a signal-boosting pre-processing step. We show how to use existing link prediction algorithms to generate effective proposal sets and evaluate this approach on various synthetic and empirical datasets. We find that proposal sets meaningfully improve the accuracy of link prediction algorithms based on both neighborhood heuristics and graph neural networks. Code is available at \url{https://github.com/CUAI/Edge-Proposal-Sets}.
    Koopman Spectrum Nonlinear Regulator and Provably Efficient Online Learning. (arXiv:2106.15775v1 [cs.LG])
    (2 min) Most modern reinforcement learning algorithms optimize a cumulative single-step cost along a trajectory. The optimized motions are often 'unnatural', representing, for example, behaviors with sudden accelerations that waste energy and lack predictability. In this work, we present a novel paradigm of controlling nonlinear systems via the minimization of the Koopman spectrum cost: a cost over the Koopman operator of the controlled dynamics. This induces a broader class of dynamical behaviors that evolve over stable manifolds such as nonlinear oscillators, closed loops, and smooth movements. We demonstrate that some dynamics realizations that are not possible with a cumulative cost are feasible in this paradigm. Moreover, we present a provably efficient online learning algorithm for our problem that enjoys a sub-linear regret bound under some structural assumptions.
    Variational Refinement for Importance Sampling Using the Forward Kullback-Leibler Divergence. (arXiv:2106.15980v1 [stat.ML])
    (2 min) Variational Inference (VI) is a popular alternative to asymptotically exact sampling in Bayesian inference. Its main workhorse is optimization over a reverse Kullback-Leibler divergence (RKL), which typically underestimates the tail of the posterior leading to miscalibration and potential degeneracy. Importance sampling (IS), on the other hand, is often used to fine-tune and de-bias the estimates of approximate Bayesian inference procedures. The quality of IS crucially depends on the choice of the proposal distribution. Ideally, the proposal distribution has heavier tails than the target, which is rarely achievable by minimizing the RKL. We thus propose a novel combination of optimization and sampling techniques for approximate Bayesian inference by constructing an IS proposal distribution through the minimization of a forward KL (FKL) divergence. This approach guarantees asymptotic consistency and a fast convergence towards both the optimal IS estimator and the optimal variational approximation. We empirically demonstrate on real data that our method is competitive with variational boosting and MCMC.
    The Evolution of Out-of-Distribution Robustness Throughout Fine-Tuning. (arXiv:2106.15831v1 [cs.LG])
    (2 min) Although machine learning models typically experience a drop in performance on out-of-distribution data, accuracies on in- versus out-of-distribution data are widely observed to follow a single linear trend when evaluated across a testbed of models. Models that are more accurate on the out-of-distribution data relative to this baseline exhibit "effective robustness" and are exceedingly rare. Identifying such models, and understanding their properties, is key to improving out-of-distribution performance. We conduct a thorough empirical investigation of effective robustness during fine-tuning and surprisingly find that models pre-trained on larger datasets exhibit effective robustness during training that vanishes at convergence. We study how properties of the data influence effective robustness, and we show that it increases with the larger size, more diversity, and higher example difficulty of the dataset. We also find that models that display effective robustness are able to correctly classify 10% of the examples that no other current testbed model gets correct. Finally, we discuss several strategies for scaling effective robustness to the high-accuracy regime to improve the out-of-distribution accuracy of state-of-the-art models.
    End-to-End Learning of OFDM Waveforms with PAPR and ACLR Constraints. (arXiv:2106.16039v1 [cs.IT])
    (2 min) Orthogonal frequency-division multiplexing (OFDM) is widely used in modern wireless networks thanks to its efficient handling of multipath environment. However, it suffers from a poor peak-to-average power ratio (PAPR) which requires a large power backoff, degrading the power amplifier (PA) efficiency. In this work, we propose to use a neural network (NN) at the transmitter to learn a high-dimensional modulation scheme allowing to control the PAPR and adjacent channel leakage ratio (ACLR). On the receiver side, a NN-based receiver is implemented to carry out demapping of the transmitted bits. The two NNs operate on top of OFDM, and are jointly optimized in and end-to-end manner using a training algorithm that enforces constraints on the PAPR and ACLR. Simulation results show that the learned waveforms enable higher information rates than a tone reservation baseline, while satisfying predefined PAPR and ACLR targets.
    Hierarchical Phenotyping and Graph Modeling of Spatial Architecture in Lymphoid Neoplasms. (arXiv:2106.16174v1 [q-bio.QM])
    (2 min) The cells and their spatial patterns in the tumor microenvironment (TME) play a key role in tumor evolution, and yet remains an understudied topic in computational pathology. This study, to the best of our knowledge, is among the first to hybrid local and global graph methods to profile orchestration and interaction of cellular components. To address the challenge in hematolymphoid cancers where the cell classes in TME are unclear, we first implemented cell level unsupervised learning and identified two new cell subtypes. Local cell graphs or supercells were built for each image by considering the individual cell's geospatial location and classes. Then, we applied supercell level clustering and identified two new cell communities. In the end, we built global graphs to abstract spatial interaction patterns and extract features for disease diagnosis. We evaluate the proposed algorithm on H\&E slides of 60 hematolymphoid neoplasm patients and further compared it with three cell level graph-based algorithms, including the global cell graph, cluster cell graph, and FLocK. The proposed algorithm achieves a mean diagnosis accuracy of 0.703 with the repeated 5-fold cross-validation scheme. In conclusion, our algorithm shows superior performance over the existing methods and can be potentially applied to other cancer types.
    Resilient UAV Swarm Communications with Graph Convolutional Neural Network. (arXiv:2106.16048v1 [eess.SP])
    (2 min) In this paper, we study the self-healing problem of unmanned aerial vehicle (UAV) swarm network (USNET) that is required to quickly rebuild the communication connectivity under unpredictable external disruptions (UEDs). Firstly, to cope with the one-off UEDs, we propose a graph convolutional neural network (GCN) and find the recovery topology of the USNET in an on-line manner. Secondly, to cope with general UEDs, we develop a GCN based trajectory planning algorithm that can make UAVs rebuild the communication connectivity during the self-healing process. We also design a meta learning scheme to facilitate the on-line executions of the GCN. Numerical results show that the proposed algorithms can rebuild the communication connectivity of the USNET more quickly than the existing algorithms under both one-off UEDs and general UEDs. The simulation results also show that the meta learning scheme can not only enhance the performance of the GCN but also reduce the time complexity of the on-line executions.
    Exploring Context Modeling Techniques on the Spatiotemporal Crowd Flow Prediction. (arXiv:2106.16046v1 [cs.LG])
    (2 min) In the big data and AI era, context is widely exploited as extra information which makes it easier to learn a more complex pattern in machine learning systems. However, most of the existing related studies seldom take context into account. The difficulty lies in the unknown generalization ability of both context and its modeling techniques across different scenarios. To fill the above gaps, we conduct a large-scale analytical and empirical study on the spatiotemporal crowd prediction (STCFP) problem that is a widely-studied and hot research topic. We mainly make three efforts:(i) we develop new taxonomy about both context features and context modeling techniques based on extensive investigations in prevailing STCFP research; (ii) we conduct extensive experiments on seven datasets with hundreds of millions of records to quantitatively evaluate the generalization ability of both distinct context features and context modeling techniques; (iii) we summarize some guidelines for researchers to conveniently utilize context in diverse applications.
    Tuning Mixed Input Hyperparameters on the Fly for Efficient Population Based AutoRL. (arXiv:2106.15883v1 [cs.LG])
    (2 min) Despite a series of recent successes in reinforcement learning (RL), many RL algorithms remain sensitive to hyperparameters. As such, there has recently been interest in the field of AutoRL, which seeks to automate design decisions to create more general algorithms. Recent work suggests that population based approaches may be effective AutoRL algorithms, by learning hyperparameter schedules on the fly. In particular, the PB2 algorithm is able to achieve strong performance in RL tasks by formulating online hyperparameter optimization as time varying GP-bandit problem, while also providing theoretical guarantees. However, PB2 is only designed to work for continuous hyperparameters, which severely limits its utility in practice. In this paper we introduce a new (provably) efficient hierarchical approach for optimizing both continuous and categorical variables, using a new time-varying bandit algorithm specifically designed for the population based training regime. We evaluate our approach on the challenging Procgen benchmark, where we show that explicitly modelling dependence between data augmentation and other hyperparameters improves generalization.
    Understanding Adversarial Attacks on Observations in Deep Reinforcement Learning. (arXiv:2106.15860v1 [cs.LG])
    (2 min) Recent works demonstrate that deep reinforcement learning (DRL) models are vulnerable to adversarial attacks which can decrease the victim's total reward by manipulating the observations. Compared with adversarial attacks in supervised learning, it is much more challenging to deceive a DRL model since the adversary has to infer the environmental dynamics. To address this issue, we reformulate the problem of adversarial attacks in function space and separate the previous gradient based attacks into several subspace. Following the analysis of the function space, we design a generic two-stage framework in the subspace where the adversary lures the agent to a target trajectory or a deceptive policy. In the first stage, we train a deceptive policy by hacking the environment, and discover a set of trajectories routing to the lowest reward. The adversary then misleads the victim to imitate the deceptive policy by perturbing the observations. Our method provides a tighter theoretical upper bound for the attacked agent's performance than the existing approaches. Extensive experiments demonstrate the superiority of our method and we achieve the state-of-the-art performance on both Atari and MuJoCo environments.
    Optimal Epidemic Control as a Contextual Combinatorial Bandit with Budget. (arXiv:2106.15808v1 [cs.LG])
    (2 min) In light of the COVID-19 pandemic, it is an open challenge and critical practical problem to find a optimal way to dynamically prescribe the best policies that balance both the governmental resources and epidemic control in different countries and regions. To solve this multi-dimensional tradeoff of exploitation and exploration, we formulate this technical challenge as a contextual combinatorial bandit problem that jointly optimizes a multi-criteria reward function. Given the historical daily cases in a region and the past intervention plans in place, the agent should generate useful intervention plans that policy makers can implement in real time to minimizing both the number of daily COVID-19 cases and the stringency of the recommended interventions. We prove this concept with simulations of multiple realistic policy making scenarios.
    Anomaly Detection: How to Artificially Increase your F1-Score with a Biased Evaluation Protocol. (arXiv:2106.16020v1 [cs.LG])
    (2 min) Anomaly detection is a widely explored domain in machine learning. Many models are proposed in the literature, and compared through different metrics measured on various datasets. The most popular metrics used to compare performances are F1-score, AUC and AVPR. In this paper, we show that F1-score and AVPR are highly sensitive to the contamination rate. One consequence is that it is possible to artificially increase their values by modifying the train-test split procedure. This leads to misleading comparisons between algorithms in the literature, especially when the evaluation protocol is not well detailed. Moreover, we show that the F1-score and the AVPR cannot be used to compare performances on different datasets as they do not reflect the intrinsic difficulty of modeling such data. Based on these observations, we claim that F1-score and AVPR should not be used as metrics for anomaly detection. We recommend a generic evaluation procedure for unsupervised anomaly detection, including the use of other metrics such as the AUC, which are more robust to arbitrary choices in the evaluation protocol.
    On the Periodic Behavior of Neural Network Training with Batch Normalization and Weight Decay. (arXiv:2106.15739v1 [cs.LG])
    (2 min) Despite the conventional wisdom that using batch normalization with weight decay may improve neural network training, some recent works show their joint usage may cause instabilities at the late stages of training. Other works, in contrast, show convergence to the equilibrium, i.e., the stabilization of training metrics. In this paper, we study this contradiction and show that instead of converging to a stable equilibrium, the training dynamics converge to consistent periodic behavior. That is, the training process regularly exhibits instabilities which, however, do not lead to complete training failure, but cause a new period of training. We rigorously investigate the mechanism underlying this discovered periodic behavior both from an empirical and theoretical point of view and show that this periodic behavior is indeed caused by the interaction between batch normalization and weight decay.
    Deep Linear Networks Dynamics: Low-Rank Biases Induced by Initialization Scale and L2 Regularization. (arXiv:2106.15933v1 [stat.ML])
    (2 min) For deep linear networks (DLN), various hyperparameters alter the dynamics of training dramatically. We investigate how the rank of the linear map found by gradient descent is affected by (1) the initialization norm and (2) the addition of $L_{2}$ regularization on the parameters. For (1), we study two regimes: (1a) the linear/lazy regime, for large norm initialization; (1b) a \textquotedbl saddle-to-saddle\textquotedbl{} regime for small initialization norm. In the (1a) setting, the dynamics of a DLN of any depth is similar to that of a standard linear model, without any low-rank bias. In the (1b) setting, we conjecture that throughout training, gradient descent approaches a sequence of saddles, each corresponding to linear maps of increasing rank, until reaching a minimal rank global minimum. We support this conjecture with a partial proof and some numerical experiments. For (2), we show that adding a $L_{2}$ regularization on the parameters corresponds to the addition to the cost of a $L_{p}$-Schatten (quasi)norm on the linear map with $p=\frac{2}{L}$ (for a depth-$L$ network), leading to a stronger low-rank bias as $L$ grows. The effect of $L_{2}$ regularization on the loss surface depends on the depth: for shallow networks, all critical points are either strict saddles or global minima, whereas for deep networks, some local minima appear. We numerically observe that these local minima can generalize better than global ones in some settings.
    Local Reweighting for Adversarial Training. (arXiv:2106.15776v1 [cs.LG])
    (2 min) Instances-reweighted adversarial training (IRAT) can significantly boost the robustness of trained models, where data being less/more vulnerable to the given attack are assigned smaller/larger weights during training. However, when tested on attacks different from the given attack simulated in training, the robustness may drop significantly (e.g., even worse than no reweighting). In this paper, we study this problem and propose our solution--locally reweighted adversarial training (LRAT). The rationale behind IRAT is that we do not need to pay much attention to an instance that is already safe under the attack. We argue that the safeness should be attack-dependent, so that for the same instance, its weight can change given different attacks based on the same model. Thus, if the attack simulated in training is mis-specified, the weights of IRAT are misleading. To this end, LRAT pairs each instance with its adversarial variants and performs local reweighting inside each pair, while performing no global reweighting--the rationale is to fit the instance itself if it is immune to the attack, but not to skip the pair, in order to passively defend different attacks in future. Experiments show that LRAT works better than both IRAT (i.e., global reweighting) and the standard AT (i.e., no reweighting) when trained with an attack and tested on different attacks.
    Curvature Graph Neural Network. (arXiv:2106.15762v1 [cs.LG])
    (2 min) Graph neural networks (GNNs) have achieved great success in many graph-based tasks. Much work is dedicated to empowering GNNs with the adaptive locality ability, which enables measuring the importance of neighboring nodes to the target node by a node-specific mechanism. However, the current node-specific mechanisms are deficient in distinguishing the importance of nodes in the topology structure. We believe that the structural importance of neighboring nodes is closely related to their importance in aggregation. In this paper, we introduce discrete graph curvature (the Ricci curvature) to quantify the strength of structural connection of pairwise nodes. And we propose Curvature Graph Neural Network (CGNN), which effectively improves the adaptive locality ability of GNNs by leveraging the structural property of graph curvature. To improve the adaptability of curvature to various datasets, we explicitly transform curvature into the weights of neighboring nodes by the necessary Negative Curvature Processing Module and Curvature Normalization Module. Then, we conduct numerous experiments on various synthetic datasets and real-world datasets. The experimental results on synthetic datasets show that CGNN effectively exploits the topology structure information, and the performance is improved significantly. CGNN outperforms the baselines on 5 dense node classification benchmark datasets. This study deepens the understanding of how to utilize advanced topology information and assign the importance of neighboring nodes from the perspective of graph curvature and encourages us to bridge the gap between graph theory and neural networks.
    The Threat of Offensive AI to Organizations. (arXiv:2106.15764v1 [cs.AI])
    (2 min) AI has provided us with the ability to automate tasks, extract information from vast amounts of data, and synthesize media that is nearly indistinguishable from the real thing. However, positive tools can also be used for negative purposes. In particular, cyber adversaries can use AI (such as machine learning) to enhance their attacks and expand their campaigns. Although offensive AI has been discussed in the past, there is a need to analyze and understand the threat in the context of organizations. For example, how does an AI-capable adversary impact the cyber kill chain? Does AI benefit the attacker more than the defender? What are the most significant AI threats facing organizations today and what will be their impact on the future? In this survey, we explore the threat of offensive AI on organizations. First, we present the background and discuss how AI changes the adversary's methods, strategies, goals, and overall attack model. Then, through a literature review, we identify 33 offensive AI capabilities which adversaries can use to enhance their attacks. Finally, through a user study spanning industry and academia, we rank the AI threats and provide insights on the adversaries.
    UAV-assisted Online Machine Learning over Multi-Tiered Networks: A Hierarchical Nested Personalized Federated Learning Approach. (arXiv:2106.15734v1 [cs.LG])
    (2 min) We consider distributed machine learning (ML) through unmanned aerial vehicles (UAVs) for geo-distributed device clusters. We propose five new technologies/techniques: (i) stratified UAV swarms with leader, worker, and coordinator UAVs, (ii) hierarchical nested personalized federated learning (HN-PFL): a holistic distributed ML framework for personalized model training across the worker-leader-core network hierarchy, (iii) cooperative UAV resource pooling for distributed ML using the UAVs' local computational capabilities, (iv) aerial data caching and relaying for efficient data relaying to conduct ML, and (v) concept/model drift, capturing online data variations at the devices. We split the UAV-enabled model training problem as two parts. (a) Network-aware HN-PFL, where we optimize a tradeoff between energy consumption and ML model performance by configuring data offloading among devices-UAVs and UAV-UAVs, UAVs' CPU frequencies, and mini-batch sizes subject to communication/computation network heterogeneity. We tackle this optimization problem via the method of posynomial condensation and propose a distributed algorithm with a performance guarantee. (b) Macro-trajectory and learning duration design, which we formulate as a sequential decision making problem, tackled via deep reinforcement learning. Our simulations demonstrate the superiority of our methodology with regards to the distributed ML performance, the optimization of network resources, and the swarm trajectory efficiency.
    Learning Bounds for Open-Set Learning. (arXiv:2106.15792v1 [cs.LG])
    (2 min) Traditional supervised learning aims to train a classifier in the closed-set world, where training and test samples share the same label space. In this paper, we target a more challenging and realistic setting: open-set learning (OSL), where there exist test samples from the classes that are unseen during training. Although researchers have designed many methods from the algorithmic perspectives, there are few methods that provide generalization guarantees on their ability to achieve consistent performance on different training samples drawn from the same distribution. Motivated by the transfer learning and probably approximate correct (PAC) theory, we make a bold attempt to study OSL by proving its generalization error-given training samples with size n, the estimation error will get close to order O_p(1/\sqrt{n}). This is the first study to provide a generalization bound for OSL, which we do by theoretically investigating the risk of the target classifier on unknown classes. According to our theory, a novel algorithm, called auxiliary open-set risk (AOSR) is proposed to address the OSL problem. Experiments verify the efficacy of AOSR. The code is available at github.com/Anjin-Liu/Openset_Learning_AOSR.
    Multi-Source Domain Adaptation for Object Detection. (arXiv:2106.15793v1 [cs.CV])
    (2 min) To reduce annotation labor associated with object detection, an increasing number of studies focus on transferring the learned knowledge from a labeled source domain to another unlabeled target domain. However, existing methods assume that the labeled data are sampled from a single source domain, which ignores a more generalized scenario, where labeled data are from multiple source domains. For the more challenging task, we propose a unified Faster R-CNN based framework, termed Divide-and-Merge Spindle Network (DMSN), which can simultaneously enhance domain invariance and preserve discriminative power. Specifically, the framework contains multiple source subnets and a pseudo target subnet. First, we propose a hierarchical feature alignment strategy to conduct strong and weak alignments for low- and high-level features, respectively, considering their different effects for object detection. Second, we develop a novel pseudo subnet learning algorithm to approximate optimal parameters of pseudo target subset by weighted combination of parameters in different source subnets. Finally, a consistency regularization for region proposal network is proposed to facilitate each subnet to learn more abstract invariances. Extensive experiments on different adaptation scenarios demonstrate the effectiveness of the proposed model.
    SAT Based Analogy Evaluation Framework for Persian Word Embeddings. (arXiv:2106.15674v1 [cs.CL])
    (2 min) In recent years there has been a special interest in word embeddings as a new approach to convert words to vectors. It has been a focal point to understand how much of the semantics of the the words has been transferred into embedding vectors. This is important as the embedding is going to be used as the basis for downstream NLP applications and it will be costly to evaluate the application end-to-end in order to identify quality of the used embedding model. Generally the word embeddings are evaluated through a number of tests, including analogy test. In this paper we propose a test framework for Persian embedding models. Persian is a low resource language and there is no rich semantic benchmark to evaluate word embedding models for this language. In this paper we introduce an evaluation framework including a hand crafted Persian SAT based analogy dataset, a colliquial test set (specific to Persian) and a benchmark to study the impact of various parameters on the semantic evaluation task.
    Single-Step Adversarial Training for Semantic Segmentation. (arXiv:2106.15998v1 [cs.CV])
    (2 min) Even though deep neural networks succeed on many different tasks including semantic segmentation, they lack on robustness against adversarial examples. To counteract this exploit, often adversarial training is used. However, it is known that adversarial training with weak adversarial attacks (e.g. using the Fast Gradient Method) does not improve the robustness against stronger attacks. Recent research shows that it is possible to increase the robustness of such single-step methods by choosing an appropriate step size during the training. Finding such a step size, without increasing the computational effort of single-step adversarial training, is still an open challenge. In this work we address the computationally particularly demanding task of semantic segmentation and propose a new step size control algorithm that increases the robustness of single-step adversarial training. The proposed algorithm does not increase the computational effort of single-step adversarial training considerably and also simplifies training, because it is free of meta-parameter. We show that the robustness of our approach can compete with multi-step adversarial training on two popular benchmarks for semantic segmentation.
    Diffusion Priors In Variational Autoencoders. (arXiv:2106.15671v1 [cs.LG])
    (2 min) Among likelihood-based approaches for deep generative modelling, variational autoencoders (VAEs) offer scalable amortized posterior inference and fast sampling. However, VAEs are also more and more outperformed by competing models such as normalizing flows (NFs), deep-energy models, or the new denoising diffusion probabilistic models (DDPMs). In this preliminary work, we improve VAEs by demonstrating how DDPMs can be used for modelling the prior distribution of the latent variables. The diffusion prior model improves upon Gaussian priors of classical VAEs and is competitive with NF-based priors. Finally, we hypothesize that hierarchical VAEs could similarly benefit from the enhanced capacity of diffusion priors.
    Dual GNNs: Graph Neural Network Learning with Limited Supervision. (arXiv:2106.15755v1 [cs.LG])
    (2 min) Graph Neural Networks (GNNs) require a relatively large number of labeled nodes and a reliable/uncorrupted graph connectivity structure in order to obtain good performance on the semi-supervised node classification task. The performance of GNNs can degrade significantly as the number of labeled nodes decreases or the graph connectivity structure is corrupted by adversarial attacks or due to noises in data measurement /collection. Therefore, it is important to develop GNN models that are able to achieve good performance when there is limited supervision knowledge -- a few labeled nodes and noisy graph structures. In this paper, we propose a novel Dual GNN learning framework to address this challenge task. The proposed framework has two GNN based node prediction modules. The primary module uses the input graph structure to induce regular node embeddings and predictions with a regular GNN baseline, while the auxiliary module constructs a new graph structure through fine-grained spectral clusterings and learns new node embeddings and predictions. By integrating the two modules in a dual GNN learning framework, we perform joint learning in an end-to-end fashion. This general framework can be applied on many GNN baseline models. The experimental results validate that the proposed dual GNN framework can greatly outperform the GNN baseline methods when the labeled nodes are scarce and the graph connectivity structure is noisy.
    Edge Representation Learning with Hypergraphs. (arXiv:2106.15845v1 [cs.LG])
    (2 min) Graph neural networks have recently achieved remarkable success in representing graph-structured data, with rapid progress in both the node embedding and graph pooling methods. Yet, they mostly focus on capturing information from the nodes considering their connectivity, and not much work has been done in representing the edges, which are essential components of a graph. However, for tasks such as graph reconstruction and generation, as well as graph classification tasks for which the edges are important for discrimination, accurately representing edges of a given graph is crucial to the success of the graph representation learning. To this end, we propose a novel edge representation learning framework based on Dual Hypergraph Transformation (DHT), which transforms the edges of a graph into the nodes of a hypergraph. This dual hypergraph construction allows us to apply message passing techniques for node representations to edges. After obtaining edge representations from the hypergraphs, we then cluster or drop edges to obtain holistic graph-level edge representations. We validate our edge representation learning method with hypergraphs on diverse graph datasets for graph representation and generation performance, on which our method largely outperforms existing graph representation learning methods. Moreover, our edge representation learning and pooling method also largely outperforms state-of-the-art graph pooling methods on graph classification, not only because of its accurate edge representation learning, but also due to its lossless compression of the nodes and removal of irrelevant edges for effective message passing.
    Distributionally Robust Learning with Stable Adversarial Training. (arXiv:2106.15791v1 [cs.LG])
    (2 min) Machine learning algorithms with empirical risk minimization are vulnerable under distributional shifts due to the greedy adoption of all the correlations found in training data. There is an emerging literature on tackling this problem by minimizing the worst-case risk over an uncertainty set. However, existing methods mostly construct ambiguity sets by treating all variables equally regardless of the stability of their correlations with the target, resulting in the overwhelmingly-large uncertainty set and low confidence of the learner. In this paper, we propose a novel Stable Adversarial Learning (SAL) algorithm that leverages heterogeneous data sources to construct a more practical uncertainty set and conduct differentiated robustness optimization, where covariates are differentiated according to the stability of their correlations with the target. We theoretically show that our method is tractable for stochastic gradient-based optimization and provide the performance guarantees for our method. Empirical studies on both simulation and real datasets validate the effectiveness of our method in terms of uniformly good performance across unknown distributional shifts.
    Unaware Fairness: Hierarchical Random Forest for Protected Classes. (arXiv:2106.15767v1 [cs.LG])
    (2 min) Procedural fairness has been a public concern, which leads to controversy when making decisions with respect to protected classes, such as race, social status, and disability. Some protected classes can be inferred according to some safe proxies like surname and geolocation for the race. Hence, implicitly utilizing the predicted protected classes based on the related proxies when making decisions is an efficient approach to circumvent this issue and seek just decisions. In this article, we propose a hierarchical random forest model for prediction without explicitly involving protected classes. Simulation experiments are conducted to show the performance of the hierarchical random forest model. An example is analyzed from Boston police interview records to illustrate the usefulness of the proposed model.

2021-06-30

  • cs.CL updates on arXiv.org

    Span-based Joint Entity and Relation Extraction with Transformer Pre-training. (arXiv:1909.07755v4 [cs.CL] UPDATED)
    (2 min) We introduce SpERT, an attention model for span-based joint entity and relation extraction. Our key contribution is a light-weight reasoning on BERT embeddings, which features entity recognition and filtering, as well as relation classification with a localized, marker-free context representation. The model is trained using strong within-sentence negative samples, which are efficiently extracted in a single BERT pass. These aspects facilitate a search over all spans in the sentence. In ablation studies, we demonstrate the benefits of pre-training, strong negative sampling and localized context. Our model outperforms prior work by up to 2.6% F1 score on several datasets for joint entity and relation extraction.
    Unsupervised Technique To Conversational Machine Reading. (arXiv:2106.15247v1 [cs.CL])
    (2 min) Conversational machine reading (CMR) tools have seen a rapid progress in the recent past. The current existing tools rely on the supervised learning technique which require labeled dataset for their training. The supervised technique necessitates that for every new rule text, a manually labeled dataset must be created. This is tedious and error prone. This paper introduces and demonstrates how unsupervised learning technique can be applied in the development of CMR. Specifically, we demonstrate how unsupervised learning can be used in rule extraction and entailment modules of CMR. Compared to the current best CMR tool, our developed framework reports 3.3% improvement in micro averaged accuracy and 1.4 % improvement in macro averaged accuracy.
    Differential Privacy for Credit Risk Model. (arXiv:2106.15343v1 [cs.CR])
    (2 min) The use of machine learning algorithms to model user behavior and drive business decisions has become increasingly commonplace, specifically providing intelligent recommendations to automated decision making. This has led to an increase in the use of customers personal data to analyze customer behavior and predict their interests in a companys products. Increased use of this customer personal data can lead to better models but also to the potential of customer data being leaked, reverse engineered, and mishandled. In this paper, we assess differential privacy as a solution to address these privacy problems by building privacy protections into the data engineering and model training stages of predictive model development. Our interest is a pragmatic implementation in an operational environment, which necessitates a general purpose differentially private modeling framework, and we evaluate one such tool from LeapYear as applied to the Credit Risk modeling domain. Credit Risk Model is a major modeling methodology in banking and finance where user data is analyzed to determine the total Expected Loss to the bank. We examine the application of differential privacy on the credit risk model and evaluate the performance of a Differentially Private Model with a Non Differentially Private Model. Credit Risk Model is a major modeling methodology in banking and finance where users data is analyzed to determine the total Expected Loss to the bank. In this paper, we explore the application of differential privacy on the credit risk model and evaluate the performance of a Non Differentially Private Model with Differentially Private Model.
    Multitask Recalibrated Aggregation Network for Medical Code Prediction. (arXiv:2104.00952v3 [cs.CL] UPDATED)
    (2 min) Medical coding translates professionally written medical reports into standardized codes, which is an essential part of medical information systems and health insurance reimbursement. Manual coding by trained human coders is time-consuming and error-prone. Thus, automated coding algorithms have been developed, building especially on the recent advances in machine learning and deep neural networks. To solve the challenges of encoding lengthy and noisy clinical documents and capturing code associations, we propose a multitask recalibrated aggregation network. In particular, multitask learning shares information across different coding schemes and captures the dependencies between different medical codes. Feature recalibration and aggregation in shared modules enhance representation learning for lengthy notes. Experiments with a real-world MIMIC-III dataset show significantly improved predictive performance.
    Few-Shot Electronic Health Record Coding through Graph Contrastive Learning. (arXiv:2106.15467v1 [cs.AI])
    (2 min) Electronic health record (EHR) coding is the task of assigning ICD codes to each EHR. Most previous studies either only focus on the frequent ICD codes or treat rare and frequent ICD codes in the same way. These methods perform well on frequent ICD codes but due to the extremely unbalanced distribution of ICD codes, the performance on rare ones is far from satisfactory. We seek to improve the performance for both frequent and rare ICD codes by using a contrastive graph-based EHR coding framework, CoGraph, which re-casts EHR coding as a few-shot learning task. First, we construct a heterogeneous EHR word-entity (HEWE) graph for each EHR, where the words and entities extracted from an EHR serve as nodes and the relations between them serve as edges. Then, CoGraph learns similarities and dissimilarities between HEWE graphs from different ICD codes so that information can be transferred among them. In a few-shot learning scenario, the model only has access to frequent ICD codes during training, which might force it to encode features that are useful for frequent ICD codes only. To mitigate this risk, CoGraph devises two graph contrastive learning schemes, GSCL and GECL, that exploit the HEWE graph structures so as to encode transferable features. GSCL utilizes the intra-correlation of different sub-graphs sampled from HEWE graphs while GECL exploits the inter-correlation among HEWE graphs at different clinical stages. Experiments on the MIMIC-III benchmark dataset show that CoGraph significantly outperforms state-of-the-art methods on EHR coding, not only on frequent ICD codes, but also on rare codes, in terms of several evaluation indicators. On frequent ICD codes, GSCL and GECL improve the classification accuracy and F1 by 1.31% and 0.61%, respectively, and on rare ICD codes CoGraph has more obvious improvements by 2.12% and 2.95%.
    A Survey on Neural Speech Synthesis. (arXiv:2106.15561v1 [eess.AS])
    (2 min) Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural speech given text, is a hot research topic in speech, language, and machine learning communities and has broad applications in the industry. As the development of deep learning and artificial intelligence, neural network-based TTS has significantly improved the quality of synthesized speech in recent years. In this paper, we conduct a comprehensive survey on neural TTS, aiming to provide a good understanding of current research and future trends. We focus on the key components in neural TTS, including text analysis, acoustic models and vocoders, and several advanced topics, including fast TTS, low-resource TTS, robust TTS, expressive TTS, and adaptive TTS, etc. We further summarize resources related to TTS (e.g., datasets, opensource implementations) and discuss future research directions. This survey can serve both academic researchers and industry practitioners working on TTS.
    Topic Modeling Based Extractive Text Summarization. (arXiv:2106.15313v1 [cs.CL])
    (2 min) Text summarization is an approach for identifying important information present within text documents. This computational technique aims to generate shorter versions of the source text, by including only the relevant and salient information present within the source text. In this paper, we propose a novel method to summarize a text document by clustering its contents based on latent topics produced using topic modeling techniques and by generating extractive summaries for each of the identified text clusters. All extractive sub-summaries are later combined to generate a summary for any given source document. We utilize the lesser used and challenging WikiHow dataset in our approach to text summarization. This dataset is unlike the commonly used news datasets which are available for text summarization. The well-known news datasets present their most important information in the first few lines of their source texts, which make their summarization a lesser challenging task when compared to summarizing the WikiHow dataset. Contrary to these news datasets, the documents in the WikiHow dataset are written using a generalized approach and have lesser abstractedness and higher compression ratio, thus proposing a greater challenge to generate summaries. A lot of the current state-of-the-art text summarization techniques tend to eliminate important information present in source documents in the favor of brevity. Our proposed technique aims to capture all the varied information present in source documents. Although the dataset proved challenging, after performing extensive tests within our experimental setup, we have discovered that our model produces encouraging ROUGE results and summaries when compared to the other published extractive and abstractive text summarization models.
    Arabic Speech Recognition by End-to-End, Modular Systems and Human. (arXiv:2101.08454v2 [eess.AS] UPDATED)
    (2 min) Recent advances in automatic speech recognition (ASR) have achieved accuracy levels comparable to human transcribers, which led researchers to debate if the machine has reached human performance. Previous work focused on the English language and modular hidden Markov model-deep neural network (HMM-DNN) systems. In this paper, we perform a comprehensive benchmarking for end-to-end transformer ASR, modular HMM-DNN ASR, and human speech recognition (HSR) on the Arabic language and its dialects. For the HSR, we evaluate linguist performance and lay-native speaker performance on a new dataset collected as a part of this study. For ASR the end-to-end work led to 12.5%, 27.5%, 33.8% WER; a new performance milestone for the MGB2, MGB3, and MGB5 challenges respectively. Our results suggest that human performance in the Arabic language is still considerably better than the machine with an absolute WER gap of 3.5% on average.
    Towards Interpretable Natural Language Understanding with Explanations as Latent Variables. (arXiv:2011.05268v2 [cs.CL] UPDATED)
    (2 min) Recently generating natural language explanations has shown very promising results in not only offering interpretable explanations but also providing additional information and supervision for prediction. However, existing approaches usually require a large set of human annotated explanations for training while collecting a large set of explanations is not only time consuming but also expensive. In this paper, we develop a general framework for interpretable natural language understanding that requires only a small set of human annotated explanations for training. Our framework treats natural language explanations as latent variables that model the underlying reasoning process of a neural model. We develop a variational EM framework for optimization where an explanation generation module and an explanation-augmented prediction module are alternatively optimized and mutually enhance each other. Moreover, we further propose an explanation-based self-training method under this framework for semi-supervised learning. It alternates between assigning pseudo-labels to unlabeled data and generating new explanations to iteratively improve each other. Experiments on two natural language understanding tasks demonstrate that our framework can not only make effective predictions in both supervised and semi-supervised settings, but also generate good natural language explanation.
    Hate speech detection using static BERT embeddings. (arXiv:2106.15537v1 [cs.CL])
    (2 min) With increasing popularity of social media platforms hate speech is emerging as a major concern, where it expresses abusive speech that targets specific group characteristics, such as gender, religion or ethnicity to spread violence. Earlier people use to verbally deliver hate speeches but now with the expansion of technology, some people are deliberately using social media platforms to spread hate by posting, sharing, commenting, etc. Whether it is Christchurch mosque shootings or hate crimes against Asians in west, it has been observed that the convicts are very much influenced from hate text present online. Even though AI systems are in place to flag such text but one of the key challenges is to reduce the false positive rate (marking non hate as hate), so that these systems can detect hate speech without undermining the freedom of expression. In this paper, we use ETHOS hate speech detection dataset and analyze the performance of hate speech detection classifier by replacing or integrating the word embeddings (fastText (FT), GloVe (GV) or FT + GV) with static BERT embeddings (BE). With the extensive experimental trails it is observed that the neural network performed better with static BE compared to using FT, GV or FT + GV as word embeddings. In comparison to fine-tuned BERT, one metric that significantly improved is specificity.
    Classification of Consumer Belief Statements From Social Media. (arXiv:2106.15498v1 [cs.LG])
    (2 min) Social media offer plenty of information to perform market research in order to meet the requirements of customers. One way how this research is conducted is that a domain expert gathers and categorizes user-generated content into a complex and fine-grained class structure. In many of such cases, little data meets complex annotations. It is not yet fully understood how this can be leveraged successfully for classification. We examine the classification accuracy of expert labels when used with a) many fine-grained classes and b) few abstract classes. For scenario b) we compare abstract class labels given by the domain expert as baseline and by automatic hierarchical clustering. We compare this to another baseline where the entire class structure is given by a completely unsupervised clustering approach. By doing so, this work can serve as an example of how complex expert annotations are potentially beneficial and can be utilized in the most optimal way for opinion mining in highly specific domains. By exploring across a range of techniques and experiments, we find that automated class abstraction approaches in particular the unsupervised approach performs remarkably well against domain expert baseline on text classification tasks. This has the potential to inspire opinion mining applications in order to support market researchers in practice and to inspire fine-grained automated content analysis on a large scale.
    On the Interaction of Belief Bias and Explanations. (arXiv:2106.15355v1 [cs.CL])
    (2 min) A myriad of explainability methods have been proposed in recent years, but there is little consensus on how to evaluate them. While automatic metrics allow for quick benchmarking, it isn't clear how such metrics reflect human interaction with explanations. Human evaluation is of paramount importance, but previous protocols fail to account for belief biases affecting human performance, which may lead to misleading conclusions. We provide an overview of belief bias, its role in human evaluation, and ideas for NLP practitioners on how to account for it. For two experimental paradigms, we present a case study of gradient-based explainability introducing simple ways to account for humans' prior beliefs: models of varying quality and adversarial examples. We show that conclusions about the highest performing methods change when introducing such controls, pointing to the importance of accounting for belief bias in evaluation.
    Scientific Credibility of Machine Translation Research: A Meta-Evaluation of 769 Papers. (arXiv:2106.15195v1 [cs.CL])
    (2 min) This paper presents the first large-scale meta-evaluation of machine translation (MT). We annotated MT evaluations conducted in 769 research papers published from 2010 to 2020. Our study shows that practices for automatic MT evaluation have dramatically changed during the past decade and follow concerning trends. An increasing number of MT evaluations exclusively rely on differences between BLEU scores to draw conclusions, without performing any kind of statistical significance testing nor human evaluation, while at least 108 metrics claiming to be better than BLEU have been proposed. MT evaluations in recent papers tend to copy and compare automatic metric scores from previous work to claim the superiority of a method or an algorithm without confirming neither exactly the same training, validating, and testing data have been used nor the metric scores are comparable. Furthermore, tools for reporting standardized metric scores are still far from being widely adopted by the MT community. After showing how the accumulation of these pitfalls leads to dubious evaluation, we propose a guideline to encourage better automatic MT evaluation along with a simple meta-evaluation scoring method to assess its credibility.
    Leveraging Static Models for Link Prediction in Temporal Knowledge Graphs. (arXiv:2106.15223v1 [cs.LG])
    (2 min) The inclusion of temporal scopes of facts in knowledge graph embedding (KGE) presents significant opportunities for improving the resulting embeddings, and consequently for increased performance in downstream applications. Yet, little research effort has focussed on this area and much of the carried out research reports only marginally improved results compared to models trained without temporal scopes (static models). Furthermore, rather than leveraging existing work on static models, they introduce new models specific to temporal knowledge graphs. We propose a novel perspective that takes advantage of the power of existing static embedding models by focussing effort on manipulating the data instead. Our method, SpliMe, draws inspiration from the field of signal processing and early work in graph embedding. We show that SpliMe competes with or outperforms the current state of the art in temporal KGE. Additionally, we uncover issues with the procedure currently used to assess the performance of static models on temporal graphs and introduce two ways to counteract them.
    Representation based meta-learning for few-shot spoken intent recognition. (arXiv:2106.15238v1 [cs.CL])
    (2 min) Spoken intent detection has become a popular approach to interface with various smart devices with ease. However, such systems are limited to the preset list of intents-terms or commands, which restricts the quick customization of personal devices to new intents. This paper presents a few-shot spoken intent classification approach with task-agnostic representations via meta-learning paradigm. Specifically, we leverage the popular representation-based meta-learning learning to build a task-agnostic representation of utterances, that then use a linear classifier for prediction. We evaluate three such approaches on our novel experimental protocol developed on two popular spoken intent classification datasets: Google Commands and the Fluent Speech Commands dataset. For a 5-shot (1-shot) classification of novel classes, the proposed framework provides an average classification accuracy of 88.6% (76.3%) on the Google Commands dataset, and 78.5% (64.2%) on the Fluent Speech Commands dataset. The performance is comparable to traditionally supervised classification models with abundant training samples.
    Rethinking the Evaluation of Neural Machine Translation. (arXiv:2106.15217v1 [cs.CL])
    (2 min) The evaluation of neural machine translation systems is usually built upon generated translation of a certain decoding method (e.g., beam search) with evaluation metrics over the generated translation (e.g., BLEU). However, this evaluation framework suffers from high search errors brought by heuristic search algorithms and is limited by its nature of evaluation over one best candidate. In this paper, we propose a novel evaluation protocol, which not only avoids the effect of search errors but provides a system-level evaluation in the perspective of model ranking. In particular, our method is based on our newly proposed exact top-$k$ decoding instead of beam search. Our approach evaluates model errors by the distance between the candidate spaces scored by the references and the model respectively. Extensive experiments on WMT'14 English-German demonstrate that bad ranking ability is connected to the well-known beam search curse, and state-of-the-art Transformer models are facing serious ranking errors. By evaluating various model architectures and techniques, we provide several interesting findings. Finally, to effectively approximate the exact search algorithm with same time cost as original beam search, we present a minimum heap augmented beam search algorithm.
    New Arabic Medical Dataset for Diseases Classification. (arXiv:2106.15236v1 [cs.CL])
    (2 min) The Arabic language suffers from a great shortage of datasets suitable for training deep learning models, and the existing ones include general non-specialized classifications. In this work, we introduce a new Arab medical dataset, which includes two thousand medical documents collected from several Arabic medical websites, in addition to the Arab Medical Encyclopedia. The dataset was built for the task of classifying texts and includes 10 classes (Blood, Bone, Cardiovascular, Ear, Endocrine, Eye, Gastrointestinal, Immune, Liver and Nephrological) diseases. Experiments on the dataset were performed by fine-tuning three pre-trained models: BERT from Google, Arabert that based on BERT with large Arabic corpus, and AraBioNER that based on Arabert with Arabic medical corpus.
    Exploring the Efficacy of Automatically Generated Counterfactuals for Sentiment Analysis. (arXiv:2106.15231v1 [cs.CL])
    (2 min) While state-of-the-art NLP models have been achieving the excellent performance of a wide range of tasks in recent years, important questions are being raised about their robustness and their underlying sensitivity to systematic biases that may exist in their training and test data. Such issues come to be manifest in performance problems when faced with out-of-distribution data in the field. One recent solution has been to use counterfactually augmented datasets in order to reduce any reliance on spurious patterns that may exist in the original data. Producing high-quality augmented data can be costly and time-consuming as it usually needs to involve human feedback and crowdsourcing efforts. In this work, we propose an alternative by describing and evaluating an approach to automatically generating counterfactual data for data augmentation and explanation. A comprehensive evaluation on several different datasets and using a variety of state-of-the-art benchmarks demonstrate how our approach can achieve significant improvements in model performance when compared to models training on the original data and even when compared to models trained with the benefit of human-generated augmented data.
    Topic-to-Essay Generation with Comprehensive Knowledge Enhancement. (arXiv:2106.15142v1 [cs.CL])
    (2 min) Generating high-quality and diverse essays with a set of topics is a challenging task in natural language generation. Since several given topics only provide limited source information, utilizing various topic-related knowledge is essential for improving essay generation performance. However, previous works cannot sufficiently use that knowledge to facilitate the generation procedure. This paper aims to improve essay generation by extracting information from both internal and external knowledge. Thus, a topic-to-essay generation model with comprehensive knowledge enhancement, named TEGKE, is proposed. For internal knowledge enhancement, both topics and related essays are fed to a teacher network as source information. Then, informative features would be obtained from the teacher network and transferred to a student network which only takes topics as input but provides comparable information compared with the teacher network. For external knowledge enhancement, a topic knowledge graph encoder is proposed. Unlike the previous works only using the nearest neighbors of topics in the commonsense base, our topic knowledge graph encoder could exploit more structural and semantic information of the commonsense knowledge graph to facilitate essay generation. Moreover, the adversarial training based on the Wasserstein distance is proposed to improve generation quality. Experimental results demonstrate that TEGKE could achieve state-of-the-art performance on both automatic and human evaluation.
    Learning from Miscellaneous Other-Class Words for Few-shot Named Entity Recognition. (arXiv:2106.15167v1 [cs.CL])
    (2 min) Few-shot Named Entity Recognition (NER) exploits only a handful of annotations to identify and classify named entity mentions. Prototypical network shows superior performance on few-shot NER. However, existing prototypical methods fail to differentiate rich semantics in other-class words, which will aggravate overfitting under few shot scenario. To address the issue, we propose a novel model, Mining Undefined Classes from Other-class (MUCO), that can automatically induce different undefined classes from the other class to improve few-shot NER. With these extra-labeled undefined classes, our method will improve the discriminative ability of NER classifier and enhance the understanding of predefined classes with stand-by semantic knowledge. Experimental results demonstrate that our model outperforms five state-of-the-art models in both 1-shot and 5-shots settings on four NER benchmarks. We will release the code upon acceptance. The source code is released on https: //github.com/shuaiwa16/OtherClassNER.git.
    GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech Synthesis. (arXiv:2106.15153v1 [eess.AS])
    (2 min) Recent advances in neural multi-speaker text-to-speech (TTS) models have enabled the generation of reasonably good speech quality with a single model and made it possible to synthesize the speech of a speaker with limited training data. Fine-tuning to the target speaker data with the multi-speaker model can achieve better quality, however, there still exists a gap compared to the real speech sample and the model depends on the speaker. In this work, we propose GANSpeech, which is a high-fidelity multi-speaker TTS model that adopts the adversarial training method to a non-autoregressive multi-speaker TTS model. In addition, we propose simple but efficient automatic scaling methods for feature matching loss used in adversarial training. In the subjective listening tests, GANSpeech significantly outperformed the baseline multi-speaker FastSpeech and FastSpeech2 models, and showed a better MOS score than the speaker-specific fine-tuned FastSpeech2.
    Neural Machine Translation for Low-Resource Languages: A Survey. (arXiv:2106.15115v1 [cs.CL])
    (2 min) Neural Machine Translation (NMT) has seen a tremendous spurt of growth in less than ten years, and has already entered a mature phase. While considered as the most widely used solution for Machine Translation, its performance on low-resource language pairs still remains sub-optimal compared to the high-resource counterparts, due to the unavailability of large parallel corpora. Therefore, the implementation of NMT techniques for low-resource language pairs has been receiving the spotlight in the recent NMT research arena, thus leading to a substantial amount of research reported on this topic. This paper presents a detailed survey of research advancements in low-resource language NMT (LRL-NMT), along with a quantitative analysis aimed at identifying the most popular solutions. Based on our findings from reviewing previous work, this survey paper provides a set of guidelines to select the possible NMT technique for a given LRL data setting. It also presents a holistic view of the LRL-NMT research landscape and provides a list of recommendations to further enhance the research efforts on LRL-NMT.
    Language Lexicons for Hindi-English Multilingual Text Processing. (arXiv:2106.15105v1 [cs.CL])
    (2 min) Language Identification in textual documents is the process of automatically detecting the language contained in a document based on its content. The present Language Identification techniques presume that a document contains text in one of the fixed set of languages, however, this presumption is incorrect when dealing with multilingual document which includes content in more than one possible language. Due to the unavailability of large standard corpora for Hindi-English mixed lingual language processing tasks we propose the language lexicons, a novel kind of lexical database that supports several multilingual language processing tasks. These lexicons are built by learning classifiers over transliterated Hindi and English vocabulary. The designed lexicons possess richer quantitative characteristic than its primary source of collection which is revealed using the visualization techniques.
    Time-Aware Language Models as Temporal Knowledge Bases. (arXiv:2106.15110v1 [cs.CL])
    (2 min) Many facts come with an expiration date, from the name of the President to the basketball team Lebron James plays for. But language models (LMs) are trained on snapshots of data collected at a specific moment in time, and this can limit their utility, especially in the closed-book setting where the pretraining corpus must contain the facts the model should memorize. We introduce a diagnostic dataset aimed at probing LMs for factual knowledge that changes over time and highlight problems with LMs at either end of the spectrum -- those trained on specific slices of temporal data, as well as those trained on a wide range of temporal data. To mitigate these problems, we propose a simple technique for jointly modeling text with its timestamp. This improves memorization of seen facts from the training time period, as well as calibration on predictions about unseen facts from future time periods. We also show that models trained with temporal context can be efficiently ``refreshed'' as new data arrives, without the need for retraining from scratch.
    A Simple and Efficient Probabilistic Language model for Code-Mixed Text. (arXiv:2106.15102v1 [cs.CL])
    (2 min) The conventional natural language processing approaches are not accustomed to the social media text due to colloquial discourse and non-homogeneous characteristics. Significantly, the language identification in a multilingual document is ascertained to be a preceding subtask in several information extraction applications such as information retrieval, named entity recognition, relation extraction, etc. The problem is often more challenging in code-mixed documents wherein foreign languages words are drawn into base language while framing the text. The word embeddings are powerful language modeling tools for representation of text documents useful in obtaining similarity between words or documents. We present a simple probabilistic approach for building efficient word embedding for code-mixed text and exemplifying it over language identification of Hindi-English short test messages scrapped from Twitter. We examine its efficacy for the classification task using bidirectional LSTMs and SVMs and observe its improved scores over various existing code-mixed embeddings
    TWAG: A Topic-Guided Wikipedia Abstract Generator. (arXiv:2106.15135v1 [cs.CL])
    (2 min) Wikipedia abstract generation aims to distill a Wikipedia abstract from web sources and has met significant success by adopting multi-document summarization techniques. However, previous works generally view the abstract as plain text, ignoring the fact that it is a description of a certain entity and can be decomposed into different topics. In this paper, we propose a two-stage model TWAG that guides the abstract generation with topical information. First, we detect the topic of each input paragraph with a classifier trained on existing Wikipedia articles to divide input documents into different topics. Then, we predict the topic distribution of each abstract sentence, and decode the sentence from topic-aware representations with a Pointer-Generator network. We evaluate our model on the WikiCatSum dataset, and the results show that \modelnames outperforms various existing baselines and is capable of generating comprehensive abstracts. Our code and dataset can be accessed at \url{https://github.com/THU-KEG/TWAG}
    Don't Take It Literally: An Edit-Invariant Sequence Loss for Text Generation. (arXiv:2106.15078v1 [cs.CL])
    (2 min) Neural text generation models are typically trained by maximizing log-likelihood with the sequence cross entropy loss, which encourages an exact token-by-token match between a target sequence with a generated sequence. Such training objective is sub-optimal when the target sequence not perfect, e.g., when the target sequence is corrupted with noises, or when only weak sequence supervision is available. To address this challenge, we propose a novel Edit-Invariant Sequence Loss (EISL), which computes the matching loss of a target n-gram with all n-grams in the generated sequence. EISL draws inspirations from convolutional networks (ConvNets) which are shift-invariant to images, hence is robust to the shift of n-grams to tolerate edits in the target sequences. Moreover, the computation of EISL is essentially a convolution operation with target n-grams as kernels, which is easy to implement with existing libraries. To demonstrate the effectiveness of EISL, we conduct experiments on three tasks: machine translation with noisy target sequences, unsupervised text style transfer, and non-autoregressive machine translation. Experimental results show our method significantly outperforms cross entropy loss on these three tasks.
    Automatic Construction of Enterprise Knowledge Base. (arXiv:2106.15085v1 [cs.CL])
    (2 min) In this paper, we present an automatic knowledge base construction system from large scale enterprise documents with minimal efforts of human intervention. In the design and deployment of such a knowledge mining system for enterprise, we faced several challenges including data distributional shift, performance evaluation, compliance requirements and other practical issues. We leveraged state-of-the-art deep learning models to extract information (named entities and definitions) at per document level, then further applied classical machine learning techniques to process global statistical information to improve the knowledge base. Experimental results are reported on actual enterprise documents. This system is currently serving as part of a Microsoft 365 service.
    Sexism in the Judiciary. (arXiv:2106.15103v1 [cs.CL])
    (2 min) We analyze 6.7 million case law documents to determine the presence of gender bias within our judicial system. We find that current bias detectino methods in NLP are insufficient to determine gender bias in our case law database and propose an alternative approach. We show that existing algorithms' inconsistent results are consequences of prior research's definition of biases themselves. Bias detection algorithms rely on groups of words to represent bias (e.g., 'salary,' 'job,' and 'boss' to represent employment as a potentially biased theme against women in text). However, the methods to build these groups of words have several weaknesses, primarily that the word lists are based on the researchers' own intuitions. We suggest two new methods of automating the creation of word lists to represent biases. We find that our methods outperform current NLP bias detection methods. Our research improves the capabilities of NLP technology to detect bias and highlights gender biases present in influential case law. In order test our NLP bias detection method's performance, we regress our results of bias in case law against U.S census data of women's participation in the workforce in the last 100 years.
    Rethinking End-to-End Evaluation of Decomposable Tasks: A Case Study on Spoken Language Understanding. (arXiv:2106.15065v1 [cs.CL])
    (2 min) Decomposable tasks are complex and comprise of a hierarchy of sub-tasks. Spoken intent prediction, for example, combines automatic speech recognition and natural language understanding. Existing benchmarks, however, typically hold out examples for only the surface-level sub-task. As a result, models with similar performance on these benchmarks may have unobserved performance differences on the other sub-tasks. To allow insightful comparisons between competitive end-to-end architectures, we propose a framework to construct robust test sets using coordinate ascent over sub-task specific utility functions. Given a dataset for a decomposable task, our method optimally creates a test set for each sub-task to individually assess sub-components of the end-to-end model. Using spoken language understanding as a case study, we generate new splits for the Fluent Speech Commands and Snips SmartLights datasets. Each split has two test sets: one with held-out utterances assessing natural language understanding abilities, and one with held-out speakers to test speech processing skills. Our splits identify performance gaps up to 10% between end-to-end systems that were within 1% of each other on the original test sets. These performance gaps allow more realistic and actionable comparisons between different architectures, driving future model development. We release our splits and tools for the community.
    Overview of BioASQ 2021: The ninth BioASQ challenge on Large-Scale Biomedical Semantic Indexing and Question Answering. (arXiv:2106.14885v1 [cs.CL])
    (2 min) Advancing the state-of-the-art in large-scale biomedical semantic indexing and question answering is the main focus of the BioASQ challenge. BioASQ organizes respective tasks where different teams develop systems that are evaluated on the same benchmark datasets that represent the real information needs of experts in the biomedical domain. This paper presents an overview of the ninth edition of the BioASQ challenge in the context of the Conference and Labs of the Evaluation Forum (CLEF) 2021. In this year, a new question answering task, named Synergy, is introduced to support researchers studying the COVID-19 disease and measure the ability of the participating teams to discern information while the problem is still developing. In total, 42 teams with more than 170 systems were registered to participate in the four tasks of the challenge. The evaluation results, similarly to previous years, show a performance gain against the baselines which indicates the continuous improvement of the state-of-the-art in this field.
  • cs.CV updates on arXiv.org

    AdderNet: Do We Really Need Multiplications in Deep Learning?. (arXiv:1912.13200v5 [cs.CV] UPDATED)
    (2 min) Compared with cheap addition operation, multiplication operation is of much higher computation complexity. The widely-used convolutions in deep neural networks are exactly cross-correlation to measure the similarity between input feature and convolution filters, which involves massive multiplications between float values. In this paper, we present adder networks (AdderNets) to trade these massive multiplications in deep neural networks, especially convolutional neural networks (CNNs), for much cheaper additions to reduce computation costs. In AdderNets, we take the $\ell_1$-norm distance between filters and input feature as the output response. The influence of this new similarity measure on the optimization of neural network have been thoroughly analyzed. To achieve a better performance, we develop a special back-propagation approach for AdderNets by investigating the full-precision gradient. We then propose an adaptive learning rate strategy to enhance the training procedure of AdderNets according to the magnitude of each neuron's gradient. As a result, the proposed AdderNets can achieve 74.9% Top-1 accuracy 91.7% Top-5 accuracy using ResNet-50 on the ImageNet dataset without any multiplication in convolution layer. The codes are publicly available at: https://github.com/huaweinoah/AdderNet.
    Patch-Based Image Restoration using Expectation Propagation. (arXiv:2106.15327v1 [cs.CV])
    (2 min) This paper presents a new Expectation Propagation (EP) framework for image restoration using patch-based prior distributions. While Monte Carlo techniques are classically used to sample from intractable posterior distributions, they can suffer from scalability issues in high-dimensional inference problems such as image restoration. To address this issue, EP is used here to approximate the posterior distributions using products of multivariate Gaussian densities. Moreover, imposing structural constraints on the covariance matrices of these densities allows for greater scalability and distributed computation. While the method is naturally suited to handle additive Gaussian observation noise, it can also be extended to non-Gaussian noise. Experiments conducted for denoising, inpainting and deconvolution problems with Gaussian and Poisson noise illustrate the potential benefits of such flexible approximate Bayesian method for uncertainty quantification in imaging problems, at a reduced computational cost compared to sampling techniques.
    How to Reach Real-Time AI on Consumer Devices? Solutions for Programmable and Custom Architectures. (arXiv:2106.15021v1 [cs.LG])
    (2 min) The unprecedented performance of deep neural networks (DNNs) has led to large strides in various Artificial Intelligence (AI) inference tasks, such as object and speech recognition. Nevertheless, deploying such AI models across commodity devices faces significant challenges: large computational cost, multiple performance objectives, hardware heterogeneity and a common need for high accuracy, together pose critical problems to the deployment of DNNs across the various embedded and mobile devices in the wild. As such, we have yet to witness the mainstream usage of state-of-the-art deep learning algorithms across consumer devices. In this paper, we provide preliminary answers to this potentially game-changing question by presenting an array of design techniques for efficient AI systems. We start by examining the major roadblocks when targeting both programmable processors and custom accelerators. Then, we present diverse methods for achieving real-time performance following a cross-stack approach. These span model-, system- and hardware-level techniques, and their combination. Our findings provide illustrative examples of AI systems that do not overburden mobile hardware, while also indicating how they can improve inference accuracy. Moreover, we showcase how custom ASIC- and FPGA-based accelerators can be an enabling factor for next-generation AI applications, such as multi-DNN systems. Collectively, these results highlight the critical need for further exploration as to how the various cross-stack solutions can be best combined in order to bring the latest advances in deep learning close to users, in a robust and efficient manner.
    Achieving Real-Time Object Detection on MobileDevices with Neural Pruning Search. (arXiv:2106.14943v1 [cs.CV])
    (2 min) Object detection plays an important role in self-driving cars for security development. However, mobile systems on self-driving cars with limited computation resources lead to difficulties for object detection. To facilitate this, we propose a compiler-aware neural pruning search framework to achieve high-speed inference on autonomous vehicles for 2D and 3D object detection. The framework automatically searches the pruning scheme and rate for each layer to find a best-suited pruning for optimizing detection accuracy and speed performance under compiler optimization. Our experiments demonstrate that for the first time, the proposed method achieves (close-to) real-time, 55ms and 99ms inference times for YOLOv4 based 2D object detection and PointPillars based 3D detection, respectively, on an off-the-shelf mobile phone with minor (or no) accuracy loss.
    Open-Set Representation Learning through Combinatorial Embedding. (arXiv:2106.15278v1 [cs.CV])
    (2 min) Visual recognition tasks are often limited to dealing with a small subset of classes simply because the labels for the remaining classes are unavailable. We are interested in identifying novel concepts in a dataset through representation learning based on the examples in both labeled and unlabeled classes, and extending the horizon of recognition to both known and novel classes. To address this challenging task, we propose a combinatorial learning approach, which naturally clusters the examples in unseen classes using the compositional knowledge given by multiple supervised meta-classifiers on heterogeneous label spaces. We also introduce a metric learning strategy to estimate pairwise pseudo-labels for improving representations of unlabeled examples, which preserves semantic relations across known and novel classes effectively. The proposed algorithm discovers novel concepts via a joint optimization of enhancing the discrimitiveness of unseen classes as well as learning the representations of known classes generalizable to novel ones. Our extensive experiments demonstrate remarkable performance gains by the proposed approach in multiple image retrieval and novel class discovery benchmarks.
    TNCR: Table Net Detection and Classification Dataset. (arXiv:2106.15322v1 [cs.CV])
    (2 min) We present TNCR, a new table dataset with varying image quality collected from free websites. The TNCR dataset can be used for table detection in scanned document images and their classification into 5 different classes. TNCR contains 9428 high-quality labeled images. In this paper, we have implemented state-of-the-art deep learning-based methods for table detection to create several strong baselines. Cascade Mask R-CNN with ResNeXt-101-64x4d Backbone Network achieves the highest performance compared to other methods with a precision of 79.7%, recall of 89.8%, and f1 score of 84.4% on the TNCR dataset. We have made TNCR open source in the hope of encouraging more deep learning approaches to table detection, classification, and structure recognition. The dataset and trained model checkpoints are available at https://github.com/abdoelsayed2016/TNCR_Dataset.
    An Image is Worth More Than a Thousand Words: Towards Disentanglement in the Wild. (arXiv:2106.15610v1 [cs.CV])
    (2 min) Unsupervised disentanglement has been shown to be theoretically impossible without inductive biases on the models and the data. As an alternative approach, recent methods rely on limited supervision to disentangle the factors of variation and allow their identifiability. While annotating the true generative factors is only required for a limited number of observations, we argue that it is infeasible to enumerate all the factors of variation that describe a real-world image distribution. To this end, we propose a method for disentangling a set of factors which are only partially labeled, as well as separating the complementary set of residual factors that are never explicitly specified. Our success in this challenging setting, demonstrated on synthetic benchmarks, gives rise to leveraging off-the-shelf image descriptors to partially annotate a subset of attributes in real image domains (e.g. of human faces) with minimal manual effort. Specifically, we use a recent language-image embedding model (CLIP) to annotate a set of attributes of interest in a zero-shot manner and demonstrate state-of-the-art disentangled image manipulation results.
    Face Sketch Synthesis via Semantic-Driven Generative Adversarial Network. (arXiv:2106.15121v1 [cs.CV])
    (2 min) Face sketch synthesis has made significant progress with the development of deep neural networks in these years. The delicate depiction of sketch portraits facilitates a wide range of applications like digital entertainment and law enforcement. However, accurate and realistic face sketch generation is still a challenging task due to the illumination variations and complex backgrounds in the real scenes. To tackle these challenges, we propose a novel Semantic-Driven Generative Adversarial Network (SDGAN) which embeds global structure-level style injection and local class-level knowledge re-weighting. Specifically, we conduct facial saliency detection on the input face photos to provide overall facial texture structure, which could be used as a global type of prior information. In addition, we exploit face parsing layouts as the semantic-level spatial prior to enforce globally structural style injection in the generator of SDGAN. Furthermore, to enhance the realistic effect of the details, we propose a novel Adaptive Re-weighting Loss (ARLoss) which dedicates to balance the contributions of different semantic classes. Experimentally, our extensive experiments on CUFS and CUFSF datasets show that our proposed algorithm achieves state-of-the-art performance.
    Xihe: A 3D Vision-based Lighting Estimation Framework for Mobile Augmented Reality. (arXiv:2106.15280v1 [cs.CV])
    (2 min) Omnidirectional lighting provides the foundation for achieving spatially-variant photorealistic 3D rendering, a desirable property for mobile augmented reality applications. However, in practice, estimating omnidirectional lighting can be challenging due to limitations such as partial panoramas of the rendering positions, and the inherent environment lighting and mobile user dynamics. A new opportunity arises recently with the advancements in mobile 3D vision, including built-in high-accuracy depth sensors and deep learning-powered algorithms, which provide the means to better sense and understand the physical surroundings. Centering the key idea of 3D vision, in this work, we design an edge-assisted framework called Xihe to provide mobile AR applications the ability to obtain accurate omnidirectional lighting estimation in real time. Specifically, we develop a novel sampling technique that efficiently compresses the raw point cloud input generated at the mobile device. This technique is derived based on our empirical analysis of a recent 3D indoor dataset and plays a key role in our 3D vision-based lighting estimator pipeline design. To achieve the real-time goal, we develop a tailored GPU pipeline for on-device point cloud processing and use an encoding technique that reduces network transmitted bytes. Finally, we present an adaptive triggering strategy that allows Xihe to skip unnecessary lighting estimations and a practical way to provide temporal coherent rendering integration with the mobile AR ecosystem. We evaluate both the lighting estimation accuracy and time of Xihe using a reference mobile application developed with Xihe's APIs. Our results show that Xihe takes as fast as 20.67ms per lighting estimation and achieves 9.4% better estimation accuracy than a state-of-the-art neural network.
    An End-to-End Autofocus Camera for Iris on the Move. (arXiv:2106.15069v1 [cs.CV])
    (2 min) For distant iris recognition, a long focal length lens is generally used to ensure the resolution ofiris images, which reduces the depth of field and leads to potential defocus blur. To accommodate users at different distances, it is necessary to control focus quickly and accurately. While for users in motion, it is expected to maintain the correct focus on the iris area continuously. In this paper, we introduced a novel rapid autofocus camera for active refocusing ofthe iris area ofthe moving objects using a focus-tunable lens. Our end-to-end computational algorithm can predict the best focus position from one single blurred image and generate a lens diopter control signal automatically. This scene-based active manipulation method enables real-time focus tracking of the iris area ofa moving object. We built a testing bench to collect real-world focal stacks for evaluation of the autofocus methods. Our camera has reached an autofocus speed ofover 50 fps. The results demonstrate the advantages of our proposed camera for biometric perception in static and dynamic scenes. The code is available at https://github.com/Debatrix/AquulaCam.
    VolterraNet: A higher order convolutional network with group equivariance for homogeneous manifolds. (arXiv:2106.15301v1 [cs.CV])
    (2 min) Convolutional neural networks have been highly successful in image-based learning tasks due to their translation equivariance property. Recent work has generalized the traditional convolutional layer of a convolutional neural network to non-Euclidean spaces and shown group equivariance of the generalized convolution operation. In this paper, we present a novel higher order Volterra convolutional neural network (VolterraNet) for data defined as samples of functions on Riemannian homogeneous spaces. Analagous to the result for traditional convolutions, we prove that the Volterra functional convolutions are equivariant to the action of the isometry group admitted by the Riemannian homogeneous spaces, and under some restrictions, any non-linear equivariant function can be expressed as our homogeneous space Volterra convolution, generalizing the non-linear shift equivariant characterization of Volterra expansions in Euclidean space. We also prove that second order functional convolution operations can be represented as cascaded convolutions which leads to an efficient implementation. Beyond this, we also propose a dilated VolterraNet model. These advances lead to large parameter reductions relative to baseline non-Euclidean CNNs. To demonstrate the efficacy of the VolterraNet performance, we present several real data experiments involving classification tasks on spherical-MNIST, atomic energy, Shrec17 data sets, and group testing on diffusion MRI data. Performance comparisons to the state-of-the-art are also presented.
    Perception-aware Multi-sensor Fusion for 3D LiDAR Semantic Segmentation. (arXiv:2106.15277v1 [cs.CV])
    (2 min) 3D LiDAR (light detection and ranging) based semantic segmentation is important in scene understanding for many applications, such as auto-driving and robotics. For example, for autonomous cars equipped with RGB cameras and LiDAR, it is crucial to fuse complementary information from different sensors for robust and accurate segmentation. Existing fusion-based methods, however, may not achieve promising performance due to the vast difference between two modalities. In this work, we investigate a collaborative fusion scheme called perception-aware multi-sensor fusion (PMF) to exploit perceptual information from two modalities, namely, appearance information from RGB images and spatio-depth information from point clouds. To this end, we first project point clouds to the camera coordinates to provide spatio-depth information for RGB images. Then, we propose a two-stream network to extract features from the two modalities, separately, and fuse the features by effective residual-based fusion modules. Moreover, we propose additional perception-aware losses to measure the great perceptual difference between the two modalities. Extensive experiments on two benchmark data sets show the superiority of our method. For example, on nuScenes, our PMF outperforms the state-of-the-art method by 0.8% in mIoU.
    Spatio-Temporal Context for Action Detection. (arXiv:2106.15171v1 [cs.CV])
    (2 min) Research in action detection has grown in the recentyears, as it plays a key role in video understanding. Modelling the interactions (either spatial or temporal) between actors and their context has proven to be essential for this task. While recent works use spatial features with aggregated temporal information, this work proposes to use non-aggregated temporal information. This is done by adding an attention based method that leverages spatio-temporal interactions between elements in the scene along the clip.The main contribution of this work is the introduction of two cross attention blocks to effectively model the spatial relations and capture short range temporal interactions.Experiments on the AVA dataset show the advantages of the proposed approach that models spatio-temporal relations between relevant elements in the scene, outperforming other methods that model actor interactions with their context by +0.31 mAP.
    Understanding Cognitive Fatigue from fMRI Scans with Self-supervised Learning. (arXiv:2106.15009v1 [cs.CV])
    (2 min) Functional magnetic resonance imaging (fMRI) is a neuroimaging technique that records neural activations in the brain by capturing the blood oxygen level in different regions based on the task performed by a subject. Given fMRI data, the problem of predicting the state of cognitive fatigue in a person has not been investigated to its full extent. This paper proposes tackling this issue as a multi-class classification problem by dividing the state of cognitive fatigue into six different levels, ranging from no-fatigue to extreme fatigue conditions. We built a spatio-temporal model that uses convolutional neural networks (CNN) for spatial feature extraction and a long short-term memory (LSTM) network for temporal modeling of 4D fMRI scans. We also applied a self-supervised method called MoCo to pre-train our model on a public dataset BOLD5000 and fine-tuned it on our labeled dataset to classify cognitive fatigue. Our novel dataset contains fMRI scans from Traumatic Brain Injury (TBI) patients and healthy controls (HCs) while performing a series of cognitive tasks. This method establishes a state-of-the-art technique to analyze cognitive fatigue from fMRI data and beats previous approaches to solve this problem.
    Deep Learning for Face Anti-Spoofing: A Survey. (arXiv:2106.14948v1 [cs.CV])
    (2 min) Face anti-spoofing (FAS) has lately attracted increasing attention due to its vital role in securing face recognition systems from presentation attacks (PAs). As more and more realistic PAs with novel types spring up, traditional FAS methods based on handcrafted features become unreliable due to their limited representation capacity. With the emergence of large-scale academic datasets in the recent decade, deep learning based FAS achieves remarkable performance and dominates this area. However, existing reviews in this field mainly focus on the handcrafted features, which are outdated and uninspiring for the progress of FAS community. In this paper, to stimulate future research, we present the first comprehensive review of recent advances in deep learning based FAS. It covers several novel and insightful components: 1) besides supervision with binary label (e.g., '0' for bonafide vs. '1' for PAs), we also investigate recent methods with pixel-wise supervision (e.g., pseudo depth map); 2) in addition to traditional intra-dataset evaluation, we collect and analyze the latest methods specially designed for domain generalization and open-set FAS; and 3) besides commercial RGB camera, we summarize the deep learning applications under multi-modal (e.g., depth and infrared) or specialized (e.g., light field and flash) sensors. We conclude this survey by emphasizing current open issues and highlighting potential prospects.
    An Efficient Cervical Whole Slide Image Analysis Framework Based on Multi-scale Semantic and Spatial Features using Deep Learning. (arXiv:2106.15113v1 [cs.CV])
    (2 min) Digital gigapixel whole slide image (WSI) is widely used in clinical diagnosis, and automated WSI analysis is key for computer-aided diagnosis. Currently, analyzing the integrated descriptor of probabilities or feature maps from massive local patches encoded by ResNet classifier is the main manner for WSI-level prediction. Feature representations of the sparse and tiny lesion cells in cervical slides, however, are still challengeable for the under-promoted upstream encoders, while the unused spatial representations of cervical cells are the available features to supply the semantics analysis. As well as patches sampling with overlap and repetitive processing incur the inefficiency and the unpredictable side effect. This study designs a novel inline connection network (InCNet) by enriching the multi-scale connectivity to build the lightweight model named You Only Look Cytopathology Once (YOLCO) with the additional supervision of spatial information. The proposed model allows the input size enlarged to megapixel that can stitch the WSI without any overlap by the average repeats decreased from $10^3\sim10^4$ to $10^1\sim10^2$ for collecting features and predictions at two scales. Based on Transformer for classifying the integrated multi-scale multi-task features, the experimental results appear $0.872$ AUC score better and $2.51\times$ faster than the best conventional method in WSI classification on multicohort datasets of 2,019 slides from four scanning devices.
    Constructing Stronger and Faster Baselines for Skeleton-based Action Recognition. (arXiv:2106.15125v1 [cs.CV])
    (2 min) One essential problem in skeleton-based action recognition is how to extract discriminative features over all skeleton joints. However, the complexity of the recent State-Of-The-Art (SOTA) models for this task tends to be exceedingly sophisticated and over-parameterized. The low efficiency in model training and inference has increased the validation costs of model architectures in large-scale datasets. To address the above issue, recent advanced separable convolutional layers are embedded into an early fused Multiple Input Branches (MIB) network, constructing an efficient Graph Convolutional Network (GCN) baseline for skeleton-based action recognition. In addition, based on such the baseline, we design a compound scaling strategy to expand the model's width and depth synchronously, and eventually obtain a family of efficient GCN baselines with high accuracies and small amounts of trainable parameters, termed EfficientGCN-Bx, where ''x'' denotes the scaling coefficient. On two large-scale datasets, i.e., NTU RGB+D 60 and 120, the proposed EfficientGCN-B4 baseline outperforms other SOTA methods, e.g., achieving 91.7% accuracy on the cross-subject benchmark of NTU 60 dataset, while being 3.15x smaller and 3.21x faster than MS-G3D, which is one of the best SOTA methods. The source code in PyTorch version and the pretrained models are available at https://github.com/yfsong0709/EfficientGCNv1.
    Uncertainty-Guided Progressive GANs for Medical Image Translation. (arXiv:2106.15542v1 [cs.CV])
    (2 min) Image-to-image translation plays a vital role in tackling various medical imaging tasks such as attenuation correction, motion correction, undersampled reconstruction, and denoising. Generative adversarial networks have been shown to achieve the state-of-the-art in generating high fidelity images for these tasks. However, the state-of-the-art GAN-based frameworks do not estimate the uncertainty in the predictions made by the network that is essential for making informed medical decisions and subsequent revision by medical experts and has recently been shown to improve the performance and interpretability of the model. In this work, we propose an uncertainty-guided progressive learning scheme for image-to-image translation. By incorporating aleatoric uncertainty as attention maps for GANs trained in a progressive manner, we generate images of increasing fidelity progressively. We demonstrate the efficacy of our model on three challenging medical image translation tasks, including PET to CT translation, undersampled MRI reconstruction, and MRI motion artefact correction. Our model generalizes well in three different tasks and improves performance over state of the art under full-supervision and weak-supervision with limited data. Code is released here: https://github.com/ExplainableML/UncerGuidedI2I
    Contrastive Semantic Similarity Learning for Image Captioning Evaluation with Intrinsic Auto-encoder. (arXiv:2106.15312v1 [cs.CV])
    (2 min) Automatically evaluating the quality of image captions can be very challenging since human language is quite flexible that there can be various expressions for the same meaning. Most of the current captioning metrics rely on token level matching between candidate caption and the ground truth label sentences. It usually neglects the sentence-level information. Motivated by the auto-encoder mechanism and contrastive representation learning advances, we propose a learning-based metric for image captioning, which we call Intrinsic Image Captioning Evaluation($I^2CE$). We develop three progressive model structures to learn the sentence level representations--single branch model, dual branches model, and triple branches model. Our empirical tests show that $I^2CE$ trained with dual branches structure achieves better consistency with human judgments to contemporary image captioning evaluation metrics. Furthermore, We select several state-of-the-art image captioning models and test their performances on the MS COCO dataset concerning both contemporary metrics and the proposed $I^2CE$. Experiment results show that our proposed method can align well with the scores generated from other contemporary metrics. On this concern, the proposed metric could serve as a novel indicator of the intrinsic information between captions, which may be complementary to the existing ones.
    Tackling Catastrophic Forgetting and Background Shift in Continual Semantic Segmentation. (arXiv:2106.15287v1 [cs.CV])
    (2 min) Deep learning approaches are nowadays ubiquitously used to tackle computer vision tasks such as semantic segmentation, requiring large datasets and substantial computational power. Continual learning for semantic segmentation (CSS) is an emerging trend that consists in updating an old model by sequentially adding new classes. However, continual learning methods are usually prone to catastrophic forgetting. This issue is further aggravated in CSS where, at each step, old classes from previous iterations are collapsed into the background. In this paper, we propose Local POD, a multi-scale pooling distillation scheme that preserves long- and short-range spatial relationships at feature level. Furthermore, we design an entropy-based pseudo-labelling of the background w.r.t. classes predicted by the old model to deal with background shift and avoid catastrophic forgetting of the old classes. Finally, we introduce a novel rehearsal method that is particularly suited for segmentation. Our approach, called PLOP, significantly outperforms state-of-the-art methods in existing CSS scenarios, as well as in newly proposed challenging benchmarks.
    Evaluating Deep Neural Networks for Image Document Enhancement. (arXiv:2106.15286v1 [cs.CV])
    (2 min) This work evaluates six state-of-the-art deep neural network (DNN) architectures applied to the problem of enhancing camera-captured document images. The results from each network were evaluated both qualitatively and quantitatively using Image Quality Assessment (IQA) metrics, and also compared with an existing approach based on traditional computer vision techniques. The best performing architectures generally produced good enhancement compared to the existing algorithm, showing that it is possible to use DNNs for document image enhancement. Furthermore, the best performing architectures could work as a baseline for future investigations on document enhancement using deep learning techniques. The main contributions of this paper are: a baseline of deep learning techniques that can be further improved to provide better results, and a evaluation methodology using IQA metrics for quantitatively comparing the produced images from the neural networks to a ground truth.
    Soft Attention: Does it Actually Help to Learn Social Interactions in Pedestrian Trajectory Prediction?. (arXiv:2106.15321v1 [cs.CV])
    (2 min) We consider the problem of predicting the future path of a pedestrian using its motion history and the motion history of the surrounding pedestrians, called social information. Since the seminal paper on Social-LSTM, deep-learning has become the main tool used to model the impact of social interactions on a pedestrian's motion. The demonstration that these models can learn social interactions relies on an ablative study of these models. The models are compared with and without their social interactions module on two standard metrics, the Average Displacement Error and Final Displacement Error. Yet, these complex models were recently outperformed by a simple constant-velocity approach. This questions if they actually allow to model social interactions as well as the validity of the proof. In this paper, we focus on the deep-learning models with a soft-attention mechanism for social interaction modeling and study whether they use social information at prediction time. We conduct two experiments across four state-of-the-art approaches on the ETH and UCY datasets, which were also used in previous work. First, the models are trained by replacing the social information with random noise and compared to model trained with actual social information. Second, we use a gating mechanism along with a $L_0$ penalty, allowing models to shut down their inner components. The models consistently learn to prune their soft-attention mechanism. For both experiments, neither the course of the convergence nor the prediction performance were altered. This demonstrates that the soft-attention mechanism and therefore the social information are ignored by the models.
    Convolutional Sparse Coding Fast Approximation with Application to Seismic Reflectivity Estimation. (arXiv:2106.15296v1 [cs.LG])
    (2 min) In sparse coding, we attempt to extract features of input vectors, assuming that the data is inherently structured as a sparse superposition of basic building blocks. Similarly, neural networks perform a given task by learning features of the training data set. Recently both data-driven and model-driven feature extracting methods have become extremely popular and have achieved remarkable results. Nevertheless, practical implementations are often too slow to be employed in real-life scenarios, especially for real-time applications. We propose a speed-up upgraded version of the classic iterative thresholding algorithm, that produces a good approximation of the convolutional sparse code within 2-5 iterations. The speed advantage is gained mostly from the observation that most solvers are slowed down by inefficient global thresholding. The main idea is to normalize each data point by the local receptive field energy, before applying a threshold. This way, the natural inclination towards strong feature expressions is suppressed, so that one can rely on a global threshold that can be easily approximated, or learned during training. The proposed algorithm can be employed with a known predetermined dictionary, or with a trained dictionary. The trained version is implemented as a neural net designed as the unfolding of the proposed solver. The performance of the proposed solution is demonstrated via the seismic inversion problem in both synthetic and real data scenarios. We also provide theoretical guarantees for a stable support recovery. Namely, we prove that under certain conditions the true support is perfectly recovered within the first iteration.
    SDL: New data generation tools for full-level annotated document layout. (arXiv:2106.15117v1 [cs.CV])
    (2 min) We present a novel data generation tool for document processing. The tool focuses on providing a maximal level of visual information in a normal type document, ranging from character position to paragraph-level position. It also enables working with a large dataset on low-resource languages as well as providing a mean of processing thorough full-level information of the documented text. The data generation tools come with a dataset of 320000 Vietnamese synthetic document images and an instruction to generate a dataset of similar size in other languages. The repository can be found at: https://github.com/tson1997/SDL-Document-Image-Generation
    The U-Net based GLOW for Optical-Flow-free Video Interframe Generation. (arXiv:2103.09576v3 [cs.CV] UPDATED)
    (2 min) Video frame interpolation is the task of creating an interframe between two adjacent frames along the time axis. So, instead of simply averaging two adjacent frames to create an intermediate image, this operation should maintain semantic continuity with the adjacent frames. Most conventional methods use optical flow, and various tools such as occlusion handling and object smoothing are indispensable. Since the use of these various tools leads to complex problems, we tried to tackle the video interframe generation problem without using problematic optical flow . To enable this , we have tried to use a deep neural network with an invertible structure, and developed an U-Net based Generative Flow which is a modified normalizing flow. In addition, we propose a learning method with a new consistency loss in the latent space to maintain semantic temporal consistency between frames. The resolution of the generated image is guaranteed to be identical to that of the original images by using an invertible network. Furthermore, as it is not a random image like the ones by generative models, our network guarantees stable outputs without flicker. Through experiments, we \sam {confirmed the feasibility of the proposed algorithm and would like to suggest the U-Net based Generative Flow as a new possibility for baseline in video frame interpolation. This paper is meaningful in that it is the world's first attempt to use invertible networks instead of optical flows for video interpolation.
    AutoNovel: Automatically Discovering and Learning Novel Visual Categories. (arXiv:2106.15252v1 [cs.CV])
    (2 min) We tackle the problem of discovering novel classes in an image collection given labelled examples of other classes. We present a new approach called AutoNovel to address this problem by combining three ideas: (1) we suggest that the common approach of bootstrapping an image representation using the labelled data only introduces an unwanted bias, and that this can be avoided by using self-supervised learning to train the representation from scratch on the union of labelled and unlabelled data; (2) we use ranking statistics to transfer the model's knowledge of the labelled classes to the problem of clustering the unlabelled images; and, (3) we train the data representation by optimizing a joint objective function on the labelled and unlabelled subsets of the data, improving both the supervised classification of the labelled data, and the clustering of the unlabelled data. Moreover, we propose a method to estimate the number of classes for the case where the number of new categories is not known a priori. We evaluate AutoNovel on standard classification benchmarks and substantially outperform current methods for novel category discovery. In addition, we also show that AutoNovel can be used for fully unsupervised image clustering, achieving promising results.
    Multi-stage Optimization based Adversarial Training. (arXiv:2106.15357v1 [cs.LG])
    (2 min) In the field of adversarial robustness, there is a common practice that adopts the single-step adversarial training for quickly developing adversarially robust models. However, the single-step adversarial training is most likely to cause catastrophic overfitting, as after a few training epochs it will be hard to generate strong adversarial examples to continuously boost the adversarial robustness. In this work, we aim to avoid the catastrophic overfitting by introducing multi-step adversarial examples during the single-step adversarial training. Then, to balance the large training overhead of generating multi-step adversarial examples, we propose a Multi-stage Optimization based Adversarial Training (MOAT) method that periodically trains the model on mixed benign examples, single-step adversarial examples, and multi-step adversarial examples stage by stage. In this way, the overall training overhead is reduced significantly, meanwhile, the model could avoid catastrophic overfitting. Extensive experiments on CIFAR-10 and CIFAR-100 datasets demonstrate that under similar amount of training overhead, the proposed MOAT exhibits better robustness than either single-step or multi-step adversarial training methods.
    How Does Heterogeneous Label Noise Impact Generalization in Neural Nets?. (arXiv:2106.15475v1 [cs.CV])
    (2 min) Incorrectly labeled examples, or label noise, is common in real-world computer vision datasets. While the impact of label noise on learning in deep neural networks has been studied in prior work, these studies have exclusively focused on homogeneous label noise, i.e., the degree of label noise is the same across all categories. However, in the real-world, label noise is often heterogeneous, with some categories being affected to a greater extent than others. Here, we address this gap in the literature. We hypothesized that heterogeneous label noise would only affect the classes that had label noise unless there was transfer from those classes to the classes without label noise. To test this hypothesis, we designed a series of computer vision studies using MNIST, CIFAR-10, CIFAR-100, and MS-COCO where we imposed heterogeneous label noise during the training of multi-class, multi-task, and multi-label systems. Our results provide evidence in support of our hypothesis: label noise only affects the class affected by it unless there is transfer.
    Efficient Realistic Data Generation Framework leveraging Deep Learning-based Human Digitization. (arXiv:2106.15409v1 [cs.CV])
    (2 min) The performance of supervised deep learning algorithms depends significantly on the scale, quality and diversity of the data used for their training. Collecting and manually annotating large amount of data can be both time-consuming and costly tasks to perform. In the case of tasks related to visual human-centric perception, the collection and distribution of such data may also face restrictions due to legislation regarding privacy. In addition, the design and testing of complex systems, e.g., robots, which often employ deep learning-based perception models, may face severe difficulties as even state-of-the-art methods trained on real and large-scale datasets cannot always perform adequately as they have not adapted to the visual differences between the virtual and the real world data. As an attempt to tackle and mitigate the effect of these issues, we present a method that automatically generates realistic synthetic data with annotations for a) person detection, b) face recognition, and c) human pose estimation. The proposed method takes as input real background images and populates them with human figures in various poses. Instead of using hand-made 3D human models, we propose the use of models generated through deep learning methods, further reducing the dataset creation costs, while maintaining a high level of realism. In addition, we provide open-source and easy to use tools that implement the proposed pipeline, allowing for generating highly-realistic synthetic datasets for a variety of tasks. A benchmarking and evaluation in the corresponding tasks shows that synthetic data can be effectively used as a supplement to real data.
    Variational Image Restoration Network. (arXiv:2008.10796v2 [eess.IV] UPDATED)
    (2 min) Deep neural networks (DNNs) have achieved significant success in image restoration tasks by directly learning a powerful non-linear mapping from corrupted images to their latent clean ones. However, there still exist two major limitations for these deep learning (DL)-based methods. Firstly, the noises contained in real corrupted images are very complex, usually neglected and largely under-estimated in most current methods. Secondly, existing DL methods are mostly trained on one pre-assumed degradation process for all of the training image pairs, such as the widely used bicubic downsampling assumption in the image super-resolution task, inevitably leading to poor generalization performance when the true degradation does not match with such assumed one. To address these issues, we propose a unified generative model for the image restoration, which elaborately configures the degradation process from the latent clean image to the observed corrupted one. Specifically, different from most of current methods, the pixel-wisely non-i.i.d. Gaussian distribution, being with more flexibility, is adopted in our method to fit the complex real noises. Furthermore, the method is built on the general image degradation process, making it capable of adapting diverse degradations under one single model. Besides, we design a variational inference algorithm to learn all parameters involved in the proposed model with explicit form of objective loss. Specifically, beyond traditional variational methodology, two DNNs are employed to parameterize the posteriori distributions, one to infer the distribution of the latent clean image, and another to infer the distribution of the image noise. Extensive experiments demonstrate the superiority of the proposed method on three classical image restoration tasks, including image denoising, image super-resolution and JPEG image deblocking.
    A Novel lightweight Convolutional Neural Network, ExquisiteNetV2. (arXiv:2105.09008v3 [cs.CV] UPDATED)
    (2 min) In the paper of ExquisiteNetV1, the ability of classification of ExquisiteNetV1 is worse than DenseNet. In this article, we propose a faster and better model ExquisiteNetV2. We conduct many experiments to evaluate its performance. We test ExquisiteNetV2, ExquisiteNetV1 and other 9 well-known models on 15 credible datasets under the same condition. According to the experimental results, ExquisiteNetV2 gets the highest classification accuracy over half of the datasets. Important of all, ExquisiteNetV2 has fewest amounts of parameters. Besides, in most instances, ExquisiteNetV2 has fastest computing speed.
    Information-Theoretic Segmentation by Inpainting Error Maximization. (arXiv:2012.07287v3 [cs.CV] UPDATED)
    (2 min) We study image segmentation from an information-theoretic perspective, proposing a novel adversarial method that performs unsupervised segmentation by partitioning images into maximally independent sets. More specifically, we group image pixels into foreground and background, with the goal of minimizing predictability of one set from the other. An easily computed loss drives a greedy search process to maximize inpainting error over these partitions. Our method does not involve training deep networks, is computationally cheap, class-agnostic, and even applicable in isolation to a single unlabeled image. Experiments demonstrate that it achieves a new state-of-the-art in unsupervised segmentation quality, while being substantially faster and more general than competing approaches.
    Winner Team Mia at TextVQA Challenge 2021: Vision-and-Language Representation Learning with Pre-trained Sequence-to-Sequence Model. (arXiv:2106.15332v1 [cs.CV])
    (2 min) TextVQA requires models to read and reason about text in images to answer questions about them. Specifically, models need to incorporate a new modality of text present in the images and reason over it to answer TextVQA questions. In this challenge, we use generative model T5 for TextVQA task. Based on pre-trained checkpoint T5-3B from HuggingFace repository, two other pre-training tasks including masked language modeling(MLM) and relative position prediction(RPP) are designed to better align object feature and scene text. In the stage of pre-training, encoder is dedicate to handle the fusion among multiple modalities: question text, object text labels, scene text labels, object visual features, scene visual features. After that decoder generates the text sequence step-by-step, cross entropy loss is required by default. We use a large-scale scene text dataset in pre-training and then fine-tune the T5-3B with the TextVQA dataset only.
    Deep Learning for Multi-View Stereo via Plane Sweep: A Survey. (arXiv:2106.15328v1 [cs.CV])
    (2 min) 3D reconstruction has lately attracted increasing attention due to its wide application in many areas, such as autonomous driving, robotics and virtual reality. As a dominant technique in artificial intelligence, deep learning has been successfully adopted to solve various computer vision problems. However, deep learning for 3D reconstruction is still at its infancy due to its unique challenges and varying pipelines. To stimulate future research, this paper presents a review of recent progress in deep learning methods for Multi-view Stereo (MVS), which is considered as a crucial task of image-based 3D reconstruction. It also presents comparative results on several publicly available datasets, with insightful observations and inspiring future research directions.
    Cloud based Scalable Object Recognition from Video Streams using Orientation Fusion and Convolutional Neural Networks. (arXiv:2106.15329v1 [cs.CV])
    (2 min) Object recognition from live video streams comes with numerous challenges such as the variation in illumination conditions and poses. Convolutional neural networks (CNNs) have been widely used to perform intelligent visual object recognition. Yet, CNNs still suffer from severe accuracy degradation, particularly on illumination-variant datasets. To address this problem, we propose a new CNN method based on orientation fusion for visual object recognition. The proposed cloud-based video analytics system pioneers the use of bi-dimensional empirical mode decomposition to split a video frame into intrinsic mode functions (IMFs). We further propose these IMFs to endure Reisz transform to produce monogenic object components, which are in turn used for the training of CNNs. Past works have demonstrated how the object orientation component may be used to pursue accuracy levels as high as 93\%. Herein we demonstrate how a feature-fusion strategy of the orientation components leads to further improving visual recognition accuracy to 97\%. We also assess the scalability of our method, looking at both the number and the size of the video streams under scrutiny. We carry out extensive experimentation on the publicly available Yale dataset, including also a self generated video datasets, finding significant improvements (both in accuracy and scale), in comparison to AlexNet, LeNet and SE-ResNeXt, which are the three most commonly used deep learning models for visual object recognition and classification.
    Where is the disease? Semi-supervised pseudo-normality synthesis from an abnormal image. (arXiv:2106.15345v1 [cs.CV])
    (2 min) Pseudo-normality synthesis, which computationally generates a pseudo-normal image from an abnormal one (e.g., with lesions), is critical in many perspectives, from lesion detection, data augmentation to clinical surgery suggestion. However, it is challenging to generate high-quality pseudo-normal images in the absence of the lesion information. Thus, expensive lesion segmentation data have been introduced to provide lesion information for the generative models and improve the quality of the synthetic images. In this paper, we aim to alleviate the need of a large amount of lesion segmentation data when generating pseudo-normal images. We propose a Semi-supervised Medical Image generative LEarning network (SMILE) which not only utilizes limited medical images with segmentation masks, but also leverages massive medical images without segmentation masks to generate realistic pseudo-normal images. Extensive experiments show that our model outperforms the best state-of-the-art model by up to 6% for data augmentation task and 3% in generating high-quality images. Moreover, the proposed semi-supervised learning achieves comparable medical image synthesis quality with supervised learning model, using only 50 of segmentation data.
    LB-CNN: An Open Source Framework for Fast Training of Light Binary Convolutional Neural Networks using Chainer and Cupy. (arXiv:2106.15350v1 [cs.LG])
    (2 min) Light binary convolutional neural networks (LB-CNN) are particularly useful when implemented in low-energy computing platforms as required in many industrial applications. Herein, a framework for optimizing compact LB-CNN is introduced and its effectiveness is evaluated. The framework is freely available and may run on free-access cloud platforms, thus requiring no major investments. The optimized model is saved in the standardized .h5 format and can be used as input to specialized tools for further deployment into specific technologies, thus enabling the rapid development of various intelligent image sensors. The main ingredient in accelerating the optimization of our model, particularly the selection of binary convolution kernels, is the Chainer/Cupy machine learning library offering significant speed-ups for training the output layer as an extreme-learning machine. Additional training of the output layer using Keras/Tensorflow is included, as it allows an increase in accuracy. Results for widely used datasets including MNIST, GTSRB, ORL, VGG show very good compromise between accuracy and complexity. Particularly, for face recognition problems a carefully optimized LB-CNN model provides up to 100% accuracies. Such TinyML solutions are well suited for industrial applications requiring image recognition with low energy consumption.
    Six-channel Image Representation for Cross-domain Object Detection. (arXiv:2101.00561v2 [cs.CV] UPDATED)
    (2 min) Most deep learning models are data-driven and the excellent performance is highly dependent on the abundant and diverse datasets. However, it is very hard to obtain and label the datasets of some specific scenes or applications. If we train the detector using the data from one domain, it cannot perform well on the data from another domain due to domain shift, which is one of the big challenges of most object detection models. To address this issue, some image-to-image translation techniques have been employed to generate some fake data of some specific scenes to train the models. With the advent of Generative Adversarial Networks (GANs), we could realize unsupervised image-to-image translation in both directions from a source to a target domain and from the target to the source domain. In this study, we report a new approach to making use of the generated images. We propose to concatenate the original 3-channel images and their corresponding GAN-generated fake images to form 6-channel representations of the dataset, hoping to address the domain shift problem while exploiting the success of available detection models. The idea of augmented data representation may inspire further study on object detection and other applications.
    Unified Questioner Transformer for Descriptive Question Generation in Goal-Oriented Visual Dialogue. (arXiv:2106.15550v1 [cs.CV])
    (2 min) Building an interactive artificial intelligence that can ask questions about the real world is one of the biggest challenges for vision and language problems. In particular, goal-oriented visual dialogue, where the aim of the agent is to seek information by asking questions during a turn-taking dialogue, has been gaining scholarly attention recently. While several existing models based on the GuessWhat?! dataset have been proposed, the Questioner typically asks simple category-based questions or absolute spatial questions. This might be problematic for complex scenes where the objects share attributes or in cases where descriptive questions are required to distinguish objects. In this paper, we propose a novel Questioner architecture, called Unified Questioner Transformer (UniQer), for descriptive question generation with referring expressions. In addition, we build a goal-oriented visual dialogue task called CLEVR Ask. It synthesizes complex scenes that require the Questioner to generate descriptive questions. We train our model with two variants of CLEVR Ask datasets. The results of the quantitative and qualitative evaluations show that UniQer outperforms the baseline.
    A Behavior-aware Graph Convolution Network Model for Video Recommendation. (arXiv:2106.15402v1 [cs.CV])
    (2 min) Interactions between users and videos are the major data source of performing video recommendation. Despite lots of existing recommendation methods, user behaviors on videos, which imply the complex relations between users and videos, are still far from being fully explored. In the paper, we present a model named Sagittarius. Sagittarius adopts a graph convolutional neural network to capture the influence between users and videos. In particular, Sagittarius differentiates between different user behaviors by weighting and fuses the semantics of user behaviors into the embeddings of users and videos. Moreover, Sagittarius combines multiple optimization objectives to learn user and video embeddings and then achieves the video recommendation by the learned user and video embeddings. The experimental results on multiple datasets show that Sagittarius outperforms several state-of-the-art models in terms of recall, unique recall and NDCG.
    Evaluation of Automated Image Descriptions for Visually Impaired Students. (arXiv:2106.15553v1 [cs.HC])
    (2 min) Illustrations are widely used in education, and sometimes, alternatives are not available for visually impaired students. Therefore, those students would benefit greatly from an automatic illustration description system, but only if those descriptions were complete, correct, and easily understandable using a screenreader. In this paper, we report on a study for the assessment of automated image descriptions. We interviewed experts to establish evaluation criteria, which we then used to create an evaluation questionnaire for sighted non-expert raters, and description templates. We used this questionnaire to evaluate the quality of descriptions which could be generated with a template-based automatic image describer. We present evidence that these templates have the potential to generate useful descriptions, and that the questionnaire identifies problems with description templates.
    Temporal Cluster Matching for Change Detection of Structures from Satellite Imagery. (arXiv:2103.09787v2 [cs.CV] UPDATED)
    (3 min) Longitudinal studies are vital to understanding dynamic changes of the planet, but labels (e.g., buildings, facilities, roads) are often available only for a single point in time. We propose a general model, Temporal Cluster Matching (TCM), for detecting building changes in time series of remotely sensed imagery when footprint labels are observed only once. The intuition behind the model is that the relationship between spectral values inside and outside of building's footprint will change when a building is constructed (or demolished). For instance, in rural settings, the pre-construction area may look similar to the surrounding environment until the building is constructed. Similarly, in urban settings, the pre-construction areas will look different from the surrounding environment until construction. We further propose a heuristic method for selecting the parameters of our model which allows it to be applied in novel settings without requiring data labeling efforts (to fit the parameters). We apply our model over a dataset of poultry barns from 2016/2017 high-resolution aerial imagery in the Delmarva Peninsula and a dataset of solar farms from a 2020 mosaic of Sentinel 2 imagery in India. Our results show that our model performs as well when fit using the proposed heuristic as it does when fit with labeled data, and further, that supervised versions of our model perform the best among all the baselines we test against. Finally, we show that our proposed approach can act as an effective data augmentation strategy -- it enables researchers to augment existing structure footprint labels along the time dimension and thus use imagery from multiple points in time to train deep learning models. We show that this improves the spatial generalization of such models when evaluated on the same change detection task.
    Probabilistic Attention for Interactive Segmentation. (arXiv:2106.15338v1 [cs.CV])
    (2 min) We provide a probabilistic interpretation of attention and show that the standard dot-product attention in transformers is a special case of Maximum A Posteriori (MAP) inference. The proposed approach suggests the use of Expectation Maximization algorithms for online adaptation of key and value model parameters. This approach is useful for cases in which external agents, e.g., annotators, provide inference-time information about the correct values of some tokens, e.g, the semantic category of some pixels, and we need for this new information to propagate to other tokens in a principled manner. We illustrate the approach on an interactive semantic segmentation task in which annotators and models collaborate online to improve annotation efficiency. Using standard benchmarks, we observe that key adaptation boosts model performance ($\sim10\%$ mIoU) in the low feedback regime and value propagation improves model responsiveness in the high feedback regime. A PyTorch layer implementation of our probabilistic attention model will be made publicly available.
    Wasserstein Adversarial Regularization (WAR) on label noise. (arXiv:1904.03936v3 [cs.LG] UPDATED)
    (2 min) Noisy labels often occur in vision datasets, especially when they are obtained from crowdsourcing or Web scraping. We propose a new regularization method, which enables learning robust classifiers in presence of noisy data. To achieve this goal, we propose a new adversarial regularization scheme based on the Wasserstein distance. Using this distance allows taking into account specific relations between classes by leveraging the geometric properties of the labels space. Our Wasserstein Adversarial Regularization (WAR) encodes a selective regularization, which promotes smoothness of the classifier between some classes, while preserving sufficient complexity of the decision boundary between others. We first discuss how and why adversarial regularization can be used in the context of label noise and then show the effectiveness of our method on five datasets corrupted with noisy labels: in both benchmarks and real datasets, WAR outperforms the state-of-the-art competitors.
    Spiking-GAN: A Spiking Generative Adversarial Network Using Time-To-First-Spike Coding. (arXiv:2106.15420v1 [cs.NE])
    (2 min) Spiking Neural Networks (SNNs) have shown great potential in solving deep learning problems in an energy-efficient manner. However, they are still limited to simple classification tasks. In this paper, we propose Spiking-GAN, the first spike-based Generative Adversarial Network (GAN). It employs a kind of temporal coding scheme called time-to-first-spike coding. We train it using approximate backpropagation in the temporal domain. We use simple integrate-and-fire (IF) neurons with very high refractory period for our network which ensures a maximum of one spike per neuron. This makes the model much sparser than a spike rate-based system. Our modified temporal loss function called 'Aggressive TTFS' improves the inference time of the network by over 33% and reduces the number of spikes in the network by more than 11% compared to previous works. Our experiments show that on training the network on the MNIST dataset using this approach, we can generate high quality samples. Thereby demonstrating the potential of this framework for solving such problems in the spiking domain.
    IMENet: Joint 3D Semantic Scene Completion and 2D Semantic Segmentation through Iterative Mutual Enhancement. (arXiv:2106.15413v1 [cs.CV])
    (2 min) 3D semantic scene completion and 2D semantic segmentation are two tightly correlated tasks that are both essential for indoor scene understanding, because they predict the same semantic classes, using positively correlated high-level features. Current methods use 2D features extracted from early-fused RGB-D images for 2D segmentation to improve 3D scene completion. We argue that this sequential scheme does not ensure these two tasks fully benefit each other, and present an Iterative Mutual Enhancement Network (IMENet) to solve them jointly, which interactively refines the two tasks at the late prediction stage. Specifically, two refinement modules are developed under a unified framework for the two tasks. The first is a 2D Deformable Context Pyramid (DCP) module, which receives the projection from the current 3D predictions to refine the 2D predictions. In turn, a 3D Deformable Depth Attention (DDA) module is proposed to leverage the reprojected results from 2D predictions to update the coarse 3D predictions. This iterative fusion happens to the stable high-level features of both tasks at a late stage. Extensive experiments on NYU and NYUCAD datasets verify the effectiveness of the proposed iterative late fusion scheme, and our approach outperforms the state of the art on both 3D semantic scene completion and 2D semantic segmentation.
    A Systematic Evaluation of Domain Adaptation in Facial Expression Recognition. (arXiv:2106.15453v1 [cs.CV])
    (2 min) Facial Expression Recognition is a commercially important application, but one common limitation is that applications often require making predictions on out-of-sample distributions, where target images may have very different properties from the images that the model was trained on. How well, or badly, do these models do on unseen target domains? In this paper, we provide a systematic evaluation of domain adaptation in facial expression recognition. Using state-of-the-art transfer learning techniques and six commonly-used facial expression datasets (three collected in the lab and three "in-the-wild"), we conduct extensive round-robin experiments to examine the classification accuracies for a state-of-the-art CNN model. We also perform multi-source experiments where we examine a model's ability to transfer from multiple source datasets, including (i) within-setting (e.g., lab to lab), (ii) cross-setting (e.g., in-the-wild to lab), (iii) mixed-setting (e.g., lab and wild to lab) transfer learning experiments. We find sobering results that the accuracy of transfer learning is not high, and varies idiosyncratically with the target dataset, and to a lesser extent the source dataset. Generally, the best settings for transfer include fine-tuning the weights of a pre-trained model, and we find that training with more datasets, regardless of setting, improves transfer performance. We end with a discussion of the need for more -- and regular -- systematic investigations into the generalizability of FER models, especially for deployed applications.
    Source-free Domain Adaptation via Avatar Prototype Generation and Adaptation. (arXiv:2106.15326v1 [cs.CV])
    (2 min) We study a practical domain adaptation task, called source-free unsupervised domain adaptation (UDA) problem, in which we cannot access source domain data due to data privacy issues but only a pre-trained source model and unlabeled target data are available. This task, however, is very difficult due to one key challenge: the lack of source data and target domain labels makes model adaptation very challenging. To address this, we propose to mine the hidden knowledge in the source model and exploit it to generate source avatar prototypes (i.e., representative features for each source class) as well as target pseudo labels for domain alignment. To this end, we propose a Contrastive Prototype Generation and Adaptation (CPGA) method. Specifically, CPGA consists of two stages: (1) prototype generation: by exploring the classification boundary information of the source model, we train a prototype generator to generate avatar prototypes via contrastive learning. (2) prototype adaptation: based on the generated source prototypes and target pseudo labels, we develop a new robust contrastive prototype adaptation strategy to align each pseudo-labeled target data to the corresponding source prototypes. Extensive experiments on three UDA benchmark datasets demonstrate the effectiveness and superiority of the proposed method.
    Contrastive Attraction and Contrastive Repulsion for Representation Learning. (arXiv:2105.03746v2 [cs.LG] UPDATED)
    (2 min) Contrastive learning (CL) is effective in learning data representations without label supervision, where the encoder needs to contrast each positive sample over multiple negative samples via a one-vs-many softmax cross-entropy loss. However, conventional CL is sensitive to how many negative samples are included and how they are selected. Proposed in this paper is a doubly CL strategy that contrasts positive samples and negative ones within themselves separately. We realize this strategy with contrastive attraction and contrastive repulsion (CACR) makes the query not only exert a greater force to attract more distant positive samples but also do so to repel closer negative samples. Theoretical analysis reveals the connection between CACR and CL from the perspectives of both positive attraction and negative repulsion and shows the benefits in both efficiency and robustness brought by separately contrasting within the sampled positive and negative pairs. Extensive large-scale experiments on standard vision tasks show that CACR not only consistently outperforms existing CL methods on benchmark datasets in representation learning, but also provides interpretable contrastive weights, demonstrating the efficacy of the proposed doubly contrastive strategy.
    Benchmarking Unsupervised Object Representations for Video Sequences. (arXiv:2006.07034v2 [cs.CV] UPDATED)
    (2 min) Perceiving the world in terms of objects and tracking them through time is a crucial prerequisite for reasoning and scene understanding. Recently, several methods have been proposed for unsupervised learning of object-centric representations. However, since these models were evaluated on different downstream tasks, it remains unclear how they compare in terms of basic perceptual abilities such as detection, figure-ground segmentation and tracking of objects. To close this gap, we design a benchmark with four data sets of varying complexity and seven additional test sets featuring challenging tracking scenarios relevant for natural videos. Using this benchmark, we compare the perceptual abilities of four object-centric approaches: ViMON, a video-extension of MONet, based on recurrent spatial attention, OP3, which exploits clustering via spatial mixture models, as well as TBA and SCALOR, which use explicit factorization via spatial transformers. Our results suggest that the architectures with unconstrained latent representations learn more powerful representations in terms of object detection, segmentation and tracking than the spatial transformer based architectures. We also observe that none of the methods are able to gracefully handle the most challenging tracking scenarios despite their synthetic nature, suggesting that our benchmark may provide fruitful guidance towards learning more robust object-centric video representations.
    Study of visual processing techniques for dynamic speckles: a comparative analysis. (arXiv:2106.15507v1 [cs.CV])
    (2 min) Main visual techniques used to obtain information from speckle patterns are Fujii method, generalized difference, weighted generalized difference, mean windowed difference, structural function (SF), modified SF, etc. In this work, a comparative analysis of major visual techniques for natural gum sample is carried out. Obtained results conclusively establish SF based method as an optimum tool for visual inspection of dynamic speckle data.
    Parallel Medical Imaging for Intelligent Medical Image Analysis: Concepts, Methods, and Applications. (arXiv:1903.04855v3 [cs.CV] UPDATED)
    (2 min) There has been much progress in data-driven artificial intelligence technology for medical image analysis in the last decades. However, it still remains challenging due to its distinctive complexity of acquiring and annotating image data, extracting medical domain knowledge, and explaining the diagnostic decision for medical image analysis. In this paper, we propose a data-knowledge-driven framework termed as Parallel Medical Imaging (PMI) for intelligent medical image analysis based on the methodology of interactive ACP-based parallel intelligence. In the PMI framework, computational experiments with predictive learning in a data-driven way are conducted to extract medical knowledge for diagnostic decision support. Artificial imaging systems are introduced to select and prescriptively generate medical image data in a knowledge-driven way to utilize medical domain knowledge. Through the closed-loop optimization based on parallel execution, our proposed PMI framework can boost the generalization ability and alleviate the limitation of medical interpretation for diagnostic decisions. Furthermore, we illustrate the preliminary implementation of PMI method through the case studies of mammogram analysis and skin lesion image analysis. Experimental results on several public medical image datasets demonstrate the effectiveness of proposed PMI.
    MemX: An Attention-Aware Smart Eyewear System for Personalized Moment Auto-capture. (arXiv:2105.00916v2 [cs.CV] UPDATED)
    (2 min) This work presents MemX: a biologically-inspired attention-aware eyewear system developed with the goal of pursuing the long-awaited vision of a personalized visual Memex. MemX captures human visual attention on the fly, analyzes the salient visual content, and records moments of personal interest in the form of compact video snippets. Accurate attentive scene detection and analysis on resource-constrained platforms is challenging because these tasks are computation and energy intensive. We propose a new temporal visual attention network that unifies human visual attention tracking and salient visual content analysis. Attention tracking focuses computation-intensive video analysis on salient regions, while video analysis makes human attention detection and tracking more accurate. Using the YouTube-VIS dataset and 30 participants, we experimentally show that MemX significantly improves the attention tracking accuracy over the eye-tracking-alone method, while maintaining high system energy efficiency. We have also conducted 11 in-field pilot studies across a range of daily usage scenarios, which demonstrate the feasibility and potential benefits of MemX.
    Fast and Accurate Road Crack Detection Based on Adaptive Cost-Sensitive Loss Function. (arXiv:2106.15510v1 [cs.CV])
    (2 min) Numerous detection problems in computer vision, including road crack detection, suffer from exceedingly foreground-background imbalance. Fortunately, modification of loss function appears to solve this puzzle once and for all. In this paper, we propose a pixel-based adaptive weighted cross-entropy loss in conjunction with Jaccard distance to facilitate high-quality pixel-level road crack detection. Our work profoundly demonstrates the influence of loss functions on detection outcomes, and sheds light on the sophisticated consecutive improvements in the realm of crack detection. Specifically, to verify the effectiveness of the proposed loss, we conduct extensive experiments on four public databases, i.e., CrackForest, AigleRN, Crack360, and BJN260. Compared with the vanilla weighted cross-entropy, the proposed loss significantly speeds up the training process while retaining the test accuracy.
    Relational Graph Learning on Visual and Kinematics Embeddings for Accurate Gesture Recognition in Robotic Surgery. (arXiv:2011.01619v2 [cs.CV] UPDATED)
    (2 min) Automatic surgical gesture recognition is fundamentally important to enable intelligent cognitive assistance in robotic surgery. With recent advancement in robot-assisted minimally invasive surgery, rich information including surgical videos and robotic kinematics can be recorded, which provide complementary knowledge for understanding surgical gestures. However, existing methods either solely adopt uni-modal data or directly concatenate multi-modal representations, which can not sufficiently exploit the informative correlations inherent in visual and kinematics data to boost gesture recognition accuracies. In this regard, we propose a novel online approach of multi-modal relational graph network (i.e., MRG-Net) to dynamically integrate visual and kinematics information through interactive message propagation in the latent feature space. In specific, we first extract embeddings from video and kinematics sequences with temporal convolutional networks and LSTM units. Next, we identify multi-relations in these multi-modal embeddings and leverage them through a hierarchical relational graph learning module. The effectiveness of our method is demonstrated with state-of-the-art results on the public JIGSAWS dataset, outperforming current uni-modal and multi-modal methods on both suturing and knot typing tasks. Furthermore, we validated our method on in-house visual-kinematics datasets collected with da Vinci Research Kit (dVRK) platforms in two centers, with consistent promising performance achieved.
    Hate speech detection using static BERT embeddings. (arXiv:2106.15537v1 [cs.CL])
    (2 min) With increasing popularity of social media platforms hate speech is emerging as a major concern, where it expresses abusive speech that targets specific group characteristics, such as gender, religion or ethnicity to spread violence. Earlier people use to verbally deliver hate speeches but now with the expansion of technology, some people are deliberately using social media platforms to spread hate by posting, sharing, commenting, etc. Whether it is Christchurch mosque shootings or hate crimes against Asians in west, it has been observed that the convicts are very much influenced from hate text present online. Even though AI systems are in place to flag such text but one of the key challenges is to reduce the false positive rate (marking non hate as hate), so that these systems can detect hate speech without undermining the freedom of expression. In this paper, we use ETHOS hate speech detection dataset and analyze the performance of hate speech detection classifier by replacing or integrating the word embeddings (fastText (FT), GloVe (GV) or FT + GV) with static BERT embeddings (BE). With the extensive experimental trails it is observed that the neural network performed better with static BE compared to using FT, GV or FT + GV as word embeddings. In comparison to fine-tuned BERT, one metric that significantly improved is specificity.
    Progressive Joint Low-light Enhancement and Noise Removal for Raw Images. (arXiv:2106.14844v2 [eess.IV] UPDATED)
    (2 min) Low-light imaging on mobile devices is typically challenging due to insufficient incident light coming through the relatively small aperture, resulting in a low signal-to-noise ratio. Most of the previous works on low-light image processing focus either only on a single task such as illumination adjustment, color enhancement, or noise removal; or on a joint illumination adjustment and denoising task that heavily relies on short-long exposure image pairs collected from specific camera models, and thus these approaches are less practical and generalizable in real-world settings where camera-specific joint enhancement and restoration is required. To tackle this problem, in this paper, we propose a low-light image processing framework that performs joint illumination adjustment, color enhancement, and denoising. Considering the difficulty in model-specific data collection and the ultra-high definition of the captured images, we design two branches: a coefficient estimation branch as well as a joint enhancement and denoising branch. The coefficient estimation branch works in a low-resolution space and predicts the coefficients for enhancement via bilateral learning, whereas the joint enhancement and denoising branch works in a full-resolution space and performs joint enhancement and denoising in a progressive manner. In contrast to existing methods, our framework does not need to recollect massive data when being adapted to another camera model, which significantly reduces the efforts required to fine-tune our approach for practical usage. Through extensive experiments, we demonstrate its great potential in real-world low-light imaging applications when compared with current state-of-the-art methods.
    Multimodal End-to-End Learning for Autonomous Steering in Adverse Road and Weather Conditions. (arXiv:2010.14924v2 [cs.CV] UPDATED)
    (2 min) Autonomous driving is challenging in adverse road and weather conditions in which there might not be lane lines, the road might be covered in snow and the visibility might be poor. We extend the previous work on end-to-end learning for autonomous steering to operate in these adverse real-life conditions with multimodal data. We collected 28 hours of driving data in several road and weather conditions and trained convolutional neural networks to predict the car steering wheel angle from front-facing color camera images and lidar range and reflectance data. We compared the CNN model performances based on the different modalities and our results show that the lidar modality improves the performances of different multimodal sensor-fusion models. We also performed on-road tests with different models and they support this observation.
    Domain-Class Correlation Decomposition for Generalizable Person Re-Identification. (arXiv:2106.15206v1 [cs.CV])
    (2 min) Domain generalization in person re-identification is a highly important meaningful and practical task in which a model trained with data from several source domains is expected to generalize well to unseen target domains. Domain adversarial learning is a promising domain generalization method that aims to remove domain information in the latent representation through adversarial training. However, in person re-identification, the domain and class are correlated, and we theoretically show that domain adversarial learning will lose certain information about class due to this domain-class correlation. Inspired by casual inference, we propose to perform interventions to the domain factor $d$, aiming to decompose the domain-class correlation. To achieve this goal, we proposed estimating the resulting representation $z^{*}$ caused by the intervention through first- and second-order statistical characteristic matching. Specifically, we build a memory bank to restore the statistical characteristics of each domain. Then, we use the newly generated samples $\{z^{*},y,d^{*}\}$ to compute the loss function. These samples are domain-class correlation decomposed; thus, we can learn a domain-invariant representation that can capture more class-related features. Extensive experiments show that our model outperforms the state-of-the-art methods on the large-scale domain generalization Re-ID benchmark.
    Deep Learning Body Region Classification of MRI and CT examinations. (arXiv:2104.13826v2 [eess.IV] UPDATED)
    (2 min) Standardized body region labelling of individual images provides data that can improve human and computer use of medical images. A CNN-based classifier was developed to identify body regions in CT and MRI. 17 CT (18 MRI) body regions covering the entire human body were defined for the classification task. Three retrospective databases were built for the AI model training, validation, and testing, with a balanced distribution of studies per body region. The test databases originated from a different healthcare network. Accuracy, recall and precision of the classifier was evaluated for patient age, patient gender, institution, scanner manufacturer, contrast, slice thickness, MRI sequence, and CT kernel. The data included a retrospective cohort of 2,934 anonymized CT cases (training: 1,804 studies, validation: 602 studies, test: 528 studies) and 3,185 anonymized MRI cases (training: 1,911 studies, validation: 636 studies, test: 638 studies). 27 institutions from primary care hospitals, community hospitals and imaging centers contributed to the test datasets. The data included cases of all genders in equal proportions and subjects aged from a few months old to +90 years old. An image-level prediction accuracy of 91.9% (90.2 - 92.1) for CT, and 94.2% (92.0 - 95.6) for MRI was achieved. The classification results were robust across all body regions and confounding factors. Due to limited data, performance results for subjects under 10 years-old could not be reliably evaluated. We show that deep learning models can classify CT and MRI images by body region including lower and upper extremities with high accuracy.
    Deep Features for training Support Vector Machine. (arXiv:2104.03488v2 [cs.CV] UPDATED)
    (2 min) Features play a crucial role in computer vision. Initially designed to detect salient elements by means of handcrafted algorithms, features are now often learned by different layers in Convolutional Neural Networks (CNNs). This paper develops a generic computer vision system based on features extracted from trained CNNs. Multiple learned features are combined into a single structure to work on different image classification tasks. The proposed system was experimentally derived by testing several approaches for extracting features from the inner layers of CNNs and using them as inputs to SVMs that are then combined by sum rule. Dimensionality reduction techniques are used to reduce the high dimensionality of inner layers. The resulting vision system is shown to significantly boost the performance of standard CNNs across a large and diverse collection of image data sets. An ensemble of different topologies using the same approach obtains state-of-the-art results on a virus data set.
    Segmentation with Multiple Acceptable Annotations: A Case Study of Myocardial Segmentation in Contrast Echocardiography. (arXiv:2106.15597v1 [cs.CV])
    (2 min) Most existing deep learning-based frameworks for image segmentation assume that a unique ground truth is known and can be used for performance evaluation. This is true for many applications, but not all. Myocardial segmentation of Myocardial Contrast Echocardiography (MCE), a critical task in automatic myocardial perfusion analysis, is an example. Due to the low resolution and serious artifacts in MCE data, annotations from different cardiologists can vary significantly, and it is hard to tell which one is the best. In this case, how can we find a good way to evaluate segmentation performance and how do we train the neural network? In this paper, we address the first problem by proposing a new extended Dice to effectively evaluate the segmentation performance when multiple accepted ground truth is available. Then based on our proposed metric, we solve the second problem by further incorporating the new metric into a loss function that enables neural networks to flexibly learn general features of myocardium. Experiment results on our clinical MCE data set demonstrate that the neural network trained with the proposed loss function outperforms those existing ones that try to obtain a unique ground truth from multiple annotations, both quantitatively and qualitatively. Finally, our grading study shows that using extended Dice as an evaluation metric can better identify segmentation results that need manual correction compared with using Dice.
    Logic could be learned from images. (arXiv:1908.01931v2 [cs.CV] UPDATED)
    (2 min) Logic reasoning is a significant ability of human intelligence and also an important task in artificial intelligence. The existing logic reasoning methods, quite often, need to design some reasoning patterns beforehand. This has led to an interesting question: can logic reasoning patterns be directly learned from given data? The problem is termed as a data concept logic. In this study, a learning logic task from images, called a LiLi task, first is proposed. This task is to learn and reason the logic relation from images, without presetting any reasoning patterns. As a preliminary exploration, we design six LiLi data sets (Bitwise And, Bitwise Or, Bitwise Xor, Addition, Subtraction and Multiplication), in which each image is embedded with a n-digit number. It is worth noting that a learning model beforehand does not know the meaning of the n-digit numbers embedded in images and the relation between the input images and the output image. In order to tackle the task, in this work we use many typical neural network models and produce fruitful results. However, these models have the poor performances on the difficult logic task. For furthermore addressing this task, a novel network framework called a divide and conquer model by adding some label information is designed, achieving a high testing accuracy.
    Two-Stage Self-Supervised Cycle-Consistency Network for Reconstruction of Thin-Slice MR Images. (arXiv:2106.15395v1 [eess.IV])
    (2 min) The thick-slice magnetic resonance (MR) images are often structurally blurred in coronal and sagittal views, which causes harm to diagnosis and image post-processing. Deep learning (DL) has shown great potential to re-construct the high-resolution (HR) thin-slice MR images from those low-resolution (LR) cases, which we refer to as the slice interpolation task in this work. However, since it is generally difficult to sample abundant paired LR-HR MR images, the classical fully supervised DL-based models cannot be effectively trained to get robust performance. To this end, we propose a novel Two-stage Self-supervised Cycle-consistency Network (TSCNet) for MR slice interpolation, in which a two-stage self-supervised learning (SSL) strategy is developed for unsupervised DL network training. The paired LR-HR images are synthesized along the sagittal and coronal directions of input LR images for network pretraining in the first-stage SSL, and then a cyclic in-terpolation procedure based on triplet axial slices is designed in the second-stage SSL for further refinement. More training samples with rich contexts along all directions are exploited as guidance to guarantee the improved in-terpolation performance. Moreover, a new cycle-consistency constraint is proposed to supervise this cyclic procedure, which encourages the network to reconstruct more realistic HR images. The experimental results on a real MRI dataset indicate that TSCNet achieves superior performance over the conventional and other SSL-based algorithms, and obtains competitive quali-tative and quantitative results compared with the fully supervised algorithm.
    Multiple Graph Learning for Scalable Multi-view Clustering. (arXiv:2106.15382v1 [cs.CV])
    (2 min) Graph-based multi-view clustering has become an active topic due to the efficiency in characterizing both the complex structure and relationship between multimedia data. However, existing methods have the following shortcomings: (1) They are inefficient or even fail for graph learning in large scale due to the graph construction and eigen-decomposition. (2) They cannot well exploit both the complementary information and spatial structure embedded in graphs of different views. To well exploit complementary information and tackle the scalability issue plaguing graph-based multi-view clustering, we propose an efficient multiple graph learning model via a small number of anchor points and tensor Schatten p-norm minimization. Specifically, we construct a hidden and tractable large graph by anchor graph for each view and well exploit complementary information embedded in anchor graphs of different views by tensor Schatten p-norm regularizer. Finally, we develop an efficient algorithm, which scales linearly with the data size, to solve our proposed model. Extensive experimental results on several datasets indicate that our proposed method outperforms some state-of-the-art multi-view clustering algorithms.
    Unmixing Convolutional Features for Crisp Edge Detection. (arXiv:2011.09808v2 [cs.CV] UPDATED)
    (2 min) This paper presents a context-aware tracing strategy (CATS) for crisp edge detection with deep edge detectors, based on an observation that the localization ambiguity of deep edge detectors is mainly caused by the mixing phenomenon of convolutional neural networks: feature mixing in edge classification and side mixing during fusing side predictions. The CATS consists of two modules: a novel tracing loss that performs feature unmixing by tracing boundaries for better side edge learning, and a context-aware fusion block that tackles the side mixing by aggregating the complementary merits of learned side edges. Experiments demonstrate that the proposed CATS can be integrated into modern deep edge detectors to improve localization accuracy. With the vanilla VGG16 backbone, in terms of BSDS500 dataset, our CATS improves the F-measure (ODS) of the RCF and BDCN deep edge detectors by 12% and 6% respectively when evaluating without using the morphological non-maximal suppression scheme for edge detection.
    Effective Evaluation of Deep Active Learning on Image Classification Tasks. (arXiv:2106.15324v1 [cs.CV])
    (2 min) With the goal of making deep learning more label-efficient, a growing number of papers have been studying active learning (AL) for deep models. However, there are a number of issues in the prevalent experimental settings, mainly stemming from a lack of unified implementation and benchmarking. Issues in the current literature include sometimes contradictory observations on the performance of different AL algorithms, unintended exclusion of important generalization approaches such as data augmentation and SGD for optimization, a lack of study of evaluation facets like the labeling efficiency of AL, and little or no clarity on the scenarios in which AL outperforms random sampling (RS). In this work, we present a unified re-implementation of state-of-the-art AL algorithms in the context of image classification, and we carefully study these issues as facets of effective evaluation. On the positive side, we show that AL techniques are 2x to 4x more label-efficient compared to RS with the use of data augmentation. Surprisingly, when data augmentation is included, there is no longer a consistent gain in using BADGE, a state-of-the-art approach, over simple uncertainty sampling. We then do a careful analysis of how existing approaches perform with varying amounts of redundancy and number of examples per class. Finally, we provide several insights for AL practitioners to consider in future work, such as the effect of the AL batch size, the effect of initialization, the importance of retraining a new model at every round, and other insights.
    Symmetry meets AI. (arXiv:2103.06115v2 [cs.LG] UPDATED)
    (2 min) We explore whether Neural Networks (NNs) can {\it discover} the presence of symmetries as they learn to perform a task. For this, we train hundreds of NNs on a {\it decoy task} based on well-controlled Physics templates, where no information on symmetry is provided. We use the output from the last hidden layer of all these NNs, projected to fewer dimensions, as the input for a symmetry classification task, and show that information on symmetry had indeed been identified by the original NN without guidance. As an interdisciplinary application of this procedure, we identify the presence and level of symmetry in artistic paintings from different styles such as those of Picasso, Pollock and Van Gogh.
    TANet++: Triple Attention Network with Filtered Pointcloud on 3D Detection. (arXiv:2106.15366v1 [cs.CV])
    (2 min) TANet is one of state-of-the-art 3D object detection method on KITTI and JRDB benchmark, the network contains a Triple Attention module and Coarse-to-Fine Regression module to improve the robustness and accuracy of 3D Detection. However, since the original input data (point clouds) contains a lot of noise during collecting the data, which will further affect the training of the model. For example, the object is far from the robot, the sensor is difficult to obtain enough pointcloud. If the objects only contains few point clouds, and the samples are fed into model with the normal samples together during training, the detector will be difficult to distinguish the individual with few pointcloud belong to object or background. In this paper, we propose TANet++ to improve the performance on 3D Detection, which adopt a novel training strategy on training the TANet. In order to reduce the negative impact by the weak samples, the training strategy previously filtered the training data, and then the TANet++ is trained by the rest of data. The experimental results shows that AP score of TANet++ is 8.98\% higher than TANet on JRDB benchmark.
    Detecting Cattle and Elk in the Wild from Space. (arXiv:2106.15448v1 [cs.CV])
    (2 min) Localizing and counting large ungulates -- hoofed mammals like cows and elk -- in very high-resolution satellite imagery is an important task for supporting ecological studies. Prior work has shown that this is feasible with deep learning based methods and sub-meter multi-spectral satellite imagery. We extend this line of work by proposing a baseline method, CowNet, that simultaneously estimates the number of animals in an image (counts), as well as predicts their location at a pixel level (localizes). We also propose an methodology for evaluating such models on counting and localization tasks across large scenes that takes the uncertainty of noisy labels and the information needed by stakeholders in ecological monitoring tasks into account. Finally, we benchmark our baseline method with state of the art vision methods for counting objects in scenes. We specifically test the temporal generalization of the resulting models over a large landscape in Point Reyes Seashore, CA. We find that the LC-FCN model performs the best and achieves an average precision between 0.56 and 0.61 and an average recall between 0.78 and 0.92 over three held out test scenes.
    Roof Damage Assessment from Automated 3D Building Models. (arXiv:2106.15294v1 [cs.CV])
    (2 min) The 3D building modelling is important in urban planning and related domains that draw upon the content of 3D models of urban scenes. Such 3D models can be used to visualize city images at multiple scales from individual buildings to entire cities prior to and after a change has occurred. This ability is of great importance in day-to-day work and special projects undertaken by planners, geo-designers, and architects. In this research, we implemented a novel approach to 3D building models for such matter, which included the integration of geographic information systems (GIS) and 3D Computer Graphics (3DCG) components that generate 3D house models from building footprints (polygons), and the automated generation of simple and complex roof geometries for rapid roof area damage reporting. These polygons (footprints) are usually orthogonal. A complicated orthogonal polygon can be partitioned into a set of rectangles. The proposed GIS and 3DCG integrated system partitions orthogonal building polygons into a set of rectangles and places rectangular roofs and box-shaped building bodies on these rectangles. Since technicians are drawing these polygons manually with digitizers, depending on aerial photos, not all building polygons are precisely orthogonal. But, when placing a set of boxes as building bodies for creating the buildings, there may be gaps or overlaps between these boxes if building polygons are not precisely orthogonal. In our proposal, after approximately orthogonal building polygons are partitioned and rectified into a set of mutually orthogonal rectangles, each rectangle knows which rectangle is adjacent to and which edge of the rectangle is adjacent to, which will avoid unwanted intersection of windows and doors when building bodies combined.
    Cells are Actors: Social Network Analysis with Classical ML for SOTA Histology Image Classification. (arXiv:2106.15299v1 [cs.CV])
    (2 min) Digitization of histology images and the advent of new computational methods, like deep learning, have helped the automatic grading of colorectal adenocarcinoma cancer (CRA). Present automated CRA grading methods, however, usually use tiny image patches and thus fail to integrate the entire tissue micro-architecture for grading purposes. To tackle these challenges, we propose to use a statistical network analysis method to describe the complex structure of the tissue micro-environment by modelling nuclei and their connections as a network. We show that by analyzing only the interactions between the cells in a network, we can extract highly discriminative statistical features for CRA grading. Unlike other deep learning or convolutional graph-based approaches, our method is highly scalable (can be used for cell networks consist of millions of nodes), completely explainable, and computationally inexpensive. We create cell networks on a broad CRC histology image dataset, experiment with our method, and report state-of-the-art performance for the prediction of three-class CRA grading.
    Text Prior Guided Scene Text Image Super-resolution. (arXiv:2106.15368v1 [cs.CV])
    (2 min) Scene text image super-resolution (STISR) aims to improve the resolution and visual quality of low-resolution (LR) scene text images, and consequently boost the performance of text recognition. However, most of existing STISR methods regard text images as natural scene images, ignoring the categorical information of text. In this paper, we make an inspiring attempt to embed categorical text prior into STISR model training. Specifically, we adopt the character probability sequence as the text prior, which can be obtained conveniently from a text recognition model. The text prior provides categorical guidance to recover high-resolution (HR) text images. On the other hand, the reconstructed HR image can refine the text prior in return. Finally, we present a multi-stage text prior guided super-resolution (TPGSR) framework for STISR. Our experiments on the benchmark TextZoom dataset show that TPGSR can not only effectively improve the visual quality of scene text images, but also significantly improve the text recognition accuracy over existing STISR methods. Our model trained on TextZoom also demonstrates certain generalization capability to the LR images in other datasets.
    High-Fidelity 3D Digital Human Head Creation from RGB-D Selfies. (arXiv:2010.05562v2 [cs.CV] UPDATED)
    (2 min) We present a fully automatic system that can produce high-fidelity, photo-realistic 3D digital human heads with a consumer RGB-D selfie camera. The system only needs the user to take a short selfie RGB-D video while rotating his/her head, and can produce a high quality head reconstruction in less than 30 seconds. Our main contribution is a new facial geometry modeling and reflectance synthesis procedure that significantly improves the state-of-the-art. Specifically, given the input video a two-stage frame selection procedure is first employed to select a few high-quality frames for reconstruction. Then a differentiable renderer based 3D Morphable Model (3DMM) fitting algorithm is applied to recover facial geometries from multiview RGB-D data, which takes advantages of a powerful 3DMM basis constructed with extensive data generation and perturbation. Our 3DMM has much larger expressive capacities than conventional 3DMM, allowing us to recover more accurate facial geometry using merely linear basis. For reflectance synthesis, we present a hybrid approach that combines parametric fitting and CNNs to synthesize high-resolution albedo/normal maps with realistic hair/pore/wrinkle details. Results show that our system can produce faithful 3D digital human faces with extremely realistic details. The main code and the newly constructed 3DMM basis is publicly available.
    ACN: Adversarial Co-training Network for Brain Tumor Segmentation with Missing Modalities. (arXiv:2106.14591v2 [eess.IV] UPDATED)
    (2 min) Accurate segmentation of brain tumors from magnetic resonance imaging (MRI) is clinically relevant in diagnoses, prognoses and surgery treatment, which requires multiple modalities to provide complementary morphological and physiopathologic information. However, missing modality commonly occurs due to image corruption, artifacts, different acquisition protocols or allergies to certain contrast agents in clinical practice. Though existing efforts demonstrate the possibility of a unified model for all missing situations, most of them perform poorly when more than one modality is missing. In this paper, we propose a novel Adversarial Co-training Network (ACN) to solve this issue, in which a series of independent yet related models are trained dedicated to each missing situation with significantly better results. Specifically, ACN adopts a novel co-training network, which enables a coupled learning process for both full modality and missing modality to supplement each other's domain and feature representations, and more importantly, to recover the `missing' information of absent modalities. Then, two unsupervised modules, i.e., entropy and knowledge adversarial learning modules are proposed to minimize the domain gap while enhancing prediction reliability and encouraging the alignment of latent representations, respectively. We also adapt modality-mutual information knowledge transfer learning to ACN to retain the rich mutual information among modalities. Extensive experiments on BraTS2018 dataset show that our proposed method significantly outperforms all state-of-the-art methods under any missing situation.
    A Mixed-Supervision Multilevel GAN Framework for Image Quality Enhancement. (arXiv:2106.15575v1 [eess.IV])
    (2 min) Deep neural networks for image quality enhancement typically need large quantities of highly-curated training data comprising pairs of low-quality images and their corresponding high-quality images. While high-quality image acquisition is typically expensive and time-consuming, medium-quality images are faster to acquire, at lower equipment costs, and available in larger quantities. Thus, we propose a novel generative adversarial network (GAN) that can leverage training data at multiple levels of quality (e.g., high and medium quality) to improve performance while limiting costs of data curation. We apply our mixed-supervision GAN to (i) super-resolve histopathology images and (ii) enhance laparoscopy images by combining super-resolution and surgical smoke removal. Results on large clinical and pre-clinical datasets show the benefits of our mixed-supervision GAN over the state of the art.
    DeepFaceLab: Integrated, flexible and extensible face-swapping framework. (arXiv:2005.05535v5 [cs.CV] UPDATED)
    (2 min) Deepfake defense not only requires the research of detection but also requires the efforts of generation methods. However, current deepfake methods suffer the effects of obscure workflow and poor performance. To solve this problem, we present DeepFaceLab, the current dominant deepfake framework for face-swapping. It provides the necessary tools as well as an easy-to-use way to conduct high-quality face-swapping. It also offers a flexible and loose coupling structure for people who need to strengthen their pipeline with other features without writing complicated boilerplate code. We detail the principles that drive the implementation of DeepFaceLab and introduce its pipeline, through which every aspect of the pipeline can be modified painlessly by users to achieve their customization purpose. It is noteworthy that DeepFaceLab could achieve cinema-quality results with high fidelity. We demonstrate the advantage of our system by comparing our approach with other face-swapping methods.For more information, please visit:https://github.com/iperov/DeepFaceLab/.
    Serial-EMD: Fast Empirical Mode Decomposition Method for Multi-dimensional Signals Based on Serialization. (arXiv:2106.15319v1 [cs.CV])
    (2 min) Empirical mode decomposition (EMD) has developed into a prominent tool for adaptive, scale-based signal analysis in various fields like robotics, security and biomedical engineering. Since the dramatic increase in amount of data puts forward higher requirements for the capability of real-time signal analysis, it is difficult for existing EMD and its variants to trade off the growth of data dimension and the speed of signal analysis. In order to decompose multi-dimensional signals at a faster speed, we present a novel signal-serialization method (serial-EMD), which concatenates multi-variate or multi-dimensional signals into a one-dimensional signal and uses various one-dimensional EMD algorithms to decompose it. To verify the effects of the proposed method, synthetic multi-variate time series, artificial 2D images with various textures and real-world facial images are tested. Compared with existing multi-EMD algorithms, the decomposition time becomes significantly reduced. In addition, the results of facial recognition with Intrinsic Mode Functions (IMFs) extracted using our method can achieve a higher accuracy than those obtained by existing multi-EMD algorithms, which demonstrates the superior performance of our method in terms of the quality of IMFs. Furthermore, this method can provide a new perspective to optimize the existing EMD algorithms, that is, transforming the structure of the input signal rather than being constrained by developing envelope computation techniques or signal decomposition methods. In summary, the study suggests that the serial-EMD technique is a highly competitive and fast alternative for multi-dimensional signal analysis.
    Unified Framework for Spectral Dimensionality Reduction, Maximum Variance Unfolding, and Kernel Learning By Semidefinite Programming: Tutorial and Survey. (arXiv:2106.15379v1 [stat.ML])
    (2 min) This is a tutorial and survey paper on unification of spectral dimensionality reduction methods, kernel learning by Semidefinite Programming (SDP), Maximum Variance Unfolding (MVU) or Semidefinite Embedding (SDE), and its variants. We first explain how the spectral dimensionality reduction methods can be unified as kernel Principal Component Analysis (PCA) with different kernels. This unification can be interpreted as eigenfunction learning or representation of kernel in terms of distance matrix. Then, since the spectral methods are unified as kernel PCA, we say let us learn the best kernel for unfolding the manifold of data to its maximum variance. We first briefly introduce kernel learning by SDP for the transduction task. Then, we explain MVU in detail. Various versions of supervised MVU using nearest neighbors graph, by class-wise unfolding, by Fisher criterion, and by colored MVU are explained. We also explain out-of-sample extension of MVU using eigenfunctions and kernel mapping. Finally, we introduce other variants of MVU including action respecting embedding, relaxed MVU, and landmark MVU for big data.
    ScanBank: A Benchmark Dataset for Figure Extraction from Scanned Electronic Theses and Dissertations. (arXiv:2106.15320v1 [cs.CV])
    (3 min) We focus on electronic theses and dissertations (ETDs), aiming to improve access and expand their utility, since more than 6 million are publicly available, and they constitute an important corpus to aid research and education across disciplines. The corpus is growing as new born-digital documents are included, and since millions of older theses and dissertations have been converted to digital form to be disseminated electronically in institutional repositories. In ETDs, as with other scholarly works, figures and tables can communicate a large amount of information in a concise way. Although methods have been proposed for extracting figures and tables from born-digital PDFs, they do not work well with scanned ETDs. Considering this problem, our assessment of state-of-the-art figure extraction systems is that the reason they do not function well on scanned PDFs is that they have only been trained on born-digital documents. To address this limitation, we present ScanBank, a new dataset containing 10 thousand scanned page images, manually labeled by humans as to the presence of the 3.3 thousand figures or tables found therein. We use this dataset to train a deep neural network model based on YOLOv5 to accurately extract figures and tables from scanned ETDs. We pose and answer important research questions aimed at finding better methods for figure extraction from scanned documents. One of those concerns the value for training, of data augmentation techniques applied to born-digital documents which are used to train models better suited for figure extraction from scanned documents. To the best of our knowledge, ScanBank is the first manually annotated dataset for figure and table extraction for scanned ETDs. A YOLOv5-based model, trained on ScanBank, outperforms existing comparable open-source and freely available baseline methods by a considerable margin.
    Improved Padding in CNNs for Quantitative Susceptibility Mapping. (arXiv:2106.15331v1 [cs.CV])
    (2 min) Recently, deep learning methods have been proposed for quantitative susceptibility mapping (QSM) data processing: background field removal, field-to-source inversion, and single-step QSM reconstruction. However, the conventional padding mechanism used in convolutional neural networks (CNNs) can introduce spatial artifacts, especially in QSM background field removal and single-step QSM which requires inference from total fields with extreme large values at the edge boundaries of volume of interest. To address this issue, we propose an improved padding technique which utilizes the neighboring valid voxels to estimate the invalid voxels of feature maps at volume boundaries in the neural networks. Studies using simulated and in-vivo data show that the proposed padding greatly improves estimation accuracy and reduces artifacts in the results in the tasks of background field removal, field-to-source inversion, and single-step QSM reconstruction.
    Syntactically Guided Generative Embeddings for Zero-Shot Skeleton Action Recognition. (arXiv:2101.11530v2 [cs.CV] UPDATED)
    (2 min) We introduce SynSE, a novel syntactically guided generative approach for Zero-Shot Learning (ZSL). Our end-to-end approach learns progressively refined generative embedding spaces constrained within and across the involved modalities (visual, language). The inter-modal constraints are defined between action sequence embedding and embeddings of Parts of Speech (PoS) tagged words in the corresponding action description. We deploy SynSE for the task of skeleton-based action sequence recognition. Our design choices enable SynSE to generalize compositionally, i.e., recognize sequences whose action descriptions contain words not encountered during training. We also extend our approach to the more challenging Generalized Zero-Shot Learning (GZSL) problem via a confidence-based gating mechanism. We are the first to present zero-shot skeleton action recognition results on the large-scale NTU-60 and NTU-120 skeleton action datasets with multiple splits. Our results demonstrate SynSE's state of the art performance in both ZSL and GZSL settings compared to strong baselines on the NTU-60 and NTU-120 datasets. The code and pretrained models are available at https://github.com/skelemoa/synse-zsl
    Boggart: Accelerating Retrospective Video Analytics via Model-Agnostic Ingest Processing. (arXiv:2106.15315v1 [cs.CV])
    (2 min) Delivering fast responses to retrospective queries on video datasets is difficult due to the large number of frames to consider and the high costs of running convolutional neural networks (CNNs) on each one. A natural solution is to perform a subset of the necessary computations ahead of time, as video is ingested. However, existing ingest-time systems require knowledge of the specific CNN that will be used in future queries -- a challenging requisite given the evergrowing space of CNN architectures and training datasets/methodologies. This paper presents Boggart, a retrospective video analytics system that delivers ingest-time speedups in a model-agnostic manner. Our underlying insight is that traditional computer vision (CV) algorithms are capable of performing computations that can be used to accelerate diverse queries with wide-ranging CNNs. Building on this, at ingest-time, Boggart carefully employs a variety of motion tracking algorithms to identify potential objects and their trajectories across frames. Then, at query-time, Boggart uses several novel techniques to collect the smallest sample of CNN results required to meet the target accuracy: (1) a clustering strategy to efficiently unearth the inevitable discrepancies between CV- and CNN-generated outputs, and (2) a set of accuracy-preserving propagation techniques to safely extend sampled results along each trajectory. Across many videos, CNNs, and queries Boggart consistently meets accuracy targets while using CNNs sparingly (on 3-54% of frames).
    Framework for an Intelligent Affect Aware Smart Home Environment for Elderly People. (arXiv:2106.15599v1 [cs.HC])
    (2 min) The population of elderly people has been increasing at a rapid rate over the last few decades and their population is expected to further increase in the upcoming future. Their increasing population is associated with their increasing needs due to problems like physical disabilities, cognitive issues, weakened memory and disorganized behavior, that elderly people face with increasing age. To reduce their financial burden on the world economy and to enhance their quality of life, it is essential to develop technology-based solutions that are adaptive, assistive and intelligent in nature. Intelligent Affect Aware Systems that can not only analyze but also predict the behavior of elderly people in the context of their day to day interactions with technology in an IoT-based environment, holds immense potential for serving as a long-term solution for improving the user experience of elderly in smart homes. This work therefore proposes the framework for an Intelligent Affect Aware environment for elderly people that can not only analyze the affective components of their interactions but also predict their likely user experience even before they start engaging in any activity in the given smart home environment. This forecasting of user experience would provide scope for enhancing the same, thereby increasing the assistive and adaptive nature of such intelligent systems. To uphold the efficacy of this proposed framework for improving the quality of life of elderly people in smart homes, it has been tested on three datasets and the results are presented and discussed.
    Quantifying urban streetscapes with deep learning: focus on aesthetic evaluation. (arXiv:2106.15361v1 [cs.CV])
    (2 min) The disorder of urban streetscapes would negatively affect people's perception of their aesthetic quality. The presence of billboards on building facades has been regarded as an important factor of the disorder, but its quantification methodology has not yet been developed in a scalable manner. To fill the gap, this paper reports the performance of our deep learning model on a unique data set prepared in Tokyo to recognize the areas covered by facades and billboards in streetscapes, respectively. The model achieved 63.17 % of accuracy, measured by Intersection-over-Union (IoU), thus enabling researchers and practitioners to obtain insights on urban streetscape design by combining data of people's preferences.
    Predicting the Solar Potential of Rooftops using Image Segmentation and Structured Data. (arXiv:2106.15268v1 [cs.CV])
    (2 min) Estimating the amount of electricity that can be produced by rooftop photovoltaic systems is a time-consuming process that requires on-site measurements, a difficult task to achieve on a large scale. In this paper, we present an approach to estimate the solar potential of rooftops based on their location and architectural characteristics, as well as the amount of solar radiation they receive annually. Our technique uses computer vision to achieve semantic segmentation of roof sections and roof objects on the one hand, and a machine learning model based on structured building features to predict roof pitch on the other hand. We then compute the azimuth and maximum number of solar panels that can be installed on a rooftop with geometric approaches. Finally, we compute precise shading masks and combine them with solar irradiation data that enables us to estimate the yearly solar potential of a rooftop.
    Automatic 2D-3D Registration without Contrast Agent during Neurovascular Interventions. (arXiv:2106.15308v1 [cs.CV])
    (2 min) Fusing live fluoroscopy images with a 3D rotational reconstruction of the vasculature allows to navigate endovascular devices in minimally invasive neuro-vascular treatment, while reducing the usage of harmful iodine contrast medium. The alignment of the fluoroscopy images and the 3D reconstruction is initialized using the sensor information of the X-ray C-arm geometry. Patient motion is then corrected by an image-based registration algorithm, based on a gradient difference similarity measure using digital reconstructed radiographs of the 3D reconstruction. This algorithm does not require the vessels in the fluoroscopy image to be filled with iodine contrast agent, but rather relies on gradients in the image (bone structures, sinuses) as landmark features. This paper investigates the accuracy, robustness and computation time aspects of the image-based registration algorithm. Using phantom experiments 97% of the registration attempts passed the success criterion of a residual registration error of less than 1 mm translation and 3{\deg} rotation. The paper establishes a new method for validation of 2D-3D registration without requiring changes to the clinical workflow, such as attaching fiducial markers. As a consequence, this method can be retrospectively applied to pre-existing clinical data. For clinical data experiments, 87% of the registration attempts passed the criterion of a residual translational error of < 1 mm, and 84% possessed a rotational error of < 3{\deg}.
    Multimodal Semantic Scene Graphs for Holistic Modeling of Surgical Procedures. (arXiv:2106.15309v1 [cs.CV])
    (2 min) From a computer science viewpoint, a surgical domain model needs to be a conceptual one incorporating both behavior and data. It should therefore model actors, devices, tools, their complex interactions and data flow. To capture and model these, we take advantage of the latest computer vision methodologies for generating 3D scene graphs from camera views. We then introduce the Multimodal Semantic Scene Graph (MSSG) which aims at providing a unified symbolic, spatiotemporal and semantic representation of surgical procedures. This methodology aims at modeling the relationship between different components in surgical domain including medical staff, imaging systems, and surgical devices, opening the path towards holistic understanding and modeling of surgical procedures. We then use MSSG to introduce a dynamically generated graphical user interface tool for surgical procedure analysis which could be used for many applications including process optimization, OR design and automatic report generation. We finally demonstrate that the proposed MSSGs could also be used for synchronizing different complex surgical procedures. While the system still needs to be integrated into real operating rooms before getting validated, this conference paper aims mainly at providing the community with the basic principles of this novel concept through a first prototypal partial realization based on MVOR dataset.
    SE-MD: A Single-encoder multiple-decoder deep network for point cloud generation from 2D images. (arXiv:2106.15325v1 [cs.CV])
    (2 min) 3D model generation from single 2D RGB images is a challenging and actively researched computer vision task. Various techniques using conventional network architectures have been proposed for the same. However, the body of research work is limited and there are various issues like using inefficient 3D representation formats, weak 3D model generation backbones, inability to generate dense point clouds, dependence of post-processing for generation of dense point clouds, and dependence on silhouettes in RGB images. In this paper, a novel 2D RGB image to point cloud conversion technique is proposed, which improves the state of art in the field due to its efficient, robust and simple model by using the concept of parallelization in network architecture. It not only uses the efficient and rich 3D representation of point clouds, but also uses a novel and robust point cloud generation backbone in order to address the prevalent issues. This involves using a single-encoder multiple-decoder deep network architecture wherein each decoder generates certain fixed viewpoints. This is followed by fusing all the viewpoints to generate a dense point cloud. Various experiments are conducted on the technique and its performance is compared with those of other state of the art techniques and impressive gains in performance are demonstrated. Code is available at https://github.com/mueedhafiz1982/
    Adaptive Sample Selection for Robust Learning under Label Noise. (arXiv:2106.15292v1 [cs.LG])
    (2 min) Deep Neural Networks (DNNs) have been shown to be susceptible to memorization or overfitting in the presence of noisily labelled data. For the problem of robust learning under such noisy data, several algorithms have been proposed. A prominent class of algorithms rely on sample selection strategies, motivated by curriculum learning. For example, many algorithms use the `small loss trick' wherein a fraction of samples with loss values below a certain threshold are selected for training. These algorithms are sensitive to such thresholds, and it is difficult to fix or learn these thresholds. Often, these algorithms also require information such as label noise rates which are typically unavailable in practice. In this paper, we propose a data-dependent, adaptive sample selection strategy that relies only on batch statistics of a given mini-batch to provide robustness against label noise. The algorithm does not have any additional hyperparameters for sample selection, does not need any information on noise rates, and does not need access to separate data with clean labels. We empirically demonstrate the effectiveness of our algorithm on benchmark datasets.
    Face Identification Proficiency Test Designed Using Item Response Theory. (arXiv:2106.15323v1 [cs.CV])
    (2 min) Measures of face identification proficiency are essential to ensure accurate and consistent performance by professional forensic face examiners and others who perform face identification tasks in applied scenarios. Current proficiency tests rely on static sets of stimulus items, and so, cannot be administered validly to the same individual multiple times. To create a proficiency test, a large number of items of "known" difficulty must be assembled. Multiple tests of equal difficulty can be constructed then using subsets of items. Here, we introduce a proficiency test, the Triad Identity Matching (TIM) test, based on stimulus difficulty measures based on Item Response Theory (IRT). Participants view face-image "triads" (N=225) (two images of one identity and one image of a different identity) and select the different identity. In Experiment 1, university students (N=197) showed wide-ranging accuracy on the TIM test. Furthermore, IRT modeling demonstrated that the TIM test produces items of various difficulty levels. In Experiment 2, IRT-based item difficulty measures were used to partition the TIM test into three equally "easy" and three equally "difficult" subsets. Simulation results indicated that the full set, as well as curated subsets, of the TIM items yielded reliable estimates of subject ability. In summary, the TIM test can provide a starting point for developing a framework that is flexible, calibrated, and adaptive to measure proficiency across various ability levels (e.g., professionals or populations with face processing deficits)
    Cascaded Diffusion Models for High Fidelity Image Generation. (arXiv:2106.15282v1 [cs.CV])
    (2 min) We show that cascaded diffusion models are capable of generating high fidelity images on the class-conditional ImageNet generation challenge, without any assistance from auxiliary image classifiers to boost sample quality. A cascaded diffusion model comprises a pipeline of multiple diffusion models that generate images of increasing resolution, beginning with a standard diffusion model at the lowest resolution, followed by one or more super-resolution diffusion models that successively upsample the image and add higher resolution details. We find that the sample quality of a cascading pipeline relies crucially on conditioning augmentation, our proposed method of data augmentation of the lower resolution conditioning inputs to the super-resolution models. Our experiments show that conditioning augmentation prevents compounding error during sampling in a cascaded model, helping us to train cascading pipelines achieving FID scores of 1.48 at 64x64, 3.52 at 128x128 and 4.88 at 256x256 resolutions, outperforming BigGAN-deep.
    Joint Learning of Portrait Intrinsic Decomposition and Relighting. (arXiv:2106.15305v1 [cs.CV])
    (2 min) Inverse rendering is the problem of decomposing an image into its intrinsic components, i.e. albedo, normal and lighting. To solve this ill-posed problem from single image, state-of-the-art methods in shape from shading mostly resort to supervised training on all the components on either synthetic or real datasets. Here, we propose a new self-supervised training paradigm that 1) reduces the need for full supervision on the decomposition task and 2) takes into account the relighting task. We introduce new self-supervised loss terms that leverage the consistencies between multi-lit images (images of the same scene under different illuminations). Our approach is applicable to multi-lit datasets. We apply our training approach in two settings: 1) train on a mixture of synthetic and real data, 2) train on real datasets with limited supervision. We show-case the effectiveness of our training paradigm on both intrinsic decomposition and relighting and demonstrate how the model struggles in both tasks without the self-supervised loss terms in limited supervision settings. We provide results of comprehensive experiments on SfSNet, CelebA and Photoface datasets and verify the performance of our approach on images in the wild.
    Analysing Affective Behavior in the second ABAW2 Competition. (arXiv:2106.15318v1 [cs.CV])
    (2 min) The Affective Behavior Analysis in-the-wild (ABAW2) 2021 Competition is the second -- following the first very successful ABAW Competition held in conjunction with IEEE FG 2020- Competition that aims at automatically analyzing affect. ABAW2 is split into three Challenges, each one addressing one of the three main behavior tasks of valence-arousal estimation, basic expression classification and action unit detection. All three Challenges are based on a common benchmark database, Aff-Wild2, which is a large scale in-the-wild database and the first one to be annotated for all these three tasks. In this paper, we describe this Competition, to be held in conjunction with ICCV 2021. We present the three Challenges, with the utilized Competition corpora. We outline the evaluation metrics and present the baseline system with its results. More information regarding the Competition is provided in the Competition site: https://ibug.doc.ic.ac.uk/resources/iccv-2021-2nd-abaw.
    Towards Fast and Accurate Multi-Person Pose Estimation on Mobile Devices. (arXiv:2106.15304v1 [cs.CV])
    (2 min) The rapid development of autonomous driving, abnormal behavior detection, and behavior recognition makes an increasing demand for multi-person pose estimation-based applications, especially on mobile platforms. However, to achieve high accuracy, state-of-the-art methods tend to have a large model size and complex post-processing algorithm, which costs intense computation and long end-to-end latency. To solve this problem, we propose an architecture optimization and weight pruning framework to accelerate inference of multi-person pose estimation on mobile devices. With our optimization framework, we achieve up to 2.51x faster model inference speed with higher accuracy compared to representative lightweight multi-person pose estimator.
    Image Inpainting Using Wasserstein Generative Adversarial Imputation Network. (arXiv:2106.15341v1 [cs.CV])
    (2 min) Image inpainting is one of the important tasks in computer vision which focuses on the reconstruction of missing regions in an image. The aim of this paper is to introduce an image inpainting model based on Wasserstein Generative Adversarial Imputation Network. The generator network of the model uses building blocks of convolutional layers with different dilation rates, together with skip connections that help the model reproduce fine details of the output. This combination yields a universal imputation model that is able to handle various scenarios of missingness with sufficient quality. To show this experimentally, the model is simultaneously trained to deal with three scenarios given by missing pixels at random, missing various smaller square regions, and one missing square placed in the center of the image. It turns out that our model achieves high-quality inpainting results on all scenarios. Performance is evaluated using peak signal-to-noise ratio and structural similarity index on two real-world benchmark datasets, CelebA faces and Paris StreetView. The results of our model are compared to biharmonic imputation and to some of the other state-of-the-art image inpainting methods.
    Inconspicuous Adversarial Patches for Fooling Image Recognition Systems on Mobile Devices. (arXiv:2106.15202v1 [cs.CV])
    (2 min) Deep learning based image recognition systems have been widely deployed on mobile devices in today's world. In recent studies, however, deep learning models are shown vulnerable to adversarial examples. One variant of adversarial examples, called adversarial patch, draws researchers' attention due to its strong attack abilities. Though adversarial patches achieve high attack success rates, they are easily being detected because of the visual inconsistency between the patches and the original images. Besides, it usually requires a large amount of data for adversarial patch generation in the literature, which is computationally expensive and time-consuming. To tackle these challenges, we propose an approach to generate inconspicuous adversarial patches with one single image. In our approach, we first decide the patch locations basing on the perceptual sensitivity of victim models, then produce adversarial patches in a coarse-to-fine way by utilizing multiple-scale generators and discriminators. The patches are encouraged to be consistent with the background images with adversarial training while preserving strong attack abilities. Our approach shows the strong attack abilities in white-box settings and the excellent transferability in black-box settings through extensive experiments on various models with different architectures and training methods. Compared to other adversarial patches, our adversarial patches hold the most negligible risks to be detected and can evade human observations, which is supported by the illustrations of saliency maps and results of user evaluations. Lastly, we show that our adversarial patches can be applied in the physical world.
    Artificial Intelligence in Minimally Invasive Interventional Treatment. (arXiv:2106.15306v1 [cs.CV])
    (2 min) Minimally invasive image guided treatment procedures often employ advanced image processing algorithms. The recent developments of artificial intelligence algorithms harbor potential to further enhance this domain. In this article we explore several application areas within the minimally invasive treatment space and discuss the deployment of artificial intelligence within these areas.
    Similarity Embedding Networks for Robust Human Activity Recognition. (arXiv:2106.15283v1 [cs.CV])
    (2 min) Deep learning models for human activity recognition (HAR) based on sensor data have been heavily studied recently. However, the generalization ability of deep models on complex real-world HAR data is limited by the availability of high-quality labeled activity data, which are hard to obtain. In this paper, we design a similarity embedding neural network that maps input sensor signals onto real vectors through carefully designed convolutional and LSTM layers. The embedding network is trained with a pairwise similarity loss, encouraging the clustering of samples from the same class in the embedded real space, and can be effectively trained on a small dataset and even on a noisy dataset with mislabeled samples. Based on the learned embeddings, we further propose both nonparametric and parametric approaches for activity recognition. Extensive evaluation based on two public datasets has shown that the proposed similarity embedding network significantly outperforms state-of-the-art deep models on HAR classification tasks, is robust to mislabeled samples in the training set, and can also be used to effectively denoise a noisy dataset.
    Predicting Depth from Semantic Segmentation using Game Engine Dataset. (arXiv:2106.15257v1 [cs.CV])
    (2 min) Depth perception is fundamental for robots to understand the surrounding environment. As the view of cognitive neuroscience, visual depth perception methods are divided into three categories, namely binocular, active, and pictorial. The first two categories have been studied for decades in detail. However, research for the exploration of the third category is still in its infancy and has got momentum by the advent of deep learning methods in recent years. In cognitive neuroscience, it is known that pictorial depth perception mechanisms are dependent on the perception of seen objects. Inspired by this fact, in this thesis, we investigated the relation of perception of objects and depth estimation convolutional neural networks. For this purpose, we developed new network structures based on a simple depth estimation network that only used a single image at its input. Our proposed structures use both an image and a semantic label of the image as their input. We used semantic labels as the output of object perception. The obtained results of performance comparison between the developed network and original network showed that our novel structures can improve the performance of depth estimation by 52\% of relative error of distance in the examined cases. Most of the experimental studies were carried out on synthetic datasets that were generated by game engines to isolate the performance comparison from the effect of inaccurate depth and semantic labels of non-synthetic datasets. It is shown that particular synthetic datasets may be used for training of depth networks in cases that an appropriate dataset is not available. Furthermore, we showed that in these cases, usage of semantic labels improves the robustness of the network against domain shift from synthetic training data to non-synthetic test data.
    On-board Volcanic Eruption Detection through CNNs and Satellite Multispectral Imagery. (arXiv:2106.15281v1 [cs.CV])
    (2 min) In recent years, the growth of Machine Learning algorithms in a variety of different applications has raised numerous studies on the applicability of these algorithms in real scenarios. Among all, one of the hardest scenarios, due to its physical requirements, is the aerospace one. In this context, the authors of this work aim to propose a first prototype and a study of feasibility for an AI model to be 'loaded' on board. As a case study, the authors decided to investigate the detection of volcanic eruptions as a method to swiftly produce alerts. Two Convolutional Neural Networks have been proposed and created, also showing how to correctly implement them on real hardware and how the complexity of a CNN can be adapted to fit computational requirements.
    MFR 2021: Masked Face Recognition Competition. (arXiv:2106.15288v1 [cs.CV])
    (2 min) This paper presents a summary of the Masked Face Recognition Competitions (MFR) held within the 2021 International Joint Conference on Biometrics (IJCB 2021). The competition attracted a total of 10 participating teams with valid submissions. The affiliations of these teams are diverse and associated with academia and industry in nine different countries. These teams successfully submitted 18 valid solutions. The competition is designed to motivate solutions aiming at enhancing the face recognition accuracy of masked faces. Moreover, the competition considered the deployability of the proposed solutions by taking the compactness of the face recognition models into account. A private dataset representing a collaborative, multi-session, real masked, capture scenario is used to evaluate the submitted solutions. In comparison to one of the top-performing academic face recognition solutions, 10 out of the 18 submitted solutions did score higher masked face verification accuracy.
    IREM: High-Resolution Magnetic Resonance (MR) Image Reconstruction via Implicit Neural Representation. (arXiv:2106.15097v1 [eess.IV])
    (2 min) For collecting high-quality high-resolution (HR) MR image, we propose a novel image reconstruction network named IREM, which is trained on multiple low-resolution (LR) MR images and achieve an arbitrary up-sampling rate for HR image reconstruction. In this work, we suppose the desired HR image as an implicit continuous function of the 3D image spatial coordinate and the thick-slice LR images as several sparse discrete samplings of this function. Then the super-resolution (SR) task is to learn the continuous volumetric function from a limited observations using an fully-connected neural network combined with Fourier feature positional encoding. By simply minimizing the error between the network prediction and the acquired LR image intensity across each imaging plane, IREM is trained to represent a continuous model of the observed tissue anatomy. Experimental results indicate that IREM succeeds in representing high frequency image feature, and in real scene data collection, IREM reduces scan time and achieves high-quality high-resolution MR imaging in terms of SNR and local image detail.
    Wrong Colored Vermeer: Color-Symmetric Image Distortion. (arXiv:2106.15179v1 [cs.CV])
    (2 min) Color symmetry implies that the colors of geometrical objects are assigned according to their symmetry properties. It is defined by associating the elements of the symmetry group with a color permutation. I use this concept for generative art and apply symmetry-consistent color distortions to images of paintings by Johannes Vermeer. The color permutations are realized as mappings of the HSV color space onto itself.
    Do Not Deceive Your Employer with a Virtual Background: A Video Conferencing Manipulation-Detection System. (arXiv:2106.15130v1 [cs.CR])
    (2 min) The last-generation video conferencing software allows users to utilize a virtual background to conceal their personal environment due to privacy concerns, especially in official meetings with other employers. On the other hand, users maybe want to fool people in the meeting by considering the virtual background to conceal where they are. In this case, developing tools to understand the virtual background utilize for fooling people in meeting plays an important role. Besides, such detectors must prove robust against different kinds of attacks since a malicious user can fool the detector by applying a set of adversarial editing steps on the video to conceal any revealing footprint. In this paper, we study the feasibility of an efficient tool to detect whether a videoconferencing user background is real. In particular, we provide the first tool which computes pixel co-occurrences matrices and uses them to search for inconsistencies among spectral and spatial bands. Our experiments confirm that cross co-occurrences matrices improve the robustness of the detector against different kinds of attacks. This work's performance is especially noteworthy with regard to color SPAM features. Moreover, the performance especially is significant with regard to robustness versus post-processing, like geometric transformations, filtering, contrast enhancement, and JPEG compression with different quality factors.
    ElephantBook: A Semi-Automated Human-in-the-Loop System for Elephant Re-Identification. (arXiv:2106.15083v1 [cs.LG])
    (2 min) African elephants are vital to their ecosystems, but their populations are threatened by a rise in human-elephant conflict and poaching. Monitoring population dynamics is essential in conservation efforts; however, tracking elephants is a difficult task, usually relying on the invasive and sometimes dangerous placement of GPS collars. Although there have been many recent successes in the use of computer vision techniques for automated identification of other species, identification of elephants is extremely difficult and typically requires expertise as well as familiarity with elephants in the population. We have built and deployed a web-based platform and database for human-in-the-loop re-identification of elephants combining manual attribute labeling and state-of-the-art computer vision algorithms, known as ElephantBook. Our system is currently in use at the Mara Elephant Project, helping monitor the protected and at-risk population of elephants in the Greater Maasai Mara ecosystem. ElephantBook makes elephant re-identification usable by non-experts and scalable for use by multiple conservation NGOs.
    SRF-Net: Selective Receptive Field Network for Anchor-Free Temporal Action Detection. (arXiv:2106.15258v1 [cs.CV])
    (2 min) Temporal action detection (TAD) is a challenging task which aims to temporally localize and recognize the human action in untrimmed videos. Current mainstream one-stage TAD approaches localize and classify action proposals relying on pre-defined anchors, where the location and scale for action instances are set by designers. Obviously, such an anchor-based TAD method limits its generalization capability and will lead to performance degradation when videos contain rich action variation. In this study, we explore to remove the requirement of pre-defined anchors for TAD methods. A novel TAD model termed as Selective Receptive Field Network (SRF-Net) is developed, in which the location offsets and classification scores at each temporal location can be directly estimated in the feature map and SRF-Net is trained in an end-to-end manner. Innovatively, a building block called Selective Receptive Field Convolution (SRFC) is dedicatedly designed which is able to adaptively adjust its receptive field size according to multiple scales of input information at each temporal location in the feature map. Extensive experiments are conducted on the THUMOS14 dataset, and superior results are reported comparing to state-of-the-art TAD approaches.
    TUCaN: Progressively Teaching Colourisation to Capsules. (arXiv:2106.15176v1 [cs.CV])
    (2 min) Automatic image colourisation is the computer vision research path that studies how to colourise greyscale images (for restoration). Deep learning techniques improved image colourisation yielding astonishing results. These differ by various factors, such as structural differences, input types, user assistance, etc. Most of them, base the architectural structure on convolutional layers with no emphasis on layers specialised in object features extraction. We introduce a novel downsampling upsampling architecture named TUCaN (Tiny UCapsNet) that exploits the collaboration of convolutional layers and capsule layers to obtain a neat colourisation of entities present in every single image. This is obtained by enforcing collaboration among such layers by skip and residual connections. We pose the problem as a per pixel colour classification task that identifies colours as a bin in a quantized space. To train the network, in contrast with the standard end to end learning method, we propose the progressive learning scheme to extract the context of objects by only manipulating the learning process without changing the model. In this scheme, the upsampling starts from the reconstruction of low resolution images and progressively grows to high resolution images throughout the training phase. Experimental results on three benchmark datasets show that our approach with ImageNet10k dataset outperforms existing methods on standard quality metrics and achieves state of the art performances on image colourisation. We performed a user study to quantify the perceptual realism of the colourisation results demonstrating: that progressive learning let the TUCaN achieve better colours than the end to end scheme; and pointing out the limitations of the existing evaluation metrics.
    Multi-Exit Vision Transformer for Dynamic Inference. (arXiv:2106.15183v1 [cs.CV])
    (2 min) Deep neural networks can be converted to multi-exit architectures by inserting early exit branches after some of their intermediate layers. This allows their inference process to become dynamic, which is useful for time critical IoT applications with stringent latency requirements, but with time-variant communication and computation resources. In particular, in edge computing systems and IoT networks where the exact computation time budget is variable and not known beforehand. Vision Transformer is a recently proposed architecture which has since found many applications across various domains of computer vision. In this work, we propose seven different architectures for early exit branches that can be used for dynamic inference in Vision Transformer backbones. Through extensive experiments involving both classification and regression problems, we show that each one of our proposed architectures could prove useful in the trade-off between accuracy and speed.
    Multimodal Trajectory Prediction Conditioned on Lane-Graph Traversals. (arXiv:2106.15004v1 [cs.CV])
    (2 min) Accurately predicting the future motion of surrounding vehicles requires reasoning about the inherent uncertainty in goals and driving behavior. This uncertainty can be loosely decoupled into lateral (e.g., keeping lane, turning) and longitudinal (e.g., accelerating, braking). We present a novel method that combines learned discrete policy rollouts with a focused decoder on subsets of the lane graph. The policy rollouts explore different goals given our current observations, ensuring that the model captures lateral variability. The longitudinal variability is captured by our novel latent variable model decoder that is conditioned on various subsets of the lane graph. Our model achieves state-of-the-art performance on the nuScenes motion prediction dataset, and qualitatively demonstrates excellent scene compliance. Detailed ablations highlight the importance of both the policy rollouts and the decoder architecture.
    Autonomous Driving Implementation in an Experimental Environment. (arXiv:2106.15274v1 [cs.RO])
    (2 min) Autonomous systems require identifying the environment and it has a long way to go before putting it safely into practice. In autonomous driving systems, the detection of obstacles and traffic lights are of importance as well as lane tracking. In this study, an autonomous driving system is developed and tested in the experimental environment designed for this purpose. In this system, a model vehicle having a camera is used to trace the lanes and avoid obstacles to experimentally study autonomous driving behavior. Convolutional Neural Network models were trained for Lane tracking. For the vehicle to avoid obstacles, corner detection, optical flow, focus of expansion, time to collision, balance calculation, and decision mechanism were created, respectively.
    O2O-Afford: Annotation-Free Large-Scale Object-Object Affordance Learning. (arXiv:2106.15087v1 [cs.CV])
    (2 min) Contrary to the vast literature in modeling, perceiving, and understanding agent-object (e.g., human-object, hand-object, robot-object) interaction in computer vision and robotics, very few past works have studied the task of object-object interaction, which also plays an important role in robotic manipulation and planning tasks. There is a rich space of object-object interaction scenarios in our daily life, such as placing an object on a messy tabletop, fitting an object inside a drawer, pushing an object using a tool, etc. In this paper, we propose a unified affordance learning framework to learn object-object interaction for various tasks. By constructing four object-object interaction task environments using physical simulation (SAPIEN) and thousands of ShapeNet models with rich geometric diversity, we are able to conduct large-scale object-object affordance learning without the need for human annotations or demonstrations. At the core of technical contribution, we propose an object-kernel point convolution network to reason about detailed interaction between two objects. Experiments on large-scale synthetic data and real-world data prove the effectiveness of the proposed approach. Please refer to the project webpage for code, data, video, and more materials: https://cs.stanford.edu/~kaichun/o2oafford
    Using Robust Regression to Find Font Usage Trends. (arXiv:2106.15232v1 [cs.CV])
    (2 min) Fonts have had trends throughout their history, not only in when they were invented but also in their usage and popularity. In this paper, we attempt to specifically find the trends in font usage using robust regression on a large collection of text images. We utilize movie posters as the source of fonts for this task because movie posters can represent time periods by using their release date. In addition, movie posters are documents that are carefully designed and represent a wide range of fonts. To understand the relationship between the fonts of movie posters and time, we use a regression Convolutional Neural Network (CNN) to estimate the release year of a movie using an isolated title text image. Due to the difficulty of the task, we propose to use of a hybrid training regimen that uses a combination of Mean Squared Error (MSE) and Tukey's biweight loss. Furthermore, we perform a thorough analysis on the trends of fonts through time.
    Object Detection Based Handwriting Localization. (arXiv:2106.14989v1 [cs.CV])
    (2 min) We present an object detection based approach to localize handwritten regions from documents, which initially aims to enhance the anonymization during the data transmission. The concatenated fusion of original and preprocessed images containing both printed texts and handwritten notes or signatures are fed into the convolutional neural network, where the bounding boxes are learned to detect the handwriting. Afterwards, the handwritten regions can be processed (e.g. replaced with redacted signatures) to conceal the personally identifiable information (PII). This processing pipeline based on the deep learning network Cascade R-CNN works at 10 fps on a GPU during the inference, which ensures the enhanced anonymization with minimal computational overheads. Furthermore, the impressive generalizability has been empirically showcased: the trained model based on the English-dominant dataset works well on the fictitious unseen invoices, even in Chinese. The proposed approach is also expected to facilitate other tasks such as handwriting recognition and signature verification.
    Improving Transferability of Adversarial Patches on Face Recognition with Generative Models. (arXiv:2106.15058v1 [cs.CV])
    (2 min) Face recognition is greatly improved by deep convolutional neural networks (CNNs). Recently, these face recognition models have been used for identity authentication in security sensitive applications. However, deep CNNs are vulnerable to adversarial patches, which are physically realizable and stealthy, raising new security concerns on the real-world applications of these models. In this paper, we evaluate the robustness of face recognition models using adversarial patches based on transferability, where the attacker has limited accessibility to the target models. First, we extend the existing transfer-based attack techniques to generate transferable adversarial patches. However, we observe that the transferability is sensitive to initialization and degrades when the perturbation magnitude is large, indicating the overfitting to the substitute models. Second, we propose to regularize the adversarial patches on the low dimensional data manifold. The manifold is represented by generative models pre-trained on legitimate human face images. Using face-like features as adversarial perturbations through optimization on the manifold, we show that the gaps between the responses of substitute models and the target models dramatically decrease, exhibiting a better transferability. Extensive digital world experiments are conducted to demonstrate the superiority of the proposed method in the black-box setting. We apply the proposed method in the physical world as well.
    Towards Understanding the Effectiveness of Attention Mechanism. (arXiv:2106.15067v1 [cs.CV])
    (2 min) Attention Mechanism is a widely used method for improving the performance of convolutional neural networks (CNNs) on computer vision tasks. Despite its pervasiveness, we have a poor understanding of what its effectiveness stems from. It is popularly believed that its effectiveness stems from the visual attention explanation, advocating focusing on the important part of input data rather than ingesting the entire input. In this paper, we find that there is only a weak consistency between the attention weights of features and their importance. Instead, we verify the crucial role of feature map multiplication in attention mechanism and uncover a fundamental impact of feature map multiplication on the learned landscapes of CNNs: with the high order non-linearity brought by the feature map multiplication, it played a regularization role on CNNs, which made them learn smoother and more stable landscapes near real samples compared to vanilla CNNs. This smoothness and stability induce a more predictive and stable behavior in-between real samples, and make CNNs generate better. Moreover, motivated by the proposed effectiveness of feature map multiplication, we design feature map multiplication network (FMMNet) by simply replacing the feature map addition in ResNet with feature map multiplication. FMMNet outperforms ResNet on various datasets, and this indicates that feature map multiplication plays a vital role in improving the performance even without finely designed attention mechanism in existing methods.
    Data augmentation for deep learning based accelerated MRI reconstruction with limited data. (arXiv:2106.14947v1 [eess.IV])
    (2 min) Deep neural networks have emerged as very successful tools for image restoration and reconstruction tasks. These networks are often trained end-to-end to directly reconstruct an image from a noisy or corrupted measurement of that image. To achieve state-of-the-art performance, training on large and diverse sets of images is considered critical. However, it is often difficult and/or expensive to collect large amounts of training images. Inspired by the success of Data Augmentation (DA) for classification problems, in this paper, we propose a pipeline for data augmentation for accelerated MRI reconstruction and study its effectiveness at reducing the required training data in a variety of settings. Our DA pipeline, MRAugment, is specifically designed to utilize the invariances present in medical imaging measurements as naive DA strategies that neglect the physics of the problem fail. Through extensive studies on multiple datasets we demonstrate that in the low-data regime DA prevents overfitting and can match or even surpass the state of the art while using significantly fewer training data, whereas in the high-data regime it has diminishing returns. Furthermore, our findings show that DA can improve the robustness of the model against various shifts in the test distribution.
    EVPropNet: Detecting Drones By Finding Propellers For Mid-Air Landing And Following. (arXiv:2106.15045v1 [cs.CV])
    (2 min) The rapid rise of accessibility of unmanned aerial vehicles or drones pose a threat to general security and confidentiality. Most of the commercially available or custom-built drones are multi-rotors and are comprised of multiple propellers. Since these propellers rotate at a high-speed, they are generally the fastest moving parts of an image and cannot be directly "seen" by a classical camera without severe motion blur. We utilize a class of sensors that are particularly suitable for such scenarios called event cameras, which have a high temporal resolution, low-latency, and high dynamic range. In this paper, we model the geometry of a propeller and use it to generate simulated events which are used to train a deep neural network called EVPropNet to detect propellers from the data of an event camera. EVPropNet directly transfers to the real world without any fine-tuning or retraining. We present two applications of our network: (a) tracking and following an unmarked drone and (b) landing on a near-hover drone. We successfully evaluate and demonstrate the proposed approach in many real-world experiments with different propeller shapes and sizes. Our network can detect propellers at a rate of 85.1% even when 60% of the propeller is occluded and can run at upto 35Hz on a 2W power budget. To our knowledge, this is the first deep learning-based solution for detecting propellers (to detect drones). Finally, our applications also show an impressive success rate of 92% and 90% for the tracking and landing tasks respectively.
    Striking the Right Balance: Recall Loss for Semantic Segmentation. (arXiv:2106.14917v1 [cs.CV])
    (2 min) Class imbalance is a fundamental problem in computer vision applications such as semantic segmentation. Specifically, uneven class distributions in a training dataset often result in unsatisfactory performance on under-represented classes. Many works have proposed to weight the standard cross entropy loss function with pre-computed weights based on class statistics, such as the number of samples and class margins. There are two major drawbacks to these methods: 1) constantly up-weighting minority classes can introduce excessive false positives in semantic segmentation; 2) a minority class is not necessarily a hard class. The consequence is low precision due to excessive false positives. In this regard, we propose a hard-class mining loss by reshaping the vanilla cross entropy loss such that it weights the loss for each class dynamically based on instantaneous recall performance. We show that the novel recall loss changes gradually between the standard cross entropy loss and the inverse frequency weighted loss. Recall loss also leads to improved mean accuracy while offering competitive mean Intersection over Union (IoU) performance. On Synthia dataset, recall loss achieves 9% relative improvement on mean accuracy with competitive mean IoU using DeepLab-ResNet18 compared to the cross entropy loss. Code available at https://github.com/PotatoTian/recall-semseg.
    Are conditional GANs explicitly conditional?. (arXiv:2106.15011v1 [cs.CV])
    (2 min) This paper proposes two important contributions for conditional Generative Adversarial Networks (cGANs) to improve the wide variety of applications that exploit this architecture. The first main contribution is an analysis of cGANs to show that they are not explicitly conditional. In particular, it will be shown that the discriminator and subsequently the cGAN does not automatically learn the conditionality between inputs. The second contribution is a new method, called acontrario, that explicitly models conditionality for both parts of the adversarial architecture via a novel acontrario loss that involves training the discriminator to learn unconditional (adverse) examples. This leads to a novel type of data augmentation approach for GANs (acontrario learning) which allows to restrict the search space of the generator to conditional outputs using adverse examples. Extensive experimentation is carried out to evaluate the conditionality of the discriminator by proposing a probability distribution analysis. Comparisons with the cGAN architecture for different applications show significant improvements in performance on well known datasets including, semantic image synthesis, image segmentation and monocular depth prediction using different metrics including Fr\'echet Inception Distance(FID), mean Intersection over Union (mIoU), Root Mean Square Error log (RMSE log) and Number of statistically-Different Bins (NDB)
    Fast Training of Neural Lumigraph Representations using Meta Learning. (arXiv:2106.14942v1 [cs.CV])
    (2 min) Novel view synthesis is a long-standing problem in machine learning and computer vision. Significant progress has recently been made in developing neural scene representations and rendering techniques that synthesize photorealistic images from arbitrary views. These representations, however, are extremely slow to train and often also slow to render. Inspired by neural variants of image-based rendering, we develop a new neural rendering approach with the goal of quickly learning a high-quality representation which can also be rendered in real-time. Our approach, MetaNLR++, accomplishes this by using a unique combination of a neural shape representation and 2D CNN-based image feature extraction, aggregation, and re-projection. To push representation convergence times down to minutes, we leverage meta learning to learn neural shape and image feature priors which accelerate training. The optimized shape and image features can then be extracted using traditional graphics techniques and rendered in real time. We show that MetaNLR++ achieves similar or better novel view synthesis results in a fraction of the time that competing methods require.
    Constructing Forest Biomass Prediction Maps from Radar Backscatter by Sequential Regression with a Conditional Generative Adversarial Network. (arXiv:2106.15020v1 [cs.LG])
    (2 min) This paper studies construction of above-ground biomass (AGB) prediction maps from synthetic aperture radar (SAR) intensity images. The purpose is to improve traditional regression models based on SAR intensity, trained with a limited amount of AGB in situ measurements. Although it is costly to collect, data from airborne laser scanning (ALS) sensors are highly correlated with AGB. Therefore, we propose using AGB predictions based on ALS data as surrogate response variables for SAR data in a sequential modelling fashion. This increases the amount of training data dramatically. To model the regression function between SAR intensity and ALS-predicted AGB we propose to utilise a conditional generative adversarial network (cGAN), i.e. the Pix2Pix convolutional neural network. This enables the recreation of existing ALS-based AGB prediction maps. The generated synthesised ALS-based AGB predictions are evaluated qualitatively and quantitatively against ALS-based AGB predictions retrieved from a traditional non-sequential regression model trained in the same area. Results show that the proposed architecture manages to capture characteristics of the actual data. This suggests that the use of ALS-guided generative models is a promising avenue for AGB prediction from SAR intensity. Further research on this area has the potential of providing both large-scale and low-cost predictions of AGB.
    Cosmic-CoNN: A Cosmic Ray Detection Deep-Learning Framework, Dataset, and Toolkit. (arXiv:2106.14922v1 [astro-ph.IM])
    (2 min) Rejecting cosmic rays (CRs) is essential for scientific interpretation of CCD-captured data, but detecting CRs in single-exposure images has remained challenging. Conventional CR-detection algorithms require tuning multiple parameters experimentally making it hard to automate across different instruments or observation requests. Recent work using deep learning to train CR-detection models has demonstrated promising results. However, instrument-specific models suffer from performance loss on images from ground-based facilities not included in the training data. In this work, we present Cosmic-CoNN, a deep-learning framework designed to produce generic CR-detection models. We build a large, diverse ground-based CR dataset leveraging thousands of images from the Las Cumbres Observatory global telescope network to produce a generic CR-detection model which achieves a 99.91% true-positive detection rate and maintains over 96.40% true-positive rates on unseen data from Gemini GMOS-N/S, with a false-positive rate of 0.01%. Apart from the open-source framework and dataset, we also build a suite of tools including console commands, a web-based application, and Python APIs to make automatic, robust CR detection widely accessible by the community of astronomers.
    GuidedMix-Net: Learning to Improve Pseudo Masks Using Labeled Images as Reference. (arXiv:2106.15064v1 [cs.CV])
    (2 min) Semi-supervised learning is a challenging problem which aims to construct a model by learning from a limited number of labeled examples. Numerous methods have been proposed to tackle this problem, with most focusing on utilizing the predictions of unlabeled instances consistency alone to regularize networks. However, treating labeled and unlabeled data separately often leads to the discarding of mass prior knowledge learned from the labeled examples, and failure to mine the feature interaction between the labeled and unlabeled image pairs. In this paper, we propose a novel method for semi-supervised semantic segmentation named GuidedMix-Net, by leveraging labeled information to guide the learning of unlabeled instances. Specifically, we first introduce a feature alignment objective between labeled and unlabeled data to capture potentially similar image pairs and then generate mixed inputs from them. The proposed mutual information transfer (MITrans), based on the cluster assumption, is shown to be a powerful knowledge module for further progressive refining features of unlabeled data in the mixed data space. To take advantage of the labeled examples and guide unlabeled data learning, we further propose a mask generation module to generate high-quality pseudo masks for the unlabeled data. Along with supervised learning for labeled data, the prediction of unlabeled data is jointly learned with the generated pseudo masks from the mixed data. Extensive experiments on PASCAL VOC 2012, PASCAL-Context and Cityscapes demonstrate the effectiveness of our GuidedMix-Net, which achieves competitive segmentation accuracy and significantly improves the mIoU by +7$\%$ compared to previous state-of-the-art approaches.
    An Uncertainty Estimation Framework for Probabilistic Object Detection. (arXiv:2106.15007v1 [cs.CV])
    (2 min) In this paper, we introduce a new technique that combines two popular methods to estimate uncertainty in object detection. Quantifying uncertainty is critical in real-world robotic applications. Traditional detection models can be ambiguous even when they provide a high-probability output. Robot actions based on high-confidence, yet unreliable predictions, may result in serious repercussions. Our framework employs deep ensembles and Monte Carlo dropout for approximating predictive uncertainty, and it improves upon the uncertainty estimation quality of the baseline method. The proposed approach is evaluated on publicly available synthetic image datasets captured from sequences of video.
  • cs.IR updates on arXiv.org

    Classification of Consumer Belief Statements From Social Media. (arXiv:2106.15498v1 [cs.LG])
    (2 min) Social media offer plenty of information to perform market research in order to meet the requirements of customers. One way how this research is conducted is that a domain expert gathers and categorizes user-generated content into a complex and fine-grained class structure. In many of such cases, little data meets complex annotations. It is not yet fully understood how this can be leveraged successfully for classification. We examine the classification accuracy of expert labels when used with a) many fine-grained classes and b) few abstract classes. For scenario b) we compare abstract class labels given by the domain expert as baseline and by automatic hierarchical clustering. We compare this to another baseline where the entire class structure is given by a completely unsupervised clustering approach. By doing so, this work can serve as an example of how complex expert annotations are potentially beneficial and can be utilized in the most optimal way for opinion mining in highly specific domains. By exploring across a range of techniques and experiments, we find that automated class abstraction approaches in particular the unsupervised approach performs remarkably well against domain expert baseline on text classification tasks. This has the potential to inspire opinion mining applications in order to support market researchers in practice and to inspire fine-grained automated content analysis on a large scale.
    When standard network measures fail to rank journals: A theoretical and empirical analysis. (arXiv:2106.15541v1 [cs.DL])
    (2 min) Journal rankings are widely used and are often based on citation data in combination with a network perspective. We argue that some of these network-based rankings can produce misleading results. From a theoretical point of view, we show that the standard network modelling approach of citation data at the journal level (i.e., the projection of paper citations onto journals) introduces fictitious relations among journals. To overcome this problem, we propose a citation path perspective, and empirically show that rankings based on the network and the citation path perspective are very different. Based on our theoretical and empirical analysis, we highlight the limitations of standard network metrics, and propose a method to overcome these limitations and compute journal rankings.
    On component interactions in two-stage recommender systems. (arXiv:2106.14979v1 [cs.IR])
    (2 min) Thanks to their scalability, two-stage recommenders are used by many of today's largest online platforms, including YouTube, LinkedIn, and Pinterest. These systems produce recommendations in two steps: (i) multiple nominators -- tuned for low prediction latency -- preselect a small subset of candidates from the whole item pool; (ii)~a slower but more accurate ranker further narrows down the nominated items, and serves to the user. Despite their popularity, the literature on two-stage recommenders is relatively scarce, and the algorithms are often treated as the sum of their parts. Such treatment presupposes that the two-stage performance is explained by the behavior of individual components if they were deployed independently. This is not the case: using synthetic and real-world data, we demonstrate that interactions between the ranker and the nominators substantially affect the overall performance. Motivated by these findings, we derive a generalization lower bound which shows that careful choice of each nominator's training set is sometimes the only difference between a poor and an optimal two-stage recommender. Since searching for a good choice manually is difficult, we learn one instead. In particular, using a Mixture-of-Experts approach, we train the nominators (experts) to specialize on different subsets of the item pool. This significantly improves performance.
    Topic Modeling Based Extractive Text Summarization. (arXiv:2106.15313v1 [cs.CL])
    (2 min) Text summarization is an approach for identifying important information present within text documents. This computational technique aims to generate shorter versions of the source text, by including only the relevant and salient information present within the source text. In this paper, we propose a novel method to summarize a text document by clustering its contents based on latent topics produced using topic modeling techniques and by generating extractive summaries for each of the identified text clusters. All extractive sub-summaries are later combined to generate a summary for any given source document. We utilize the lesser used and challenging WikiHow dataset in our approach to text summarization. This dataset is unlike the commonly used news datasets which are available for text summarization. The well-known news datasets present their most important information in the first few lines of their source texts, which make their summarization a lesser challenging task when compared to summarizing the WikiHow dataset. Contrary to these news datasets, the documents in the WikiHow dataset are written using a generalized approach and have lesser abstractedness and higher compression ratio, thus proposing a greater challenge to generate summaries. A lot of the current state-of-the-art text summarization techniques tend to eliminate important information present in source documents in the favor of brevity. Our proposed technique aims to capture all the varied information present in source documents. Although the dataset proved challenging, after performing extensive tests within our experimental setup, we have discovered that our model produces encouraging ROUGE results and summaries when compared to the other published extractive and abstractive text summarization models.
    A scalable solution to the nearest neighbor search problem through local-search methods on neighbor graphs. (arXiv:1705.10351v4 [cs.DS] UPDATED)
    (2 min) Near neighbor search (NNS) is a powerful abstraction for data access; however, data indexing is troublesome even for approximate indexes. For intrinsically high-dimensional data, high-quality fast searches demand either indexes with impractically large memory usage or preprocessing time. In this paper, we introduce an algorithm to solve a nearest-neighbor query $q$ by minimizing a kernel function defined by the distance from $q$ to each object in the database. The minimization is performed using metaheuristics to solve the problem rapidly; even when some methods in the literature use this strategy behind the scenes, our approach is the first one using it explicitly. We also provide two approaches to select edges in the graph's construction stage that limit memory footprint and reduce the number of free parameters simultaneously. We carry out a thorough experimental comparison with state-of-the-art indexes through synthetic and real-world datasets; we found out that our contributions achieve competitive performances regarding speed, accuracy, and memory in almost any of our benchmarks.
    Context-aware Heterogeneous Graph Attention Network for User Behavior Prediction in Local Consumer Service Platform. (arXiv:2106.14652v2 [cs.IR] UPDATED)
    (2 min) As a new type of e-commerce platform developed in recent years, local consumer service platform provides users with software to consume service to the nearby store or to the home, such as Groupon and Koubei. Different from other common e-commerce platforms, the behavior of users on the local consumer service platform is closely related to their real-time local context information. Therefore, building a context-aware user behavior prediction system is able to provide both merchants and users better service in local consumer service platforms. However, most of the previous work just treats the contextual information as an ordinary feature into the prediction model to obtain the prediction list under a specific context, which ignores the fact that the interest of a user in different contexts is often significantly different. Hence, in this paper, we propose a context-aware heterogeneous graph attention network (CHGAT) to dynamically generate the representation of the user and to estimate the probability for future behavior. Specifically, we first construct the meta-path based heterogeneous graphs with the historical behaviors from multiple sources and comprehend heterogeneous vertices in the graph with a novel unified knowledge representing approach. Next, a multi-level attention mechanism is introduced for context-aware aggregation with graph vertices, which contains the vertex-level attention network and the path-level attention network. Both of them aim to capture the semantic correlation between information contained in the graph and the outside real-time contextual information in the search system. Then the model proposed in this paper aggregates specific graphs with their corresponding context features and obtains the representation of user interest under a specific context and input it into the prediction network to finally obtain the predicted probability of user behavior.
    Overview of BioASQ 2021: The ninth BioASQ challenge on Large-Scale Biomedical Semantic Indexing and Question Answering. (arXiv:2106.14885v1 [cs.CL])
    (2 min) Advancing the state-of-the-art in large-scale biomedical semantic indexing and question answering is the main focus of the BioASQ challenge. BioASQ organizes respective tasks where different teams develop systems that are evaluated on the same benchmark datasets that represent the real information needs of experts in the biomedical domain. This paper presents an overview of the ninth edition of the BioASQ challenge in the context of the Conference and Labs of the Evaluation Forum (CLEF) 2021. In this year, a new question answering task, named Synergy, is introduced to support researchers studying the COVID-19 disease and measure the ability of the participating teams to discern information while the problem is still developing. In total, 42 teams with more than 170 systems were registered to participate in the four tasks of the challenge. The evaluation results, similarly to previous years, show a performance gain against the baselines which indicates the continuous improvement of the state-of-the-art in this field.
    A Bytecode-based Approach for Smart Contract Classification. (arXiv:2106.15497v1 [cs.IR])
    (2 min) With the development of blockchain technologies, the number of smart contracts deployed on blockchain platforms is growing exponentially, which makes it difficult for users to find desired services by manual screening. The automatic classification of smart contracts can provide blockchain users with keyword-based contract searching and helps to manage smart contracts effectively. Current research on smart contract classification focuses on Natural Language Processing (NLP) solutions which are based on contract source code. However, more than 94% of smart contracts are not open-source, so the application scenarios of NLP methods are very limited. Meanwhile, NLP models are vulnerable to adversarial attacks. This paper proposes a classification model based on features from contract bytecode instead of source code to solve these problems. We also use feature selection and ensemble learning to optimize the model. Our experimental studies on over 3,300 real-world Ethereum smart contracts show that our model can classify smart contracts without source code and has better performance than baseline models. Our model also has good resistance to adversarial attacks compared with NLP-based models. In addition, our analysis reveals that account features used in many smart contract classification models have little effect on classification and can be excluded.
    On the Capacity of Quantum Private Information Retrieval from MDS-Coded and Colluding Servers. (arXiv:2106.14719v2 [cs.IT] UPDATED)
    (2 min) In quantum private information retrieval (QPIR), a user retrieves a classical file from multiple servers by downloading quantum systems without revealing the identity of the file. The QPIR capacity is the maximal achievable ratio of the retrieved file size to the total download size. In this paper, the capacity of QPIR from MDS-coded and colluding servers is studied. Two classes of QPIR, called stabilizer QPIR and dimension squared QPIR induced from classical strongly linear PIR are defined, and the related QPIR capacities are derived. For the non-colluding case, the general QPIR capacity is derived when the number of files goes to infinity. The capacities of symmetric and non-symmetric QPIR with coded and colluding servers are proved to coincide, being double to their classical counterparts. A general statement on the converse bound for QPIR with coded and colluding servers is derived showing that the capacities of stabilizer QPIR and dimension squared QPIR induced from any class of PIR are upper bounded by twice the classical capacity of the respective PIR class. The proposed capacity-achieving scheme combines the star-product scheme by Freij-Hollanti et al. and the stabilizer QPIR scheme by Song et al. by employing (weakly) self-dual Reed--Solomon codes.
    Convolutional Hypercomplex Embeddings for Link Prediction. (arXiv:2106.15230v1 [cs.LG])
    (2 min) Knowledge graph embedding research has mainly focused on the two smallest normed division algebras, $\mathbb{R}$ and $\mathbb{C}$. Recent results suggest that trilinear products of quaternion-valued embeddings can be a more effective means to tackle link prediction. In addition, models based on convolutions on real-valued embeddings often yield state-of-the-art results for link prediction. In this paper, we investigate a composition of convolution operations with hypercomplex multiplications. We propose the four approaches QMult, OMult, ConvQ and ConvO to tackle the link prediction problem. QMult and OMult can be considered as quaternion and octonion extensions of previous state-of-the-art approaches, including DistMult and ComplEx. ConvQ and ConvO build upon QMult and OMult by including convolution operations in a way inspired by the residual learning framework. We evaluated our approaches on seven link prediction datasets including WN18RR, FB15K-237 and YAGO3-10. Experimental results suggest that the benefits of learning hypercomplex-valued vector representations become more apparent as the size and complexity of the knowledge graph grows. ConvO outperforms state-of-the-art approaches on FB15K-237 in MRR, Hit@1 and Hit@3, while QMult, OMult, ConvQ and ConvO outperform state-of-the-approaches on YAGO3-10 in all metrics. Results also suggest that link prediction performances can be further improved via prediction averaging. To foster reproducible research, we provide an open-source implementation of approaches, including training and evaluation scripts as well as pretrained models.
  • cs.LG updates on arXiv.org

    Analyzing the Stability of Non-coplanar Circumbinary Planets using Machine Learning. (arXiv:2101.02316v2 [astro-ph.EP] UPDATED)
    (2 min) Exoplanet detection in the past decade by efforts including NASA's Kepler and TESS missions has discovered many worlds that differ substantially from planets in our own Solar system, including more than 400 exoplanets orbiting binary or multi-star systems. This not only broadens our understanding of the diversity of exoplanets, but also promotes our study of exoplanets in the complex binary and multi-star systems and provides motivation to explore their habitability. In this study, we analyze orbital stability of exoplanets in non-coplanar circumbinary systems using a numerical simulation method, with which a large number of circumbinary planet samples are generated in order to quantify the effects of various orbital parameters on orbital stability. We also train a machine learning model that can quickly determine the stability of the circumbinary planetary systems. Our results indicate that larger inclinations of the planet tend to increase the stability of its orbit, but change in the planet's mass range between Earth and Jupiter has little effect on the stability of the system. In addition, we find that Deep Neural Networks (DNNs) have higher accuracy and precision than other machine learning algorithms.
    A function approximation approach to the prediction of blood glucose levels. (arXiv:2105.05893v2 [cs.LG] UPDATED)
    (2 min) The problem of real time prediction of blood glucose (BG) levels based on the readings from a continuous glucose monitoring (CGM) device is a problem of great importance in diabetes care, and therefore, has attracted a lot of research in recent years, especially based on machine learning. An accurate prediction with a 30, 60, or 90 minute prediction horizon has the potential of saving millions of dollars in emergency care costs. In this paper, we treat the problem as one of function approximation, where the value of the BG level at time $t+h$ (where $h$ the prediction horizon) is considered to be an unknown function of $d$ readings prior to the time $t$. This unknown function may be supported in particular on some unknown submanifold of the $d$-dimensional Euclidean space. While manifold learning is classically done in a semi-supervised setting, where the entire data has to be known in advance, we use recent ideas to achieve an accurate function approximation in a supervised setting; i.e., construct a model for the target function. We use the state-of-the-art clinically relevant PRED-EGA grid to evaluate our results, and demonstrate that for a real life dataset, our method performs better than a standard deep network, especially in hypoglycemic and hyperglycemic regimes. One noteworthy aspect of this work is that the training data and test data may come from different distributions.
    Spatio-Temporal Graph Convolution for Resting-State fMRI Analysis. (arXiv:2003.10613v3 [cs.LG] UPDATED)
    (2 min) The Blood-Oxygen-Level-Dependent (BOLD) signal of resting-state fMRI (rs-fMRI) records the temporal dynamics of intrinsic functional networks in the brain. However, existing deep learning methods applied to rs-fMRI either neglect the functional dependency between different brain regions in a network or discard the information in the temporal dynamics of brain activity. To overcome those shortcomings, we propose to formulate functional connectivity networks within the context of spatio-temporal graphs. We train a spatio-temporal graph convolutional network (ST-GCN) on short sub-sequences of the BOLD time series to model the non-stationary nature of functional connectivity. Simultaneously, the model learns the importance of graph edges within ST-GCN to gain insight into the functional connectivities contributing to the prediction. In analyzing the rs-fMRI of the Human Connectome Project (HCP, N=1,091) and the National Consortium on Alcohol and Neurodevelopment in Adolescence (NCANDA, N=773), ST-GCN is significantly more accurate than common approaches in predicting gender and age based on BOLD signals. Furthermore, the brain regions and functional connections significantly contributing to the predictions of our model are important markers according to the neuroscience literature.
    The Values Encoded in Machine Learning Research. (arXiv:2106.15590v1 [cs.LG])
    (2 min) Machine learning (ML) currently exerts an outsized influence on the world, increasingly affecting communities and institutional practices. It is therefore critical that we question vague conceptions of the field as value-neutral or universally beneficial, and investigate what specific values the field is advancing. In this paper, we present a rigorous examination of the values of the field by quantitatively and qualitatively analyzing 100 highly cited ML papers published at premier ML conferences, ICML and NeurIPS. We annotate key features of papers which reveal their values: how they justify their choice of project, which aspects they uplift, their consideration of potential negative consequences, and their institutional affiliations and funding sources. We find that societal needs are typically very loosely connected to the choice of project, if mentioned at all, and that consideration of negative consequences is extremely rare. We identify 67 values that are uplifted in machine learning research, and, of these, we find that papers most frequently justify and assess themselves based on performance, generalization, efficiency, researcher understanding, novelty, and building on previous work. We present extensive textual evidence and analysis of how these values are operationalized. Notably, we find that each of these top values is currently being defined and applied with assumptions and implications generally supporting the centralization of power. Finally, we find increasingly close ties between these highly cited papers and tech companies and elite universities.
    Towards Interpretable Natural Language Understanding with Explanations as Latent Variables. (arXiv:2011.05268v2 [cs.CL] UPDATED)
    (2 min) Recently generating natural language explanations has shown very promising results in not only offering interpretable explanations but also providing additional information and supervision for prediction. However, existing approaches usually require a large set of human annotated explanations for training while collecting a large set of explanations is not only time consuming but also expensive. In this paper, we develop a general framework for interpretable natural language understanding that requires only a small set of human annotated explanations for training. Our framework treats natural language explanations as latent variables that model the underlying reasoning process of a neural model. We develop a variational EM framework for optimization where an explanation generation module and an explanation-augmented prediction module are alternatively optimized and mutually enhance each other. Moreover, we further propose an explanation-based self-training method under this framework for semi-supervised learning. It alternates between assigning pseudo-labels to unlabeled data and generating new explanations to iteratively improve each other. Experiments on two natural language understanding tasks demonstrate that our framework can not only make effective predictions in both supervised and semi-supervised settings, but also generate good natural language explanation.
    Against Membership Inference Attack: Pruning is All You Need. (arXiv:2008.13578v3 [cs.LG] UPDATED)
    (2 min) The large model size, high computational operations, and vulnerability against membership inference attack (MIA) have impeded deep learning or deep neural networks (DNNs) popularity, especially on mobile devices. To address the challenge, we envision that the weight pruning technique will help DNNs against MIA while reducing model storage and computational operation. In this work, we propose a pruning algorithm, and we show that the proposed algorithm can find a subnetwork that can prevent privacy leakage from MIA and achieves competitive accuracy with the original DNNs. We also verify our theoretical insights with experiments. Our experimental results illustrate that the attack accuracy using model compression is up to 13.6% and 10% lower than that of the baseline and Min-Max game, accordingly.
    PPFL: Privacy-preserving Federated Learning with Trusted Execution Environments. (arXiv:2104.14380v2 [cs.CR] UPDATED)
    (2 min) We propose and implement a Privacy-preserving Federated Learning ($PPFL$) framework for mobile systems to limit privacy leakages in federated learning. Leveraging the widespread presence of Trusted Execution Environments (TEEs) in high-end and mobile devices, we utilize TEEs on clients for local training, and on servers for secure aggregation, so that model/gradient updates are hidden from adversaries. Challenged by the limited memory size of current TEEs, we leverage greedy layer-wise training to train each model's layer inside the trusted area until its convergence. The performance evaluation of our implementation shows that $PPFL$ can significantly improve privacy while incurring small system overheads at the client-side. In particular, $PPFL$ can successfully defend the trained model against data reconstruction, property inference, and membership inference attacks. Furthermore, it can achieve comparable model utility with fewer communication rounds (0.54$\times$) and a similar amount of network traffic (1.002$\times$) compared to the standard federated learning of a complete model. This is achieved while only introducing up to ~15% CPU time, ~18% memory usage, and ~21% energy consumption overhead in $PPFL$'s client-side.
    Domain Generalization using Causal Matching. (arXiv:2006.07500v3 [cs.LG] UPDATED)
    (2 min) In the domain generalization literature, a common objective is to learn representations independent of the domain after conditioning on the class label. We show that this objective is not sufficient: there exist counter-examples where a model fails to generalize to unseen domains even after satisfying class-conditional domain invariance. We formalize this observation through a structural causal model and show the importance of modeling within-class variations for generalization. Specifically, classes contain objects that characterize specific causal features, and domains can be interpreted as interventions on these objects that change non-causal features. We highlight an alternative condition: inputs across domains should have the same representation if they are derived from the same object. Based on this objective, we propose matching-based algorithms when base objects are observed (e.g., through data augmentation) and approximate the objective when objects are not observed (MatchDG). Our simple matching-based algorithms are competitive to prior work on out-of-domain accuracy for rotated MNIST, Fashion-MNIST, PACS, and Chest-Xray datasets. Our method MatchDG also recovers ground-truth object matches: on MNIST and Fashion-MNIST, top-10 matches from MatchDG have over 50% overlap with ground-truth matches.
    Expert Q-learning: Deep Q-learning With State Values From Expert Examples. (arXiv:2106.14642v2 [cs.LG] UPDATED)
    (2 min) We propose a novel algorithm named Expert Q-learning. Expert Q-learning was inspired by Dueling Q-learning and aimed at incorporating the ideas from semi-supervised learning into reinforcement learning through splitting Q-values into state values and action advantages. Different from Generative Adversarial Imitation Learning and Deep Q-Learning from Demonstrations, the offline expert we have used only predicts the value of a state from {-1, 0, 1}, indicating whether this is a bad, neutral or good state. An expert network was designed in addition to the Q-network, which updates each time following the regular offline minibatch update whenever the expert example buffer is not empty. The Q-network plays the role of the advantage function only during the update. Our algorithm also keeps asynchronous copies of the Q-network and expert network, predicting the target values using the same manner as of Double Q-learning. We compared on the game of Othello our algorithm with the state-of-the-art Q-learning algorithm, which was a combination of Double Q-learning and Dueling Q-learning. The results showed that Expert Q-learning was indeed useful and more resistant to the overestimation bias of Q-learning. The baseline Q-learning algorithm exhibited unstable and suboptimal behavior, especially when playing against a stochastic player, whereas Expert Q-learning demonstrated more robust performance with higher scores. Expert Q-learning without using examples has also gained better results than the baseline algorithm when trained and tested against a fixed player. On the other hand, Expert Q-learning without examples cannot win against the baseline Q-learning algorithm in direct game competitions despite the fact that it has also shown the strength of reducing the overestimation bias.
    Doing good by fighting fraud: Ethical anti-fraud systems for mobile payments. (arXiv:2106.14861v2 [cs.CR] UPDATED)
    (2 min) App builders commonly use security challenges, a form of step-up authentication, to add security to their apps. However, the ethical implications of this type of architecture has not been studied previously. In this paper, we present a large-scale measurement study of running an existing anti-fraud security challenge, Boxer, in real apps running on mobile devices. We find that although Boxer does work well overall, it is unable to scan effectively on devices that run its machine learning models at less than one frame per second (FPS), blocking users who use inexpensive devices. With the insights from our study, we design Daredevil, anew anti-fraud system for scanning payment cards that work swell across the broad range of performance characteristics and hardware configurations found on modern mobile devices. Daredevil reduces the number of devices that run at less than one FPS by an order of magnitude compared to Boxer, providing a more equitable system for fighting fraud. In total, we collect data from 5,085,444 real devices spread across 496 real apps running production software and interacting with real users.
    Inductive Bias of Multi-Channel Linear Convolutional Networks with Bounded Weight Norm. (arXiv:2102.12238v2 [cs.LG] UPDATED)
    (2 min) We study the function space characterization of the inductive bias resulting from controlling the $\ell_2$ norm of the weights in linear convolutional networks. We view this in terms of an induced regularizer in the function space given by the minimum norm of weights required to realize a linear function. For two layer linear convolutional networks with $C$ output channels and kernel size $K$, we show the following: (a) If the inputs to the network have a single channel, the induced regularizer for any $K$ is a norm given by a semidefinite program (SDP) that is independent of the number of output channels $C$. (b) In contrast, for networks with multi-channel inputs, multiple output channels can be necessary to merely realize all matrix-valued linear functions and thus the inductive bias does depend on $C$. Further, for sufficiently large $C$, the induced regularizer for $K=1$ and $K=D$ are the nuclear norm and the $\ell_{2,1}$ group-sparse norm, respectively, of the Fourier coefficients. (c) Complementing our theoretical results, we show through experiments on MNIST and CIFAR-10 that our key findings extend to implicit biases from gradient descent in overparameterized networks.
    Symmetry meets AI. (arXiv:2103.06115v2 [cs.LG] UPDATED)
    (2 min) We explore whether Neural Networks (NNs) can {\it discover} the presence of symmetries as they learn to perform a task. For this, we train hundreds of NNs on a {\it decoy task} based on well-controlled Physics templates, where no information on symmetry is provided. We use the output from the last hidden layer of all these NNs, projected to fewer dimensions, as the input for a symmetry classification task, and show that information on symmetry had indeed been identified by the original NN without guidance. As an interdisciplinary application of this procedure, we identify the presence and level of symmetry in artistic paintings from different styles such as those of Picasso, Pollock and Van Gogh.
    Improved Approximation Properties of Dictionaries and Applications to Neural Networks. (arXiv:2101.12365v6 [stat.ML] UPDATED)
    (2 min) This article addresses the problem of approximating a function in a Hilbert space by an expansion over a dictionary $\mathbb{D}$. We introduce the notion of a smoothly parameterized dictionary and give upper bounds on the approximation rates, metric entropy and $n$-widths of the absolute convex hull, which we denote $B_1(\mathbb{D})$, of such dictionaries. The upper bounds depend upon the order of smoothness of the parameterization, and improve upon existing results in many cases. The main applications of these results is to the dictionaries $\mathbb{D} = \{\sigma(\omega\cdot x + b)\}\subset L^2$ corresponding to shallow neural networks with activation function $\sigma$, and to the dictionary of decaying Fourier modes corresponding to the spectral Barron space. This improves upon existing approximation rates for shallow neural networks when $\sigma = \text{ReLU}^k$ for $k\geq 2$, sharpens bounds on the metric entropy, and provides the first bounds on the Gelfand $n$-widths of the Barron space and spectral Barron space.
    Improved Prediction and Network Estimation Using the Monotone Single Index Multi-variate Autoregressive Model. (arXiv:2106.14630v2 [stat.ML] UPDATED)
    (2 min) Network estimation from multi-variate point process or time series data is a problem of fundamental importance. Prior work has focused on parametric approaches that require a known parametric model, which makes estimation procedures less robust to model mis-specification, non-linearities and heterogeneities. In this paper, we develop a semi-parametric approach based on the monotone single-index multi-variate autoregressive model (SIMAM) which addresses these challenges. We provide theoretical guarantees for dependent data and an alternating projected gradient descent algorithm. Significantly we do not explicitly assume mixing conditions on the process (although we do require conditions analogous to restricted strong convexity) and we achieve rates of the form $O(T^{-\frac{1}{3}} \sqrt{s\log(TM)})$ (optimal in the independent design case) where $s$ is the threshold for the maximum in-degree of the network that indicates the sparsity level, $M$ is the number of actors and $T$ is the number of time points. In addition, we demonstrate the superior performance both on simulated data and two real data examples where our SIMAM approach out-performs state-of-the-art parametric methods both in terms of prediction and network estimation.
    Size and Depth Separation in Approximating Benign Functions with Neural Networks. (arXiv:2102.00314v3 [cs.LG] UPDATED)
    (3 min) When studying the expressive power of neural networks, a main challenge is to understand how the size and depth of the network affect its ability to approximate real functions. However, not all functions are interesting from a practical viewpoint: functions of interest usually have a polynomially-bounded Lipschitz constant, and can be computed efficiently. We call functions that satisfy these conditions "benign", and explore the benefits of size and depth for approximation of benign functions with ReLU networks. As we show, this problem is more challenging than the corresponding problem for non-benign functions. We give barriers to showing depth-lower-bounds: Proving existence of a benign function that cannot be approximated by polynomial-size networks of depth $4$ would settle longstanding open problems in computational complexity. It implies that beyond depth $4$ there is a barrier to showing depth-separation for benign functions, even between networks of constant depth and networks of nonconstant depth. We also study size-separation, namely, whether there are benign functions that can be approximated with networks of size $O(s(d))$, but not with networks of size $O(s'(d))$. We show a complexity-theoretic barrier to proving such results beyond size $O(d\log^2(d))$, but also show an explicit benign function, that can be approximated with networks of size $O(d)$ and not with networks of size $o(d/\log d)$. For approximation in $L_\infty$ we achieve such separation already between size $O(d)$ and size $o(d)$. Moreover, we show superpolynomial size lower bounds and barriers to such lower bounds, depending on the assumptions on the function. Our size-separation results rely on an analysis of size lower bounds for Boolean functions, which is of independent interest: We show linear size lower bounds for computing explicit Boolean functions with neural networks and threshold circuits.
    Complexity of Stochastic Dual Dynamic Programming. (arXiv:1912.07702v6 [math.OC] UPDATED)
    (2 min) Stochastic dual dynamic programming is a cutting plane type algorithm for multi-stage stochastic optimization originated about 30 years ago. In spite of its popularity in practice, there does not exist any analysis on the convergence rates of this method. In this paper, we first establish the number of iterations, i.e., iteration complexity, required by a basic dynamic cutting plane method for solving relatively simple multi-stage optimization problems, by introducing novel mathematical tools including the saturation of search points. We then refine these basic tools and establish the iteration complexity for both deterministic and stochastic dual dynamic programming methods for solving more general multi-stage stochastic optimization problems under the standard stage-wise independence assumption. Our results indicate that the complexity of these methods mildly increases with the number of stages $T$, in fact linearly dependent on $T$ for discounted problems. Therefore, they are efficient for strategic decision making which involves a large number of stages, but with a relatively small number of decision variables in each stage. Without explicitly discretizing the state and action spaces, these methods might also be pertinent to the related reinforcement learning and stochastic control areas.
    Deep Learning Body Region Classification of MRI and CT examinations. (arXiv:2104.13826v2 [eess.IV] UPDATED)
    (2 min) Standardized body region labelling of individual images provides data that can improve human and computer use of medical images. A CNN-based classifier was developed to identify body regions in CT and MRI. 17 CT (18 MRI) body regions covering the entire human body were defined for the classification task. Three retrospective databases were built for the AI model training, validation, and testing, with a balanced distribution of studies per body region. The test databases originated from a different healthcare network. Accuracy, recall and precision of the classifier was evaluated for patient age, patient gender, institution, scanner manufacturer, contrast, slice thickness, MRI sequence, and CT kernel. The data included a retrospective cohort of 2,934 anonymized CT cases (training: 1,804 studies, validation: 602 studies, test: 528 studies) and 3,185 anonymized MRI cases (training: 1,911 studies, validation: 636 studies, test: 638 studies). 27 institutions from primary care hospitals, community hospitals and imaging centers contributed to the test datasets. The data included cases of all genders in equal proportions and subjects aged from a few months old to +90 years old. An image-level prediction accuracy of 91.9% (90.2 - 92.1) for CT, and 94.2% (92.0 - 95.6) for MRI was achieved. The classification results were robust across all body regions and confounding factors. Due to limited data, performance results for subjects under 10 years-old could not be reliably evaluated. We show that deep learning models can classify CT and MRI images by body region including lower and upper extremities with high accuracy.
    Off-Policy Risk Assessment in Contextual Bandits. (arXiv:2104.08977v2 [cs.LG] UPDATED)
    (2 min) Even when unable to run experiments, practitioners can evaluate prospective policies, using previously logged data. However, while the bandits literature has adopted a diverse set of objectives, most research on off-policy evaluation to date focuses on the expected reward. In this paper, we introduce Lipschitz risk functionals, a broad class of objectives that subsumes conditional value-at-risk (CVaR), variance, mean-variance, many distorted risks, and CPT risks, among others. We propose Off-Policy Risk Assessment (OPRA), a framework that first estimates a target policy's CDF and then generates plugin estimates for any collection of Lipschitz risks, providing finite sample guarantees that hold simultaneously over the entire class. We instantiate OPRA with both importance sampling and doubly robust estimators. Our primary theoretical contributions are (i) the first uniform concentration inequalities for both CDF estimators in contextual bandits and (ii) error bounds on our Lipschitz risk estimates, which all converge at a rate of $O(1/\sqrt{n})$.
    Universal Approximation Theorems for Differentiable Geometric Deep Learning. (arXiv:2101.05390v3 [cs.LG] UPDATED)
    (2 min) This paper addresses the growing need to process non-Euclidean data, by introducing a geometric deep learning (GDL) framework for building universal feedforward-type models compatible with differentiable manifold geometries. We show that our GDL models can approximate any continuous target function uniformly on compacts of a controlled maximum diameter. We obtain curvature dependant lower-bounds on this maximum diameter and upper-bounds on the depth of our approximating GDL models. Conversely, we find that there is always a continuous function between any two non-degenerate compact manifolds that any "locally-defined" GDL model cannot uniformly approximate. Our last main result identifies data-dependent conditions guaranteeing that the GDL model implementing our approximation breaks "the curse of dimensionality." We find that any "real-world" (i.e. finite) dataset always satisfies our condition and, conversely, any dataset satisfies our requirement if the target function is smooth. As applications, we confirm the universal approximation capabilities of the following GDL models: Ganea et al. (2018)'s hyperbolic feedforward networks, the architecture implementing Krishnan et al. (2015)'s deep Kalman-Filter, and deep softmax classifiers. We build universal extensions/variants of: the SPD-matrix regressor of Meyer et al. (2011), and Fletcher et al. (2009)'s Procrustean regressor. In the Euclidean setting, our results imply a quantitative version of Kidger and Lyons (2020)'s approximation theorem and a data-dependent version of Yarotsky and Zhevnerchuk (2020)'s uncursed approximation rates.
    As easy as APC: Leveraging self-supervised learning in the context of time series classification with varying levels of sparsity and severe class imbalance. (arXiv:2106.15577v1 [cs.LG])
    (2 min) High levels of sparsity and strong class imbalance are ubiquitous challenges that are often presented simultaneously in real-world time series data. While most methods tackle each problem separately, our proposed approach handles both in conjunction, while imposing fewer assumptions on the data. In this work, we propose leveraging a self-supervised learning method, specifically Autoregressive Predictive Coding (APC), to learn relevant hidden representations of time series data in the context of both missing data and class imbalance. We apply APC using either a GRU or GRU-D encoder on two real-world datasets, and show that applying one-step-ahead prediction with APC improves the classification results in all settings. In fact, by applying GRU-D - APC, we achieve state-of-the-art AUPRC results on the Physionet benchmark.
    Rapid parameter estimation of discrete decaying signals using autoencoder networks. (arXiv:2103.08663v2 [eess.SP] UPDATED)
    (2 min) In this work we demonstrate the use of neural networks for rapid extraction of signal parameters of discretely sampled signals. In particular, we use dense autoencoder networks to extract the parameters of interest from exponentially decaying signals and decaying oscillations. By using a three-stage training method and careful choice of the neural network size, we are able to retrieve the relevant signal parameters directly from the latent space of the autoencoder network at significantly improved rates compared to traditional algorithmic signal-analysis approaches. We show that the achievable precision and accuracy of this method of analysis is similar to conventional algorithm-based signal analysis methods, by demonstrating that the extracted signal parameters are approaching their fundamental parameter estimation limit as provided by the Cram\'er-Rao bound. Furthermore, we demonstrate that autoencoder networks are able to achieve signal analysis, and, hence, parameter extraction, at rates of 75 kHz, orders-of-magnitude faster than conventional techniques with similar precision. Finally, we explore the limitations of our approach, demonstrating that analysis rates of $>$200 kHz are feasible with further optimization of the transfer rate between the data-acquisition system and data-analysis system.
    Graph Partitioning and Sparse Matrix Ordering using Reinforcement Learning and Graph Neural Networks. (arXiv:2104.03546v2 [cs.LG] UPDATED)
    (2 min) We present a novel method for graph partitioning, based on reinforcement learning and graph convolutional neural networks. Our approach is to recursively partition coarser representations of a given graph. The neural network is implemented using SAGE graph convolution layers, and trained using an advantage actor critic (A2C) agent. We present two variants, one for finding an edge separator that minimizes the normalized cut or quotient cut, and one that finds a small vertex separator. The vertex separators are then used to construct a nested dissection ordering to permute a sparse matrix so that its triangular factorization will incur less fill-in. The partitioning quality is compared with partitions obtained using METIS and SCOTCH, and the nested dissection ordering is evaluated in the sparse solver SuperLU. Our results show that the proposed method achieves similar partitioning quality as METIS and SCOTCH. Furthermore, the method generalizes across different classes of graphs, and works well on a variety of graphs from the SuiteSparse sparse matrix collection.
    Fast Classification Learning with Neural Networks and Conceptors for Speech Recognition and Car Driving Maneuvers. (arXiv:2102.05588v3 [cs.LG] UPDATED)
    (2 min) Recurrent neural networks are a powerful means in diverse applications. We show that, together with so-called conceptors, they also allow fast learning, in contrast to other deep learning methods. In addition, a relatively small number of examples suffices to train neural networks with high accuracy. We demonstrate this with two applications, namely speech recognition and detecting car driving maneuvers. We improve the state of the art by application-specific preparation techniques: For speech recognition, we use mel frequency cepstral coefficients leading to a compact representation of the frequency spectra, and detecting car driving maneuvers can be done without the commonly used polynomial interpolation, as our evaluation suggests.
    Learning Off-Policy with Online Planning. (arXiv:2008.10066v3 [cs.LG] UPDATED)
    (2 min) Reinforcement learning (RL) in low-data and risk-sensitive domains requires performant and flexible deployment policies that can readily incorporate constraints during deployment. One such class of policies are the semi-parametric H-step lookahead policies, which select actions using trajectory optimization over a dynamics model for a fixed horizon with a terminal value function. In this work, we investigate a novel instantiation of H-step lookahead with a learned model and a terminal value function learned by a model-free off-policy algorithm, named Learning Off-Policy with Online Planning (LOOP). We provide a theoretical analysis of this method, suggesting a tradeoff between model errors and value function errors and empirically demonstrate this tradeoff to be beneficial in deep reinforcement learning. Furthermore, we identify the "Actor Divergence" issue in this framework and propose Actor Regularized Control (ARC), a modified trajectory optimization procedure. We evaluate our method on a set of robotic tasks for Offline and Online RL and demonstrate improved performance. We also show the flexibility of LOOP to incorporate safety constraints during deployment with a set of navigation environments. We demonstrate that LOOP is a desirable framework for robotics applications based on its strong performance in various important RL settings.
    Towards Finding Longer Proofs. (arXiv:1905.13100v2 [cs.LO] UPDATED)
    (2 min) We present a reinforcement learning (RL) based guidance system for automated theorem proving geared towards Finding Longer Proofs (FLoP). Unlike most learning based approaches, we focus on generalising from very little training data and achieving near complete confidence. We use several simple, structured datasets with very long proofs to show that FLoP can successfully generalise a single training proof to a large class of related problems. On these benchmarks, FLoP is competitive with strong theorem provers despite using very limited search, due to its ability to solve problems that are prohibitively long for other systems.
    Impossibility of Partial Recovery in the Graph Alignment Problem. (arXiv:2102.02685v2 [stat.ML] UPDATED)
    (2 min) Random graph alignment refers to recovering the underlying vertex correspondence between two random graphs with correlated edges. This can be viewed as an average-case and noisy version of the well-known graph isomorphism problem. For the correlated Erd\"os-R\'enyi model, we prove an impossibility result for partial recovery in the sparse regime, with constant average degree and correlation, as well as a general bound on the maximal reachable overlap. Our bound is tight in the noiseless case (the graph isomorphism problem) and we conjecture that it is still tight with noise. Our proof technique relies on a careful application of the probabilistic method to build automorphisms between tree components of a subcritical Erd\"os-R\'enyi graph.
    Exploiting Chain Rule and Bayes' Theorem to Compare Probability Distributions. (arXiv:2012.14100v4 [stat.ML] UPDATED)
    (2 min) To measure the difference between two probability distributions, referred to as the source and target, respectively, we exploit both the chain rule and Bayes' theorem to construct conditional transport (CT), which is constituted by both a forward component and a backward one. The forward CT is the expected cost of moving a source data point to a target one, with their joint distribution defined by the product of the source probability density function (PDF) and a source-dependent conditional distribution, which is related to the target PDF via Bayes' theorem. The backward CT is defined by reversing the direction. The CT cost can be approximated by replacing the source and target PDFs with their discrete empirical distributions supported on mini-batches, making it amenable to implicit distributions and stochastic gradient descent-based optimization. When applied to train a generative model, CT is shown to strike a good balance between mode-covering and mode-seeking behaviors and strongly resist mode collapse. On a wide variety of benchmark datasets for generative modeling, substituting the default statistical distance of an existing generative adversarial network with CT is shown to consistently improve the performance. PyTorch-style code is provided.
    An Ambient Intelligence-Based Human Behavior Monitoring Framework for Ubiquitous Environments. (arXiv:2106.15609v1 [cs.HC])
    (2 min) This framework for human behavior monitoring aims to take a holistic approach to study, track, monitor, and analyze human behavior during activities of daily living (ADLs). The framework consists of two novel functionalities. First, it can perform the semantic analysis of user interactions on the diverse contextual parameters during ADLs to identify a list of distinct behavioral patterns associated with different complex activities. Second, it consists of an intelligent decision-making algorithm that can analyze these behavioral patterns and their relationships with the dynamic contextual and spatial features of the environment to detect any anomalies in user behavior that could constitute an emergency. These functionalities of this interdisciplinary framework were developed by integrating the latest advancements and technologies in human-computer interaction, machine learning, Internet of Things, pattern recognition, and ubiquitous computing. The framework was evaluated on a dataset of ADLs, and the performance accuracies of these two functionalities were found to be 76.71% and 83.87%, respectively. The presented and discussed results uphold the relevance and immense potential of this framework to contribute towards improving the quality of life and assisted living of the aging population in the future of Internet of Things (IoT)-based ubiquitous living environments, e.g., smart homes.
    Span-based Joint Entity and Relation Extraction with Transformer Pre-training. (arXiv:1909.07755v4 [cs.CL] UPDATED)
    (2 min) We introduce SpERT, an attention model for span-based joint entity and relation extraction. Our key contribution is a light-weight reasoning on BERT embeddings, which features entity recognition and filtering, as well as relation classification with a localized, marker-free context representation. The model is trained using strong within-sentence negative samples, which are efficiently extracted in a single BERT pass. These aspects facilitate a search over all spans in the sentence. In ablation studies, we demonstrate the benefits of pre-training, strong negative sampling and localized context. Our model outperforms prior work by up to 2.6% F1 score on several datasets for joint entity and relation extraction.
    Non-asymptotic Superlinear Convergence of Standard Quasi-Newton Methods. (arXiv:2003.13607v3 [math.OC] UPDATED)
    (2 min) In this paper, we study and prove the non-asymptotic superlinear convergence rate of the Broyden class of quasi-Newton methods including Davidon--Fletcher--Powell (DFP) method and Broyden--Fletcher--Goldfarb--Shanno (BFGS) method. The asymptotic superlinear convergence rate of these quasi-Newton methods has been extensively studied, but their explicit finite time local convergence rate is not fully investigated. In this paper, we provide a finite time (non-asymptotic) convergence analysis for BFGS and DFP methods under the assumptions that the objective function is strongly convex, its gradient is Lipschitz continuous, and its Hessian is Lipschitz continuous only in the direction of the optimal solution. We show that in a local neighborhood of the optimal solution, the iterates generated by both DFP and BFGS converge to the optimal solution at a superlinear rate of $(1/k)^{k/2}$, where $k$ is the number of iterations. We also prove the same local superlinear convergence rate in the case that the objective function is self-concordant. Numerical experiments on different objective functions confirm our explicit convergence rates. Our theoretical guarantee is one of the first results that provide a non-asymptotic superlinear convergence rate for DFP and BFGS quasi-Newton methods.
    Joint Learning of Portrait Intrinsic Decomposition and Relighting. (arXiv:2106.15305v1 [cs.CV])
    (2 min) Inverse rendering is the problem of decomposing an image into its intrinsic components, i.e. albedo, normal and lighting. To solve this ill-posed problem from single image, state-of-the-art methods in shape from shading mostly resort to supervised training on all the components on either synthetic or real datasets. Here, we propose a new self-supervised training paradigm that 1) reduces the need for full supervision on the decomposition task and 2) takes into account the relighting task. We introduce new self-supervised loss terms that leverage the consistencies between multi-lit images (images of the same scene under different illuminations). Our approach is applicable to multi-lit datasets. We apply our training approach in two settings: 1) train on a mixture of synthetic and real data, 2) train on real datasets with limited supervision. We show-case the effectiveness of our training paradigm on both intrinsic decomposition and relighting and demonstrate how the model struggles in both tasks without the self-supervised loss terms in limited supervision settings. We provide results of comprehensive experiments on SfSNet, CelebA and Photoface datasets and verify the performance of our approach on images in the wild.
    Near field Acoustic Holography on arbitrary shapes using Convolutional Neural Network. (arXiv:2103.16935v2 [cs.SD] UPDATED)
    (2 min) Near-field Acoustic Holography (NAH) is a well-known problem aimed at estimating the vibrational velocity field of a structure by means of acoustic measurements. In this paper, we propose a NAH technique based on Convolutional Neural Network (CNN). The devised CNN predicts the vibrational field on the surface of arbitrary shaped plates (violin plates) with orthotropic material properties from a limited number of measurements. In particular, the architecture, named Super Resolution CNN (SRCNN), is able to estimate the vibrational field with a higher spatial resolution compared to the input pressure. The pressure and velocity datasets have been generated through Finite Element Method simulations. We validate the proposed method by comparing the estimates with the synthesized ground truth and with a state-of-the-art technique. Moreover, we evaluate the robustness of the devised network against noisy input data.
    Bias-Free Scalable Gaussian Processes via Randomized Truncations. (arXiv:2102.06695v2 [cs.LG] UPDATED)
    (2 min) Scalable Gaussian Process methods are computationally attractive, yet introduce modeling biases that require rigorous study. This paper analyzes two common techniques: early truncated conjugate gradients (CG) and random Fourier features (RFF). We find that both methods introduce a systematic bias on the learned hyperparameters: CG tends to underfit while RFF tends to overfit. We address these issues using randomized truncation estimators that eliminate bias in exchange for increased variance. In the case of RFF, we show that the bias-to-variance conversion is indeed a trade-off: the additional variance proves detrimental to optimization. However, in the case of CG, our unbiased learning procedure meaningfully outperforms its biased counterpart with minimal additional computation.
    Globally Optimal Hierarchical Reinforcement Learning for Linearly-Solvable Markov Decision Processes. (arXiv:2106.15380v1 [cs.LG])
    (2 min) In this work we present a novel approach to hierarchical reinforcement learning for linearly-solvable Markov decision processes. Our approach assumes that the state space is partitioned, and the subtasks consist in moving between the partitions. We represent value functions on several levels of abstraction, and use the compositionality of subtasks to estimate the optimal values of the states in each partition. The policy is implicitly defined on these optimal value estimates, rather than being decomposed among the subtasks. As a consequence, our approach can learn the globally optimal policy, and does not suffer from the non-stationarity of high-level decisions. If several partitions have equivalent dynamics, the subtasks of those partitions can be shared. If the set of boundary states is smaller than the entire state space, our approach can have significantly smaller sample complexity than that of a flat learner, and we validate this empirically in several experiments.
    Conditional Gradient Methods for Convex Optimization with General Affine and Nonlinear Constraints. (arXiv:2007.00153v3 [math.OC] UPDATED)
    (2 min) Conditional gradient methods have attracted much attention in both machine learning and optimization communities recently. These simple methods can guarantee the generation of sparse solutions. In addition, without the computation of full gradients, they can handle huge-scale problems sometimes even with an exponentially increasing number of decision variables. This paper aims to significantly expand the application areas of these methods by presenting new conditional gradient methods for solving convex optimization problems with general affine and nonlinear constraints. More specifically, we first present a new constraint extrapolated condition gradient (CoexCG) method that can achieve an ${\cal O}(1/\epsilon^2)$ iteration complexity for both smooth and structured nonsmooth function constrained convex optimization. We further develop novel variants of CoexCG, namely constraint extrapolated and dual regularized conditional gradient (CoexDurCG) methods, that can achieve similar iteration complexity to CoexCG but allow adaptive selection for algorithmic parameters. We illustrate the effectiveness of these methods for solving an important class of radiation therapy treatment planning problems arising from healthcare industry. To the best of our knowledge, all the algorithmic schemes and their complexity results are new in the area of projection-free methods.
    Modularity in Reinforcement Learning via Algorithmic Independence in Credit Assignment. (arXiv:2106.14993v1 [cs.LG])
    (2 min) Many transfer problems require re-using previously optimal decisions for solving new tasks, which suggests the need for learning algorithms that can modify the mechanisms for choosing certain actions independently of those for choosing others. However, there is currently no formalism nor theory for how to achieve this kind of modular credit assignment. To answer this question, we define modular credit assignment as a constraint on minimizing the algorithmic mutual information among feedback signals for different decisions. We introduce what we call the modularity criterion for testing whether a learning algorithm satisfies this constraint by performing causal analysis on the algorithm itself. We generalize the recently proposed societal decision-making framework as a more granular formalism than the Markov decision process to prove that for decision sequences that do not contain cycles, certain single-step temporal difference action-value methods meet this criterion while all policy-gradient methods do not. Empirical evidence suggests that such action-value methods are more sample efficient than policy-gradient methods on transfer problems that require only sparse changes to a sequence of previously optimal decisions.
    Contrastive Attraction and Contrastive Repulsion for Representation Learning. (arXiv:2105.03746v2 [cs.LG] UPDATED)
    (2 min) Contrastive learning (CL) is effective in learning data representations without label supervision, where the encoder needs to contrast each positive sample over multiple negative samples via a one-vs-many softmax cross-entropy loss. However, conventional CL is sensitive to how many negative samples are included and how they are selected. Proposed in this paper is a doubly CL strategy that contrasts positive samples and negative ones within themselves separately. We realize this strategy with contrastive attraction and contrastive repulsion (CACR) makes the query not only exert a greater force to attract more distant positive samples but also do so to repel closer negative samples. Theoretical analysis reveals the connection between CACR and CL from the perspectives of both positive attraction and negative repulsion and shows the benefits in both efficiency and robustness brought by separately contrasting within the sampled positive and negative pairs. Extensive large-scale experiments on standard vision tasks show that CACR not only consistently outperforms existing CL methods on benchmark datasets in representation learning, but also provides interpretable contrastive weights, demonstrating the efficacy of the proposed doubly contrastive strategy.
    SpreadsheetCoder: Formula Prediction from Semi-structured Context. (arXiv:2106.15339v1 [cs.SE])
    (2 min) Spreadsheet formula prediction has been an important program synthesis problem with many real-world applications. Previous works typically utilize input-output examples as the specification for spreadsheet formula synthesis, where each input-output pair simulates a separate row in the spreadsheet. However, this formulation does not fully capture the rich context in real-world spreadsheets. First, spreadsheet data entries are organized as tables, thus rows and columns are not necessarily independent from each other. In addition, many spreadsheet tables include headers, which provide high-level descriptions of the cell data. However, previous synthesis approaches do not consider headers as part of the specification. In this work, we present the first approach for synthesizing spreadsheet formulas from tabular context, which includes both headers and semi-structured tabular data. In particular, we propose SpreadsheetCoder, a BERT-based model architecture to represent the tabular context in both row-based and column-based formats. We train our model on a large dataset of spreadsheets, and demonstrate that SpreadsheetCoder achieves top-1 prediction accuracy of 42.51%, which is a considerable improvement over baselines that do not employ rich tabular context. Compared to the rule-based system, SpreadsheetCoder assists 82% more users in composing formulas on Google Sheets.
    Generalization of Reinforcement Learning with Policy-Aware Adversarial Data Augmentation. (arXiv:2106.15587v1 [cs.LG])
    (2 min) The generalization gap in reinforcement learning (RL) has been a significant obstacle that prevents the RL agent from learning general skills and adapting to varying environments. Increasing the generalization capacity of the RL systems can significantly improve their performance on real-world working environments. In this work, we propose a novel policy-aware adversarial data augmentation method to augment the standard policy learning method with automatically generated trajectory data. Different from the commonly used observation transformation based data augmentations, our proposed method adversarially generates new trajectory data based on the policy gradient objective and aims to more effectively increase the RL agent's generalization ability with the policy-aware data augmentation. Moreover, we further deploy a mixup step to integrate the original and generated data to enhance the generalization capacity while mitigating the over-deviation of the adversarial data. We conduct experiments on a number of RL tasks to investigate the generalization performance of the proposed method by comparing it with the standard baselines and the state-of-the-art mixreg approach. The results show our method can generalize well with limited training diversity, and achieve the state-of-the-art generalization test performance.
    On Graph Neural Network Ensembles for Large-Scale Molecular Property Prediction. (arXiv:2106.15529v1 [cs.LG])
    (2 min) In order to advance large-scale graph machine learning, the Open Graph Benchmark Large Scale Challenge (OGB-LSC) was proposed at the KDD Cup 2021. The PCQM4M-LSC dataset defines a molecular HOMO-LUMO property prediction task on about 3.8M graphs. In this short paper, we show our current work-in-progress solution which builds an ensemble of three graph neural networks models based on GIN, Bayesian Neural Networks and DiffPool. Our approach outperforms the provided baseline by 7.6%. Moreover, using uncertainty in our ensemble's prediction, we can identify molecules whose HOMO-LUMO gaps are harder to predict (with Pearson's correlation of 0.5181). We anticipate that this will facilitate active learning.
    A Convergent and Efficient Deep Q Network Algorithm. (arXiv:2106.15419v1 [cs.LG])
    (2 min) Despite the empirical success of the deep Q network (DQN) reinforcement learning algorithm and its variants, DQN is still not well understood and it does not guarantee convergence. In this work, we show that DQN can diverge and cease to operate in realistic settings. Although there exist gradient-based convergent methods, we show that they actually have inherent problems in learning behaviour and elucidate why they often fail in practice. To overcome these problems, we propose a convergent DQN algorithm (C-DQN) by carefully modifying DQN, and we show that the algorithm is convergent and can work with large discount factors (0.9998). It learns robustly in difficult settings and can learn several difficult games in the Atari 2600 benchmark where DQN fail, within a moderate computational budget. Our codes have been publicly released and can be used to reproduce our results.
    How to Reach Real-Time AI on Consumer Devices? Solutions for Programmable and Custom Architectures. (arXiv:2106.15021v1 [cs.LG])
    (2 min) The unprecedented performance of deep neural networks (DNNs) has led to large strides in various Artificial Intelligence (AI) inference tasks, such as object and speech recognition. Nevertheless, deploying such AI models across commodity devices faces significant challenges: large computational cost, multiple performance objectives, hardware heterogeneity and a common need for high accuracy, together pose critical problems to the deployment of DNNs across the various embedded and mobile devices in the wild. As such, we have yet to witness the mainstream usage of state-of-the-art deep learning algorithms across consumer devices. In this paper, we provide preliminary answers to this potentially game-changing question by presenting an array of design techniques for efficient AI systems. We start by examining the major roadblocks when targeting both programmable processors and custom accelerators. Then, we present diverse methods for achieving real-time performance following a cross-stack approach. These span model-, system- and hardware-level techniques, and their combination. Our findings provide illustrative examples of AI systems that do not overburden mobile hardware, while also indicating how they can improve inference accuracy. Moreover, we showcase how custom ASIC- and FPGA-based accelerators can be an enabling factor for next-generation AI applications, such as multi-DNN systems. Collectively, these results highlight the critical need for further exploration as to how the various cross-stack solutions can be best combined in order to bring the latest advances in deep learning close to users, in a robust and efficient manner.
    Effective Evaluation of Deep Active Learning on Image Classification Tasks. (arXiv:2106.15324v1 [cs.CV])
    (2 min) With the goal of making deep learning more label-efficient, a growing number of papers have been studying active learning (AL) for deep models. However, there are a number of issues in the prevalent experimental settings, mainly stemming from a lack of unified implementation and benchmarking. Issues in the current literature include sometimes contradictory observations on the performance of different AL algorithms, unintended exclusion of important generalization approaches such as data augmentation and SGD for optimization, a lack of study of evaluation facets like the labeling efficiency of AL, and little or no clarity on the scenarios in which AL outperforms random sampling (RS). In this work, we present a unified re-implementation of state-of-the-art AL algorithms in the context of image classification, and we carefully study these issues as facets of effective evaluation. On the positive side, we show that AL techniques are 2x to 4x more label-efficient compared to RS with the use of data augmentation. Surprisingly, when data augmentation is included, there is no longer a consistent gain in using BADGE, a state-of-the-art approach, over simple uncertainty sampling. We then do a careful analysis of how existing approaches perform with varying amounts of redundancy and number of examples per class. Finally, we provide several insights for AL practitioners to consider in future work, such as the effect of the AL batch size, the effect of initialization, the importance of retraining a new model at every round, and other insights.
    Uncertainty-Guided Progressive GANs for Medical Image Translation. (arXiv:2106.15542v1 [cs.CV])
    (2 min) Image-to-image translation plays a vital role in tackling various medical imaging tasks such as attenuation correction, motion correction, undersampled reconstruction, and denoising. Generative adversarial networks have been shown to achieve the state-of-the-art in generating high fidelity images for these tasks. However, the state-of-the-art GAN-based frameworks do not estimate the uncertainty in the predictions made by the network that is essential for making informed medical decisions and subsequent revision by medical experts and has recently been shown to improve the performance and interpretability of the model. In this work, we propose an uncertainty-guided progressive learning scheme for image-to-image translation. By incorporating aleatoric uncertainty as attention maps for GANs trained in a progressive manner, we generate images of increasing fidelity progressively. We demonstrate the efficacy of our model on three challenging medical image translation tasks, including PET to CT translation, undersampled MRI reconstruction, and MRI motion artefact correction. Our model generalizes well in three different tasks and improves performance over state of the art under full-supervision and weak-supervision with limited data. Code is released here: https://github.com/ExplainableML/UncerGuidedI2I
    Automated Evolutionary Approach for the Design of Composite Machine Learning Pipelines. (arXiv:2106.15397v1 [cs.LG])
    (2 min) The effectiveness of the machine learning methods for real-world tasks depends on the proper structure of the modeling pipeline. The proposed approach is aimed to automate the design of composite machine learning pipelines, which is equivalent to computation workflows that consist of models and data operations. The approach combines key ideas of both automated machine learning and workflow management systems. It designs the pipelines with a customizable graph-based structure, analyzes the obtained results, and reproduces them. The evolutionary approach is used for the flexible identification of pipeline structure. The additional algorithms for sensitivity analysis, atomization, and hyperparameter tuning are implemented to improve the effectiveness of the approach. Also, the software implementation on this approach is presented as an open-source framework. The set of experiments is conducted for the different datasets and tasks (classification, regression, time series forecasting). The obtained results confirm the correctness and effectiveness of the proposed approach in the comparison with the state-of-the-art competitors and baseline solutions.
    US Fatal Police Shooting Analysis and Prediction. (arXiv:2106.15298v1 [physics.soc-ph])
    (2 min) We believe that "all men are created equal". With the rise of the police shootings reported by media, more people in the U.S. think that police use excessive force during law enforcement, especially to a specific group of people. We want to apply multidimensional statistical analysis to reveal more facts than the monotone mainstream media. Our paper has three parts. First, we proposed a new method to quantify fatal police shooting news reporting deviation of mainstream media, which includes CNN, FOX, ABC, and NBC. Second, we analyzed the most comprehensive US fatal police shooting dataset from Washington Post. We used FP-growth to reveal the frequent patterns and DBSCAN clustering to find fatal shooting hotspots. We brought multi-attributes (social economics, demographics, political tendency, education, gun ownership rate, police training hours, etc.) to reveal connections under the iceberg. We found that the police shooting rate of a state depends on many variables. The top four most relevant attributes were state joined year, state land area, gun ownership rate, and violent crime rate. Third, we proposed four regression models to predict police shooting rates at the state level. The best model Kstar could predict the fatal police shooting rate with about 88.53% correlation coefficient. We also proposed classification models, including Gradient Boosting Machine, Multi-class Classifier, Logistic Regression, and Naive Bayes Classifier, to predict the race of fatal police shooting victims. Our classification models show no significant evidence to conclude that racial discrimination happened during fatal police shootings recorded by the WP dataset.
    SE-MD: A Single-encoder multiple-decoder deep network for point cloud generation from 2D images. (arXiv:2106.15325v1 [cs.CV])
    (2 min) 3D model generation from single 2D RGB images is a challenging and actively researched computer vision task. Various techniques using conventional network architectures have been proposed for the same. However, the body of research work is limited and there are various issues like using inefficient 3D representation formats, weak 3D model generation backbones, inability to generate dense point clouds, dependence of post-processing for generation of dense point clouds, and dependence on silhouettes in RGB images. In this paper, a novel 2D RGB image to point cloud conversion technique is proposed, which improves the state of art in the field due to its efficient, robust and simple model by using the concept of parallelization in network architecture. It not only uses the efficient and rich 3D representation of point clouds, but also uses a novel and robust point cloud generation backbone in order to address the prevalent issues. This involves using a single-encoder multiple-decoder deep network architecture wherein each decoder generates certain fixed viewpoints. This is followed by fusing all the viewpoints to generate a dense point cloud. Various experiments are conducted on the technique and its performance is compared with those of other state of the art techniques and impressive gains in performance are demonstrated. Code is available at https://github.com/mueedhafiz1982/
    ASAM: Adaptive Sharpness-Aware Minimization for Scale-Invariant Learning of Deep Neural Networks. (arXiv:2102.11600v3 [cs.LG] UPDATED)
    (2 min) Recently, learning algorithms motivated from sharpness of loss surface as an effective measure of generalization gap have shown state-of-the-art performances. Nevertheless, sharpness defined in a rigid region with a fixed radius, has a drawback in sensitivity to parameter re-scaling which leaves the loss unaffected, leading to weakening of the connection between sharpness and generalization gap. In this paper, we introduce the concept of adaptive sharpness which is scale-invariant and propose the corresponding generalization bound. We suggest a novel learning method, adaptive sharpness-aware minimization (ASAM), utilizing the proposed generalization bound. Experimental results in various benchmark datasets show that ASAM contributes to significant improvement of model generalization performance.
    Automated classification of plasma regions using 3D particle energy distributions. (arXiv:1908.05715v3 [physics.space-ph] UPDATED)
    (2 min) We investigate the properties of the ion sky maps produced by the Dual Ion Spectrometers (DIS) from the Fast Plasma Investigation (FPI). We have trained a convolutional neural network classifier to predict four regions crossed by the MMS on the dayside magnetosphere: solar wind, ion foreshock, magnetosheath, and magnetopause using solely DIS spectrograms. The accuracy of the classifier is >98%. We use the classifier to detect mixed plasma regions, in particular to find the bow shock regions. A similar approach can be used to identify the magnetopause crossings and reveal regions prone to magnetic reconnection. Data processing through the trained classifier is fast and efficient and thus can be used for classification for the whole MMS database.
    Towards Generalisable Deep Inertial Tracking via Geometry-Aware Learning. (arXiv:2106.15178v1 [cs.LG])
    (2 min) Autonomous navigation in uninstrumented and unprepared environments is a fundamental demand for next generation indoor and outdoor location-based services. To bring about such ambition, a suite of collaborative sensing modalities is required in order to sustain performance irrespective of challenging dynamic conditions. Of the many modalities on offer, inertial tracking plays a key role under momentary unfavourable operational conditions owing to its independence of the surrounding environment. However, inertial tracking has traditionally (i) suffered from excessive error growth and (ii) required extensive and cumbersome tuning. Both of these issues have limited the appeal and utility of inertial tracking. In this paper, we present DIT: a novel Deep learning Inertial Tracking system that overcomes prior limitations; namely, by (i) significantly reducing tracking drift and (ii) seamlessly constructing robust and generalisable learned models. DIT describes two core contributions: (i) DIT employs a robotic platform augmented with a mechanical slider subsystem that automatically samples inertial signal variabilities arising from different sensor mounting geometries. We use the platform to curate in-house a 7.2 million sample dataset covering an aggregate distance of 21 kilometres split into 11 indexed sensor mounting geometries. (ii) DIT uses deep learning, optimal transport, and domain adaptation (DA) to create a model which is robust to variabilities in sensor mounting geometry. The overall system synthesises high-performance and generalisable inertial navigation models in an end-to-end, robotic-learning fashion. In our evaluation, DIT outperforms an industrial-grade sensor fusion baseline by 10x (90th percentile) and a state-of-the-art adversarial DA technique by > 2.5x in performance (90th percentile) and >10x in training time.
    Security Analysis of Camera-LiDAR Semantic-Level Fusion Against Black-Box Attacks on Autonomous Vehicles. (arXiv:2106.07098v2 [cs.CR] UPDATED)
    (2 min) To enable safe and reliable decision-making, autonomous vehicles (AVs) feed sensor data to perception algorithms to understand the environment. Sensor fusion, and particularly semantic fusion, with multi-frame tracking is becoming increasingly popular for detecting 3D objects. Recently, it was shown that LiDAR-based perception built on deep neural networks is vulnerable to LiDAR spoofing attacks. Thus, in this work, we perform the first analysis of camera-LiDAR fusion under spoofing attacks and the first security analysis of semantic fusion in any AV context. We find first that fusion is more successful than existing defenses at guarding against naive spoofing. However, we then define the frustum attack as a new class of attacks on AVs and find that semantic camera-LiDAR fusion exhibits widespread vulnerability to frustum attacks with between 70% and 90% success against target models. Importantly, the attacker needs less than 20 random spoof points on average for successful attacks - an order of magnitude less than established maximum capability. Finally, we are the first to analyze the longitudinal impact of perception attacks by showing the impact of multi-frame attacks.
    Curious Explorer: a provable exploration strategy in Policy Learning. (arXiv:2106.15503v1 [cs.LG])
    (2 min) Having access to an exploring restart distribution (the so-called wide coverage assumption) is critical with policy gradient methods. This is due to the fact that, while the objective function is insensitive to updates in unlikely states, the agent may still need improvements in those states in order to reach a nearly optimal payoff. For this reason, wide coverage is used in some form when analyzing theoretical properties of practical policy gradient methods. However, this assumption can be unfeasible in certain environments, for instance when learning is online, or when restarts are possible only from a fixed initial state. In these cases, classical policy gradient algorithms can have very poor convergence properties and sample efficiency. In this paper, we develop Curious Explorer, a novel and simple iterative state space exploration strategy that can be used with any starting distribution $\rho$. Curious Explorer starts from $\rho$, then using intrinsic rewards assigned to the set of poorly visited states produces a sequence of policies, each one more exploratory than the previous one in an informed way, and finally outputs a restart model $\mu$ based on the state visitation distribution of the exploratory policies. Curious Explorer is provable, in the sense that we provide theoretical upper bounds on how often an optimal policy visits poorly visited states. These bounds can be used to prove PAC convergence and sample efficiency results when a PAC optimizer is plugged in Curious Explorer. This allows to achieve global convergence and sample efficiency results without any coverage assumption for REINFORCE, and potentially for any other policy gradient method ensuring PAC convergence with wide coverage. Finally, we plug (the output of) Curious Explorer into REINFORCE and TRPO, and show empirically that it can improve performance in MDPs with challenging exploration.
    The U-Net based GLOW for Optical-Flow-free Video Interframe Generation. (arXiv:2103.09576v3 [cs.CV] UPDATED)
    (2 min) Video frame interpolation is the task of creating an interframe between two adjacent frames along the time axis. So, instead of simply averaging two adjacent frames to create an intermediate image, this operation should maintain semantic continuity with the adjacent frames. Most conventional methods use optical flow, and various tools such as occlusion handling and object smoothing are indispensable. Since the use of these various tools leads to complex problems, we tried to tackle the video interframe generation problem without using problematic optical flow . To enable this , we have tried to use a deep neural network with an invertible structure, and developed an U-Net based Generative Flow which is a modified normalizing flow. In addition, we propose a learning method with a new consistency loss in the latent space to maintain semantic temporal consistency between frames. The resolution of the generated image is guaranteed to be identical to that of the original images by using an invertible network. Furthermore, as it is not a random image like the ones by generative models, our network guarantees stable outputs without flicker. Through experiments, we \sam {confirmed the feasibility of the proposed algorithm and would like to suggest the U-Net based Generative Flow as a new possibility for baseline in video frame interpolation. This paper is meaningful in that it is the world's first attempt to use invertible networks instead of optical flows for video interpolation.
    Unified Framework for Spectral Dimensionality Reduction, Maximum Variance Unfolding, and Kernel Learning By Semidefinite Programming: Tutorial and Survey. (arXiv:2106.15379v1 [stat.ML])
    (2 min) This is a tutorial and survey paper on unification of spectral dimensionality reduction methods, kernel learning by Semidefinite Programming (SDP), Maximum Variance Unfolding (MVU) or Semidefinite Embedding (SDE), and its variants. We first explain how the spectral dimensionality reduction methods can be unified as kernel Principal Component Analysis (PCA) with different kernels. This unification can be interpreted as eigenfunction learning or representation of kernel in terms of distance matrix. Then, since the spectral methods are unified as kernel PCA, we say let us learn the best kernel for unfolding the manifold of data to its maximum variance. We first briefly introduce kernel learning by SDP for the transduction task. Then, we explain MVU in detail. Various versions of supervised MVU using nearest neighbors graph, by class-wise unfolding, by Fisher criterion, and by colored MVU are explained. We also explain out-of-sample extension of MVU using eigenfunctions and kernel mapping. Finally, we introduce other variants of MVU including action respecting embedding, relaxed MVU, and landmark MVU for big data.
    TUCaN: Progressively Teaching Colourisation to Capsules. (arXiv:2106.15176v1 [cs.CV])
    (2 min) Automatic image colourisation is the computer vision research path that studies how to colourise greyscale images (for restoration). Deep learning techniques improved image colourisation yielding astonishing results. These differ by various factors, such as structural differences, input types, user assistance, etc. Most of them, base the architectural structure on convolutional layers with no emphasis on layers specialised in object features extraction. We introduce a novel downsampling upsampling architecture named TUCaN (Tiny UCapsNet) that exploits the collaboration of convolutional layers and capsule layers to obtain a neat colourisation of entities present in every single image. This is obtained by enforcing collaboration among such layers by skip and residual connections. We pose the problem as a per pixel colour classification task that identifies colours as a bin in a quantized space. To train the network, in contrast with the standard end to end learning method, we propose the progressive learning scheme to extract the context of objects by only manipulating the learning process without changing the model. In this scheme, the upsampling starts from the reconstruction of low resolution images and progressively grows to high resolution images throughout the training phase. Experimental results on three benchmark datasets show that our approach with ImageNet10k dataset outperforms existing methods on standard quality metrics and achieves state of the art performances on image colourisation. We performed a user study to quantify the perceptual realism of the colourisation results demonstrating: that progressive learning let the TUCaN achieve better colours than the end to end scheme; and pointing out the limitations of the existing evaluation metrics.
    Multiple Graph Learning for Scalable Multi-view Clustering. (arXiv:2106.15382v1 [cs.CV])
    (2 min) Graph-based multi-view clustering has become an active topic due to the efficiency in characterizing both the complex structure and relationship between multimedia data. However, existing methods have the following shortcomings: (1) They are inefficient or even fail for graph learning in large scale due to the graph construction and eigen-decomposition. (2) They cannot well exploit both the complementary information and spatial structure embedded in graphs of different views. To well exploit complementary information and tackle the scalability issue plaguing graph-based multi-view clustering, we propose an efficient multiple graph learning model via a small number of anchor points and tensor Schatten p-norm minimization. Specifically, we construct a hidden and tractable large graph by anchor graph for each view and well exploit complementary information embedded in anchor graphs of different views by tensor Schatten p-norm regularizer. Finally, we develop an efficient algorithm, which scales linearly with the data size, to solve our proposed model. Extensive experimental results on several datasets indicate that our proposed method outperforms some state-of-the-art multi-view clustering algorithms.
    Regularized OFU: an Efficient UCB Estimator forNon-linear Contextual Bandit. (arXiv:2106.15128v1 [cs.LG])
    (2 min) Balancing exploration and exploitation (EE) is a fundamental problem in contex-tual bandit. One powerful principle for EE trade-off isOptimism in Face of Uncer-tainty(OFU), in which the agent takes the action according to an upper confidencebound (UCB) of reward. OFU has achieved (near-)optimal regret bound for lin-ear/kernel contextual bandits. However, it is in general unknown how to deriveefficient and effective EE trade-off methods for non-linearcomplex tasks, suchas contextual bandit with deep neural network as the reward function. In thispaper, we propose a novel OFU algorithm namedregularized OFU(ROFU). InROFU, we measure the uncertainty of the reward by a differentiable function andcompute the upper confidence bound by solving a regularized optimization prob-lem. We prove that, for multi-armed bandit, kernel contextual bandit and neuraltangent kernel bandit, ROFU achieves (near-)optimal regret bounds with certainuncertainty measure, which theoretically justifies its effectiveness on EE trade-off.Importantly, ROFU admits a very efficient implementation with gradient-basedoptimizer, which easily extends to general deep neural network models beyondneural tangent kernel, in sharp contrast with previous OFU methods. The em-pirical evaluation demonstrates that ROFU works extremelywell for contextualbandits under various settings.
    Interactive Dimensionality Reduction for Comparative Analysis. (arXiv:2106.15481v1 [cs.LG])
    (2 min) Finding the similarities and differences between two or more groups of datasets is a fundamental analysis task. For high-dimensional data, dimensionality reduction (DR) methods are often used to find the characteristics of each group. However, existing DR methods provide limited capability and flexibility for such comparative analysis as each method is designed only for a narrow analysis target, such as identifying factors that most differentiate groups. In this work, we introduce an interactive DR framework where we integrate our new DR method, called ULCA (unified linear comparative analysis), with an interactive visual interface. ULCA unifies two DR schemes, discriminant analysis and contrastive learning, to support various comparative analysis tasks. To provide flexibility for comparative analysis, we develop an optimization algorithm that enables analysts to interactively refine ULCA results. Additionally, we provide an interactive visualization interface to examine ULCA results with a rich set of analysis libraries. We evaluate ULCA and the optimization algorithm to show their efficiency as well as present multiple case studies using real-world datasets to demonstrate the usefulness of our framework.
    Open-Set Representation Learning through Combinatorial Embedding. (arXiv:2106.15278v1 [cs.CV])
    (2 min) Visual recognition tasks are often limited to dealing with a small subset of classes simply because the labels for the remaining classes are unavailable. We are interested in identifying novel concepts in a dataset through representation learning based on the examples in both labeled and unlabeled classes, and extending the horizon of recognition to both known and novel classes. To address this challenging task, we propose a combinatorial learning approach, which naturally clusters the examples in unseen classes using the compositional knowledge given by multiple supervised meta-classifiers on heterogeneous label spaces. We also introduce a metric learning strategy to estimate pairwise pseudo-labels for improving representations of unlabeled examples, which preserves semantic relations across known and novel classes effectively. The proposed algorithm discovers novel concepts via a joint optimization of enhancing the discrimitiveness of unseen classes as well as learning the representations of known classes generalizable to novel ones. Our extensive experiments demonstrate remarkable performance gains by the proposed approach in multiple image retrieval and novel class discovery benchmarks.
    Characterization of the Variation Spaces Corresponding to Shallow Neural Networks. (arXiv:2106.15002v1 [stat.ML])
    (2 min) We consider the variation space corresponding to a dictionary of functions in $L^2(\Omega)$ and present the basic theory of approximation in these spaces. Specifically, we compare the definition based on integral representations with the definition in terms of convex hulls. We show that in many cases, including the dictionaries corresponding to shallow ReLU$^k$ networks and a dictionary of decaying Fourier modes, that the two definitions coincide. We also give a partial characterization of the variation space for shallow ReLU$^k$ networks and show that the variation space with respect to the dictionary of decaying Fourier modes corresponds to the Barron spectral space.
    Reinforcement Learning of Implicit and Explicit Control Flow in Instructions. (arXiv:2102.13195v2 [cs.LG] UPDATED)
    (2 min) Learning to flexibly follow task instructions in dynamic environments poses interesting challenges for reinforcement learning agents. We focus here on the problem of learning control flow that deviates from a strict step-by-step execution of instructions -- that is, control flow that may skip forward over parts of the instructions or return backward to previously completed or skipped steps. Demand for such flexible control arises in two fundamental ways: explicitly when control is specified in the instructions themselves (such as conditional branching and looping) and implicitly when stochastic environment dynamics require re-completion of instructions whose effects have been perturbed, or opportunistic skipping of instructions whose effects are already present. We formulate an attention-based architecture that meets these challenges by learning, from task reward only, to flexibly attend to and condition behavior on an internal encoding of the instructions. We test the architecture's ability to learn both explicit and implicit control in two illustrative domains -- one inspired by Minecraft and the other by StarCraft -- and show that the architecture exhibits zero-shot generalization to novel instructions of length greater than those in a training set, at a performance level unmatched by two baseline recurrent architectures and one ablation architecture.
    Relational Graph Learning on Visual and Kinematics Embeddings for Accurate Gesture Recognition in Robotic Surgery. (arXiv:2011.01619v2 [cs.CV] UPDATED)
    (2 min) Automatic surgical gesture recognition is fundamentally important to enable intelligent cognitive assistance in robotic surgery. With recent advancement in robot-assisted minimally invasive surgery, rich information including surgical videos and robotic kinematics can be recorded, which provide complementary knowledge for understanding surgical gestures. However, existing methods either solely adopt uni-modal data or directly concatenate multi-modal representations, which can not sufficiently exploit the informative correlations inherent in visual and kinematics data to boost gesture recognition accuracies. In this regard, we propose a novel online approach of multi-modal relational graph network (i.e., MRG-Net) to dynamically integrate visual and kinematics information through interactive message propagation in the latent feature space. In specific, we first extract embeddings from video and kinematics sequences with temporal convolutional networks and LSTM units. Next, we identify multi-relations in these multi-modal embeddings and leverage them through a hierarchical relational graph learning module. The effectiveness of our method is demonstrated with state-of-the-art results on the public JIGSAWS dataset, outperforming current uni-modal and multi-modal methods on both suturing and knot typing tasks. Furthermore, we validated our method on in-house visual-kinematics datasets collected with da Vinci Research Kit (dVRK) platforms in two centers, with consistent promising performance achieved.
    INN: A Method Identifying Clean-annotated Samples via Consistency Effect in Deep Neural Networks. (arXiv:2106.15185v1 [cs.LG])
    (2 min) In many classification problems, collecting massive clean-annotated data is not easy, and thus a lot of researches have been done to handle data with noisy labels. Most recent state-of-art solutions for noisy label problems are built on the small-loss strategy which exploits the memorization effect. While it is a powerful tool, the memorization effect has several drawbacks. The performances are sensitive to the choice of a training epoch required for utilizing the memorization effect. In addition, when the labels are heavily contaminated or imbalanced, the memorization effect may not occur in which case the methods based on the small-loss strategy fail to identify clean labeled data. We introduce a new method called INN(Integration with the Nearest Neighborhoods) to refine clean labeled data from training data with noisy labels. The proposed method is based on a new discovery that a prediction pattern at neighbor regions of clean labeled data is consistently different from that of noisy labeled data regardless of training epochs. The INN method requires more computation but is much stable and powerful than the small-loss strategy. By carrying out various experiments, we demonstrate that the INN method resolves the shortcomings in the memorization effect successfully and thus is helpful to construct more accurate deep prediction models with training data with noisy labels.
    Multimodal Approaches for Indoor Localization for Ambient Assisted Living in Smart Homes. (arXiv:2106.15606v1 [cs.HC])
    (2 min) This work makes multiple scientific contributions to the field of Indoor Localization for Ambient Assisted Living in Smart Homes. First, it presents a Big-Data driven methodology that studies the multimodal components of user interactions and analyzes the data from Bluetooth Low Energy (BLE) beacons and BLE scanners to detect a user's indoor location in a specific activity-based zone during Activities of Daily Living. Second, it introduces a context independent approach that can interpret the accelerometer and gyroscope data from diverse behavioral patterns to detect the zone-based indoor location of a user in any Internet of Things (IoT)-based environment. These two approaches achieved performance accuracies of 81.36% and 81.13%, respectively, when tested on a dataset. Third, it presents a methodology to detect the spatial coordinates of a user's indoor position that outperforms all similar works in this field, as per the associated root mean squared error - one of the performance evaluation metrics in ISO/IEC18305:2016- an international standard for testing Localization and Tracking Systems. Finally, it presents a comprehensive comparative study that includes Random Forest, Artificial Neural Network, Decision Tree, Support Vector Machine, k-NN, Gradient Boosted Trees, Deep Learning, and Linear Regression, to address the challenge of identifying the optimal machine learning approach for Indoor Localization.
    Towards Understanding the Effectiveness of Attention Mechanism. (arXiv:2106.15067v1 [cs.CV])
    (2 min) Attention Mechanism is a widely used method for improving the performance of convolutional neural networks (CNNs) on computer vision tasks. Despite its pervasiveness, we have a poor understanding of what its effectiveness stems from. It is popularly believed that its effectiveness stems from the visual attention explanation, advocating focusing on the important part of input data rather than ingesting the entire input. In this paper, we find that there is only a weak consistency between the attention weights of features and their importance. Instead, we verify the crucial role of feature map multiplication in attention mechanism and uncover a fundamental impact of feature map multiplication on the learned landscapes of CNNs: with the high order non-linearity brought by the feature map multiplication, it played a regularization role on CNNs, which made them learn smoother and more stable landscapes near real samples compared to vanilla CNNs. This smoothness and stability induce a more predictive and stable behavior in-between real samples, and make CNNs generate better. Moreover, motivated by the proposed effectiveness of feature map multiplication, we design feature map multiplication network (FMMNet) by simply replacing the feature map addition in ResNet with feature map multiplication. FMMNet outperforms ResNet on various datasets, and this indicates that feature map multiplication plays a vital role in improving the performance even without finely designed attention mechanism in existing methods.
    GraphAnoGAN: Detecting Anomalous Snapshots from Attributed Graphs. (arXiv:2106.15504v1 [cs.LG])
    (2 min) Finding anomalous snapshots from a graph has garnered huge attention recently. Existing studies address the problem using shallow learning mechanisms such as subspace selection, ego-network, or community analysis. These models do not take into account the multifaceted interactions between the structure and attributes in the network. In this paper, we propose GraphAnoGAN, an anomalous snapshot ranking framework, which consists of two core components -- generative and discriminative models. Specifically, the generative model learns to approximate the distribution of anomalous samples from the candidate set of graph snapshots, and the discriminative model detects whether the sampled snapshot is from the ground-truth or not. Experiments on 4 real-world networks show that GraphAnoGAN outperforms 6 baselines with a significant margin (28.29% and 22.01% higher precision and recall, respectively compared to the best baseline, averaged across all datasets).
    A Mechanism for Producing Aligned Latent Spaces with Autoencoders. (arXiv:2106.15456v1 [cs.LG])
    (2 min) Aligned latent spaces, where meaningful semantic shifts in the input space correspond to a translation in the embedding space, play an important role in the success of downstream tasks such as unsupervised clustering and data imputation. In this work, we prove that linear and nonlinear autoencoders produce aligned latent spaces by stretching along the left singular vectors of the data. We fully characterize the amount of stretching in linear autoencoders and provide an initialization scheme to arbitrarily stretch along the top directions using these networks. We also quantify the amount of stretching in nonlinear autoencoders in a simplified setting. We use our theoretical results to align drug signatures across cell types in gene expression space and semantic shifts in word embedding spaces.
    Self-Contrastive Learning. (arXiv:2106.15499v1 [cs.LG])
    (2 min) This paper proposes a novel contrastive learning framework, coined as Self-Contrastive (SelfCon) Learning, that self-contrasts within multiple outputs from the different levels of a network. We confirmed that SelfCon loss guarantees the lower bound of mutual information (MI) between the intermediate and last representations. Besides, we empirically showed, via various MI estimators, that SelfCon loss highly correlates to the increase of MI and better classification performance. In our experiments, SelfCon surpasses supervised contrastive (SupCon) learning without the need for a multi-viewed batch and with the cheaper computational cost. Especially on ResNet-18, we achieved top-1 classification accuracy of 76.45% for the CIFAR-100 dataset, which is 2.87% and 4.36% higher than SupCon and cross-entropy loss, respectively. We found that mitigating both vanishing gradient and overfitting issue makes our method outperform the counterparts.
    Explaining the Performance of Multi-label Classification Methods with Data Set Properties. (arXiv:2106.15411v1 [cs.LG])
    (2 min) Meta learning generalizes the empirical experience with different learning tasks and holds promise for providing important empirical insight into the behaviour of machine learning algorithms. In this paper, we present a comprehensive meta-learning study of data sets and methods for multi-label classification (MLC). MLC is a practically relevant machine learning task where each example is labelled with multiple labels simultaneously. Here, we analyze 40 MLC data sets by using 50 meta features describing different properties of the data. The main findings of this study are as follows. First, the most prominent meta features that describe the space of MLC data sets are the ones assessing different aspects of the label space. Second, the meta models show that the most important meta features describe the label space, and, the meta features describing the relationships among the labels tend to occur a bit more often than the meta features describing the distributions between and within the individual labels. Third, the optimization of the hyperparameters can improve the predictive performance, however, quite often the extent of the improvements does not always justify the resource utilization.
    MAML is a Noisy Contrastive Learner. (arXiv:2106.15367v1 [cs.LG])
    (2 min) Model-agnostic meta-learning (MAML) is one of the most popular and widely-adopted meta-learning algorithms nowadays, which achieves remarkable success in various learning problems. Yet, with the unique design of nested inner-loop and outer-loop updates which respectively govern the task-specific and meta-model-centric learning, the underlying learning objective of MAML still remains implicit and thus impedes a more straightforward understanding of it. In this paper, we provide a new perspective to the working mechanism of MAML and discover that: MAML is analogous to a meta-learner using a supervised contrastive objective function, where the query features are pulled towards the support features of the same class and against those of different classes, in which such contrastiveness is experimentally verified via an analysis based on the cosine similarity. Moreover, our analysis reveals that the vanilla MAML algorithm has an undesirable interference term originating from the random initialization and the cross-task interaction. We therefore propose a simple but effective technique, zeroing trick, to alleviate such interference, where the extensive experiments are then conducted on both miniImagenet and Omniglot datasets to demonstrate the consistent improvement brought by our proposed technique thus well validating its effectiveness.
    A Bytecode-based Approach for Smart Contract Classification. (arXiv:2106.15497v1 [cs.IR])
    (2 min) With the development of blockchain technologies, the number of smart contracts deployed on blockchain platforms is growing exponentially, which makes it difficult for users to find desired services by manual screening. The automatic classification of smart contracts can provide blockchain users with keyword-based contract searching and helps to manage smart contracts effectively. Current research on smart contract classification focuses on Natural Language Processing (NLP) solutions which are based on contract source code. However, more than 94% of smart contracts are not open-source, so the application scenarios of NLP methods are very limited. Meanwhile, NLP models are vulnerable to adversarial attacks. This paper proposes a classification model based on features from contract bytecode instead of source code to solve these problems. We also use feature selection and ensemble learning to optimize the model. Our experimental studies on over 3,300 real-world Ethereum smart contracts show that our model can classify smart contracts without source code and has better performance than baseline models. Our model also has good resistance to adversarial attacks compared with NLP-based models. In addition, our analysis reveals that account features used in many smart contract classification models have little effect on classification and can be excluded.
    Detecting Cattle and Elk in the Wild from Space. (arXiv:2106.15448v1 [cs.CV])
    (2 min) Localizing and counting large ungulates -- hoofed mammals like cows and elk -- in very high-resolution satellite imagery is an important task for supporting ecological studies. Prior work has shown that this is feasible with deep learning based methods and sub-meter multi-spectral satellite imagery. We extend this line of work by proposing a baseline method, CowNet, that simultaneously estimates the number of animals in an image (counts), as well as predicts their location at a pixel level (localizes). We also propose an methodology for evaluating such models on counting and localization tasks across large scenes that takes the uncertainty of noisy labels and the information needed by stakeholders in ecological monitoring tasks into account. Finally, we benchmark our baseline method with state of the art vision methods for counting objects in scenes. We specifically test the temporal generalization of the resulting models over a large landscape in Point Reyes Seashore, CA. We find that the LC-FCN model performs the best and achieves an average precision between 0.56 and 0.61 and an average recall between 0.78 and 0.92 over three held out test scenes.
    Multi-stage Optimization based Adversarial Training. (arXiv:2106.15357v1 [cs.LG])
    (2 min) In the field of adversarial robustness, there is a common practice that adopts the single-step adversarial training for quickly developing adversarially robust models. However, the single-step adversarial training is most likely to cause catastrophic overfitting, as after a few training epochs it will be hard to generate strong adversarial examples to continuously boost the adversarial robustness. In this work, we aim to avoid the catastrophic overfitting by introducing multi-step adversarial examples during the single-step adversarial training. Then, to balance the large training overhead of generating multi-step adversarial examples, we propose a Multi-stage Optimization based Adversarial Training (MOAT) method that periodically trains the model on mixed benign examples, single-step adversarial examples, and multi-step adversarial examples stage by stage. In this way, the overall training overhead is reduced significantly, meanwhile, the model could avoid catastrophic overfitting. Extensive experiments on CIFAR-10 and CIFAR-100 datasets demonstrate that under similar amount of training overhead, the proposed MOAT exhibits better robustness than either single-step or multi-step adversarial training methods.
    Fast Approximation of the Sliced-Wasserstein Distance Using Concentration of Random Projections. (arXiv:2106.15427v1 [stat.ML])
    (2 min) The Sliced-Wasserstein distance (SW) is being increasingly used in machine learning applications as an alternative to the Wasserstein distance and offers significant computational and statistical benefits. Since it is defined as an expectation over random projections, SW is commonly approximated by Monte Carlo. We adopt a new perspective to approximate SW by making use of the concentration of measure phenomenon: under mild assumptions, one-dimensional projections of a high-dimensional random vector are approximately Gaussian. Based on this observation, we develop a simple deterministic approximation for SW. Our method does not require sampling a number of random projections, and is therefore both accurate and easy to use compared to the usual Monte Carlo approximation. We derive nonasymptotical guarantees for our approach, and show that the approximation error goes to zero as the dimension increases, under a weak dependence condition on the data distribution. We validate our theoretical findings on synthetic datasets, and illustrate the proposed approximation on a generative modeling problem.
    DRILL-- Deep Reinforcement Learning for Refinement Operators in $\mathcal{ALC}$. (arXiv:2106.15373v1 [cs.AI])
    (2 min) Approaches based on refinement operators have been successfully applied to class expression learning on RDF knowledge graphs. These approaches often need to explore a large number of concepts to find adequate hypotheses. This need arguably stems from current approaches relying on myopic heuristic functions to guide their search through an infinite concept space. In turn, deep reinforcement learning provides effective means to address myopia by estimating how much discounted cumulated future reward states promise. In this work, we leverage deep reinforcement learning to accelerate the learning of concepts in $\mathcal{ALC}$ by proposing DRILL -- a novel class expression learning approach that uses a convolutional deep Q-learning model to steer its search. By virtue of its architecture, DRILL is able to compute the expected discounted cumulated future reward of more than $10^3$ class expressions in a second on standard hardware. We evaluate DRILL on four benchmark datasets against state-of-the-art approaches. Our results suggest that DRILL converges to goal states at least 2.7$\times$ faster than state-of-the-art models on all benchmark datasets. We provide an open-source implementation of our approach, including training and evaluation scripts as well as pre-trained models.
    Scalable Gaussian Processes for Data-Driven Design using Big Data with Categorical Factors. (arXiv:2106.15356v1 [cs.LG])
    (2 min) Scientific and engineering problems often require the use of artificial intelligence to aid understanding and the search for promising designs. While Gaussian processes (GP) stand out as easy-to-use and interpretable learners, they have difficulties in accommodating big datasets, categorical inputs, and multiple responses, which has become a common challenge for a growing number of data-driven design applications. In this paper, we propose a GP model that utilizes latent variables and functions obtained through variational inference to address the aforementioned challenges simultaneously. The method is built upon the latent variable Gaussian process (LVGP) model where categorical factors are mapped into a continuous latent space to enable GP modeling of mixed-variable datasets. By extending variational inference to LVGP models, the large training dataset is replaced by a small set of inducing points to address the scalability issue. Output response vectors are represented by a linear combination of independent latent functions, forming a flexible kernel structure to handle multiple responses that might have distinct behaviors. Comparative studies demonstrate that the proposed method scales well for large datasets with over 10^4 data points, while outperforming state-of-the-art machine learning methods without requiring much hyperparameter tuning. In addition, an interpretable latent space is obtained to draw insights into the effect of categorical factors, such as those associated with building blocks of architectures and element choices in metamaterial and materials design. Our approach is demonstrated for machine learning of ternary oxide materials and topology optimization of a multiscale compliant mechanism with aperiodic microstructures and multiple materials.
    Differential Privacy for Credit Risk Model. (arXiv:2106.15343v1 [cs.CR])
    (2 min) The use of machine learning algorithms to model user behavior and drive business decisions has become increasingly commonplace, specifically providing intelligent recommendations to automated decision making. This has led to an increase in the use of customers personal data to analyze customer behavior and predict their interests in a companys products. Increased use of this customer personal data can lead to better models but also to the potential of customer data being leaked, reverse engineered, and mishandled. In this paper, we assess differential privacy as a solution to address these privacy problems by building privacy protections into the data engineering and model training stages of predictive model development. Our interest is a pragmatic implementation in an operational environment, which necessitates a general purpose differentially private modeling framework, and we evaluate one such tool from LeapYear as applied to the Credit Risk modeling domain. Credit Risk Model is a major modeling methodology in banking and finance where user data is analyzed to determine the total Expected Loss to the bank. We examine the application of differential privacy on the credit risk model and evaluate the performance of a Differentially Private Model with a Non Differentially Private Model. Credit Risk Model is a major modeling methodology in banking and finance where users data is analyzed to determine the total Expected Loss to the bank. In this paper, we explore the application of differential privacy on the credit risk model and evaluate the performance of a Non Differentially Private Model with Differentially Private Model.
    High-dimensional separability for one- and few-shot learning. (arXiv:2106.15416v1 [cs.LG])
    (2 min) This work is driven by a practical question, corrections of Artificial Intelligence (AI) errors. Systematic re-training of a large AI system is hardly possible. To solve this problem, special external devices, correctors, are developed. They should provide quick and non-iterative system fix without modification of a legacy AI system. A common universal part of the AI corrector is a classifier that should separate undesired and erroneous behavior from normal operation. Training of such classifiers is a grand challenge at the heart of the one- and few-shot learning methods. Effectiveness of one- and few-short methods is based on either significant dimensionality reductions or the blessing of dimensionality effects. Stochastic separability is a blessing of dimensionality phenomenon that allows one-and few-shot error correction: in high-dimensional datasets under broad assumptions each point can be separated from the rest of the set by simple and robust linear discriminant. The hierarchical structure of data universe is introduced where each data cluster has a granular internal structure, etc. New stochastic separation theorems for the data distributions with fine-grained structure are formulated and proved. Separation theorems in infinite-dimensional limits are proven under assumptions of compact embedding of patterns into data space. New multi-correctors of AI systems are presented and illustrated with examples of predicting errors and learning new classes of objects by a deep convolutional neural network.
    Attentive Neural Processes and Batch Bayesian Optimization for Scalable Calibration of Physics-Informed Digital Twins. (arXiv:2106.15502v1 [cs.LG])
    (2 min) Physics-informed dynamical system models form critical components of digital twins of the built environment. These digital twins enable the design of energy-efficient infrastructure, but must be properly calibrated to accurately reflect system behavior for downstream prediction and analysis. Dynamical system models of modern buildings are typically described by a large number of parameters and incur significant computational expenditure during simulations. To handle large-scale calibration of digital twins without exorbitant simulations, we propose ANP-BBO: a scalable and parallelizable batch-wise Bayesian optimization (BBO) methodology that leverages attentive neural processes (ANPs).
    Personalized Federated Learning with Gaussian Processes. (arXiv:2106.15482v1 [cs.LG])
    (2 min) Federated learning aims to learn a global model that performs well on client devices with limited cross-client communication. Personalized federated learning (PFL) further extends this setup to handle data heterogeneity between clients by learning personalized models. A key challenge in this setting is to learn effectively across clients even though each client has unique data that is often limited in size. Here we present pFedGP, a solution to PFL that is based on Gaussian processes (GPs) with deep kernel learning. GPs are highly expressive models that work well in the low data regime due to their Bayesian nature. However, applying GPs to PFL raises multiple challenges. Mainly, GPs performance depends heavily on access to a good kernel function, and learning a kernel requires a large training set. Therefore, we propose learning a shared kernel function across all clients, parameterized by a neural network, with a personal GP classifier for each client. We further extend pFedGP to include inducing points using two novel methods, the first helps to improve generalization in the low data regime and the second reduces the computational cost. We derive a PAC-Bayes generalization bound on novel clients and empirically show that it gives non-vacuous guarantees. Extensive experiments on standard PFL benchmarks with CIFAR-10, CIFAR-100, and CINIC-10, and on a new setup of learning under input noise show that pFedGP achieves well-calibrated predictions while significantly outperforming baseline methods, reaching up to 21% in accuracy gain.
    Zoo-Tuning: Adaptive Transfer from a Zoo of Models. (arXiv:2106.15434v1 [cs.LG])
    (2 min) With the development of deep networks on various large-scale datasets, a large zoo of pretrained models are available. When transferring from a model zoo, applying classic single-model based transfer learning methods to each source model suffers from high computational burden and cannot fully utilize the rich knowledge in the zoo. We propose \emph{Zoo-Tuning} to address these challenges, which learns to adaptively transfer the parameters of pretrained models to the target task. With the learnable channel alignment layer and adaptive aggregation layer, Zoo-Tuning \emph{adaptively aggregates channel aligned pretrained parameters} to derive the target model, which promotes knowledge transfer by simultaneously adapting multiple source models to downstream tasks. The adaptive aggregation substantially reduces the computation cost at both training and inference. We further propose lite Zoo-Tuning with the temporal ensemble of batch average gating values to reduce the storage cost at the inference time. We evaluate our approach on a variety of tasks, including reinforcement learning, image classification, and facial landmark detection. Experiment results demonstrate that the proposed adaptive transfer learning approach can transfer knowledge from a zoo of models more effectively and efficiently.
    VolterraNet: A higher order convolutional network with group equivariance for homogeneous manifolds. (arXiv:2106.15301v1 [cs.CV])
    (2 min) Convolutional neural networks have been highly successful in image-based learning tasks due to their translation equivariance property. Recent work has generalized the traditional convolutional layer of a convolutional neural network to non-Euclidean spaces and shown group equivariance of the generalized convolution operation. In this paper, we present a novel higher order Volterra convolutional neural network (VolterraNet) for data defined as samples of functions on Riemannian homogeneous spaces. Analagous to the result for traditional convolutions, we prove that the Volterra functional convolutions are equivariant to the action of the isometry group admitted by the Riemannian homogeneous spaces, and under some restrictions, any non-linear equivariant function can be expressed as our homogeneous space Volterra convolution, generalizing the non-linear shift equivariant characterization of Volterra expansions in Euclidean space. We also prove that second order functional convolution operations can be represented as cascaded convolutions which leads to an efficient implementation. Beyond this, we also propose a dilated VolterraNet model. These advances lead to large parameter reductions relative to baseline non-Euclidean CNNs. To demonstrate the efficacy of the VolterraNet performance, we present several real data experiments involving classification tasks on spherical-MNIST, atomic energy, Shrec17 data sets, and group testing on diffusion MRI data. Performance comparisons to the state-of-the-art are also presented.
    Convolutional Sparse Coding Fast Approximation with Application to Seismic Reflectivity Estimation. (arXiv:2106.15296v1 [cs.LG])
    (2 min) In sparse coding, we attempt to extract features of input vectors, assuming that the data is inherently structured as a sparse superposition of basic building blocks. Similarly, neural networks perform a given task by learning features of the training data set. Recently both data-driven and model-driven feature extracting methods have become extremely popular and have achieved remarkable results. Nevertheless, practical implementations are often too slow to be employed in real-life scenarios, especially for real-time applications. We propose a speed-up upgraded version of the classic iterative thresholding algorithm, that produces a good approximation of the convolutional sparse code within 2-5 iterations. The speed advantage is gained mostly from the observation that most solvers are slowed down by inefficient global thresholding. The main idea is to normalize each data point by the local receptive field energy, before applying a threshold. This way, the natural inclination towards strong feature expressions is suppressed, so that one can rely on a global threshold that can be easily approximated, or learned during training. The proposed algorithm can be employed with a known predetermined dictionary, or with a trained dictionary. The trained version is implemented as a neural net designed as the unfolding of the proposed solver. The performance of the proposed solution is demonstrated via the seismic inversion problem in both synthetic and real data scenarios. We also provide theoretical guarantees for a stable support recovery. Namely, we prove that under certain conditions the true support is perfectly recovered within the first iteration.
    Short-Term Load Forecasting for Smart HomeAppliances with Sequence to Sequence Learning. (arXiv:2106.15348v1 [eess.SP])
    (2 min) Appliance-level load forecasting plays a critical role in residential energy management, besides having significant importance for ancillary services performed by the utilities. In this paper, we propose to use an LSTM-based sequence-to-sequence (seq2seq) learning model that can capture the load profiles of appliances. We use a real dataset collected fromfour residential buildings and compare our proposed schemewith three other techniques, namely VARMA, Dilated One Dimensional Convolutional Neural Network, and an LSTM model.The results show that the proposed LSTM-based seq2seq model outperforms other techniques in terms of prediction error in most cases.
    Federated Learning for Intrusion Detection in IoT Security: A Hybrid Ensemble Approach. (arXiv:2106.15349v1 [cs.CR])
    (2 min) Critical role of Internet of Things (IoT) in various domains like smart city, healthcare, supply chain and transportation has made them the target of malicious attacks. Past works in this area focused on centralized Intrusion Detection System (IDS), assuming the existence of a central entity to perform data analysis and identify threats. However, such IDS may not always be feasible, mainly due to spread of data across multiple sources and gathering at central node can be costly. Also, the earlier works primarily focused on improving True Positive Rate (TPR) and ignored the False Positive Rate (FPR), which is also essential to avoid unnecessary downtime of the systems. In this paper, we first present an architecture for IDS based on hybrid ensemble model, named PHEC, which gives improved performance compared to state-of-the-art architectures. We then adapt this model to a federated learning framework that performs local training and aggregates only the model parameters. Next, we propose Noise-Tolerant PHEC in centralized and federated settings to address the label-noise problem. The proposed idea uses classifiers using weighted convex surrogate loss functions. Natural robustness of KNN classifier towards noisy data is also used in the proposed architecture. Experimental results on four benchmark datasets drawn from various security attacks show that our model achieves high TPR while keeping FPR low on noisy and clean data. Further, they also demonstrate that the hybrid ensemble models achieve performance in federated settings close to that of the centralized settings.
    On exploring practical potentials of quantum auto-encoder with advantages. (arXiv:2106.15432v1 [quant-ph])
    (2 min) Quantum auto-encoder (QAE) is a powerful tool to relieve the curse of dimensionality encountered in quantum physics, celebrated by the ability to extract low-dimensional patterns from quantum states living in the high-dimensional space. Despite its attractive properties, little is known about the practical applications of QAE with provable advantages. To address these issues, here we prove that QAE can be used to efficiently calculate the eigenvalues and prepare the corresponding eigenvectors of a high-dimensional quantum state with the low-rank property. With this regard, we devise three effective QAE-based learning protocols to solve the low-rank state fidelity estimation, the quantum Gibbs state preparation, and the quantum metrology tasks, respectively. Notably, all of these protocols are scalable and can be readily executed on near-term quantum machines. Moreover, we prove that the error bounds of the proposed QAE-based methods outperform those in previous literature. Numerical simulations collaborate with our theoretical analysis. Our work opens a new avenue of utilizing QAE to tackle various quantum physics and quantum information processing problems in a scalable way.
    Where is the disease? Semi-supervised pseudo-normality synthesis from an abnormal image. (arXiv:2106.15345v1 [cs.CV])
    (2 min) Pseudo-normality synthesis, which computationally generates a pseudo-normal image from an abnormal one (e.g., with lesions), is critical in many perspectives, from lesion detection, data augmentation to clinical surgery suggestion. However, it is challenging to generate high-quality pseudo-normal images in the absence of the lesion information. Thus, expensive lesion segmentation data have been introduced to provide lesion information for the generative models and improve the quality of the synthetic images. In this paper, we aim to alleviate the need of a large amount of lesion segmentation data when generating pseudo-normal images. We propose a Semi-supervised Medical Image generative LEarning network (SMILE) which not only utilizes limited medical images with segmentation masks, but also leverages massive medical images without segmentation masks to generate realistic pseudo-normal images. Extensive experiments show that our model outperforms the best state-of-the-art model by up to 6% for data augmentation task and 3% in generating high-quality images. Moreover, the proposed semi-supervised learning achieves comparable medical image synthesis quality with supervised learning model, using only 50 of segmentation data.
    LB-CNN: An Open Source Framework for Fast Training of Light Binary Convolutional Neural Networks using Chainer and Cupy. (arXiv:2106.15350v1 [cs.LG])
    (2 min) Light binary convolutional neural networks (LB-CNN) are particularly useful when implemented in low-energy computing platforms as required in many industrial applications. Herein, a framework for optimizing compact LB-CNN is introduced and its effectiveness is evaluated. The framework is freely available and may run on free-access cloud platforms, thus requiring no major investments. The optimized model is saved in the standardized .h5 format and can be used as input to specialized tools for further deployment into specific technologies, thus enabling the rapid development of various intelligent image sensors. The main ingredient in accelerating the optimization of our model, particularly the selection of binary convolution kernels, is the Chainer/Cupy machine learning library offering significant speed-ups for training the output layer as an extreme-learning machine. Additional training of the output layer using Keras/Tensorflow is included, as it allows an increase in accuracy. Results for widely used datasets including MNIST, GTSRB, ORL, VGG show very good compromise between accuracy and complexity. Particularly, for face recognition problems a carefully optimized LB-CNN model provides up to 100% accuracies. Such TinyML solutions are well suited for industrial applications requiring image recognition with low energy consumption.
    Reliable and Fast Recurrent Neural Network Architecture Optimization. (arXiv:2106.15295v1 [cs.NE])
    (2 min) This article introduces Random Error Sampling-based Neuroevolution (RESN), a novel automatic method to optimize recurrent neural network architectures. RESN combines an evolutionary algorithm with a training-free evaluation approach. The results show that RESN achieves state-of-the-art error performance while reducing by half the computational time.
    Privacy Budget Scheduling. (arXiv:2106.15335v1 [cs.CR])
    (2 min) Machine learning (ML) models trained on personal data have been shown to leak information about users. Differential privacy (DP) enables model training with a guaranteed bound on this leakage. Each new model trained with DP increases the bound on data leakage and can be seen as consuming part of a global privacy budget that should not be exceeded. This budget is a scarce resource that must be carefully managed to maximize the number of successfully trained models. We describe PrivateKube, an extension to the popular Kubernetes datacenter orchestrator that adds privacy as a new type of resource to be managed alongside other traditional compute resources, such as CPU, GPU, and memory. The abstractions we design for the privacy resource mirror those defined by Kubernetes for traditional resources, but there are also major differences. For example, traditional compute resources are replenishable while privacy is not: a CPU can be regained after a model finishes execution while privacy budget cannot. This distinction forces a re-design of the scheduler. We present DPF (Dominant Private Block Fairness) -- a variant of the popular Dominant Resource Fairness (DRF) algorithm -- that is geared toward the non-replenishable privacy resource but enjoys similar theoretical properties as DRF. We evaluate PrivateKube and DPF on microbenchmarks and an ML workload on Amazon Reviews data. Compared to existing baselines, DPF allows training more models under the same global privacy guarantee. This is especially true for DPF over R\'enyi DP, a highly composable form of DP.
    Predicting the Solar Potential of Rooftops using Image Segmentation and Structured Data. (arXiv:2106.15268v1 [cs.CV])
    (2 min) Estimating the amount of electricity that can be produced by rooftop photovoltaic systems is a time-consuming process that requires on-site measurements, a difficult task to achieve on a large scale. In this paper, we present an approach to estimate the solar potential of rooftops based on their location and architectural characteristics, as well as the amount of solar radiation they receive annually. Our technique uses computer vision to achieve semantic segmentation of roof sections and roof objects on the one hand, and a machine learning model based on structured building features to predict roof pitch on the other hand. We then compute the azimuth and maximum number of solar panels that can be installed on a rooftop with geometric approaches. Finally, we compute precise shading masks and combine them with solar irradiation data that enables us to estimate the yearly solar potential of a rooftop.
    DeepGD: A Deep Learning Framework for Graph Drawing Using GNN. (arXiv:2106.15347v1 [cs.LG])
    (2 min) In the past decades, many graph drawing techniques have been proposed for generating aesthetically pleasing graph layouts. However, it remains a challenging task since different layout methods tend to highlight different characteristics of the graphs. Recently, studies on deep learning based graph drawing algorithm have emerged but they are often not generalizable to arbitrary graphs without re-training. In this paper, we propose a Convolutional Graph Neural Network based deep learning framework, DeepGD, which can draw arbitrary graphs once trained. It attempts to generate layouts by compromising among multiple pre-specified aesthetics considering a good graph layout usually complies with multiple aesthetics simultaneously. In order to balance the trade-off, we propose two adaptive training strategies which adjust the weight factor of each aesthetic dynamically during training. The quantitative and qualitative assessment of DeepGD demonstrates that it is capable of drawing arbitrary graphs effectively, while being flexible at accommodating different aesthetic criteria.
    Learning Task Informed Abstraction. (arXiv:2106.15612v1 [cs.LG])
    (2 min) Current model-based reinforcement learning methods struggle when operating from complex visual scenes due to their inability to prioritize task-relevant features. To mitigate this problem, we propose learning Task Informed Abstractions (TIA) that explicitly separates reward-correlated visual features from distractors. For learning TIA, we introduce the formalism of Task Informed MDP (TiMDP) that is realized by training two models that learn visual features via cooperative reconstruction, but one model is adversarially dissociated from the reward signal. Empirical evaluation shows that TIA leads to significant performance gains over state-of-the-art methods on many visual control tasks where natural and unconstrained visual distractions pose a formidable challenge.
    ScanBank: A Benchmark Dataset for Figure Extraction from Scanned Electronic Theses and Dissertations. (arXiv:2106.15320v1 [cs.CV])
    (3 min) We focus on electronic theses and dissertations (ETDs), aiming to improve access and expand their utility, since more than 6 million are publicly available, and they constitute an important corpus to aid research and education across disciplines. The corpus is growing as new born-digital documents are included, and since millions of older theses and dissertations have been converted to digital form to be disseminated electronically in institutional repositories. In ETDs, as with other scholarly works, figures and tables can communicate a large amount of information in a concise way. Although methods have been proposed for extracting figures and tables from born-digital PDFs, they do not work well with scanned ETDs. Considering this problem, our assessment of state-of-the-art figure extraction systems is that the reason they do not function well on scanned PDFs is that they have only been trained on born-digital documents. To address this limitation, we present ScanBank, a new dataset containing 10 thousand scanned page images, manually labeled by humans as to the presence of the 3.3 thousand figures or tables found therein. We use this dataset to train a deep neural network model based on YOLOv5 to accurately extract figures and tables from scanned ETDs. We pose and answer important research questions aimed at finding better methods for figure extraction from scanned documents. One of those concerns the value for training, of data augmentation techniques applied to born-digital documents which are used to train models better suited for figure extraction from scanned documents. To the best of our knowledge, ScanBank is the first manually annotated dataset for figure and table extraction for scanned ETDs. A YOLOv5-based model, trained on ScanBank, outperforms existing comparable open-source and freely available baseline methods by a considerable margin.
    Towards Sample-Optimal Compressive Phase Retrieval with Sparse and Generative Priors. (arXiv:2106.15358v1 [stat.ML])
    (2 min) Compressive phase retrieval is a popular variant of the standard compressive sensing problem, in which the measurements only contain magnitude information. In this paper, motivated by recent advances in deep generative models, we provide recovery guarantees with order-optimal sample complexity bounds for phase retrieval with generative priors. We first show that when using i.i.d. Gaussian measurements and an $L$-Lipschitz continuous generative model with bounded $k$-dimensional inputs, roughly $O(k \log L)$ samples suffice to guarantee that the signal is close to any vector that minimizes an amplitude-based empirical loss function. Attaining this sample complexity with a practical algorithm remains a difficult challenge, and a popular spectral initialization method has been observed to pose a major bottleneck. To partially address this, we further show that roughly $O(k \log L)$ samples ensure sufficient closeness between the signal and any {\em globally optimal} solution to an optimization problem designed for spectral initialization (though finding such a solution may still be challenging). We adapt this result to sparse phase retrieval, and show that $O(s \log n)$ samples are sufficient for a similar guarantee when the underlying signal is $s$-sparse and $n$-dimensional, matching an information-theoretic lower bound. While our guarantees do not directly correspond to a practical algorithm, we propose a practical spectral initialization method motivated by our findings, and experimentally observe significant performance gains over various existing spectral initialization methods of sparse phase retrieval.
    Artificial Intelligence in Minimally Invasive Interventional Treatment. (arXiv:2106.15306v1 [cs.CV])
    (2 min) Minimally invasive image guided treatment procedures often employ advanced image processing algorithms. The recent developments of artificial intelligence algorithms harbor potential to further enhance this domain. In this article we explore several application areas within the minimally invasive treatment space and discuss the deployment of artificial intelligence within these areas.
    Continuous Latent Process Flows. (arXiv:2106.15580v1 [cs.LG])
    (2 min) Partial observations of continuous time-series dynamics at arbitrary time stamps exist in many disciplines. Fitting this type of data using statistical models with continuous dynamics is not only promising at an intuitive level but also has practical benefits, including the ability to generate continuous trajectories and to perform inference on previously unseen time stamps. Despite exciting progress in this area, the existing models still face challenges in terms of their representational power and the quality of their variational approximations. We tackle these challenges with continuous latent process flows (CLPF), a principled architecture decoding continuous latent processes into continuous observable processes using a time-dependent normalizing flow driven by a stochastic differential equation. To optimize our model using maximum likelihood, we propose a novel piecewise construction of a variational posterior process and derive the corresponding variational lower bound using trajectory re-weighting. Our ablation studies demonstrate the effectiveness of our contributions in various inference tasks on irregular time grids. Comparisons to state-of-the-art baselines show our model's favourable performance on both synthetic and real-world time-series data.
    Probabilistic Attention for Interactive Segmentation. (arXiv:2106.15338v1 [cs.CV])
    (2 min) We provide a probabilistic interpretation of attention and show that the standard dot-product attention in transformers is a special case of Maximum A Posteriori (MAP) inference. The proposed approach suggests the use of Expectation Maximization algorithms for online adaptation of key and value model parameters. This approach is useful for cases in which external agents, e.g., annotators, provide inference-time information about the correct values of some tokens, e.g, the semantic category of some pixels, and we need for this new information to propagate to other tokens in a principled manner. We illustrate the approach on an interactive semantic segmentation task in which annotators and models collaborate online to improve annotation efficiency. Using standard benchmarks, we observe that key adaptation boosts model performance ($\sim10\%$ mIoU) in the low feedback regime and value propagation improves model responsiveness in the high feedback regime. A PyTorch layer implementation of our probabilistic attention model will be made publicly available.
    Attack Transferability Characterization for Adversarially Robust Multi-label Classification. (arXiv:2106.15360v1 [cs.LG])
    (2 min) Despite of the pervasive existence of multi-label evasion attack, it is an open yet essential problem to characterize the origin of the adversarial vulnerability of a multi-label learning system and assess its attackability. In this study, we focus on non-targeted evasion attack against multi-label classifiers. The goal of the threat is to cause miss-classification with respect to as many labels as possible, with the same input perturbation. Our work gains in-depth understanding about the multi-label adversarial attack by first characterizing the transferability of the attack based on the functional properties of the multi-label classifier. We unveil how the transferability level of the attack determines the attackability of the classifier via establishing an information-theoretic analysis of the adversarial risk. Furthermore, we propose a transferability-centered attackability assessment, named Soft Attackability Estimator (SAE), to evaluate the intrinsic vulnerability level of the targeted multi-label classifier. This estimator is then integrated as a transferability-tuning regularization term into the multi-label learning paradigm to achieve adversarially robust classification. The experimental study on real-world data echos the theoretical analysis and verify the validity of the transferability-regularized multi-label learning method.
    Adaptive Sample Selection for Robust Learning under Label Noise. (arXiv:2106.15292v1 [cs.LG])
    (2 min) Deep Neural Networks (DNNs) have been shown to be susceptible to memorization or overfitting in the presence of noisily labelled data. For the problem of robust learning under such noisy data, several algorithms have been proposed. A prominent class of algorithms rely on sample selection strategies, motivated by curriculum learning. For example, many algorithms use the `small loss trick' wherein a fraction of samples with loss values below a certain threshold are selected for training. These algorithms are sensitive to such thresholds, and it is difficult to fix or learn these thresholds. Often, these algorithms also require information such as label noise rates which are typically unavailable in practice. In this paper, we propose a data-dependent, adaptive sample selection strategy that relies only on batch statistics of a given mini-batch to provide robustness against label noise. The algorithm does not have any additional hyperparameters for sample selection, does not need any information on noise rates, and does not need access to separate data with clean labels. We empirically demonstrate the effectiveness of our algorithm on benchmark datasets.
    Autonomous Driving Implementation in an Experimental Environment. (arXiv:2106.15274v1 [cs.RO])
    (2 min) Autonomous systems require identifying the environment and it has a long way to go before putting it safely into practice. In autonomous driving systems, the detection of obstacles and traffic lights are of importance as well as lane tracking. In this study, an autonomous driving system is developed and tested in the experimental environment designed for this purpose. In this system, a model vehicle having a camera is used to trace the lanes and avoid obstacles to experimentally study autonomous driving behavior. Convolutional Neural Network models were trained for Lane tracking. For the vehicle to avoid obstacles, corner detection, optical flow, focus of expansion, time to collision, balance calculation, and decision mechanism were created, respectively.
    Evaluating Deep Neural Networks for Image Document Enhancement. (arXiv:2106.15286v1 [cs.CV])
    (2 min) This work evaluates six state-of-the-art deep neural network (DNN) architectures applied to the problem of enhancing camera-captured document images. The results from each network were evaluated both qualitatively and quantitatively using Image Quality Assessment (IQA) metrics, and also compared with an existing approach based on traditional computer vision techniques. The best performing architectures generally produced good enhancement compared to the existing algorithm, showing that it is possible to use DNNs for document image enhancement. Furthermore, the best performing architectures could work as a baseline for future investigations on document enhancement using deep learning techniques. The main contributions of this paper are: a baseline of deep learning techniques that can be further improved to provide better results, and a evaluation methodology using IQA metrics for quantitatively comparing the produced images from the neural networks to a ground truth.
    Anomaly Detection and Automated Labeling for Voter Registration File Changes. (arXiv:2106.15285v1 [cs.CR])
    (2 min) Voter eligibility in United States elections is determined by a patchwork of state databases containing information about which citizens are eligible to vote. Administrators at the state and local level are faced with the exceedingly difficult task of ensuring that each of their jurisdictions is properly managed, while also monitoring for improper modifications to the database. Monitoring changes to Voter Registration Files (VRFs) is crucial, given that a malicious actor wishing to disrupt the democratic process in the US would be well-advised to manipulate the contents of these files in order to achieve their goals. In 2020, we saw election officials perform admirably when faced with administering one of the most contentious elections in US history, but much work remains to secure and monitor the election systems Americans rely on. Using data created by comparing snapshots taken of VRFs over time, we present a set of methods that make use of machine learning to ease the burden on analysts and administrators in protecting voter rolls. We first evaluate the effectiveness of multiple unsupervised anomaly detection methods in detecting VRF modifications by modeling anomalous changes as sparse additive noise. In this setting we determine that statistical models comparing administrative districts within a short time span and non-negative matrix factorization are most effective for surfacing anomalous events for review. These methods were deployed during 2019-2020 in our organization's monitoring system and were used in collaboration with the office of the Iowa Secretary of State. Additionally, we propose a newly deployed model which uses historical and demographic metadata to label the likely root cause of database modifications. We hope to use this model to predict which modifications have known causes and therefore better identify potentially anomalous modifications.
    Image Inpainting Using Wasserstein Generative Adversarial Imputation Network. (arXiv:2106.15341v1 [cs.CV])
    (2 min) Image inpainting is one of the important tasks in computer vision which focuses on the reconstruction of missing regions in an image. The aim of this paper is to introduce an image inpainting model based on Wasserstein Generative Adversarial Imputation Network. The generator network of the model uses building blocks of convolutional layers with different dilation rates, together with skip connections that help the model reproduce fine details of the output. This combination yields a universal imputation model that is able to handle various scenarios of missingness with sufficient quality. To show this experimentally, the model is simultaneously trained to deal with three scenarios given by missing pixels at random, missing various smaller square regions, and one missing square placed in the center of the image. It turns out that our model achieves high-quality inpainting results on all scenarios. Performance is evaluated using peak signal-to-noise ratio and structural similarity index on two real-world benchmark datasets, CelebA faces and Paris StreetView. The results of our model are compared to biharmonic imputation and to some of the other state-of-the-art image inpainting methods.
    Cascaded Diffusion Models for High Fidelity Image Generation. (arXiv:2106.15282v1 [cs.CV])
    (2 min) We show that cascaded diffusion models are capable of generating high fidelity images on the class-conditional ImageNet generation challenge, without any assistance from auxiliary image classifiers to boost sample quality. A cascaded diffusion model comprises a pipeline of multiple diffusion models that generate images of increasing resolution, beginning with a standard diffusion model at the lowest resolution, followed by one or more super-resolution diffusion models that successively upsample the image and add higher resolution details. We find that the sample quality of a cascading pipeline relies crucially on conditioning augmentation, our proposed method of data augmentation of the lower resolution conditioning inputs to the super-resolution models. Our experiments show that conditioning augmentation prevents compounding error during sampling in a cascaded model, helping us to train cascading pipelines achieving FID scores of 1.48 at 64x64, 3.52 at 128x128 and 4.88 at 256x256 resolutions, outperforming BigGAN-deep.
    A Mixed-Supervision Multilevel GAN Framework for Image Quality Enhancement. (arXiv:2106.15575v1 [eess.IV])
    (2 min) Deep neural networks for image quality enhancement typically need large quantities of highly-curated training data comprising pairs of low-quality images and their corresponding high-quality images. While high-quality image acquisition is typically expensive and time-consuming, medium-quality images are faster to acquire, at lower equipment costs, and available in larger quantities. Thus, we propose a novel generative adversarial network (GAN) that can leverage training data at multiple levels of quality (e.g., high and medium quality) to improve performance while limiting costs of data curation. We apply our mixed-supervision GAN to (i) super-resolve histopathology images and (ii) enhance laparoscopy images by combining super-resolution and surgical smoke removal. Results on large clinical and pre-clinical datasets show the benefits of our mixed-supervision GAN over the state of the art.
    An Image is Worth More Than a Thousand Words: Towards Disentanglement in the Wild. (arXiv:2106.15610v1 [cs.CV])
    (2 min) Unsupervised disentanglement has been shown to be theoretically impossible without inductive biases on the models and the data. As an alternative approach, recent methods rely on limited supervision to disentangle the factors of variation and allow their identifiability. While annotating the true generative factors is only required for a limited number of observations, we argue that it is infeasible to enumerate all the factors of variation that describe a real-world image distribution. To this end, we propose a method for disentangling a set of factors which are only partially labeled, as well as separating the complementary set of residual factors that are never explicitly specified. Our success in this challenging setting, demonstrated on synthetic benchmarks, gives rise to leveraging off-the-shelf image descriptors to partially annotate a subset of attributes in real image domains (e.g. of human faces) with minimal manual effort. Specifically, we use a recent language-image embedding model (CLIP) to annotate a set of attributes of interest in a zero-shot manner and demonstrate state-of-the-art disentangled image manipulation results.
    Generating the Graph Gestalt: Kernel-Regularized Graph Representation Learning. (arXiv:2106.15239v1 [cs.LG])
    (2 min) Recent work on graph generative models has made remarkable progress towards generating increasingly realistic graphs, as measured by global graph features such as degree distribution, density, and clustering coefficients. Deep generative models have also made significant advances through better modelling of the local correlations in the graph topology, which have been very useful for predicting unobserved graph components, such as the existence of a link or the class of a node, from nearby observed graph components. A complete scientific understanding of graph data should address both global and local structure. In this paper, we propose a joint model for both as complementary objectives in a graph VAE framework. Global structure is captured by incorporating graph kernels in a probabilistic model whose loss function is closely related to the maximum mean discrepancy(MMD) between the global structures of the reconstructed and the input graphs. The ELBO objective derived from the model regularizes a standard local link reconstruction term with an MMD term. Our experiments demonstrate a significant improvement in the realism of the generated graph structures, typically by 1-2 orders of magnitude of graph structure metrics, compared to leading graph VAEand GAN models. Local link reconstruction improves as well in many cases.
    Convolutional Hypercomplex Embeddings for Link Prediction. (arXiv:2106.15230v1 [cs.LG])
    (2 min) Knowledge graph embedding research has mainly focused on the two smallest normed division algebras, $\mathbb{R}$ and $\mathbb{C}$. Recent results suggest that trilinear products of quaternion-valued embeddings can be a more effective means to tackle link prediction. In addition, models based on convolutions on real-valued embeddings often yield state-of-the-art results for link prediction. In this paper, we investigate a composition of convolution operations with hypercomplex multiplications. We propose the four approaches QMult, OMult, ConvQ and ConvO to tackle the link prediction problem. QMult and OMult can be considered as quaternion and octonion extensions of previous state-of-the-art approaches, including DistMult and ComplEx. ConvQ and ConvO build upon QMult and OMult by including convolution operations in a way inspired by the residual learning framework. We evaluated our approaches on seven link prediction datasets including WN18RR, FB15K-237 and YAGO3-10. Experimental results suggest that the benefits of learning hypercomplex-valued vector representations become more apparent as the size and complexity of the knowledge graph grows. ConvO outperforms state-of-the-art approaches on FB15K-237 in MRR, Hit@1 and Hit@3, while QMult, OMult, ConvQ and ConvO outperform state-of-the-approaches on YAGO3-10 in all metrics. Results also suggest that link prediction performances can be further improved via prediction averaging. To foster reproducible research, we provide an open-source implementation of approaches, including training and evaluation scripts as well as pretrained models.
    Soft Attention: Does it Actually Help to Learn Social Interactions in Pedestrian Trajectory Prediction?. (arXiv:2106.15321v1 [cs.CV])
    (2 min) We consider the problem of predicting the future path of a pedestrian using its motion history and the motion history of the surrounding pedestrians, called social information. Since the seminal paper on Social-LSTM, deep-learning has become the main tool used to model the impact of social interactions on a pedestrian's motion. The demonstration that these models can learn social interactions relies on an ablative study of these models. The models are compared with and without their social interactions module on two standard metrics, the Average Displacement Error and Final Displacement Error. Yet, these complex models were recently outperformed by a simple constant-velocity approach. This questions if they actually allow to model social interactions as well as the validity of the proof. In this paper, we focus on the deep-learning models with a soft-attention mechanism for social interaction modeling and study whether they use social information at prediction time. We conduct two experiments across four state-of-the-art approaches on the ETH and UCY datasets, which were also used in previous work. First, the models are trained by replacing the social information with random noise and compared to model trained with actual social information. Second, we use a gating mechanism along with a $L_0$ penalty, allowing models to shut down their inner components. The models consistently learn to prune their soft-attention mechanism. For both experiments, neither the course of the convergence nor the prediction performance were altered. This demonstrates that the soft-attention mechanism and therefore the social information are ignored by the models.
    Similarity Embedding Networks for Robust Human Activity Recognition. (arXiv:2106.15283v1 [cs.CV])
    (2 min) Deep learning models for human activity recognition (HAR) based on sensor data have been heavily studied recently. However, the generalization ability of deep models on complex real-world HAR data is limited by the availability of high-quality labeled activity data, which are hard to obtain. In this paper, we design a similarity embedding neural network that maps input sensor signals onto real vectors through carefully designed convolutional and LSTM layers. The embedding network is trained with a pairwise similarity loss, encouraging the clustering of samples from the same class in the embedded real space, and can be effectively trained on a small dataset and even on a noisy dataset with mislabeled samples. Based on the learned embeddings, we further propose both nonparametric and parametric approaches for activity recognition. Extensive evaluation based on two public datasets has shown that the proposed similarity embedding network significantly outperforms state-of-the-art deep models on HAR classification tasks, is robust to mislabeled samples in the training set, and can also be used to effectively denoise a noisy dataset.
    Predicting Depth from Semantic Segmentation using Game Engine Dataset. (arXiv:2106.15257v1 [cs.CV])
    (2 min) Depth perception is fundamental for robots to understand the surrounding environment. As the view of cognitive neuroscience, visual depth perception methods are divided into three categories, namely binocular, active, and pictorial. The first two categories have been studied for decades in detail. However, research for the exploration of the third category is still in its infancy and has got momentum by the advent of deep learning methods in recent years. In cognitive neuroscience, it is known that pictorial depth perception mechanisms are dependent on the perception of seen objects. Inspired by this fact, in this thesis, we investigated the relation of perception of objects and depth estimation convolutional neural networks. For this purpose, we developed new network structures based on a simple depth estimation network that only used a single image at its input. Our proposed structures use both an image and a semantic label of the image as their input. We used semantic labels as the output of object perception. The obtained results of performance comparison between the developed network and original network showed that our novel structures can improve the performance of depth estimation by 52\% of relative error of distance in the examined cases. Most of the experimental studies were carried out on synthetic datasets that were generated by game engines to isolate the performance comparison from the effect of inaccurate depth and semantic labels of non-synthetic datasets. It is shown that particular synthetic datasets may be used for training of depth networks in cases that an appropriate dataset is not available. Furthermore, we showed that in these cases, usage of semantic labels improves the robustness of the network against domain shift from synthetic training data to non-synthetic test data.
    Semi-supervised learning with Bayesian Confidence Propagation Neural Network. (arXiv:2106.15546v1 [cs.LG])
    (2 min) Learning internal representations from data using no or few labels is useful for machine learning research, as it allows using massive amounts of unlabeled data. In this work, we use the Bayesian Confidence Propagation Neural Network (BCPNN) model developed as a biologically plausible model of the cortex. Recent work has demonstrated that these networks can learn useful internal representations from data using local Bayesian-Hebbian learning rules. In this work, we show how such representations can be leveraged in a semi-supervised setting by introducing and comparing different classifiers. We also evaluate and compare such networks with other popular semi-supervised classifiers.
    Efficient Realistic Data Generation Framework leveraging Deep Learning-based Human Digitization. (arXiv:2106.15409v1 [cs.CV])
    (2 min) The performance of supervised deep learning algorithms depends significantly on the scale, quality and diversity of the data used for their training. Collecting and manually annotating large amount of data can be both time-consuming and costly tasks to perform. In the case of tasks related to visual human-centric perception, the collection and distribution of such data may also face restrictions due to legislation regarding privacy. In addition, the design and testing of complex systems, e.g., robots, which often employ deep learning-based perception models, may face severe difficulties as even state-of-the-art methods trained on real and large-scale datasets cannot always perform adequately as they have not adapted to the visual differences between the virtual and the real world data. As an attempt to tackle and mitigate the effect of these issues, we present a method that automatically generates realistic synthetic data with annotations for a) person detection, b) face recognition, and c) human pose estimation. The proposed method takes as input real background images and populates them with human figures in various poses. Instead of using hand-made 3D human models, we propose the use of models generated through deep learning methods, further reducing the dataset creation costs, while maintaining a high level of realism. In addition, we provide open-source and easy to use tools that implement the proposed pipeline, allowing for generating highly-realistic synthetic datasets for a variety of tasks. A benchmarking and evaluation in the corresponding tasks shows that synthetic data can be effectively used as a supplement to real data.
    Robust Distributed Optimization With Randomly Corrupted Gradients. (arXiv:2106.14956v1 [math.OC])
    (2 min) In this paper, we propose a first-order distributed optimization algorithm that is provably robust to Byzantine failures-arbitrary and potentially adversarial behavior, where all the participating agents are prone to failure. We model each agent's state over time as a two-state Markov chain that indicates Byzantine or trustworthy behaviors at different time instants. We set no restrictions on the maximum number of Byzantine agents at any given time. We design our method based on three layers of defense: 1) Temporal gradient averaging, 2) robust aggregation, and 3) gradient normalization. We study two settings for stochastic optimization, namely Sample Average Approximation and Stochastic Approximation, and prove that for strongly convex and smooth non-convex cost functions, our algorithm achieves order-optimal statistical error and convergence rates.
    DeepFaceLab: Integrated, flexible and extensible face-swapping framework. (arXiv:2005.05535v5 [cs.CV] UPDATED)
    (2 min) Deepfake defense not only requires the research of detection but also requires the efforts of generation methods. However, current deepfake methods suffer the effects of obscure workflow and poor performance. To solve this problem, we present DeepFaceLab, the current dominant deepfake framework for face-swapping. It provides the necessary tools as well as an easy-to-use way to conduct high-quality face-swapping. It also offers a flexible and loose coupling structure for people who need to strengthen their pipeline with other features without writing complicated boilerplate code. We detail the principles that drive the implementation of DeepFaceLab and introduce its pipeline, through which every aspect of the pipeline can be modified painlessly by users to achieve their customization purpose. It is noteworthy that DeepFaceLab could achieve cinema-quality results with high fidelity. We demonstrate the advantage of our system by comparing our approach with other face-swapping methods.For more information, please visit:https://github.com/iperov/DeepFaceLab/.
    Framework for an Intelligent Affect Aware Smart Home Environment for Elderly People. (arXiv:2106.15599v1 [cs.HC])
    (2 min) The population of elderly people has been increasing at a rapid rate over the last few decades and their population is expected to further increase in the upcoming future. Their increasing population is associated with their increasing needs due to problems like physical disabilities, cognitive issues, weakened memory and disorganized behavior, that elderly people face with increasing age. To reduce their financial burden on the world economy and to enhance their quality of life, it is essential to develop technology-based solutions that are adaptive, assistive and intelligent in nature. Intelligent Affect Aware Systems that can not only analyze but also predict the behavior of elderly people in the context of their day to day interactions with technology in an IoT-based environment, holds immense potential for serving as a long-term solution for improving the user experience of elderly in smart homes. This work therefore proposes the framework for an Intelligent Affect Aware environment for elderly people that can not only analyze the affective components of their interactions but also predict their likely user experience even before they start engaging in any activity in the given smart home environment. This forecasting of user experience would provide scope for enhancing the same, thereby increasing the assistive and adaptive nature of such intelligent systems. To uphold the efficacy of this proposed framework for improving the quality of life of elderly people in smart homes, it has been tested on three datasets and the results are presented and discussed.
    Online Interaction Detection for Click-Through Rate Prediction. (arXiv:2106.15400v1 [cs.LG])
    (2 min) Click-Through Rate prediction aims to predict the ratio of clicks to impressions of a specific link. This is a challenging task since (1) there are usually categorical features, and the inputs will be extremely high-dimensional if one-hot encoding is applied, (2) not only the original features but also their interactions are important, (3) an effective prediction may rely on different features and interactions in different time periods. To overcome these difficulties, we propose a new interaction detection method, named Online Random Intersection Chains. The method, which is based on the idea of frequent itemset mining, detects informative interactions by observing the intersections of randomly chosen samples. The discovered interactions enjoy high interpretability as they can be comprehended as logical expressions. ORIC can be updated every time new data is collected, without being retrained on historical data. What's more, the importance of the historical and latest data can be controlled by a tuning parameter. A framework is designed to deal with the streaming interactions, so almost all existing models for CTR prediction can be applied after interaction detection. Empirical results demonstrate the efficiency and effectiveness of ORIC on three benchmark datasets.
    Arabic Speech Recognition by End-to-End, Modular Systems and Human. (arXiv:2101.08454v2 [eess.AS] UPDATED)
    (2 min) Recent advances in automatic speech recognition (ASR) have achieved accuracy levels comparable to human transcribers, which led researchers to debate if the machine has reached human performance. Previous work focused on the English language and modular hidden Markov model-deep neural network (HMM-DNN) systems. In this paper, we perform a comprehensive benchmarking for end-to-end transformer ASR, modular HMM-DNN ASR, and human speech recognition (HSR) on the Arabic language and its dialects. For the HSR, we evaluate linguist performance and lay-native speaker performance on a new dataset collected as a part of this study. For ASR the end-to-end work led to 12.5%, 27.5%, 33.8% WER; a new performance milestone for the MGB2, MGB3, and MGB5 challenges respectively. Our results suggest that human performance in the Arabic language is still considerably better than the machine with an absolute WER gap of 3.5% on average.
    Cross-Gradient Aggregation for Decentralized Learning from Non-IID data. (arXiv:2103.02051v2 [cs.LG] UPDATED)
    (2 min) Decentralized learning enables a group of collaborative agents to learn models using a distributed dataset without the need for a central parameter server. Recently, decentralized learning algorithms have demonstrated state-of-the-art results on benchmark data sets, comparable with centralized algorithms. However, the key assumption to achieve competitive performance is that the data is independently and identically distributed (IID) among the agents which, in real-life applications, is often not applicable. Inspired by ideas from continual learning, we propose Cross-Gradient Aggregation (CGA), a novel decentralized learning algorithm where (i) each agent aggregates cross-gradient information, i.e., derivatives of its model with respect to its neighbors' datasets, and (ii) updates its model using a projected gradient based on quadratic programming (QP). We theoretically analyze the convergence characteristics of CGA and demonstrate its efficiency on non-IID data distributions sampled from the MNIST and CIFAR-10 datasets. Our empirical comparisons show superior learning performance of CGA over existing state-of-the-art decentralized learning algorithms, as well as maintaining the improved performance under information compression to reduce peer-to-peer communication overhead. The code is available here on GitHub.
    Achieving Statistical Optimality of Federated Learning: Beyond Stationary Points. (arXiv:2106.15216v1 [stat.ML])
    (2 min) Federated Learning (FL) is a promising framework that has great potentials in privacy preservation and in lowering the computation load at the cloud. FedAvg and FedProx are two widely adopted algorithms. However, recent work raised concerns on these two methods: (1) their fixed points do not correspond to the stationary points of the original optimization problem, and (2) the common model found might not generalize well locally. In this paper, we alleviate these concerns. Towards this, we adopt the statistical learning perspective yet allow the distributions to be heterogeneous and the local data to be unbalanced. We show, in the general kernel regression setting, that both FedAvg and FedProx converge to the minimax-optimal error rates. Moreover, when the kernel function has a finite rank, the convergence is exponentially fast. Our results further analytically quantify the impact of the model heterogeneity and characterize the federation gain - the reduction of the estimation error for a worker to join the federated learning compared to the best local estimator. To the best of our knowledge, we are the first to show the achievability of minimax error rates under FedAvg and FedProx, and the first to characterize the gains in joining FL. Numerical experiments further corroborate our theoretical findings on the statistical optimality of FedAvg and FedProx and the federation gains.
    Look-Ahead Screening Rules for the Lasso. (arXiv:2105.05648v2 [stat.ML] UPDATED)
    (2 min) The lasso is a popular method to induce shrinkage and sparsity in the solution vector (coefficients) of regression problems, particularly when there are many predictors relative to the number of observations. Solving the lasso in this high-dimensional setting can, however, be computationally demanding. Fortunately, this demand can be alleviated via the use of screening rules that discard predictors prior to fitting the model, leading to a reduced problem to be solved. In this paper, we present a new screening strategy: look-ahead screening. Our method uses safe screening rules to find a range of penalty values for which a given predictor cannot enter the model, thereby screening predictors along the remainder of the path. In experiments we show that these look-ahead screening rules outperform the active warm-start version of the Gap Safe rules.
    Evolving-Graph Gaussian Processes. (arXiv:2106.15127v1 [cs.LG])
    (2 min) Graph Gaussian Processes (GGPs) provide a data-efficient solution on graph structured domains. Existing approaches have focused on static structures, whereas many real graph data represent a dynamic structure, limiting the applications of GGPs. To overcome this we propose evolving-Graph Gaussian Processes (e-GGPs). The proposed method is capable of learning the transition function of graph vertices over time with a neighbourhood kernel to model the connectivity and interaction changes between vertices. We assess the performance of our method on time-series regression problems where graphs evolve over time. We demonstrate the benefits of e-GGPs over static graph Gaussian Process approaches.
    Online Estimation and Coverage Control with Heterogeneous Sensing Information. (arXiv:2106.14984v1 [cs.RO])
    (2 min) Heterogeneous multi-robot sensing systems are able to characterize physical processes more comprehensively than homogeneous systems. Access to multiple modalities of sensory data allow such systems to fuse information between complementary sources and learn richer representations of a phenomenon of interest. Often, these data are correlated but vary in fidelity, i.e., accuracy (bias) and precision (noise). Low-fidelity data may be more plentiful, while high-fidelity data may be more trustworthy. In this paper, we address the problem of multi-robot online estimation and coverage control by combining low- and high-fidelity data to learn and cover a sensory function of interest. We propose two algorithms for this task of heterogeneous learning and coverage -- namely Stochastic Sequencing of Multi-fidelity Learning and Coverage (SMLC) and Deterministic Sequencing of Multi-fidelity Learning and Coverage (DMLC) -- and prove that they converge asymptotically. In addition, we demonstrate the empirical efficacy of SMLC and DMLC through numerical simulations.
    A Representation Learning Perspective on the Importance of Train-Validation Splitting in Meta-Learning. (arXiv:2106.15615v1 [cs.LG])
    (2 min) An effective approach in meta-learning is to utilize multiple "train tasks" to learn a good initialization for model parameters that can help solve unseen "test tasks" with very few samples by fine-tuning from this initialization. Although successful in practice, theoretical understanding of such methods is limited. This work studies an important aspect of these methods: splitting the data from each task into train (support) and validation (query) sets during meta-training. Inspired by recent work (Raghu et al., 2020), we view such meta-learning methods through the lens of representation learning and argue that the train-validation split encourages the learned representation to be low-rank without compromising on expressivity, as opposed to the non-splitting variant that encourages high-rank representations. Since sample efficiency benefits from low-rankness, the splitting strategy will require very few samples to solve unseen test tasks. We present theoretical results that formalize this idea for linear representation learning on a subspace meta-learning instance, and experimentally verify this practical benefit of splitting in simulations and on standard meta-learning benchmarks.
    Causal Policy Gradients: Leveraging Structure for Efficient Learning in (Factored) MOMDPs. (arXiv:2102.10362v2 [cs.LG] UPDATED)
    (2 min) Policy gradient methods can solve complex tasks but often fail when the dimensionality of the action-space or objective multiplicity grow very large. This occurs, in part, because the variance on score-based gradient estimators scales quadratically. In this paper, we address this problem through a causal baseline which exploits independence structure encoded in a novel action-target influence network. Causal policy gradients (CPGs), which follow, provide a common framework for analysing key state-of-the-art algorithms, are shown to generalise traditional policy gradients, and yield a principled way of incorporating prior knowledge of a problem domain's generative processes. We provide an analysis of the proposed estimator and identify the conditions under which variance is reduced. The algorithmic aspects of CPGs are discussed, including optimal policy factorisation, as characterised by minimum biclique coverings, and the implications for the bias-variance trade-off of incorrectly specifying the network. Finally, we demonstrate the performance advantages of our algorithm on large-scale bandit and traffic intersection problems, providing a novel contribution to the latter in the form of a spatio-causal approximation.
    On-board Volcanic Eruption Detection through CNNs and Satellite Multispectral Imagery. (arXiv:2106.15281v1 [cs.CV])
    (2 min) In recent years, the growth of Machine Learning algorithms in a variety of different applications has raised numerous studies on the applicability of these algorithms in real scenarios. Among all, one of the hardest scenarios, due to its physical requirements, is the aerospace one. In this context, the authors of this work aim to propose a first prototype and a study of feasibility for an AI model to be 'loaded' on board. As a case study, the authors decided to investigate the detection of volcanic eruptions as a method to swiftly produce alerts. Two Convolutional Neural Networks have been proposed and created, also showing how to correctly implement them on real hardware and how the complexity of a CNN can be adapted to fit computational requirements.
    Optimal Rates for Random Order Online Optimization. (arXiv:2106.15207v1 [cs.LG])
    (2 min) We study online convex optimization in the random order model, recently proposed by \citet{garber2020online}, where the loss functions may be chosen by an adversary, but are then presented to the online algorithm in a uniformly random order. Focusing on the scenario where the cumulative loss function is (strongly) convex, yet individual loss functions are smooth but might be non-convex, we give algorithms that achieve the optimal bounds and significantly outperform the results of \citet{garber2020online}, completely removing the dimension dependence and improving their scaling with respect to the strong convexity parameter. Our analysis relies on novel connections between algorithmic stability and generalization for sampling without-replacement analogous to those studied in the with-replacement i.i.d.~setting, as well as on a refined average stability analysis of stochastic gradient descent.
    A Comprehensive Survey of Incentive Mechanism for Federated Learning. (arXiv:2106.15406v1 [cs.LG])
    (2 min) Federated learning utilizes various resources provided by participants to collaboratively train a global model, which potentially address the data privacy issue of machine learning. In such promising paradigm, the performance will be deteriorated without sufficient training data and other resources in the learning process. Thus, it is quite crucial to inspire more participants to contribute their valuable resources with some payments for federated learning. In this paper, we present a comprehensive survey of incentive schemes for federate learning. Specifically, we identify the incentive problem in federated learning and then provide a taxonomy for various schemes. Subsequently, we summarize the existing incentive mechanisms in terms of the main techniques, such as Stackelberg game, auction, contract theory, Shapley value, reinforcement learning, blockchain. By reviewing and comparing some impressive results, we figure out three directions for the future study.
    Wasserstein Adversarial Regularization (WAR) on label noise. (arXiv:1904.03936v3 [cs.LG] UPDATED)
    (2 min) Noisy labels often occur in vision datasets, especially when they are obtained from crowdsourcing or Web scraping. We propose a new regularization method, which enables learning robust classifiers in presence of noisy data. To achieve this goal, we propose a new adversarial regularization scheme based on the Wasserstein distance. Using this distance allows taking into account specific relations between classes by leveraging the geometric properties of the labels space. Our Wasserstein Adversarial Regularization (WAR) encodes a selective regularization, which promotes smoothness of the classifier between some classes, while preserving sufficient complexity of the decision boundary between others. We first discuss how and why adversarial regularization can be used in the context of label noise and then show the effectiveness of our method on five datasets corrupted with noisy labels: in both benchmarks and real datasets, WAR outperforms the state-of-the-art competitors.
    FallDeF5: A Fall Detection Framework Using 5G-based Deep Gated Recurrent Unit Networks. (arXiv:2106.15049v1 [cs.LG])
    (2 min) Fall prevalence is high among elderly people, which is challenging due to the severe consequences of falling. This is why rapid assistance is a critical task. Ambient assisted living (AAL) uses recent technologies such as 5G networks and the internet of medical things (IoMT) to address this research area. Edge computing can reduce the cost of cloud communication, including high latency and bandwidth use, by moving conventional healthcare services and applications closer to end-users. Artificial intelligence (AI) techniques such as deep learning (DL) have been used recently for automatic fall detection, as well as supporting healthcare services. However, DL requires a vast amount of data and substantial processing power to improve its performance for the IoMT linked to the traditional edge computing environment. This research proposes an effective fall detection framework based on DL algorithms and mobile edge computing (MEC) within 5G wireless networks, the aim being to empower IoMT-based healthcare applications. We also propose the use of a deep gated recurrent unit (DGRU) neural network to improve the accuracy of existing DL-based fall detection methods. DGRU has the advantage of dealing with time-series IoMT data, and it can reduce the number of parameters and avoid the vanishing gradient problem. The experimental results on two public datasets show that the DGRU model of the proposed framework achieves higher accuracy rates compared to the current related works on the same datasets.
    Test-Time Adaptation to Distribution Shift by Confidence Maximization and Input Transformation. (arXiv:2106.14999v1 [stat.ML])
    (2 min) Deep neural networks often exhibit poor performance on data that is unlikely under the train-time data distribution, for instance data affected by corruptions. Previous works demonstrate that test-time adaptation to data shift, for instance using entropy minimization, effectively improves performance on such shifted distributions. This paper focuses on the fully test-time adaptation setting, where only unlabeled data from the target distribution is required. This allows adapting arbitrary pretrained networks. Specifically, we propose a novel loss that improves test-time adaptation by addressing both premature convergence and instability of entropy minimization. This is achieved by replacing the entropy by a non-saturating surrogate and adding a diversity regularizer based on batch-wise entropy maximization that prevents convergence to trivial collapsed solutions. Moreover, we propose to prepend an input transformation module to the network that can partially undo test-time distribution shifts. Surprisingly, this preprocessing can be learned solely using the fully test-time adaptation loss in an end-to-end fashion without any target domain labels or source domain data. We show that our approach outperforms previous work in improving the robustness of publicly available pretrained image classifiers to common corruptions on such challenging benchmarks as ImageNet-C.
    A Survey on Neural Speech Synthesis. (arXiv:2106.15561v1 [eess.AS])
    (2 min) Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural speech given text, is a hot research topic in speech, language, and machine learning communities and has broad applications in the industry. As the development of deep learning and artificial intelligence, neural network-based TTS has significantly improved the quality of synthesized speech in recent years. In this paper, we conduct a comprehensive survey on neural TTS, aiming to provide a good understanding of current research and future trends. We focus on the key components in neural TTS, including text analysis, acoustic models and vocoders, and several advanced topics, including fast TTS, low-resource TTS, robust TTS, expressive TTS, and adaptive TTS, etc. We further summarize resources related to TTS (e.g., datasets, opensource implementations) and discuss future research directions. This survey can serve both academic researchers and industry practitioners working on TTS.
    Forest Fire Clustering: Iterative Label Propagation Clustering and Monte Carlo Validation For Single-cell Sequencing Analysis. (arXiv:2103.11802v3 [cs.LG] UPDATED)
    (2 min) With the rise of single-cell sequencing technologies, there is a growing need for robust clustering algorithms to extract deeper insights from data. Here, we introduce an intuitive and efficient clustering method, Forest Fire Clustering, for discovering and validating cell types in single-cell sequencing analysis. Compared to existing methods, our clustering algorithm makes minimum prior assumptions about the data distribution and can provide a point-wise significance value via Monte Carlo simulations for internal validation. Additionally, point-wise label entropies can highlight novel transition cell types \emph{de novo} along developmental pseudo-time manifolds. Lastly, our inductive algorithm has the ability to make robust inferences in an online-learning context. In this paper, we describe the method, provide a summary of its performance against common clustering benchmarks, and demonstrate that Forest Fire Clustering is uniquely suitable for single-cell sequencing analysis.
    Small random initialization is akin to spectral learning: Optimization and generalization guarantees for overparameterized low-rank matrix reconstruction. (arXiv:2106.15013v1 [cs.LG])
    (2 min) Recently there has been significant theoretical progress on understanding the convergence and generalization of gradient-based methods on nonconvex losses with overparameterized models. Nevertheless, many aspects of optimization and generalization and in particular the critical role of small random initialization are not fully understood. In this paper, we take a step towards demystifying this role by proving that small random initialization followed by a few iterations of gradient descent behaves akin to popular spectral methods. We also show that this implicit spectral bias from small random initialization, which is provably more prominent for overparameterized models, also puts the gradient descent iterations on a particular trajectory towards solutions that are not only globally optimal but also generalize well. Concretely, we focus on the problem of reconstructing a low-rank matrix from a few measurements via a natural nonconvex formulation. In this setting, we show that the trajectory of the gradient descent iterations from small random initialization can be approximately decomposed into three phases: (I) a spectral or alignment phase where we show that that the iterates have an implicit spectral bias akin to spectral initialization allowing us to show that at the end of this phase the column space of the iterates and the underlying low-rank matrix are sufficiently aligned, (II) a saddle avoidance/refinement phase where we show that the trajectory of the gradient iterates moves away from certain degenerate saddle points, and (III) a local refinement phase where we show that after avoiding the saddles the iterates converge quickly to the underlying low-rank matrix. Underlying our analysis are insights for the analysis of overparameterized nonconvex optimization schemes that may have implications for computational problems beyond low-rank reconstruction.
    Learning complex dependency structure of gene regulatory networks from high dimensional micro-array data with Gaussian Bayesian networks. (arXiv:2106.15365v1 [q-bio.MN])
    (2 min) Gene expression datasets consist of thousand of genes with relatively small samplesizes (i.e. are large-$p$-small-$n$). Moreover, dependencies of various orders co-exist in the datasets. In the Undirected probabilistic Graphical Model (UGM) framework the Glasso algorithm has been proposed to deal with high dimensional micro-array datasets forcing sparsity. Also, modifications of the default Glasso algorithm are developed to overcome the problem of complex interaction structure. In this work we advocate the use of a simple score-based Hill Climbing algorithm (HC) that learns Gaussian Bayesian Networks (BNs) leaning on Directed Acyclic Graphs (DAGs). We compare HC with Glasso and its modifications in the UGM framework on their capability to reconstruct GRNs from micro-array data belonging to the Escherichia Coli genome. We benefit from the analytical properties of the Joint Probability Density (JPD) function on which both directed and undirected PGMs build to convert DAGs to UGMs. We conclude that dependencies in complex data are learned best by the HC algorithm, presenting them most accurately and efficiently, simultaneously modelling strong local and weaker but significant global connections coexisting in the gene expression dataset. The HC algorithm adapts intrinsically to the complex dependency structure of the dataset, without forcing a specific structure in advance. On the contrary, Glasso and modifications model unnecessary dependencies at the expense of the probabilistic information in the network and of a structural bias in the JPD function that can only be relieved including many parameters.
    Machine learning for plant microRNA prediction: A systematic review. (arXiv:2106.15159v1 [q-bio.GN])
    (2 min) MicroRNAs (miRNAs) are endogenous small non-coding RNAs that play an important role in post-transcriptional gene regulation. However, the experimental determination of miRNA sequence and structure is both expensive and time-consuming. Therefore, computational and machine learning-based approaches have been adopted to predict novel microRNAs. With the involvement of data science and machine learning in biology, multiple research studies have been conducted to find microRNAs with different computational methods and different miRNA features. Multiple approaches are discussed in detail considering the learning algorithm/s used, features considered, dataset/s used and the criteria used in evaluations. This systematic review focuses on the machine learning methods developed for miRNA identification in plants. This will help researchers to gain a detailed idea about past studies and identify novel paths that solve drawbacks occurred in past studies. Our findings highlight the need for plant-specific computational methods for miRNA identification.
    Data augmentation for deep learning based accelerated MRI reconstruction with limited data. (arXiv:2106.14947v1 [eess.IV])
    (2 min) Deep neural networks have emerged as very successful tools for image restoration and reconstruction tasks. These networks are often trained end-to-end to directly reconstruct an image from a noisy or corrupted measurement of that image. To achieve state-of-the-art performance, training on large and diverse sets of images is considered critical. However, it is often difficult and/or expensive to collect large amounts of training images. Inspired by the success of Data Augmentation (DA) for classification problems, in this paper, we propose a pipeline for data augmentation for accelerated MRI reconstruction and study its effectiveness at reducing the required training data in a variety of settings. Our DA pipeline, MRAugment, is specifically designed to utilize the invariances present in medical imaging measurements as naive DA strategies that neglect the physics of the problem fail. Through extensive studies on multiple datasets we demonstrate that in the low-data regime DA prevents overfitting and can match or even surpass the state of the art while using significantly fewer training data, whereas in the high-data regime it has diminishing returns. Furthermore, our findings show that DA can improve the robustness of the model against various shifts in the test distribution.
    Federated Dynamic Spectrum Access. (arXiv:2106.14976v1 [eess.SP])
    (2 min) Due to the growing volume of data traffic produced by the surge of Internet of Things (IoT) devices, the demand for radio spectrum resources is approaching their limitation defined by Federal Communications Commission (FCC). To this end, Dynamic Spectrum Access (DSA) is considered as a promising technology to handle this spectrum scarcity. However, standard DSA techniques often rely on analytical modeling wireless networks, making its application intractable in under-measured network environments. Therefore, utilizing neural networks to approximate the network dynamics is an alternative approach. In this article, we introduce a Federated Learning (FL) based framework for the task of DSA, where FL is a distributive machine learning framework that can reserve the privacy of network terminals under heterogeneous data distributions. We discuss the opportunities, challenges, and opening problems of this framework. To evaluate its feasibility, we implement a Multi-Agent Reinforcement Learning (MARL)-based FL as a realization associated with its initial evaluation results.
    Near-Optimal Explainable $k$-Means for All Dimensions. (arXiv:2106.15566v1 [cs.LG])
    (2 min) Many clustering algorithms are guided by certain cost functions such as the widely-used $k$-means cost. These algorithms divide data points into clusters with often complicated boundaries, creating difficulties in explaining the clustering decision. In a recent work, Dasgupta, Frost, Moshkovitz, and Rashtchian (ICML'20) introduced explainable clustering, where the cluster boundaries are axis-parallel hyperplanes and the clustering is obtained by applying a decision tree to the data. The central question here is: how much does the explainability constraint increase the value of the cost function? Given $d$-dimensional data points, we show an efficient algorithm that finds an explainable clustering whose $k$-means cost is at most $k^{1 - 2/d}\mathrm{poly}(d\log k)$ times the minimum cost achievable by a clustering without the explainability constraint, assuming $k,d\ge 2$. Combining this with an independent work by Makarychev and Shan (ICML'21), we get an improved bound of $k^{1 - 2/d}\mathrm{polylog}(k)$, which we show is optimal for every choice of $k,d\ge 2$ up to a poly-logarithmic factor in $k$. For $d = 2$ in particular, we show an $O(\log k\log\log k)$ bound, improving exponentially over the previous best bound of $\widetilde O(k)$.
    Limited depth bandit-based strategy for Monte Carlo planning in continuous action spaces. (arXiv:2106.15594v1 [math.OC])
    (2 min) This paper addresses the problem of optimal control using search trees. We start by considering multi-armed bandit problems with continuous action spaces and propose LD-HOO, a limited depth variant of the hierarchical optimistic optimization (HOO) algorithm. We provide a regret analysis for LD-HOO and show that, asymptotically, our algorithm exhibits the same cumulative regret as the original HOO while being faster and more memory efficient. We then propose a Monte Carlo tree search algorithm based on LD-HOO for optimal control problems and illustrate the resulting approach's application in several optimal control problems.
    Subgroup Generalization and Fairness of Graph Neural Networks. (arXiv:2106.15535v1 [cs.LG])
    (2 min) Despite enormous successful applications of graph neural networks (GNNs) recently, theoretical understandings of their generalization ability, especially for node-level tasks where data are not independent and identically-distributed (IID), have been sparse. The theoretical investigation of the generalization performance is beneficial for understanding fundamental issues (such as fairness) of GNN models and designing better learning methods. In this paper, we present a novel PAC-Bayesian analysis for GNNs under a non-IID semi-supervised learning setup. Moreover, we analyze the generalization performances on different subgroups of unlabeled nodes, which allows us to further study an accuracy-(dis)parity-style (un)fairness of GNNs from a theoretical perspective. Under reasonable assumptions, we demonstrate that the distance between a test subgroup and the training set can be a key factor affecting the GNN performance on that subgroup, which calls special attention to the training node selection for fair learning. Experiments across multiple GNN models and datasets support our theoretical results.
    Learning latent causal graphs via mixture oracles. (arXiv:2106.15563v1 [cs.LG])
    (2 min) We study the problem of reconstructing a causal graphical model from data in the presence of latent variables. The main problem of interest is recovering the causal structure over the latent variables while allowing for general, potentially nonlinear dependence between the variables. In many practical problems, the dependence between raw observations (e.g. pixels in an image) is much less relevant than the dependence between certain high-level, latent features (e.g. concepts or objects), and this is the setting of interest. We provide conditions under which both the latent representations and the underlying latent causal model are identifiable by a reduction to a mixture oracle. The proof is constructive, and leads to several algorithms for explicitly reconstructing the full graphical model. We discuss efficient algorithms and provide experiments illustrating the algorithms in practice.
    Cross-domain error minimization for unsupervised domain adaptation. (arXiv:2106.15057v1 [cs.LG])
    (2 min) Unsupervised domain adaptation aims to transfer knowledge from a labeled source domain to an unlabeled target domain. Previous methods focus on learning domain-invariant features to decrease the discrepancy between the feature distributions as well as minimizing the source error and have made remarkable progress. However, a recently proposed theory reveals that such a strategy is not sufficient for a successful domain adaptation. It shows that besides a small source error, both the discrepancy between the feature distributions and the discrepancy between the labeling functions should be small across domains. The discrepancy between the labeling functions is essentially the cross-domain errors which are ignored by existing methods. To overcome this issue, in this paper, a novel method is proposed to integrate all the objectives into a unified optimization framework. Moreover, the incorrect pseudo labels widely used in previous methods can lead to error accumulation during learning. To alleviate this problem, the pseudo labels are obtained by utilizing structural information of the target domain besides source classifier and we propose a curriculum learning based strategy to select the target samples with more accurate pseudo-labels during training. Comprehensive experiments are conducted, and the results validate that our approach outperforms state-of-the-art methods.
    SCARF: Self-Supervised Contrastive Learning using Random Feature Corruption. (arXiv:2106.15147v1 [cs.LG])
    (2 min) Self-supervised contrastive representation learning has proved incredibly successful in the vision and natural language domains, enabling state-of-the-art performance with orders of magnitude less labeled data. However, such methods are domain-specific and little has been done to leverage this technique on real-world tabular datasets. We propose SCARF, a simple, widely-applicable technique for contrastive learning, where views are formed by corrupting a random subset of features. When applied to pre-train deep neural networks on the 69 real-world, tabular classification datasets from the OpenML-CC18 benchmark, SCARF not only improves classification accuracy in the fully-supervised setting but does so also in the presence of label noise and in the semi-supervised setting where only a fraction of the available training data is labeled. We show that SCARF complements existing strategies and outperforms alternatives like autoencoders. We conduct comprehensive ablations, detailing the importance of a range of factors.
    GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech Synthesis. (arXiv:2106.15153v1 [eess.AS])
    (2 min) Recent advances in neural multi-speaker text-to-speech (TTS) models have enabled the generation of reasonably good speech quality with a single model and made it possible to synthesize the speech of a speaker with limited training data. Fine-tuning to the target speaker data with the multi-speaker model can achieve better quality, however, there still exists a gap compared to the real speech sample and the model depends on the speaker. In this work, we propose GANSpeech, which is a high-fidelity multi-speaker TTS model that adopts the adversarial training method to a non-autoregressive multi-speaker TTS model. In addition, we propose simple but efficient automatic scaling methods for feature matching loss used in adversarial training. In the subjective listening tests, GANSpeech significantly outperformed the baseline multi-speaker FastSpeech and FastSpeech2 models, and showed a better MOS score than the speaker-specific fine-tuned FastSpeech2.
    Counterfactual Explanations for Arbitrary Regression Models. (arXiv:2106.15212v1 [cs.LG])
    (2 min) We present a new method for counterfactual explanations (CFEs) based on Bayesian optimisation that applies to both classification and regression models. Our method is a globally convergent search algorithm with support for arbitrary regression models and constraints like feature sparsity and actionable recourse, and furthermore can answer multiple counterfactual questions in parallel while learning from previous queries. We formulate CFE search for regression models in a rigorous mathematical framework using differentiable potentials, which resolves robustness issues in threshold-based objectives. We prove that in this framework, (a) verifying the existence of counterfactuals is NP-complete; and (b) that finding instances using such potentials is CLS-complete. We describe a unified algorithm for CFEs using a specialised acquisition function that composes both expected improvement and an exponential-polynomial (EP) family with desirable properties. Our evaluation on real-world benchmark domains demonstrate high sample-efficiency and precision.
    End-to-end Waveform Learning Through Joint Optimization of Pulse and Constellation Shaping. (arXiv:2106.15158v1 [cs.IT])
    (2 min) As communication systems are foreseen to enable new services such as joint communication and sensing and utilize parts of the sub-THz spectrum, the design of novel waveforms that can support these emerging applications becomes increasingly challenging. We present in this work an end-to-end learning approach to design waveforms through joint learning of pulse shaping and constellation geometry, together with a neural network (NN)-based receiver. Optimization is performed to maximize an achievable information rate, while satisfying constraints on out-of-band emission and power envelope. Our results show that the proposed approach enables up to orders of magnitude smaller adjacent channel leakage ratios (ACLRs) with peak-to-average power ratios (PAPRs) competitive with traditional filters, without significant loss of information rate on an additive white Gaussian noise (AWGN) channel, and no additional complexity at the transmitter.
    FastPitchFormant: Source-filter based Decomposed Modeling for Speech Synthesis. (arXiv:2106.15123v1 [eess.AS])
    (2 min) Methods for modeling and controlling prosody with acoustic features have been proposed for neural text-to-speech (TTS) models. Prosodic speech can be generated by conditioning acoustic features. However, synthesized speech with a large pitch-shift scale suffers from audio quality degradation, and speaker characteristics deformation. To address this problem, we propose a feed-forward Transformer based TTS model that is designed based on the source-filter theory. This model, called FastPitchFormant, has a unique structure that handles text and acoustic features in parallel. With modeling each feature separately, the tendency that the model learns the relationship between two features can be mitigated.
    Constructing Forest Biomass Prediction Maps from Radar Backscatter by Sequential Regression with a Conditional Generative Adversarial Network. (arXiv:2106.15020v1 [cs.LG])
    (2 min) This paper studies construction of above-ground biomass (AGB) prediction maps from synthetic aperture radar (SAR) intensity images. The purpose is to improve traditional regression models based on SAR intensity, trained with a limited amount of AGB in situ measurements. Although it is costly to collect, data from airborne laser scanning (ALS) sensors are highly correlated with AGB. Therefore, we propose using AGB predictions based on ALS data as surrogate response variables for SAR data in a sequential modelling fashion. This increases the amount of training data dramatically. To model the regression function between SAR intensity and ALS-predicted AGB we propose to utilise a conditional generative adversarial network (cGAN), i.e. the Pix2Pix convolutional neural network. This enables the recreation of existing ALS-based AGB prediction maps. The generated synthesised ALS-based AGB predictions are evaluated qualitatively and quantitatively against ALS-based AGB predictions retrieved from a traditional non-sequential regression model trained in the same area. Results show that the proposed architecture manages to capture characteristics of the actual data. This suggests that the use of ALS-guided generative models is a promising avenue for AGB prediction from SAR intensity. Further research on this area has the potential of providing both large-scale and low-cost predictions of AGB.
    ElephantBook: A Semi-Automated Human-in-the-Loop System for Elephant Re-Identification. (arXiv:2106.15083v1 [cs.LG])
    (2 min) African elephants are vital to their ecosystems, but their populations are threatened by a rise in human-elephant conflict and poaching. Monitoring population dynamics is essential in conservation efforts; however, tracking elephants is a difficult task, usually relying on the invasive and sometimes dangerous placement of GPS collars. Although there have been many recent successes in the use of computer vision techniques for automated identification of other species, identification of elephants is extremely difficult and typically requires expertise as well as familiarity with elephants in the population. We have built and deployed a web-based platform and database for human-in-the-loop re-identification of elephants combining manual attribute labeling and state-of-the-art computer vision algorithms, known as ElephantBook. Our system is currently in use at the Mara Elephant Project, helping monitor the protected and at-risk population of elephants in the Greater Maasai Mara ecosystem. ElephantBook makes elephant re-identification usable by non-experts and scalable for use by multiple conservation NGOs.
    An Efficient Batch Constrained Bayesian Optimization Approach for Analog Circuit Synthesis via Multi-objective Acquisition Ensemble. (arXiv:2106.15412v1 [cs.LG])
    (2 min) Bayesian optimization is a promising methodology for analog circuit synthesis. However, the sequential nature of the Bayesian optimization framework significantly limits its ability to fully utilize real-world computational resources. In this paper, we propose an efficient parallelizable Bayesian optimization algorithm via Multi-objective ACquisition function Ensemble (MACE) to further accelerate the optimization procedure. By sampling query points from the Pareto front of the probability of improvement (PI), expected improvement (EI) and lower confidence bound (LCB), we combine the benefits of state-of-the-art acquisition functions to achieve a delicate tradeoff between exploration and exploitation for the unconstrained optimization problem. Based on this batch design, we further adjust the algorithm for the constrained optimization problem. By dividing the optimization procedure into two stages and first focusing on finding an initial feasible point, we manage to gain more information about the valid region and can better avoid sampling around the infeasible area. After achieving the first feasible point, we favor the feasible region by adopting a specially designed penalization term to the acquisition function ensemble. The experimental results quantitatively demonstrate that our proposed algorithm can reduce the overall simulation time by up to 74 times compared to differential evolution (DE) for the unconstrained optimization problem when the batch size is 15. For the constrained optimization problem, our proposed algorithm can speed up the optimization process by up to 15 times compared to the weighted expected improvement based Bayesian optimization (WEIBO) approach, when the batch size is 15.
    Sounds of COVID-19: exploring realistic performance of audio-based digital testing. (arXiv:2106.15523v1 [cs.SD])
    (2 min) Researchers have been battling with the question of how we can identify Coronavirus disease (COVID-19) cases efficiently, affordably and at scale. Recent work has shown how audio based approaches, which collect respiratory audio data (cough, breathing and voice) can be used for testing, however there is a lack of exploration of how biases and methodological decisions impact these tools' performance in practice. In this paper, we explore the realistic performance of audio-based digital testing of COVID-19. To investigate this, we collected a large crowdsourced respiratory audio dataset through a mobile app, alongside recent COVID-19 test result and symptoms intended as a ground truth. Within the collected dataset, we selected 5,240 samples from 2,478 participants and split them into different participant-independent sets for model development and validation. Among these, we controlled for potential confounding factors (such as demographics and language). The unbiased model takes features extracted from breathing, coughs, and voice signals as predictors and yields an AUC-ROC of 0.71 (95\% CI: 0.65$-$0.77). We further explore different unbalanced distributions to show how biases and participant splits affect performance. Finally, we discuss how the realistic model presented could be integrated in clinical practice to realize continuous, ubiquitous, sustainable and affordable testing at population scale.
    DCASE 2021 Task 3: Spectrotemporally-aligned Features for Polyphonic Sound Event Localization and Detection. (arXiv:2106.15190v1 [eess.AS])
    (2 min) Sound event localization and detection consists of two subtasks which are sound event detection and direction-of-arrival estimation. While sound event detection mainly relies on time-frequency patterns to distinguish different sound classes, direction-of-arrival estimation uses magnitude or phase differences between microphones to estimate source directions. Therefore, it is often difficult to jointly train these two subtasks simultaneously. We propose a novel feature called spatial cue-augmented log-spectrogram (SALSA) with exact time-frequency mapping between the signal power and the source direction-of-arrival. The feature includes multichannel log-spectrograms stacked along with the estimated direct-to-reverberant ratio and a normalized version of the principal eigenvector of the spatial covariance matrix at each time-frequency bin on the spectrograms. Experimental results on the DCASE 2021 dataset for sound event localization and detection with directional interference showed that the deep learning-based models trained on this new feature outperformed the DCASE challenge baseline by a large margin. We combined several models with slightly different architectures that were trained on the new feature to further improve the system performances for the DCASE sound event localization and detection challenge.
    Leveraging Static Models for Link Prediction in Temporal Knowledge Graphs. (arXiv:2106.15223v1 [cs.LG])
    (2 min) The inclusion of temporal scopes of facts in knowledge graph embedding (KGE) presents significant opportunities for improving the resulting embeddings, and consequently for increased performance in downstream applications. Yet, little research effort has focussed on this area and much of the carried out research reports only marginally improved results compared to models trained without temporal scopes (static models). Furthermore, rather than leveraging existing work on static models, they introduce new models specific to temporal knowledge graphs. We propose a novel perspective that takes advantage of the power of existing static embedding models by focussing effort on manipulating the data instead. Our method, SpliMe, draws inspiration from the field of signal processing and early work in graph embedding. We show that SpliMe competes with or outperforms the current state of the art in temporal KGE. Additionally, we uncover issues with the procedure currently used to assess the performance of static models on temporal graphs and introduce two ways to counteract them.
    Evading Adversarial Example Detection Defenses with Orthogonal Projected Gradient Descent. (arXiv:2106.15023v1 [cs.LG])
    (2 min) Evading adversarial example detection defenses requires finding adversarial examples that must simultaneously (a) be misclassified by the model and (b) be detected as non-adversarial. We find that existing attacks that attempt to satisfy multiple simultaneous constraints often over-optimize against one constraint at the cost of satisfying another. We introduce Orthogonal Projected Gradient Descent, an improved attack technique to generate adversarial examples that avoids this problem by orthogonalizing the gradients when running standard gradient-based attacks. We use our technique to evade four state-of-the-art detection defenses, reducing their accuracy to 0% while maintaining a 0% detection rate.
    Geometry-aware Transformer for molecular property prediction. (arXiv:2106.15516v1 [cs.LG])
    (2 min) Recently, graph neural networks (GNNs) have achieved remarkable performances for quantum mechanical problems. However, a graph convolution can only cover a localized region, and cannot capture long-range interactions of atoms. This behavior is contrary to theoretical interatomic potentials, which is a fundamental limitation of the spatial based GNNs. In this work, we propose a novel attention-based framework for molecular property prediction tasks. We represent a molecular conformation as a discrete atomic sequence combined by atom-atom distance attributes, named Geometry-aware Transformer (GeoT). In particular, we adopt a Transformer architecture, which has been widely used for sequential data. Our proposed model trains sequential representations of molecular graphs based on globally constructed attentions, maintaining all spatial arrangements of atom pairs. Our method does not suffer from cost intensive computations, such as angle calculations. The experimental results on several public benchmarks and visualization maps verified that keeping the long-range interatomic attributes can significantly improve the model predictability.
    Classification of Consumer Belief Statements From Social Media. (arXiv:2106.15498v1 [cs.LG])
    (2 min) Social media offer plenty of information to perform market research in order to meet the requirements of customers. One way how this research is conducted is that a domain expert gathers and categorizes user-generated content into a complex and fine-grained class structure. In many of such cases, little data meets complex annotations. It is not yet fully understood how this can be leveraged successfully for classification. We examine the classification accuracy of expert labels when used with a) many fine-grained classes and b) few abstract classes. For scenario b) we compare abstract class labels given by the domain expert as baseline and by automatic hierarchical clustering. We compare this to another baseline where the entire class structure is given by a completely unsupervised clustering approach. By doing so, this work can serve as an example of how complex expert annotations are potentially beneficial and can be utilized in the most optimal way for opinion mining in highly specific domains. By exploring across a range of techniques and experiments, we find that automated class abstraction approaches in particular the unsupervised approach performs remarkably well against domain expert baseline on text classification tasks. This has the potential to inspire opinion mining applications in order to support market researchers in practice and to inspire fine-grained automated content analysis on a large scale.
    Attaining entropy production and dissipation maps from Brownian movies via neural networks. (arXiv:2106.15108v1 [cond-mat.stat-mech])
    (2 min) Quantifying entropy production (EP) is essential to understand stochastic systems at mesoscopic scales, such as living organisms or biological assemblies. However, without tracking the relevant variables, it is challenging to figure out where and to what extent EP occurs from recorded time-series image data from experiments. Here, applying a convolutional neural network (CNN), a powerful tool for image processing, we develop an estimation method for EP through an unsupervised learning algorithm that calculates only from movies. Together with an attention map of the CNN's last layer, our method can not only quantify stochastic EP but also produce the spatiotemporal pattern of the EP (dissipation map). We show that our method accurately measures the EP and creates a dissipation map in two nonequilibrium systems, the bead-spring model and a network of elastic filaments. We further confirm high performance even with noisy, low spatial resolution data, and partially observed situations. Our method will provide a practical way to obtain dissipation maps and ultimately contribute to uncovering the nonequilibrium nature of complex systems.
    Certifiable Machine Unlearning for Linear Models. (arXiv:2106.15093v1 [cs.LG])
    (2 min) Machine unlearning is the task of updating machine learning (ML) models after a subset of the training data they were trained on is deleted. Methods for the task are desired to combine effectiveness and efficiency, i.e., they should effectively "unlearn" deleted data, but in a way that does not require excessive computation effort (e.g., a full retraining) for a small amount of deletions. Such a combination is typically achieved by tolerating some amount of approximation in the unlearning. In addition, laws and regulations in the spirit of "the right to be forgotten" have given rise to requirements for certifiability, i.e., the ability to demonstrate that the deleted data has indeed been unlearned by the ML model. In this paper, we present an experimental study of the three state-of-the-art approximate unlearning methods for linear models and demonstrate the trade-offs between efficiency, effectiveness and certifiability offered by each method. In implementing the study, we extend some of the existing works and describe a common ML pipeline to compare and evaluate the unlearning methods on six real-world datasets and a variety of settings. We provide insights into the effect of the quantity and distribution of the deleted data on ML models and the performance of each unlearning method in different settings. We also propose a practical online strategy to determine when the accumulated error from approximate unlearning is large enough to warrant a full retrain of the ML model.
    Joint Majorization-Minimization for Nonnegative Matrix Factorization with the $\beta$-divergence. (arXiv:2106.15214v1 [cs.LG])
    (2 min) This article proposes new multiplicative updates for nonnegative matrix factorization (NMF) with the $\beta$-divergence objective function. Our new updates are derived from a joint majorization-minimization (MM) scheme, in which an auxiliary function (a tight upper bound of the objective function) is built for the two factors jointly and minimized at each iteration. This is in contrast with the classic approach in which the factors are optimized alternately and a MM scheme is applied to each factor individually. Like the classic approach, our joint MM algorithm also results in multiplicative updates that are simple to implement. They however yield a significant drop of computation time (for equally good solutions), in particular for some $\beta$-divergences of important applicative interest, such as the squared Euclidean distance and the Kullback-Leibler or Itakura-Saito divergences. We report experimental results using diverse datasets: face images, audio spectrograms, hyperspectral data and song play counts. Depending on the value of $\beta$ and on the dataset, our joint MM approach yields a CPU time reduction of about $10\%$ to $78\%$ in comparison to the classic alternating scheme.
    Improving Transferability of Adversarial Patches on Face Recognition with Generative Models. (arXiv:2106.15058v1 [cs.CV])
    (2 min) Face recognition is greatly improved by deep convolutional neural networks (CNNs). Recently, these face recognition models have been used for identity authentication in security sensitive applications. However, deep CNNs are vulnerable to adversarial patches, which are physically realizable and stealthy, raising new security concerns on the real-world applications of these models. In this paper, we evaluate the robustness of face recognition models using adversarial patches based on transferability, where the attacker has limited accessibility to the target models. First, we extend the existing transfer-based attack techniques to generate transferable adversarial patches. However, we observe that the transferability is sensitive to initialization and degrades when the perturbation magnitude is large, indicating the overfitting to the substitute models. Second, we propose to regularize the adversarial patches on the low dimensional data manifold. The manifold is represented by generative models pre-trained on legitimate human face images. Using face-like features as adversarial perturbations through optimization on the manifold, we show that the gaps between the responses of substitute models and the target models dramatically decrease, exhibiting a better transferability. Extensive digital world experiments are conducted to demonstrate the superiority of the proposed method in the black-box setting. We apply the proposed method in the physical world as well.
    GraphPiece: Efficiently Generating High-Quality Molecular Graph with Substructures. (arXiv:2106.15098v1 [cs.LG])
    (2 min) Molecular graph generation is a fundamental but challenging task in various applications such as drug discovery and material science, which requires generating valid molecules with desired properties. Auto-regressive models, which usually construct graphs following sequential actions of adding nodes and edges at the atom-level, have made rapid progress in recent years. However, these atom-level models ignore high-frequency subgraphs that not only capture the regularities of atomic combination in molecules but also are often related to desired chemical properties. In this paper, we propose a method to automatically discover such common substructures, which we call {\em graph pieces}, from given molecular graphs. Based on graph pieces, we leverage a variational autoencoder to generate molecules in two phases: piece-level graph generation followed by bond completion. Experiments show that our graph piece variational autoencoder achieves better performance over state-of-the-art baselines on property optimization and constrained property optimization tasks with higher computational efficiency.
    Meta-learning for Matrix Factorization without Shared Rows or Columns. (arXiv:2106.15133v1 [stat.ML])
    (2 min) We propose a method that meta-learns a knowledge on matrix factorization from various matrices, and uses the knowledge for factorizing unseen matrices. The proposed method uses a neural network that takes a matrix as input, and generates prior distributions of factorized matrices of the given matrix. The neural network is meta-learned such that the expected imputation error is minimized when the factorized matrices are adapted to each matrix by a maximum a posteriori (MAP) estimation. We use a gradient descent method for the MAP estimation, which enables us to backpropagate the expected imputation error through the gradient descent steps for updating neural network parameters since each gradient descent step is written in a closed form and is differentiable. The proposed method can meta-learn from matrices even when their rows and columns are not shared, and their sizes are different from each other. In our experiments with three user-item rating datasets, we demonstrate that our proposed method can impute the missing values from a limited number of observations in unseen matrices after being trained with different matrices.
    Learning from Multiple Annotators by Incorporating Instance Features. (arXiv:2106.15146v1 [cs.LG])
    (2 min) Learning from multiple annotators aims to induce a high-quality classifier from training instances, where each of them is associated with a set of possibly noisy labels provided by multiple annotators under the influence of their varying abilities and own biases. In modeling the probability transition process from latent true labels to observed labels, most existing methods adopt class-level confusion matrices of annotators that observed labels do not depend on the instance features, just determined by the true labels. It may limit the performance that the classifier can achieve. In this work, we propose the noise transition matrix, which incorporates the influence of instance features on annotators' performance based on confusion matrices. Furthermore, we propose a simple yet effective learning framework, which consists of a classifier module and a noise transition matrix module in a unified neural network architecture. Experimental results demonstrate the superiority of our method in comparison with state-of-the-art methods.
    Early Mobility Recognition for Intensive Care Unit Patients Using Accelerometers. (arXiv:2106.15017v1 [cs.LG])
    (2 min) With the development of the Internet of Things(IoT) and Artificial Intelligence(AI) technologies, human activity recognition has enabled various applications, such as smart homes and assisted living. In this paper, we target a new healthcare application of human activity recognition, early mobility recognition for Intensive Care Unit(ICU) patients. Early mobility is essential for ICU patients who suffer from long-time immobilization. Our system includes accelerometer-based data collection from ICU patients and an AI model to recognize patients' early mobility. To improve the model accuracy and stability, we identify features that are insensitive to sensor orientations and propose a segment voting process that leverages a majority voting strategy to recognize each segment's activity. Our results show that our system improves model accuracy from 77.78\% to 81.86\% and reduces the model instability (standard deviation) from 16.69\% to 6.92\%, compared to the same AI model without our feature engineering and segment voting process.
    MuViS: Online MU-MIMO Grouping for Multi-User Applications Over Commodity WiFi. (arXiv:2106.15262v1 [cs.NI])
    (2 min) Over the last decade, the bandwidth expansion and MU-MIMO spectral efficiency have promised to increase data throughput by allowing concurrent communication between one Access Point and multiple users. However, we are still a long way from enjoying such MU-MIMO MAC protocol improvements for bandwidth hungry applications such as video streaming in practical WiFi network settings due to heterogeneous channel conditions and devices, unreliable transmissions, and lack of useful feedback exchange among the lower and upper layers' requirements. This paper introduces MuViS, a novel dual-phase optimization framework that proposes a Quality of Experience (QoE) aware MU-MIMO optimization for multi-user video streaming over IEEE 802.11ac. MuViS first employs reinforcement learning to optimize the MU-MIMO user group and mode selection for users based on their PHY/MAC layer characteristics. The video bitrate is then optimized based on the user's mode (Multi-User (MU) or Single-User (SU)). We present our design and its evaluation on smartphones and laptops using 802.11ac WiFi. Our experimental results in various indoor environments and configurations show a scalable framework that can support a large number of users with streaming at high video rates and satisfying QoE requirements.
    Do Not Deceive Your Employer with a Virtual Background: A Video Conferencing Manipulation-Detection System. (arXiv:2106.15130v1 [cs.CR])
    (2 min) The last-generation video conferencing software allows users to utilize a virtual background to conceal their personal environment due to privacy concerns, especially in official meetings with other employers. On the other hand, users maybe want to fool people in the meeting by considering the virtual background to conceal where they are. In this case, developing tools to understand the virtual background utilize for fooling people in meeting plays an important role. Besides, such detectors must prove robust against different kinds of attacks since a malicious user can fool the detector by applying a set of adversarial editing steps on the video to conceal any revealing footprint. In this paper, we study the feasibility of an efficient tool to detect whether a videoconferencing user background is real. In particular, we provide the first tool which computes pixel co-occurrences matrices and uses them to search for inconsistencies among spectral and spatial bands. Our experiments confirm that cross co-occurrences matrices improve the robustness of the detector against different kinds of attacks. This work's performance is especially noteworthy with regard to color SPAM features. Moreover, the performance especially is significant with regard to robustness versus post-processing, like geometric transformations, filtering, contrast enhancement, and JPEG compression with different quality factors.
    Sharp Lower Bounds on the Approximation Rate of Shallow Neural Networks. (arXiv:2106.14997v1 [stat.ML])
    (2 min) We consider the approximation rates of shallow neural networks with respect to the variation norm. Upper bounds on these rates have been established for sigmoidal and ReLU activation functions, but it has remained an important open problem whether these rates are sharp. In this article, we provide a solution to this problem by proving sharp lower bounds on the approximation rates for shallow neural networks, which are obtained by lower bounding the $L^2$-metric entropy of the convex hull of the neural network basis functions. In addition, our methods also give sharp lower bounds on the Kolmogorov $n$-widths of this convex hull, which show that the variation spaces corresponding to shallow neural networks cannot be efficiently approximated by linear methods. These lower bounds apply to both sigmoidal activation functions with bounded variation and to activation functions which are a power of the ReLU. Our results also quantify how much stronger the Barron spectral norm is than the variation norm and, combined with previous results, give the asymptotics of the $L^\infty$-metric entropy up to logarithmic factors in the case of the ReLU activation function.
    On component interactions in two-stage recommender systems. (arXiv:2106.14979v1 [cs.IR])
    (2 min) Thanks to their scalability, two-stage recommenders are used by many of today's largest online platforms, including YouTube, LinkedIn, and Pinterest. These systems produce recommendations in two steps: (i) multiple nominators -- tuned for low prediction latency -- preselect a small subset of candidates from the whole item pool; (ii)~a slower but more accurate ranker further narrows down the nominated items, and serves to the user. Despite their popularity, the literature on two-stage recommenders is relatively scarce, and the algorithms are often treated as the sum of their parts. Such treatment presupposes that the two-stage performance is explained by the behavior of individual components if they were deployed independently. This is not the case: using synthetic and real-world data, we demonstrate that interactions between the ranker and the nominators substantially affect the overall performance. Motivated by these findings, we derive a generalization lower bound which shows that careful choice of each nominator's training set is sometimes the only difference between a poor and an optimal two-stage recommender. Since searching for a good choice manually is difficult, we learn one instead. In particular, using a Mixture-of-Experts approach, we train the nominators (experts) to specialize on different subsets of the item pool. This significantly improves performance.
    Adversarial Robustness of Streaming Algorithms through Importance Sampling. (arXiv:2106.14952v1 [cs.LG])
    (2 min) In this paper, we introduce adversarially robust streaming algorithms for central machine learning and algorithmic tasks, such as regression and clustering, as well as their more general counterparts, subspace embedding, low-rank approximation, and coreset construction. For regression and other numerical linear algebra related tasks, we consider the row arrival streaming model. Our results are based on a simple, but powerful, observation that many importance sampling-based algorithms give rise to adversarial robustness which is in contrast to sketching based algorithms, which are very prevalent in the streaming literature but suffer from adversarial attacks. In addition, we show that the well-known merge and reduce paradigm in streaming is adversarially robust. Since the merge and reduce paradigm allows coreset constructions in the streaming setting, we thus obtain robust algorithms for $k$-means, $k$-median, $k$-center, Bregman clustering, projective clustering, principal component analysis (PCA) and non-negative matrix factorization. To the best of our knowledge, these are the first adversarially robust results for these problems yet require no new algorithmic implementations. Finally, we empirically confirm the robustness of our algorithms on various adversarial attacks and demonstrate that by contrast, some common existing algorithms are not robust. (Abstract shortened to meet arXiv limits)
    Feature selection for intrusion detection systems. (arXiv:2106.14941v1 [cs.CR])
    (2 min) In this paper, we analyze existing feature selection methods to identify the key elements of network traffic data that allow intrusion detection. In addition, we propose a new feature selection method that addresses the challenge of considering continuous input features and discrete target values. We show that the proposed method performs well against the benchmark selection methods. We use our findings to develop a highly effective machine learning-based detection systems that achieves 99.9% accuracy in distinguishing between DDoS and benign signals. We believe that our results can be useful to experts who are interested in designing and building automated intrusion detection systems.
    Fast Training of Neural Lumigraph Representations using Meta Learning. (arXiv:2106.14942v1 [cs.CV])
    (2 min) Novel view synthesis is a long-standing problem in machine learning and computer vision. Significant progress has recently been made in developing neural scene representations and rendering techniques that synthesize photorealistic images from arbitrary views. These representations, however, are extremely slow to train and often also slow to render. Inspired by neural variants of image-based rendering, we develop a new neural rendering approach with the goal of quickly learning a high-quality representation which can also be rendered in real-time. Our approach, MetaNLR++, accomplishes this by using a unique combination of a neural shape representation and 2D CNN-based image feature extraction, aggregation, and re-projection. To push representation convergence times down to minutes, we leverage meta learning to learn neural shape and image feature priors which accelerate training. The optimized shape and image features can then be extracted using traditional graphics techniques and rendered in real time. We show that MetaNLR++ achieves similar or better novel view synthesis results in a fraction of the time that competing methods require.

2021-06-29

  • cs.CL updates on arXiv.org

    Integrating topic modeling and word embedding to characterize violent deaths. (arXiv:2106.14365v1 [cs.CL])
    (2 min) There is an escalating need for methods to identify latent patterns in text data from many domains. We introduce a new method to identify topics in a corpus and represent documents as topic sequences. Discourse Atom Topic Modeling draws on advances in theoretical machine learning to integrate topic modeling and word embedding, capitalizing on the distinct capabilities of each. We first identify a set of vectors ("discourse atoms") that provide a sparse representation of an embedding space. Atom vectors can be interpreted as latent topics: Through a generative model, atoms map onto distributions over words; one can also infer the topic that generated a sequence of words. We illustrate our method with a prominent example of underutilized text: the U.S. National Violent Death Reporting System (NVDRS). The NVDRS summarizes violent death incidents with structured variables and unstructured narratives. We identify 225 latent topics in the narratives (e.g., preparation for death and physical aggression); many of these topics are not captured by existing structured variables. Motivated by known patterns in suicide and homicide by gender, and recent research on gender biases in semantic space, we identify the gender bias of our topics (e.g., a topic about pain medication is feminine). We then compare the gender bias of topics to their prevalence in narratives of female versus male victims. Results provide a detailed quantitative picture of reporting about lethal violence and its gendered nature. Our method offers a flexible and broadly applicable approach to model topics in text data.
    Sparsely Overlapped Speech Training in the Time Domain: Joint Learning of Target Speech Separation and Personal VAD Benefits. (arXiv:2106.14371v1 [cs.SD])
    (2 min) Target speech separation is the process of filtering a certain speaker's voice out of speech mixtures according to the additional speaker identity information provided. Recent works have made considerable improvement by processing signals in the time domain directly. The majority of them take fully overlapped speech mixtures for training. However, since most real-life conversations occur randomly and are sparsely overlapped, we argue that training with different overlap ratio data benefits. To do so, an unavoidable problem is that the popularly used SI-SNR loss has no definition for silent sources. This paper proposes the weighted SI-SNR loss, together with the joint learning of target speech separation and personal VAD. The weighted SI-SNR loss imposes a weight factor that is proportional to the target speaker's duration and returns zero when the target speaker is absent. Meanwhile, the personal VAD generates masks and sets non-target speech to silence. Experiments show that our proposed method outperforms the baseline by 1.73 dB in terms of SDR on fully overlapped speech, as well as by 4.17 dB and 0.9 dB on sparsely overlapped speech of clean and noisy conditions. Besides, with slight degradation in performance, our model could reduce the time costs in inference.
    A Closer Look at How Fine-tuning Changes BERT. (arXiv:2106.14282v1 [cs.CL])
    (2 min) Given the prevalence of pre-trained contextualized representations in today's NLP, there have been several efforts to understand what information such representations contain. A common strategy to use such representations is to fine-tune them for an end task. However, how fine-tuning for a task changes the underlying space is less studied. In this work, we study the English BERT family and use two probing techniques to analyze how fine-tuning changes the space. Our experiments reveal that fine-tuning improves performance because it pushes points associated with a label away from other labels. By comparing the representations before and after fine-tuning, we also discover that fine-tuning does not change the representations arbitrarily; instead, it adjusts the representations to downstream tasks while preserving the original structure. Finally, using carefully constructed experiments, we show that fine-tuning can encode training sets in a representation, suggesting an overfitting problem of a new kind.
    LRG at SemEval-2021 Task 4: Improving Reading Comprehension with Abstract Words using Augmentation, Linguistic Features and Voting. (arXiv:2102.12255v2 [cs.CL] UPDATED)
    (2 min) In this article, we present our methodologies for SemEval-2021 Task-4: Reading Comprehension of Abstract Meaning. Given a fill-in-the-blank-type question and a corresponding context, the task is to predict the most suitable word from a list of 5 options. There are three sub-tasks within this task: Imperceptibility (subtask-I), Non-Specificity (subtask-II), and Intersection (subtask-III). We use encoders of transformers-based models pre-trained on the masked language modelling (MLM) task to build our Fill-in-the-blank (FitB) models. Moreover, to model imperceptibility, we define certain linguistic features, and to model non-specificity, we leverage information from hypernyms and hyponyms provided by a lexical database. Specifically, for non-specificity, we try out augmentation techniques, and other statistical techniques. We also propose variants, namely Chunk Voting and Max Context, to take care of input length restrictions for BERT, etc. Additionally, we perform a thorough ablation study, and use Integrated Gradients to explain our predictions on a few samples. Our best submissions achieve accuracies of 75.31% and 77.84%, on the test sets for subtask-I and subtask-II, respectively. For subtask-III, we achieve accuracies of 65.64% and 62.27%.
    Persian Causality Corpus (PerCause) and the Causality Detection Benchmark. (arXiv:2106.14165v1 [cs.CL])
    (2 min) Recognizing causal elements and causal relations in text is one of the challenging issues in natural language processing; specifically, in low resource languages such as Persian. In this research we prepare a causality human annotated corpus for the Persian language which consists of 4446 sentences and 5128 causal relations and three labels of cause, effect and causal mark -- if possibl -- are specified for each relation. We have used this corpus to train a system for detecting causal elements boundaries. Also, we present a causality detection benchmark for three machine learning methods and two deep learning systems based on this corpus. Performance evaluations indicate that our best total result is obtained through CRF classifier which has F-measure of 0.76 and the best accuracy obtained through Bi-LSTM-CRF deep learning method with Accuracy equal to %91.4.
    NLRG at SemEval-2021 Task 5: Toxic Spans Detection Leveraging BERT-based Token Classification and Span Prediction Techniques. (arXiv:2102.12254v2 [cs.CL] UPDATED)
    (2 min) Toxicity detection of text has been a popular NLP task in the recent years. In SemEval-2021 Task-5 Toxic Spans Detection, the focus is on detecting toxic spans within passages. Most state-of-the-art span detection approaches employ various techniques, each of which can be broadly classified into Token Classification or Span Prediction approaches. In our paper, we explore simple versions of both of these approaches and their performance on the task. Specifically, we use BERT-based models -- BERT, RoBERTa, and SpanBERT for both approaches. We also combine these approaches and modify them to bring improvements for Toxic Spans prediction. To this end, we investigate results on four hybrid approaches -- Multi-Span, Span+Token, LSTM-CRF, and a combination of predicted offsets using union/intersection. Additionally, we perform a thorough ablative analysis and analyze our observed results. Our best submission -- a combination of SpanBERT Span Predictor and RoBERTa Token Classifier predictions -- achieves an F1 score of 0.6753 on the test set. Our best post-eval F1 score is 0.6895 on intersection of predicted offsets from top-3 RoBERTa Token Classification checkpoints. These approaches improve the performance by 3% on average than those of the shared baseline models -- RNNSL and SpaCy NER.
    MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers. (arXiv:2012.15828v2 [cs.CL] UPDATED)
    (2 min) We generalize deep self-attention distillation in MiniLM (Wang et al., 2020) by only using self-attention relation distillation for task-agnostic compression of pretrained Transformers. In particular, we define multi-head self-attention relations as scaled dot-product between the pairs of query, key, and value vectors within each self-attention module. Then we employ the above relational knowledge to train the student model. Besides its simplicity and unified principle, more favorably, there is no restriction in terms of the number of student's attention heads, while most previous work has to guarantee the same head number between teacher and student. Moreover, the fine-grained self-attention relations tend to fully exploit the interaction knowledge learned by Transformer. In addition, we thoroughly examine the layer selection strategy for teacher models, rather than just relying on the last layer as in MiniLM. We conduct extensive experiments on compressing both monolingual and multilingual pretrained models. Experimental results demonstrate that our models distilled from base-size and large-size teachers (BERT, RoBERTa and XLM-R) outperform the state-of-the-art.
    IITP at AILA 2019: System Report for Artificial Intelligence for Legal Assistance Shared Task. (arXiv:2105.11347v2 [cs.CL] UPDATED)
    (2 min) In this article, we present a description of our systems as a part of our participation in the shared task namely Artificial Intelligence for Legal Assistance (AILA 2019). This is an integral event of Forum for Information Retrieval Evaluation-2019. The outcomes of this track would be helpful for the automation of the working process of the Indian Judiciary System. The manual working procedures and documentation at any level (from lower to higher court) of the judiciary system are very complex in nature. The systems produced as a part of this track would assist the law practitioners. It would be helpful for common men too. This kind of track also opens the path of research of Natural Language Processing (NLP) in the judicial domain. This track defined two problems such as Task 1: Identifying relevant prior cases for a given situation and Task 2: Identifying the most relevant statutes for a given situation. We tackled both of them. Our proposed approaches are based on BM25 and Doc2Vec. As per the results declared by the task organizers, we are in 3rd and a modest position in Task 1 and Task 2 respectively.
    Visual Conceptual Blending with Large-scale Language and Vision Models. (arXiv:2106.14127v1 [cs.CL])
    (2 min) We ask the question: to what extent can recent large-scale language and image generation models blend visual concepts? Given an arbitrary object, we identify a relevant object and generate a single-sentence description of the blend of the two using a language model. We then generate a visual depiction of the blend using a text-based image generation model. Quantitative and qualitative evaluations demonstrate the superiority of language models over classical methods for conceptual blending, and of recent large-scale image generation models over prior models for the visual depiction.
    Efficient Dialogue State Tracking by Masked Hierarchical Transformer. (arXiv:2106.14433v1 [cs.CL])
    (2 min) This paper describes our approach to DSTC 9 Track 2: Cross-lingual Multi-domain Dialog State Tracking, the task goal is to build a Cross-lingual dialog state tracker with a training set in rich resource language and a testing set in low resource language. We formulate a method for joint learning of slot operation classification task and state tracking task respectively. Furthermore, we design a novel mask mechanism for fusing contextual information about dialogue, the results show the proposed model achieves excellent performance on DSTC Challenge II with a joint accuracy of 62.37% and 23.96% in MultiWOZ(en - zh) dataset and CrossWOZ(zh - en) dataset, respectively.
    End-to-End Speech Translation with Pre-trained Models and Adapters: UPC at IWSLT 2021. (arXiv:2105.04512v2 [cs.CL] UPDATED)
    (2 min) This paper describes the submission to the IWSLT 2021 offline speech translation task by the UPC Machine Translation group. The task consists of building a system capable of translating English audio recordings extracted from TED talks into German text. Submitted systems can be either cascade or end-to-end and use a custom or given segmentation. Our submission is an end-to-end speech translation system, which combines pre-trained models (Wav2Vec 2.0 and mBART) with coupling modules between the encoder and decoder, and uses an efficient fine-tuning technique, which trains only 20% of its total parameters. We show that adding an Adapter to the system and pre-training it, can increase the convergence speed and the final result, with which we achieve a BLEU score of 27.3 on the MuST-C test set. Our final model is an ensemble that obtains 28.22 BLEU score on the same set. Our submission also uses a custom segmentation algorithm that employs pre-trained Wav2Vec 2.0 for identifying periods of untranscribable text and can bring improvements of 2.5 to 3 BLEU score on the IWSLT 2019 test set, as compared to the result with the given segmentation.
    PhyCRNet: Physics-informed Convolutional-Recurrent Network for Solving Spatiotemporal PDEs. (arXiv:2106.14103v1 [cs.LG])
    (2 min) Partial differential equations (PDEs) play a fundamental role in modeling and simulating problems across a wide range of disciplines. Recent advances in deep learning have shown the great potential of physics-informed neural networks (PINNs) to solve PDEs as a basis for data-driven modeling and inverse analysis. However, the majority of existing PINN methods, based on fully-connected NNs, pose intrinsic limitations to low-dimensional spatiotemporal parameterizations. Moreover, since the initial/boundary conditions (I/BCs) are softly imposed via penalty, the solution quality heavily relies on hyperparameter tuning. To this end, we propose the novel physics-informed convolutional-recurrent learning architectures (PhyCRNet and PhyCRNet-s) for solving PDEs without any labeled data. Specifically, an encoder-decoder convolutional long short-term memory network is proposed for low-dimensional spatial feature extraction and temporal evolution learning. The loss function is defined as the aggregated discretized PDE residuals, while the I/BCs are hard-encoded in the network to ensure forcible satisfaction (e.g., periodic boundary padding). The networks are further enhanced by autoregressive and residual connections that explicitly simulate time marching. The performance of our proposed methods has been assessed by solving three nonlinear PDEs (e.g., 2D Burgers' equations, the $\lambda$-$\omega$ and FitzHugh Nagumo reaction-diffusion equations), and compared against the start-of-the-art baseline algorithms. The numerical results demonstrate the superiority of our proposed methodology in the context of solution accuracy, extrapolability and generalizability.
    RadGraph: Extracting Clinical Entities and Relations from Radiology Reports. (arXiv:2106.14463v1 [cs.CL])
    (2 min) Extracting structured clinical information from free-text radiology reports can enable the use of radiology report information for a variety of critical healthcare applications. In our work, we present RadGraph, a dataset of entities and relations in full-text chest X-ray radiology reports based on a novel information extraction schema we designed to structure radiology reports. We release a development dataset, which contains board-certified radiologist annotations for 500 radiology reports from the MIMIC-CXR dataset (14,579 entities and 10,889 relations), and a test dataset, which contains two independent sets of board-certified radiologist annotations for 100 radiology reports split equally across the MIMIC-CXR and CheXpert datasets. Using these datasets, we train and test a deep learning model, RadGraph Benchmark, that achieves a micro F1 of 0.82 and 0.73 on relation extraction on the MIMIC-CXR and CheXpert test sets respectively. Additionally, we release an inference dataset, which contains annotations automatically generated by RadGraph Benchmark across 220,763 MIMIC-CXR reports (around 6 million entities and 4 million relations) and 500 CheXpert reports (13,783 entities and 9,908 relations) with mappings to associated chest radiographs. Our freely available dataset can facilitate a wide range of research in medical natural language processing, as well as computer vision and multi-modal learning when linked to chest radiographs.
    A Training-free and Reference-free Summarization Evaluation Metric via Centrality-weighted Relevance and Self-referenced Redundancy. (arXiv:2106.13945v1 [cs.CL])
    (2 min) In recent years, reference-based and supervised summarization evaluation metrics have been widely explored. However, collecting human-annotated references and ratings are costly and time-consuming. To avoid these limitations, we propose a training-free and reference-free summarization evaluation metric. Our metric consists of a centrality-weighted relevance score and a self-referenced redundancy score. The relevance score is computed between the pseudo reference built from the source document and the given summary, where the pseudo reference content is weighted by the sentence centrality to provide importance guidance. Besides an $F_1$-based relevance score, we also design an $F_\beta$-based variant that pays more attention to the recall score. As for the redundancy score of the summary, we compute a self-masked similarity score with the summary itself to evaluate the redundant information in the summary. Finally, we combine the relevance and redundancy scores to produce the final evaluation score of the given summary. Extensive experiments show that our methods can significantly outperform existing methods on both multi-document and single-document summarization evaluation.
    Perspective-corrected Spatial Referring Expression Generation for Human-Robot Interaction. (arXiv:2104.01558v2 [cs.RO] UPDATED)
    (2 min) Intelligent robots designed to interact with humans in real scenarios need to be able to refer to entities actively by natural language. In spatial referring expression generation, the ambiguity is unavoidable due to the diversity of reference frames, which will lead to an understanding gap between humans and robots. To narrow this gap, in this paper, we propose a novel perspective-corrected spatial referring expression generation (PcSREG) approach for human-robot interaction by considering the selection of reference frames. The task of referring expression generation is simplified into the process of generating diverse spatial relation units. First, we pick out all landmarks in these spatial relation units according to the entropy of preference and allow its updating through a stack model. Then all possible referring expressions are generated according to different reference frame strategies. Finally, we evaluate every expression using a probabilistic referring expression resolution model and find the best expression that satisfies both of the appropriateness and effectiveness. We implement the proposed approach on a robot system and empirical experiments show that our approach can generate more effective spatial referring expressions for practical applications.
    PGST: a Polyglot Gender Style Transfer method. (arXiv:2009.01040v2 [cs.CL] UPDATED)
    (2 min) Recent developments in Text Style Transfer have led this field to be more highlighted than ever. The task of transferring an input's style to another is accompanied by plenty of challenges (e.g., fluency and content preservation) that need to be taken care of. In this research, we introduce PGST, a novel polyglot text style transfer approach in the gender domain, composed of different constitutive elements. In contrast to prior studies, it is feasible to apply a style transfer method in multiple languages by fulfilling our method's predefined elements. We have proceeded with a pre-trained word embedding for token replacement purposes, a character-based token classifier for gender exchange purposes, and a beam search algorithm for extracting the most fluent combination. Since different approaches are introduced in our research, we determine a trade-off value for evaluating different models' success in faking our gender identification model with transferred text. To demonstrate our method's multilingual applicability, we applied our method on both English and Persian corpora and ended up defeating our proposed gender identification model by 45.6% and 39.2%, respectively. While this research's focus is not limited to a specific language, our obtained evaluation results are highly competitive in an analogy among English state of the art methods.
    A Knowledge-Grounded Dialog System Based on Pre-Trained Language Models. (arXiv:2106.14444v1 [cs.CL])
    (2 min) We present a knowledge-grounded dialog system developed for the ninth Dialog System Technology Challenge (DSTC9) Track 1 - Beyond Domain APIs: Task-oriented Conversational Modeling with Unstructured Knowledge Access. We leverage transfer learning with existing language models to accomplish the tasks in this challenge track. Specifically, we divided the task into four sub-tasks and fine-tuned several Transformer models on each of the sub-tasks. We made additional changes that yielded gains in both performance and efficiency, including the combination of the model with traditional entity-matching techniques, and the addition of a pointer network to the output layer of the language model.
    Enhancing the Generalization for Intent Classification and Out-of-Domain Detection in SLU. (arXiv:2106.14464v1 [cs.CL])
    (2 min) Intent classification is a major task in spoken language understanding (SLU). Since most models are built with pre-collected in-domain (IND) training utterances, their ability to detect unsupported out-of-domain (OOD) utterances has a critical effect in practical use. Recent works have shown that using extra data and labels can improve the OOD detection performance, yet it could be costly to collect such data. This paper proposes to train a model with only IND data while supporting both IND intent classification and OOD detection. Our method designs a novel domain-regularized module (DRM) to reduce the overconfident phenomenon of a vanilla classifier, achieving a better generalization in both cases. Besides, DRM can be used as a drop-in replacement for the last layer in any neural network-based intent classifier, providing a low-cost strategy for a significant improvement. The evaluation on four datasets shows that our method built on BERT and RoBERTa models achieves state-of-the-art performance against existing approaches and the strong baselines we created for the comparisons.
    Word2Box: Learning Word Representation Using Box Embeddings. (arXiv:2106.14361v1 [cs.CL])
    (2 min) Learning vector representations for words is one of the most fundamental topics in NLP, capable of capturing syntactic and semantic relationships useful in a variety of downstream NLP tasks. Vector representations can be limiting, however, in that typical scoring such as dot product similarity intertwines position and magnitude of the vector in space. Exciting innovations in the space of representation learning have proposed alternative fundamental representations, such as distributions, hyperbolic vectors, or regions. Our model, Word2Box, takes a region-based approach to the problem of word representation, representing words as $n$-dimensional rectangles. These representations encode position and breadth independently and provide additional geometric operations such as intersection and containment which allow them to model co-occurrence patterns vectors struggle with. We demonstrate improved performance on various word similarity tasks, particularly on less common words, and perform a qualitative analysis exploring the additional unique expressivity provided by Word2Box.
    Current Landscape of the Russian Sentiment Corpora. (arXiv:2106.14434v1 [cs.CL])
    (2 min) Currently, there are more than a dozen Russian-language corpora for sentiment analysis, differing in the source of the texts, domain, size, number and ratio of sentiment classes, and annotation method. This work examines publicly available Russian-language corpora, presents their qualitative and quantitative characteristics, which make it possible to get an idea of the current landscape of the corpora for sentiment analysis. The ranking of corpora by annotation quality is proposed, which can be useful when choosing corpora for training and testing. The influence of the training dataset on the performance of sentiment analysis is investigated based on the use of the deep neural network model BERT. The experiments with review corpora allow us to conclude that on average the quality of models increases with an increase in the number of training corpora. For the first time, quality scores were obtained for the corpus of reviews of ROMIP seminars based on the BERT model. Also, the study proposes the task of the building a universal model for sentiment analysis.
    Traditional Machine Learning and Deep Learning Models for Argumentation Mining in Russian Texts. (arXiv:2106.14438v1 [cs.CL])
    (2 min) Argumentation mining is a field of computational linguistics that is devoted to extracting from texts and classifying arguments and relations between them, as well as constructing an argumentative structure. A significant obstacle to research in this area for the Russian language is the lack of annotated Russian-language text corpora. This article explores the possibility of improving the quality of argumentation mining using the extension of the Russian-language version of the Argumentative Microtext Corpus (ArgMicro) based on the machine translation of the Persuasive Essays Corpus (PersEssays). To make it possible to use these two corpora combined, we propose a Joint Argument Annotation Scheme based on the schemes used in ArgMicro and PersEssays. We solve the problem of classifying argumentative discourse units (ADUs) into two classes - "pro" ("for") and "opp" ("against") using traditional machine learning techniques (SVM, Bagging and XGBoost) and a deep neural network (BERT model). An ensemble of XGBoost and BERT models was proposed, which showed the highest performance of ADUs classification for both corpora.
    A Span-Based Model for Joint Overlapped and Discontinuous Named Entity Recognition. (arXiv:2106.14373v1 [cs.CL])
    (2 min) Research on overlapped and discontinuous named entity recognition (NER) has received increasing attention. The majority of previous work focuses on either overlapped or discontinuous entities. In this paper, we propose a novel span-based model that can recognize both overlapped and discontinuous entities jointly. The model includes two major steps. First, entity fragments are recognized by traversing over all possible text spans, thus, overlapped entities can be recognized. Second, we perform relation classification to judge whether a given pair of entity fragments to be overlapping or succession. In this way, we can recognize not only discontinuous entities, and meanwhile doubly check the overlapped entities. As a whole, our model can be regarded as a relation extraction paradigm essentially. Experimental results on multiple benchmark datasets (i.e., CLEF, GENIA and ACE05) show that our model is highly competitive for overlapped and discontinuous NER.
    Draw Me a Flower: Grounding Formal Abstract Structures Stated in Informal Natural Language. (arXiv:2106.14321v1 [cs.CL])
    (2 min) Forming and interpreting abstraction is a core process in human communication. In particular, when giving and performing complex instructions stated in natural language (NL), people may naturally evoke abstract constructs such as objects, loops, conditions and functions to convey their intentions in an efficient and precise way. Yet, interpreting and grounding abstraction stated in NL has not been systematically studied in NLP/AI. To elicit naturally-occurring abstractions in NL we develop the Hexagons referential game, where players describe increasingly complex images on a two-dimensional Hexagons board, and other players need to follow these instructions to recreate the images. Using this game we collected the Hexagons dataset, which consists of 164 images and over 3000 naturally-occurring instructions, rich with diverse abstractions. Results of our baseline models on an instruction-to-execution task derived from the Hexagons dataset confirm that higher-level abstractions in NL are indeed more challenging for current systems to process. Thus, this dataset exposes a new and challenging dimension for grounded semantic parsing, and we propose it for the community as a future benchmark to explore more sophisticated and high-level communication within NLP applications.
    SymbolicGPT: A Generative Transformer Model for Symbolic Regression. (arXiv:2106.14131v1 [cs.LG])
    (2 min) Symbolic regression is the task of identifying a mathematical expression that best fits a provided dataset of input and output values. Due to the richness of the space of mathematical expressions, symbolic regression is generally a challenging problem. While conventional approaches based on genetic evolution algorithms have been used for decades, deep learning-based methods are relatively new and an active research area. In this work, we present SymbolicGPT, a novel transformer-based language model for symbolic regression. This model exploits the advantages of probabilistic language models like GPT, including strength in performance and flexibility. Through comprehensive experiments, we show that our model performs strongly compared to competing models with respect to the accuracy, running time, and data efficiency.
    A Case Study of LLVM-Based Analysis for Optimizing SIMD Code Generation. (arXiv:2106.14332v1 [cs.DC])
    (2 min) This paper presents a methodology for using LLVM-based tools to tune the DCA++ (dynamical clusterapproximation) application that targets the new ARM A64FX processor. The goal is to describethe changes required for the new architecture and generate efficient single instruction/multiple data(SIMD) instructions that target the new Scalable Vector Extension instruction set. During manualtuning, the authors used the LLVM tools to improve code parallelization by using OpenMP SIMD,refactored the code and applied transformation that enabled SIMD optimizations, and ensured thatthe correct libraries were used to achieve optimal performance. By applying these code changes, codespeed was increased by 1.98X and 78 GFlops were achieved on the A64FX processor. The authorsaim to automatize parts of the efforts in the OpenMP Advisor tool, which is built on top of existingand newly introduced LLVM tooling.
    Effective Cascade Dual-Decoder Model for Joint Entity and Relation Extraction. (arXiv:2106.14163v1 [cs.CL])
    (2 min) Extracting relational triples from texts is a fundamental task in knowledge graph construction. The popular way of existing methods is to jointly extract entities and relations using a single model, which often suffers from the overlapping triple problem. That is, there are multiple relational triples that share the same entities within one sentence. In this work, we propose an effective cascade dual-decoder approach to extract overlapping relational triples, which includes a text-specific relation decoder and a relation-corresponded entity decoder. Our approach is straightforward: the text-specific relation decoder detects relations from a sentence according to its text semantics and treats them as extra features to guide the entity extraction; for each extracted relation, which is with trainable embedding, the relation-corresponded entity decoder detects the corresponding head and tail entities using a span-based tagging scheme. In this way, the overlapping triple problem is tackled naturally. Experiments on two public datasets demonstrate that our proposed approach outperforms state-of-the-art methods and achieves better F1 scores under the strict evaluation metric. Our implementation is available at https://github.com/prastunlp/DualDec.
    Political Ideology and Polarization of Policy Positions: A Multi-dimensional Approach. (arXiv:2106.14387v1 [cs.CL])
    (2 min) Analyzing political ideology and polarization is of critical importance in advancing our understanding of the political context in society. Recent research has made great strides towards understanding the ideological bias (i.e., stance) of news media along a left-right spectrum. In this work, we take a novel approach and study the ideology of the policy under discussion teasing apart the nuanced co-existence of stance and ideology. Aligned with the theoretical accounts in political science, we treat ideology as a multi-dimensional construct, and introduce the first diachronic dataset of news articles whose political ideology under discussion is annotated by trained political scientists and linguists at the paragraph-level. We showcase that this framework enables quantitative analysis of polarization, a temporal, multifaceted measure of ideological distance. We further present baseline models for ideology prediction.
    PeCoQ: A Dataset for Persian Complex Question Answering over Knowledge Graph. (arXiv:2106.14167v1 [cs.CL])
    (2 min) Question answering systems may find the answers to users' questions from either unstructured texts or structured data such as knowledge graphs. Answering questions using supervised learning approaches including deep learning models need large training datasets. In recent years, some datasets have been presented for the task of Question answering over knowledge graphs, which is the focus of this paper. Although many datasets in English were proposed, there have been a few question-answering datasets in Persian. This paper introduces \textit{PeCoQ}, a dataset for Persian question answering. This dataset contains 10,000 complex questions and answers extracted from the Persian knowledge graph, FarsBase. For each question, the SPARQL query and two paraphrases that were written by linguists are provided as well. There are different types of complexities in the dataset, such as multi-relation, multi-entity, ordinal, and temporal constraints. In this paper, we discuss the dataset's characteristics and describe our methodology for building it.
    Analyzing Research Trends in Inorganic Materials Literature Using NLP. (arXiv:2106.14157v1 [cs.CL])
    (2 min) In the field of inorganic materials science, there is a growing demand to extract knowledge such as physical properties and synthesis processes of materials by machine-reading a large number of papers. This is because materials researchers refer to many papers in order to come up with promising terms of experiments for material synthesis. However, there are only a few systems that can extract material names and their properties. This study proposes a large-scale natural language processing (NLP) pipeline for extracting material names and properties from materials science literature to enable the search and retrieval of results in materials science. Therefore, we propose a label definition for extracting material names and properties and accordingly build a corpus containing 836 annotated paragraphs extracted from 301 papers for training a named entity recognition (NER) model. Experimental results demonstrate the utility of this NER model; it achieves successful extraction with a micro-F1 score of 78.1%. To demonstrate the efficacy of our approach, we present a thorough evaluation on a real-world automatically annotated corpus by applying our trained NER model to 12,895 materials science papers. We analyze the trend in materials science by visualizing the outputs of the NLP pipeline. For example, the country-by-year analysis indicates that in recent years, the number of papers on "MoS2," a material used in perovskite solar cells, has been increasing rapidly in China but decreasing in the United States. Further, according to the conditions-by-year analysis, the processing temperature of the catalyst material "PEDOT:PSS" is shifting below 200 degree, and the number of reports with a processing time exceeding 5 h is increasing slightly.
    KGRefiner: Knowledge Graph Refinement for Improving Accuracy of Translational Link Prediction Methods. (arXiv:2106.14233v1 [cs.CL])
    (2 min) Link prediction is the task of predicting missing relations between entities of the knowledge graph by inferring from the facts contained in it. Recent work in link prediction has attempted to provide a model for increasing link prediction accuracy by using more layers in neural network architecture or methods that add to the computational complexity of models. This paper we proposed a method for refining the knowledge graph, which makes the knowledge graph more informative, and link prediction operations can be performed more accurately using relatively fast translational models. Translational link prediction models, such as TransE, TransH, TransD, etc., have much less complexity than deep learning approaches. This method uses the hierarchy of relationships and also the hierarchy of entities in the knowledge graph to add the entity information as a new entity to the graph and connect it to the nodes which contain this information in their hierarchy. Our experiments show that our method can significantly increase the performance of translational link prediction methods in H@10, MR, MRR.
    UMIC: An Unreferenced Metric for Image Captioning via Contrastive Learning. (arXiv:2106.14019v1 [cs.CL])
    (2 min) Despite the success of various text generation metrics such as BERTScore, it is still difficult to evaluate the image captions without enough reference captions due to the diversity of the descriptions. In this paper, we introduce a new metric UMIC, an Unreferenced Metric for Image Captioning which does not require reference captions to evaluate image captions. Based on Vision-and-Language BERT, we train UMIC to discriminate negative captions via contrastive learning. Also, we observe critical problems of the previous benchmark dataset (i.e., human annotations) on image captioning metric, and introduce a new collection of human annotations on the generated captions. We validate UMIC on four datasets, including our new dataset, and show that UMIC has a higher correlation than all previous metrics that require multiple references. We release the benchmark dataset and pre-trained models to compute the UMIC.
    Core Challenges in Embodied Vision-Language Planning. (arXiv:2106.13948v1 [cs.LG])
    (2 min) Recent advances in the areas of multimodal machine learning and artificial intelligence (AI) have led to the development of challenging tasks at the intersection of Computer Vision, Natural Language Processing, and Embodied AI. Whereas many approaches and previous survey pursuits have characterised one or two of these dimensions, there has not been a holistic analysis at the center of all three. Moreover, even when combinations of these topics are considered, more focus is placed on describing, e.g., current architectural methods, as opposed to also illustrating high-level challenges and opportunities for the field. In this survey paper, we discuss Embodied Vision-Language Planning (EVLP) tasks, a family of prominent embodied navigation and manipulation problems that jointly use computer vision and natural language. We propose a taxonomy to unify these tasks and provide an in-depth analysis and comparison of the new and current algorithmic approaches, metrics, simulated environments, as well as the datasets used for EVLP tasks. Finally, we present the core challenges that we believe new EVLP works should seek to address, and we advocate for task construction that enables model generalizability and furthers real-world deployment.
    Rationale-Inspired Natural Language Explanations with Commonsense. (arXiv:2106.13876v1 [cs.CL])
    (2 min) Explainable machine learning models primarily justify predicted labels using either extractive rationales (i.e., subsets of input features) or free-text natural language explanations (NLEs) as abstractive justifications. While NLEs can be more comprehensive than extractive rationales, machine-generated NLEs have been shown to sometimes lack commonsense knowledge. Here, we show that commonsense knowledge can act as a bridge between extractive rationales and NLEs, rendering both types of explanations better. More precisely, we introduce a unified framework, called RExC (Rationale-Inspired Explanations with Commonsense), that (1) extracts rationales as a set of features responsible for machine predictions, (2) expands the extractive rationales using available commonsense resources, and (3) uses the expanded knowledge to generate natural language explanations. Our framework surpasses by a large margin the previous state-of-the-art in generating NLEs across five tasks in both natural language processing and vision-language understanding, with human annotators consistently rating the explanations generated by RExC to be more comprehensive, grounded in commonsense, and overall preferred compared to previous state-of-the-art models. Moreover, our work shows that commonsense-grounded explanations can enhance both task performance and rationales extraction capabilities.
    Multimodal Few-Shot Learning with Frozen Language Models. (arXiv:2106.13884v1 [cs.CV])
    (2 min) When trained at sufficient scale, auto-regressive language models exhibit the notable ability to learn a new language task after being prompted with just a few examples. Here, we present a simple, yet effective, approach for transferring this few-shot learning ability to a multimodal setting (vision and language). Using aligned image and caption data, we train a vision encoder to represent each image as a sequence of continuous embeddings, such that a pre-trained, frozen language model prompted with this prefix generates the appropriate caption. The resulting system is a multimodal few-shot learner, with the surprising ability to learn a variety of new tasks when conditioned on examples, represented as a sequence of multiple interleaved image and text embeddings. We demonstrate that it can rapidly learn words for new objects and novel visual categories, do visual question-answering with only a handful of examples, and make use of outside knowledge, by measuring a single model on a variety of established and new benchmarks.
    Benchmarking Differential Privacy and Federated Learning for BERT Models. (arXiv:2106.13973v1 [cs.CL])
    (2 min) Natural Language Processing (NLP) techniques can be applied to help with the diagnosis of medical conditions such as depression, using a collection of a person's utterances. Depression is a serious medical illness that can have adverse effects on how one feels, thinks, and acts, which can lead to emotional and physical problems. Due to the sensitive nature of such data, privacy measures need to be taken for handling and training models with such data. In this work, we study the effects that the application of Differential Privacy (DP) has, in both a centralized and a Federated Learning (FL) setup, on training contextualized language models (BERT, ALBERT, RoBERTa and DistilBERT). We offer insights on how to privately train NLP models and what architectures and setups provide more desirable privacy utility trade-offs. We envisage this work to be used in future healthcare and mental health studies to keep medical history private. Therefore, we provide an open-source implementation of this work.
    XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages. (arXiv:2106.13822v1 [cs.CL])
    (2 min) Contemporary works on abstractive text summarization have focused primarily on high-resource languages like English, mostly due to the limited availability of datasets for low/mid-resource ones. In this work, we present XL-Sum, a comprehensive and diverse dataset comprising 1 million professionally annotated article-summary pairs from BBC, extracted using a set of carefully designed heuristics. The dataset covers 44 languages ranging from low to high-resource, for many of which no public dataset is currently available. XL-Sum is highly abstractive, concise, and of high quality, as indicated by human and intrinsic evaluation. We fine-tune mT5, a state-of-the-art pretrained multilingual model, with XL-Sum and experiment on multilingual and low-resource summarization tasks. XL-Sum induces competitive results compared to the ones obtained using similar monolingual datasets: we show higher than 11 ROUGE-2 scores on 10 languages we benchmark on, with some of them exceeding 15, as obtained by multilingual training. Additionally, training on low-resource languages individually also provides competitive performance. To the best of our knowledge, XL-Sum is the largest abstractive summarization dataset in terms of the number of samples collected from a single source and the number of languages covered. We are releasing our dataset and models to encourage future research on multilingual abstractive summarization. The resources can be found at \url{https://github.com/csebuetnlp/xl-sum}.
    Persian Rhetorical Structure Theory. (arXiv:2106.13833v1 [cs.CL])
    (2 min) Over the past years, interest in discourse analysis and discourse parsing has steadily grown, and many discourse-annotated corpora and, as a result, discourse parsers have been built. In this paper, we present a discourse-annotated corpus for the Persian language built in the framework of Rhetorical Structure Theory as well as a discourse parser built upon the DPLP parser, an open-source discourse parser. Our corpus consists of 150 journalistic texts, each text having an average of around 400 words. Corpus texts were annotated using 18 discourse relations and based on the annotation guideline of the English RST Discourse Treebank corpus. Our text-level discourse parser is trained using gold segmentation and is built upon the DPLP discourse parser, which uses a large-margin transition-based approach to solve the problem of discourse parsing. The performance of our discourse parser in span (S), nuclearity (N) and relation (R) detection is around 78%, 64%, 44% respectively, in terms of F1 measure.
    Semantic Parsing Natural Language into Relational Algebra. (arXiv:2106.13858v1 [cs.CL])
    (2 min) Natural interface to database (NLIDB) has been researched a lot during the past decades. In the core of NLIDB, is a semantic parser used to convert natural language into SQL. Solutions from traditional NLP methodology focuses on grammar rule pattern learning and pairing via intermediate logic forms. Although those methods give an acceptable performance on certain specific database and parsing tasks, they are hard to generalize and scale. On the other hand, recent progress in neural deep learning seems to provide a promising direction towards building a general NLIDB system. Unlike the traditional approach, those neural methodologies treat the parsing problem as a sequence-to-sequence learning problem. In this paper, we experimented on several sequence-to-sequence learning models and evaluate their performance on general database parsing task.
  • cs.CV updates on arXiv.org

    Channel Pruning in a White Box for Efficient Image Classification. (arXiv:2104.11883v2 [cs.CV] UPDATED)
    (2 min) Channel Pruning has been long studied to compress CNNs for efficient image classification. Prior works implement channel pruning in an unexplainable manner, which tends to reduce the final classification errors while failing to consider the internal influence of each channel. In this paper, we conduct channel pruning in a white box. Through deep visualization of feature maps activated by different channels, we observe that different channels have a varying contribution to different categories in image classification. Inspired by this, we choose to preserve channels contributing to most categories. Specifically, to model the contribution of each channel to differentiating categories, we develop a class-wise mask for each channel, implemented in a dynamic training manner w.r.t. the input image's category. On the basis of the learned class-wise mask, we perform a global voting mechanism to remove channels with less category discrimination. Lastly, a fine-tuning process is conducted to recover the performance of the pruned model. To our best knowledge, it is the first time that CNN interpretability theory is considered to guide channel pruning. Extensive experiments on representative image classification tasks demonstrate the superiority of our White-Box over many state-of-the-arts. For instance, on CIFAR-10, it reduces 65.23% FLOPs with even 0.62% accuracy improvement for ResNet-110. On ILSVRC-2012, White-Box achieves a 45.6% FLOPs reduction with only a small loss of 0.83% in the top-1 accuracy for ResNet-50.
    Touch-based Curiosity for Sparse-Reward Tasks. (arXiv:2104.00442v2 [cs.LG] UPDATED)
    (2 min) Robots in many real-world settings have access to force/torque sensors in their gripper and tactile sensing is often necessary in tasks that involve contact-rich motion. In this work, we leverage surprise from mismatches in touch feedback to guide exploration in hard sparse-reward reinforcement learning tasks. Our approach, Touch-based Curiosity (ToC), learns what visible objects interactions are supposed to "feel" like. We encourage exploration by rewarding interactions where the expectation and the experience don't match. In our proposed method, an initial task-independent exploration phase is followed by an on-task learning phase, in which the original interactions are relabeled with on-task rewards. We test our approach on a range of touch-intensive robot arm tasks (e.g. pushing objects, opening doors), which we also release as part of this work. Across multiple experiments in a simulated setting, we demonstrate that our method is able to learn these difficult tasks through sparse reward and curiosity alone. We compare our cross-modal approach to single-modality (touch- or vision-only) approaches as well as other curiosity-based methods and find that our method performs better and is more sample-efficient.
    Multiple Meta-model Quantifying for Medical Visual Question Answering. (arXiv:2105.08913v2 [cs.CV] UPDATED)
    (2 min) Transfer learning is an important step to extract meaningful features and overcome the data limitation in the medical Visual Question Answering (VQA) task. However, most of the existing medical VQA methods rely on external data for transfer learning, while the meta-data within the dataset is not fully utilized. In this paper, we present a new multiple meta-model quantifying method that effectively learns meta-annotation and leverages meaningful features to the medical VQA task. Our proposed method is designed to increase meta-data by auto-annotation, deal with noisy labels, and output meta-models which provide robust features for medical VQA tasks. Extensively experimental results on two public medical VQA datasets show that our approach achieves superior accuracy in comparison with other state-of-the-art methods, while does not require external data to train meta-models.
    Darker than Black-Box: Face Reconstruction from Similarity Queries. (arXiv:2106.14290v1 [cs.CV])
    (2 min) Several methods for inversion of face recognition models were recently presented, attempting to reconstruct a face from deep templates. Although some of these approaches work in a black-box setup using only face embeddings, usually, on the end-user side, only similarity scores are provided. Therefore, these algorithms are inapplicable in such scenarios. We propose a novel approach that allows reconstructing the face querying only similarity scores of the black-box model. While our algorithm operates in a more general setup, experiments show that it is query efficient and outperforms the existing methods.
    Driving-Signal Aware Full-Body Avatars. (arXiv:2105.10441v2 [cs.CV] UPDATED)
    (2 min) We present a learning-based method for building driving-signal aware full-body avatars. Our model is a conditional variational autoencoder that can be animated with incomplete driving signals, such as human pose and facial keypoints, and produces a high-quality representation of human geometry and view-dependent appearance. The core intuition behind our method is that better drivability and generalization can be achieved by disentangling the driving signals and remaining generative factors, which are not available during animation. To this end, we explicitly account for information deficiency in the driving signal by introducing a latent space that exclusively captures the remaining information, thus enabling the imputation of the missing factors required during full-body animation, while remaining faithful to the driving signal. We also propose a learnable localized compression for the driving signal which promotes better generalization, and helps minimize the influence of global chance-correlations often found in real datasets. For a given driving signal, the resulting variational model produces a compact space of uncertainty for missing factors that allows for an imputation strategy best suited to a particular application. We demonstrate the efficacy of our approach on the challenging problem of full-body animation for virtual telepresence with driving signals acquired from minimal sensors placed in the environment and mounted on a VR-headset.
    Semi-Supervised Classification and Segmentation on High Resolution Aerial Images. (arXiv:2105.08655v2 [cs.CV] UPDATED)
    (2 min) FloodNet is a high-resolution image dataset acquired by a small UAV platform, DJI Mavic Pro quadcopters, after Hurricane Harvey. The dataset presents a unique challenge of advancing the damage assessment process for post-disaster scenarios using unlabeled and limited labeled dataset. We propose a solution to address their classification and semantic segmentation challenge. We approach this problem by generating pseudo labels for both classification and segmentation during training and slowly incrementing the amount by which the pseudo label loss affects the final loss. Using this semi-supervised method of training helped us improve our baseline supervised loss by a huge margin for classification, allowing the model to generalize and perform better on the validation and test splits of the dataset. In this paper, we compare and contrast the various methods and models for image classification and semantic segmentation on the FloodNet dataset.
    Learning to Amend Facial Expression Representation via De-albino and Affinity. (arXiv:2103.10189v2 [cs.CV] UPDATED)
    (2 min) Facial Expression Recognition (FER) is a classification task that points to face variants. Hence, there are certain affinity features between facial expressions, receiving little attention in the FER literature. Convolution padding, despite helping capture the edge information, causes erosion of the feature map simultaneously. After multi-layer filling convolution, the output feature map named albino feature definitely weakens the representation of the expression. To tackle these challenges, we propose a novel architecture named Amending Representation Module (ARM). ARM is a substitute for the pooling layer. Theoretically, it can be embedded in the back end of any network to deal with the Padding Erosion. ARM efficiently enhances facial expression representation from two different directions: 1) reducing the weight of eroded features to offset the side effect of padding, and 2) sharing affinity features over mini-batch to strengthen the representation learning. Experiments on public benchmarks prove that our ARM boosts the performance of FER remarkably. The validation accuracies are respectively 92.05% on RAF-DB, 65.2% on Affect-Net, and 58.71% on SFEW, exceeding current state-of-the-art methods. Our implementation and trained models are available at https://github.com/JiaweiShiCV/Amend-Representation-Module.
    OSKDet: Towards Orientation-sensitive Keypoint Localization for Rotated Object Detection. (arXiv:2104.08697v2 [cs.CV] UPDATED)
    (2 min) Rotated object detection is a challenging issue of computer vision field. Loss of spatial information and confusion of parametric order have been the bottleneck for rotated detection accuracy. In this paper, we propose an orientation-sensitive keypoint based rotated detector OSKDet. We adopt a set of keypoints to characterize the target and predict the keypoint heatmap on ROI to form a rotated target. By proposing the orientation-sensitive heatmap, OSKDet could learn the shape and direction of rotated target implicitly and has stronger modeling capabilities for target representation, which improves the localization accuracy and acquires high quality detection results. To extract highly effective features at border areas, we design a rotation-aware deformable convolution module. Furthermore, we explore a new keypoint reorder algorithm and feature fusion module based on the angle distribution to eliminate the confusion of keypoint order. Experimental results on several public benchmarks show the state-of-the-art performance of OSKDet. Specifically, we achieve an AP of 77.81% on DOTA, 89.91% on HRSC2016, and 97.18% on UCAS-AOD, respectively.
    Low-Dose CT Denoising Using a Structure-Preserving Kernel Prediction Network. (arXiv:2105.14758v2 [eess.IV] UPDATED)
    (2 min) Low-dose CT has been a key diagnostic imaging modality to reduce the potential risk of radiation overdose to patient health. Despite recent advances, CNN-based approaches typically apply filters in a spatially invariant way and adopt similar pixel-level losses, which treat all regions of the CT image equally and can be inefficient when fine-grained structures coexist with non-uniformly distributed noises. To address this issue, we propose a Structure-preserving Kernel Prediction Network (StructKPN) that combines the kernel prediction network with a structure-aware loss function that utilizes the pixel gradient statistics and guides the model towards spatially-variant filters that enhance noise removal, prevent over-smoothing and preserve detailed structures for different regions in CT imaging. Extensive experiments demonstrated that our approach achieved superior performance on both synthetic and non-synthetic datasets, and better preserves structures that are highly desired in clinical screening and low-dose protocol optimization.
    TransBTS: Multimodal Brain Tumor Segmentation Using Transformer. (arXiv:2103.04430v2 [cs.CV] UPDATED)
    (2 min) Transformer, which can benefit from global (long-range) information modeling using self-attention mechanisms, has been successful in natural language processing and 2D image classification recently. However, both local and global features are crucial for dense prediction tasks, especially for 3D medical image segmentation. In this paper, we for the first time exploit Transformer in 3D CNN for MRI Brain Tumor Segmentation and propose a novel network named TransBTS based on the encoder-decoder structure. To capture the local 3D context information, the encoder first utilizes 3D CNN to extract the volumetric spatial feature maps. Meanwhile, the feature maps are reformed elaborately for tokens that are fed into Transformer for global feature modeling. The decoder leverages the features embedded by Transformer and performs progressive upsampling to predict the detailed segmentation map. Extensive experimental results on both BraTS 2019 and 2020 datasets show that TransBTS achieves comparable or higher results than previous state-of-the-art 3D methods for brain tumor segmentation on 3D MRI scans. The source code is available at https://github.com/Wenxuan-1119/TransBTS
    BAM: A Lightweight and Efficient Balanced Attention Mechanism for Single Image Super Resolution. (arXiv:2104.07566v2 [eess.IV] UPDATED)
    (2 min) Attention mechanism has shown enormous potential for single image super-resolution (SISR). However, existing works only proposed some attention mechanism for a specific network. A universal attention mechanism for SISR, which could further improve the performance of networks without attention and provide a baseline for networks with attention, is still lacking. To fit this gap, we propose a lightweight and efficient Balanced Attention Mechanism (BAM), which consists of Avgpool Channel Attention Module (ACAM) and Maxpool Spatial Attention Module (MSAM) in parallel. The information extraction mechanism of ACAM and MSAM effectively filters redundant information, making the overall structure of BAM very lightweight. Owing to the parallel structure, during the gradient backpropagation process of BAM, ACAM and MSAM not only conduct self-optimization, but also mutual optimization so as to generate more balanced attention information. To verify the effectiveness and robustness of BAM, we applied it to 12 state-ofthe-art SISR networks. The results on 4 benchmark datasets demonstrate that BAM can efficiently improve the networks' performance, and for those with attention, the substitution with BAM further reduces the amount of parameters and increase the inference speed. Moreover, ablation experiments were conducted to prove the minimalism of BAM.
    Exploring to establish an appropriate model for image aesthetic assessment via CNN-based RSRL: An empirical study. (arXiv:2106.03316v2 [cs.CV] UPDATED)
    (2 min) To establish an appropriate model for photo aesthetic assessment, in this paper, a D-measure which reflects the disentanglement degree of the final layer FC nodes of CNN is introduced. By combining F-measure with D-measure to obtain a FD measure, an algorithm of determining the optimal model from the multiple photo score prediction models generated by CNN-based repetitively self-revised learning(RSRL) is proposed. Furthermore, the first fixation perspective(FFP) and the assessment interest region(AIR) of the models are defined and calculated. The experimental results show that the FD measure is effective for establishing the appropriate model from the multiple score prediction models with different CNN structures. Moreover, the FD-determined optimal models with the comparatively high FD always have the FFP an AIR which are close to the human's aesthetic perception when enjoying photos.
    VAT-Mart: Learning Visual Action Trajectory Proposals for Manipulating 3D ARTiculated Objects. (arXiv:2106.14440v1 [cs.CV])
    (2 min) Perceiving and manipulating 3D articulated objects (e.g., cabinets, doors) in human environments is an important yet challenging task for future home-assistant robots. The space of 3D articulated objects is exceptionally rich in their myriad semantic categories, diverse shape geometry, and complicated part functionality. Previous works mostly abstract kinematic structure with estimated joint parameters and part poses as the visual representations for manipulating 3D articulated objects. In this paper, we propose object-centric actionable visual priors as a novel perception-interaction handshaking point that the perception system outputs more actionable guidance than kinematic structure estimation, by predicting dense geometry-aware, interaction-aware, and task-aware visual action affordance and trajectory proposals. We design an interaction-for-perception framework VAT-Mart to learn such actionable visual representations by simultaneously training a curiosity-driven reinforcement learning policy exploring diverse interaction trajectories and a perception module summarizing and generalizing the explored knowledge for pointwise predictions among diverse shapes. Experiments prove the effectiveness of the proposed approach using the large-scale PartNet-Mobility dataset in SAPIEN environment and show promising generalization capabilities to novel test shapes, unseen object categories, and real-world data. Project page: https://hyperplane-lab.github.io/vat-mart
    Transformer Transforms Salient Object Detection and Camouflaged Object Detection. (arXiv:2104.10127v2 [cs.CV] UPDATED)
    (3 min) The transformer networks are particularly good at modeling long-range dependencies within a long sequence. In this paper, we conduct research on applying the transformer networks for salient object detection (SOD). We adopt the dense transformer backbone for fully supervised RGB image based SOD, RGB-D image pair based SOD, and weakly supervised SOD within a unified framework based on the observation that the transformer backbone can provide accurate structure modeling, which makes it powerful in learning from weak labels with less structure information. Further, we find that the vision transformer architectures do not offer direct spatial supervision, instead encoding position as a feature. Therefore, we investigate the contributions of two strategies to provide stronger spatial supervision through the transformer layers within our unified framework, namely deep supervision and difficulty-aware learning. We find that deep supervision can get gradients back into the higher level features, thus leads to uniform activation within the same semantic object. Difficulty-aware learning on the other hand is capable of identifying the hard pixels for effective hard negative mining. We also visualize features of conventional backbone and transformer backbone before and after fine-tuning them for SOD, and find that transformer backbone encodes more accurate object structure information and more distinct semantic information within the lower and higher level features respectively. We also apply our model to camouflaged object detection (COD) and achieve similar observations as the above three SOD tasks. Extensive experimental results on various SOD and COD tasks illustrate that transformer networks can transform SOD and COD, leading to new benchmarks for each related task. The source code and experimental results are available via our project page: https://github.com/fupiao1998/TrasformerSOD.
    3rd Place Solution for Short-video Face Parsing Challenge. (arXiv:2106.07409v2 [cs.CV] UPDATED)
    (2 min) This is a short technical report introducing the solution of Team Rat for Short-video Parsing Face Parsing Track of The 3rd Person in Context (PIC) Workshop and Challenge at CVPR 2021. In this report, we propose an Edge-Aware Network (EANet) that uses edge information to refine the segmentation edge. To further obtain the finer edge results, we introduce edge attention loss that only compute cross entropy on the edges, it can effectively reduce the classification error around edge and get more smooth boundary. Benefiting from the edge information and edge attention loss, the proposed EANet achieves 86.16\% accuracy in the Short-video Face Parsing track of the 3rd Person in Context (PIC) Workshop and Challenge, ranked the third place.
    Kimera-Multi: Robust, Distributed, Dense Metric-Semantic SLAM for Multi-Robot Systems. (arXiv:2106.14386v1 [cs.RO])
    (2 min) This paper presents Kimera-Multi, the first multi-robot system that (i) is robust and capable of identifying and rejecting incorrect inter and intra-robot loop closures resulting from perceptual aliasing, (ii) is fully distributed and only relies on local (peer-to-peer) communication to achieve distributed localization and mapping, and (iii) builds a globally consistent metric-semantic 3D mesh model of the environment in real-time, where faces of the mesh are annotated with semantic labels. Kimera-Multi is implemented by a team of robots equipped with visual-inertial sensors. Each robot builds a local trajectory estimate and a local mesh using Kimera. When communication is available, robots initiate a distributed place recognition and robust pose graph optimization protocol based on a novel distributed graduated non-convexity algorithm. The proposed protocol allows the robots to improve their local trajectory estimates by leveraging inter-robot loop closures while being robust to outliers. Finally, each robot uses its improved trajectory estimate to correct the local mesh using mesh deformation techniques. We demonstrate Kimera-Multi in photo-realistic simulations, SLAM benchmarking datasets, and challenging outdoor datasets collected using ground robots. Both real and simulated experiments involve long trajectories (e.g., up to 800 meters per robot). The experiments show that Kimera-Multi (i) outperforms the state of the art in terms of robustness and accuracy, (ii) achieves estimation errors comparable to a centralized SLAM system while being fully distributed, (iii) is parsimonious in terms of communication bandwidth, (iv) produces accurate metric-semantic 3D meshes, and (v) is modular and can be also used for standard 3D reconstruction (i.e., without semantic labels) or for trajectory estimation (i.e., without reconstructing a 3D mesh).
    Adversarial Generation of Continuous Images. (arXiv:2011.12026v2 [cs.CV] UPDATED)
    (2 min) In most existing learning systems, images are typically viewed as 2D pixel arrays. However, in another paradigm gaining popularity, a 2D image is represented as an implicit neural representation (INR) - an MLP that predicts an RGB pixel value given its (x,y) coordinate. In this paper, we propose two novel architectural techniques for building INR-based image decoders: factorized multiplicative modulation and multi-scale INRs, and use them to build a state-of-the-art continuous image GAN. Previous attempts to adapt INRs for image generation were limited to MNIST-like datasets and do not scale to complex real-world data. Our proposed INR-GAN architecture improves the performance of continuous image generators by several times, greatly reducing the gap between continuous image GANs and pixel-based ones. Apart from that, we explore several exciting properties of the INR-based decoders, like out-of-the-box superresolution, meaningful image-space interpolation, accelerated inference of low-resolution images, an ability to extrapolate outside of image boundaries, and strong geometric prior. The project page is located at https://universome.github.io/inr-gan.
    Enhancing Deep Neural Network Saliency Visualizations with Gradual Extrapolation. (arXiv:2104.04945v2 [cs.CV] UPDATED)
    (2 min) In this paper, an enhancement technique for the class activation mapping methods such as gradient-weighted class activation maps or excitation backpropagation is proposed to present the visual explanations of decisions from convolutional neural network-based models. The proposed idea, called Gradual Extrapolation, can supplement any method that generates a heatmap picture by sharpening the output. Instead of producing a coarse localization map that highlights the important predictive regions in the image, the proposed method outputs the specific shape that most contributes to the model output. Thus, the proposed method improves the accuracy of saliency maps. The effect has been achieved by the gradual propagation of the crude map obtained in the deep layer through all preceding layers with respect to their activations. In validation tests conducted on a selected set of images, the faithfulness, interpretability, and applicability of the method are evaluated. The proposed technique significantly improves the localization detection of the neural networks attention at low additional computational costs. Furthermore, the proposed method is applicable to a variety deep neural network models. The code for the method can be found at https://github.com/szandala/gradual-extrapolation
    Continual Learning of Context-dependent Processing in Neural Networks. (arXiv:1810.01256v3 [cs.LG] UPDATED)
    (2 min) Deep neural networks (DNNs) are powerful tools in learning sophisticated but fixed mapping rules between inputs and outputs, thereby limiting their application in more complex and dynamic situations in which the mapping rules are not kept the same but changing according to different contexts. To lift such limits, we developed a novel approach involving a learning algorithm, called orthogonal weights modification (OWM), with the addition of a context-dependent processing (CDP) module. We demonstrated that with OWM to overcome the problem of catastrophic forgetting, and the CDP module to learn how to reuse a feature representation and a classifier for different contexts, a single network can acquire numerous context-dependent mapping rules in an online and continual manner, with as few as $\sim$10 samples to learn each. This should enable highly compact systems to gradually learn myriad regularities of the real world and eventually behave appropriately within it.
    NTIRE 2021 Challenge on Perceptual Image Quality Assessment. (arXiv:2105.03072v3 [eess.IV] UPDATED)
    (2 min) This paper reports on the NTIRE 2021 challenge on perceptual image quality assessment (IQA), held in conjunction with the New Trends in Image Restoration and Enhancement workshop (NTIRE) workshop at CVPR 2021. As a new type of image processing technology, perceptual image processing algorithms based on Generative Adversarial Networks (GAN) have produced images with more realistic textures. These output images have completely different characteristics from traditional distortions, thus pose a new challenge for IQA methods to evaluate their visual quality. In comparison with previous IQA challenges, the training and testing datasets in this challenge include the outputs of perceptual image processing algorithms and the corresponding subjective scores. Thus they can be used to develop and evaluate IQA methods on GAN-based distortions. The challenge has 270 registered participants in total. In the final testing stage, 13 participating teams submitted their models and fact sheets. Almost all of them have achieved much better results than existing IQA methods, while the winning method can demonstrate state-of-the-art performance.
    QuickBrowser: A Unified Model to Detect and Read Simple Object in Real-time. (arXiv:2102.07354v2 [cs.CV] UPDATED)
    (2 min) There are many real-life use cases such as barcode scanning or billboard reading where people need to detect objects and read the object contents. Commonly existing methods are first trying to localize object regions, then determine layout and lastly classify content units. However, for simple fixed structured objects like license plates, this approach becomes overkill and lengthy to run. This work aims to solve this detect-and-read problem in a lightweight way by integrating multi-digit recognition into a one-stage object detection model. Our unified method not only eliminates the duplication in feature extraction (one for localizing, one again for classifying) but also provides useful contextual information around object regions for classification. Additionally, our choice of backbones and modifications in architecture, loss function, data augmentation and training make the method robust, efficient and speedy. Secondly, we made a public benchmark dataset of diverse real-life 1D barcodes for a reliable evaluation, which we collected, annotated and checked carefully. Eventually, experimental results prove the method's efficiency on the barcode problem by outperforming industrial tools in both detecting and decoding rates with a real-time fps at a VGA-similar resolution. It also did a great job expectedly on the license-plate recognition task (on the AOLP dataset) by outperforming the current state-of-the-art method significantly in terms of recognition rate and inference time.
    Exploring Data-Efficient 3D Scene Understanding with Contrastive Scene Contexts. (arXiv:2012.09165v3 [cs.CV] UPDATED)
    (2 min) The rapid progress in 3D scene understanding has come with growing demand for data; however, collecting and annotating 3D scenes (e.g. point clouds) are notoriously hard. For example, the number of scenes (e.g. indoor rooms) that can be accessed and scanned might be limited; even given sufficient data, acquiring 3D labels (e.g. instance masks) requires intensive human labor. In this paper, we explore data-efficient learning for 3D point cloud. As a first step towards this direction, we propose Contrastive Scene Contexts, a 3D pre-training method that makes use of both point-level correspondences and spatial contexts in a scene. Our method achieves state-of-the-art results on a suite of benchmarks where training data or labels are scarce. Our study reveals that exhaustive labelling of 3D point clouds might be unnecessary; and remarkably, on ScanNet, even using 0.1% of point labels, we still achieve 89% (instance segmentation) and 96% (semantic segmentation) of the baseline performance that uses full annotations.
    TiVGAN: Text to Image to Video Generation with Step-by-Step Evolutionary Generator. (arXiv:2009.02018v2 [cs.CV] UPDATED)
    (2 min) Advances in technology have led to the development of methods that can create desired visual multimedia. In particular, image generation using deep learning has been extensively studied across diverse fields. In comparison, video generation, especially on conditional inputs, remains a challenging and less explored area. To narrow this gap, we aim to train our model to produce a video corresponding to a given text description. We propose a novel training framework, Text-to-Image-to-Video Generative Adversarial Network (TiVGAN), which evolves frame-by-frame and finally produces a full-length video. In the first phase, we focus on creating a high-quality single video frame while learning the relationship between the text and an image. As the steps proceed, our model is trained gradually on more number of consecutive frames.This step-by-step learning process helps stabilize the training and enables the creation of high-resolution video based on conditional text descriptions. Qualitative and quantitative experimental results on various datasets demonstrate the effectiveness of the proposed method.
    ComBiNet: Compact Convolutional Bayesian Neural Network for Image Segmentation. (arXiv:2104.06957v2 [cs.CV] UPDATED)
    (2 min) Fully convolutional U-shaped neural networks have largely been the dominant approach for pixel-wise image segmentation. In this work, we tackle two defects that hinder their deployment in real-world applications: 1) Predictions lack uncertainty quantification that may be crucial to many decision-making systems; 2) Large memory storage and computational consumption demanding extensive hardware resources. To address these issues and improve their practicality we demonstrate a few-parameter compact Bayesian convolutional architecture, that achieves a marginal improvement in accuracy in comparison to related work using significantly fewer parameters and compute operations. The architecture combines parameter-efficient operations such as separable convolutions, bilinear interpolation, multi-scale feature propagation and Bayesian inference for per-pixel uncertainty quantification through Monte Carlo Dropout. The best performing configurations required fewer than 2.5 million parameters on diverse challenging datasets with few observations.
    MAAD-Face: A Massively Annotated Attribute Dataset for Face Images. (arXiv:2012.01030v2 [cs.CV] UPDATED)
    (2 min) Soft-biometrics play an important role in face biometrics and related fields since these might lead to biased performances, threatens the user's privacy, or are valuable for commercial aspects. Current face databases are specifically constructed for the development of face recognition applications. Consequently, these databases contain large amount of face images but lack in the number of attribute annotations and the overall annotation correctness. In this work, we propose MAADFace, a new face annotations database that is characterized by the large number of its high-quality attribute annotations. MAADFace is build on the VGGFace2 database and thus, consists of 3.3M faces of over 9k individuals. Using a novel annotation transfer-pipeline that allows an accurate label-transfer from multiple source-datasets to a target-dataset, MAAD-Face consists of 123.9M attribute annotations of 47 different binary attributes. Consequently, it provides 15 and 137 times more attribute labels than CelebA and LFW. Our investigation on the annotation quality by three human evaluators demonstrated the superiority of the MAAD-Face annotations over existing databases. Additionally, we make use of the large amount of high-quality annotations from MAAD-Face to study the viability of soft-biometrics for recognition, providing insights about which attributes support genuine and imposter decisions. The MAAD-Face annotations dataset is publicly available.
    MSAF: Multimodal Split Attention Fusion. (arXiv:2012.07175v2 [cs.CV] UPDATED)
    (2 min) Multimodal learning mimics the reasoning process of the human multi-sensory system, which is used to perceive the surrounding world. While making a prediction, the human brain tends to relate crucial cues from multiple sources of information. In this work, we propose a novel multimodal fusion module that learns to emphasize more contributive features across all modalities. Specifically, the proposed Multimodal Split Attention Fusion (MSAF) module splits each modality into channel-wise equal feature blocks and creates a joint representation that is used to generate soft attention for each channel across the feature blocks. Further, the MSAF module is designed to be compatible with features of various spatial dimensions and sequence lengths, suitable for both CNNs and RNNs. Thus, MSAF can be easily added to fuse features of any unimodal networks and utilize existing pretrained unimodal model weights. To demonstrate the effectiveness of our fusion module, we design three multimodal networks with MSAF for emotion recognition, sentiment analysis, and action recognition tasks. Our approach achieves competitive results in each task and outperforms other application-specific networks and multimodal fusion benchmarks.
    Going Beyond Saliency Maps: Training Deep Models to Interpret Deep Models. (arXiv:2102.08239v2 [eess.IV] UPDATED)
    (2 min) Interpretability is a critical factor in applying complex deep learning models to advance the understanding of brain disorders in neuroimaging studies. To interpret the decision process of a trained classifier, existing techniques typically rely on saliency maps to quantify the voxel-wise or feature-level importance for classification through partial derivatives. Despite providing some level of localization, these maps are not human-understandable from the neuroscience perspective as they do not inform the specific meaning of the alteration linked to the brain disorder. Inspired by the image-to-image translation scheme, we propose to train simulator networks that can warp a given image to inject or remove patterns of the disease. These networks are trained such that the classifier produces consistently increased or decreased prediction logits for the simulated images. Moreover, we propose to couple all the simulators into a unified model based on conditional convolution. We applied our approach to interpreting classifiers trained on a synthetic dataset and two neuroimaging datasets to visualize the effect of the Alzheimer's disease and alcohol use disorder. Compared to the saliency maps generated by baseline approaches, our simulations and visualizations based on the Jacobian determinants of the warping field reveal meaningful and understandable patterns related to the diseases.
    Matching Point Sets with Quantum Circuit Learning. (arXiv:2102.06697v2 [cs.CV] UPDATED)
    (2 min) In this work, we propose a parameterised quantum circuit learning approach to point set matching problem. In contrast to previous annealing-based methods, we propose a quantum circuit-based framework whose parameters are optimised via descending the gradients w.r.t a kernel-based loss function. We formulate the shape matching problem into a distribution learning task; that is, to learn the distribution of the optimal transformation parameters. We show that this framework is able to find multiple optimal solutions for symmetric shapes and is more accurate, scalable and robust than the previous annealing-based method. Code, data and pre-trained weights are available at the project page: \href{https://hansen7.github.io/qKC}{https://hansen7.github.io/qKC}
    Smart Inference for Multidigit Convolutional Neural Network based Barcode Decoding. (arXiv:2004.06297v3 [cs.CV] UPDATED)
    (2 min) Barcodes are ubiquitous and have been used in most of critical daily activities for decades. However, most of traditional decoders require well-founded barcode under a relatively standard condition. While wilder conditioned barcodes such as underexposed, occluded, blurry, wrinkled and rotated are commonly captured in reality, those traditional decoders show weakness of recognizing. Several works attempted to solve those challenging barcodes, but many limitations still exist. This work aims to solve the decoding problem using deep convolutional neural network with the possibility of running on portable devices. Firstly, we proposed a special modification of inference based on the feature of having checksum and test-time augmentation, named as Smart Inference (SI) in prediction phase of a trained model. SI considerably boosts accuracy and reduces the false prediction for trained models. Secondly, we have created a large practical evaluation dataset of real captured 1D barcode under various challenging conditions to test our methods vigorously, which is publicly available for other researchers. The experiments' results demonstrated the SI effectiveness with the highest accuracy of 95.85% which outperformed many existing decoders on the evaluation set. Finally, we successfully minimized the best model by knowledge distillation to a shallow model which is shown to have high accuracy (90.85%) with good inference speed of 34.2 ms per image on a real edge device.
    Benchmarking convolutional neural networks for diagnosing Lyme disease from images. (arXiv:2106.14465v1 [eess.IV])
    (3 min) Lyme disease is one of the most common infectious vector-borne diseases in the world. In the early stage, the disease manifests itself in most cases with erythema migrans (EM) skin lesions. Better diagnosis of these early forms would allow improving the prognosis by preventing the transition to a severe late form thanks to appropriate antibiotic therapy. Recent studies show that convolutional neural networks (CNNs) perform very well to identify skin lesions from the image but, there is not much work for Lyme disease prediction from EM lesion images. The main objective of this study is to extensively analyze the effectiveness of CNNs for diagnosing Lyme disease from images and to find out the best CNN architecture for the purpose. There is no publicly available EM image dataset for Lyme dis…
    Confounder-Aware Visualization of ConvNets. (arXiv:1907.12727v2 [cs.LG] UPDATED)
    (2 min) With recent advances in deep learning, neuroimaging studies increasingly rely on convolutional networks (ConvNets) to predict diagnosis based on MR images. To gain a better understanding of how a disease impacts the brain, the studies visualize the salience maps of the ConvNet highlighting voxels within the brain majorly contributing to the prediction. However, these salience maps are generally confounded, i.e., some salient regions are more predictive of confounding variables (such as age) than the diagnosis. To avoid such misinterpretation, we propose in this paper an approach that aims to visualize confounder-free saliency maps that only highlight voxels predictive of the diagnosis. The approach incorporates univariate statistical tests to identify confounding effects within the intermediate features learned by ConvNet. The influence from the subset of confounded features is then removed by a novel partial back-propagation procedure. We use this two-step approach to visualize confounder-free saliency maps extracted from synthetic and two real datasets. These experiments reveal the potential of our visualization in producing unbiased model-interpretation.
    Unifying Remote Sensing Image Retrieval and Classification with Robust Fine-tuning. (arXiv:2102.13392v2 [cs.CV] UPDATED)
    (2 min) Advances in high resolution remote sensing image analysis are currently hampered by the difficulty of gathering enough annotated data for training deep learning methods, giving rise to a variety of small datasets and associated dataset-specific methods. Moreover, typical tasks such as classification and retrieval lack a systematic evaluation on standard benchmarks and training datasets, which make it hard to identify durable and generalizable scientific contributions. We aim at unifying remote sensing image retrieval and classification with a new large-scale training and testing dataset, SF300, including both vertical and oblique aerial images and made available to the research community, and an associated fine-tuning method. We additionally propose a new adversarial fine-tuning method for global descriptors. We show that our framework systematically achieves a boost of retrieval and classification performance on nine different datasets compared to an ImageNet pretrained baseline, with currently no other method to compare to.
    A 3D CNN Network with BERT For Automatic COVID-19 Diagnosis From CT-Scan Images. (arXiv:2106.14403v1 [eess.IV])
    (2 min) We present an automatic COVID1-19 diagnosis framework from lung CT-scan slice images. In this framework, the slice images of a CT-scan volume are first proprocessed using segmentation techniques to filter out images of closed lung, and to remove the useless background. Then a resampling method is used to select one or multiple sets of a fixed number of slice images for training and validation. A 3D CNN network with BERT is used to classify this set of selected slice images. In this network, an embedding feature is also extracted. In cases where there are more than one set of slice images in a volume, the features of all sets are extracted and pooled into a global feature vector for the whole CT-scan volume. A simple multiple-layer perceptron (MLP) network is used to further classify the aggregated feature vector. The models are trained and evaluated on the provided training and validation datasets. On the validation dataset, the accuracy is 0.9278 and the F1 score is 0.9261.
    Classification and understanding of cloud structures via satellite images with EfficientUNet. (arXiv:2009.12931v3 [eess.IV] UPDATED)
    (2 min) Climate change has been a common interest and the forefront of crucial political discussion and decision-making for many years. Shallow clouds play a significant role in understanding the Earth's climate, but they are challenging to interpret and represent in a climate model. By classifying these cloud structures, there is a better possibility of understanding the physical structures of the clouds, which would improve the climate model generation, resulting in a better prediction of climate change or forecasting weather update. Clouds organise in many forms, which makes it challenging to build traditional rule-based algorithms to separate cloud features. In this paper, classification of cloud organization patterns was performed using a new scaled-up version of Convolutional Neural Network (CNN) named as EfficientNet as the encoder and UNet as decoder where they worked as feature extractor and reconstructor of fine grained feature map and was used as a classifier, which will help experts to understand how clouds will shape the future climate. By using a segmentation model in a classification task, it was shown that with a good encoder alongside UNet, it is possible to obtain good performance from this dataset. Dice coefficient has been used for the final evaluation metric, which gave the score of 66.26\% and 66.02\% for public and private (test set) leaderboard on Kaggle competition respectively.
    MQA: Answering the Question via Robotic Manipulation. (arXiv:2003.04641v3 [cs.AI] UPDATED)
    (2 min) In this paper, we propose a novel task, Manipulation Question Answering (MQA), where the robot performs manipulation actions to change the environment in order to answer a given question. To solve this problem, a framework consisting of a QA module and a manipulation module is proposed. For the QA module, we adopt the method for the Visual Question Answering (VQA) task. For the manipulation module, a Deep Q Network (DQN) model is designed to generate manipulation actions for the robot to interact with the environment. We consider the situation where the robot continuously manipulating objects inside a bin until the answer to the question is found. Besides, a novel dataset that contains a variety of object models, scenarios and corresponding question-answer pairs is established in a simulation environment. Extensive experiments have been conducted to validate the effectiveness of the proposed framework.
    Tiny Video Networks. (arXiv:1910.06961v2 [cs.CV] UPDATED)
    (2 min) Video understanding is a challenging problem with great impact on the abilities of autonomous agents working in the real-world. Yet, solutions so far have been computationally intensive, with the fastest algorithms running for more than half a second per video snippet on powerful GPUs. We propose a novel idea on video architecture learning - Tiny Video Networks - which automatically designs highly efficient models for video understanding. The tiny video models run with competitive performance for as low as 37 milliseconds per video on a CPU and 10 milliseconds on a standard GPU.
    Prior Flow Variational Autoencoder: A density estimation model for Non-Intrusive Load Monitoring. (arXiv:2011.14870v2 [cs.LG] UPDATED)
    (2 min) Non-Intrusive Load Monitoring (NILM) is a computational technique to estimate the power loads' appliance-by-appliance from the whole consumption measured by a single meter. In this paper, we propose a conditional density estimation model, based on deep neural networks, that joins a Conditional Variational Autoencoder with a Conditional Invertible Normalizing Flow model to estimate the individual appliance's power demand. The resulting model is called Prior Flow Variational Autoencoder or, for simplicity PFVAE. Thus, instead of having one model per appliance, the resulting model is responsible for estimating the power demand, appliance-by-appliance, at once. We train and evaluate our proposed model in a publicly available dataset composed of power demand measures from a poultry feed factory located in Brazil. The proposed model's quality is evaluated by comparing the obtained normalized disaggregation error (NDE) and signal aggregated error (SAE) with the previous work values on the same dataset. Our proposal achieves highly competitive results, and for six of the eight machines belonging to the dataset, we observe consistent improvements that go from 28% up to 81% in NDE and from 27% up to 86% in SAE.
    Virtual Normal: Enforcing Geometric Constraints for Accurate and Robust Depth Prediction. (arXiv:2103.04216v5 [cs.CV] UPDATED)
    (3 min) Monocular depth prediction plays a crucial role in understanding 3D scene geometry. Although recent methods have achieved impressive progress in terms of evaluation metrics such as the pixel-wise relative error, most methods neglect the geometric constraints in the 3D space. In this work, we show the importance of the high-order 3D geometric constraints for depth prediction. By designing a loss term that enforces a simple geometric constraint, namely, virtual normal directions determined by randomly sampled three points in the reconstructed 3D space, we significantly improve the accuracy and robustness of monocular depth estimation. Significantly, the virtual normal loss can not only improve the performance of learning metric depth, but also disentangle the scale information and enrich the model with better shape information. Therefore, when not having access to absolute metric depth training data, we can use virtual normal to learn a robust affine-invariant depth generated on diverse scenes. In experiments, We show state-of-the-art results of learning metric depth on NYU Depth-V2 and KITTI. From the high-quality predicted depth, we are now able to recover good 3D structures of the scene such as the point cloud and surface normal directly, eliminating the necessity of relying on additional models as was previously done. To demonstrate the excellent generalizability of learning affine-invariant depth on diverse data with the virtual normal loss, we construct a large-scale and diverse dataset for training affine-invariant depth, termed Diverse Scene Depth dataset (DiverseDepth), and test on five datasets with the zero-shot test setting. Code is available at: https://git.io/Depth
    False Negative Reduction in Video Instance Segmentation using Uncertainty Estimates. (arXiv:2106.14474v1 [cs.CV])
    (2 min) Instance segmentation of images is an important tool for automated scene understanding. Neural networks are usually trained to optimize their overall performance in terms of accuracy. Meanwhile, in applications such as automated driving, an overlooked pedestrian seems more harmful than a falsely detected one. In this work, we present a false negative detection method for image sequences based on inconsistencies in time series of tracked instances given the availability of image sequences in online applications. As the number of instances can be greatly increased by this algorithm, we apply a false positive pruning using uncertainty estimates aggregated over instances. To this end, instance-wise metrics are constructed which characterize uncertainty and geometry of a given instance or are predicated on depth estimation. The proposed method serves as a post-processing step applicable to any neural network that can also be trained on single frames only. In our tests, we obtain an improved trade-off between false negative and false positive instances by our fused detection approach in comparison to the use of an ordinary score value provided by the instance segmentation network during inference.
    Semantic Labeling of Large-Area Geographic Regions Using Multi-View and Multi-Date Satellite Images and Noisy OSM Training Labels. (arXiv:2008.10271v5 [cs.CV] UPDATED)
    (3 min) We present a novel multi-view training framework and CNN architecture for combining information from multiple overlapping satellite images and noisy training labels derived from OpenStreetMap (OSM) to semantically label buildings and roads across large geographic regions (100 km$^2$). Our approach to multi-view semantic segmentation yields a 4-7% improvement in the per-class IoU scores compared to the traditional approaches that use the views independently of one another. A unique (and, perhaps, surprising) property of our system is that modifications that are added to the tail-end of the CNN for learning from the multi-view data can be discarded at the time of inference with a relatively small penalty in the overall performance. This implies that the benefits of training using multiple views are absorbed by all the layers of the network. Additionally, our approach only adds a small overhead in terms of the GPU-memory consumption even when training with as many as 32 views per scene. The system we present is end-to-end automated, which facilitates comparing the classifiers trained directly on true orthophotos vis-a-vis first training them on the off-nadir images and subsequently translating the predicted labels to geographical coordinates. With no human supervision, our IoU scores for the buildings and roads classes are 0.8 and 0.64 respectively which are better than state-of-the-art approaches that use OSM labels and that are not completely automated.
    Towards End-to-End Text Spotting in Natural Scenes. (arXiv:1906.06013v6 [cs.CV] UPDATED)
    (2 min) Text spotting in natural scene images is of great importance for many image understanding tasks. It includes two sub-tasks: text detection and recognition. In this work, we propose a unified network that simultaneously localizes and recognizes text with a single forward pass, avoiding intermediate processes such as image cropping and feature re-calculation, word separation, and character grouping. In contrast to existing approaches that consider text detection and recognition as two distinct tasks and tackle them one by one, the proposed framework settles these two tasks concurrently. The whole framework can be trained end-to-end and is able to handle text of arbitrary shapes. The convolutional features are calculated only once and shared by both detection and recognition modules. Through multi-task training, the learned features become more discriminate and improve the overall performance. By employing the $2$D attention model in word recognition, the irregularity of text can be robustly addressed. It provides the spatial location for each character, which not only helps local feature extraction in word recognition, but also indicates an orientation angle to refine text localization. Our proposed method has achieved state-of-the-art performance on several standard text spotting benchmarks, including both regular and irregular ones.
    Dizygotic Conditional Variational AutoEncoder for Multi-Modal and Partial Modality Absent Few-Shot Learning. (arXiv:2106.14467v1 [cs.CV])
    (2 min) Data augmentation is a powerful technique for improving the performance of the few-shot classification task. It generates more samples as supplements, and then this task can be transformed into a common supervised learning issue for solution. However, most mainstream data augmentation based approaches only consider the single modality information, which leads to the low diversity and quality of generated features. In this paper, we present a novel multi-modal data augmentation approach named Dizygotic Conditional Variational AutoEncoder (DCVAE) for addressing the aforementioned issue. DCVAE conducts feature synthesis via pairing two Conditional Variational AutoEncoders (CVAEs) with the same seed but different modality conditions in a dizygotic symbiosis manner. Subsequently, the generated features of two CVAEs are adaptively combined to yield the final feature, which can be converted back into its paired conditions while ensuring these conditions are consistent with the original conditions not only in representation but also in function. DCVAE essentially provides a new idea of data augmentation in various multi-modal scenarios by exploiting the complement of different modality prior information. Extensive experimental results demonstrate our work achieves state-of-the-art performances on miniImageNet, CIFAR-FS and CUB datasets, and is able to work well in the partial modality absence case.
    Progressive Class-based Expansion Learning For Image Classification. (arXiv:2106.14412v1 [cs.CV])
    (2 min) In this paper, we propose a novel image process scheme called class-based expansion learning for image classification, which aims at improving the supervision-stimulation frequency for the samples of the confusing classes. Class-based expansion learning takes a bottom-up growing strategy in a class-based expansion optimization fashion, which pays more attention to the quality of learning the fine-grained classification boundaries for the preferentially selected classes. Besides, we develop a class confusion criterion to select the confusing class preferentially for training. In this way, the classification boundaries of the confusing classes are frequently stimulated, resulting in a fine-grained form. Experimental results demonstrate the effectiveness of the proposed scheme on several benchmarks.
    Residual Moment Loss for Medical Image Segmentation. (arXiv:2106.14178v1 [eess.IV])
    (2 min) Location information is proven to benefit the deep learning models on capturing the manifold structure of target objects, and accordingly boosts the accuracy of medical image segmentation. However, most existing methods encode the location information in an implicit way, e.g. the distance transform maps, which describe the relative distance from each pixel to the contour boundary, for the network to learn. These implicit approaches do not fully exploit the position information (i.e. absolute location) of targets. In this paper, we propose a novel loss function, namely residual moment (RM) loss, to explicitly embed the location information of segmentation targets during the training of deep learning networks. Particularly, motivated by image moments, the segmentation prediction map and ground-truth map are weighted by coordinate information. Then our RM loss encourages the networks to maintain the consistency between the two weighted maps, which promotes the segmentation networks to easily locate the targets and extract manifold-structure-related features. We validate the proposed RM loss by conducting extensive experiments on two publicly available datasets, i.e., 2D optic cup and disk segmentation and 3D left atrial segmentation. The experimental results demonstrate the effectiveness of our RM loss, which significantly boosts the accuracy of segmentation networks.
    Seismic Facies Analysis: A Deep Domain Adaptation Approach. (arXiv:2011.10510v2 [physics.geo-ph] UPDATED)
    (2 min) Deep neural networks (DNNs) can learn accurately from large quantities of labeled input data, but DNNs sometimes fail to generalize to test data sampled from different input distributions. Unsupervised Deep Domain Adaptation (DDA) proves useful when no input labels are available, and distribution shifts are observed in the target domain (TD). Experiments are performed on seismic images of the F3 block 3D dataset from offshore Netherlands (source domain; SD) and Penobscot 3D survey data from Canada (target domain; TD). Three geological classes from SD and TD that have similar reflection patterns are considered. In the present study, an improved deep neural network architecture named EarthAdaptNet (EAN) is proposed to semantically segment the seismic images. We specifically use a transposed residual unit to replace the traditional dilated convolution in the decoder block. The EAN achieved a pixel-level accuracy >84% and an accuracy of ~70% for the minority classes, showing improved performance compared to existing architectures. In addition, we introduced the CORAL (Correlation Alignment) method to the EAN to create an unsupervised deep domain adaptation network (EAN-DDA) for the classification of seismic reflections fromF3 and Penobscot. Maximum class accuracy achieved was ~99% for class 2 of Penobscot with >50% overall accuracy. Taken together, EAN-DDA has the potential to classify target domain seismic facies classes with high accuracy.
    Towards Better Explanations of Class Activation Mapping. (arXiv:2102.05228v2 [cs.CV] UPDATED)
    (2 min) Increasing demands for understanding the internal behavior of convolutional neural networks (CNNs) have led to remarkable improvements in explanation methods. Particularly, several class activation mapping (CAM) based methods, which generate visual explanation maps by a linear combination of activation maps from CNNs, have been proposed. However, the majority of the methods lack a clear theoretical basis on how they assign the coefficients of the linear combination. In this paper, we revisit the intrinsic linearity of CAM with respect to the activation maps; we construct an explanation model of CNN as a linear function of binary variables that denote the existence of the corresponding activation maps. With this approach, the explanation model can be determined by additive feature attribution methods in an analytic manner. We then demonstrate the adequacy of SHAP values, which is a unique solution for the explanation model with a set of desirable properties, as the coefficients of CAM. Since the exact SHAP values are unattainable, we introduce an efficient approximation method, LIFT-CAM, based on DeepLIFT. Our proposed LIFT-CAM can estimate the SHAP values of the activation maps with high speed and accuracy. Furthermore, it greatly outperforms other previous CAM-based methods in both qualitative and quantitative aspects.
    Learning Adaptive Classifiers Synthesis for Generalized Few-Shot Learning. (arXiv:1906.02944v5 [cs.CV] UPDATED)
    (2 min) Object recognition in the real-world requires handling long-tailed or even open-ended data. An ideal visual system needs to recognize the populated head visual concepts reliably and meanwhile efficiently learn about emerging new tail categories with a few training instances. Class-balanced many-shot learning and few-shot learning tackle one side of this problem, by either learning strong classifiers for head or learning to learn few-shot classifiers for the tail. In this paper, we investigate the problem of generalized few-shot learning (GFSL) -- a model during the deployment is required to learn about tail categories with few shots and simultaneously classify the head classes. We propose the ClAssifier SynThesis LEarning (CASTLE), a learning framework that learns how to synthesize calibrated few-shot classifiers in addition to the multi-class classifiers of head classes with a shared neural dictionary, shedding light upon the inductive GFSL. Furthermore, we propose an adaptive version of CASTLE (ACASTLE) that adapts the head classifiers conditioned on the incoming tail training examples, yielding a framework that allows effective backward knowledge transfer. As a consequence, ACASTLE can handle GFSL with classes from heterogeneous domains effectively. CASTLE and ACASTLE demonstrate superior performances than existing GFSL algorithms and strong baselines on MiniImageNet as well as TieredImageNet datasets. More interestingly, they outperform previous state-of-the-art methods when evaluated with standard few-shot learning criteria.
    Quasiconformal model with CNN features for large deformation image registration. (arXiv:2011.00731v3 [cs.CV] UPDATED)
    (2 min) Image registration has been widely studied over the past several decades, with numerous applications in science, engineering and medicine. Most of the conventional mathematical models for large deformation image registration rely on prescribed landmarks, which usually require tedious manual labeling and are prone to error. In recent years, there has been a surge of interest in the use of machine learning for image registration. In this paper, we develop a novel method for large deformation image registration by a fusion of quasiconformal theory and convolutional neural network (CNN). More specifically, we propose a quasiconformal energy model with a novel fidelity term that incorporates the features extracted using a pre-trained CNN, thereby allowing us to obtain meaningful registration results without any guidance of prescribed landmarks. Moreover, unlike many prior image registration methods, the bijectivity of our method is guaranteed by quasiconformal theory. Experimental results are presented to demonstrate the effectiveness of the proposed method. More broadly, our work sheds light on how rigorous mathematical theories and practical machine learning approaches can be integrated for developing computational methods with improved performance.
    Few-Shot Domain Expansion for Face Anti-Spoofing. (arXiv:2106.14162v1 [cs.CV])
    (2 min) Face anti-spoofing (FAS) is an indispensable and widely used module in face recognition systems. Although high accuracy has been achieved, a FAS system will never be perfect due to the non-stationary applied environments and the potential emergence of new types of presentation attacks in real-world applications. In practice, given a handful of labeled samples from a new deployment scenario (target domain) and abundant labeled face images in the existing source domain, the FAS system is expected to perform well in the new scenario without sacrificing the performance on the original domain. To this end, we identify and address a more practical problem: Few-Shot Domain Expansion for Face Anti-Spoofing (FSDE-FAS). This problem is challenging since with insufficient target domain training samples, the model may suffer from both overfitting to the target domain and catastrophic forgetting of the source domain. To address the problem, this paper proposes a Style transfer-based Augmentation for Semantic Alignment (SASA) framework. We propose to augment the target data by generating auxiliary samples based on photorealistic style transfer. With the assistant of the augmented data, we further propose a carefully designed mechanism to align different domains from both instance-level and distribution-level, and then stabilize the performance on the source domain with a less-forgetting constraint. Two benchmarks are proposed to simulate the FSDE-FAS scenarios, and the experimental results show that the proposed SASA method outperforms state-of-the-art methods.
    An XAI Approach to Deep Learning Models in the Detection of Ductal Carcinoma in Situ. (arXiv:2106.14186v1 [eess.IV])
    (2 min) During the last decade or so, there has been an insurgence in the deep learning community to solve health-related issues, particularly breast cancer. Following the Camelyon-16 challenge in 2016, several researchers have dedicated their time to build Convolutional Neural Networks (CNNs) to help radiologists and other clinicians diagnose breast cancer. In particular, there has been an emphasis on Ductal Carcinoma in Situ (DCIS); the clinical term for early-stage breast cancer. Large companies have given their fair share of research into this subject, among these Google Deepmind who developed a model in 2020 that has proven to be better than radiologists themselves to diagnose breast cancer correctly. We found that among the issues which exist, there is a need for an explanatory system that goes through the hidden layers of a CNN to highlight those pixels that contributed to the classification of a mammogram. We then chose an open-source, reasonably successful project developed by Prof. Shen, using the CBIS-DDSM image database to run our experiments on. It was later improved using the Resnet-50 and VGG-16 patch-classifiers, analytically comparing the outcome of both. The results showed that the Resnet-50 one converged earlier in the experiments. Following the research by Montavon and Binder, we used the DeepTaylor Layer-wise Relevance Propagation (LRP) model to highlight those pixels and regions within a mammogram which contribute most to its classification. This is represented as a map of those pixels in the original image, which contribute to the diagnosis and the extent to which they contribute to the final classification. The most significant advantage of this algorithm is that it performs exceptionally well with the Resnet-50 patch classifier architecture.
    Feature Combination Meets Attention: Baidu Soccer Embeddings and Transformer based Temporal Detection. (arXiv:2106.14447v1 [cs.CV])
    (2 min) With rapidly evolving internet technologies and emerging tools, sports related videos generated online are increasing at an unprecedentedly fast pace. To automate sports video editing/highlight generation process, a key task is to precisely recognize and locate the events in the long untrimmed videos. In this tech report, we present a two-stage paradigm to detect what and when events happen in soccer broadcast videos. Specifically, we fine-tune multiple action recognition models on soccer data to extract high-level semantic features, and design a transformer based temporal detection module to locate the target events. This approach achieved the state-of-the-art performance in both two tasks, i.e., action spotting and replay grounding, in the SoccerNet-v2 Challenge, under CVPR 2021 ActivityNet workshop. Our soccer embedding features are released at https://github.com/baidu-research/vidpress-sports. By sharing these features with the broader community, we hope to accelerate the research into soccer video understanding.
    Longitudinal Self-Supervised Learning. (arXiv:2006.06930v2 [cs.LG] UPDATED)
    (2 min) Machine learning analysis of longitudinal neuroimaging data is typically based on supervised learning, which requires a large number of ground-truth labels to be informative. As ground-truth labels are often missing or expensive to obtain in neuroscience, we avoid them in our analysis by combing factor disentanglement with self-supervised learning to identify changes and consistencies across the multiple MRIs acquired of each individual over time. Specifically, we propose a new definition of disentanglement by formulating a multivariate mapping between factors (e.g., brain age) associated with an MRI and a latent image representation. Then, factors that evolve across acquisitions of longitudinal sequences are disentangled from that mapping by self-supervised learning in such a way that changes in a single factor induce change along one direction in the representation space. We implement this model, named Longitudinal Self-Supervised Learning (LSSL), via a standard autoencoding structure with a cosine loss to disentangle brain age from the image representation. We apply LSSL to two longitudinal neuroimaging studies to highlight its strength in extracting the brain-age information from MRI and revealing informative characteristics associated with neurodegenerative and neuropsychological disorders. Moreover, the representations learned by LSSL facilitate supervised classification by recording faster convergence and higher (or similar) prediction accuracy compared to several other representation learning techniques.
    A Time-Delay Feedback Neural Network for Discriminating Small, Fast-Moving Targets in Complex Dynamic Environments. (arXiv:2001.05846v5 [cs.CV] UPDATED)
    (3 min) Discriminating small moving objects within complex visual environments is a significant challenge for autonomous micro robots that are generally limited in computational power. By exploiting their highly evolved visual systems, flying insects can effectively detect mates and track prey during rapid pursuits, even though the small targets equate to only a few pixels in their visual field. The high degree of sensitivity to small target movement is supported by a class of specialized neurons called small target motion detectors (STMDs). Existing STMD-based computational models normally comprise four sequentially arranged neural layers interconnected via feedforward loops to extract information on small target motion from raw visual inputs. However, feedback, another important regulatory circuit for motion perception, has not been investigated in the STMD pathway and its functional roles for small target motion detection are not clear. In this paper, we propose an STMD-based neural network with feedback connection (Feedback STMD), where the network output is temporally delayed, then fed back to the lower layers to mediate neural responses. We compare the properties of the model with and without the time-delay feedback loop, and find it shows preference for high-velocity objects. Extensive experiments suggest that the Feedback STMD achieves superior detection performance for fast-moving small targets, while significantly suppressing background false positive movements which display lower velocities. The proposed feedback model provides an effective solution in robotic visual systems for detecting fast-moving small targets that are always salient and potentially threatening.
    Recurrent neural network transducer for Japanese and Chinese offline handwritten text recognition. (arXiv:2106.14459v1 [cs.CV])
    (2 min) In this paper, we propose an RNN-Transducer model for recognizing Japanese and Chinese offline handwritten text line images. As far as we know, it is the first approach that adopts the RNN-Transducer model for offline handwritten text recognition. The proposed model consists of three main components: a visual feature encoder that extracts visual features from an input image by CNN and then encodes the visual features by BLSTM; a linguistic context encoder that extracts and encodes linguistic features from the input image by embedded layers and LSTM; and a joint decoder that combines and then decodes the visual features and the linguistic features into the final label sequence by fully connected and softmax layers. The proposed model takes advantage of both visual and linguistic information from the input image. In the experiments, we evaluated the performance of the proposed model on the two datasets: Kuzushiji and SCUT-EPT. Experimental results show that the proposed model achieves state-of-the-art performance on all datasets.
    Using deep learning to detect patients at risk for prostate cancer despite benign biopsies. (arXiv:2106.14256v1 [eess.IV])
    (3 min) Background: Transrectal ultrasound guided systematic biopsies of the prostate is a routine procedure to establish a prostate cancer diagnosis. However, the 10-12 prostate core biopsies only sample a relatively small volume of the prostate, and tumour lesions in regions between biopsy cores can be missed, leading to a well-known low sensitivity to detect clinically relevant cancer. As a proof-of-principle, we developed and validated a deep convolutional neural network model to distinguish between morphological patterns in benign prostate biopsy whole slide images from men with and without established cancer. Methods: This study included 14,354 hematoxylin and eosin stained whole slide images from benign prostate biopsies from 1,508 men in two groups: men without an established prostate cancer (PCa) diagnosis and men with at least one core biopsy diagnosed with PCa. 80% of the participants were assigned as training data and used for model optimization (1,211 men), and the remaining 20% (297 men) as a held-out test set used to evaluate model performance. An ensemble of 10 deep convolutional neural network models was optimized for classification of biopsies from men with and without established cancer. Hyperparameter optimization and model selection was performed by cross-validation in the training data . Results: Area under the receiver operating characteristic curve (ROC-AUC) was estimated as 0.727 (bootstrap 95% CI: 0.708-0.745) on biopsy level and 0.738 (bootstrap 95% CI: 0.682 - 0.796) on man level. At a specificity of 0.9 the model had an estimated sensitivity of 0.348. Conclusion: The developed model has the ability to detect men with risk of missed PCa due to under-sampling of the prostate. The proposed model has the potential to reduce the number of false negative cases in routine systematic prostate biopsies and to indicate men who could benefit from MRI-guided re-biopsy.
    Indoor Panorama Planar 3D Reconstruction via Divide and Conquer. (arXiv:2106.14166v1 [cs.CV])
    (2 min) Indoor panorama typically consists of human-made structures parallel or perpendicular to gravity. We leverage this phenomenon to approximate the scene in a 360-degree image with (H)orizontal-planes and (V)ertical-planes. To this end, we propose an effective divide-and-conquer strategy that divides pixels based on their plane orientation estimation; then, the succeeding instance segmentation module conquers the task of planes clustering more easily in each plane orientation group. Besides, parameters of V-planes depend on camera yaw rotation, but translation-invariant CNNs are less aware of the yaw change. We thus propose a yaw-invariant V-planar reparameterization for CNNs to learn. We create a benchmark for indoor panorama planar reconstruction by extending existing 360 depth datasets with ground truth H\&V-planes (referred to as PanoH&V dataset) and adopt state-of-the-art planar reconstruction methods to predict H\&V-planes as our baselines. Our method outperforms the baselines by a large margin on the proposed dataset.
    Blind Non-Uniform Motion Deblurring using Atrous Spatial Pyramid Deformable Convolution and Deblurring-Reblurring Consistency. (arXiv:2106.14336v1 [cs.CV])
    (2 min) Many deep learning based methods are designed to remove non-uniform (spatially variant) motion blur caused by object motion and camera shake without knowing the blur kernel. Some methods directly output the latent sharp image in one stage, while others utilize a multi-stage strategy (\eg multi-scale, multi-patch, or multi-temporal) to gradually restore the sharp image. However, these methods have the following two main issues: 1) The computational cost of multi-stage is high; 2) The same convolution kernel is applied in different regions, which is not an ideal choice for non-uniform blur. Hence, non-uniform motion deblurring is still a challenging and open problem. In this paper, we propose a new architecture which consists of multiple Atrous Spatial Pyramid Deformable Convolution (ASPDC) modules to deblur an image end-to-end with more flexibility. Multiple ASPDC modules implicitly learn the pixel-specific motion with different dilation rates in the same layer to handle movements of different magnitude. To improve the training, we also propose a reblurring network to map the deblurred output back to the blurred input, which constrains the solution space. Our experimental results show that the proposed method outperforms state-of-the-art methods on the benchmark datasets.
    Prior-Induced Information Alignment for Image Matting. (arXiv:2106.14439v1 [cs.CV])
    (2 min) Image matting is an ill-posed problem that aims to estimate the opacity of foreground pixels in an image. However, most existing deep learning-based methods still suffer from the coarse-grained details. In general, these algorithms are incapable of felicitously distinguishing the degree of exploration between deterministic domains (certain FG and BG pixels) and undetermined domains (uncertain in-between pixels), or inevitably lose information in the continuous sampling process, leading to a sub-optimal result. In this paper, we propose a novel network named Prior-Induced Information Alignment Matting Network (PIIAMatting), which can efficiently model the distinction of pixel-wise response maps and the correlation of layer-wise feature maps. It mainly consists of a Dynamic Gaussian Modulation mechanism (DGM) and an Information Alignment strategy (IA). Specifically, the DGM can dynamically acquire a pixel-wise domain response map learned from the prior distribution. The response map can present the relationship between the opacity variation and the convergence process during training. On the other hand, the IA comprises an Information Match Module (IMM) and an Information Aggregation Module (IAM), jointly scheduled to match and aggregate the adjacent layer-wise features adaptively. Besides, we also develop a Multi-Scale Refinement (MSR) module to integrate multi-scale receptive field information at the refinement stage to recover the fluctuating appearance details. Extensive quantitative and qualitative evaluations demonstrate that the proposed PIIAMatting performs favourably against state-of-the-art image matting methods on the Alphamatting.com, Composition-1K and Distinctions-646 dataset.
    MTrans: Multi-Modal Transformer for Accelerated MR Imaging. (arXiv:2106.14248v1 [eess.IV])
    (2 min) Accelerating multi-modal magnetic resonance (MR) imaging is a new and effective solution for fast MR imaging, providing superior performance in restoring the target modality from its undersampled counterpart with guidance from an auxiliary modality. However, existing works simply introduce the auxiliary modality as prior information, lacking in-depth investigations on the potential mechanisms for fusing two modalities. Further, they usually rely on the convolutional neural networks (CNNs), which focus on local information and prevent them from fully capturing the long-distance dependencies of global knowledge. To this end, we propose a multi-modal transformer (MTrans), which is capable of transferring multi-scale features from the target modality to the auxiliary modality, for accelerated MR imaging. By restructuring the transformer architecture, our MTrans gains a powerful ability to capture deep multi-modal information. More specifically, the target modality and the auxiliary modality are first split into two branches and then fused using a multi-modal transformer module. This module is based on an improved multi-head attention mechanism, named the cross attention module, which absorbs features from the auxiliary modality that contribute to the target modality. Our framework provides two appealing benefits: (i) MTrans is the first attempt at using improved transformers for multi-modal MR imaging, affording more global information compared with CNN-based methods. (ii) A new cross attention module is proposed to exploit the useful information in each branch at different scales. It affords both distinct structural information and subtle pixel-level information, which supplement the target modality effectively.
    SDOF-Tracker: Fast and Accurate Multiple Human Tracking by Skipped-Detection and Optical-Flow. (arXiv:2106.14259v1 [cs.CV])
    (2 min) Multiple human tracking is a fundamental problem for scene understanding. Although both accuracy and speed are required in real-world applications, recent tracking methods based on deep learning have focused on accuracy and require substantial running time. This study aims to improve running speed by performing human detection at a certain frame interval because it accounts for most of the running time. The question is how to maintain accuracy while skipping human detection. In this paper, we propose a method that complements the detection results with optical flow, based on the fact that someone's appearance does not change much between adjacent frames. To maintain the tracking accuracy, we introduce robust interest point selection within human regions and a tracking termination metric calculated by the distribution of the interest points. On the MOT20 dataset in the MOTChallenge, the proposed SDOF-Tracker achieved the best performance in terms of the total running speed while maintaining the MOTA metric. Our code is available at https://anonymous.4open.science/r/sdof-tracker-75AE.
    Mitigating severe over-parameterization in deep convolutional neural networks through forced feature abstraction and compression with an entropy-based heuristic. (arXiv:2106.14190v1 [cs.CV])
    (2 min) Convolutional Neural Networks (CNNs) such as ResNet-50, DenseNet-40 and ResNeXt-56 are severely over-parameterized, necessitating a consequent increase in the computational resources required for model training which scales exponentially for increments in model depth. In this paper, we propose an Entropy-Based Convolutional Layer Estimation (EBCLE) heuristic which is robust and simple, yet effective in resolving the problem of over-parameterization with regards to network depth of CNN model. The EBCLE heuristic employs a priori knowledge of the entropic data distribution of input datasets to determine an upper bound for convolutional network depth, beyond which identity transformations are prevalent offering insignificant contributions for enhancing model performance. Restricting depth redundancies by forcing feature compression and abstraction restricts over-parameterization while decreasing training time by 24.99% - 78.59% without degradation in model performance. We present empirical evidence to emphasize the relative effectiveness of broader, yet shallower models trained using the EBCLE heuristic, which maintains or outperforms baseline classification accuracies of narrower yet deeper models. The EBCLE heuristic is architecturally agnostic and EBCLE based CNN models restrict depth redundancies resulting in enhanced utilization of the available computational resources. The proposed EBCLE heuristic is a compelling technique for researchers to analytically justify their HyperParameter (HP) choices for CNNs. Empirical validation of the EBCLE heuristic in training CNN models was established on five benchmarking datasets (ImageNet32, CIFAR-10/100, STL-10, MNIST) and four network architectures (DenseNet, ResNet, ResNeXt and EfficientNet B0-B2) with appropriate statistical tests employed to infer any conclusive claims presented in this paper.
    Memory Guided Road Detection. (arXiv:2106.14184v1 [cs.CV])
    (2 min) In self driving car applications, there is a requirement to predict the location of the lane given an input RGB front facing image. In this paper, we propose an architecture that allows us to increase the speed and robustness of road detection without a large hit in accuracy by introducing an underlying shared feature space that is propagated over time, which serves as a flowing dynamic memory. By utilizing the gist of previous frames, we train the network to predict the current road with a greater accuracy and lesser deviation from previous frames.
    Learning stochastic object models from medical imaging measurements by use of advanced AmbientGANs. (arXiv:2106.14324v1 [eess.IV])
    (2 min) In order to objectively assess new medical imaging technologies via computer-simulations, it is important to account for all sources of variability that contribute to image data. One important source of variability that can significantly limit observer performance is associated with the variability in the ensemble of objects to-be-imaged. This source of variability can be described by stochastic object models (SOMs), which are generative models that can be employed to sample from a distribution of to-be-virtually-imaged objects. It is generally desirable to establish SOMs from experimental imaging measurements acquired by use of a well-characterized imaging system, but this task has remained challenging. Deep generative neural networks, such as generative adversarial networks (GANs) hold potential for such tasks. To establish SOMs from imaging measurements, an AmbientGAN has been proposed that augments a GAN with a measurement operator. However, the original AmbientGAN could not immediately benefit from modern training procedures and GAN architectures, which limited its ability to be applied to realistically sized medical image data. To circumvent this, in this work, a modified AmbientGAN training strategy is proposed that is suitable for modern progressive or multi-resolution training approaches such as employed in the Progressive Growing of GANs and Style-based GANs. AmbientGANs established by use of the proposed training procedure are systematically validated in a controlled way by use of computer-simulated measurement data corresponding to a stylized imaging system. Finally, emulated single-coil experimental magnetic resonance imaging data are employed to demonstrate the methods under less stylized conditions.
    Deep Learning Image Recognition for Non-images. (arXiv:2106.14350v1 [cs.LG])
    (2 min) Powerful deep learning algorithms open an opportunity for solving non-image Machine Learning (ML) problems by transforming these problems to into the image recognition problems. The CPC-R algorithm presented in this chapter converts non-image data into images by visualizing non-image data. Then deep learning CNN algorithms solve the learning problems on these images. The design of the CPC-R algorithm allows preserving all high-dimensional information in 2-D images. The use of pair values mapping instead of single value mapping used in the alternative approaches allows encoding each n-D point with 2 times fewer visual elements. The attributes of an n-D point are divided into pairs of its values and each pair is visualized as 2-D points in the same 2-D Cartesian coordinates. Next, grey scale or color intensity values are assigned to each pair to encode the order of pairs. This is resulted in the heatmap image. The computational experiments with CPC-R are conducted for different CNN architectures, and methods to optimize the CPC-R images showing that the combined CPC-R and deep learning CNN algorithms are able to solve non-image ML problems reaching high accuracy on the benchmark datasets. This chapter expands our prior work by adding more experiments to test accuracy of classification, exploring saliency and informativeness of discovered features to test their interpretability, and generalizing the approach.
    The Deep Neural Network based Photometry Framework for Wide Field Small Aperture Telescopes. (arXiv:2106.14349v1 [astro-ph.IM])
    (2 min) Wide field small aperture telescopes (WFSATs) are mainly used to obtain scientific information of point--like and streak--like celestial objects. However, qualities of images obtained by WFSATs are seriously affected by the background noise and variable point spread functions. Developing high speed and high efficiency data processing method is of great importance for further scientific research. In recent years, deep neural networks have been proposed for detection and classification of celestial objects and have shown better performance than classical methods. In this paper, we further extend abilities of the deep neural network based astronomical target detection framework to make it suitable for photometry and astrometry. We add new branches into the deep neural network to obtain types, magnitudes and positions of different celestial objects at the same time. Tested with simulated data, we find that our neural network has better performance in photometry than classical methods. Because photometry and astrometry are regression algorithms, which would obtain high accuracy measurements instead of rough classification results, the accuracy of photometry and astrometry results would be affected by different observation conditions. To solve this problem, we further propose to use reference stars to train our deep neural network with transfer learning strategy when observation conditions change. The photometry framework proposed in this paper could be used as an end--to--end quick data processing framework for WFSATs, which can further increase response speed and scientific outputs of WFSATs.
    Representation Based Regression for Object Distance Estimation. (arXiv:2106.14208v1 [cs.CV])
    (2 min) In this study, we propose a novel approach to predict the distances of the detected objects in an observed scene. The proposed approach modifies the recently proposed Convolutional Support Estimator Networks (CSENs). CSENs are designed to compute a direct mapping for the Support Estimation (SE) task in a representation-based classification problem. We further propose and demonstrate that representation-based methods (sparse or collaborative representation) can be used in well-designed regression problems. To the best of our knowledge, this is the first representation-based method proposed for performing a regression task by utilizing the modified CSENs; and hence, we name this novel approach as Representation-based Regression (RbR). The initial version of CSENs has a proxy mapping stage (i.e., a coarse estimation for the support set) that is required for the input. In this study, we improve the CSEN model by proposing Compressive Learning CSEN (CL-CSEN) that has the ability to jointly optimize the so-called proxy mapping stage along with convolutional layers. The experimental evaluations using the KITTI 3D Object Detection distance estimation dataset show that the proposed method can achieve a significantly improved distance estimation performance over all competing methods. Finally, the software implementations of the methods are publicly shared at https://github.com/meteahishali/CSENDistance.
    Learning Mesh Representations via Binary Space Partitioning Tree Networks. (arXiv:2106.14274v1 [cs.CV])
    (2 min) Polygonal meshes are ubiquitous, but have only played a relatively minor role in the deep learning revolution. State-of-the-art neural generative models for 3D shapes learn implicit functions and generate meshes via expensive iso-surfacing. We overcome these challenges by employing a classical spatial data structure from computer graphics, Binary Space Partitioning (BSP), to facilitate 3D learning. The core operation of BSP involves recursive subdivision of 3D space to obtain convex sets. By exploiting this property, we devise BSP-Net, a network that learns to represent a 3D shape via convex decomposition without supervision. The network is trained to reconstruct a shape using a set of convexes obtained from a BSP-tree built over a set of planes, where the planes and convexes are both defined by learned network weights. BSP-Net directly outputs polygonal meshes from the inferred convexes. The generated meshes are watertight, compact (i.e., low-poly), and well suited to represent sharp geometry. We show that the reconstruction quality by BSP-Net is competitive with those from state-of-the-art methods while using much fewer primitives. We also explore variations to BSP-Net including using a more generic decoder for reconstruction, more general primitives than planes, as well as training a generative model with variational auto-encoders. Code is available at https://github.com/czq142857/BSP-NET-original.
    Image content dependent semi-fragile watermarking with localized tamper detection. (arXiv:2106.14150v1 [cs.CV])
    (2 min) Content-independent watermarks and block-wise independency can be considered as vulnerabilities in semi-fragile watermarking methods. In this paper to achieve the objectives of semi-fragile watermarking techniques, a method is proposed to not have the mentioned shortcomings. In the proposed method, the watermark is generated by relying on image content and a key. Furthermore, the embedding scheme causes the watermarked blocks to become dependent on each other, using a key. In the embedding phase, the image is partitioned into non-overlapping blocks. In order to detect and separate the different types of attacks more precisely, the proposed method embeds three copies of each watermark bit into LWT coefficients of each 4x4 block. In the authentication phase, by voting between the extracted bits the error maps are created; these maps indicate image authenticity and reveal the modified regions. Also, in order to automate the authentication, the images are classified into four categories using seven features. Classification accuracy in the experiments is 97.97 percent. It is noted that our experiments demonstrate that the proposed method is robust against JPEG compression and is competitive with a state-of-the-art semi-fragile watermarking method, in terms of robustness and semi-fragility.
    Geometric Processing for Image-based 3D Object Modeling. (arXiv:2106.14307v1 [cs.CV])
    (2 min) Image-based 3D object modeling refers to the process of converting raw optical images to 3D digital representations of the objects. Very often, such models are desired to be dimensionally true, semantically labeled with photorealistic appearance (reality-based modeling). Laser scanning was deemed as the standard (and direct) way to obtaining highly accurate 3D measurements of objects, while one would have to abide the high acquisition cost and its unavailability on some of the platforms. Nowadays the image-based methods backboned by the recently developed advanced dense image matching algorithms and geo-referencing paradigms, are becoming the dominant approaches, due to its high flexibility, availability and low cost. The largely automated geometric processing of images in a 3D object reconstruction workflow, from ordered/unordered raw imagery to textured meshes, is becoming a critical part of the reality-based 3D modeling. This article summarizes the overall geometric processing workflow, with focuses on introducing the state-of-the-art methods of three major components of geometric processing: 1) geo-referencing; 2) Image dense matching 3) texture mapping. Finally, we will draw conclusions and share our outlooks of the topics discussed in this article.
    Multi-Compound Transformer for Accurate Biomedical Image Segmentation. (arXiv:2106.14385v1 [cs.CV])
    (2 min) The recent vision transformer(i.e.for image classification) learns non-local attentive interaction of different patch tokens. However, prior arts miss learning the cross-scale dependencies of different pixels, the semantic correspondence of different labels, and the consistency of the feature representations and semantic embeddings, which are critical for biomedical segmentation. In this paper, we tackle the above issues by proposing a unified transformer network, termed Multi-Compound Transformer (MCTrans), which incorporates rich feature learning and semantic structure mining into a unified framework. Specifically, MCTrans embeds the multi-scale convolutional features as a sequence of tokens and performs intra- and inter-scale self-attention, rather than single-scale attention in previous works. In addition, a learnable proxy embedding is also introduced to model semantic relationship and feature enhancement by using self-attention and cross-attention, respectively. MCTrans can be easily plugged into a UNet-like network and attains a significant improvement over the state-of-the-art methods in biomedical image segmentation in six standard benchmarks. For example, MCTrans outperforms UNet by 3.64%, 3.71%, 4.34%, 2.8%, 1.88%, 1.57% in Pannuke, CVC-Clinic, CVC-Colon, Etis, Kavirs, ISIC2018 dataset, respectively. Code is available at https://github.com/JiYuanFeng/MCTrans.
    Rail-5k: a Real-World Dataset for Rail Surface Defects Detection. (arXiv:2106.14366v1 [cs.CV])
    (2 min) This paper presents the Rail-5k dataset for benchmarking the performance of visual algorithms in a real-world application scenario, namely the rail surface defects detection task. We collected over 5k high-quality images from railways across China, and annotated 1100 images with the help from railway experts to identify the most common 13 types of rail defects. The dataset can be used for two settings both with unique challenges, the first is the fully-supervised setting using the 1k+ labeled images for training, fine-grained nature and long-tailed distribution of defect classes makes it hard for visual algorithms to tackle. The second is the semi-supervised learning setting facilitated by the 4k unlabeled images, these 4k images are uncurated containing possible image corruptions and domain shift with the labeled images, which can not be easily tackle by previous semi-supervised learning methods. We believe our dataset could be a valuable benchmark for evaluating robustness and reliability of visual algorithms.
    Generalized Zero-Shot Learning using Multimodal Variational Auto-Encoder with Semantic Concepts. (arXiv:2106.14082v1 [cs.CV])
    (2 min) With the ever-increasing amount of data, the central challenge in multimodal learning involves limitations of labelled samples. For the task of classification, techniques such as meta-learning, zero-shot learning, and few-shot learning showcase the ability to learn information about novel classes based on prior knowledge. Recent techniques try to learn a cross-modal mapping between the semantic space and the image space. However, they tend to ignore the local and global semantic knowledge. To overcome this problem, we propose a Multimodal Variational Auto-Encoder (M-VAE) which can learn the shared latent space of image features and the semantic space. In our approach we concatenate multimodal data to a single embedding before passing it to the VAE for learning the latent space. We propose the use of a multi-modal loss during the reconstruction of the feature embedding through the decoder. Our approach is capable to correlating modalities and exploit the local and global semantic knowledge for novel sample predictions. Our experimental results using a MLP classifier on four benchmark datasets show that our proposed model outperforms the current state-of-the-art approaches for generalized zero-shot learning.
    The Story in Your Eyes: An Individual-difference-aware Model for Cross-person Gaze Estimation. (arXiv:2106.14183v1 [cs.CV])
    (2 min) We propose a novel method on refining cross-person gaze prediction task with eye/face images only by explicitly modelling the person-specific differences. Specifically, we first assume that we can obtain some initial gaze prediction results with existing method, which we refer to as InitNet, and then introduce three modules, the Validity Module (VM), Self-Calibration (SC) and Person-specific Transform (PT)) Module. By predicting the reliability of current eye/face images, our VM is able to identify invalid samples, e.g. eye blinking images, and reduce their effects in our modelling process. Our SC and PT module then learn to compensate for the differences on valid samples only. The former models the translation offsets by bridging the gap between initial predictions and dataset-wise distribution. And the later learns more general person-specific transformation by incorporating the information from existing initial predictions of the same person. We validate our ideas on three publicly available datasets, EVE, XGaze and MPIIGaze and demonstrate that our proposed method outperforms the SOTA methods significantly on all of them, e.g. respectively 21.7%, 36.0% and 32.9% relative performance improvements. We won the GAZE 2021 Competition on the EVE dataset. Our code can be found here https://github.com/bjj9/EVE_SCPT.
    Knee Osteoarthritis Severity Prediction using an Attentive Multi-Scale Deep Convolutional Neural Network. (arXiv:2106.14292v1 [eess.IV])
    (2 min) Knee Osteoarthritis (OA) is a destructive joint disease identified by joint stiffness, pain, and functional disability concerning millions of lives across the globe. It is generally assessed by evaluating physical symptoms, medical history, and other joint screening tests like radiographs, Magnetic Resonance Imaging (MRI), and Computed Tomography (CT) scans. Unfortunately, the conventional methods are very subjective, which forms a barrier in detecting the disease progression at an early stage. This paper presents a deep learning-based framework, namely OsteoHRNet, that automatically assesses the Knee OA severity in terms of Kellgren and Lawrence (KL) grade classification from X-rays. As a primary novelty, the proposed approach is built upon one of the most recent deep models, called the High-Resolution Network (HRNet), to capture the multi-scale features of knee X-rays. In addition, we have also incorporated an attention mechanism to filter out the counterproductive features and boost the performance further. Our proposed model has achieved the best multiclass accuracy of 71.74% and MAE of 0.311 on the baseline cohort of the OAI dataset, which is a remarkable gain over the existing best-published works. We have also employed the Gradient-based Class Activation Maps (Grad-CAMs) visualization to justify the proposed network learning.
    3D Reconstruction through Fusion of Cross-View Images. (arXiv:2106.14306v1 [cs.CV])
    (2 min) 3D recovery from multi-stereo and stereo images, as an important application of the image-based perspective geometry, serves many applications in computer vision, remote sensing and Geomatics. In this chapter, the authors utilize the imaging geometry and present approaches that perform 3D reconstruction from cross-view images that are drastically different in their viewpoints. We introduce our framework that takes ground-view images and satellite images for full 3D recovery, which includes necessary methods in satellite and ground-based point cloud generation from images, 3D data co-registration, fusion and mesh generation. We demonstrate our proposed framework on a dataset consisting of twelve satellite images and 150k video frames acquired through a vehicle-mounted Go-pro camera and demonstrate the reconstruction results. We have also compared our results with results generated from an intuitive processing pipeline that involves typical geo-registration and meshing methods.
    Co$^2$L: Contrastive Continual Learning. (arXiv:2106.14413v1 [cs.LG])
    (2 min) Recent breakthroughs in self-supervised learning show that such algorithms learn visual representations that can be transferred better to unseen tasks than joint-training methods relying on task-specific supervision. In this paper, we found that the similar holds in the continual learning con-text: contrastively learned representations are more robust against the catastrophic forgetting than jointly trained representations. Based on this novel observation, we propose a rehearsal-based continual learning algorithm that focuses on continually learning and maintaining transferable representations. More specifically, the proposed scheme (1) learns representations using the contrastive learning objective, and (2) preserves learned representations using a self-supervised distillation step. We conduct extensive experimental validations under popular benchmark image classification datasets, where our method sets the new state-of-the-art performance.
    Disentangling semantic features of macromolecules in Cryo-Electron Tomography. (arXiv:2106.14192v1 [q-bio.BM])
    (2 min) Cryo-electron tomography (Cryo-ET) is a 3D imaging technique that enables the systemic study of shape, abundance, and distribution of macromolecular structures in single cells in near-atomic resolution. However, the systematic and efficient $\textit{de novo}$ recognition and recovery of macromolecular structures captured by Cryo-ET are very challenging due to the structural complexity and imaging limits. Even macromolecules with identical structures have various appearances due to different orientations and imaging limits, such as noise and the missing wedge effect. Explicitly disentangling the semantic features of macromolecules is crucial for performing several downstream analyses on the macromolecules. This paper has addressed the problem by proposing a 3D Spatial Variational Autoencoder that explicitly disentangle the structure, orientation, and shift of macromolecules. Extensive experiments on both synthesized and real cryo-ET datasets and cross-domain evaluations demonstrate the efficacy of our method.
    Semi-Supervised Raw-to-Raw Mapping. (arXiv:2106.13883v1 [cs.CV])
    (2 min) The raw-RGB colors of a camera sensor vary due to the spectral sensitivity differences across different sensor makes and models. This paper focuses on the task of mapping between different sensor raw-RGB color spaces. Prior work addressed this problem using a pairwise calibration to achieve accurate color mapping. Although being accurate, this approach is less practical as it requires: (1) capturing pair of images by both camera devices with a color calibration object placed in each new scene; (2) accurate image alignment or manual annotation of the color calibration object. This paper aims to tackle color mapping in the raw space through a more practical setup. Specifically, we present a semi-supervised raw-to-raw mapping method trained on a small set of paired images alongside an unpaired set of images captured by each camera device. Through extensive experiments, we show that our method achieves better results compared to other domain adaptation alternatives in addition to the single-calibration solution. We have generated a new dataset of raw images from two different smartphone cameras as part of this effort. Our dataset includes unpaired and paired sets for our semi-supervised training and evaluation.
    Deep Learning for Technical Document Classification. (arXiv:2106.14269v1 [cs.LG])
    (2 min) In large technology companies, the requirements for managing and organizing technical documents created by engineers and managers in supporting relevant decision making have increased dramatically in recent years, which has led to a higher demand for more scalable, accurate, and automated document classification. Prior studies have primarily focused on processing text for classification and small-scale databases. This paper describes a novel multimodal deep learning architecture, called TechDoc, for technical document classification, which utilizes both natural language and descriptive images to train hierarchical classifiers. The architecture synthesizes convolutional neural networks and recurrent neural networks through an integrated training process. We applied the architecture to a large multimodal technical document database and trained the model for classifying documents based on the hierarchical International Patent Classification system. Our results show that the trained neural network presents a greater classification accuracy than those using a single modality and several earlier text classification methods. The trained model can potentially be scaled to millions of real-world technical documents with both text and figures, which is useful for data and knowledge management in large technology companies and organizations.
    Learning without Forgetting for 3D Point Cloud Objects. (arXiv:2106.14275v1 [cs.CV])
    (2 min) When we fine-tune a well-trained deep learning model for a new set of classes, the network learns new concepts but gradually forgets the knowledge of old training. In some real-life applications, we may be interested in learning new classes without forgetting the capability of previous experience. Such learning without forgetting problem is often investigated using 2D image recognition tasks. In this paper, considering the growth of depth camera technology, we address the same problem for the 3D point cloud object data. This problem becomes more challenging in the 3D domain than 2D because of the unavailability of large datasets and powerful pretrained backbone models. We investigate knowledge distillation techniques on 3D data to reduce catastrophic forgetting of the previous training. Moreover, we improve the distillation process by using semantic word vectors of object classes. We observe that exploring the interrelation of old and new knowledge during training helps to learn new concepts without forgetting old ones. Experimenting on three 3D point cloud recognition backbones (PointNet, DGCNN, and PointConv) and synthetic (ModelNet40, ModelNet10) and real scanned (ScanObjectNN) datasets, we establish new baseline results on learning without forgetting for 3D data. This research will instigate many future works in this area.
    Learning to solve geometric construction problems from images. (arXiv:2106.14195v1 [cs.CV])
    (2 min) We describe a purely image-based method for finding geometric constructions with a ruler and compass in the Euclidea geometric game. The method is based on adapting the Mask R-CNN state-of-the-art image processing neural architecture and adding a tree-based search procedure to it. In a supervised setting, the method learns to solve all 68 kinds of geometric construction problems from the first six level packs of Euclidea with an average 92% accuracy. When evaluated on new kinds of problems, the method can solve 31 of the 68 kinds of Euclidea problems. We believe that this is the first time that a purely image-based learning has been trained to solve geometric construction problems of this difficulty.
    Dual-Stream Reciprocal Disentanglement Learning for Domain Adaption Person Re-Identification. (arXiv:2106.13929v1 [cs.CV])
    (2 min) Since human-labeled samples are free for the target set, unsupervised person re-identification (Re-ID) has attracted much attention in recent years, by additionally exploiting the source set. However, due to the differences on camera styles, illumination and backgrounds, there exists a large gap between source domain and target domain, introducing a great challenge on cross-domain matching. To tackle this problem, in this paper we propose a novel method named Dual-stream Reciprocal Disentanglement Learning (DRDL), which is quite efficient in learning domain-invariant features. In DRDL, two encoders are first constructed for id-related and id-unrelated feature extractions, which are respectively measured by their associated classifiers. Furthermore, followed by an adversarial learning strategy, both streams reciprocally and positively effect each other, so that the id-related features and id-unrelated features are completely disentangled from a given image, allowing the encoder to be powerful enough to obtain the discriminative but domain-invariant features. In contrast to existing approaches, our proposed method is free from image generation, which not only reduces the computational complexity remarkably, but also removes redundant information from id-related features. Extensive experiments substantiate the superiority of our proposed method compared with the state-of-the-arts. The source code has been released in https://github.com/lhf12278/DRDL.
    EARLIN: Early Out-of-Distribution Detection for Resource-efficient Collaborative Inference. (arXiv:2106.13842v1 [cs.CV])
    (2 min) Collaborative inference enables resource-constrained edge devices to make inferences by uploading inputs (e.g., images) to a server (i.e., cloud) where the heavy deep learning models run. While this setup works cost-effectively for successful inferences, it severely underperforms when the model faces input samples on which the model was not trained (known as Out-of-Distribution (OOD) samples). If the edge devices could, at least, detect that an input sample is an OOD, that could potentially save communication and computation resources by not uploading those inputs to the server for inference workload. In this paper, we propose a novel lightweight OOD detection approach that mines important features from the shallow layers of a pretrained CNN model and detects an input sample as ID (In-Distribution) or OOD based on a distance function defined on the reduced feature space. Our technique (a) works on pretrained models without any retraining of those models, and (b) does not expose itself to any OOD dataset (all detection parameters are obtained from the ID training dataset). To this end, we develop EARLIN (EARLy OOD detection for Collaborative INference) that takes a pretrained model and partitions the model at the OOD detection layer and deploys the considerably small OOD part on an edge device and the rest on the cloud. By experimenting using real datasets and a prototype implementation, we show that our technique achieves better results than other approaches in terms of overall accuracy and cost when tested against popular OOD datasets on top of popular deep learning models pretrained on benchmark datasets.
    DenseTNT: Waymo Open Dataset Motion Prediction Challenge 1st Place Solution. (arXiv:2106.14160v1 [cs.CV])
    (2 min) In autonomous driving, goal-based multi-trajectory prediction methods are proved to be effective recently, where they first score goal candidates, then select a final set of goals, and finally complete trajectories based on the selected goals. However, these methods usually involve goal predictions based on sparse predefined anchors. In this work, we propose an anchor-free model, named DenseTNT, which performs dense goal probability estimation for trajectory prediction. Our model achieves state-of-the-art performance, and ranks 1st on the Waymo Open Dataset Motion Prediction Challenge.
    Change Detection for Geodatabase Updating. (arXiv:2106.14309v1 [cs.CV])
    (2 min) The geodatabase (vectorized data) nowadays becomes a rather standard digital city infrastructure; however, updating geodatabase efficiently and economically remains a fundamental and practical issue in the geospatial industry. The cost of building a geodatabase is extremely high and labor intensive, and very often the maps we use have several months and even years of latency. One solution is to develop more automated methods for (vectorized) geospatial data generation, which has been proven a difficult task in the past decades. An alternative solution is to first detect the differences between the new data and the existing geospatial data, and then only update the area identified as changes. The second approach is becoming more favored due to its high practicality and flexibility. A highly relevant technique is change detection. This article aims to provide an overview the state-of-the-art change detection methods in the field of Remote Sensing and Geomatics to support the task of updating geodatabases. Data used for change detection are highly disparate, we therefore structure our review intuitively based on the dimension of the data, being 1) change detection with 2D data; 2) change detection with 3D data. Conclusions will be drawn based on the reviewed efforts in the field, and we will share our outlooks of the topic of updating geodatabases.
    Real-time 3D Object Detection using Feature Map Flow. (arXiv:2106.14101v1 [cs.CV])
    (2 min) In this paper, we present a real-time 3D detection approach considering time-spatial feature map aggregation from different time steps of deep neural model inference (named feature map flow, FMF). Proposed approach improves the quality of 3D detection center-based baseline and provides real-time performance on the nuScenes and Waymo benchmark. Code is available at https://github.com/YoushaaMurhij/FMFNet
    Interflow: Aggregating Multi-layer Feature Mappings with Attention Mechanism. (arXiv:2106.14073v1 [cs.CV])
    (2 min) Traditionally, CNN models possess hierarchical structures and utilize the feature mapping of the last layer to obtain the prediction output. However, it can be difficulty to settle the optimal network depth and make the middle layers learn distinguished features. This paper proposes the Interflow algorithm specially for traditional CNN models. Interflow divides CNNs into several stages according to the depth and makes predictions by the feature mappings in each stage. Subsequently, we input these prediction branches into a well-designed attention module, which learns the weights of these prediction branches, aggregates them and obtains the final output. Interflow weights and fuses the features learned in both shallower and deeper layers, making the feature information at each stage processed reasonably and effectively, enabling the middle layers to learn more distinguished features, and enhancing the model representation ability. In addition, Interflow can alleviate gradient vanishing problem, lower the difficulty of network depth selection, and lighten possible over-fitting problem by introducing attention mechanism. Besides, it can avoid network degradation as a byproduct. Compared with the original model, the CNN model with Interflow achieves higher test accuracy on multiple benchmark datasets.
    A Machine Learning Model for Early Detection of Diabetic Foot using Thermogram Images. (arXiv:2106.14207v1 [eess.IV])
    (2 min) Diabetes foot ulceration (DFU) and amputation are a cause of significant morbidity. The prevention of DFU may be achieved by the identification of patients at risk of DFU and the institution of preventative measures through education and offloading. Several studies have reported that thermogram images may help to detect an increase in plantar temperature prior to DFU. However, the distribution of plantar temperature may be heterogeneous, making it difficult to quantify and utilize to predict outcomes. We have compared a machine learning-based scoring technique with feature selection and optimization techniques and learning classifiers to several state-of-the-art Convolutional Neural Networks (CNNs) on foot thermogram images and propose a robust solution to identify the diabetic foot. A comparatively shallow CNN model, MobilenetV2 achieved an F1 score of ~95% for a two-feet thermogram image-based classification and the AdaBoost Classifier used 10 features and achieved an F1 score of 97 %. A comparison of the inference time for the best-performing networks confirmed that the proposed algorithm can be deployed as a smartphone application to allow the user to monitor the progression of the DFU in a home setting.
    Semi-Supervised Deep Ensembles for Blind Image Quality Assessment. (arXiv:2106.14008v1 [cs.CV])
    (2 min) Ensemble methods are generally regarded to be better than a single model if the base learners are deemed to be "accurate" and "diverse." Here we investigate a semi-supervised ensemble learning strategy to produce generalizable blind image quality assessment models. We train a multi-head convolutional network for quality prediction by maximizing the accuracy of the ensemble (as well as the base learners) on labeled data, and the disagreement (i.e., diversity) among them on unlabeled data, both implemented by the fidelity loss. We conduct extensive experiments to demonstrate the advantages of employing unlabeled data for BIQA, especially in model generalization and failure identification.
    Machine Learning Detection Algorithm for Large Barkhausen Jumps in Cluttered Environment. (arXiv:2106.14148v1 [cs.LG])
    (2 min) Modern magnetic sensor arrays conventionally utilize state of the art low power magnetometers such as parallel and orthogonal fluxgates. Low power fluxgates tend to have large Barkhausen jumps that appear as a dc jump in the fluxgate output. This phenomenon deteriorates the signal fidelity and effectively increases the internal sensor noise. Even if sensors that are more prone to dc jumps can be screened during production, the conventional noise measurement does not always catch the dc jump because of its sparsity. Moreover, dc jumps persist in almost all the sensor cores although at a slower but still intolerable rate. Even if dc jumps can be easily detected in a shielded environment, when deployed in presence of natural noise and clutter, it can be hard to positively detect them. This work fills this gap and presents algorithms that distinguish dc jumps embedded in natural magnetic field data. To improve robustness to noise, we developed two machine learning algorithms that employ temporal and statistical physical-based features of a pre-acquired and well-known experimental data set. The first algorithm employs a support vector machine classifier, while the second is based on a neural network architecture. We compare these new approaches to a more classical kernel-based method. To that purpose, the receiver operating characteristic curve is generated, which allows diagnosis ability of the different classifiers by comparing their performances across various operation points. The accuracy of the machine learning-based algorithms over the classic method is highly emphasized. In addition, high generalization and robustness of the neural network can be concluded, based on the rapid convergence of the corresponding receiver operating characteristic curves.
    DONet: Learning Category-Level 6D Object Pose and Size Estimation from Depth Observation. (arXiv:2106.14193v1 [cs.CV])
    (2 min) We propose a method of Category-level 6D Object Pose and Size Estimation (COPSE) from a single depth image, without external pose-annotated real-world training data. While previous works exploit visual cues in RGB(D) images, our method makes inferences based on the rich geometric information of the object in the depth channel alone. Essentially, our framework explores such geometric information by learning the unified 3D Orientation-Consistent Representations (3D-OCR) module, and further enforced by the property of Geometry-constrained Reflection Symmetry (GeoReS) module. The magnitude information of object size and the center point is finally estimated by Mirror-Paired Dimensional Estimation (MPDE) module. Extensive experiments on the category-level NOCS benchmark demonstrate that our framework competes with state-of-the-art approaches that require labeled real-world images. We also deploy our approach to a physical Baxter robot to perform manipulation tasks on unseen but category-known instances, and the results further validate the efficacy of our proposed model. Our videos are available in the supplementary material.
    Identifying High Accuracy Regions in Traffic Camera Images to Enhance the Estimation of Road Traffic Metrics: A Quadtree Based Method. (arXiv:2106.14049v1 [cs.CV])
    (2 min) The growing number of real-time camera feeds in urban areas has made it possible to provide high-quality traffic data for effective transportation planning, operations, and management. However, deriving reliable traffic metrics from these camera feeds has been a challenge due to the limitations of current vehicle detection techniques, as well as the various camera conditions such as height and resolution. In this work, a quadtree based algorithm is developed to continuously partition the image extent until only regions with high detection accuracy are remained. These regions are referred to as the high-accuracy identification regions (HAIR) in this paper. We demonstrate how the use of the HAIR can improve the accuracy of traffic density estimates using images from traffic cameras at different heights and resolutions in Central Ohio. Our experiments show that the proposed algorithm can be used to derive robust HAIR where vehicle detection accuracy is 41 percent higher than that in the original image extent. The use of the HAIR also significantly improves the traffic density estimation with an overall decrease of 49 percent in root mean squared error.
    Post-Training Quantization for Vision Transformer. (arXiv:2106.14156v1 [cs.CV])
    (2 min) Recently, transformer has achieved remarkable performance on a variety of computer vision applications. Compared with mainstream convolutional neural networks, vision transformers are often of sophisticated architectures for extracting powerful feature representations, which are more difficult to be developed on mobile devices. In this paper, we present an effective post-training quantization algorithm for reducing the memory storage and computational costs of vision transformers. Basically, the quantization task can be regarded as finding the optimal low-bit quantization intervals for weights and inputs, respectively. To preserve the functionality of the attention mechanism, we introduce a ranking loss into the conventional quantization objective that aims to keep the relative order of the self-attention results after quantization. Moreover, we thoroughly analyze the relationship between quantization loss of different layers and the feature diversity, and explore a mixed-precision quantization scheme by exploiting the nuclear norm of each attention map and output feature. The effectiveness of the proposed method is verified on several benchmark models and datasets, which outperforms the state-of-the-art post-training quantization algorithms. For instance, we can obtain an 81.29\% top-1 accuracy using DeiT-B model on ImageNet dataset with about 8-bit quantization.
    An Image Classifier Can Suffice Video Understanding. (arXiv:2106.14104v1 [cs.CV])
    (2 min) We propose a new perspective on video understanding by casting the video recognition problem as an image recognition task. We show that an image classifier alone can suffice for video understanding without temporal modeling. Our approach is simple and universal. It composes input frames into a super image to train an image classifier to fulfill the task of action recognition, in exactly the same way as classifying an image. We prove the viability of such an idea by demonstrating strong and promising performance on four public datasets including Kinetics400, Something-to-something (V2), MiT and Jester, using a recently developed vision transformer. We also experiment with the prevalent ResNet image classifiers in computer vision to further validate our idea. The results on Kinetics400 are comparable to some of the best-performed CNN approaches based on spatio-temporal modeling. our code and models will be made available at https://github.com/IBM/sifar-pytorch.
    CAMS: Color-Aware Multi-Style Transfer. (arXiv:2106.13920v1 [cs.CV])
    (2 min) Image style transfer aims to manipulate the appearance of a source image, or "content" image, to share similar texture and colors of a target "style" image. Ideally, the style transfer manipulation should also preserve the semantic content of the source image. A commonly used approach to assist in transferring styles is based on Gram matrix optimization. One problem of Gram matrix-based optimization is that it does not consider the correlation between colors and their styles. Specifically, certain textures or structures should be associated with specific colors. This is particularly challenging when the target style image exhibits multiple style types. In this work, we propose a color-aware multi-style transfer method that generates aesthetically pleasing results while preserving the style-color correlation between style and generated images. We achieve this desired outcome by introducing a simple but efficient modification to classic Gram matrix-based style transfer optimization. A nice feature of our method is that it enables the users to manually select the color associations between the target style and content image for more transfer flexibility. We validated our method with several qualitative comparisons, including a user study conducted with 30 participants. In comparison with prior work, our method is simple, easy to implement, and achieves visually appealing results when targeting images that have multiple styles. Source code is available at https://github.com/mahmoudnafifi/color-aware-style-transfer.
    Radar Voxel Fusion for 3D Object Detection. (arXiv:2106.14087v1 [cs.CV])
    (2 min) Automotive traffic scenes are complex due to the variety of possible scenarios, objects, and weather conditions that need to be handled. In contrast to more constrained environments, such as automated underground trains, automotive perception systems cannot be tailored to a narrow field of specific tasks but must handle an ever-changing environment with unforeseen events. As currently no single sensor is able to reliably perceive all relevant activity in the surroundings, sensor data fusion is applied to perceive as much information as possible. Data fusion of different sensors and sensor modalities on a low abstraction level enables the compensation of sensor weaknesses and misdetections among the sensors before the information-rich sensor data are compressed and thereby information is lost after a sensor-individual object detection. This paper develops a low-level sensor fusion network for 3D object detection, which fuses lidar, camera, and radar data. The fusion network is trained and evaluated on the nuScenes data set. On the test set, fusion of radar data increases the resulting AP (Average Precision) detection score by about 5.1% in comparison to the baseline lidar network. The radar sensor fusion proves especially beneficial in inclement conditions such as rain and night scenes. Fusing additional camera data contributes positively only in conjunction with the radar fusion, which shows that interdependencies of the sensors are important for the detection result. Additionally, the paper proposes a novel loss to handle the discontinuity of a simple yaw representation for object detection. Our updated loss increases the detection and orientation estimation performance for all sensor input configurations. The code for this research has been made available on GitHub.
    Image Classification with CondenseNeXt for ARM-Based Computing Platforms. (arXiv:2106.14102v1 [cs.CV])
    (2 min) In this paper, we demonstrate the implementation of our ultra-efficient deep convolutional neural network architecture: CondenseNeXt on NXP BlueBox, an autonomous driving development platform developed for self-driving vehicles. We show that CondenseNeXt is remarkably efficient in terms of FLOPs, designed for ARM-based embedded computing platforms with limited computational resources and can perform image classification without the need of a CUDA enabled GPU. CondenseNeXt utilizes the state-of-the-art depthwise separable convolution and model compression techniques to achieve a remarkable computational efficiency. Extensive analyses are conducted on CIFAR-10, CIFAR-100 and ImageNet datasets to verify the performance of CondenseNeXt Convolutional Neural Network (CNN) architecture. It achieves state-of-the-art image classification performance on three benchmark datasets including CIFAR-10 (4.79% top-1 error), CIFAR-100 (21.98% top-1 error) and ImageNet (7.91% single model, single crop top-5 error). CondenseNeXt achieves final trained model size improvement of 2.9+ MB and up to 59.98% reduction in forward FLOPs compared to CondenseNet and can perform image classification on ARM-Based computing platforms without needing a CUDA enabled GPU support, with outstanding efficiency.
    Saying the Unseen: Video Descriptions via Dialog Agents. (arXiv:2106.14069v1 [cs.CV])
    (2 min) Current vision and language tasks usually take complete visual data (e.g., raw images or videos) as input, however, practical scenarios may often consist the situations where part of the visual information becomes inaccessible due to various reasons e.g., restricted view with fixed camera or intentional vision block for security concerns. As a step towards the more practical application scenarios, we introduce a novel task that aims to describe a video using the natural language dialog between two agents as a supplementary information source given incomplete visual data. Different from most existing vision-language tasks where AI systems have full access to images or video clips, which may reveal sensitive information such as recognizable human faces or voices, we intentionally limit the visual input for AI systems and seek a more secure and transparent information medium, i.e., the natural language dialog, to supplement the missing visual information. Specifically, one of the intelligent agents - Q-BOT - is given two semantic segmented frames from the beginning and the end of the video, as well as a finite number of opportunities to ask relevant natural language questions before describing the unseen video. A-BOT, the other agent who has access to the entire video, assists Q-BOT to accomplish the goal by answering the asked questions. We introduce two different experimental settings with either a generative (i.e., agents generate questions and answers freely) or a discriminative (i.e., agents select the questions and answers from candidates) internal dialog generation process. With the proposed unified QA-Cooperative networks, we experimentally demonstrate the knowledge transfer process between the two dialog agents and the effectiveness of using the natural language dialog as a supplement for incomplete implicit visions.
    A Graph-based approach to derive the geodesic distance on Statistical manifolds: Application to Multimedia Information Retrieval. (arXiv:2106.14060v1 [cs.CV])
    (2 min) In this paper, we leverage the properties of non-Euclidean Geometry to define the Geodesic distance (GD) on the space of statistical manifolds. The Geodesic distance is a real and intuitive similarity measure that is a good alternative to the purely statistical and extensively used Kullback-Leibler divergence (KLD). Despite the effectiveness of the GD, a closed-form does not exist for many manifolds, since the geodesic equations are hard to solve. This explains that the major studies have been content to use numerical approximations. Nevertheless, most of those do not take account of the manifold properties, which leads to a loss of information and thus to low performances. We propose an approximation of the Geodesic distance through a graph-based method. This latter permits to well represent the structure of the statistical manifold, and respects its geometrical properties. Our main aim is to compare the graph-based approximation to the state of the art approximations. Thus, the proposed approach is evaluated for two statistical manifolds, namely the Weibull manifold and the Gamma manifold, considering the Content-Based Texture Retrieval application on different databases.
    Building a Video-and-Language Dataset with Human Actions for Multimodal Logical Inference. (arXiv:2106.14137v1 [cs.CV])
    (2 min) This paper introduces a new video-and-language dataset with human actions for multimodal logical inference, which focuses on intentional and aspectual expressions that describe dynamic human actions. The dataset consists of 200 videos, 5,554 action labels, and 1,942 action triplets of the form that can be translated into logical semantic representations. The dataset is expected to be useful for evaluating multimodal inference systems between videos and semantically complicated sentences including negation and quantification.
    Vision-driven Compliant Manipulation for Reliable, High-Precision Assembly Tasks. (arXiv:2106.14070v1 [cs.RO])
    (2 min) Highly constrained manipulation tasks continue to be challenging for autonomous robots as they require high levels of precision, typically less than 1mm, which is often incompatible with what can be achieved by traditional perception systems. This paper demonstrates that the combination of state-of-the-art object tracking with passively adaptive mechanical hardware can be leveraged to complete precision manipulation tasks with tight, industrially-relevant tolerances (0.25mm). The proposed control method closes the loop through vision by tracking the relative 6D pose of objects in the relevant workspace. It adjusts the control reference of both the compliant manipulator and the hand to complete object insertion tasks via within-hand manipulation. Contrary to previous efforts for insertion, our method does not require expensive force sensors, precision manipulators, or time-consuming, online learning, which is data hungry. Instead, this effort leverages mechanical compliance and utilizes an object agnostic manipulation model of the hand learned offline, off-the-shelf motion planning, and an RGBD-based object tracker trained solely with synthetic data. These features allow the proposed system to easily generalize and transfer to new tasks and environments. This paper describes in detail the system components and showcases its efficacy with extensive experiments involving tight tolerance peg-in-hole insertion tasks of various geometries as well as open-world constrained placement tasks.
    Mining atmospheric data. (arXiv:2106.13992v1 [cs.CV])
    (2 min) This paper overviews two interdependent issues important for mining remote sensing data (e.g. images) obtained from atmospheric monitoring missions. The first issue relates the building new public datasets and benchmarks, which are hot priority of the remote sensing community. The second issue is the investigation of deep learning methodologies for atmospheric data classification based on vast amount of data without annotations and with localized annotated data provided by sparse observing networks at the surface. The targeted application is air quality assessment and prediction. Air quality is defined as the pollution level linked with several atmospheric constituents such as gases and aerosols. There are dependency relationships between the bad air quality, caused by air pollution, and the public health. The target application is the development of a fast prediction model for local and regional air quality assessment and tracking. The results of mining data will have significant implication for citizen and decision makers by providing a fast prediction and reliable air quality monitoring system able to cover the local and regional scale through intelligent extrapolation of sparse ground-based in situ measurement networks.
    Attention-guided Progressive Mapping for Profile Face Recognition. (arXiv:2106.14124v1 [cs.CV])
    (2 min) The past few years have witnessed great progress in the domain of face recognition thanks to advances in deep learning. However, cross pose face recognition remains a significant challenge. It is difficult for many deep learning algorithms to narrow the performance gap caused by pose variations; the main reasons for this relate to the intra-class discrepancy between face images in different poses and the pose imbalances of training datasets. Learning pose-robust features by traversing to the feature space of frontal faces provides an effective and cheap way to alleviate this problem. In this paper, we present a method for progressively transforming profile face representations to the canonical pose with an attentive pair-wise loss. Firstly, to reduce the difficulty of directly transforming the profile face features into a frontal pose, we propose to learn the feature residual between the source pose and its nearby pose in a block-byblock fashion, and thus traversing to the feature space of a smaller pose by adding the learned residual. Secondly, we propose an attentive pair-wise loss to guide the feature transformation progressing in the most effective direction. Finally, our proposed progressive module and attentive pair-wise loss are light-weight and easy to implement, adding only about 7:5% extra parameters. Evaluations on the CFP and CPLFW datasets demonstrate the superiority of our proposed method. Code is available at https://github.com/hjy1312/AGPM.
    Visual Conceptual Blending with Large-scale Language and Vision Models. (arXiv:2106.14127v1 [cs.CL])
    (2 min) We ask the question: to what extent can recent large-scale language and image generation models blend visual concepts? Given an arbitrary object, we identify a relevant object and generate a single-sentence description of the blend of the two using a language model. We then generate a visual depiction of the blend using a text-based image generation model. Quantitative and qualitative evaluations demonstrate the superiority of language models over classical methods for conceptual blending, and of recent large-scale image generation models over prior models for the visual depiction.
    Spectral-Spatial Graph Reasoning Network for Hyperspectral Image Classification. (arXiv:2106.13952v1 [cs.CV])
    (2 min) In this paper, we propose a spectral-spatial graph reasoning network (SSGRN) for hyperspectral image (HSI) classification. Concretely, this network contains two parts that separately named spatial graph reasoning subnetwork (SAGRN) and spectral graph reasoning subnetwork (SEGRN) to capture the spatial and spectral graph contexts, respectively. Different from the previous approaches implementing superpixel segmentation on the original image or attempting to obtain the category features under the guide of label image, we perform the superpixel segmentation on intermediate features of the network to adaptively produce the homogeneous regions to get the effective descriptors. Then, we adopt a similar idea in spectral part that reasonably aggregating the channels to generate spectral descriptors for spectral graph contexts capturing. All graph reasoning procedures in SAGRN and SEGRN are achieved through graph convolution. To guarantee the global perception ability of the proposed methods, all adjacent matrices in graph reasoning are obtained with the help of non-local self-attention mechanism. At last, by combining the extracted spatial and spectral graph contexts, we obtain the SSGRN to achieve a high accuracy classification. Extensive quantitative and qualitative experiments on three public HSI benchmarks demonstrate the competitiveness of the proposed methods compared with other state-of-the-art approaches.
    BiX-NAS: Searching Efficient Bi-directional Architecture for Medical Image Segmentation. (arXiv:2106.14033v1 [eess.IV])
    (2 min) The recurrent mechanism has recently been introduced into U-Net in various medical image segmentation tasks. Existing studies have focused on promoting network recursion via reusing building blocks. Although network parameters could be greatly saved, computational costs still increase inevitably in accordance with the pre-set iteration time. In this work, we study a multi-scale upgrade of a bi-directional skip connected network and then automatically discover an efficient architecture by a novel two-phase Neural Architecture Search (NAS) algorithm, namely BiX-NAS. Our proposed method reduces the network computational cost by sifting out ineffective multi-scale features at different levels and iterations. We evaluate BiX-NAS on two segmentation tasks using three different medical image datasets, and the experimental results show that our BiX-NAS searched architecture achieves the state-of-the-art performance with significantly lower computational cost.
    Functional Classwise Principal Component Analysis: A Novel Classification Framework. (arXiv:2106.13959v1 [stat.ML])
    (2 min) In recent times, functional data analysis (FDA) has been successfully applied in the field of high dimensional data classification. In this paper, we present a novel classification framework using functional data and classwise Principal Component Analysis (PCA). Our proposed method can be used in high dimensional time series data which typically suffers from small sample size problem. Our method extracts a piece wise linear functional feature space and is particularly suitable for hard classification problems.The proposed framework converts time series data into functional data and uses classwise functional PCA for feature extraction followed by classification using a Bayesian linear classifier. We demonstrate the efficacy of our proposed method by applying it to both synthetic data sets and real time series data from diverse fields including but not limited to neuroscience, food science, medical sciences and chemometrics.
    Robust Pose Transfer with Dynamic Details using Neural Video Rendering. (arXiv:2106.14132v1 [cs.CV])
    (2 min) Pose transfer of human videos aims to generate a high fidelity video of a target person imitating actions of a source person. A few studies have made great progress either through image translation with deep latent features or neural rendering with explicit 3D features. However, both of them rely on large amounts of training data to generate realistic results, and the performance degrades on more accessible internet videos due to insufficient training frames. In this paper, we demonstrate that the dynamic details can be preserved even trained from short monocular videos. Overall, we propose a neural video rendering framework coupled with an image-translation-based dynamic details generation network (D2G-Net), which fully utilizes both the stability of explicit 3D features and the capacity of learning components. To be specific, a novel texture representation is presented to encode both the static and pose-varying appearance characteristics, which is then mapped to the image space and rendered as a detail-rich frame in the neural rendering stage. Moreover, we introduce a concise temporal loss in the training stage to suppress the detail flickering that is made more visible due to high-quality dynamic details generated by our method. Through extensive comparisons, we demonstrate that our neural human video renderer is capable of achieving both clearer dynamic details and more robust performance even on accessible short videos with only 2k - 4k frames.
    Hear Me Out: Fusional Approaches for Audio Augmented Temporal Action Localization. (arXiv:2106.14118v1 [cs.CV])
    (2 min) State of the art architectures for untrimmed video Temporal Action Localization (TAL) have only considered RGB and Flow modalities, leaving the information-rich audio modality totally unexploited. Audio fusion has been explored for the related but arguably easier problem of trimmed (clip-level) action recognition. However, TAL poses a unique set of challenges. In this paper, we propose simple but effective fusion-based approaches for TAL. To the best of our knowledge, our work is the first to jointly consider audio and video modalities for supervised TAL. We experimentally show that our schemes consistently improve performance for state of the art video-only TAL approaches. Specifically, they help achieve new state of the art performance on large-scale benchmark datasets - ActivityNet-1.3 (52.73 mAP@0.5) and THUMOS14 (57.18 mAP@0.5). Our experiments include ablations involving multiple fusion schemes, modality combinations and TAL architectures. Our code, models and associated data will be made available.
    Semantics-aware Multi-modal Domain Translation:From LiDAR Point Clouds to Panoramic Color Images. (arXiv:2106.13974v1 [cs.CV])
    (2 min) In this work, we present a simple yet effective framework to address the domain translation problem between different sensor modalities with unique data formats. By relying only on the semantics of the scene, our modular generative framework can, for the first time, synthesize a panoramic color image from a given full 3D LiDAR point cloud. The framework starts with semantic segmentation of the point cloud, which is initially projected onto a spherical surface. The same semantic segmentation is applied to the corresponding camera image. Next, our new conditional generative model adversarially learns to translate the predicted LiDAR segment maps to the camera image counterparts. Finally, generated image segments are processed to render the panoramic scene images. We provide a thorough quantitative evaluation on the SemanticKitti dataset and show that our proposed framework outperforms other strong baseline models. Our source code is available at https://github.com/halmstad-University/TITAN-NET
    Descriptive Modeling of Textiles using FE Simulations and Deep Learning. (arXiv:2106.13982v1 [cs.CV])
    (2 min) In this work we propose a novel and fully automated method for extracting the yarn geometrical features in woven composites so that a direct parametrization of the textile reinforcement is achieved (e.g., FE mesh). Thus, our aim is not only to perform yarn segmentation from tomographic images but rather to provide a complete descriptive modeling of the fabric. As such, this direct approach improves on previous methods that use voxel-wise masks as intermediate representations followed by re-meshing operations (yarn envelope estimation). The proposed approach employs two deep neural network architectures (U-Net and Mask RCNN). First, we train the U-Net to generate synthetic CT images from the corresponding FE simulations. This allows to generate large quantities of annotated data without requiring costly manual annotations. This data is then used to train the Mask R-CNN, which is focused on predicting contour points around each of the yarns in the image. Experimental results show that our method is accurate and robust for performing yarn instance segmentation on CT images, this is further validated by quantitative and qualitative analyses.
    UMIC: An Unreferenced Metric for Image Captioning via Contrastive Learning. (arXiv:2106.14019v1 [cs.CL])
    (2 min) Despite the success of various text generation metrics such as BERTScore, it is still difficult to evaluate the image captions without enough reference captions due to the diversity of the descriptions. In this paper, we introduce a new metric UMIC, an Unreferenced Metric for Image Captioning which does not require reference captions to evaluate image captions. Based on Vision-and-Language BERT, we train UMIC to discriminate negative captions via contrastive learning. Also, we observe critical problems of the previous benchmark dataset (i.e., human annotations) on image captioning metric, and introduce a new collection of human annotations on the generated captions. We validate UMIC on four datasets, including our new dataset, and show that UMIC has a higher correlation than all previous metrics that require multiple references. We release the benchmark dataset and pre-trained models to compute the UMIC.
    ShapeEditer: a StyleGAN Encoder for Face Swapping. (arXiv:2106.13984v1 [cs.CV])
    (2 min) In this paper, we propose a novel encoder, called ShapeEditor, for high-resolution, realistic and high-fidelity face exchange. First of all, in order to ensure sufficient clarity and authenticity, our key idea is to use an advanced pretrained high-quality random face image generator, i.e. StyleGAN, as backbone. Secondly, we design ShapeEditor, a two-step encoder, to make the swapped face integrate the identity and attribute of the input faces. In the first step, we extract the identity vector of the source image and the attribute vector of the target image respectively; in the second step, we map the concatenation of identity vector and attribute vector into the $\mathcal{W+}$ potential space. In addition, for learning to map into the latent space of StyleGAN, we propose a set of self-supervised loss functions with which the training data do not need to be labeled manually. Extensive experiments on the test dataset show that the results of our method not only have a great advantage in clarity and authenticity than other state-of-the-art methods, but also reflect the sufficient integration of identity and attribute.
    In-N-Out: Towards Good Initialization for Inpainting and Outpainting. (arXiv:2106.13953v1 [cs.CV])
    (2 min) In computer vision, recovering spatial information by filling in masked regions, e.g., inpainting, has been widely investigated for its usability and wide applicability to other various applications: image inpainting, image extrapolation, and environment map estimation. Most of them are studied separately depending on the applications. Our focus, however, is on accommodating the opposite task, e.g., image outpainting, which would benefit the target applications, e.g., image inpainting. Our self-supervision method, In-N-Out, is summarized as a training approach that leverages the knowledge of the opposite task into the target model. We empirically show that In-N-Out -- which explores the complementary information -- effectively takes advantage over the traditional pipelines where only task-specific learning takes place in training. In experiments, we compare our method to the traditional procedure and analyze the effectiveness of our method on different applications: image inpainting, image extrapolation, and environment map estimation. For these tasks, we demonstrate that In-N-Out consistently improves the performance of the recent works with In-N-Out self-supervision to their training procedure. Also, we show that our approach achieves better results than an existing training approach for outpainting.
    Multimodal Few-Shot Learning with Frozen Language Models. (arXiv:2106.13884v1 [cs.CV])
    (2 min) When trained at sufficient scale, auto-regressive language models exhibit the notable ability to learn a new language task after being prompted with just a few examples. Here, we present a simple, yet effective, approach for transferring this few-shot learning ability to a multimodal setting (vision and language). Using aligned image and caption data, we train a vision encoder to represent each image as a sequence of continuous embeddings, such that a pre-trained, frozen language model prompted with this prefix generates the appropriate caption. The resulting system is a multimodal few-shot learner, with the surprising ability to learn a variety of new tasks when conditioned on examples, represented as a sequence of multiple interleaved image and text embeddings. We demonstrate that it can rapidly learn words for new objects and novel visual categories, do visual question-answering with only a handful of examples, and make use of outside knowledge, by measuring a single model on a variety of established and new benchmarks.
    Semi-supervised Semantic Segmentation with Directional Context-aware Consistency. (arXiv:2106.14133v1 [cs.CV])
    (2 min) Semantic segmentation has made tremendous progress in recent years. However, satisfying performance highly depends on a large number of pixel-level annotations. Therefore, in this paper, we focus on the semi-supervised segmentation problem where only a small set of labeled data is provided with a much larger collection of totally unlabeled images. Nevertheless, due to the limited annotations, models may overly rely on the contexts available in the training data, which causes poor generalization to the scenes unseen before. A preferred high-level representation should capture the contextual information while not losing self-awareness. Therefore, we propose to maintain the context-aware consistency between features of the same identity but with different contexts, making the representations robust to the varying environments. Moreover, we present the Directional Contrastive Loss (DC Loss) to accomplish the consistency in a pixel-to-pixel manner, only requiring the feature with lower quality to be aligned towards its counterpart. In addition, to avoid the false-negative samples and filter the uncertain positive samples, we put forward two sampling strategies. Extensive experiments show that our simple yet effective method surpasses current state-of-the-art methods by a large margin and also generalizes well with extra image-level annotations.
    Exploring Temporal Context and Human Movement Dynamics for Online Action Detection in Videos. (arXiv:2106.13967v1 [cs.CV])
    (2 min) Nowadays, the interaction between humans and robots is constantly expanding, requiring more and more human motion recognition applications to operate in real time. However, most works on temporal action detection and recognition perform these tasks in offline manner, i.e. temporally segmented videos are classified as a whole. In this paper, based on the recently proposed framework of Temporal Recurrent Networks, we explore how temporal context and human movement dynamics can be effectively employed for online action detection. Our approach uses various state-of-the-art architectures and appropriately combines the extracted features in order to improve action detection. We evaluate our method on a challenging but widely used dataset for temporal action localization, THUMOS'14. Our experiments show significant improvement over the baseline method, achieving state-of-the art results on THUMOS'14.
    Inverting and Understanding Object Detectors. (arXiv:2106.13933v1 [cs.CV])
    (2 min) As a core problem in computer vision, the performance of object detection has improved drastically in the past few years. Despite their impressive performance, object detectors suffer from a lack of interpretability. Visualization techniques have been developed and widely applied to introspect the decisions made by other kinds of deep learning models; however, visualizing object detectors has been underexplored. In this paper, we propose using inversion as a primary tool to understand modern object detectors and develop an optimization-based approach to layout inversion, allowing us to generate synthetic images recognized by trained detectors as containing a desired configuration of objects. We reveal intriguing properties of detectors by applying our layout inversion technique to a variety of modern object detectors, and further investigate them via validation experiments: they rely on qualitatively different features for classification and regression; they learn canonical motifs of commonly co-occurring objects; they use diff erent visual cues to recognize objects of varying sizes. We hope our insights can help practitioners improve object detectors.
    OffRoadTranSeg: Semi-Supervised Segmentation using Transformers on OffRoad environments. (arXiv:2106.13963v1 [cs.CV])
    (2 min) We present OffRoadTranSeg, the first end-to-end framework for semi-supervised segmentation in unstructured outdoor environment using transformers and automatic data selection for labelling. The offroad segmentation is a scene understanding approach that is widely used in autonomous driving. The popular offroad segmentation method is to use fully connected convolution layers and large labelled data, however, due to class imbalance, there will be several mismatches and also some classes may not be detected. Our approach is to do the task of offroad segmentation in a semi-supervised manner. The aim is to provide a model where self supervised vision transformer is used to fine-tune offroad datasets with self-supervised data collection for labelling using depth estimation. The proposed method is validated on RELLIS-3D and RUGD offroad datasets. The experiments show that OffRoadTranSeg outperformed other state of the art models, and also solves the RELLIS-3D class imbalance problem.
    Midpoint Regularization: from High Uncertainty Training to Conservative Classification. (arXiv:2106.13913v1 [cs.LG])
    (2 min) Label Smoothing (LS) improves model generalization through penalizing models from generating overconfident output distributions. For each training sample the LS strategy smooths the one-hot encoded training signal by distributing its distribution mass over the non-ground truth classes. We extend this technique by considering example pairs, coined PLS. PLS first creates midpoint samples by averaging random sample pairs and then learns a smoothing distribution during training for each of these midpoint samples, resulting in midpoints with high uncertainty labels for training. We empirically show that PLS significantly outperforms LS, achieving up to 30% of relative classification error reduction. We also visualize that PLS produces very low winning softmax scores for both in and out of distribution samples.
    Self-paced Principal Component Analysis. (arXiv:2106.13880v1 [cs.LG])
    (2 min) Principal Component Analysis (PCA) has been widely used for dimensionality reduction and feature extraction. Robust PCA (RPCA), under different robust distance metrics, such as l1-norm and l2, p-norm, can deal with noise or outliers to some extent. However, real-world data may display structures that can not be fully captured by these simple functions. In addition, existing methods treat complex and simple samples equally. By contrast, a learning pattern typically adopted by human beings is to learn from simple to complex and less to more. Based on this principle, we propose a novel method called Self-paced PCA (SPCA) to further reduce the effect of noise and outliers. Notably, the complexity of each sample is calculated at the beginning of each iteration in order to integrate samples from simple to more complex into training. Based on an alternating optimization, SPCA finds an optimal projection matrix and filters out outliers iteratively. Theoretical analysis is presented to show the rationality of SPCA. Extensive experiments on popular data sets demonstrate that the proposed method can improve the state of-the-art results considerably.
    Core Challenges in Embodied Vision-Language Planning. (arXiv:2106.13948v1 [cs.LG])
    (2 min) Recent advances in the areas of multimodal machine learning and artificial intelligence (AI) have led to the development of challenging tasks at the intersection of Computer Vision, Natural Language Processing, and Embodied AI. Whereas many approaches and previous survey pursuits have characterised one or two of these dimensions, there has not been a holistic analysis at the center of all three. Moreover, even when combinations of these topics are considered, more focus is placed on describing, e.g., current architectural methods, as opposed to also illustrating high-level challenges and opportunities for the field. In this survey paper, we discuss Embodied Vision-Language Planning (EVLP) tasks, a family of prominent embodied navigation and manipulation problems that jointly use computer vision and natural language. We propose a taxonomy to unify these tasks and provide an in-depth analysis and comparison of the new and current algorithmic approaches, metrics, simulated environments, as well as the datasets used for EVLP tasks. Finally, we present the core challenges that we believe new EVLP works should seek to address, and we advocate for task construction that enables model generalizability and furthers real-world deployment.
    Nonuniform Defocus Removal for Image Classification. (arXiv:2106.13864v1 [cs.CV])
    (2 min) We propose and study the single-frame anisoplanatic deconvolution problem associated with image classification using machine learning algorithms, named the nonuniform defocus removal (NDR) problem. Mathematical analysis of the NDR problem is done and the so-called defocus removal (DR) algorithm for solving it is proposed. Global convergence of the DR algorithm is established without imposing any unverifiable assumption. Numerical results on simulation data show significant features of DR including solvability, noise robustness, convergence, model insensitivity and computational efficiency. Physical relevance of the NDR problem and practicability of the DR algorithm are tested on experimental data. Back to the application that originally motivated the investigation of the NDR problem, we show that the DR algorithm can improve the accuracy of classifying distorted images using convolutional neural networks. The key difference of this paper compared to most existing works on single-frame anisoplanatic deconvolution is that the new method does not require the data image to be decomposable into isoplanatic subregions. Therefore, solution approaches partitioning the image into isoplanatic zones are not applicable to the NDR problem and those handling the entire image such as the DR algorithm need to be developed and analyzed.
    Domain Conditional Predictors for Domain Adaptation. (arXiv:2106.13899v1 [cs.LG])
    (2 min) Learning guarantees often rely on assumptions of i.i.d. data, which will likely be violated in practice once predictors are deployed to perform real-world tasks. Domain adaptation approaches thus appeared as a useful framework yielding extra flexibility in that distinct train and test data distributions are supported, provided that other assumptions are satisfied such as covariate shift, which expects the conditional distributions over labels to be independent of the underlying data distribution. Several approaches were introduced in order to induce generalization across varying train and test data sources, and those often rely on the general idea of domain-invariance, in such a way that the data-generating distributions are to be disregarded by the prediction model. In this contribution, we tackle the problem of generalizing across data sources by approaching it from the opposite direction: we consider a conditional modeling approach in which predictions, in addition to being dependent on the input data, use information relative to the underlying data-generating distribution. For instance, the model has an explicit mechanism to adapt to changing environments and/or new data sources. We argue that such an approach is more generally applicable than current domain adaptation methods since it does not require extra assumptions such as covariate shift and further yields simpler training algorithms that avoid a common source of training instabilities caused by minimax formulations, often employed in domain-invariant methods.
    A CNN Segmentation-Based Approach to Object Detection and Tracking in Ultrasound Scans with Application to the Vagus Nerve Detection. (arXiv:2106.13849v1 [cs.CV])
    (2 min) Ultrasound scanning is essential in several medical diagnostic and therapeutic applications. It is used to visualize and analyze anatomical features and structures that influence treatment plans. However, it is both labor intensive, and its effectiveness is operator dependent. Real-time accurate and robust automatic detection and tracking of anatomical structures while scanning would significantly impact diagnostic and therapeutic procedures to be consistent and efficient. In this paper, we propose a deep learning framework to automatically detect and track a specific anatomical target structure in ultrasound scans. Our framework is designed to be accurate and robust across subjects and imaging devices, to operate in real-time, and to not require a large training set. It maintains a localization precision and recall higher than 90% when trained on training sets that are as small as 20% in size of the original training set. The framework backbone is a weakly trained segmentation neural network based on U-Net. We tested the framework on two different ultrasound datasets with the aim to detect and track the Vagus nerve, where it outperformed current state-of-the-art real-time object detection networks.
    Domain Adaptive YOLO for One-Stage Cross-Domain Detection. (arXiv:2106.13939v1 [cs.CV])
    (2 min) Domain shift is a major challenge for object detectors to generalize well to real world applications. Emerging techniques of domain adaptation for two-stage detectors help to tackle this problem. However, two-stage detectors are not the first choice for industrial applications due to its long time consumption. In this paper, a novel Domain Adaptive YOLO (DA-YOLO) is proposed to improve cross-domain performance for one-stage detectors. Image level features alignment is used to strictly match for local features like texture, and loosely match for global features like illumination. Multi-scale instance level features alignment is presented to reduce instance domain shift effectively , such as variations in object appearance and viewpoint. A consensus regularization to these domain classifiers is employed to help the network generate domain-invariant detections. We evaluate our proposed method on popular datasets like Cityscapes, KITTI, SIM10K and etc.. The results demonstrate significant improvement when tested under different cross-domain scenarios.
    Fully Steerable 3D Spherical Neurons. (arXiv:2106.13863v1 [cs.CV])
    (2 min) Emerging from low-level vision theory, steerable filters found their counterpart in deep learning. Earlier works used the steering theorems and presented convolutional networks equivariant to rigid transformations. In our work, we propose a steerable feed-forward learning-based approach that consists of spherical decision surfaces and operates on point clouds. Due to the inherent geometric 3D structure of our theory, we derive a 3D steerability constraint for its atomic parts, the hypersphere neurons. Exploiting the rotational equivariance, we show how the model parameters are fully steerable at inference time. The proposed spherical filter banks enable to make equivariant and, after online optimization, invariant class predictions for known synthetic point sets in unknown orientations.
    Scene Uncertainty and the Wellington Posterior of Deterministic Image Classifiers. (arXiv:2106.13870v1 [cs.CV])
    (2 min) We propose a method to estimate the uncertainty of the outcome of an image classifier on a given input datum. Deep neural networks commonly used for image classification are deterministic maps from an input image to an output class. As such, their outcome on a given datum involves no uncertainty, so we must specify what variability we are referring to when defining, measuring and interpreting "confidence." To this end, we introduce the Wellington Posterior, which is the distribution of outcomes that would have been obtained in response to data that could have been generated by the same scene that produced the given image. Since there are infinitely many scenes that could have generated the given image, the Wellington Posterior requires induction from scenes other than the one portrayed. We explore alternate methods using data augmentation, ensembling, and model linearization. Additional alternatives include generative adversarial networks, conditional prior networks, and supervised single-view reconstruction. We test these alternatives against the empirical posterior obtained by inferring the class of temporally adjacent frames in a video. These developments are only a small step towards assessing the reliability of deep network classifiers in a manner that is compatible with safety-critical applications.
  • cs.IR updates on arXiv.org

    IITP at AILA 2019: System Report for Artificial Intelligence for Legal Assistance Shared Task. (arXiv:2105.11347v2 [cs.CL] UPDATED)
    (2 min) In this article, we present a description of our systems as a part of our participation in the shared task namely Artificial Intelligence for Legal Assistance (AILA 2019). This is an integral event of Forum for Information Retrieval Evaluation-2019. The outcomes of this track would be helpful for the automation of the working process of the Indian Judiciary System. The manual working procedures and documentation at any level (from lower to higher court) of the judiciary system are very complex in nature. The systems produced as a part of this track would assist the law practitioners. It would be helpful for common men too. This kind of track also opens the path of research of Natural Language Processing (NLP) in the judicial domain. This track defined two problems such as Task 1: Identifying relevant prior cases for a given situation and Task 2: Identifying the most relevant statutes for a given situation. We tackled both of them. Our proposed approaches are based on BM25 and Doc2Vec. As per the results declared by the task organizers, we are in 3rd and a modest position in Task 1 and Task 2 respectively.
    Intent Disentanglement and Feature Self-supervision for Novel Recommendation. (arXiv:2106.14388v1 [cs.IR])
    (2 min) One key property in recommender systems is the long-tail distribution in user-item interactions where most items only have few user feedback. Improving the recommendation of tail items can promote novelty and bring positive effects to both users and providers, and thus is a desirable property of recommender systems. Current novel recommendation studies over-emphasize the importance of tail items without differentiating the degree of users' intent on popularity and often incur a sharp decline of accuracy. Moreover, none of existing methods has ever taken the extreme case of tail items, i.e., cold-start items without any interaction, into consideration. In this work, we first disclose the mechanism that drives a user's interaction towards popular or niche items by disentangling her intent into conformity influence (popularity) and personal interests (preference). We then present a unified end-to-end framework to simultaneously optimize accuracy and novelty targets based on the disentangled intent of popularity and that of preference. We further develop a new paradigm for novel recommendation of cold-start items which exploits the self-supervised learning technique to model the correlation between collaborative features and content features. We conduct extensive experimental results on three real-world datasets. The results demonstrate that our proposed model yields significant improvements over the state-of-the-art baselines in terms of accuracy, novelty, coverage, and trade-off.
    Deep Learning for Technical Document Classification. (arXiv:2106.14269v1 [cs.LG])
    (2 min) In large technology companies, the requirements for managing and organizing technical documents created by engineers and managers in supporting relevant decision making have increased dramatically in recent years, which has led to a higher demand for more scalable, accurate, and automated document classification. Prior studies have primarily focused on processing text for classification and small-scale databases. This paper describes a novel multimodal deep learning architecture, called TechDoc, for technical document classification, which utilizes both natural language and descriptive images to train hierarchical classifiers. The architecture synthesizes convolutional neural networks and recurrent neural networks through an integrated training process. We applied the architecture to a large multimodal technical document database and trained the model for classifying documents based on the hierarchical International Patent Classification system. Our results show that the trained neural network presents a greater classification accuracy than those using a single modality and several earlier text classification methods. The trained model can potentially be scaled to millions of real-world technical documents with both text and figures, which is useful for data and knowledge management in large technology companies and organizations.
    RadGraph: Extracting Clinical Entities and Relations from Radiology Reports. (arXiv:2106.14463v1 [cs.CL])
    (2 min) Extracting structured clinical information from free-text radiology reports can enable the use of radiology report information for a variety of critical healthcare applications. In our work, we present RadGraph, a dataset of entities and relations in full-text chest X-ray radiology reports based on a novel information extraction schema we designed to structure radiology reports. We release a development dataset, which contains board-certified radiologist annotations for 500 radiology reports from the MIMIC-CXR dataset (14,579 entities and 10,889 relations), and a test dataset, which contains two independent sets of board-certified radiologist annotations for 100 radiology reports split equally across the MIMIC-CXR and CheXpert datasets. Using these datasets, we train and test a deep learning model, RadGraph Benchmark, that achieves a micro F1 of 0.82 and 0.73 on relation extraction on the MIMIC-CXR and CheXpert test sets respectively. Additionally, we release an inference dataset, which contains annotations automatically generated by RadGraph Benchmark across 220,763 MIMIC-CXR reports (around 6 million entities and 4 million relations) and 500 CheXpert reports (13,783 entities and 9,908 relations) with mappings to associated chest radiographs. Our freely available dataset can facilitate a wide range of research in medical natural language processing, as well as computer vision and multi-modal learning when linked to chest radiographs.
    Interplay between Upsampling and Regularization for Provider Fairness in Recommender Systems. (arXiv:2006.04279v3 [cs.IR] UPDATED)
    (2 min) Considering the impact of recommendations on item providers is one of the duties of multi-sided recommender systems. Item providers are key stakeholders in online platforms, and their earnings and plans are influenced by the exposure their items receive in recommended lists. Prior work showed that certain minority groups of providers, characterized by a common sensitive attribute (e.g., gender or race), are being disproportionately affected by indirect and unintentional discrimination. Our study in this paper handles a situation where ($i$) the same provider is associated with multiple items of a list suggested to a user, ($ii$) an item is created by more than one provider jointly, and ($iii$) predicted user-item relevance scores are biasedly estimated for items of provider groups. Under this scenario, we assess disparities in relevance, visibility, and exposure, by simulating diverse representations of the minority group in the catalog and the interactions. Based on emerged unfair outcomes, we devise a treatment that combines observation upsampling and loss regularization, while learning user-item relevance scores. Experiments on real-world data demonstrate that our treatment leads to lower disparate relevance. The resulting recommended lists show fairer visibility and exposure, higher minority item coverage, and negligible loss in recommendation utility.
    Improving Sequential Recommendation Consistency with Self-Supervised Imitation. (arXiv:2106.14031v1 [cs.IR])
    (2 min) Most sequential recommendation models capture the features of consecutive items in a user-item interaction history. Though effective, their representation expressiveness is still hindered by the sparse learning signals. As a result, the sequential recommender is prone to make inconsistent predictions. In this paper, we propose a model, \textbf{SSI}, to improve sequential recommendation consistency with Self-Supervised Imitation. Precisely, we extract the consistency knowledge by utilizing three self-supervised pre-training tasks, where temporal consistency and persona consistency capture user-interaction dynamics in terms of the chronological order and persona sensitivities, respectively. Furthermore, to provide the model with a global perspective, global session consistency is introduced by maximizing the mutual information among global and local interaction sequences. Finally, to comprehensively take advantage of all three independent aspects of consistency-enhanced knowledge, we establish an integrated imitation learning framework. The consistency knowledge is effectively internalized and transferred to the student model by imitating the conventional prediction logit as well as the consistency-enhanced item representations. In addition, the flexible self-supervised imitation framework can also benefit other student recommenders. Experiments on four real-world datasets show that SSI effectively outperforms the state-of-the-art sequential recommendation methods.
    Sequential Recommendation with Graph Neural Networks. (arXiv:2106.14226v1 [cs.IR])
    (2 min) Sequential recommendation aims to leverage users' historical behaviors to predict their next interaction. Existing works have not yet addressed two main challenges in sequential recommendation. First, user behaviors in their rich historical sequences are often implicit and noisy preference signals, they cannot sufficiently reflect users' actual preferences. In addition, users' dynamic preferences often change rapidly over time, and hence it is difficult to capture user patterns in their historical sequences. In this work, we propose a graph neural network model called SURGE (short for SeqUential Recommendation with Graph neural nEtworks) to address these two issues. Specifically, SURGE integrates different types of preferences in long-term user behaviors into clusters in the graph by re-constructing loose item sequences into tight item-item interest graphs based on metric learning. This helps explicitly distinguish users' core interests, by forming dense clusters in the interest graph. Then, we perform cluster-aware and query-aware graph convolutional propagation and graph pooling on the constructed graph. It dynamically fuses and extracts users' current activated core interests from noisy user behavior sequences. We conduct extensive experiments on both public and proprietary industrial datasets. Experimental results demonstrate significant performance gains of our proposed method compared to state-of-the-art methods. Further studies on sequence length confirm that our method can model long behavioral sequences effectively and efficiently.
    Unifying Remote Sensing Image Retrieval and Classification with Robust Fine-tuning. (arXiv:2102.13392v2 [cs.CV] UPDATED)
    (2 min) Advances in high resolution remote sensing image analysis are currently hampered by the difficulty of gathering enough annotated data for training deep learning methods, giving rise to a variety of small datasets and associated dataset-specific methods. Moreover, typical tasks such as classification and retrieval lack a systematic evaluation on standard benchmarks and training datasets, which make it hard to identify durable and generalizable scientific contributions. We aim at unifying remote sensing image retrieval and classification with a new large-scale training and testing dataset, SF300, including both vertical and oblique aerial images and made available to the research community, and an associated fine-tuning method. We additionally propose a new adversarial fine-tuning method for global descriptors. We show that our framework systematically achieves a boost of retrieval and classification performance on nine different datasets compared to an ImageNet pretrained baseline, with currently no other method to compare to.
    AI based Presentation Creator With Customized Audio Content Delivery. (arXiv:2106.14213v1 [cs.LG])
    (3 min) In this paper, we propose an architecture to solve a novel problem statement that has stemmed more so in recent times with an increase in demand for virtual content delivery due to the COVID-19 pandemic. All educational institutions, workplaces, research centers, etc. are trying to bridge the gap of communication during these socially distanced times with the use of online content delivery. The trend now is to create presentations, and then subsequently deliver the same using various virtual meeting platforms. The time being spent in such creation of presentations and delivering is what we try to reduce and eliminate through this paper which aims to use Machine Learning (ML) algorithms and Natural Language Processing (NLP) modules to automate the process of creating a slides-based presentation from a document, and then use state-of-the-art voice cloning models to deliver the content in the desired author's voice. We consider a structured document such as a research paper to be the content that has to be presented. The research paper is first summarized using BERT summarization techniques and condensed into bullet points that go into the slides. Tacotron inspired architecture with Encoder, Synthesizer, and a Generative Adversarial Network (GAN) based vocoder, is used to convey the contents of the slides in the author's voice (or any customized voice). Almost all learning has now been shifted to online mode, and professionals are now working from the comfort of their homes. Due to the current situation, teachers and professionals have shifted to presentations to help them in imparting information. In this paper, we aim to reduce the considerable amount of time that is taken in creating a presentation by automating this process and subsequently delivering this presentation in a customized voice, using a content delivery mechanism that can clone any voice using a short audio clip.
    Detecting race and gender bias in visual representation of AI on web search engines. (arXiv:2106.14072v1 [cs.IR])
    (2 min) Web search engines influence perception of social reality by filtering and ranking information. However, their outputs are often subjected to bias that can lead to skewed representation of subjects such as professional occupations or gender. In our paper, we use a mixed-method approach to investigate presence of race and gender bias in representation of artificial intelligence (AI) in image search results coming from six different search engines. Our findings show that search engines prioritize anthropomorphic images of AI that portray it as white, whereas non-white images of AI are present only in non-Western search engines. By contrast, gender representation of AI is more diverse and less skewed towards a specific gender that can be attributed to higher awareness about gender bias in search outputs. Our observations indicate both the the need and the possibility for addressing bias in representation of societally relevant subjects, such as technological innovation, and emphasize the importance of designing new approaches for detecting bias in information retrieval systems.
    Transfer-based adaptive tree for multimodal sentiment analysis based on user latent aspects. (arXiv:2106.14174v1 [cs.LG])
    (2 min) Multimodal sentiment analysis benefits various applications such as human-computer interaction and recommendation systems. It aims to infer the users' bipolar ideas using visual, textual, and acoustic signals. Although researchers affirm the association between cognitive cues and emotional manifestations, most of the current multimodal approaches in sentiment analysis disregard user-specific aspects. To tackle this issue, we devise a novel method to perform multimodal sentiment prediction using cognitive cues, such as personality. Our framework constructs an adaptive tree by hierarchically dividing users and trains the LSTM-based submodels, utilizing an attention-based fusion to transfer cognitive-oriented knowledge within the tree. Subsequently, the framework consumes the conclusive agglomerative knowledge from the adaptive tree to predict final sentiments. We also devise a dynamic dropout method to facilitate data sharing between neighboring nodes, reducing data sparsity. The empirical results on real-world datasets determine that our proposed model for sentiment prediction can surpass trending rivals. Moreover, compared to other ensemble approaches, the proposed transfer-based algorithm can better utilize the latent cognitive cues and foster the prediction outcomes. Based on the given extrinsic and intrinsic analysis results, we note that compared to other theoretical-based techniques, the proposed hierarchical clustering approach can better group the users within the adaptive tree.
  • cs.LG updates on arXiv.org

    LRG at SemEval-2021 Task 4: Improving Reading Comprehension with Abstract Words using Augmentation, Linguistic Features and Voting. (arXiv:2102.12255v2 [cs.CL] UPDATED)
    (2 min) In this article, we present our methodologies for SemEval-2021 Task-4: Reading Comprehension of Abstract Meaning. Given a fill-in-the-blank-type question and a corresponding context, the task is to predict the most suitable word from a list of 5 options. There are three sub-tasks within this task: Imperceptibility (subtask-I), Non-Specificity (subtask-II), and Intersection (subtask-III). We use encoders of transformers-based models pre-trained on the masked language modelling (MLM) task to build our Fill-in-the-blank (FitB) models. Moreover, to model imperceptibility, we define certain linguistic features, and to model non-specificity, we leverage information from hypernyms and hyponyms provided by a lexical database. Specifically, for non-specificity, we try out augmentation techniques, and other statistical techniques. We also propose variants, namely Chunk Voting and Max Context, to take care of input length restrictions for BERT, etc. Additionally, we perform a thorough ablation study, and use Integrated Gradients to explain our predictions on a few samples. Our best submissions achieve accuracies of 75.31% and 77.84%, on the test sets for subtask-I and subtask-II, respectively. For subtask-III, we achieve accuracies of 65.64% and 62.27%.
    Quasiconformal model with CNN features for large deformation image registration. (arXiv:2011.00731v3 [cs.CV] UPDATED)
    (2 min) Image registration has been widely studied over the past several decades, with numerous applications in science, engineering and medicine. Most of the conventional mathematical models for large deformation image registration rely on prescribed landmarks, which usually require tedious manual labeling and are prone to error. In recent years, there has been a surge of interest in the use of machine learning for image registration. In this paper, we develop a novel method for large deformation image registration by a fusion of quasiconformal theory and convolutional neural network (CNN). More specifically, we propose a quasiconformal energy model with a novel fidelity term that incorporates the features extracted using a pre-trained CNN, thereby allowing us to obtain meaningful registration results without any guidance of prescribed landmarks. Moreover, unlike many prior image registration methods, the bijectivity of our method is guaranteed by quasiconformal theory. Experimental results are presented to demonstrate the effectiveness of the proposed method. More broadly, our work sheds light on how rigorous mathematical theories and practical machine learning approaches can be integrated for developing computational methods with improved performance.
    Graph Contrastive Learning Automated. (arXiv:2106.07594v2 [cs.LG] UPDATED)
    (2 min) Self-supervised learning on graph-structured data has drawn recent interest for learning generalizable, transferable and robust representations from unlabeled graphs. Among many, graph contrastive learning (GraphCL) has emerged with promising representation learning performance. Unfortunately, unlike its counterpart on image data, the effectiveness of GraphCL hinges on ad-hoc data augmentations, which have to be manually picked per dataset, by either rules of thumb or trial-and-errors, owing to the diverse nature of graph data. That significantly limits the more general applicability of GraphCL. Aiming to fill in this crucial gap, this paper proposes a unified bi-level optimization framework to automatically, adaptively and dynamically select data augmentations when performing GraphCL on specific graph data. The general framework, dubbed JOint Augmentation Optimization (JOAO), is instantiated as min-max optimization. The selections of augmentations made by JOAO are shown to be in general aligned with previous "best practices" observed from handcrafted tuning: yet now being automated, more flexible and versatile. Moreover, we propose a new augmentation-aware projection head mechanism, which will route output features through different projection heads corresponding to different augmentations chosen at each training step. Extensive experiments demonstrate that JOAO performs on par with or sometimes better than the state-of-the-art competitors including GraphCL, on multiple graph datasets of various scales and types, yet without resorting to any laborious dataset-specific tuning on augmentation selection. We release the code at https://github.com/Shen-Lab/GraphCL_Automated.
    MLPerf Tiny Benchmark. (arXiv:2106.07597v2 [cs.LG] UPDATED)
    (2 min) Advancements in ultra-low-power tiny machine learning (TinyML) systems promise to unlock an entirely new class of smart applications. However, continued progress is limited by the lack of a widely accepted and easily reproducible benchmark for these systems. To meet this need, we present MLPerf Tiny, the first industry-standard benchmark suite for ultra-low-power tiny machine learning systems. The benchmark suite is the collaborative effort of more than 50 organizations from industry and academia and reflects the needs of the community. MLPerf Tiny measures the accuracy, latency, and energy of machine learning inference to properly evaluate the tradeoffs between systems. Additionally, MLPerf Tiny implements a modular design that enables benchmark submitters to show the benefits of their product, regardless of where it falls on the ML deployment stack, in a fair and reproducible manner. The suite features four benchmarks: keyword spotting, visual wake words, image classification, and anomaly detection.
    A Reinforcement Learning Approach for Sequential Spatial Transformer Networks. (arXiv:2106.14295v1 [cs.LG])
    (2 min) Spatial Transformer Networks (STN) can generate geometric transformations which modify input images to improve the classifier's performance. In this work, we combine the idea of STN with Reinforcement Learning (RL). To this end, we break the affine transformation down into a sequence of simple and discrete transformations. We formulate the task as a Markovian Decision Process (MDP) and use RL to solve this sequential decision-making problem. STN architectures learn the transformation parameters by minimizing the classification error and backpropagating the gradients through a sub-differentiable sampling module. In our method, we are not bound to the differentiability of the sampling modules. Moreover, we have freedom in designing the objective rather than only minimizing the error; e.g., we can directly set the target as maximizing the accuracy. We design multiple experiments to verify the effectiveness of our method using cluttered MNIST and Fashion-MNIST datasets and show that our method outperforms STN with a proper definition of MDP components.
    Neural tensor contractions and the expressive power of deep neural quantum states. (arXiv:2103.10293v2 [quant-ph] UPDATED)
    (2 min) We establish a direct connection between general tensor networks and deep feed-forward artificial neural networks. The core of our results is the construction of neural-network layers that efficiently perform tensor contractions, and that use commonly adopted non-linear activation functions. The resulting deep networks feature a number of edges that closely matches the contraction complexity of the tensor networks to be approximated. In the context of many-body quantum states, this result establishes that neural-network states have strictly the same or higher expressive power than practically usable variational tensor networks. As an example, we show that all matrix product states can be efficiently written as neural-network states with a number of edges polynomial in the bond dimension and depth logarithmic in the system size. The opposite instead does not hold true, and our results imply that there exist quantum states that are not efficiently expressible in terms of matrix product states or practically usable PEPS, but that are instead efficiently expressible with neural network states.
    Consistency Regularization for Adversarial Robustness. (arXiv:2103.04623v2 [cs.LG] UPDATED)
    (2 min) Adversarial training (AT) is currently one of the most successful methods to obtain the adversarial robustness of deep neural networks. However, the phenomenon of robust overfitting, i.e., the robustness starts to decrease significantly during AT, has been problematic, not only making practitioners consider a bag of tricks for a successful training, e.g., early stopping, but also incurring a significant generalization gap in the robustness. In this paper, we propose an effective regularization technique that prevents robust overfitting by optimizing an auxiliary 'consistency' regularization loss during AT. Specifically, it forces the predictive distributions after attacking from two different augmentations of the same instance to be similar with each other. Our experimental results demonstrate that such a simple regularization technique brings significant improvements in the test robust accuracy of a wide range of AT methods. More remarkably, we also show that our method could significantly help the model to generalize its robustness against unseen adversaries, e.g., other types or larger perturbations compared to those used during training. Code is available at https://github.com/alinlab/consistency-adversarial.
    Detection of Adversarial Supports in Few-shot Classifiers Using Self-Similarity and Filtering. (arXiv:2012.06330v2 [cs.CR] UPDATED)
    (2 min) Few-shot classifiers excel under limited training samples, making them useful in applications with sparsely user-provided labels. Their unique relative prediction setup offers opportunities for novel attacks, such as targeting support sets required to categorise unseen test samples, which are not available in other machine learning setups. In this work, we propose a detection strategy to identify adversarial support sets, aimed at destroying the understanding of a few-shot classifier for a certain class. We achieve this by introducing the concept of self-similarity of a support set and by employing filtering of supports. Our method is attack-agnostic, and we are the first to explore adversarial detection for support sets of few-shot classifiers to the best of our knowledge. Our evaluation of the miniImagenet (MI) and CUB datasets exhibits good attack detection performance despite conceptual simplicity, showing high AUROC scores. We show that self-similarity and filtering for adversarial detection can be paired with other filtering functions, constituting a generalisable concept.
    Decaying Clipping Range in Proximal Policy Optimization. (arXiv:2102.10456v2 [cs.LG] UPDATED)
    (2 min) Proximal Policy Optimization (PPO) is among the most widely used algorithms in reinforcement learning, which achieves state-of-the-art performance in many challenging problems. The keys to its success are the reliable policy updates through the clipping mechanism and the multiple epochs of minibatch updates. The aim of this research is to give new simple but effective alternatives to the former. For this, we propose linearly and exponentially decaying clipping range approaches throughout the training. With these, we would like to provide higher exploration at the beginning and stronger restrictions at the end of the learning phase. We investigate their performance in several classical control and locomotive robotic environments. During the analysis, we found that they influence the achieved rewards and are effective alternatives to the constant clipping method in many reinforcement learning tasks.
    Continual Learning with Echo State Networks. (arXiv:2105.07674v2 [cs.LG] UPDATED)
    (2 min) Continual Learning (CL) refers to a learning setup where data is non stationary and the model has to learn without forgetting existing knowledge. The study of CL for sequential patterns revolves around trained recurrent networks. In this work, instead, we introduce CL in the context of Echo State Networks (ESNs), where the recurrent component is kept fixed. We provide the first evaluation of catastrophic forgetting in ESNs and we highlight the benefits in using CL strategies which are not applicable to trained recurrent models. Our results confirm the ESN as a promising model for CL and open to its use in streaming scenarios.
    ASK: Adversarial Soft k-Nearest Neighbor Attack and Defense. (arXiv:2106.14300v1 [cs.LG])
    (2 min) K-Nearest Neighbor (kNN)-based deep learning methods have been applied to many applications due to their simplicity and geometric interpretability. However, the robustness of kNN-based classification models has not been thoroughly explored and kNN attack strategies are underdeveloped. In this paper, we propose an Adversarial Soft kNN (ASK) loss to both design more effective kNN attack strategies and to develop better defenses against them. Our ASK loss approach has two advantages. First, ASK loss can better approximate the kNN's probability of classification error than objectives proposed in previous works. Second, the ASK loss is interpretable: it preserves the mutual information between the perturbed input and the kNN of the unperturbed input. We use the ASK loss to generate a novel attack method called the ASK-Attack (ASK-Atk), which shows superior attack efficiency and accuracy degradation relative to previous kNN attacks. Based on the ASK-Atk, we then derive an ASK-Defense (ASK-Def) method that optimizes the worst-case training loss induced by ASK-Atk.
    Efficient Sparse Coding using Hierarchical Riemannian Pursuit. (arXiv:2104.10314v2 [cs.LG] UPDATED)
    (2 min) Sparse coding is a class of unsupervised methods for learning a sparse representation of the input data in the form of a linear combination of a dictionary and a sparse code. This learning framework has led to state-of-the-art results in various image and video processing tasks. However, classical methods learn the dictionary and the sparse code based on alternative optimizations, usually without theoretical guarantees for either optimality or convergence due to non-convexity of the problem. Recent works on sparse coding with a complete dictionary provide strong theoretical guarantees thanks to the development of the non-convex optimization. However, initial non-convex approaches learn the dictionary in the sparse coding problem sequentially in an atom-by-atom manner, which leads to a long execution time. More recent works seek to directly learn the entire dictionary at once, which substantially reduces the execution time. However, the associated recovery performance is degraded with a finite number of data samples. In this paper, we propose an efficient sparse coding scheme with a two-stage optimization. The proposed scheme leverages the global and local Riemannian geometry of the two-stage optimization problem and facilitates fast implementation for superb dictionary recovery performance by a finite number of samples without atom-by-atom calculation. We further prove that, with high probability, the proposed scheme can exactly recover any atom in the target dictionary with a finite number of samples if it is adopted to recover one atom of the dictionary. An application on wireless sensor data compression is also proposed. Experiments on both synthetic and real-world data verify the efficiency and effectiveness of the proposed scheme.
    Good and Bad Optimization Models: Insights from Rockafellians. (arXiv:2105.06073v2 [math.OC] UPDATED)
    (2 min) A basic requirement for a mathematical model is often that its solution (output) shouldn't change much if the model's parameters (input) are perturbed. This is important because the exact values of parameters may not be known and one would like to avoid being mislead by an output obtained using incorrect values. Thus, it's rarely enough to address an application by formulating a model, solving the resulting optimization problem and presenting the solution as the answer. One would need to confirm that the model is suitable, i.e., "good," and this can, at least in part, be achieved by considering a family of optimization problems constructed by perturbing parameters of concern. The resulting sensitivity analysis uncovers troubling situations with unstable solutions, which we referred to as "bad" models, and indicates better model formulations. Embedding an actual problem of interest within a family of problems is also a primary path to optimality conditions as well as computationally attractive, alternative problems, which under ideal circumstances, and when properly tuned, may even furnish the minimum value of the actual problem. The tuning of these alternative problems turns out to be intimately tied to finding multipliers in optimality conditions and thus emerges as a main component of several optimization algorithms. In fact, the tuning amounts to solving certain dual optimization problems. In this tutorial, we'll discuss the opportunities and insights afforded by this broad perspective.
    Algorithm is Experiment: Machine Learning, Market Design, and Policy Eligibility Rules. (arXiv:2104.12909v2 [econ.EM] UPDATED)
    (2 min) Algorithms produce a growing portion of decisions and recommendations both in policy and business. Such algorithmic decisions are natural experiments (conditionally quasi-randomly assigned instruments) since the algorithms make decisions based only on observable input variables. We use this observation to develop a treatment-effect estimator for a class of stochastic and deterministic decision-making algorithms. Our estimator is shown to be consistent and asymptotically normal for well-defined causal effects. A key special case of our estimator is a multidimensional regression discontinuity design. We apply our estimator to evaluate the effect of the Coronavirus Aid, Relief, and Economic Security (CARES) Act, where more than \$175 billion worth of relief funding is allocated to hospitals via an algorithmic rule. Our estimates suggest that the relief funding has little effect on COVID-19-related hospital activity levels. Naive OLS and IV estimates exhibit substantial selection bias.
    Regularizing towards Causal Invariance: Linear Models with Proxies. (arXiv:2103.02477v2 [cs.LG] UPDATED)
    (2 min) We propose a method for learning linear models whose predictive performance is robust to causal interventions on unobserved variables, when noisy proxies of those variables are available. Our approach takes the form of a regularization term that trades off between in-distribution performance and robustness to interventions. Under the assumption of a linear structural causal model, we show that a single proxy can be used to create estimators that are prediction optimal under interventions of bounded strength. This strength depends on the magnitude of the measurement noise in the proxy, which is, in general, not identifiable. In the case of two proxy variables, we propose a modified estimator that is prediction optimal under interventions up to a known strength. We further show how to extend these estimators to scenarios where additional information about the "test time" intervention is available during training. We evaluate our theoretical findings in synthetic experiments and using real data of hourly pollution levels across several cities in China.
    Automated Detection of Abnormalities from an EEG Recording of Epilepsy Patients With a Compact Convolutional Neural Network. (arXiv:2105.10358v2 [eess.SP] UPDATED)
    (2 min) Electroencephalography (EEG) is essential for the diagnosis of epilepsy, but it requires expertise and experience to identify abnormalities. It is thus crucial to develop automated models for the detection of abnormalities in EEGs related to epilepsy. This paper describes the development of a novel class of compact convolutional neural networks (CNNs) for detecting abnormal patterns and electrodes in EEGs for epilepsy. The designed model is inspired by a CNN developed for brain-computer interfacing called multichannel EEGNet (mEEGNet). Unlike the EEGNet, the proposed model, mEEGNet, has the same number of electrode inputs and outputs to detect abnormal patterns. The mEEGNet was evaluated with a clinical dataset consisting of 29 cases of juvenile and childhood absence epilepsy labeled by a clinical expert. The labels were given to paroxysmal discharges visually observed in both ictal (seizure) and interictal (nonseizure) durations. Results showed that the mEEGNet detected abnormalities with the area under the curve, F1-values, and sensitivity equivalent to or higher than those of existing CNNs. Moreover, the number of parameters is much smaller than other CNN models. To our knowledge, the dataset of absence epilepsy validated with machine learning through this research is the largest in the literature.
    Learning swimming escape patterns for larval fish under energy constraints. (arXiv:2105.00771v2 [physics.flu-dyn] UPDATED)
    (2 min) Swimming organisms can escape their predators by creating and harnessing unsteady flow fields through their body motions. Stochastic optimization and flow simulations have identified escape patterns that are consistent with those observed in natural larval swimmers. However, these patterns have been limited by the specification of a particular cost function and depend on a prescribed functional form of the body motion. Here, we deploy reinforcement learning to discover swimmer escape patterns for larval fish under energy constraints. The identified patterns include the C-start mechanism, in addition to more energetically efficient escapes. We find that maximizing distance with limited energy requires swimming via short bursts of accelerating motion interlinked with phases of gliding. The present, data efficient, reinforcement learning algorithm results in an array of patterns that reveal practical flow optimization principles for efficient swimming and the methodology can be transferred to the control of aquatic robotic devices operating under energy constraints.
    A Survey on Deep Learning for Human Mobility. (arXiv:2012.02825v2 [cs.LG] UPDATED)
    (2 min) The study of human mobility is crucial due to its impact on several aspects of our society, such as disease spreading, urban planning, well-being, pollution, and more. The proliferation of digital mobility data, such as phone records, GPS traces, and social media posts, combined with the predictive power of artificial intelligence, triggered the application of deep learning to human mobility. Existing surveys focus on single tasks, data sources, mechanistic or traditional machine learning approaches, while a comprehensive description of deep learning solutions is missing. This survey provides a taxonomy of mobility tasks, a discussion on the challenges related to each task and how deep learning may overcome the limitations of traditional models, a description of the most relevant solutions to the mobility tasks described above and the relevant challenges for the future. Our survey is a guide to the leading deep learning solutions to next-location prediction, crowd flow prediction, trajectory generation, and flow generation. At the same time, it helps deep learning scientists and practitioners understand the fundamental concepts and the open challenges of the study of human mobility.
    Accelerating Recurrent Neural Networks for Gravitational Wave Experiments. (arXiv:2106.14089v1 [cs.LG])
    (2 min) This paper presents novel reconfigurable architectures for reducing the latency of recurrent neural networks (RNNs) that are used for detecting gravitational waves. Gravitational interferometers such as the LIGO detectors capture cosmic events such as black hole mergers which happen at unknown times and of varying durations, producing time-series data. We have developed a new architecture capable of accelerating RNN inference for analyzing time-series data from LIGO detectors. This architecture is based on optimizing the initiation intervals (II) in a multi-layer LSTM (Long Short-Term Memory) network, by identifying appropriate reuse factors for each layer. A customizable template for this architecture has been designed, which enables the generation of low-latency FPGA designs with efficient resource utilization using high-level synthesis tools. The proposed approach has been evaluated based on two LSTM models, targeting a ZYNQ 7045 FPGA and a U250 FPGA. Experimental results show that with balanced II, the number of DSPs can be reduced up to 42% while achieving the same IIs. When compared to other FPGA-based LSTM designs, our design can achieve about 4.92 to 12.4 times lower latency.
    Sanity Simulations for Saliency Methods. (arXiv:2105.06506v2 [cs.LG] UPDATED)
    (2 min) Saliency methods are a popular class of feature attribution tools that aim to capture a model's predictive reasoning by identifying "important" pixels in an input image. However, the development and adoption of saliency methods are currently hindered by the lack of access to underlying model reasoning, which prevents accurate method evaluation. In this work, we design a synthetic evaluation framework, SMERF, that allows us to perform ground-truth-based evaluation of saliency methods while controlling the underlying complexity of model reasoning. Experimental evaluations via SMERF reveal significant limitations in existing saliency methods, especially given the relative simplicity of SMERF's synthetic evaluation tasks. Moreover, the SMERF benchmarking suite represents a useful tool in the development of new saliency methods to potentially overcome these limitations.
    Lambda Learner: Fast Incremental Learning on Data Streams. (arXiv:2010.05154v2 [cs.LG] UPDATED)
    (2 min) One of the most well-established applications of machine learning is in deciding what content to show website visitors. When observation data comes from high-velocity, user-generated data streams, machine learning methods perform a balancing act between model complexity, training time, and computational costs. Furthermore, when model freshness is critical, the training of models becomes time-constrained. Parallelized batch offline training, although horizontally scalable, is often not time-considerate or cost-effective. In this paper, we propose Lambda Learner, a new framework for training models by incremental updates in response to mini-batches from data streams. We show that the resulting model of our framework closely estimates a periodically updated model trained on offline data and outperforms it when model updates are time-sensitive. We provide theoretical proof that the incremental learning updates improve the loss-function over a stale batch model. We present a large-scale deployment on the sponsored content platform for a large social network, serving hundreds of millions of users across different channels (e.g., desktop, mobile). We address challenges and complexities from both algorithms and infrastructure perspectives, and illustrate the system details for computation, storage, and streaming production of training data.
    SymbolicGPT: A Generative Transformer Model for Symbolic Regression. (arXiv:2106.14131v1 [cs.LG])
    (2 min) Symbolic regression is the task of identifying a mathematical expression that best fits a provided dataset of input and output values. Due to the richness of the space of mathematical expressions, symbolic regression is generally a challenging problem. While conventional approaches based on genetic evolution algorithms have been used for decades, deep learning-based methods are relatively new and an active research area. In this work, we present SymbolicGPT, a novel transformer-based language model for symbolic regression. This model exploits the advantages of probabilistic language models like GPT, including strength in performance and flexibility. Through comprehensive experiments, we show that our model performs strongly compared to competing models with respect to the accuracy, running time, and data efficiency.
    Adversarial Feature Desensitization. (arXiv:2006.04621v2 [cs.LG] UPDATED)
    (2 min) Neural networks are known to be vulnerable to adversarial attacks -- slight but carefully constructed perturbations of the inputs which can drastically impair the network's performance. Many defense methods have been proposed for improving robustness of deep networks by training them on adversarially perturbed inputs. However, these models often remain vulnerable to new types of attacks not seen during training, and even to slightly stronger versions of previously seen attacks. In this work, we propose a novel approach to adversarial robustness, which builds upon the insights from the domain adaptation field. Our method, called Adversarial Feature Desensitization (AFD), aims at learning features that are invariant towards adversarial perturbations of the inputs. This is achieved through a game where we learn features that are both predictive and robust (insensitive to adversarial attacks), i.e. cannot be used to discriminate between natural and adversarial data. Empirical results on several benchmarks demonstrate the effectiveness of the proposed approach against a wide range of attack types and attack strengths.
    Online Multi-Armed Bandits with Adaptive Inference. (arXiv:2102.13202v2 [cs.LG] UPDATED)
    (2 min) During online decision making in Multi-Armed Bandits (MAB), one needs to conduct inference on the true mean reward of each arm based on data collected so far at each step. However, since the arms are adaptively selected--thereby yielding non-iid data--conducting inference accurately is not straightforward. In particular, sample averaging, which is used in the family of UCB and Thompson sampling (TS) algorithms, does not provide a good choice as it suffers from bias and a lack of good statistical properties (e.g. asymptotic normality). Our thesis in this paper is that more sophisticated inference schemes that take into account the adaptive nature of the sequentially collected data can unlock further performance gains, even though both UCB and TS type algorithms are optimal in the worst case. In particular, we propose a variant of TS-style algorithms--which we call doubly adaptive TS--that leverages recent advances in causal inference and adaptively reweights the terms of a doubly robust estimator on the true mean reward of each arm. Through 20 synthetic domain experiments and a semi-synthetic experiment based on data from an A/B test of a web service, we demonstrate that using an adaptive inferential scheme (while still retaining the exploration efficacy of TS) provides clear benefits in online decision making: the proposed DATS algorithm has superior empirical performance to existing baselines (UCB and TS) in terms of regret and sample complexity in identifying the best arm. In addition, we also provide a finite-time regret bound of doubly adaptive TS that matches (up to log factors) those of UCB and TS algorithms, thereby establishing that its improved practical benefits do not come at the expense of worst-case suboptimality.
    Differentially Private SGD with Non-Smooth Losses. (arXiv:2101.08925v2 [stat.ML] UPDATED)
    (2 min) In this paper, we are concerned with differentially private {stochastic gradient descent (SGD)} algorithms in the setting of stochastic convex optimization (SCO). Most of the existing work requires the loss to be Lipschitz continuous and strongly smooth, and the model parameter to be uniformly bounded. However, these assumptions are restrictive as many popular losses violate these conditions including the hinge loss for SVM, the absolute loss in robust regression, and even the least square loss in an unbounded domain. We significantly relax these restrictive assumptions and establish privacy and generalization (utility) guarantees for private SGD algorithms using output and gradient perturbations associated with non-smooth convex losses. Specifically, the loss function is relaxed to have an $\alpha$-H\"{o}lder continuous gradient (referred to as $\alpha$-H\"{o}lder smoothness) which instantiates the Lipschitz continuity ($\alpha=0$) and the strong smoothness ($\alpha=1$). We prove that noisy SGD with $\alpha$-H\"older smooth losses using gradient perturbation can guarantee $(\epsilon,\delta)$-differential privacy (DP) and attain optimal excess population risk $\mathcal{O}\Big(\frac{\sqrt{d\log(1/\delta)}}{n\epsilon}+\frac{1}{\sqrt{n}}\Big)$, up to logarithmic terms, with the gradient complexity $ \mathcal{O}( n^{2-\alpha\over 1+\alpha}+ n).$ This shows an important trade-off between $\alpha$-H\"older smoothness of the loss and the computational complexity for private SGD with statistically optimal performance. In particular, our results indicate that $\alpha$-H\"older smoothness with $\alpha\ge {1/2}$ is sufficient to guarantee $(\epsilon,\delta)$-DP of noisy SGD algorithms while achieving optimal excess risk with the linear gradient complexity $\mathcal{O}(n).$
    Evaluating adversarial robustness in simulated cerebellum. (arXiv:2012.02976v2 [cs.NE] UPDATED)
    (2 min) It is well known that artificial neural networks are vulnerable to adversarial examples, in which great efforts have been made to improve the robustness. However, such examples are usually imperceptible to humans, and thus their effect on biological neural circuits is largely unknown. This paper will investigate the adversarial robustness in a simulated cerebellum, a well-studied supervised learning system in computational neuroscience. Specifically, we propose to study three unique characteristics revealed in the cerebellum: (i) network width; (ii) long-term depression on the parallel fiber-Purkinje cell synapses; (iii) sparse connectivity in the granule layer, and hypothesize that they will be beneficial for improving robustness. To the best of our knowledge, this is the first attempt to examine the adversarial robustness in simulated cerebellum models. The results are negative in the experimental phase -- no significant improvements in robustness are discovered from the proposed three mechanisms. Consequently, the cerebellum is expected to be vulnerable to adversarial examples as the deep neural networks under batch training. Neuroscientists are encouraged to fool the biological system in experiments with adversarial attacks.
    MSAF: Multimodal Split Attention Fusion. (arXiv:2012.07175v2 [cs.CV] UPDATED)
    (2 min) Multimodal learning mimics the reasoning process of the human multi-sensory system, which is used to perceive the surrounding world. While making a prediction, the human brain tends to relate crucial cues from multiple sources of information. In this work, we propose a novel multimodal fusion module that learns to emphasize more contributive features across all modalities. Specifically, the proposed Multimodal Split Attention Fusion (MSAF) module splits each modality into channel-wise equal feature blocks and creates a joint representation that is used to generate soft attention for each channel across the feature blocks. Further, the MSAF module is designed to be compatible with features of various spatial dimensions and sequence lengths, suitable for both CNNs and RNNs. Thus, MSAF can be easily added to fuse features of any unimodal networks and utilize existing pretrained unimodal model weights. To demonstrate the effectiveness of our fusion module, we design three multimodal networks with MSAF for emotion recognition, sentiment analysis, and action recognition tasks. Our approach achieves competitive results in each task and outperforms other application-specific networks and multimodal fusion benchmarks.
    Learning Adaptive Classifiers Synthesis for Generalized Few-Shot Learning. (arXiv:1906.02944v5 [cs.CV] UPDATED)
    (2 min) Object recognition in the real-world requires handling long-tailed or even open-ended data. An ideal visual system needs to recognize the populated head visual concepts reliably and meanwhile efficiently learn about emerging new tail categories with a few training instances. Class-balanced many-shot learning and few-shot learning tackle one side of this problem, by either learning strong classifiers for head or learning to learn few-shot classifiers for the tail. In this paper, we investigate the problem of generalized few-shot learning (GFSL) -- a model during the deployment is required to learn about tail categories with few shots and simultaneously classify the head classes. We propose the ClAssifier SynThesis LEarning (CASTLE), a learning framework that learns how to synthesize calibrated few-shot classifiers in addition to the multi-class classifiers of head classes with a shared neural dictionary, shedding light upon the inductive GFSL. Furthermore, we propose an adaptive version of CASTLE (ACASTLE) that adapts the head classifiers conditioned on the incoming tail training examples, yielding a framework that allows effective backward knowledge transfer. As a consequence, ACASTLE can handle GFSL with classes from heterogeneous domains effectively. CASTLE and ACASTLE demonstrate superior performances than existing GFSL algorithms and strong baselines on MiniImageNet as well as TieredImageNet datasets. More interestingly, they outperform previous state-of-the-art methods when evaluated with standard few-shot learning criteria.
    Score-Based Change Detection for Gradient-Based Learning Machines. (arXiv:2106.14122v1 [stat.ML])
    (2 min) The widespread use of machine learning algorithms calls for automatic change detection algorithms to monitor their behavior over time. As a machine learning algorithm learns from a continuous, possibly evolving, stream of data, it is desirable and often critical to supplement it with a companion change detection algorithm to facilitate its monitoring and control. We present a generic score-based change detection method that can detect a change in any number of components of a machine learning model trained via empirical risk minimization. This proposed statistical hypothesis test can be readily implemented for such models designed within a differentiable programming framework. We establish the consistency of the hypothesis test and show how to calibrate it to achieve a prescribed false alarm rate. We illustrate the versatility of the approach on synthetic and real data.
    LSMI-Sinkhorn: Semi-supervised Mutual Information Estimation with Optimal Transport. (arXiv:1909.02373v3 [stat.ML] UPDATED)
    (2 min) Estimating mutual information is an important statistics and machine learning problem. To estimate the mutual information from data, a common practice is preparing a set of paired samples $\{(\mathbf{x}_i,\mathbf{y}_i)\}_{i=1}^n \stackrel{\mathrm{i.i.d.}}{\sim} p(\mathbf{x},\mathbf{y})$. However, in many situations, it is difficult to obtain a large number of data pairs. To address this problem, we propose the semi-supervised Squared-loss Mutual Information (SMI) estimation method using a small number of paired samples and the available unpaired ones. We first represent SMI through the density ratio function, where the expectation is approximated by the samples from marginals and its assignment parameters. The objective is formulated using the optimal transport problem and quadratic programming. Then, we introduce the Least-Squares Mutual Information with Sinkhorn (LSMI-Sinkhorn) algorithm for efficient optimization. Through experiments, we first demonstrate that the proposed method can estimate the SMI without a large number of paired samples. Then, we show the effectiveness of the proposed LSMI-Sinkhorn algorithm on various types of machine learning problems such as image matching and photo album summarization. Code can be found at https://github.com/csyanbin/LSMI-Sinkhorn.
    Instance-optimality in optimal value estimation: Adaptivity via variance-reduced Q-learning. (arXiv:2106.14352v1 [stat.ML])
    (2 min) Various algorithms in reinforcement learning exhibit dramatic variability in their convergence rates and ultimate accuracy as a function of the problem structure. Such instance-specific behavior is not captured by existing global minimax bounds, which are worst-case in nature. We analyze the problem of estimating optimal $Q$-value functions for a discounted Markov decision process with discrete states and actions and identify an instance-dependent functional that controls the difficulty of estimation in the $\ell_\infty$-norm. Using a local minimax framework, we show that this functional arises in lower bounds on the accuracy on any estimation procedure. In the other direction, we establish the sharpness of our lower bounds, up to factors logarithmic in the state and action spaces, by analyzing a variance-reduced version of $Q$-learning. Our theory provides a precise way of distinguishing "easy" problems from "hard" ones in the context of $Q$-learning, as illustrated by an ensemble with a continuum of difficulty.
    The KL-Divergence between a Graph Model and its Fair I-Projection as a Fairness Regularizer. (arXiv:2103.01846v2 [cs.LG] UPDATED)
    (2 min) Learning and reasoning over graphs is increasingly done by means of probabilistic models, e.g. exponential random graph models, graph embedding models, and graph neural networks. When graphs are modeling relations between people, however, they will inevitably reflect biases, prejudices, and other forms of inequity and inequality. An important challenge is thus to design accurate graph modeling approaches while guaranteeing fairness according to the specific notion of fairness that the problem requires. Yet, past work on the topic remains scarce, is limited to debiasing specific graph modeling methods, and often aims to ensure fairness in an indirect manner. We propose a generic approach applicable to most probabilistic graph modeling approaches. Specifically, we first define the class of fair graph models corresponding to a chosen set of fairness criteria. Given this, we propose a fairness regularizer defined as the KL-divergence between the graph model and its I-projection onto the set of fair models. We demonstrate that using this fairness regularizer in combination with existing graph modeling approaches efficiently trades-off fairness with accuracy, whereas the state-of-the-art models can only make this trade-off for the fairness criterion that they were specifically designed for.
    Mode-wise Tensor Decompositions: Multi-dimensional Generalizations of CUR Decompositions. (arXiv:2103.11037v2 [math.NA] UPDATED)
    (2 min) Low rank tensor approximation is a fundamental tool in modern machine learning and data science. In this paper, we study the characterization, perturbation analysis, and an efficient sampling strategy for two primary tensor CUR approximations, namely Chidori and Fiber CUR. We characterize exact tensor CUR decompositions for low multilinear rank tensors. We also present theoretical error bounds of the tensor CUR approximations when (adversarial or Gaussian) noise appears. Moreover, we show that low cost uniform sampling is sufficient for tensor CUR approximations if the tensor has an incoherent structure. Empirical performance evaluations, with both synthetic and real-world datasets, establish the speed advantage of the tensor CUR approximations over other state-of-the-art low multilinear rank tensor approximations.
    Habitat 2.0: Training Home Assistants to Rearrange their Habitat. (arXiv:2106.14405v1 [cs.LG])
    (2 min) We introduce Habitat 2.0 (H2.0), a simulation platform for training virtual robots in interactive 3D environments and complex physics-enabled scenarios. We make comprehensive contributions to all levels of the embodied AI stack - data, simulation, and benchmark tasks. Specifically, we present: (i) ReplicaCAD: an artist-authored, annotated, reconfigurable 3D dataset of apartments (matching real spaces) with articulated objects (e.g. cabinets and drawers that can open/close); (ii) H2.0: a high-performance physics-enabled 3D simulator with speeds exceeding 25,000 simulation steps per second (850x real-time) on an 8-GPU node, representing 100x speed-ups over prior work; and, (iii) Home Assistant Benchmark (HAB): a suite of common tasks for assistive robots (tidy the house, prepare groceries, set the table) that test a range of mobile manipulation capabilities. These large-scale engineering contributions allow us to systematically compare deep reinforcement learning (RL) at scale and classical sense-plan-act (SPA) pipelines in long-horizon structured tasks, with an emphasis on generalization to new objects, receptacles, and layouts. We find that (1) flat RL policies struggle on HAB compared to hierarchical ones; (2) a hierarchy with independent skills suffers from 'hand-off problems', and (3) SPA pipelines are more brittle than RL policies.
    The Implications of the No-Free-Lunch Theorems for Meta-induction. (arXiv:2103.11956v2 [cs.LG] UPDATED)
    (2 min) The important recent book by G. Schurz appreciates that the no-free-lunch theorems (NFL) have major implications for the problem of (meta) induction. Here I review the NFL theorems, emphasizing that they do not only concern the case where there is a uniform prior -- they prove that there are "as many priors" (loosely speaking) for which any induction algorithm $A$ out-generalizes some induction algorithm $B$ as vice-versa. Importantly though, in addition to the NFL theorems, there are many \textit{free lunch} theorems. In particular, the NFL theorems can only be used to compare the \textit{marginal} expected performance of an induction algorithm $A$ with the marginal expected performance of an induction algorithm $B$. There is a rich set of free lunches which instead concern the statistical correlations among the generalization errors of induction algorithms. As I describe, the meta-induction algorithms that Schurz advocate as a "solution to Hume's problem" are just an example of such a free lunch based on correlations among the generalization errors of induction algorithms. I end by pointing out that the prior that Schurz advocates, which is uniform over bit frequencies rather than bit patterns, is contradicted by thousands of experiments in statistical physics and by the great success of the maximum entropy procedure in inductive inference.
    Interpretable Network Representation Learning with Principal Component Analysis. (arXiv:2106.14238v1 [stat.ML])
    (2 min) We consider the problem of interpretable network representation learning for samples of network-valued data. We propose the Principal Component Analysis for Networks (PCAN) algorithm to identify statistically meaningful low-dimensional representations of a network sample via subgraph count statistics. The PCAN procedure provides an interpretable framework for which one can readily visualize, explore, and formulate predictive models for network samples. We furthermore introduce a fast sampling-based algorithm, sPCAN, which is significantly more computationally efficient than its counterpart, but still enjoys advantages of interpretability. We investigate the relationship between these two methods and analyze their large-sample properties under the common regime where the sample of networks is a collection of kernel-based random graphs. We show that under this regime, the embeddings of the sPCAN method enjoy a central limit theorem and moreover that the population level embeddings of PCAN and sPCAN are equivalent. We assess PCAN's ability to visualize, cluster, and classify observations in network samples arising in nature, including functional connectivity network samples and dynamic networks describing the political co-voting habits of the U.S. Senate. Our analyses reveal that our proposed algorithm provides informative and discriminatory features describing the networks in each sample. The PCAN and sPCAN methods build on the current literature of network representation learning and set the stage for a new line of research in interpretable learning on network-valued data. Publicly available software for the PCAN and sPCAN methods are available at https://www.github.com/jihuilee/.
    Collective Intelligence: Decentralized Learning for Android Malware Detection in IoT with Blockchain. (arXiv:2102.13376v2 [cs.CR] UPDATED)
    (2 min) The widespread significance of Android IoT devices is due to its flexibility and hardware support features which revolutionized the digital world by introducing exciting applications almost in all walks of daily life, such as healthcare, smart cities, smart environments, safety, remote sensing, and many more. Such versatile applicability gives incentive for more malware attacks. In this paper, we propose a framework which continuously aggregates multiple user trained models on non-overlapping data into single model. Specifically for malware detection task, (i) we propose a novel user (local) neural network (LNN) which trains on local distribution and (ii) then to assure the model authenticity and quality, we propose a novel smart contract which enable aggregation process over blokchain platform. The LNN model analyzes various static and dynamic features of both malware and benign whereas the smart contract verifies the malicious applications both for uploading and downloading processes in the network using stored aggregated features of local models. In this way, the proposed model not only improves malware detection accuracy using decentralized model network but also model efficacy with blockchain. We evaluate our approach with three state-of-the-art models and performed deep analyses of extracted features of the relative model.
    Importance of Environment Design in Reinforcement Learning: A Study of a Robotic Environment. (arXiv:2102.10447v2 [cs.LG] UPDATED)
    (2 min) An in-depth understanding of the particular environment is crucial in reinforcement learning (RL). To address this challenge, the decision-making process of a mobile collaborative robotic assistant modeled by the Markov decision process (MDP) framework is studied in this paper. The optimal state-action combinations of the MDP are calculated with the non-linear Bellman optimality equations. This system of equations can be solved with relative ease by the computational power of Wolfram Mathematica, where the obtained optimal action-values point to the optimal policy. Unlike other RL algorithms, this methodology does not approximate the optimal behavior, it gives the exact, explicit solution, which provides a strong foundation for our study. With this, we offer new insights into understanding the action selection mechanisms in RL by presenting various small modifications on the very same schema that lead to different optimal policies.
    High-Dimensional Uncertainty Quantification via Tensor Regression with Rank Determination and Adaptive Sampling. (arXiv:2103.17236v2 [stat.ML] UPDATED)
    (2 min) Fabrication process variations can significantly influence the performance and yield of nano-scale electronic and photonic circuits. Stochastic spectral methods have achieved great success in quantifying the impact of process variations, but they suffer from the curse of dimensionality. Recently, low-rank tensor methods have been developed to mitigate this issue, but two fundamental challenges remain open: how to automatically determine the tensor rank and how to adaptively pick the informative simulation samples. This paper proposes a novel tensor regression method to address these two challenges. We use a $\ell_{q}/ \ell_{2}$ group-sparsity regularization to determine the tensor rank. The resulting optimization problem can be efficiently solved via an alternating minimization solver. We also propose a two-stage adaptive sampling method to reduce the simulation cost. Our method considers both exploration and exploitation via the estimated Voronoi cell volume and nonlinearity measurement respectively. The proposed model is verified with synthetic and some realistic circuit benchmarks, on which our method can well capture the uncertainty caused by 19 to 100 random variables with only 100 to 600 simulation samples.
    Query Complexity of Least Absolute Deviation Regression via Robust Uniform Convergence. (arXiv:2102.02322v2 [cs.LG] UPDATED)
    (2 min) Consider a regression problem where the learner is given a large collection of $d$-dimensional data points, but can only query a small subset of the real-valued labels. How many queries are needed to obtain a $1+\epsilon$ relative error approximation of the optimum? While this problem has been extensively studied for least squares regression, little is known for other losses. An important example is least absolute deviation regression ($\ell_1$ regression) which enjoys superior robustness to outliers compared to least squares. We develop a new framework for analyzing importance sampling methods in regression problems, which enables us to show that the query complexity of least absolute deviation regression is $\Theta(d/\epsilon^2)$ up to logarithmic factors. We further extend our techniques to show the first bounds on the query complexity for any $\ell_p$ loss with $p\in(1,2)$. As a key novelty in our analysis, we introduce the notion of robust uniform convergence, which is a new approximation guarantee for the empirical loss. While it is inspired by uniform convergence in statistical learning, our approach additionally incorporates a correction term to avoid unnecessary variance due to outliers. This can be viewed as a new connection between statistical learning theory and variance reduction techniques in stochastic optimization, which should be of independent interest.
    Fully Three-dimensional Radial Visualization. (arXiv:1904.06366v3 [stat.ML] UPDATED)
    (2 min) We develop methodology for three-dimensional (3D) radial visualization (RadViz) of multidimensional datasets. Our tool is called RadViz3D and extends the classical two-dimensional (2D) RadViz that visualizes multivariate data in the 2D plane by mapping every observation to a point inside the unit circle. We show that distributing anchor points uniformly on the 3D unit sphere provides the best visualization with minimal artificial visual correlation for data with uncorrelated variables. However, anchor points can be placed exactly equi-distant from each other only for the five Platonic solids. We provide equi-distant anchor points for these five settings, and approximately equi-distant anchor points via a Fibonacci grid for the other cases. Our methodology, implemented in the R package $radviz3d$, makes fully 3D RadViz possible and is shown to improve clarity of this nonlinear display technique on simulated and real datasets.
    Poisoning the Search Space in Neural Architecture Search. (arXiv:2106.14406v1 [cs.LG])
    (2 min) Deep learning has proven to be a highly effective problem-solving tool for object detection and image segmentation across various domains such as healthcare and autonomous driving. At the heart of this performance lies neural architecture design which relies heavily on domain knowledge and prior experience on the researchers' behalf. More recently, this process of finding the most optimal architectures, given an initial search space of possible operations, was automated by Neural Architecture Search (NAS). In this paper, we evaluate the robustness of one such algorithm known as Efficient NAS (ENAS) against data agnostic poisoning attacks on the original search space with carefully designed ineffective operations. By evaluating algorithm performance on the CIFAR-10 dataset, we empirically demonstrate how our novel search space poisoning (SSP) approach and multiple-instance poisoning attacks exploit design flaws in the ENAS controller to result in inflated prediction error rates for child networks. Our results provide insights into the challenges to surmount in using NAS for more adversarially robust architecture search.
    BAM: A Lightweight and Efficient Balanced Attention Mechanism for Single Image Super Resolution. (arXiv:2104.07566v2 [eess.IV] UPDATED)
    (2 min) Attention mechanism has shown enormous potential for single image super-resolution (SISR). However, existing works only proposed some attention mechanism for a specific network. A universal attention mechanism for SISR, which could further improve the performance of networks without attention and provide a baseline for networks with attention, is still lacking. To fit this gap, we propose a lightweight and efficient Balanced Attention Mechanism (BAM), which consists of Avgpool Channel Attention Module (ACAM) and Maxpool Spatial Attention Module (MSAM) in parallel. The information extraction mechanism of ACAM and MSAM effectively filters redundant information, making the overall structure of BAM very lightweight. Owing to the parallel structure, during the gradient backpropagation process of BAM, ACAM and MSAM not only conduct self-optimization, but also mutual optimization so as to generate more balanced attention information. To verify the effectiveness and robustness of BAM, we applied it to 12 state-ofthe-art SISR networks. The results on 4 benchmark datasets demonstrate that BAM can efficiently improve the networks' performance, and for those with attention, the substitution with BAM further reduces the amount of parameters and increase the inference speed. Moreover, ablation experiments were conducted to prove the minimalism of BAM.
    Multi-target normal behaviour models for wind farm condition monitoring. (arXiv:2012.03074v3 [cs.LG] UPDATED)
    (2 min) The trend towards larger wind turbines and remote locations of wind farms fuels the demand for automated condition monitoring strategies that can reduce the operating cost and avoid unplanned downtime. Normal behaviour modelling has been introduced to detect anomalous deviations from normal operation based on the turbine's SCADA data. A growing number of machine learning models of the normal behaviour of turbine subsystems are being developed by wind farm managers to this end. However, these models need to be kept track of, be maintained and require frequent updates. This research explores multi-target models as a new approach to capturing a wind turbine's normal behaviour. We present an overview of multi-target regression methods, motivate their application and benefits in wind turbine condition monitoring, and assess their performance in a wind farm case study. We find that multi-target models are advantageous in comparison to single-target modelling in that they can reduce the cost and effort of practical condition monitoring without compromising on the accuracy. We also outline some areas of future research.
    A Hypothesis Testing Approach to Nonstationary Source Separation. (arXiv:2105.06958v2 [eess.SP] UPDATED)
    (2 min) The extraction of nonstationary signals from blind and semi-blind multivariate observations is a recurrent problem. Numerous algorithms have been developed for this problem, which are based on the exact or approximate joint diagonalization of second or higher order cumulant matrices/tensors of multichannel data. While a great body of research has been dedicated to joint diagonalization algorithms, the selection of the diagonalized matrix/tensor set remains highly problem-specific. Herein, various methods for nonstationarity identification are reviewed and a new general framework based on hypothesis testing is proposed, which results in a classification/clustering perspective to semi-blind source separation of nonstationary components. The proposed method is applied to noninvasive fetal ECG extraction, as case study.
    Algorithmic Recourse in the Wild: Understanding the Impact of Data and Model Shifts. (arXiv:2012.11788v3 [cs.LG] UPDATED)
    (2 min) As predictive models are increasingly being deployed to make a variety of consequential decisions, there is a growing emphasis on designing algorithms that can provide recourse to affected individuals. Existing recourse algorithms function under the assumption that the underlying predictive model does not change. However, models are regularly updated in practice for several reasons including data distribution shifts. In this work, we make the first attempt at understanding how model updates resulting from data distribution shifts impact the algorithmic recourses generated by state-of-the-art algorithms. We carry out a rigorous theoretical and empirical analysis to address the above question. Our theoretical results establish a lower bound on the probability of recourse invalidation due to model shifts, and show the existence of a tradeoff between this invalidation probability and typical notions of "cost" minimized by modern recourse generation algorithms. We experiment with multiple synthetic and real world datasets, capturing different kinds of distribution shifts including temporal shifts, geospatial shifts, and shifts due to data correction. These experiments demonstrate that model updation due to all the aforementioned distribution shifts can potentially invalidate recourses generated by state-of-the-art algorithms. Our findings thus not only expose previously unknown flaws in the current recourse generation paradigm, but also pave the way for fundamentally rethinking the design and development of recourse generation algorithms.
    Rejoinder: Gaussian Differential Privacy. (arXiv:2104.01987v2 [cs.CR] UPDATED)
    (2 min) In this rejoinder, we aim to address two broad issues that cover most comments made in the discussion. First, we discuss some theoretical aspects of our work and comment on how this work might impact the theoretical foundation of privacy-preserving data analysis. Taking a practical viewpoint, we next discuss how f-differential privacy (f-DP) and Gaussian differential privacy (GDP) can make a difference in a range of applications.
    Learning and Information in Stochastic Networks and Queues. (arXiv:2105.08769v3 [cs.LG] UPDATED)
    (2 min) We review the role of information and learning in the stability and optimization of queueing systems. In recent years, techniques from supervised learning, bandit learning and reinforcement learning have been applied to queueing systems supported by increasing role of information in decision making. We present observations and new results that help rationalize the application of these areas to queueing systems. We prove that the MaxWeight and BackPressure policies are an application of Blackwell's Approachability Theorem. This connects queueing theoretic results with adversarial learning. We then discuss the requirements of statistical learning for service parameter estimation. As an example, we show how queue size regret can be bounded when applying a perceptron algorithm to classify service. Next, we discuss the role of state information in improved decision making. Here we contrast the roles of epistemic information (information on uncertain parameters) and aleatoric information (information on an uncertain state). Finally we review recent advances in the theory of reinforcement learning and queueing, as well as, provide discussion on current research challenges.
    NLRG at SemEval-2021 Task 5: Toxic Spans Detection Leveraging BERT-based Token Classification and Span Prediction Techniques. (arXiv:2102.12254v2 [cs.CL] UPDATED)
    (2 min) Toxicity detection of text has been a popular NLP task in the recent years. In SemEval-2021 Task-5 Toxic Spans Detection, the focus is on detecting toxic spans within passages. Most state-of-the-art span detection approaches employ various techniques, each of which can be broadly classified into Token Classification or Span Prediction approaches. In our paper, we explore simple versions of both of these approaches and their performance on the task. Specifically, we use BERT-based models -- BERT, RoBERTa, and SpanBERT for both approaches. We also combine these approaches and modify them to bring improvements for Toxic Spans prediction. To this end, we investigate results on four hybrid approaches -- Multi-Span, Span+Token, LSTM-CRF, and a combination of predicted offsets using union/intersection. Additionally, we perform a thorough ablative analysis and analyze our observed results. Our best submission -- a combination of SpanBERT Span Predictor and RoBERTa Token Classifier predictions -- achieves an F1 score of 0.6753 on the test set. Our best post-eval F1 score is 0.6895 on intersection of predicted offsets from top-3 RoBERTa Token Classification checkpoints. These approaches improve the performance by 3% on average than those of the shared baseline models -- RNNSL and SpaCy NER.
    Model-Based Deep Learning. (arXiv:2012.08405v2 [eess.SP] UPDATED)
    (2 min) Signal processing, communications, and control have traditionally relied on classical statistical modeling techniques. Such model-based methods utilize mathematical formulations that represent the underlying physics, prior information and additional domain knowledge. Simple classical models are useful but sensitive to inaccuracies and may lead to poor performance when real systems display complex or dynamic behavior. On the other hand, purely data-driven approaches that are model-agnostic are becoming increasingly popular as datasets become abundant and the power of modern deep learning pipelines increases. Deep neural networks (DNNs) use generic architectures which learn to operate from data, and demonstrate excellent performance, especially for supervised problems. However, DNNs typically require massive amounts of data and immense computational resources, limiting their applicability for some signal processing scenarios. We are interested in hybrid techniques that combine principled mathematical models with data-driven systems to benefit from the advantages of both approaches. Such model-based deep learning methods exploit both partial domain knowledge, via mathematical structures designed for specific problems, as well as learning from limited data. In this article we survey the leading approaches for studying and designing model-based deep learning systems. We divide hybrid model-based/data-driven systems into categories based on their inference mechanism. We provide a comprehensive review of the leading approaches for combining model-based algorithms with deep learning in a systematic manner, along with concrete guidelines and detailed signal processing oriented examples from recent literature. Our aim is to facilitate the design and study of future systems on the intersection of signal processing and machine learning that incorporate the advantages of both domains.
    Nonlinear Independent Component Analysis for Continuous-Time Signals. (arXiv:2102.02876v2 [stat.ML] UPDATED)
    (2 min) We study the classical problem of recovering a multidimensional source process from observations of nonlinear mixtures of this process. Assuming statistical independence of the coordinate processes of the source, we show that this recovery is possible for many popular models of stochastic processes (up to order and monotone scaling of their coordinates) if the mixture is given by a sufficiently differentiable, invertible function. Key to our approach is the combination of tools from stochastic analysis and recent contrastive learning approaches to nonlinear ICA. This yields a scalable method with widely applicable theoretical guarantees for which our experiments indicate good performance.
    Random Forests for dependent data. (arXiv:2007.15421v2 [stat.ML] UPDATED)
    (2 min) Random forest (RF) is one of the most popular methods for estimating regression functions. The local nature of the RF algorithm, based on intra-node means and variances, is ideal when errors are i.i.d. For dependent error processes like time series and spatial settings where data in all the nodes will be correlated, operating locally ignores this dependence. Also, RF will involve resampling of correlated data, violating the principles of bootstrap. Theoretically, consistency of RF has been established for i.i.d. errors, but little is known about the case of dependent errors. We propose RF-GLS, a novel extension of RF for dependent error processes in the same way Generalized Least Squares (GLS) fundamentally extends Ordinary Least Squares (OLS) for linear models under dependence. The key to this extension is the equivalent representation of the local decision-making in a regression tree as a global OLS optimization which is then replaced with a GLS loss to create a GLS-style regression tree. This also synergistically addresses the resampling issue, as the use of GLS loss amounts to resampling uncorrelated contrasts (pre-whitened data) instead of the correlated data. For spatial settings, RF-GLS can be used in conjunction with Gaussian Process correlated errors to generate kriging predictions at new locations. RF becomes a special case of RF-GLS with an identity working covariance matrix. We establish consistency of RF-GLS under beta- (absolutely regular) mixing error processes and show that this general result subsumes important cases like autoregressive time series and spatial Matern Gaussian Processes. As a byproduct, we also establish consistency of RF for beta-mixing processes, which to our knowledge, is the first such result for RF under dependence. We empirically demonstrate the improvement achieved by RF-GLS over RF for both estimation and prediction under dependence.
    Mitigating severe over-parameterization in deep convolutional neural networks through forced feature abstraction and compression with an entropy-based heuristic. (arXiv:2106.14190v1 [cs.CV])
    (2 min) Convolutional Neural Networks (CNNs) such as ResNet-50, DenseNet-40 and ResNeXt-56 are severely over-parameterized, necessitating a consequent increase in the computational resources required for model training which scales exponentially for increments in model depth. In this paper, we propose an Entropy-Based Convolutional Layer Estimation (EBCLE) heuristic which is robust and simple, yet effective in resolving the problem of over-parameterization with regards to network depth of CNN model. The EBCLE heuristic employs a priori knowledge of the entropic data distribution of input datasets to determine an upper bound for convolutional network depth, beyond which identity transformations are prevalent offering insignificant contributions for enhancing model performance. Restricting depth redundancies by forcing feature compression and abstraction restricts over-parameterization while decreasing training time by 24.99% - 78.59% without degradation in model performance. We present empirical evidence to emphasize the relative effectiveness of broader, yet shallower models trained using the EBCLE heuristic, which maintains or outperforms baseline classification accuracies of narrower yet deeper models. The EBCLE heuristic is architecturally agnostic and EBCLE based CNN models restrict depth redundancies resulting in enhanced utilization of the available computational resources. The proposed EBCLE heuristic is a compelling technique for researchers to analytically justify their HyperParameter (HP) choices for CNNs. Empirical validation of the EBCLE heuristic in training CNN models was established on five benchmarking datasets (ImageNet32, CIFAR-10/100, STL-10, MNIST) and four network architectures (DenseNet, ResNet, ResNeXt and EfficientNet B0-B2) with appropriate statistical tests employed to infer any conclusive claims presented in this paper.
    Intrinsically Motivated Self-supervised Learning in Reinforcement Learning. (arXiv:2106.13970v1 [cs.LG])
    (2 min) In vision-based reinforcement learning (RL) tasks, it is prevalent to assign the auxiliary task with a surrogate self-supervised loss so as to obtain more semantic representations and improve sample efficiency. However, abundant information in self-supervised auxiliary tasks has been disregarded, since the representation learning part and the decision-making part are separated. To sufficiently utilize information in the auxiliary task, we present a simple yet effective idea to employ self-supervised loss as an intrinsic reward, called Intrinsically Motivated Self-Supervised learning in Reinforcement learning (IM-SSR). We formally show that the self-supervised loss can be decomposed as exploration for novel states and robustness improvement from nuisance elimination. IM-SSR can be effortlessly plugged into any reinforcement learning with self-supervised auxiliary objectives with nearly no additional cost. Combined with IM-SSR, the previous underlying algorithms achieve salient improvements on both sample efficiency and generalization in various vision-based robotics tasks from the DeepMind Control Suite, especially when the reward signal is sparse.
    Seismic Facies Analysis: A Deep Domain Adaptation Approach. (arXiv:2011.10510v2 [physics.geo-ph] UPDATED)
    (2 min) Deep neural networks (DNNs) can learn accurately from large quantities of labeled input data, but DNNs sometimes fail to generalize to test data sampled from different input distributions. Unsupervised Deep Domain Adaptation (DDA) proves useful when no input labels are available, and distribution shifts are observed in the target domain (TD). Experiments are performed on seismic images of the F3 block 3D dataset from offshore Netherlands (source domain; SD) and Penobscot 3D survey data from Canada (target domain; TD). Three geological classes from SD and TD that have similar reflection patterns are considered. In the present study, an improved deep neural network architecture named EarthAdaptNet (EAN) is proposed to semantically segment the seismic images. We specifically use a transposed residual unit to replace the traditional dilated convolution in the decoder block. The EAN achieved a pixel-level accuracy >84% and an accuracy of ~70% for the minority classes, showing improved performance compared to existing architectures. In addition, we introduced the CORAL (Correlation Alignment) method to the EAN to create an unsupervised deep domain adaptation network (EAN-DDA) for the classification of seismic reflections fromF3 and Penobscot. Maximum class accuracy achieved was ~99% for class 2 of Penobscot with >50% overall accuracy. Taken together, EAN-DDA has the potential to classify target domain seismic facies classes with high accuracy.
    Model-assisted Learning-based Framework for Sensor Fault-Tolerant Building HVAC Control. (arXiv:2106.14144v1 [eess.SY])
    (2 min) As people spend up to 87% of their time indoors, intelligent Heating, Ventilation, and Air Conditioning (HVAC) systems in buildings are essential for maintaining occupant comfort and reducing energy consumption. Those HVAC systems in modern smart buildings rely on real-time sensor readings, which in practice often suffer from various faults and could also be vulnerable to malicious attacks. Such faulty sensor inputs may lead to the violation of indoor environment requirements (e.g., temperature, humidity, etc.) and the increase of energy consumption. While many model-based approaches have been proposed in the literature for building HVAC control, it is costly to develop accurate physical models for ensuring their performance and even more challenging to address the impact of sensor faults. In this work, we present a novel learning-based framework for sensor fault-tolerant HVAC control, which includes three deep learning based components for 1) generating temperature proposals with the consideration of possible sensor faults, 2) selecting one of the proposals based on the assessment of their accuracy, and 3) applying reinforcement learning with the selected temperature proposal. Moreover, to address the challenge of training data insufficiency in building-related tasks, we propose a model-assisted learning method leveraging an abstract model of building physical dynamics. Through extensive numerical experiments, we demonstrate that the proposed fault-tolerant HVAC control framework can significantly reduce building temperature violations under a variety of sensor fault patterns while maintaining energy efficiency.
    The Feasibility and Inevitability of Stealth Attacks. (arXiv:2106.13997v1 [cs.CR])
    (2 min) We develop and study new adversarial perturbations that enable an attacker to gain control over decisions in generic Artificial Intelligence (AI) systems including deep learning neural networks. In contrast to adversarial data modification, the attack mechanism we consider here involves alterations to the AI system itself. Such a stealth attack could be conducted by a mischievous, corrupt or disgruntled member of a software development team. It could also be made by those wishing to exploit a "democratization of AI" agenda, where network architectures and trained parameter sets are shared publicly. Building on work by [Tyukin et al., International Joint Conference on Neural Networks, 2020], we develop a range of new implementable attack strategies with accompanying analysis, showing that with high probability a stealth attack can be made transparent, in the sense that system performance is unchanged on a fixed validation set which is unknown to the attacker, while evoking any desired output on a trigger input of interest. The attacker only needs to have estimates of the size of the validation set and the spread of the AI's relevant latent space. In the case of deep learning neural networks, we show that a one neuron attack is possible - a modification to the weights and bias associated with a single neuron - revealing a vulnerability arising from over-parameterization. We illustrate these concepts in a realistic setting. Guided by the theory and computational results, we also propose strategies to guard against stealth attacks.
    Towards Model-informed Precision Dosing with Expert-in-the-loop Machine Learning. (arXiv:2106.14384v1 [stat.AP])
    (2 min) Machine Learning (ML) and its applications have been transforming our lives but it is also creating issues related to the development of fair, accountable, transparent, and ethical Artificial Intelligence. As the ML models are not fully comprehensible yet, it is obvious that we still need humans to be part of algorithmic decision-making processes. In this paper, we consider a ML framework that may accelerate model learning and improve its interpretability by incorporating human experts into the model learning loop. We propose a novel human-in-the-loop ML framework aimed at dealing with learning problems that the cost of data annotation is high and the lack of appropriate data to model the association between the target tasks and the input features. With an application to precision dosing, our experimental results show that the approach can learn interpretable rules from data and may potentially lower experts' workload by replacing data annotation with rule representation editing. The approach may also help remove algorithmic bias by introducing experts' feedback into the iterative model learning process.
    Unsupervised Skill Discovery with Bottleneck Option Learning. (arXiv:2106.14305v1 [cs.LG])
    (2 min) Having the ability to acquire inherent skills from environments without any external rewards or supervision like humans is an important problem. We propose a novel unsupervised skill discovery method named Information Bottleneck Option Learning (IBOL). On top of the linearization of environments that promotes more various and distant state transitions, IBOL enables the discovery of diverse skills. It provides the abstraction of the skills learned with the information bottleneck framework for the options with improved stability and encouraged disentanglement. We empirically demonstrate that IBOL outperforms multiple state-of-the-art unsupervised skill discovery methods on the information-theoretic evaluations and downstream tasks in MuJoCo environments, including Ant, HalfCheetah, Hopper and D'Kitty.
    Representational aspects of depth and conditioning in normalizing flows. (arXiv:2010.01155v2 [cs.LG] UPDATED)
    (2 min) Normalizing flows are among the most popular paradigms in generative modeling, especially for images, primarily because we can efficiently evaluate the likelihood of a data point. This is desirable both for evaluating the fit of a model, and for ease of training, as maximizing the likelihood can be done by gradient descent. However, training normalizing flows comes with difficulties as well: models which produce good samples typically need to be extremely deep -- which comes with accompanying vanishing/exploding gradient problems. A very related problem is that they are often poorly conditioned: since they are parametrized as invertible maps from $\mathbb{R}^d \to \mathbb{R}^d$, and typical training data like images intuitively is lower-dimensional, the learned maps often have Jacobians that are close to being singular. In our paper, we tackle representational aspects around depth and conditioning of normalizing flows: both for general invertible architectures, and for a particular common architecture, affine couplings. We prove that $\Theta(1)$ affine coupling layers suffice to exactly represent a permutation or $1 \times 1$ convolution, as used in GLOW, showing that representationally the choice of partition is not a bottleneck for depth. We also show that shallow affine coupling networks are universal approximators in Wasserstein distance if ill-conditioning is allowed, and experimentally investigate related phenomena involving padding. Finally, we show a depth lower bound for general flow architectures with few neurons per layer and bounded Lipschitz constant.
    Smart Choices and the Selection Monad. (arXiv:2007.08926v5 [cs.LO] UPDATED)
    (2 min) Describing systems in terms of choices and their resulting costs and rewards offers the promise of freeing algorithm designers and programmers from specifying how those choices should be made; in implementations, the choices can be realized by optimization techniques and,increasingly, by machine-learning methods. We study this approach from a programming-language perspective. We define two small languages that support decision-making abstractions: one with choices and rewards, and the other additionally with probabilities. We give both operational and denotational semantics. In the case of the second language we consider three denotational semantics, with varying degrees of correlation between possible program values and expected rewards. The operational semantics combine the usual semantics of standard constructs with optimization over spaces of possible execution strategies. The denotational semantics, which are compositional rely on the selection monad, to handle choice, augmented with an auxiliary monad to handle other effects, such as rewards or probability. We establish adequacy theorems that the two semantics coincide in all cases. We also prove full abstraction at base types, with varying notions of observation in the probabilistic case corresponding to the various degrees of correlation. We present axioms for choice combined with rewards and probability, establishing completeness at base types for the case of rewards without probability.
    Two Sides of Meta-Learning Evaluation: In vs. Out of Distribution. (arXiv:2102.11503v2 [cs.LG] UPDATED)
    (2 min) We categorize meta-learning evaluation into two settings: $\textit{in-distribution}$ [ID], in which the train and test tasks are sampled $\textit{iid}$ from the same underlying task distribution, and $\textit{out-of-distribution}$ [OOD], in which they are not. While most meta-learning theory and some FSL applications follow the ID setting, we identify that most existing few-shot classification benchmarks instead reflect OOD evaluation, as they use disjoint sets of train (base) and test (novel) classes for task generation. This discrepancy is problematic because -- as we show on numerous benchmarks -- meta-learning methods that perform better on existing OOD datasets may perform significantly worse in the ID setting. In addition, in the OOD setting, even though current FSL benchmarks seem befitting, our study highlights concerns in 1) reliably performing model selection for a given meta-learning method, and 2) consistently comparing the performance of different methods. To address these concerns, we provide suggestions on how to construct FSL benchmarks to allow for ID evaluation as well as more reliable OOD evaluation. Our work aims to inform the meta-learning community about the importance and distinction of ID vs. OOD evaluation, as well as the subtleties of OOD evaluation with current benchmarks.
    Center Smoothing: Provable Robustness for Functions with Metric-Space Outputs. (arXiv:2102.09701v2 [cs.LG] UPDATED)
    (2 min) Randomized smoothing has been successfully applied to classification tasks on high-dimensional inputs, such as images, to obtain models that are provably robust against adversarial perturbations of the input. We extend this technique to produce provable robustness for functions that map inputs into an arbitrary metric space rather than discrete classes. Such functions are used in many machine learning problems like image reconstruction, dimensionality reduction, facial recognition, etc. Our robustness certificates guarantee that the change in the output of the smoothed model as measured by the distance metric remains small for any norm-bounded perturbation of the input. We can certify robustness under a variety of different output metrics, such as total variation distance, Jaccard distance, perceptual metrics, etc. In our experiments, we apply our procedure to create certifiably robust models with disparate output spaces -- from sets to images -- and show that it yields meaningful certificates without significantly degrading the performance of the base model. The code for our experiments is available at: https://github.com/aounon/center-smoothing.
    Learning stochastic object models from medical imaging measurements by use of advanced AmbientGANs. (arXiv:2106.14324v1 [eess.IV])
    (2 min) In order to objectively assess new medical imaging technologies via computer-simulations, it is important to account for all sources of variability that contribute to image data. One important source of variability that can significantly limit observer performance is associated with the variability in the ensemble of objects to-be-imaged. This source of variability can be described by stochastic object models (SOMs), which are generative models that can be employed to sample from a distribution of to-be-virtually-imaged objects. It is generally desirable to establish SOMs from experimental imaging measurements acquired by use of a well-characterized imaging system, but this task has remained challenging. Deep generative neural networks, such as generative adversarial networks (GANs) hold potential for such tasks. To establish SOMs from imaging measurements, an AmbientGAN has been proposed that augments a GAN with a measurement operator. However, the original AmbientGAN could not immediately benefit from modern training procedures and GAN architectures, which limited its ability to be applied to realistically sized medical image data. To circumvent this, in this work, a modified AmbientGAN training strategy is proposed that is suitable for modern progressive or multi-resolution training approaches such as employed in the Progressive Growing of GANs and Style-based GANs. AmbientGANs established by use of the proposed training procedure are systematically validated in a controlled way by use of computer-simulated measurement data corresponding to a stylized imaging system. Finally, emulated single-coil experimental magnetic resonance imaging data are employed to demonstrate the methods under less stylized conditions.
    Benchmarking Differential Privacy and Federated Learning for BERT Models. (arXiv:2106.13973v1 [cs.CL])
    (2 min) Natural Language Processing (NLP) techniques can be applied to help with the diagnosis of medical conditions such as depression, using a collection of a person's utterances. Depression is a serious medical illness that can have adverse effects on how one feels, thinks, and acts, which can lead to emotional and physical problems. Due to the sensitive nature of such data, privacy measures need to be taken for handling and training models with such data. In this work, we study the effects that the application of Differential Privacy (DP) has, in both a centralized and a Federated Learning (FL) setup, on training contextualized language models (BERT, ALBERT, RoBERTa and DistilBERT). We offer insights on how to privately train NLP models and what architectures and setups provide more desirable privacy utility trade-offs. We envisage this work to be used in future healthcare and mental health studies to keep medical history private. Therefore, we provide an open-source implementation of this work.
    On a novel training algorithm for sequence-to-sequence predictive recurrent networks. (arXiv:2106.14120v1 [cs.LG])
    (2 min) Neural networks mapping sequences to sequences (seq2seq) lead to significant progress in machine translation and speech recognition. Their traditional architecture includes two recurrent networks (RNs) followed by a linear predictor. In this manuscript we perform analysis of a corresponding algorithm and show that the parameters of the RNs of the well trained predictive network are not independent of each other. Their dependence can be used to significantly improve the network effectiveness. The traditional seq2seq algorithms require short term memory of a size proportional to the predicted sequence length. This requirement is quite difficult to implement in a neuroscience context. We present a novel memoryless algorithm for seq2seq predictive networks and compare it to the traditional one in the context of time series prediction. We show that the new algorithm is more robust and makes predictions with higher accuracy than the traditional one.
    Spectral-Spatial Graph Reasoning Network for Hyperspectral Image Classification. (arXiv:2106.13952v1 [cs.CV])
    (2 min) In this paper, we propose a spectral-spatial graph reasoning network (SSGRN) for hyperspectral image (HSI) classification. Concretely, this network contains two parts that separately named spatial graph reasoning subnetwork (SAGRN) and spectral graph reasoning subnetwork (SEGRN) to capture the spatial and spectral graph contexts, respectively. Different from the previous approaches implementing superpixel segmentation on the original image or attempting to obtain the category features under the guide of label image, we perform the superpixel segmentation on intermediate features of the network to adaptively produce the homogeneous regions to get the effective descriptors. Then, we adopt a similar idea in spectral part that reasonably aggregating the channels to generate spectral descriptors for spectral graph contexts capturing. All graph reasoning procedures in SAGRN and SEGRN are achieved through graph convolution. To guarantee the global perception ability of the proposed methods, all adjacent matrices in graph reasoning are obtained with the help of non-local self-attention mechanism. At last, by combining the extracted spatial and spectral graph contexts, we obtain the SSGRN to achieve a high accuracy classification. Extensive quantitative and qualitative experiments on three public HSI benchmarks demonstrate the competitiveness of the proposed methods compared with other state-of-the-art approaches.
    Functional Classwise Principal Component Analysis: A Novel Classification Framework. (arXiv:2106.13959v1 [stat.ML])
    (2 min) In recent times, functional data analysis (FDA) has been successfully applied in the field of high dimensional data classification. In this paper, we present a novel classification framework using functional data and classwise Principal Component Analysis (PCA). Our proposed method can be used in high dimensional time series data which typically suffers from small sample size problem. Our method extracts a piece wise linear functional feature space and is particularly suitable for hard classification problems.The proposed framework converts time series data into functional data and uses classwise functional PCA for feature extraction followed by classification using a Bayesian linear classifier. We demonstrate the efficacy of our proposed method by applying it to both synthetic data sets and real time series data from diverse fields including but not limited to neuroscience, food science, medical sciences and chemometrics.
    Continual Learning of Context-dependent Processing in Neural Networks. (arXiv:1810.01256v3 [cs.LG] UPDATED)
    (2 min) Deep neural networks (DNNs) are powerful tools in learning sophisticated but fixed mapping rules between inputs and outputs, thereby limiting their application in more complex and dynamic situations in which the mapping rules are not kept the same but changing according to different contexts. To lift such limits, we developed a novel approach involving a learning algorithm, called orthogonal weights modification (OWM), with the addition of a context-dependent processing (CDP) module. We demonstrated that with OWM to overcome the problem of catastrophic forgetting, and the CDP module to learn how to reuse a feature representation and a classifier for different contexts, a single network can acquire numerous context-dependent mapping rules in an online and continual manner, with as few as $\sim$10 samples to learn each. This should enable highly compact systems to gradually learn myriad regularities of the real world and eventually behave appropriately within it.
    Pairing Conceptual Modeling with Machine Learning. (arXiv:2106.14251v1 [cs.SE])
    (2 min) Both conceptual modeling and machine learning have long been recognized as important areas of research. With the increasing emphasis on digitizing and processing large amounts of data for business and other applications, it would be helpful to consider how these areas of research can complement each other. To understand how they can be paired, we provide an overview of machine learning foundations and development cycle. We then examine how conceptual modeling can be applied to machine learning and propose a framework for incorporating conceptual modeling into data science projects. The framework is illustrated by applying it to a healthcare application. For the inverse pairing, machine learning can impact conceptual modeling through text and rule mining, as well as knowledge graphs. The pairing of conceptual modeling and machine learning in this this way should help lay the foundations for future research.
    Advanced Stationary and Non-Stationary Kernel Designs for Domain-Aware Gaussian Processes. (arXiv:2102.03432v2 [stat.ML] UPDATED)
    (2 min) Gaussian process regression is a widely-applied method for function approximation and uncertainty quantification. The technique has gained popularity recently in the machine learning community due to its robustness and interpretability. The mathematical methods we discuss in this paper are an extension of the Gaussian-process framework. We are proposing advanced kernel designs that only allow for functions with certain desirable characteristics to be elements of the reproducing kernel Hilbert space (RKHS) that underlies all kernel methods and serves as the sample space for Gaussian process regression. These desirable characteristics reflect the underlying physics; two obvious examples are symmetry and periodicity constraints. In addition, non-stationary kernel designs can be defined in the same framework to yield flexible multi-task Gaussian processes. We will show the impact of advanced kernel designs on Gaussian processes using several synthetic and two scientific data sets. The results show that including domain knowledge, communicated through advanced kernel designs, has a significant impact on the accuracy and relevance of the function approximation.
    Going Beyond Saliency Maps: Training Deep Models to Interpret Deep Models. (arXiv:2102.08239v2 [eess.IV] UPDATED)
    (2 min) Interpretability is a critical factor in applying complex deep learning models to advance the understanding of brain disorders in neuroimaging studies. To interpret the decision process of a trained classifier, existing techniques typically rely on saliency maps to quantify the voxel-wise or feature-level importance for classification through partial derivatives. Despite providing some level of localization, these maps are not human-understandable from the neuroscience perspective as they do not inform the specific meaning of the alteration linked to the brain disorder. Inspired by the image-to-image translation scheme, we propose to train simulator networks that can warp a given image to inject or remove patterns of the disease. These networks are trained such that the classifier produces consistently increased or decreased prediction logits for the simulated images. Moreover, we propose to couple all the simulators into a unified model based on conditional convolution. We applied our approach to interpreting classifiers trained on a synthetic dataset and two neuroimaging datasets to visualize the effect of the Alzheimer's disease and alcohol use disorder. Compared to the saliency maps generated by baseline approaches, our simulations and visualizations based on the Jacobian determinants of the warping field reveal meaningful and understandable patterns related to the diseases.
    Low-Precision Training in Logarithmic Number System using Multiplicative Weight Update. (arXiv:2106.13914v1 [cs.LG])
    (2 min) Training large-scale deep neural networks (DNNs) currently requires a significant amount of energy, leading to serious environmental impacts. One promising approach to reduce the energy costs is representing DNNs with low-precision numbers. While it is common to train DNNs with forward and backward propagation in low-precision, training directly over low-precision weights, without keeping a copy of weights in high-precision, still remains to be an unsolved problem. This is due to complex interactions between learning algorithms and low-precision number systems. To address this, we jointly design a low-precision training framework involving a logarithmic number system (LNS) and a multiplicative weight update training method, termed LNS-Madam. LNS has a high dynamic range even in a low-bitwidth setting, leading to high energy efficiency and making it relevant for on-board training in energy-constrained edge devices. We design LNS to have the flexibility of choosing different bases for weights and gradients, as they usually require different quantization gaps and dynamic ranges during training. By drawing the connection between LNS and multiplicative update, LNS-Madam ensures low quantization error during weight update, leading to a stable convergence even if the bitwidth is limited. Compared to using a fixed-point or floating-point number system and training with popular learning algorithms such as SGD and Adam, our joint design with LNS and LNS-Madam optimizer achieves better accuracy while requiring smaller bitwidth. Notably, with only 5-bit for gradients, the proposed training framework achieves accuracy comparable to full-precision state-of-the-art models such as ResNet-50 and BERT. After conducting energy estimations by analyzing the math datapath units during training, the results show that our design achieves over 60x energy reduction compared to FP32 on BERT models.
    Self-Attentive Ensemble Transformer: Representing Ensemble Interactions in Neural Networks for Earth System Models. (arXiv:2106.13924v1 [cs.LG])
    (2 min) Ensemble data from Earth system models has to be calibrated and post-processed. I propose a novel member-by-member post-processing approach with neural networks. I bridge ideas from ensemble data assimilation with self-attention, resulting into the self-attentive ensemble transformer. Here, interactions between ensemble members are represented as additive and dynamic self-attentive part. As proof-of-concept, global ECMWF ensemble forecasts are regressed to 2-metre-temperature fields from the ERA5 reanalysis. I demonstrate that the ensemble transformer can calibrate the ensemble spread and extract additional information from the ensemble. Furthermore, the ensemble transformer directly outputs multivariate and spatially-coherent ensemble members. Therefore, self-attention and the transformer technique can be a missing piece for a member-by-member post-processing of ensemble data with neural networks.
    Adaptive Universal Generalized PageRank Graph Neural Network. (arXiv:2006.07988v4 [cs.LG] UPDATED)
    (2 min) In many important graph data processing applications the acquired information includes both node features and observations of the graph topology. Graph neural networks (GNNs) are designed to exploit both sources of evidence but they do not optimally trade-off their utility and integrate them in a manner that is also universal. Here, universality refers to independence on homophily or heterophily graph assumptions. We address these issues by introducing a new Generalized PageRank (GPR) GNN architecture that adaptively learns the GPR weights so as to jointly optimize node feature and topological information extraction, regardless of the extent to which the node labels are homophilic or heterophilic. Learned GPR weights automatically adjust to the node label pattern, irrelevant on the type of initialization, and thereby guarantee excellent learning performance for label patterns that are usually hard to handle. Furthermore, they allow one to avoid feature over-smoothing, a process which renders feature information nondiscriminative, without requiring the network to be shallow. Our accompanying theoretical analysis of the GPR-GNN method is facilitated by novel synthetic benchmark datasets generated by the so-called contextual stochastic block model. We also compare the performance of our GNN architecture with that of several state-of-the-art GNNs on the problem of node-classification, using well-known benchmark homophilic and heterophilic datasets. The results demonstrate that GPR-GNN offers significant performance improvement compared to existing techniques on both synthetic and benchmark data.
    Anomaly Detection for Aggregated Data Using Multi-Graph Autoencoder. (arXiv:2101.04053v3 [cs.LG] UPDATED)
    (2 min) In data systems, activities or events are continuously collected in the field to trace their proper executions. Logging, which means recording sequences of events, can be used for analyzing system failures and malfunctions, and identifying the causes and locations of such issues. In our research we focus on creating an Anomaly detection models for system logs. The task of anomaly detection is identifying unexpected events in dataset, which differ from the normal behavior. Anomaly detection models also assist in data systems analysis tasks. Modern systems may produce such a large amount of events monitoring every individual event is not feasible. In such cases, the events are often aggregated over a fixed period of time, reporting the number of times every event has occurred in that time period. This aggregation facilitates scaling, but requires a different approach for anomaly detection. In this research, we present a thorough analysis of the aggregated data and the relationships between aggregated events. Based on the initial phase of our research we present graphs representations of our aggregated dataset, which represent the different relationships between aggregated instances in the same context. Using the graph representation, we propose Multiple-graphs autoencoder MGAE, a novel convolutional graphs-autoencoder model which exploits the relationships of the aggregated instances in our unique dataset. MGAE outperforms standard graph-autoencoder models and the different experiments. With our novel MGAE we present 60% decrease in reconstruction error in comparison to standard graph autoencoder, which is expressed in reconstructing high-degree relationships.
    AI based Presentation Creator With Customized Audio Content Delivery. (arXiv:2106.14213v1 [cs.LG])
    (3 min) In this paper, we propose an architecture to solve a novel problem statement that has stemmed more so in recent times with an increase in demand for virtual content delivery due to the COVID-19 pandemic. All educational institutions, workplaces, research centers, etc. are trying to bridge the gap of communication during these socially distanced times with the use of online content delivery. The trend now is to create presentations, and then subsequently deliver the same using various virtual meeting platforms. The time being spent in such creation of presentations and delivering is what we try to reduce and eliminate through this paper which aims to use Machine Learning (ML) algorithms and Natural Language Processing (NLP) modules to automate the process of creating a slides-based presentation from a document, and then use state-of-the-art voice cloning models to deliver the content in the desired author's voice. We consider a structured document such as a research paper to be the content that has to be presented. The research paper is first summarized using BERT summarization techniques and condensed into bullet points that go into the slides. Tacotron inspired architecture with Encoder, Synthesizer, and a Generative Adversarial Network (GAN) based vocoder, is used to convey the contents of the slides in the author's voice (or any customized voice). Almost all learning has now been shifted to online mode, and professionals are now working from the comfort of their homes. Due to the current situation, teachers and professionals have shifted to presentations to help them in imparting information. In this paper, we aim to reduce the considerable amount of time that is taken in creating a presentation by automating this process and subsequently delivering this presentation in a customized voice, using a content delivery mechanism that can clone any voice using a short audio clip.
    Concentration of Contractive Stochastic Approximation and Reinforcement Learning. (arXiv:2106.14308v1 [cs.LG])
    (2 min) Using a martingale concentration inequality, concentration bounds `from time $n_0$ on' are derived for stochastic approximation algorithms with contractive maps and both martingale difference and Markov noises. These are applied to reinforcement learning algorithms, in particular to asynchronous Q-learning and TD(0).
    POLAR: A Polynomial Arithmetic Framework for Verifying Neural-Network Controlled Systems. (arXiv:2106.13867v1 [eess.SY])
    (2 min) We propose POLAR, a \textbf{pol}ynomial \textbf{ar}ithmetic framework that leverages polynomial overapproximations with interval remainders for bounded-time reachability analysis of neural network-controlled systems (NNCSs). Compared with existing arithmetic approaches that use standard Taylor models, our framework uses a novel approach to iteratively overapproximate the neuron output ranges layer-by-layer with a combination of Bernstein polynomial interpolation for continuous activation functions and Taylor model arithmetic for the other operations. This approach can overcome the main drawback in the standard Taylor model arithmetic, i.e. its inability to handle functions that cannot be well approximated by Taylor polynomials, and significantly improve the accuracy and efficiency of reachable states computation for NNCSs. To further tighten the overapproximation, our method keeps the Taylor model remainders symbolic under the linear mappings when estimating the output range of a neural network. We show that POLAR can be seamlessly integrated with existing Taylor model flowpipe construction techniques, and demonstrate that POLAR significantly outperforms the current state-of-the-art techniques on a suite of benchmarks.
    Statistical Query Algorithms and Low-Degree Tests Are Almost Equivalent. (arXiv:2009.06107v3 [cs.CC] UPDATED)
    (2 min) Researchers currently use a number of approaches to predict and substantiate information-computation gaps in high-dimensional statistical estimation problems. A prominent approach is to characterize the limits of restricted models of computation, which on the one hand yields strong computational lower bounds for powerful classes of algorithms and on the other hand helps guide the development of efficient algorithms. In this paper, we study two of the most popular restricted computational models, the statistical query framework and low-degree polynomials, in the context of high-dimensional hypothesis testing. Our main result is that under mild conditions on the testing problem, the two classes of algorithms are essentially equivalent in power. As corollaries, we obtain new statistical query lower bounds for sparse PCA, tensor PCA and several variants of the planted clique problem.
    Closed-form Continuous-Depth Models. (arXiv:2106.13898v1 [cs.LG])
    (2 min) Continuous-depth neural models, where the derivative of the model's hidden state is defined by a neural network, have enabled strong sequential data processing capabilities. However, these models rely on advanced numerical differential equation (DE) solvers resulting in a significant overhead both in terms of computational cost and model complexity. In this paper, we present a new family of models, termed Closed-form Continuous-depth (CfC) networks, that are simple to describe and at least one order of magnitude faster while exhibiting equally strong modeling abilities compared to their ODE-based counterparts. The models are hereby derived from the analytical closed-form solution of an expressive subset of time-continuous models, thus alleviating the need for complex DE solvers all together. In our experimental evaluations, we demonstrate that CfC networks outperform advanced, recurrent models over a diverse set of time-series prediction tasks, including those with long-term dependencies and irregularly sampled data. We believe our findings open new opportunities to train and deploy rich, continuous neural models in resource-constrained settings, which demand both performance and efficiency.
    Sparse recovery by reduced variance stochastic approximation. (arXiv:2006.06365v2 [stat.ML] UPDATED)
    (2 min) In this paper, we discuss application of iterative Stochastic Optimization routines to the problem of sparse signal recovery from noisy observation. Using Stochastic Mirror Descent algorithm as a building block, we develop a multistage procedure for recovery of sparse solutions to Stochastic Optimization problem under assumption of smoothness and quadratic minoration on the expected objective. An interesting feature of the proposed algorithm is linear convergence of the approximate solution during the preliminary phase of the routine when the component of stochastic error in the gradient observation which is due to bad initial approximation of the optimal solution is larger than the "ideal" asymptotic error component owing to observation noise "at the optimal solution." We also show how one can straightforwardly enhance reliability of the corresponding solution by using Median-of-Means like techniques. We illustrate the performance of the proposed algorithms in application to classical problems of recovery of sparse and low rank signals in linear regression framework. We show, under rather weak assumption on the regressor and noise distributions, how they lead to parameter estimates which obey (up to factors which are logarithmic in problem dimension and confidence level) the best known to us accuracy bounds.
    PGST: a Polyglot Gender Style Transfer method. (arXiv:2009.01040v2 [cs.CL] UPDATED)
    (2 min) Recent developments in Text Style Transfer have led this field to be more highlighted than ever. The task of transferring an input's style to another is accompanied by plenty of challenges (e.g., fluency and content preservation) that need to be taken care of. In this research, we introduce PGST, a novel polyglot text style transfer approach in the gender domain, composed of different constitutive elements. In contrast to prior studies, it is feasible to apply a style transfer method in multiple languages by fulfilling our method's predefined elements. We have proceeded with a pre-trained word embedding for token replacement purposes, a character-based token classifier for gender exchange purposes, and a beam search algorithm for extracting the most fluent combination. Since different approaches are introduced in our research, we determine a trade-off value for evaluating different models' success in faking our gender identification model with transferred text. To demonstrate our method's multilingual applicability, we applied our method on both English and Persian corpora and ended up defeating our proposed gender identification model by 45.6% and 39.2%, respectively. While this research's focus is not limited to a specific language, our obtained evaluation results are highly competitive in an analogy among English state of the art methods.
    A Neural-symbolic Approach for Ontology-mediated Query Answering. (arXiv:2106.14052v1 [cs.AI])
    (2 min) Recently, low-dimensional vector space representations of knowledge graphs (KGs) have been applied to find answers to conjunctive queries (CQs) over incomplete KGs. However, the current methods only focus on inductive reasoning, i.e. answering CQs by predicting facts based on patterns learned from the data, and lack the ability of deductive reasoning by applying external domain knowledge. Such (expert or commonsense) domain knowledge is an invaluable resource which can be used to advance machine intelligence. To address this shortcoming, we introduce a neural-symbolic method for ontology-mediated CQ answering over incomplete KGs that operates in the embedding space. More specifically, we propose various data augmentation strategies to generate training queries using query-rewriting based methods and then exploit a novel loss function for training the model. The experimental results demonstrate the effectiveness of our training strategies and the new loss function, i.e., our method significantly outperforms the baseline in the settings that require both inductive and deductive reasoning.
    The Role of Contextual Information in Best Arm Identification. (arXiv:2106.14077v1 [cs.LG])
    (2 min) We study the best-arm identification problem with fixed confidence when contextual (covariate) information is available in stochastic bandits. Although we can use contextual information in each round, we are interested in the marginalized mean reward over the contextual distribution. Our goal is to identify the best arm with a minimal number of samplings under a given value of the error rate. We show the instance-specific sample complexity lower bounds for the problem. Then, we propose a context-aware version of the "Track-and-Stop" strategy, wherein the proportion of the arm draws tracks the set of optimal allocations and prove that the expected number of arm draws matches the lower bound asymptotically. We demonstrate that contextual information can be used to improve the efficiency of the identification of the best marginalized mean reward compared with the results of Garivier & Kaufmann (2016). We experimentally confirm that context information contributes to faster best-arm identification.
    Autonomous Deep Quality Monitoring in Streaming Environments. (arXiv:2106.13955v1 [cs.LG])
    (2 min) The common practice of quality monitoring in industry relies on manual inspection well-known to be slow, error-prone and operator-dependent. This issue raises strong demand for automated real-time quality monitoring developed from data-driven approaches thus alleviating from operator dependence and adapting to various process uncertainties. Nonetheless, current approaches do not take into account the streaming nature of sensory information while relying heavily on hand-crafted features making them application-specific. This paper proposes the online quality monitoring methodology developed from recently developed deep learning algorithms for data streams, Neural Networks with Dynamically Evolved Capacity (NADINE), namely NADINE++. It features the integration of 1-D and 2-D convolutional layers to extract natural features of time-series and visual data streams captured from sensors and cameras of the injection molding machines from our own project. Real-time experiments have been conducted where the online quality monitoring task is simulated on the fly under the prequential test-then-train fashion - the prominent data stream evaluation protocol. Comparison with the state-of-the-art techniques clearly exhibits the advantage of NADINE++ with 4.68\% improvement on average for the quality monitoring task in streaming environments. To support the reproducible research initiative, codes, results of NADINE++ along with supplementary materials and injection molding dataset are made available in \url{https://github.com/ContinualAL/NADINE-IJCNN2021}.
    Hierarchical Large-scale Graph Similarity Computation via Graph Coarsening and Matching. (arXiv:2005.07115v5 [cs.LG] UPDATED)
    (3 min) In this work, we focus on large graph similarity computation problem and propose a novel "embedding-coarsening-matching" learning framework, which outperforms state-of-the-art methods in this task and has significant improvement in time efficiency. Graph similarity computation for metrics such as Graph Edit Distance (GED) is typically NP-hard, and existing heuristics-based algorithms usually achieves a unsatisfactory trade-off between accuracy and efficiency. Recently the development of deep learning techniques provides a promising solution for this problem by a data-driven approach which trains a network to encode graphs to their own feature vectors and computes similarity based on feature vectors. These deep-learning methods can be classified to two categories, embedding models and matching models. Embedding models such as GCN-Mean and GCN-Max, which directly map graphs to respective feature vectors, run faster but the performance is usually poor due to the lack of interactions across graphs. Matching models such as GMN, whose encoding process involves interaction across the two graphs, are more accurate but interaction between whole graphs brings a significant increase in time consumption (at least quadratic time complexity over number of nodes). Inspired by large biological molecular identification where the whole molecular is first mapped to functional groups and then identified based on these functional groups, our "embedding-coarsening-matching" learning framework first embeds and coarsens large graphs to coarsened graphs with denser local topology and then matching mechanism is deployed on the coarsened graphs for the final similarity scores. Detailed experiments have been conducted and the results demonstrate the efficiency and effectiveness of our proposed framework.
    Semantic Labeling of Large-Area Geographic Regions Using Multi-View and Multi-Date Satellite Images and Noisy OSM Training Labels. (arXiv:2008.10271v5 [cs.CV] UPDATED)
    (3 min) We present a novel multi-view training framework and CNN architecture for combining information from multiple overlapping satellite images and noisy training labels derived from OpenStreetMap (OSM) to semantically label buildings and roads across large geographic regions (100 km$^2$). Our approach to multi-view semantic segmentation yields a 4-7% improvement in the per-class IoU scores compared to the traditional approaches that use the views independently of one another. A unique (and, perhaps, surprising) property of our system is that modifications that are added to the tail-end of the CNN for learning from the multi-view data can be discarded at the time of inference with a relatively small penalty in the overall performance. This implies that the benefits of training using multiple views are absorbed by all the layers of the network. Additionally, our approach only adds a small overhead in terms of the GPU-memory consumption even when training with as many as 32 views per scene. The system we present is end-to-end automated, which facilitates comparing the classifiers trained directly on true orthophotos vis-a-vis first training them on the off-nadir images and subsequently translating the predicted labels to geographical coordinates. With no human supervision, our IoU scores for the buildings and roads classes are 0.8 and 0.64 respectively which are better than state-of-the-art approaches that use OSM labels and that are not completely automated.
    Reducing numerical precision preserves classification accuracy in Mondrian Forests. (arXiv:2106.14340v1 [cs.LG])
    (2 min) Mondrian Forests are a powerful data stream classification method, but their large memory footprint makes them ill-suited for low-resource platforms such as connected objects. We explored using reduced-precision floating-point representations to lower memory consumption and evaluated its effect on classification performance. We applied the Mondrian Forest implementation provided by OrpailleCC, a C++ collection of data stream algorithms, to two canonical datasets in human activity recognition: Recofit and Banos \emph{et al}. Results show that the precision of floating-point values used by tree nodes can be reduced from 64 bits to 8 bits with no significant difference in F1 score. In some cases, reduced precision was shown to improve classification performance, presumably due to its regularization effect. We conclude that numerical precision is a relevant hyperparameter in the Mondrian Forest, and that commonly-used double precision values may not be necessary for optimal performance. Future work will evaluate the generalizability of these findings to other data stream classifiers.
    We Should at Least Be Able to Design Molecules That Dock Well. (arXiv:2006.16955v4 [q-bio.BM] UPDATED)
    (2 min) Designing compounds with desired properties is a key element of the drug discovery process. However, measuring progress in the field has been challenging due to the lack of realistic retrospective benchmarks, and the large cost of prospective validation. To close this gap, we propose a benchmark based on docking, a popular computational method for assessing molecule binding to a protein. Concretely, the goal is to generate drug-like molecules that are scored highly by SMINA, a popular docking software. We observe that popular graph-based generative models fail to generate molecules with a high docking score when trained using a realistically sized training set. This suggests a limitation of the current incarnation of models for de novo drug design. Finally, we propose a simplified version of the benchmark based on a simpler scoring function, and show that the tested models are able to partially solve it. We release the benchmark as an easy to use package available at https://github.com/cieplinski-tobiasz/smina-docking-benchmark. We hope that our benchmark will serve as a stepping stone towards the goal of automatically generating promising drug candidates.
    The mbsts package: Multivariate Bayesian Structural Time Series Models in R. (arXiv:2106.14045v1 [stat.ME])
    (2 min) The multivariate Bayesian structural time series (MBSTS) model \citep{qiu2018multivariate,Jammalamadaka2019Predicting} as a generalized version of many structural time series models, deals with inference and prediction for multiple correlated time series, where one also has the choice of using a different candidate pool of contemporaneous predictors for each target series. The MBSTS model has wide applications and is ideal for feature selection, time series forecasting, nowcasting, inferring causal impact, and others. This paper demonstrates how to use the R package \pkg{mbsts} for MBSTS modeling, establishing a bridge between user-friendly and developer-friendly functions in package and the corresponding methodology. A simulated dataset and object-oriented functions in the \pkg{mbsts} package are explained in the way that enables users to flexibly add or deduct some components, as well as to simplify or complicate some settings.
    High-probability Bounds for Non-Convex Stochastic Optimization with Heavy Tails. (arXiv:2106.14343v1 [cs.LG])
    (2 min) We consider non-convex stochastic optimization using first-order algorithms for which the gradient estimates may have heavy tails. We show that a combination of gradient clipping, momentum, and normalized gradient descent yields convergence to critical points in high-probability with best-known rates for smooth losses when the gradients only have bounded $\mathfrak{p}$th moments for some $\mathfrak{p}\in(1,2]$. We then consider the case of second-order smooth losses, which to our knowledge have not been studied in this setting, and again obtain high-probability bounds for any $\mathfrak{p}$. Moreover, our results hold for arbitrary smooth norms, in contrast to the typical SGD analysis which requires a Hilbert space norm. Further, we show that after a suitable "burn-in" period, the objective value will monotonically decrease for every iteration until a critical point is identified, which provides intuition behind the popular practice of learning rate "warm-up" and also yields a last-iterate guarantee.
    Truncated Gaussian-Mixture Variational AutoEncoder. (arXiv:1902.03717v3 [cs.LG] UPDATED)
    (2 min) Variation Autoencoder (VAE) has become a powerful tool in modeling the non-linear generative process of data from a low-dimensional latent space. Recently, several studies have proposed to use VAE for unsupervised clustering by using mixture models to capture the multi-modal structure of latent representations. This strategy, however, is ineffective when there are outlier data samples whose latent representations are meaningless, yet contaminating the estimation of key major clusters in the latent space. This exact problem arises in the context of resting-state fMRI (rs-fMRI) analysis, where clustering major functional connectivity patterns is often hindered by heavy noise of rs-fMRI and many minor clusters (rare connectivity patterns) of no interest to analysis. In this paper we propose a novel generative process, in which we use a Gaussian-mixture to model a few major clusters in the data, and use a non-informative uniform distribution to capture the remaining data. We embed this truncated Gaussian-Mixture model in a Variational AutoEncoder framework to obtain a general joint clustering and outlier detection approach, called tGM-VAE. We demonstrated the applicability of tGM-VAE on the MNIST dataset and further validated it in the context of rs-fMRI connectivity analysis.
    Revelio: ML-Generated Debugging Queries for Distributed Systems. (arXiv:2106.14347v1 [cs.DC])
    (2 min) A major difficulty in debugging distributed systems lies in manually determining which of the many available debugging tools to use and how to query its logs. Our own study of a production debugging workflow confirms the magnitude of this burden. This paper explores whether a machine-learning model can assist developers in distributed systems debugging. We present Revelio, a debugging assistant which takes user reports and system logs as input, and outputs debugging queries that developers can use to find a bug's root cause. The key challenges lie in (1) combining inputs of different types (e.g., natural language reports and quantitative logs) and (2) generalizing to unseen faults. Revelio addresses these by employing deep neural networks to uniformly embed diverse input sources and potential queries into a high-dimensional vector space. In addition, it exploits observations from production systems to factorize query generation into two computationally and statistically simpler learning tasks. To evaluate Revelio, we built a testbed with multiple distributed applications and debugging tools. By injecting faults and training on logs and reports from 800 Mechanical Turkers, we show that Revelio includes the most helpful query in its predicted list of top-3 relevant queries 96% of the time. Our developer study confirms the utility of Revelio.
    Two-stage framework for short-term wind power forecasting using different feature-learning models. (arXiv:2006.00413v2 [eess.SP] UPDATED)
    (2 min) Two-stage ensemble-based forecasting methods have been studied extensively in the wind power forecasting field. However, deep learning-based wind power forecasting studies have not investigated two aspects. In the first stage, different learning structures considering multiple inputs and multiple outputs have not been discussed. In the second stage, the model extrapolation issue has not been investigated. Therefore, we develop four deep neural networks for the first stage to learn data features considering the input-and-output structure. We then explore the model extrapolation issue in the second stage using different modeling methods. Considering the overfitting issue, we propose a new moving window-based algorithm using a validation set in the first stage to update the training data in both stages with two different moving window processes.Experiments were conducted at three wind farms, and the results demonstrate that the model with single input multiple output structure obtains better forecasting accuracy compared to existing models. In addition, the ridge regression method results in a better ensemble model that can further improve forecasting accuracy compared to existing machine learning methods. Finally, the proposed two-stage forecasting algorithm can generate more accurate and stable results than existing algorithms.
    Midpoint Regularization: from High Uncertainty Training to Conservative Classification. (arXiv:2106.13913v1 [cs.LG])
    (2 min) Label Smoothing (LS) improves model generalization through penalizing models from generating overconfident output distributions. For each training sample the LS strategy smooths the one-hot encoded training signal by distributing its distribution mass over the non-ground truth classes. We extend this technique by considering example pairs, coined PLS. PLS first creates midpoint samples by averaging random sample pairs and then learns a smoothing distribution during training for each of these midpoint samples, resulting in midpoints with high uncertainty labels for training. We empirically show that PLS significantly outperforms LS, achieving up to 30% of relative classification error reduction. We also visualize that PLS produces very low winning softmax scores for both in and out of distribution samples.
    Locally Private k-Means Clustering. (arXiv:1907.02513v2 [cs.LG] UPDATED)
    (2 min) We design a new algorithm for the Euclidean $k$-means problem that operates in the local model of differential privacy. Unlike in the non-private literature, differentially private algorithms for the $k$-means objective incur both additive and multiplicative errors. Our algorithm significantly reduces the additive error while keeping the multiplicative error the same as in previous state-of-the-art results. Specifically, on a database of size $n$, our algorithm guarantees $O(1)$ multiplicative error and $\approx n^{1/2+a}$ additive error for an arbitrarily small constant $a>0$. All previous algorithms in the local model had additive error $\approx n^{2/3+a}$. Our techniques extend to $k$-median clustering. We show that the additive error we obtain is almost optimal in terms of its dependency on the database size $n$. Specifically, we give a simple lower bound showing that every locally-private algorithm for the $k$-means objective must have additive error at least $\approx\sqrt{n}$.
    Semi-Supervised Deep Ensembles for Blind Image Quality Assessment. (arXiv:2106.14008v1 [cs.CV])
    (2 min) Ensemble methods are generally regarded to be better than a single model if the base learners are deemed to be "accurate" and "diverse." Here we investigate a semi-supervised ensemble learning strategy to produce generalizable blind image quality assessment models. We train a multi-head convolutional network for quality prediction by maximizing the accuracy of the ensemble (as well as the base learners) on labeled data, and the disagreement (i.e., diversity) among them on unlabeled data, both implemented by the fidelity loss. We conduct extensive experiments to demonstrate the advantages of employing unlabeled data for BIQA, especially in model generalization and failure identification.
    Residual Moment Loss for Medical Image Segmentation. (arXiv:2106.14178v1 [eess.IV])
    (2 min) Location information is proven to benefit the deep learning models on capturing the manifold structure of target objects, and accordingly boosts the accuracy of medical image segmentation. However, most existing methods encode the location information in an implicit way, e.g. the distance transform maps, which describe the relative distance from each pixel to the contour boundary, for the network to learn. These implicit approaches do not fully exploit the position information (i.e. absolute location) of targets. In this paper, we propose a novel loss function, namely residual moment (RM) loss, to explicitly embed the location information of segmentation targets during the training of deep learning networks. Particularly, motivated by image moments, the segmentation prediction map and ground-truth map are weighted by coordinate information. Then our RM loss encourages the networks to maintain the consistency between the two weighted maps, which promotes the segmentation networks to easily locate the targets and extract manifold-structure-related features. We validate the proposed RM loss by conducting extensive experiments on two publicly available datasets, i.e., 2D optic cup and disk segmentation and 3D left atrial segmentation. The experimental results demonstrate the effectiveness of our RM loss, which significantly boosts the accuracy of segmentation networks.
    Compositional Reinforcement Learning from Logical Specifications. (arXiv:2106.13906v1 [cs.LG])
    (2 min) We study the problem of learning control policies for complex tasks given by logical specifications. Recent approaches automatically generate a reward function from a given specification and use a suitable reinforcement learning algorithm to learn a policy that maximizes the expected reward. These approaches, however, scale poorly to complex tasks that require high-level planning. In this work, we develop a compositional learning approach, called DiRL, that interleaves high-level planning and reinforcement learning. First, DiRL encodes the specification as an abstract graph; intuitively, vertices and edges of the graph correspond to regions of the state space and simpler sub-tasks, respectively. Our approach then incorporates reinforcement learning to learn neural network policies for each edge (sub-task) within a Dijkstra-style planning algorithm to compute a high-level plan in the graph. An evaluation of the proposed approach on a set of challenging control benchmarks with continuous state and action spaces demonstrates that it outperforms state-of-the-art baselines.
    Confounder-Aware Visualization of ConvNets. (arXiv:1907.12727v2 [cs.LG] UPDATED)
    (2 min) With recent advances in deep learning, neuroimaging studies increasingly rely on convolutional networks (ConvNets) to predict diagnosis based on MR images. To gain a better understanding of how a disease impacts the brain, the studies visualize the salience maps of the ConvNet highlighting voxels within the brain majorly contributing to the prediction. However, these salience maps are generally confounded, i.e., some salient regions are more predictive of confounding variables (such as age) than the diagnosis. To avoid such misinterpretation, we propose in this paper an approach that aims to visualize confounder-free saliency maps that only highlight voxels predictive of the diagnosis. The approach incorporates univariate statistical tests to identify confounding effects within the intermediate features learned by ConvNet. The influence from the subset of confounded features is then removed by a novel partial back-propagation procedure. We use this two-step approach to visualize confounder-free saliency maps extracted from synthetic and two real datasets. These experiments reveal the potential of our visualization in producing unbiased model-interpretation.
    Non-Exhaustive Learning Using Gaussian Mixture Generative Adversarial Networks. (arXiv:2106.14344v1 [cs.LG])
    (2 min) Supervised learning, while deployed in real-life scenarios, often encounters instances of unknown classes. Conventional algorithms for training a supervised learning model do not provide an option to detect such instances, so they miss-classify such instances with 100% probability. Open Set Recognition (OSR) and Non-Exhaustive Learning (NEL) are potential solutions to overcome this problem. Most existing methods of OSR first classify members of existing classes and then identify instances of new classes. However, many of the existing methods of OSR only makes a binary decision, i.e., they only identify the existence of the unknown class. Hence, such methods cannot distinguish test instances belonging to incremental unseen classes. On the other hand, the majority of NEL methods often make a parametric assumption over the data distribution, which either fail to return good results, due to the reason that real-life complex datasets may not follow a well-known data distribution. In this paper, we propose a new online non-exhaustive learning model, namely, Non-Exhaustive Gaussian Mixture Generative Adversarial Networks (NE-GM-GAN) to address these issues. Our proposed model synthesizes Gaussian mixture based latent representation over a deep generative model, such as GAN, for incremental detection of instances of emerging classes in the test data. Extensive experimental results on several benchmark datasets show that NE-GM-GAN significantly outperforms the state-of-the-art methods in detecting instances of novel classes in streaming data.
    Deep Learning for Technical Document Classification. (arXiv:2106.14269v1 [cs.LG])
    (2 min) In large technology companies, the requirements for managing and organizing technical documents created by engineers and managers in supporting relevant decision making have increased dramatically in recent years, which has led to a higher demand for more scalable, accurate, and automated document classification. Prior studies have primarily focused on processing text for classification and small-scale databases. This paper describes a novel multimodal deep learning architecture, called TechDoc, for technical document classification, which utilizes both natural language and descriptive images to train hierarchical classifiers. The architecture synthesizes convolutional neural networks and recurrent neural networks through an integrated training process. We applied the architecture to a large multimodal technical document database and trained the model for classifying documents based on the hierarchical International Patent Classification system. Our results show that the trained neural network presents a greater classification accuracy than those using a single modality and several earlier text classification methods. The trained model can potentially be scaled to millions of real-world technical documents with both text and figures, which is useful for data and knowledge management in large technology companies and organizations.
    Regret Analysis in Deterministic Reinforcement Learning. (arXiv:2106.14338v1 [cs.LG])
    (2 min) We consider Markov Decision Processes (MDPs) with deterministic transitions and study the problem of regret minimization, which is central to the analysis and design of optimal learning algorithms. We present logarithmic problem-specific regret lower bounds that explicitly depend on the system parameter (in contrast to previous minimax approaches) and thus, truly quantify the fundamental limit of performance achievable by any learning algorithm. Deterministic MDPs can be interpreted as graphs and analyzed in terms of their cycles, a fact which we leverage in order to identify a class of deterministic MDPs whose regret lower bound can be determined numerically. We further exemplify this result on a deterministic line search problem, and a deterministic MDP with state-dependent rewards, whose regret lower bounds we can state explicitly. These bounds share similarities with the known problem-specific bound of the multi-armed bandit problem and suggest that navigation on a deterministic MDP need not have an effect on the performance of a learning algorithm.
    Feature Combination Meets Attention: Baidu Soccer Embeddings and Transformer based Temporal Detection. (arXiv:2106.14447v1 [cs.CV])
    (2 min) With rapidly evolving internet technologies and emerging tools, sports related videos generated online are increasing at an unprecedentedly fast pace. To automate sports video editing/highlight generation process, a key task is to precisely recognize and locate the events in the long untrimmed videos. In this tech report, we present a two-stage paradigm to detect what and when events happen in soccer broadcast videos. Specifically, we fine-tune multiple action recognition models on soccer data to extract high-level semantic features, and design a transformer based temporal detection module to locate the target events. This approach achieved the state-of-the-art performance in both two tasks, i.e., action spotting and replay grounding, in the SoccerNet-v2 Challenge, under CVPR 2021 ActivityNet workshop. Our soccer embedding features are released at https://github.com/baidu-research/vidpress-sports. By sharing these features with the broader community, we hope to accelerate the research into soccer video understanding.
    Domain Adaptation Broad Learning System Based on Locally Linear Embedding. (arXiv:2106.14367v1 [cs.LG])
    (2 min) Broad learning system (BLS) has been proposed for a few years. It demonstrates an effective learning capability for many classification and regression problems. However, BLS and its improved versions are mainly used to deal with unsupervised, supervised and semi-supervised learning problems in a single domain. As far as we know, a little attention is paid to the cross-domain learning ability of BLS. Therefore, we introduce BLS into the field of transfer learning and propose a novel algorithm called domain adaptation broad learning system based on locally linear embedding (DABLS-LLE). The proposed algorithm can learn a robust classification model by using a small part of labeled data from the target domain and all labeled data from the source domain. The proposed algorithm inherits the computational efficiency and learning capability of BLS. Experiments on benchmark dataset (Office-Caltech-10) verify the effectiveness of our approach. The results show that our approach can get better classification accuracy with less running time than many existing transfer learning approaches. It shows that our approach can bring a new superiority for BLS.
    Introduction to Multi-Armed Bandits. (arXiv:1904.07272v6 [cs.LG] UPDATED)
    (2 min) Multi-armed bandits a simple but very powerful framework for algorithms that make decisions over time under uncertainty. An enormous body of work has accumulated over the years, covered in several books and surveys. This book provides a more introductory, textbook-like treatment of the subject. Each chapter tackles a particular line of work, providing a self-contained, teachable technical introduction and a brief review of the further developments; many of the chapters conclude with exercises. The book is structured as follows. The first four chapters are on IID rewards, from the basic model to impossibility results to Bayesian priors to Lipschitz rewards. The next three chapters cover adversarial rewards, from the full-feedback version to adversarial bandits to extensions with linear rewards and combinatorially structured actions. Chapter 8 is on contextual bandits, a middle ground between IID and adversarial bandits in which the change in reward distributions is completely explained by observable contexts. The last three chapters cover connections to economics, from learning in repeated games to bandits with supply/budget constraints to exploration in the presence of incentives. The appendix provides sufficient background on concentration and KL-divergence. The chapters on "bandits with similarity information", "bandits with knapsacks" and "bandits and agents" can also be consumed as standalone surveys on the respective topics.
    Causal Reinforcement Learning using Observational and Interventional Data. (arXiv:2106.14421v1 [cs.LG])
    (2 min) Learning efficiently a causal model of the environment is a key challenge of model-based RL agents operating in POMDPs. We consider here a scenario where the learning agent has the ability to collect online experiences through direct interactions with the environment (interventional data), but has also access to a large collection of offline experiences, obtained by observing another agent interacting with the environment (observational data). A key ingredient, that makes this situation non-trivial, is that we allow the observed agent to interact with the environment based on hidden information, which is not observed by the learning agent. We then ask the following questions: can the online and offline experiences be safely combined for learning a causal model ? And can we expect the offline experiences to improve the agent's performances ? To answer these questions, we import ideas from the well-established causal framework of do-calculus, and we express model-based reinforcement learning as a causal inference problem. Then, we propose a general yet simple methodology for leveraging offline data during learning. In a nutshell, the method relies on learning a latent-based causal transition model that explains both the interventional and observational regimes, and then using the recovered latent variable to infer the standard POMDP transition model via deconfounding. We prove our method is correct and efficient in the sense that it attains better generalization guarantees due to the offline data (in the asymptotic case), and we illustrate its effectiveness empirically on synthetic toy problems. Our contribution aims at bridging the gap between the fields of reinforcement learning and causality.
    Error analysis for physics informed neural networks (PINNs) approximating Kolmogorov PDEs. (arXiv:2106.14473v1 [math.NA])
    (2 min) Physics informed neural networks approximate solutions of PDEs by minimizing pointwise residuals. We derive rigorous bounds on the error, incurred by PINNs in approximating the solutions of a large class of linear parabolic PDEs, namely Kolmogorov equations that include the heat equation and Black-Scholes equation of option pricing, as examples. We construct neural networks, whose PINN residual (generalization error) can be made as small as desired. We also prove that the total $L^2$-error can be bounded by the generalization error, which in turn is bounded in terms of the training error, provided that a sufficient number of randomly chosen training (collocation) points is used. Moreover, we prove that the size of the PINNs and the number of training samples only grow polynomially with the underlying dimension, enabling PINNs to overcome the curse of dimensionality in this context. These results enable us to provide a comprehensive error analysis for PINNs in approximating Kolmogorov PDEs.
    Co$^2$L: Contrastive Continual Learning. (arXiv:2106.14413v1 [cs.LG])
    (2 min) Recent breakthroughs in self-supervised learning show that such algorithms learn visual representations that can be transferred better to unseen tasks than joint-training methods relying on task-specific supervision. In this paper, we found that the similar holds in the continual learning con-text: contrastively learned representations are more robust against the catastrophic forgetting than jointly trained representations. Based on this novel observation, we propose a rehearsal-based continual learning algorithm that focuses on continually learning and maintaining transferable representations. More specifically, the proposed scheme (1) learns representations using the contrastive learning objective, and (2) preserves learned representations using a self-supervised distillation step. We conduct extensive experimental validations under popular benchmark image classification datasets, where our method sets the new state-of-the-art performance.
    Integrating topic modeling and word embedding to characterize violent deaths. (arXiv:2106.14365v1 [cs.CL])
    (2 min) There is an escalating need for methods to identify latent patterns in text data from many domains. We introduce a new method to identify topics in a corpus and represent documents as topic sequences. Discourse Atom Topic Modeling draws on advances in theoretical machine learning to integrate topic modeling and word embedding, capitalizing on the distinct capabilities of each. We first identify a set of vectors ("discourse atoms") that provide a sparse representation of an embedding space. Atom vectors can be interpreted as latent topics: Through a generative model, atoms map onto distributions over words; one can also infer the topic that generated a sequence of words. We illustrate our method with a prominent example of underutilized text: the U.S. National Violent Death Reporting System (NVDRS). The NVDRS summarizes violent death incidents with structured variables and unstructured narratives. We identify 225 latent topics in the narratives (e.g., preparation for death and physical aggression); many of these topics are not captured by existing structured variables. Motivated by known patterns in suicide and homicide by gender, and recent research on gender biases in semantic space, we identify the gender bias of our topics (e.g., a topic about pain medication is feminine). We then compare the gender bias of topics to their prevalence in narratives of female versus male victims. Results provide a detailed quantitative picture of reporting about lethal violence and its gendered nature. Our method offers a flexible and broadly applicable approach to model topics in text data.
    Continual Learning via Inter-Task Synaptic Mapping. (arXiv:2106.13954v1 [cs.LG])
    (2 min) Learning from streaming tasks leads a model to catastrophically erase unique experiences it absorbs from previous episodes. While regularization techniques such as LWF, SI, EWC have proven themselves as an effective avenue to overcome this issue by constraining important parameters of old tasks from changing when accepting new concepts, these approaches do not exploit common information of each task which can be shared to existing neurons. As a result, they do not scale well to large-scale problems since the parameter importance variables quickly explode. An Inter-Task Synaptic Mapping (ISYANA) is proposed here to underpin knowledge retention for continual learning. ISYANA combines task-to-neuron relationship as well as concept-to-concept relationship such that it prevents a neuron to embrace distinct concepts while merely accepting relevant concept. Numerical study in the benchmark continual learning problems has been carried out followed by comparison against prominent continual learning algorithms. ISYANA exhibits competitive performance compared to state of the arts. Codes of ISYANA is made available in \url{https://github.com/ContinualAL/ISYANAKBS}.
    DGL-LifeSci: An Open-Source Toolkit for Deep Learning on Graphs in Life Science. (arXiv:2106.14232v1 [cs.LG])
    (2 min) Graph neural networks (GNNs) constitute a class of deep learning methods for graph data. They have wide applications in chemistry and biology, such as molecular property prediction, reaction prediction and drug-target interaction prediction. Despite the interest, GNN-based modeling is challenging as it requires graph data pre-processing and modeling in addition to programming and deep learning. Here we present DGL-LifeSci, an open-source package for deep learning on graphs in life science. DGL-LifeSci is a python toolkit based on RDKit, PyTorch and Deep Graph Library (DGL). DGL-LifeSci allows GNN-based modeling on custom datasets for molecular property prediction, reaction prediction and molecule generation. With its command-line interfaces, users can perform modeling without any background in programming and deep learning. We test the command-line interfaces using standard benchmarks MoleculeNet, USPTO, and ZINC. Compared with previous implementations, DGL-LifeSci achieves a speed up by up to 6x. For modeling flexibility, DGL-LifeSci provides well-optimized modules for various stages of the modeling pipeline. In addition, DGL-LifeSci provides pre-trained models for reproducing the test experiment results and applying models without training. The code is distributed under an Apache-2.0 License and is freely accessible at https://github.com/awslabs/dgl-lifesci.
    Learning to solve geometric construction problems from images. (arXiv:2106.14195v1 [cs.CV])
    (2 min) We describe a purely image-based method for finding geometric constructions with a ruler and compass in the Euclidea geometric game. The method is based on adapting the Mask R-CNN state-of-the-art image processing neural architecture and adding a tree-based search procedure to it. In a supervised setting, the method learns to solve all 68 kinds of geometric construction problems from the first six level packs of Euclidea with an average 92% accuracy. When evaluated on new kinds of problems, the method can solve 31 of the 68 kinds of Euclidea problems. We believe that this is the first time that a purely image-based learning has been trained to solve geometric construction problems of this difficulty.
    Time-Series Representation Learning via Temporal and Contextual Contrasting. (arXiv:2106.14112v1 [cs.LG])
    (2 min) Learning decent representations from unlabeled time-series data with temporal dynamics is a very challenging task. In this paper, we propose an unsupervised Time-Series representation learning framework via Temporal and Contextual Contrasting (TS-TCC), to learn time-series representation from unlabeled data. First, the raw time-series data are transformed into two different yet correlated views by using weak and strong augmentations. Second, we propose a novel temporal contrasting module to learn robust temporal representations by designing a tough cross-view prediction task. Last, to further learn discriminative representations, we propose a contextual contrasting module built upon the contexts from the temporal contrasting module. It attempts to maximize the similarity among different contexts of the same sample while minimizing similarity among contexts of different samples. Experiments have been carried out on three real-world time-series datasets. The results manifest that training a linear classifier on top of the features learned by our proposed TS-TCC performs comparably with the supervised training. Additionally, our proposed TS-TCC shows high efficiency in few-labeled data and transfer learning scenarios. The code is publicly available at https://github.com/emadeldeen24/TS-TCC.
    Nonparametric estimation of continuous DPPs with kernel methods. (arXiv:2106.14210v1 [cs.LG])
    (2 min) Determinantal Point Process (DPPs) are statistical models for repulsive point patterns. Both sampling and inference are tractable for DPPs, a rare feature among models with negative dependence that explains their popularity in machine learning and spatial statistics. Parametric and nonparametric inference methods have been proposed in the finite case, i.e. when the point patterns live in a finite ground set. In the continuous case, only parametric methods have been investigated, while nonparametric maximum likelihood for DPPs -- an optimization problem over trace-class operators -- has remained an open question. In this paper, we show that a restricted version of this maximum likelihood (MLE) problem falls within the scope of a recent representer theorem for nonnegative functions in an RKHS. This leads to a finite-dimensional problem, with strong statistical ties to the original MLE. Moreover, we propose, analyze, and demonstrate a fixed point algorithm to solve this finite-dimensional problem. Finally, we also provide a controlled estimate of the correlation kernel of the DPP, thus providing more interpretability.
    Recurrently Predicting Hypergraphs. (arXiv:2106.13919v1 [cs.LG])
    (2 min) This work considers predicting the relational structure of a hypergraph for a given set of vertices, as common for applications in particle physics, biological systems and other complex combinatorial problems. A problem arises from the number of possible multi-way relationships, or hyperedges, scaling in $\mathcal{O}(2^n)$ for a set of $n$ elements. Simply storing an indicator tensor for all relationships is already intractable for moderately sized $n$, prompting previous approaches to restrict the number of vertices a hyperedge connects. Instead, we propose a recurrent hypergraph neural network that predicts the incidence matrix by iteratively refining an initial guess of the solution. We leverage the property that most hypergraphs of interest are sparsely connected and reduce the memory requirement to $\mathcal{O}(nk)$, where $k$ is the maximum number of positive edges, i.e., edges that actually exist. In order to counteract the linearly growing memory cost from training a lengthening sequence of refinement steps, we further propose an algorithm that applies backpropagation through time on randomly sampled subsequences. We empirically show that our method can match an increase in the intrinsic complexity without a performance decrease and demonstrate superior performance compared to state-of-the-art models.
    Graph Convolutional Memory for Deep Reinforcement Learning. (arXiv:2106.14117v1 [cs.LG])
    (2 min) Solving partially-observable Markov decision processes (POMDPs) is critical when applying deep reinforcement learning (DRL) to real-world robotics problems, where agents have an incomplete view of the world. We present graph convolutional memory (GCM) for solving POMDPs using deep reinforcement learning. Unlike recurrent neural networks (RNNs) or transformers, GCM embeds domain-specific priors into the memory recall process via a knowledge graph. By encapsulating priors in the graph, GCM adapts to specific tasks but remains applicable to any DRL task. Using graph convolutions, GCM extracts hierarchical graph features, analogous to image features in a convolutional neural network (CNN). We show GCM outperforms long short-term memory (LSTM), gated transformers for reinforcement learning (GTrXL), and differentiable neural computers (DNCs) on control, long-term non-sequential recall, and 3D navigation tasks while using significantly fewer parameters.
    A Machine Learning Model for Early Detection of Diabetic Foot using Thermogram Images. (arXiv:2106.14207v1 [eess.IV])
    (2 min) Diabetes foot ulceration (DFU) and amputation are a cause of significant morbidity. The prevention of DFU may be achieved by the identification of patients at risk of DFU and the institution of preventative measures through education and offloading. Several studies have reported that thermogram images may help to detect an increase in plantar temperature prior to DFU. However, the distribution of plantar temperature may be heterogeneous, making it difficult to quantify and utilize to predict outcomes. We have compared a machine learning-based scoring technique with feature selection and optimization techniques and learning classifiers to several state-of-the-art Convolutional Neural Networks (CNNs) on foot thermogram images and propose a robust solution to identify the diabetic foot. A comparatively shallow CNN model, MobilenetV2 achieved an F1 score of ~95% for a two-feet thermogram image-based classification and the AdaBoost Classifier used 10 features and achieved an F1 score of 97 %. A comparison of the inference time for the best-performing networks confirmed that the proposed algorithm can be deployed as a smartphone application to allow the user to monitor the progression of the DFU in a home setting.
    Deep Learning Image Recognition for Non-images. (arXiv:2106.14350v1 [cs.LG])
    (2 min) Powerful deep learning algorithms open an opportunity for solving non-image Machine Learning (ML) problems by transforming these problems to into the image recognition problems. The CPC-R algorithm presented in this chapter converts non-image data into images by visualizing non-image data. Then deep learning CNN algorithms solve the learning problems on these images. The design of the CPC-R algorithm allows preserving all high-dimensional information in 2-D images. The use of pair values mapping instead of single value mapping used in the alternative approaches allows encoding each n-D point with 2 times fewer visual elements. The attributes of an n-D point are divided into pairs of its values and each pair is visualized as 2-D points in the same 2-D Cartesian coordinates. Next, grey scale or color intensity values are assigned to each pair to encode the order of pairs. This is resulted in the heatmap image. The computational experiments with CPC-R are conducted for different CNN architectures, and methods to optimize the CPC-R images showing that the combined CPC-R and deep learning CNN algorithms are able to solve non-image ML problems reaching high accuracy on the benchmark datasets. This chapter expands our prior work by adding more experiments to test accuracy of classification, exploring saliency and informativeness of discovered features to test their interpretability, and generalizing the approach.
    Legendre Deep Neural Network (LDNN) and its application for approximation of nonlinear Volterra Fredholm Hammerstein integral equations. (arXiv:2106.14320v1 [math.NA])
    (2 min) Various phenomena in biology, physics, and engineering are modeled by differential equations. These differential equations including partial differential equations and ordinary differential equations can be converted and represented as integral equations. In particular, Volterra Fredholm Hammerstein integral equations are the main type of these integral equations and researchers are interested in investigating and solving these equations. In this paper, we propose Legendre Deep Neural Network (LDNN) for solving nonlinear Volterra Fredholm Hammerstein integral equations (VFHIEs). LDNN utilizes Legendre orthogonal polynomials as activation functions of the Deep structure. We present how LDNN can be used to solve nonlinear VFHIEs. We show using the Gaussian quadrature collocation method in combination with LDNN results in a novel numerical solution for nonlinear VFHIEs. Several examples are given to verify the performance and accuracy of LDNN.
    Learning Mesh Representations via Binary Space Partitioning Tree Networks. (arXiv:2106.14274v1 [cs.CV])
    (2 min) Polygonal meshes are ubiquitous, but have only played a relatively minor role in the deep learning revolution. State-of-the-art neural generative models for 3D shapes learn implicit functions and generate meshes via expensive iso-surfacing. We overcome these challenges by employing a classical spatial data structure from computer graphics, Binary Space Partitioning (BSP), to facilitate 3D learning. The core operation of BSP involves recursive subdivision of 3D space to obtain convex sets. By exploiting this property, we devise BSP-Net, a network that learns to represent a 3D shape via convex decomposition without supervision. The network is trained to reconstruct a shape using a set of convexes obtained from a BSP-tree built over a set of planes, where the planes and convexes are both defined by learned network weights. BSP-Net directly outputs polygonal meshes from the inferred convexes. The generated meshes are watertight, compact (i.e., low-poly), and well suited to represent sharp geometry. We show that the reconstruction quality by BSP-Net is competitive with those from state-of-the-art methods while using much fewer primitives. We also explore variations to BSP-Net including using a more generic decoder for reconstruction, more general primitives than planes, as well as training a generative model with variational auto-encoders. Code is available at https://github.com/czq142857/BSP-NET-original.
    How many moments does MMD compare?. (arXiv:2106.14277v1 [cs.LG])
    (2 min) We present a new way of study of Mercer kernels, by corresponding to a special kernel $K$ a pseudo-differential operator $p({\mathbf x}, D)$ such that $\mathcal{F} p({\mathbf x}, D)^\dag p({\mathbf x}, D) \mathcal{F}^{-1}$ acts on smooth functions in the same way as an integral operator associated with $K$ (where $\mathcal{F}$ is the Fourier transform). We show that kernels defined by pseudo-differential operators are able to approximate uniformly any continuous Mercer kernel on a compact set. The symbol $p({\mathbf x}, {\mathbf y})$ encapsulates a lot of useful information about the structure of the Maximum Mean Discrepancy distance defined by the kernel $K$. We approximate $p({\mathbf x}, {\mathbf y})$ with the sum of the first $r$ terms of the Singular Value Decomposition of $p$, denoted by $p_r({\mathbf x}, {\mathbf y})$. If ordered singular values of the integral operator associated with $p({\mathbf x}, {\mathbf y})$ die down rapidly, the MMD distance defined by the new symbol $p_r$ differs from the initial one only slightly. Moreover, the new MMD distance can be interpreted as an aggregated result of comparing $r$ local moments of two probability distributions. The latter results holds under the condition that right singular vectors of the integral operator associated with $p$ are uniformly bounded. But even if this is not satisfied we can still hold that the Hilbert-Schmidt distance between $p$ and $p_r$ vanishes. Thus, we report an interesting phenomenon: the MMD distance measures the difference of two probability distributions with respect to a certain number of local moments, $r^\ast$, and this number $r^\ast$ depends on the speed with which singular values of $p$ die down.
    Machine Learning Detection Algorithm for Large Barkhausen Jumps in Cluttered Environment. (arXiv:2106.14148v1 [cs.LG])
    (2 min) Modern magnetic sensor arrays conventionally utilize state of the art low power magnetometers such as parallel and orthogonal fluxgates. Low power fluxgates tend to have large Barkhausen jumps that appear as a dc jump in the fluxgate output. This phenomenon deteriorates the signal fidelity and effectively increases the internal sensor noise. Even if sensors that are more prone to dc jumps can be screened during production, the conventional noise measurement does not always catch the dc jump because of its sparsity. Moreover, dc jumps persist in almost all the sensor cores although at a slower but still intolerable rate. Even if dc jumps can be easily detected in a shielded environment, when deployed in presence of natural noise and clutter, it can be hard to positively detect them. This work fills this gap and presents algorithms that distinguish dc jumps embedded in natural magnetic field data. To improve robustness to noise, we developed two machine learning algorithms that employ temporal and statistical physical-based features of a pre-acquired and well-known experimental data set. The first algorithm employs a support vector machine classifier, while the second is based on a neural network architecture. We compare these new approaches to a more classical kernel-based method. To that purpose, the receiver operating characteristic curve is generated, which allows diagnosis ability of the different classifiers by comparing their performances across various operation points. The accuracy of the machine learning-based algorithms over the classic method is highly emphasized. In addition, high generalization and robustness of the neural network can be concluded, based on the rapid convergence of the corresponding receiver operating characteristic curves.
    AdaptCL: Efficient Collaborative Learning with Dynamic and Adaptive Pruning. (arXiv:2106.14126v1 [cs.LG])
    (2 min) In multi-party collaborative learning, the parameter server sends a global model to each data holder for local training and then aggregates committed models globally to achieve privacy protection. However, both the dragger issue of synchronous collaborative learning and the staleness issue of asynchronous collaborative learning make collaborative learning inefficient in real-world heterogeneous environments. We propose a novel and efficient collaborative learning framework named AdaptCL, which generates an adaptive sub-model dynamically from the global base model for each data holder, without any prior information about worker capability. All workers (data holders) achieve approximately identical update time as the fastest worker by equipping them with capability-adapted pruned models. Thus the training process can be dramatically accelerated. Besides, we tailor the efficient pruned rate learning algorithm and pruning approach for AdaptCL. Meanwhile, AdaptCL provides a mechanism for handling the trade-off between accuracy and time overhead and can be combined with other techniques to accelerate training further. Empirical results show that AdaptCL introduces little computing and communication overhead. AdaptCL achieves time savings of more than 41\% on average and improves accuracy in a low heterogeneous environment. In a highly heterogeneous environment, AdaptCL achieves a training speedup of 6.2x with a slight loss of accuracy.
    Global Convergence of Gradient Descent for Asymmetric Low-Rank Matrix Factorization. (arXiv:2106.14289v1 [math.OC])
    (2 min) We study the asymmetric low-rank factorization problem: \[\min_{\mathbf{U} \in \mathbb{R}^{m \times d}, \mathbf{V} \in \mathbb{R}^{n \times d}} \frac{1}{2}\|\mathbf{U}\mathbf{V}^\top -\mathbf{\Sigma}\|_F^2\] where $\mathbf{\Sigma}$ is a given matrix of size $m \times n$ and rank $d$. This is a canonical problem that admits two difficulties in optimization: 1) non-convexity and 2) non-smoothness (due to unbalancedness of $\mathbf{U}$ and $\mathbf{V}$). This is also a prototype for more complex problems such as asymmetric matrix sensing and matrix completion. Despite being non-convex and non-smooth, it has been observed empirically that the randomly initialized gradient descent algorithm can solve this problem in polynomial time. Existing theories to explain this phenomenon all require artificial modifications of the algorithm, such as adding noise in each iteration and adding a balancing regularizer to balance the $\mathbf{U}$ and $\mathbf{V}$. This paper presents the first proof that shows randomly initialized gradient descent converges to a global minimum of the asymmetric low-rank factorization problem with a polynomial rate. For the proof, we develop 1) a new symmetrization technique to capture the magnitudes of the symmetry and asymmetry, and 2) a quantitative perturbation analysis to approximate matrix derivatives. We believe both are useful for other related non-convex problems.
    Hyperbolic Busemann Learning with Ideal Prototypes. (arXiv:2106.14472v1 [cs.LG])
    (2 min) Hyperbolic space has become a popular choice of manifold for representation learning of arbitrary data, from tree-like structures and text to graphs. Building on the success of deep learning with prototypes in Euclidean and hyperspherical spaces, a few recent works have proposed hyperbolic prototypes for classification. Such approaches enable effective learning in low-dimensional output spaces and can exploit hierarchical relations amongst classes, but require privileged information about class labels to position the hyperbolic prototypes. In this work, we propose Hyperbolic Busemann Learning. The main idea behind our approach is to position prototypes on the ideal boundary of the Poincare ball, which does not require prior label knowledge. To be able to compute proximities to ideal prototypes, we introduce the penalised Busemann loss. We provide theory supporting the use of ideal prototypes and the proposed loss by proving its equivalence to logistic regression in the one-dimensional case. Empirically, we show that our approach provides a natural interpretation of classification confidence, while outperforming recent hyperspherical and hyperbolic prototype approaches.
    An XAI Approach to Deep Learning Models in the Detection of Ductal Carcinoma in Situ. (arXiv:2106.14186v1 [eess.IV])
    (2 min) During the last decade or so, there has been an insurgence in the deep learning community to solve health-related issues, particularly breast cancer. Following the Camelyon-16 challenge in 2016, several researchers have dedicated their time to build Convolutional Neural Networks (CNNs) to help radiologists and other clinicians diagnose breast cancer. In particular, there has been an emphasis on Ductal Carcinoma in Situ (DCIS); the clinical term for early-stage breast cancer. Large companies have given their fair share of research into this subject, among these Google Deepmind who developed a model in 2020 that has proven to be better than radiologists themselves to diagnose breast cancer correctly. We found that among the issues which exist, there is a need for an explanatory system that goes through the hidden layers of a CNN to highlight those pixels that contributed to the classification of a mammogram. We then chose an open-source, reasonably successful project developed by Prof. Shen, using the CBIS-DDSM image database to run our experiments on. It was later improved using the Resnet-50 and VGG-16 patch-classifiers, analytically comparing the outcome of both. The results showed that the Resnet-50 one converged earlier in the experiments. Following the research by Montavon and Binder, we used the DeepTaylor Layer-wise Relevance Propagation (LRP) model to highlight those pixels and regions within a mammogram which contribute most to its classification. This is represented as a map of those pixels in the original image, which contribute to the diagnosis and the extent to which they contribute to the final classification. The most significant advantage of this algorithm is that it performs exceptionally well with the Resnet-50 patch classifier architecture.
    Reward-Based 1-bit Compressed Federated Distillation on Blockchain. (arXiv:2106.14265v1 [cs.LG])
    (2 min) The recent advent of various forms of Federated Knowledge Distillation (FD) paves the way for a new generation of robust and communication-efficient Federated Learning (FL), where mere soft-labels are aggregated, rather than whole gradients of Deep Neural Networks (DNN) as done in previous FL schemes. This security-per-design approach in combination with increasingly performant Internet of Things (IoT) and mobile devices opens up a new realm of possibilities to utilize private data from industries as well as from individuals as input for artificial intelligence model training. Yet in previous FL systems, lack of trust due to the imbalance of power between workers and a central authority, the assumption of altruistic worker participation and the inability to correctly measure and compare contributions of workers hinder this technology from scaling beyond small groups of already entrusted entities towards mass adoption. This work aims to mitigate the aforementioned issues by introducing a novel decentralized federated learning framework where heavily compressed 1-bit soft-labels, resembling 1-hot label predictions, are aggregated on a smart contract. In a context where workers' contributions are now easily comparable, we modify the Peer Truth Serum for Crowdsourcing mechanism (PTSC) for FD to reward honest participation based on peer consistency in an incentive compatible fashion. Due to heavy reductions of both computational complexity and storage, our framework is a fully on-blockchain FL system that is feasible on simple smart contracts and therefore blockchain agnostic. We experimentally test our new framework and validate its theoretical properties.
    Use of Machine Learning Technique to maximize the signal over background for $H \rightarrow \tau \tau$. (arXiv:2106.14257v1 [physics.data-an])
    (2 min) In recent years, artificial neural networks (ANNs) have won numerous contests in pattern recognition and machine learning. ANNS have been applied to problems ranging from speech recognition to prediction of protein secondary structure, classification of cancers, and gene prediction. Here, we intend to maximize the chances of finding the Higgs boson decays to two $\tau$ leptons in the pseudo dataset using a Machine Learning technique to classify the recorded events as signal or background.
    Model-Advantage Optimization for Model-Based Reinforcement Learning. (arXiv:2106.14080v1 [cs.LG])
    (2 min) Model-based Reinforcement Learning (MBRL) algorithms have been traditionally designed with the goal of learning accurate dynamics of the environment. This introduces a mismatch between the objectives of model-learning and the overall learning problem of finding an optimal policy. Value-aware model learning, an alternative model-learning paradigm to maximum likelihood, proposes to inform model-learning through the value function of the learnt policy. While this paradigm is theoretically sound, it does not scale beyond toy settings. In this work, we propose a novel value-aware objective that is an upper bound on the absolute performance difference of a policy across two models. Further, we propose a general purpose algorithm that modifies the standard MBRL pipeline -- enabling learning with value aware objectives. Our proposed objective, in conjunction with this algorithm, is the first successful instantiation of value-aware MBRL on challenging continuous control environments, outperforming previous value-aware objectives and with competitive performance w.r.t. MLE-based MBRL approaches.
    Transfer-based adaptive tree for multimodal sentiment analysis based on user latent aspects. (arXiv:2106.14174v1 [cs.LG])
    (2 min) Multimodal sentiment analysis benefits various applications such as human-computer interaction and recommendation systems. It aims to infer the users' bipolar ideas using visual, textual, and acoustic signals. Although researchers affirm the association between cognitive cues and emotional manifestations, most of the current multimodal approaches in sentiment analysis disregard user-specific aspects. To tackle this issue, we devise a novel method to perform multimodal sentiment prediction using cognitive cues, such as personality. Our framework constructs an adaptive tree by hierarchically dividing users and trains the LSTM-based submodels, utilizing an attention-based fusion to transfer cognitive-oriented knowledge within the tree. Subsequently, the framework consumes the conclusive agglomerative knowledge from the adaptive tree to predict final sentiments. We also devise a dynamic dropout method to facilitate data sharing between neighboring nodes, reducing data sparsity. The empirical results on real-world datasets determine that our proposed model for sentiment prediction can surpass trending rivals. Moreover, compared to other ensemble approaches, the proposed transfer-based algorithm can better utilize the latent cognitive cues and foster the prediction outcomes. Based on the given extrinsic and intrinsic analysis results, we note that compared to other theoretical-based techniques, the proposed hierarchical clustering approach can better group the users within the adaptive tree.
    Improving Sequential Recommendation Consistency with Self-Supervised Imitation. (arXiv:2106.14031v1 [cs.IR])
    (2 min) Most sequential recommendation models capture the features of consecutive items in a user-item interaction history. Though effective, their representation expressiveness is still hindered by the sparse learning signals. As a result, the sequential recommender is prone to make inconsistent predictions. In this paper, we propose a model, \textbf{SSI}, to improve sequential recommendation consistency with Self-Supervised Imitation. Precisely, we extract the consistency knowledge by utilizing three self-supervised pre-training tasks, where temporal consistency and persona consistency capture user-interaction dynamics in terms of the chronological order and persona sensitivities, respectively. Furthermore, to provide the model with a global perspective, global session consistency is introduced by maximizing the mutual information among global and local interaction sequences. Finally, to comprehensively take advantage of all three independent aspects of consistency-enhanced knowledge, we establish an integrated imitation learning framework. The consistency knowledge is effectively internalized and transferred to the student model by imitating the conventional prediction logit as well as the consistency-enhanced item representations. In addition, the flexible self-supervised imitation framework can also benefit other student recommenders. Experiments on four real-world datasets show that SSI effectively outperforms the state-of-the-art sequential recommendation methods.
    Improved Approximation Algorithms for Individually Fair Clustering. (arXiv:2106.14043v1 [cs.DS])
    (2 min) We consider the $k$-clustering problem with $\ell_p$-norm cost, which includes $k$-median, $k$-means and $k$-center cost functions, under an individual notion of fairness proposed by Jung et al. [2020]: given a set of points $P$ of size $n$, a set of $k$ centers induces a fair clustering if for every point $v\in P$, $v$ can find a center among its $n/k$ closest neighbors. Recently, Mahabadi and Vakilian [2020] showed how to get a $(p^{O(p)},7)$-bicriteria approximation for the problem of fair $k$-clustering with $\ell_p$-norm cost: every point finds a center within distance at most $7$ times its distance to its $(n/k)$-th closest neighbor and the $\ell_p$-norm cost of the solution is at most $p^{O(p)}$ times the cost of an optimal fair solution. In this work, for any $\varepsilon>0$, we present an improved $(16^p +\varepsilon,3)$-bicriteria approximation for the fair $k$-clustering with $\ell_p$-norm cost. To achieve our guarantees, we extend the framework of [Charikar et al., 2002, Swamy, 2016] and devise a $16^p$-approximation algorithm for the facility location with $\ell_p$-norm cost under matroid constraint which might be of an independent interest. Besides, our approach suggests a reduction from our individually fair clustering to a clustering with a group fairness requirement proposed by Kleindessner et al. [2019], which is essentially the median matroid problem [Krishnaswamy et al., 2011].
    Predictive Control Using Learned State Space Models via Rolling Horizon Evolution. (arXiv:2106.13911v1 [cs.LG])
    (2 min) A large part of the interest in model-based reinforcement learning derives from the potential utility to acquire a forward model capable of strategic long term decision making. Assuming that an agent succeeds in learning a useful predictive model, it still requires a mechanism to harness it to generate and select among competing simulated plans. In this paper, we explore this theme combining evolutionary algorithmic planning techniques with models learned via deep learning and variational inference. We demonstrate the approach with an agent that reliably performs online planning in a set of visual navigation tasks.
    Core Challenges in Embodied Vision-Language Planning. (arXiv:2106.13948v1 [cs.LG])
    (2 min) Recent advances in the areas of multimodal machine learning and artificial intelligence (AI) have led to the development of challenging tasks at the intersection of Computer Vision, Natural Language Processing, and Embodied AI. Whereas many approaches and previous survey pursuits have characterised one or two of these dimensions, there has not been a holistic analysis at the center of all three. Moreover, even when combinations of these topics are considered, more focus is placed on describing, e.g., current architectural methods, as opposed to also illustrating high-level challenges and opportunities for the field. In this survey paper, we discuss Embodied Vision-Language Planning (EVLP) tasks, a family of prominent embodied navigation and manipulation problems that jointly use computer vision and natural language. We propose a taxonomy to unify these tasks and provide an in-depth analysis and comparison of the new and current algorithmic approaches, metrics, simulated environments, as well as the datasets used for EVLP tasks. Finally, we present the core challenges that we believe new EVLP works should seek to address, and we advocate for task construction that enables model generalizability and furthers real-world deployment.
    Learning and Planning in Average-Reward Markov Decision Processes. (arXiv:2006.16318v3 [cs.LG] UPDATED)
    (2 min) We introduce learning and planning algorithms for average-reward MDPs, including 1) the first general proven-convergent off-policy model-free control algorithm without reference states, 2) the first proven-convergent off-policy model-free prediction algorithm, and 3) the first off-policy learning algorithm that converges to the actual value function rather than to the value function plus an offset. All of our algorithms are based on using the temporal-difference error rather than the conventional error when updating the estimate of the average reward. Our proof techniques are a slight generalization of those by Abounadi, Bertsekas, and Borkar (2001). In experiments with an Access-Control Queuing Task, we show some of the difficulties that can arise when using methods that rely on reference states and argue that our new algorithms can be significantly easier to use.
    Surrogate Model Based Hyperparameter Tuning for Deep Learning with SPOT. (arXiv:2105.14625v2 [cs.LG] UPDATED)
    (2 min) A surrogate model based hyperparameter tuning approach for deep learning is presented. This article demonstrates how the architecture-level parameters (hyperparameters) of deep learning models that were implemented in Keras/tensorflow can be optimized. The implementation of the tuning procedure is 100% accessible from R, the software environment for statistical computing. With a few lines of code, existing R packages (tfruns and SPOT) can be combined to perform hyperparameter tuning. An elementary hyperparameter tuning task (neural network and the MNIST data) is used to exemplify this approach
    Matching Point Sets with Quantum Circuit Learning. (arXiv:2102.06697v2 [cs.CV] UPDATED)
    (2 min) In this work, we propose a parameterised quantum circuit learning approach to point set matching problem. In contrast to previous annealing-based methods, we propose a quantum circuit-based framework whose parameters are optimised via descending the gradients w.r.t a kernel-based loss function. We formulate the shape matching problem into a distribution learning task; that is, to learn the distribution of the optimal transformation parameters. We show that this framework is able to find multiple optimal solutions for symmetric shapes and is more accurate, scalable and robust than the previous annealing-based method. Code, data and pre-trained weights are available at the project page: \href{https://hansen7.github.io/qKC}{https://hansen7.github.io/qKC}
    Discovering Generalizable Skills via Automated Generation of Diverse Tasks. (arXiv:2106.13935v1 [cs.RO])
    (2 min) The learning efficiency and generalization ability of an intelligent agent can be greatly improved by utilizing a useful set of skills. However, the design of robot skills can often be intractable in real-world applications due to the prohibitive amount of effort and expertise that it requires. In this work, we introduce Skill Learning In Diversified Environments (SLIDE), a method to discover generalizable skills via automated generation of a diverse set of tasks. As opposed to prior work on unsupervised discovery of skills which incentivizes the skills to produce different outcomes in the same environment, our method pairs each skill with a unique task produced by a trainable task generator. To encourage generalizable skills to emerge, our method trains each skill to specialize in the paired task and maximizes the diversity of the generated tasks. A task discriminator defined on the robot behaviors in the generated tasks is jointly trained to estimate the evidence lower bound of the diversity objective. The learned skills can then be composed in a hierarchical reinforcement learning algorithm to solve unseen target tasks. We demonstrate that the proposed method can effectively learn a variety of robot skills in two tabletop manipulation domains. Our results suggest that the learned skills can effectively improve the robot's performance in various unseen target tasks compared to existing reinforcement learning and skill learning methods.
    PhyCRNet: Physics-informed Convolutional-Recurrent Network for Solving Spatiotemporal PDEs. (arXiv:2106.14103v1 [cs.LG])
    (2 min) Partial differential equations (PDEs) play a fundamental role in modeling and simulating problems across a wide range of disciplines. Recent advances in deep learning have shown the great potential of physics-informed neural networks (PINNs) to solve PDEs as a basis for data-driven modeling and inverse analysis. However, the majority of existing PINN methods, based on fully-connected NNs, pose intrinsic limitations to low-dimensional spatiotemporal parameterizations. Moreover, since the initial/boundary conditions (I/BCs) are softly imposed via penalty, the solution quality heavily relies on hyperparameter tuning. To this end, we propose the novel physics-informed convolutional-recurrent learning architectures (PhyCRNet and PhyCRNet-s) for solving PDEs without any labeled data. Specifically, an encoder-decoder convolutional long short-term memory network is proposed for low-dimensional spatial feature extraction and temporal evolution learning. The loss function is defined as the aggregated discretized PDE residuals, while the I/BCs are hard-encoded in the network to ensure forcible satisfaction (e.g., periodic boundary padding). The networks are further enhanced by autoregressive and residual connections that explicitly simulate time marching. The performance of our proposed methods has been assessed by solving three nonlinear PDEs (e.g., 2D Burgers' equations, the $\lambda$-$\omega$ and FitzHugh Nagumo reaction-diffusion equations), and compared against the start-of-the-art baseline algorithms. The numerical results demonstrate the superiority of our proposed methodology in the context of solution accuracy, extrapolability and generalizability.
    Solar Irradiation Forecasting using Genetic Algorithms. (arXiv:2106.13956v1 [cs.LG])
    (2 min) Renewable energy forecasting is attaining greater importance due to its constant increase in contribution to the electrical power grids. Solar energy is one of the most significant contributors to renewable energy and is dependent on solar irradiation. For the effective management of electrical power grids, forecasting models that predict solar irradiation, with high accuracy, are needed. In the current study, Machine Learning techniques such as Linear Regression, Extreme Gradient Boosting and Genetic Algorithm Optimization are used to forecast solar irradiation. The data used for training and validation is recorded from across three different geographical stations in the United States that are part of the SURFRAD network. A Global Horizontal Index (GHI) is predicted for the models built and compared. Genetic Algorithm Optimization is applied to XGB to further improve the accuracy of solar irradiation prediction.
    Benchmarking convolutional neural networks for diagnosing Lyme disease from images. (arXiv:2106.14465v1 [eess.IV])
    (3 min) Lyme disease is one of the most common infectious vector-borne diseases in the world. In the early stage, the disease manifests itself in most cases with erythema migrans (EM) skin lesions. Better diagnosis of these early forms would allow improving the prognosis by preventing the transition to a severe late form thanks to appropriate antibiotic therapy. Recent studies show that convolutional neural networks (CNNs) perform very well to identify skin lesions from the image but, there is not much work for Lyme disease prediction from EM lesion images. The main objective of this study is to extensively analyze the effectiveness of CNNs for diagnosing Lyme disease from images and to find out the best CNN architecture for the purpose. There is no publicly available EM image dataset for Lyme dis…
    Learning Nash Equilibria in Zero-Sum Stochastic Games via Entropy-Regularized Policy Approximation. (arXiv:2009.00162v2 [cs.LG] UPDATED)
    (2 min) We explore the use of policy approximations to reduce the computational cost of learning Nash equilibria in zero-sum stochastic games. We propose a new Q-learning type algorithm that uses a sequence of entropy-regularized soft policies to approximate the Nash policy during the Q-function updates. We prove that under certain conditions, by updating the regularized Q-function, the algorithm converges to a Nash equilibrium. We also demonstrate the proposed algorithm's ability to transfer previous training experiences, enabling the agents to adapt quickly to new environments. We provide a dynamic hyper-parameter scheduling scheme to further expedite convergence. Empirical results applied to a number of stochastic games verify that the proposed algorithm converges to the Nash equilibrium, while exhibiting a major speed-up over existing algorithms.
    Touch-based Curiosity for Sparse-Reward Tasks. (arXiv:2104.00442v2 [cs.LG] UPDATED)
    (2 min) Robots in many real-world settings have access to force/torque sensors in their gripper and tactile sensing is often necessary in tasks that involve contact-rich motion. In this work, we leverage surprise from mismatches in touch feedback to guide exploration in hard sparse-reward reinforcement learning tasks. Our approach, Touch-based Curiosity (ToC), learns what visible objects interactions are supposed to "feel" like. We encourage exploration by rewarding interactions where the expectation and the experience don't match. In our proposed method, an initial task-independent exploration phase is followed by an on-task learning phase, in which the original interactions are relabeled with on-task rewards. We test our approach on a range of touch-intensive robot arm tasks (e.g. pushing objects, opening doors), which we also release as part of this work. Across multiple experiments in a simulated setting, we demonstrate that our method is able to learn these difficult tasks through sparse reward and curiosity alone. We compare our cross-modal approach to single-modality (touch- or vision-only) approaches as well as other curiosity-based methods and find that our method performs better and is more sample-efficient.
    Building population models for large-scale neural recordings: opportunities and pitfalls. (arXiv:2102.01807v3 [q-bio.NC] UPDATED)
    (2 min) Modern recording technologies now enable simultaneous recording from large numbers of neurons. This has driven the development of new statistical models for analyzing and interpreting neural population activity. Here we provide a broad overview of recent developments in this area. We compare and contrast different approaches, highlight strengths and limitations, and discuss biological and mechanistic insights that these methods provide.
    Prior Flow Variational Autoencoder: A density estimation model for Non-Intrusive Load Monitoring. (arXiv:2011.14870v2 [cs.LG] UPDATED)
    (2 min) Non-Intrusive Load Monitoring (NILM) is a computational technique to estimate the power loads' appliance-by-appliance from the whole consumption measured by a single meter. In this paper, we propose a conditional density estimation model, based on deep neural networks, that joins a Conditional Variational Autoencoder with a Conditional Invertible Normalizing Flow model to estimate the individual appliance's power demand. The resulting model is called Prior Flow Variational Autoencoder or, for simplicity PFVAE. Thus, instead of having one model per appliance, the resulting model is responsible for estimating the power demand, appliance-by-appliance, at once. We train and evaluate our proposed model in a publicly available dataset composed of power demand measures from a poultry feed factory located in Brazil. The proposed model's quality is evaluated by comparing the obtained normalized disaggregation error (NDE) and signal aggregated error (SAE) with the previous work values on the same dataset. Our proposal achieves highly competitive results, and for six of the eight machines belonging to the dataset, we observe consistent improvements that go from 28% up to 81% in NDE and from 27% up to 86% in SAE.
    TimeSHAP: Explaining Recurrent Models through Sequence Perturbations. (arXiv:2012.00073v2 [cs.LG] UPDATED)
    (2 min) Although recurrent neural networks (RNNs) are state-of-the-art in numerous sequential decision-making tasks, there has been little research on explaining their predictions. In this work, we present TimeSHAP, a model-agnostic recurrent explainer that builds upon KernelSHAP and extends it to the sequential domain. TimeSHAP computes feature-, timestep-, and cell-level attributions. As sequences may be arbitrarily long, we further propose a pruning method that is shown to dramatically decrease both its computational cost and the variance of its attributions. We use TimeSHAP to explain the predictions of a real-world bank account takeover fraud detection RNN model, and draw key insights from its explanations: i) the model identifies important features and events aligned with what fraud analysts consider cues for account takeover; ii) positive predicted sequences can be pruned to only 10% of the original length, as older events have residual attribution values; iii) the most recent input event of positive predictions only contributes on average to 41% of the model's score; iv) notably high attribution to client's age, suggesting a potential discriminatory reasoning, later confirmed as higher false positive rates for older clients.
    Learning transferable and discriminative features for unsupervised domain adaptation. (arXiv:2003.11723v2 [cs.LG] UPDATED)
    (2 min) Although achieving remarkable progress, it is very difficult to induce a supervised classifier without any labeled data. Unsupervised domain adaptation is able to overcome this challenge by transferring knowledge from a labeled source domain to an unlabeled target domain. Transferability and discriminability are two key criteria for characterizing the superiority of feature representations to enable successful domain adaptation. In this paper, a novel method called \textit{learning TransFerable and Discriminative Features for unsupervised domain adaptation} (TFDF) is proposed to optimize these two objectives simultaneously. On the one hand, distribution alignment is performed to reduce domain discrepancy and learn more transferable representations. Instead of adopting \textit{Maximum Mean Discrepancy} (MMD) which only captures the first-order statistical information to measure distribution discrepancy, we adopt a recently proposed statistic called \textit{Maximum Mean and Covariance Discrepancy} (MMCD), which can not only capture the first-order statistical information but also capture the second-order statistical information in the reproducing kernel Hilbert space (RKHS). On the other hand, we propose to explore both local discriminative information via manifold regularization and global discriminative information via minimizing the proposed \textit{class confusion} objective to learn more discriminative features, respectively. We integrate these two objectives into the \textit{Structural Risk Minimization} (RSM) framework and learn a domain-invariant classifier. Comprehensive experiments are conducted on five real-world datasets and the results verify the effectiveness of the proposed method.
    Use of Variational Inference in Music Emotion Recognition. (arXiv:2106.14323v1 [stat.ML])
    (2 min) This work was developed aiming to employ Statistical techniques to the field of Music Emotion Recognition, a well-recognized area within the Signal Processing world, but hardly explored from the statistical point of view. Here, we opened several possibilities within the field, applying modern Bayesian Statistics techniques and developing efficient algorithms, focusing on the applicability of the results obtained. Although the motivation for this project was the development of a emotion-based music recommendation system, its main contribution is a highly adaptable multivariate model that can be useful interpreting any database where there is an interest in applying regularization in an efficient manner. Broadly speaking, we will explore what role a sound theoretical statistical analysis can play in the modeling of an algorithm that is able to understand a well-known database and what can be gained with this kind of approach.
    Certified Robustness via Randomized Smoothing over Multiplicative Parameters. (arXiv:2106.14432v1 [cs.LG])
    (2 min) We propose a novel approach of randomized smoothing over multiplicative parameters. Using this method we construct certifiably robust classifiers with respect to a gamma-correction perturbation and compare the result with classifiers obtained via Gaussian smoothing. To the best of our knowledge it is the first work concerning certified robustness against the multiplicative gamma-correction transformation.
    Implicit Gradient Alignment in Distributed and Federated Learning. (arXiv:2106.13897v1 [cs.LG])
    (2 min) A major obstacle to achieving global convergence in distributed and federated learning is the misalignment of gradients across clients, or mini-batches due to heterogeneity and stochasticity of the distributed data. One way to alleviate this problem is to encourage the alignment of gradients across different clients throughout training. Our analysis reveals that this goal can be accomplished by utilizing the right optimization method that replicates the implicit regularization effect of SGD, leading to gradient alignment as well as improvements in test accuracies. Since the existence of this regularization in SGD completely relies on the sequential use of different mini-batches during training, it is inherently absent when training with large mini-batches. To obtain the generalization benefits of this regularization while increasing parallelism, we propose a novel GradAlign algorithm that induces the same implicit regularization while allowing the use of arbitrarily large batches in each update. We experimentally validate the benefit of our algorithm in different distributed and federated learning settings.
    Error bound of critical points and KL property of exponent $1/2$ for squared F-norm regularized factorization. (arXiv:1911.04293v2 [math.OC] UPDATED)
    (2 min) This paper is concerned with the squared F(robenius)-norm regularized factorization form for noisy low-rank matrix recovery problems. Under a suitable assumption on the restricted condition number of the Hessian for the loss function, we derive an error bound to the true matrix for the non-strict critical points with rank not more than that of the true matrix. Then, for the squared F-norm regularized factorized least squares loss function, under the noisy and full sample setting we establish its KL property of exponent $1/2$ on its global minimizer set, and under the noisy and partial sample setting achieve this property for a class of critical points. These theoretical findings are also confirmed by solving the squared F-norm regularized factorization problem with an accelerated alternating minimization method.
    Individual Privacy Accounting via a Renyi Filter. (arXiv:2008.11193v3 [cs.CR] UPDATED)
    (2 min) We consider a sequential setting in which a single dataset of individuals is used to perform adaptively-chosen analyses, while ensuring that the differential privacy loss of each participant does not exceed a pre-specified privacy budget. The standard approach to this problem relies on bounding a worst-case estimate of the privacy loss over all individuals and all possible values of their data, for every single analysis. Yet, in many scenarios this approach is overly conservative, especially for "typical" data points which incur little privacy loss by participation in most of the analyses. In this work, we give a method for tighter privacy loss accounting based on the value of a personalized privacy loss estimate for each individual in each analysis. To implement the accounting method we design a filter for R\'enyi differential privacy. A filter is a tool that ensures that the privacy parameter of a composed sequence of algorithms with adaptively-chosen privacy parameters does not exceed a pre-specified budget. Our filter is simpler and tighter than the known filter for $(\epsilon,\delta)$-differential privacy by Rogers et al. We apply our results to the analysis of noisy gradient descent and show that personalized accounting can be practical, easy to implement, and can only make the privacy-utility tradeoff tighter.
    Knowledge Infused Policy Gradients with Upper Confidence Bound for Relational Bandits. (arXiv:2106.13895v1 [cs.LG])
    (2 min) Contextual Bandits find important use cases in various real-life scenarios such as online advertising, recommendation systems, healthcare, etc. However, most of the algorithms use flat feature vectors to represent context whereas, in the real world, there is a varying number of objects and relations among them to model in the context. For example, in a music recommendation system, the user context contains what music they listen to, which artists create this music, the artist albums, etc. Adding richer relational context representations also introduces a much larger context space making exploration-exploitation harder. To improve the efficiency of exploration-exploitation knowledge about the context can be infused to guide the exploration-exploitation strategy. Relational context representations allow a natural way for humans to specify knowledge owing to their descriptive nature. We propose an adaptation of Knowledge Infused Policy Gradients to the Contextual Bandit setting and a novel Knowledge Infused Policy Gradients Upper Confidence Bound algorithm and perform an experimental analysis of a simulated music recommendation dataset and various real-life datasets where expert knowledge can drastically reduce the total regret and where it cannot.
    Capturing Dynamics of Time-Varying Data via Topology. (arXiv:2010.05780v2 [cs.LG] UPDATED)
    (2 min) One approach to understanding complex data is to study its shape through the lens of algebraic topology. While the early development of topological data analysis focused primarily on static data, in recent years, theoretical and applied studies have turned to data that varies in time. A time-varying collection of metric spaces as formed, for example, by a moving school of fish or flock of birds, can contain a vast amount of information. There is often a need to simplify or summarize the dynamic behavior. We provide an introduction to topological summaries of time-varying metric spaces including vineyards [19], crocker plots [56], and multiparameter rank functions [37]. We then introduce a new tool to summarize time-varying metric spaces: a crocker stack. Crocker stacks are convenient for visualization, amenable to machine learning, and satisfy a desirable continuity property which we prove. We demonstrate the utility of crocker stacks for a parameter identification task involving an influential model of biological aggregations [58]. Altogether, we aim to bring the broader applied mathematics community up-to-date on topological summaries of time-varying metric spaces.
    Quantum Data Compression and Quantum Cross Entropy. (arXiv:2106.13823v1 [quant-ph])
    (2 min) Quantum machine learning is an emerging field at the intersection of machine learning and quantum computing. A central quantity for the theoretical foundation of quantum machine learning is the quantum cross entropy. In this paper, we present one operational interpretation of this quantity, that the quantum cross entropy is the compression rate for sub-optimal quantum source coding. To do so, we give a simple, universal quantum data compression protocol, which is developed based on quantum generalization of variable-length coding, as well as quantum strong typicality.
    On Power Laws in Deep Ensembles. (arXiv:2007.08483v2 [cs.LG] UPDATED)
    (2 min) Ensembles of deep neural networks are known to achieve state-of-the-art performance in uncertainty estimation and lead to accuracy improvement. In this work, we focus on a classification problem and investigate the behavior of both non-calibrated and calibrated negative log-likelihood (CNLL) of a deep ensemble as a function of the ensemble size and the member network size. We indicate the conditions under which CNLL follows a power law w.r.t. ensemble size or member network size, and analyze the dynamics of the parameters of the discovered power laws. Our important practical finding is that one large network may perform worse than an ensemble of several medium-size networks with the same total number of parameters (we call this ensemble a memory split). Using the detected power law-like dependencies, we can predict (1) the possible gain from the ensembling of networks with given structure, (2) the optimal memory split given a memory budget, based on a relatively small number of trained networks. We describe the memory split advantage effect in more details in arXiv:2005.07292
    Multimodal Few-Shot Learning with Frozen Language Models. (arXiv:2106.13884v1 [cs.CV])
    (2 min) When trained at sufficient scale, auto-regressive language models exhibit the notable ability to learn a new language task after being prompted with just a few examples. Here, we present a simple, yet effective, approach for transferring this few-shot learning ability to a multimodal setting (vision and language). Using aligned image and caption data, we train a vision encoder to represent each image as a sequence of continuous embeddings, such that a pre-trained, frozen language model prompted with this prefix generates the appropriate caption. The resulting system is a multimodal few-shot learner, with the surprising ability to learn a variety of new tasks when conditioned on examples, represented as a sequence of multiple interleaved image and text embeddings. We demonstrate that it can rapidly learn words for new objects and novel visual categories, do visual question-answering with only a handful of examples, and make use of outside knowledge, by measuring a single model on a variety of established and new benchmarks.
    AutoPipeline: Synthesize Data Pipelines By-Target Using Reinforcement Learning and Search. (arXiv:2106.13861v1 [cs.DB])
    (2 min) Recent work has made significant progress in helping users to automate single data preparation steps, such as string-transformations and table-manipulation operators (e.g., Join, GroupBy, Pivot, etc.). We in this work propose to automate multiple such steps end-to-end, by synthesizing complex data pipelines with both string transformations and table-manipulation operators. We propose a novel "by-target" paradigm that allows users to easily specify the desired pipeline, which is a significant departure from the traditional by-example paradigm. Using by-target, users would provide input tables (e.g., csv or json files), and point us to a "target table" (e.g., an existing database table or BI dashboard) to demonstrate how the output from the desired pipeline would schematically "look like". While the problem is seemingly underspecified, our unique insight is that implicit table constraints such as FDs and keys can be exploited to significantly constrain the space to make the problem tractable. We develop an Auto-Pipeline system that learns to synthesize pipelines using reinforcement learning and search. Experiments on large numbers of real pipelines crawled from GitHub suggest that Auto-Pipeline can successfully synthesize 60-70% of these complex pipelines (up to 10 steps) in 10-20 seconds on average.
    R-Drop: Regularized Dropout for Neural Networks. (arXiv:2106.14448v1 [cs.LG])
    (2 min) Dropout is a powerful and widely used technique to regularize the training of deep neural networks. In this paper, we introduce a simple regularization strategy upon dropout in model training, namely R-Drop, which forces the output distributions of different sub models generated by dropout to be consistent with each other. Specifically, for each training sample, R-Drop minimizes the bidirectional KL-divergence between the output distributions of two sub models sampled by dropout. Theoretical analysis reveals that R-Drop reduces the freedom of the model parameters and complements dropout. Experiments on $\bf{5}$ widely used deep learning tasks ($\bf{18}$ datasets in total), including neural machine translation, abstractive summarization, language understanding, language modeling, and image classification, show that R-Drop is universally effective. In particular, it yields substantial improvements when applied to fine-tune large-scale pre-trained models, e.g., ViT, RoBERTa-large, and BART, and achieves state-of-the-art (SOTA) performances with the vanilla Transformer model on WMT14 English$\to$German translation ($\bf{30.91}$ BLEU) and WMT14 English$\to$French translation ($\bf{43.95}$ BLEU), even surpassing models trained with extra large-scale data and expert-designed advanced variants of Transformer models. Our code is available at GitHub{\url{https://github.com/dropreg/R-Drop}}.
    Scene Uncertainty and the Wellington Posterior of Deterministic Image Classifiers. (arXiv:2106.13870v1 [cs.CV])
    (2 min) We propose a method to estimate the uncertainty of the outcome of an image classifier on a given input datum. Deep neural networks commonly used for image classification are deterministic maps from an input image to an output class. As such, their outcome on a given datum involves no uncertainty, so we must specify what variability we are referring to when defining, measuring and interpreting "confidence." To this end, we introduce the Wellington Posterior, which is the distribution of outcomes that would have been obtained in response to data that could have been generated by the same scene that produced the given image. Since there are infinitely many scenes that could have generated the given image, the Wellington Posterior requires induction from scenes other than the one portrayed. We explore alternate methods using data augmentation, ensembling, and model linearization. Additional alternatives include generative adversarial networks, conditional prior networks, and supervised single-view reconstruction. We test these alternatives against the empirical posterior obtained by inferring the class of temporally adjacent frames in a video. These developments are only a small step towards assessing the reliability of deep network classifiers in a manner that is compatible with safety-critical applications.
    Adversarial Generation of Continuous Images. (arXiv:2011.12026v2 [cs.CV] UPDATED)
    (2 min) In most existing learning systems, images are typically viewed as 2D pixel arrays. However, in another paradigm gaining popularity, a 2D image is represented as an implicit neural representation (INR) - an MLP that predicts an RGB pixel value given its (x,y) coordinate. In this paper, we propose two novel architectural techniques for building INR-based image decoders: factorized multiplicative modulation and multi-scale INRs, and use them to build a state-of-the-art continuous image GAN. Previous attempts to adapt INRs for image generation were limited to MNIST-like datasets and do not scale to complex real-world data. Our proposed INR-GAN architecture improves the performance of continuous image generators by several times, greatly reducing the gap between continuous image GANs and pixel-based ones. Apart from that, we explore several exciting properties of the INR-based decoders, like out-of-the-box superresolution, meaningful image-space interpolation, accelerated inference of low-resolution images, an ability to extrapolate outside of image boundaries, and strong geometric prior. The project page is located at https://universome.github.io/inr-gan.
    Aggregating Incomplete and Noisy Rankings. (arXiv:2011.00810v2 [cs.LG] UPDATED)
    (2 min) We consider the problem of learning the true ordering of a set of alternatives from largely incomplete and noisy rankings. We introduce a natural generalization of both the classical Mallows model of ranking distributions and the extensively studied model of noisy pairwise comparisons. Our selective Mallows model outputs a noisy ranking on any given subset of alternatives, based on an underlying Mallows distribution. Assuming a sequence of subsets where each pair of alternatives appears frequently enough, we obtain strong asymptotically tight upper and lower bounds on the sample complexity of learning the underlying complete ranking and the (identities and the) ranking of the top-k alternatives from selective Mallows rankings. Moreover, building on the work of (Braverman and Mossel, 2009), we show how to efficiently compute the maximum likelihood complete ranking from selective Mallows rankings.
    A multi-stage machine learning model on diagnosis of esophageal manometry. (arXiv:2106.13869v1 [cs.LG])
    (2 min) High-resolution manometry (HRM) is the primary procedure used to diagnose esophageal motility disorders. Its interpretation and classification includes an initial evaluation of swallow-level outcomes and then derivation of a study-level diagnosis based on Chicago Classification (CC), using a tree-like algorithm. This diagnostic approach on motility disordered using HRM was mirrored using a multi-stage modeling framework developed using a combination of various machine learning approaches. Specifically, the framework includes deep-learning models at the swallow-level stage and feature-based machine learning models at the study-level stage. In the swallow-level stage, three models based on convolutional neural networks (CNNs) were developed to predict swallow type, swallow pressurization, and integrated relaxation pressure (IRP). At the study-level stage, model selection from families of the expert-knowledge-based rule models, xgboost models and artificial neural network(ANN) models were conducted, with the latter two model designed and augmented with motivation from the export knowledge. A simple model-agnostic strategy of model balancing motivated by Bayesian principles was utilized, which gave rise to model averaging weighted by precision scores. The averaged (blended) models and individual models were compared and evaluated, of which the best performance on test dataset is 0.81 in top-1 prediction, 0.92 in top-2 predictions. This is the first artificial-intelligence-style model to automatically predict CC diagnosis of HRM study from raw multi-swallow data. Moreover, the proposed modeling framework could be easily extended to multi-modal tasks, such as diagnosis of esophageal patients based on clinical data from both HRM and functional luminal imaging probe panometry (FLIP).
    Transflower: probabilistic autoregressive dance generation with multimodal attention. (arXiv:2106.13871v1 [cs.SD])
    (2 min) Dance requires skillful composition of complex movements that follow rhythmic, tonal and timbral features of music. Formally, generating dance conditioned on a piece of music can be expressed as a problem of modelling a high-dimensional continuous motion signal, conditioned on an audio signal. In this work we make two contributions to tackle this problem. First, we present a novel probabilistic autoregressive architecture that models the distribution over future poses with a normalizing flow conditioned on previous poses as well as music context, using a multimodal transformer encoder. Second, we introduce the currently largest 3D dance-motion dataset, obtained with a variety of motion-capture technologies, and including both professional and casual dancers. Using this dataset, we compare our new model against two baselines, via objective metrics and a user study, and show that both the ability to model a probability distribution, as well as being able to attend over a large motion and music context are necessary to produce interesting, diverse, and realistic dance that matches the music.
    Self-paced Principal Component Analysis. (arXiv:2106.13880v1 [cs.LG])
    (2 min) Principal Component Analysis (PCA) has been widely used for dimensionality reduction and feature extraction. Robust PCA (RPCA), under different robust distance metrics, such as l1-norm and l2, p-norm, can deal with noise or outliers to some extent. However, real-world data may display structures that can not be fully captured by these simple functions. In addition, existing methods treat complex and simple samples equally. By contrast, a learning pattern typically adopted by human beings is to learn from simple to complex and less to more. Based on this principle, we propose a novel method called Self-paced PCA (SPCA) to further reduce the effect of noise and outliers. Notably, the complexity of each sample is calculated at the beginning of each iteration in order to integrate samples from simple to more complex into training. Based on an alternating optimization, SPCA finds an optimal projection matrix and filters out outliers iteratively. Theoretical analysis is presented to show the rationality of SPCA. Extensive experiments on popular data sets demonstrate that the proposed method can improve the state of-the-art results considerably.
    Domain Conditional Predictors for Domain Adaptation. (arXiv:2106.13899v1 [cs.LG])
    (2 min) Learning guarantees often rely on assumptions of i.i.d. data, which will likely be violated in practice once predictors are deployed to perform real-world tasks. Domain adaptation approaches thus appeared as a useful framework yielding extra flexibility in that distinct train and test data distributions are supported, provided that other assumptions are satisfied such as covariate shift, which expects the conditional distributions over labels to be independent of the underlying data distribution. Several approaches were introduced in order to induce generalization across varying train and test data sources, and those often rely on the general idea of domain-invariance, in such a way that the data-generating distributions are to be disregarded by the prediction model. In this contribution, we tackle the problem of generalizing across data sources by approaching it from the opposite direction: we consider a conditional modeling approach in which predictions, in addition to being dependent on the input data, use information relative to the underlying data-generating distribution. For instance, the model has an explicit mechanism to adapt to changing environments and/or new data sources. We argue that such an approach is more generally applicable than current domain adaptation methods since it does not require extra assumptions such as covariate shift and further yields simpler training algorithms that avoid a common source of training instabilities caused by minimax formulations, often employed in domain-invariant methods.
    Pastprop-RNN: improved predictions of the future by correcting the past. (arXiv:2106.13881v1 [cs.LG])
    (2 min) Forecasting accuracy is reliant on the quality of available past data. Data disruptions can adversely affect the quality of the generated model (e.g. unexpected events such as out-of-stock products when forecasting demand). We address this problem by pastcasting: predicting how data should have been in the past to explain the future better. We propose Pastprop-LSTM, a data-centric backpropagation algorithm that assigns part of the responsibility for errors to the training data and changes it accordingly. We test three variants of Pastprop-LSTM on forecasting competition datasets, M4 and M5, plus the Numenta Anomaly Benchmark. Empirical evaluation indicates that the proposed method can improve forecasting accuracy, especially when the prediction errors of standard LSTM are high. It also demonstrates the potential of the algorithm on datasets containing anomalies.
    Longitudinal Self-Supervised Learning. (arXiv:2006.06930v2 [cs.LG] UPDATED)
    (2 min) Machine learning analysis of longitudinal neuroimaging data is typically based on supervised learning, which requires a large number of ground-truth labels to be informative. As ground-truth labels are often missing or expensive to obtain in neuroscience, we avoid them in our analysis by combing factor disentanglement with self-supervised learning to identify changes and consistencies across the multiple MRIs acquired of each individual over time. Specifically, we propose a new definition of disentanglement by formulating a multivariate mapping between factors (e.g., brain age) associated with an MRI and a latent image representation. Then, factors that evolve across acquisitions of longitudinal sequences are disentangled from that mapping by self-supervised learning in such a way that changes in a single factor induce change along one direction in the representation space. We implement this model, named Longitudinal Self-Supervised Learning (LSSL), via a standard autoencoding structure with a cosine loss to disentangle brain age from the image representation. We apply LSSL to two longitudinal neuroimaging studies to highlight its strength in extracting the brain-age information from MRI and revealing informative characteristics associated with neurodegenerative and neuropsychological disorders. Moreover, the representations learned by LSSL facilitate supervised classification by recording faster convergence and higher (or similar) prediction accuracy compared to several other representation learning techniques.
    Training Saturation in Layerwise Quantum Approximate Optimisation. (arXiv:2106.13814v1 [quant-ph])
    (2 min) Quantum Approximate Optimisation (QAOA) is the most studied gate based variational quantum algorithm today. We train QAOA one layer at a time to maximize overlap with an $n$ qubit target state. Doing so we discovered that such training always saturates -- called \textit{training saturation} -- at some depth $p^*$, meaning that past a certain depth, overlap can not be improved by adding subsequent layers. We formulate necessary conditions for saturation. Numerically, we find layerwise QAOA reaches its maximum overlap at depth $p^*=n$. The addition of coherent dephasing errors to training removes saturation, recovering robustness to layerwise training. This study sheds new light on the performance limitations and prospects of QAOA.
    A Photonic-Circuits-Inspired Compact Network: Toward Real-Time Wireless Signal Classification at the Edge. (arXiv:2106.13865v1 [eess.SP])
    (2 min) Machine learning (ML) methods are ubiquitous in wireless communication systems and have proven powerful for applications including radio-frequency (RF) fingerprinting, automatic modulation classification, and cognitive radio. However, the large size of ML models can make them difficult to implement on edge devices for latency-sensitive downstream tasks. In wireless communication systems, ML data processing at a sub-millisecond scale will enable real-time network monitoring to improve security and prevent infiltration. In addition, compact and integratable hardware platforms which can implement ML models at the chip scale will find much broader application to wireless communication networks. Toward real-time wireless signal classification at the edge, we propose a novel compact deep network that consists of a photonic-hardware-inspired recurrent neural network model in combination with a simplified convolutional classifier, and we demonstrate its application to the identification of RF emitters by their random transmissions. With the proposed model, we achieve 96.32% classification accuracy over a set of 30 identical ZigBee devices when using 50 times fewer training parameters than an existing state-of-the-art CNN classifier. Thanks to the large reduction in network size, we demonstrate real-time RF fingerprinting with 0.219 ms latency using a small-scale FPGA board, the PYNQ-Z1.
    Fully Steerable 3D Spherical Neurons. (arXiv:2106.13863v1 [cs.CV])
    (2 min) Emerging from low-level vision theory, steerable filters found their counterpart in deep learning. Earlier works used the steering theorems and presented convolutional networks equivariant to rigid transformations. In our work, we propose a steerable feed-forward learning-based approach that consists of spherical decision surfaces and operates on point clouds. Due to the inherent geometric 3D structure of our theory, we derive a 3D steerability constraint for its atomic parts, the hypersphere neurons. Exploiting the rotational equivariance, we show how the model parameters are fully steerable at inference time. The proposed spherical filter banks enable to make equivariant and, after online optimization, invariant class predictions for known synthetic point sets in unknown orientations.
    Approximate Maximum Halfspace Discrepancy. (arXiv:2106.13851v1 [cs.CG])
    (2 min) Consider the geometric range space $(X, \mathcal{H}_d)$ where $X \subset \mathbb{R}^d$ and $\mathcal{H}_d$ is the set of ranges defined by $d$-dimensional halfspaces. In this setting we consider that $X$ is the disjoint union of a red and blue set. For each halfspace $h \in \mathcal{H}_d$ define a function $\Phi(h)$ that measures the "difference" between the fraction of red and fraction of blue points which fall in the range $h$. In this context the maximum discrepancy problem is to find the $h^* = \arg \max_{h \in (X, \mathcal{H}_d)} \Phi(h)$. We aim to instead find an $\hat{h}$ such that $\Phi(h^*) - \Phi(\hat{h}) \le \varepsilon$. This is the central problem in linear classification for machine learning, in spatial scan statistics for spatial anomaly detection, and shows up in many other areas. We provide a solution for this problem in $O(|X| + (1/\varepsilon^d) \log^4 (1/\varepsilon))$ time, which improves polynomially over the previous best solutions. For $d=2$ we show that this is nearly tight through conditional lower bounds. For different classes of $\Phi$ we can either provide a $\Omega(|X|^{3/2 - o(1)})$ time lower bound for the exact solution with a reduction to APSP, or an $\Omega(|X| + 1/\varepsilon^{2-o(1)})$ lower bound for the approximate solution with a reduction to 3SUM. A key technical result is a $\varepsilon$-approximate halfspace range counting data structure of size $O(1/\varepsilon^d)$ with $O(\log (1/\varepsilon))$ query time, which we can build in $O(|X| + (1/\varepsilon^d) \log^4 (1/\varepsilon))$ time.
    Ladder Polynomial Neural Networks. (arXiv:2106.13834v1 [cs.LG])
    (2 min) Polynomial functions have plenty of useful analytical properties, but they are rarely used as learning models because their function class is considered to be restricted. This work shows that when trained properly polynomial functions can be strong learning models. Particularly this work constructs polynomial feedforward neural networks using the product activation, a new activation function constructed from multiplications. The new neural network is a polynomial function and provides accurate control of its polynomial order. It can be trained by standard training techniques such as batch normalization and dropout. This new feedforward network covers several previous polynomial models as special cases. Compared with common feedforward neural networks, the polynomial feedforward network has closed-form calculations of a few interesting quantities, which are very useful in Bayesian learning. In a series of regression and classification tasks in the empirical study, the proposed model outperforms previous polynomial models.
    Unified Source-Filter GAN: Unified Source-filter Network Based On Factorization of Quasi-Periodic Parallel WaveGAN. (arXiv:2104.04668v3 [cs.SD] UPDATED)
    (2 min) We propose a unified approach to data-driven source-filter modeling using a single neural network for developing a neural vocoder capable of generating high-quality synthetic speech waveforms while retaining flexibility of the source-filter model to control their voice characteristics. Our proposed network called unified source-filter generative adversarial networks (uSFGAN) is developed by factorizing quasi-periodic parallel WaveGAN (QPPWG), one of the neural vocoders based on a single neural network, into a source excitation generation network and a vocal tract resonance filtering network by additionally implementing a regularization loss. Moreover, inspired by neural source filter (NSF), only a sinusoidal waveform is additionally used as the simplest clue to generate a periodic source excitation waveform while minimizing the effect of approximations in the source filter model. The experimental results demonstrate that uSFGAN outperforms conventional neural vocoders, such as QPPWG and NSF in both speech quality and pitch controllability.
    Assessment Modeling: Fundamental Pre-training Tasks for Interactive Educational Systems. (arXiv:2002.05505v6 [cs.LG] UPDATED)
    (2 min) Like many other domains in Artificial Intelligence (AI), there are specific tasks in the field of AI in Education (AIEd) for which labels are scarce and expensive, such as predicting exam score or review correctness. A common way of circumventing label-scarce problems is pre-training a model to learn representations of the contents of learning items. However, such methods fail to utilize the full range of student interaction data available and do not model student learning behavior. To this end, we propose Assessment Modeling, a class of fundamental pre-training tasks for general interactive educational systems. An assessment is a feature of student-system interactions which can serve as a pedagogical evaluation. Examples include the correctness and timeliness of a student's answer. Assessment Modeling is the prediction of assessments conditioned on the surrounding context of interactions. Although it is natural to pre-train on interactive features available in large amounts, limiting the prediction targets to assessments focuses the tasks' relevance to the label-scarce educational problems and reduces less-relevant noise. While the effectiveness of different combinations of assessments is open for exploration, we suggest Assessment Modeling as a first-order guiding principle for selecting proper pre-training tasks for label-scarce educational problems.
    Image Classification with CondenseNeXt for ARM-Based Computing Platforms. (arXiv:2106.14102v1 [cs.CV])
    (2 min) In this paper, we demonstrate the implementation of our ultra-efficient deep convolutional neural network architecture: CondenseNeXt on NXP BlueBox, an autonomous driving development platform developed for self-driving vehicles. We show that CondenseNeXt is remarkably efficient in terms of FLOPs, designed for ARM-based embedded computing platforms with limited computational resources and can perform image classification without the need of a CUDA enabled GPU. CondenseNeXt utilizes the state-of-the-art depthwise separable convolution and model compression techniques to achieve a remarkable computational efficiency. Extensive analyses are conducted on CIFAR-10, CIFAR-100 and ImageNet datasets to verify the performance of CondenseNeXt Convolutional Neural Network (CNN) architecture. It achieves state-of-the-art image classification performance on three benchmark datasets including CIFAR-10 (4.79% top-1 error), CIFAR-100 (21.98% top-1 error) and ImageNet (7.91% single model, single crop top-5 error). CondenseNeXt achieves final trained model size improvement of 2.9+ MB and up to 59.98% reduction in forward FLOPs compared to CondenseNet and can perform image classification on ARM-Based computing platforms without needing a CUDA enabled GPU support, with outstanding efficiency.
    RadGraph: Extracting Clinical Entities and Relations from Radiology Reports. (arXiv:2106.14463v1 [cs.CL])
    (2 min) Extracting structured clinical information from free-text radiology reports can enable the use of radiology report information for a variety of critical healthcare applications. In our work, we present RadGraph, a dataset of entities and relations in full-text chest X-ray radiology reports based on a novel information extraction schema we designed to structure radiology reports. We release a development dataset, which contains board-certified radiologist annotations for 500 radiology reports from the MIMIC-CXR dataset (14,579 entities and 10,889 relations), and a test dataset, which contains two independent sets of board-certified radiologist annotations for 100 radiology reports split equally across the MIMIC-CXR and CheXpert datasets. Using these datasets, we train and test a deep learning model, RadGraph Benchmark, that achieves a micro F1 of 0.82 and 0.73 on relation extraction on the MIMIC-CXR and CheXpert test sets respectively. Additionally, we release an inference dataset, which contains annotations automatically generated by RadGraph Benchmark across 220,763 MIMIC-CXR reports (around 6 million entities and 4 million relations) and 500 CheXpert reports (13,783 entities and 9,908 relations) with mappings to associated chest radiographs. Our freely available dataset can facilitate a wide range of research in medical natural language processing, as well as computer vision and multi-modal learning when linked to chest radiographs.
    Learning Gaussian Networks. (arXiv:1302.6808v3 [cs.AI] UPDATED)
    (2 min) We describe algorithms for learning Bayesian networks from a combination of user knowledge and statistical data. The algorithms have two components: a scoring metric and a search procedure. The scoring metric takes a network structure, statistical data, and a user's prior knowledge, and returns a score proportional to the posterior probability of the network structure given the data. The search procedure generates networks for evaluation by the scoring metric. Previous work has concentrated on metrics for domains containing only discrete variables, under the assumption that data represents a multinomial sample. In this paper, we extend this work, developing scoring metrics for domains containing all continuous variables or a mixture of discrete and continuous variables, under the assumption that continuous data is sampled from a multivariate normal distribution. Our work extends traditional statistical approaches for identifying vanishing regression coefficients in that we identify two important assumptions, called event equivalence and parameter modularity, that when combined allow the construction of prior distributions for multivariate normal parameters from a single prior Bayesian network specified by a user.
    Stabilizing Equilibrium Models by Jacobian Regularization. (arXiv:2106.14342v1 [cs.LG])
    (2 min) Deep equilibrium networks (DEQs) are a new class of models that eschews traditional depth in favor of finding the fixed point of a single nonlinear layer. These models have been shown to achieve performance competitive with the state-of-the-art deep networks while using significantly less memory. Yet they are also slower, brittle to architectural choices, and introduce potential instability to the model. In this paper, we propose a regularization scheme for DEQ models that explicitly regularizes the Jacobian of the fixed-point update equations to stabilize the learning of equilibrium models. We show that this regularization adds only minimal computational cost, significantly stabilizes the fixed-point convergence in both forward and backward passes, and scales well to high-dimensional, realistic domains (e.g., WikiText-103 language modeling and ImageNet classification). Using this method, we demonstrate, for the first time, an implicit-depth model that runs with approximately the same speed and level of performance as popular conventional deep networks such as ResNet-101, while still maintaining the constant memory footprint and architectural simplicity of DEQs. Code is available at https://github.com/locuslab/deq .
    Multi-task Over-the-Air Federated Learning: A Non-Orthogonal Transmission Approach. (arXiv:2106.14229v1 [cs.LG])
    (2 min) In this letter, we propose a multi-task over-theair federated learning (MOAFL) framework, where multiple learning tasks share edge devices for data collection and learning models under the coordination of a edge server (ES). Specially, the model updates for all the tasks are transmitted and superpositioned concurrently over a non-orthogonal uplink channel via over-the-air computation, and the aggregation results of all the tasks are reconstructed at the ES through an extended version of the turbo compressed sensing algorithm. Both the convergence analysis and numerical results demonstrate that the MOAFL framework can significantly reduce the uplink bandwidth consumption of multiple tasks without causing substantial learning performance degradation.
    Contextual Inverse Optimization: Offline and Online Learning. (arXiv:2106.14015v1 [cs.LG])
    (2 min) We study the problems of offline and online contextual optimization with feedback information, where instead of observing the loss, we observe, after-the-fact, the optimal action an oracle with full knowledge of the objective function would have taken. We aim to minimize regret, which is defined as the difference between our losses and the ones incurred by an all-knowing oracle. In the offline setting, the decision-maker has information available from past periods and needs to make one decision, while in the online setting, the decision-maker optimizes decisions dynamically over time based a new set of feasible actions and contextual functions in each period. For the offline setting, we characterize the optimal minimax policy, establishing the performance that can be achieved as a function of the underlying geometry of the information induced by the data. In the online setting, we leverage this geometric characterization to optimize the cumulative regret. We develop an algorithm that yields the first regret bound for this problem that is logarithmic in the time horizon.
    Last-iterate Convergence in Extensive-Form Games. (arXiv:2106.14326v1 [cs.LG])
    (2 min) Regret-based algorithms are highly efficient at finding approximate Nash equilibria in sequential games such as poker games. However, most regret-based algorithms, including counterfactual regret minimization (CFR) and its variants, rely on iterate averaging to achieve convergence. Inspired by recent advances on last-iterate convergence of optimistic algorithms in zero-sum normal-form games, we study this phenomenon in sequential games, and provide a comprehensive study of last-iterate convergence for zero-sum extensive-form games with perfect recall (EFGs), using various optimistic regret-minimization algorithms over treeplexes. This includes algorithms using the vanilla entropy or squared Euclidean norm regularizers, as well as their dilated versions which admit more efficient implementation. In contrast to CFR, we show that all of these algorithms enjoy last-iterate convergence, with some of them even converging exponentially fast. We also provide experiments to further support our theoretical results.
    Rationale-Inspired Natural Language Explanations with Commonsense. (arXiv:2106.13876v1 [cs.CL])
    (2 min) Explainable machine learning models primarily justify predicted labels using either extractive rationales (i.e., subsets of input features) or free-text natural language explanations (NLEs) as abstractive justifications. While NLEs can be more comprehensive than extractive rationales, machine-generated NLEs have been shown to sometimes lack commonsense knowledge. Here, we show that commonsense knowledge can act as a bridge between extractive rationales and NLEs, rendering both types of explanations better. More precisely, we introduce a unified framework, called RExC (Rationale-Inspired Explanations with Commonsense), that (1) extracts rationales as a set of features responsible for machine predictions, (2) expands the extractive rationales using available commonsense resources, and (3) uses the expanded knowledge to generate natural language explanations. Our framework surpasses by a large margin the previous state-of-the-art in generating NLEs across five tasks in both natural language processing and vision-language understanding, with human annotators consistently rating the explanations generated by RExC to be more comprehensive, grounded in commonsense, and overall preferred compared to previous state-of-the-art models. Moreover, our work shows that commonsense-grounded explanations can enhance both task performance and rationales extraction capabilities.

2021-06-28

  • cs.CL updates on arXiv.org

    Topic-Oriented Spoken Dialogue Summarization for Customer Service with Saliency-Aware Topic Modeling. (arXiv:2012.07311v2 [cs.CL] UPDATED)
    (2 min) In a customer service system, dialogue summarization can boost service efficiency by automatically creating summaries for long spoken dialogues in which customers and agents try to address issues about specific topics. In this work, we focus on topic-oriented dialogue summarization, which generates highly abstractive summaries that preserve the main ideas from dialogues. In spoken dialogues, abundant dialogue noise and common semantics could obscure the underlying informative content, making the general topic modeling approaches difficult to apply. In addition, for customer service, role-specific information matters and is an indispensable part of a summary. To effectively perform topic modeling on dialogues and capture multi-role information, in this work we propose a novel topic-augmented two-stage dialogue summarizer (TDS) jointly with a saliency-aware neural topic model (SATM) for topic-oriented summarization of customer service dialogues. Comprehensive studies on a real-world Chinese customer service dataset demonstrated the superiority of our method against several strong baselines.
    An Exploratory Analysis of the Relation Between Offensive Language and Mental Health. (arXiv:2105.14888v2 [cs.CL] UPDATED)
    (2 min) In this paper, we analyze the interplay between the use of offensive language and mental health. We acquired publicly available datasets created for offensive language identification and depression detection and we train computational models to compare the use of offensive language in social media posts written by groups of individuals with and without self-reported depression diagnosis. We also look at samples written by groups of individuals whose posts show signs of depression according to recent related studies. Our analysis indicates that offensive language is more frequently used in the samples written by individuals with self-reported depression as well as individuals showing signs of depression. The results discussed here open new avenues in research in politeness/offensiveness and mental health.
    Unsupervised Summarization for Chat Logs with Topic-Oriented Ranking and Context-Aware Auto-Encoders. (arXiv:2012.07300v2 [cs.CL] UPDATED)
    (2 min) Automatic chat summarization can help people quickly grasp important information from numerous chat messages. Unlike conventional documents, chat logs usually have fragmented and evolving topics. In addition, these logs contain a quantity of elliptical and interrogative sentences, which make the chat summarization highly context dependent. In this work, we propose a novel unsupervised framework called RankAE to perform chat summarization without employing manually labeled data. RankAE consists of a topic-oriented ranking strategy that selects topic utterances according to centrality and diversity simultaneously, as well as a denoising auto-encoder that is carefully designed to generate succinct but context-informative summaries based on the selected utterances. To evaluate the proposed method, we collect a large-scale dataset of chat logs from a customer service environment and build an annotated set only for model evaluation. Experimental results show that RankAE significantly outperforms other unsupervised methods and is able to generate high-quality summaries in terms of relevance and topic coverage.
    STYLER: Style Factor Modeling with Rapidity and Robustness via Speech Decomposition for Expressive and Controllable Neural Text to Speech. (arXiv:2103.09474v4 [eess.AS] UPDATED)
    (2 min) Previous works on neural text-to-speech (TTS) have been addressed on limited speed in training and inference time, robustness for difficult synthesis conditions, expressiveness, and controllability. Although several approaches resolve some limitations, there has been no attempt to solve all weaknesses at once. In this paper, we propose STYLER, an expressive and controllable TTS framework with high-speed and robust synthesis. Our novel audio-text aligning method called Mel Calibrator and excluding autoregressive decoding enable rapid training and inference and robust synthesis on unseen data. Also, disentangled style factor modeling under supervision enlarges the controllability in synthesizing process leading to expressive TTS. On top of it, a novel noise modeling pipeline using domain adversarial training and Residual Decoding empowers noise-robust style transfer, decomposing the noise without any additional label. Various experiments demonstrate that STYLER is more effective in speed and robustness than expressive TTS with autoregressive decoding and more expressive and controllable than reading style non-autoregressive TTS. Synthesis samples and experiment results are provided via our demo page, and code is available publicly.
    TTS-Portuguese Corpus: a corpus for speech synthesis in Brazilian Portuguese. (arXiv:2005.05144v3 [eess.AS] UPDATED)
    (2 min) Speech provides a natural way for human-computer interaction. In particular, speech synthesis systems are popular in different applications, such as personal assistants, GPS applications, screen readers and accessibility tools. However, not all languages are on the same level when in terms of resources and systems for speech synthesis. This work consists of creating publicly available resources for Brazilian Portuguese in the form of a novel dataset along with deep learning models for end-to-end speech synthesis. Such dataset has 10.5 hours from a single speaker, from which a Tacotron 2 model with the RTISI-LA vocoder presented the best performance, achieving a 4.03 MOS value. The obtained results are comparable to related works covering English language and the state-of-the-art in Portuguese.
    Masked Proxy Loss For Text-Independent Speaker Verification. (arXiv:2011.04491v2 [cs.SD] UPDATED)
    (2 min) Open-set speaker recognition can be regarded as a metric learning problem, which is to maximize inter-class variance and minimize intra-class variance. Supervised metric learning can be categorized into entity-based learning and proxy-based learning. Most of the existing metric learning objectives like Contrastive, Triplet, Prototypical, GE2E, etc all belong to the former division, the performance of which is either highly dependent on sample mining strategy or restricted by insufficient label information in the mini-batch. Proxy-based losses mitigate both shortcomings, however, fine-grained connections among entities are either not or indirectly leveraged. This paper proposes a Masked Proxy (MP) loss which directly incorporates both proxy-based relationships and pair-based relationships. We further propose Multinomial Masked Proxy (MMP) loss to leverage the hardness of speaker pairs. These methods have been applied to evaluate on VoxCeleb test set and reach state-of-the-art Equal Error Rate(EER).
    Deep Encoder, Shallow Decoder: Reevaluating Non-autoregressive Machine Translation. (arXiv:2006.10369v4 [cs.CL] UPDATED)
    (2 min) Much recent effort has been invested in non-autoregressive neural machine translation, which appears to be an efficient alternative to state-of-the-art autoregressive machine translation on modern GPUs. In contrast to the latter, where generation is sequential, the former allows generation to be parallelized across target token positions. Some of the latest non-autoregressive models have achieved impressive translation quality-speed tradeoffs compared to autoregressive baselines. In this work, we reexamine this tradeoff and argue that autoregressive baselines can be substantially sped up without loss in accuracy. Specifically, we study autoregressive models with encoders and decoders of varied depths. Our extensive experiments show that given a sufficiently deep encoder, a single-layer autoregressive decoder can substantially outperform strong non-autoregressive models with comparable inference speed. We show that the speed disadvantage for autoregressive baselines compared to non-autoregressive methods has been overestimated in three aspects: suboptimal layer allocation, insufficient speed measurement, and lack of knowledge distillation. Our results establish a new protocol for future research toward fast, accurate machine translation. Our code is available at https://github.com/jungokasai/deep-shallow.
    Extended Parallel Corpus for Amharic-English Machine Translation. (arXiv:2104.03543v2 [cs.CL] UPDATED)
    (2 min) This paper describes the acquisition, preprocessing, segmentation, and alignment of an Amharic-English parallel corpus. It will be useful for machine translation of an under-resourced language, Amharic. The corpus is larger than previously compiled corpora; it is released for research purposes. We trained neural machine translation and phrase-based statistical machine translation models using the corpus. In the automatic evaluation, neural machine translation models outperform phrase-based statistical machine translation models.
    Analyzing the Source and Target Contributions to Predictions in Neural Machine Translation. (arXiv:2010.10907v3 [cs.CL] UPDATED)
    (2 min) In Neural Machine Translation (and, more generally, conditional language modeling), the generation of a target token is influenced by two types of context: the source and the prefix of the target sequence. While many attempts to understand the internal workings of NMT models have been made, none of them explicitly evaluates relative source and target contributions to a generation decision. We argue that this relative contribution can be evaluated by adopting a variant of Layerwise Relevance Propagation (LRP). Its underlying 'conservation principle' makes relevance propagation unique: differently from other methods, it evaluates not an abstract quantity reflecting token importance, but the proportion of each token's influence. We extend LRP to the Transformer and conduct an analysis of NMT models which explicitly evaluates the source and target relative contributions to the generation process. We analyze changes in these contributions when conditioning on different types of prefixes, when varying the training objective or the amount of training data, and during the training process. We find that models trained with more data tend to rely on source information more and to have more sharp token contributions; the training process is non-monotonic with several stages of different nature.
    Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing. (arXiv:2007.15779v5 [cs.CL] UPDATED)
    (2 min) Pretraining large neural language models, such as BERT, has led to impressive gains on many natural language processing (NLP) tasks. However, most pretraining efforts focus on general domain corpora, such as newswire and Web. A prevailing assumption is that even domain-specific pretraining can benefit by starting from general-domain language models. In this paper, we challenge this assumption by showing that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains over continual pretraining of general-domain language models. To facilitate this investigation, we compile a comprehensive biomedical NLP benchmark from publicly-available datasets. Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks, leading to new state-of-the-art results across the board. Further, in conducting a thorough evaluation of modeling choices, both for pretraining and task-specific fine-tuning, we discover that some common practices are unnecessary with BERT models, such as using complex tagging schemes in named entity recognition (NER). To help accelerate research in biomedical NLP, we have released our state-of-the-art pretrained and task-specific models for the community, and created a leaderboard featuring our BLURB benchmark (short for Biomedical Language Understanding & Reasoning Benchmark) at https://aka.ms/BLURB.
    Sentiment Progression based Searching and Indexing of Literary Textual Artefacts. (arXiv:2106.13767v1 [cs.IR])
    (2 min) Literary artefacts are generally indexed and searched based on titles, meta data and keywords over the years. This searching and indexing works well when user/reader already knows about that particular creative textual artefact or document. This indexing and search hardly takes into account interest and emotional makeup of readers and its mapping to books. When a person is looking for a literary textual artefact, he/she might be looking for not only information but also to seek the joy of reading. In case of literary artefacts, progression of emotions across the key events could prove to be the key for indexing and searching. In this paper, we establish clusters among literary artefacts based on computational relationships among sentiment progressions using intelligent text analysis. We have created a database of 1076 English titles + 20 Marathi titles and also used database this http URL with 16559 titles and their summaries. We have proposed Sentiment Progression based Indexing for searching and recommending books. This can be used to create personalized clusters of book titles of interest to readers. The analysis clearly suggests better searching and indexing when we are targeting book lovers looking for a particular type of book or creative artefact. This indexing and searching can find many real-life applications for recommending books.
    How to marry a star: probabilistic constraints for meaning in context. (arXiv:2009.07936v2 [cs.CL] UPDATED)
    (2 min) In this paper, we derive a notion of 'word meaning in context' which accounts for the wide range of lexical shifts and ambiguities observed in utterance interpretation. We characterize the lexical comprehension process as a combination of cognitive semantics and Discourse Representation Theory, formalized as a 'situation description system': a probabilistic model which takes utterance understanding to be the mental process of describing one or more situations that would account for an observed utterance. Our model uses insights from different types of generative models to capture the interplay of local and global contexts and their joint influence upon the lexical representation of sentence constituents. We implement the system using a directed graphical model, and apply it to examples containing various contextualisation phenomena.
    Modeling Task Effects in Human Reading with Neural Attention. (arXiv:1808.00054v3 [cs.CL] UPDATED)
    (2 min) Humans read by making a sequence of fixations and saccades. They often skip words, without apparent detriment to understanding. We offer a novel explanation for skipping: readers optimize a tradeoff between performing a language-related task and fixating as few words as possible. We propose a neural architecture that combines an attention module (deciding whether to skip words) and a task module (memorizing the input). We show that our model predicts human skipping behavior, while also modeling reading times well, even though it skips 40% of the input. A key prediction of our model is that different reading tasks should result in different skipping behaviors. We confirm this prediction in an eye-tracking experiment in which participants answers questions about a text. We are able to capture these experimental results using the our model, replacing the memorization module with a task module that performs neural question answering.
    Privileged Zero-Shot AutoML. (arXiv:2106.13743v1 [cs.LG])
    (2 min) This work improves the quality of automated machine learning (AutoML) systems by using dataset and function descriptions while significantly decreasing computation time from minutes to milliseconds by using a zero-shot approach. Given a new dataset and a well-defined machine learning task, humans begin by reading a description of the dataset and documentation for the algorithms to be used. This work is the first to use these textual descriptions, which we call privileged information, for AutoML. We use a pre-trained Transformer model to process the privileged text and demonstrate that using this information improves AutoML performance. Thus, our approach leverages the progress of unsupervised representation learning in natural language processing to provide a significant boost to AutoML. We demonstrate that using only textual descriptions of the data and functions achieves reasonable classification performance, and adding textual descriptions to data meta-features improves classification across tabular datasets. To achieve zero-shot AutoML we train a graph neural network with these description embeddings and the data meta-features. Each node represents a training dataset, which we use to predict the best machine learning pipeline for a new test dataset in a zero-shot fashion. Our zero-shot approach rapidly predicts a high-quality pipeline for a supervised learning task and dataset. In contrast, most AutoML systems require tens or hundreds of pipeline evaluations. We show that zero-shot AutoML reduces running and prediction times from minutes to milliseconds, consistently across datasets. By speeding up AutoML by orders of magnitude this work demonstrates real-time AutoML.
    DeltaLM: Encoder-Decoder Pre-training for Language Generation and Translation by Augmenting Pretrained Multilingual Encoders. (arXiv:2106.13736v1 [cs.CL])
    (2 min) While pretrained encoders have achieved success in various natural language understanding (NLU) tasks, there is a gap between these pretrained encoders and natural language generation (NLG). NLG tasks are often based on the encoder-decoder framework, where the pretrained encoders can only benefit part of it. To reduce this gap, we introduce DeltaLM, a pretrained multilingual encoder-decoder model that regards the decoder as the task layer of off-the-shelf pretrained encoders. Specifically, we augment the pretrained multilingual encoder with a decoder and pre-train it in a self-supervised way. To take advantage of both the large-scale monolingual data and bilingual data, we adopt the span corruption and translation span corruption as the pre-training tasks. Experiments show that DeltaLM outperforms various strong baselines on both natural language generation and translation tasks, including machine translation, abstractive text summarization, data-to-text, and question generation.
    Manually Annotated Spelling Error Corpus for Amharic. (arXiv:2106.13521v1 [cs.CL])
    (2 min) This paper presents a manually annotated spelling error corpus for Amharic, lingua franca in Ethiopia. The corpus is designed to be used for the evaluation of spelling error detection and correction. The misspellings are tagged as non-word and real-word errors. In addition, the contextual information available in the corpus makes it useful in dealing with both types of spelling errors.
    Multimodal Emergent Fake News Detection via Meta Neural Process Networks. (arXiv:2106.13711v1 [cs.IR])
    (2 min) Fake news travels at unprecedented speeds, reaches global audiences and puts users and communities at great risk via social media platforms. Deep learning based models show good performance when trained on large amounts of labeled data on events of interest, whereas the performance of models tends to degrade on other events due to domain shift. Therefore, significant challenges are posed for existing detection approaches to detect fake news on emergent events, where large-scale labeled datasets are difficult to obtain. Moreover, adding the knowledge from newly emergent events requires to build a new model from scratch or continue to fine-tune the model, which can be challenging, expensive, and unrealistic for real-world settings. In order to address those challenges, we propose an end-to-end fake news detection framework named MetaFEND, which is able to learn quickly to detect fake news on emergent events with a few verified posts. Specifically, the proposed model integrates meta-learning and neural process methods together to enjoy the benefits of these approaches. In particular, a label embedding module and a hard attention mechanism are proposed to enhance the effectiveness by handling categorical information and trimming irrelevant posts. Extensive experiments are conducted on multimedia datasets collected from Twitter and Weibo. The experimental results show our proposed MetaFEND model can detect fake news on never-seen events effectively and outperform the state-of-the-art methods.
    Learning to Sample Replacements for ELECTRA Pre-Training. (arXiv:2106.13715v1 [cs.CL])
    (2 min) ELECTRA pretrains a discriminator to detect replaced tokens, where the replacements are sampled from a generator trained with masked language modeling. Despite the compelling performance, ELECTRA suffers from the following two issues. First, there is no direct feedback loop from discriminator to generator, which renders replacement sampling inefficient. Second, the generator's prediction tends to be over-confident along with training, making replacements biased to correct tokens. In this paper, we propose two methods to improve replacement sampling for ELECTRA pre-training. Specifically, we augment sampling with a hardness prediction mechanism, so that the generator can encourage the discriminator to learn what it has not acquired. We also prove that efficient sampling reduces the training variance of the discriminator. Moreover, we propose to use a focal loss for the generator in order to relieve oversampling of correct tokens as replacements. Experimental results show that our method improves ELECTRA pre-training on various downstream tasks.
    Fine-grained Geolocation Prediction of Tweets with Human Machine Collaboration. (arXiv:2106.13411v1 [cs.LG])
    (2 min) Twitter is a useful resource to analyze peoples' opinions on various topics. Often these topics are correlated or associated with locations from where these Tweet posts are made. For example, restaurant owners may need to know where their target customers eat with respect to the sentiment of the posts made related to food, policy planners may need to analyze citizens' opinion on relevant issues such as crime, safety, congestion, etc. with respect to specific parts of the city, or county or state. As promising as this is, less than $1\%$ of the crawled Tweet posts come with geolocation tags. That makes accurate prediction of Tweet posts for the non geo-tagged tweets very critical to analyze data in various domains. In this research, we utilized millions of Twitter posts and end-users domain expertise to build a set of deep neural network models using natural language processing (NLP) techniques, that predicts the geolocation of non geo-tagged Tweet posts at various level of granularities such as neighborhood, zipcode, and longitude with latitudes. With multiple neural architecture experiments, and a collaborative human-machine workflow design, our ongoing work on geolocation detection shows promising results that empower end-users to correlate relationship between variables of choice with the location information.
    Domain-Specific Pretraining for Vertical Search: Case Study on Biomedical Literature. (arXiv:2106.13375v1 [cs.IR])
    (2 min) Information overload is a prevalent challenge in many high-value domains. A prominent case in point is the explosion of the biomedical literature on COVID-19, which swelled to hundreds of thousands of papers in a matter of months. In general, biomedical literature expands by two papers every minute, totalling over a million new papers every year. Search in the biomedical realm, and many other vertical domains is challenging due to the scarcity of direct supervision from click logs. Self-supervised learning has emerged as a promising direction to overcome the annotation bottleneck. We propose a general approach for vertical search based on domain-specific pretraining and present a case study for the biomedical domain. Despite being substantially simpler and not using any relevance labels for training or development, our method performs comparably or better than the best systems in the official TREC-COVID evaluation, a COVID-related biomedical search competition. Using distributed computing in modern cloud infrastructure, our system can scale to tens of millions of articles on PubMed and has been deployed as Microsoft Biomedical Search, a new search experience for biomedical literature: https://aka.ms/biomedsearch.
    VOGUE: Answer Verbalization through Multi-Task Learning. (arXiv:2106.13316v1 [cs.CL])
    (2 min) In recent years, there have been significant developments in Question Answering over Knowledge Graphs (KGQA). Despite all the notable advancements, current KGQA systems only focus on answer generation techniques and not on answer verbalization. However, in real-world scenarios (e.g., voice assistants such as Alexa, Siri, etc.), users prefer verbalized answers instead of a generated response. This paper addresses the task of answer verbalization for (complex) question answering over knowledge graphs. In this context, we propose a multi-task-based answer verbalization framework: VOGUE (Verbalization thrOuGh mUlti-task lEarning). The VOGUE framework attempts to generate a verbalized answer using a hybrid approach through a multi-task learning paradigm. Our framework can generate results based on using questions and queries as inputs concurrently. VOGUE comprises four modules that are trained simultaneously through multi-task learning. We evaluate our framework on existing datasets for answer verbalization, and it outperforms all current baselines on both BLEU and METEOR scores.
    Cutting Down on Prompts and Parameters: Simple Few-Shot Learning with Language Models. (arXiv:2106.13353v1 [cs.CL])
    (2 min) Prompting language models (LMs) with training examples and task descriptions has been seen as critical to recent successes in few-shot learning. In this work, we show that finetuning LMs in the few-shot setting can considerably reduce the need for prompt engineering. In fact, one can use null prompts, prompts that contain neither task-specific templates nor training examples, and achieve competitive accuracy to manually-tuned prompts across a wide range of tasks. While finetuning LMs does introduce new parameters for each downstream task, we show that this memory overhead can be substantially reduced: finetuning only the bias terms can achieve comparable or better accuracy than standard finetuning while only updating 0.1% of the parameters. All in all, we recommend finetuning LMs for few-shot learning as it is more accurate, robust to different prompts, and can be made nearly as efficient as using frozen LMs.
    ParaLaw Nets -- Cross-lingual Sentence-level Pretraining for Legal Text Processing. (arXiv:2106.13403v1 [cs.CL])
    (2 min) Ambiguity is a characteristic of natural language, which makes expression ideas flexible. However, in a domain that requires accurate statements, it becomes a barrier. Specifically, a single word can have many meanings and multiple words can have the same meaning. When translating a text into a foreign language, the translator needs to determine the exact meaning of each element in the original sentence to produce the correct translation sentence. From that observation, in this paper, we propose ParaLaw Nets, a pretrained model family using sentence-level cross-lingual information to reduce ambiguity and increase the performance in legal text processing. This approach achieved the best result in the Question Answering task of COLIEE-2021.
    Exploring the Representation of Word Meanings in Context: A Case Study on Homonymy and Synonymy. (arXiv:2106.13553v1 [cs.CL])
    (2 min) This paper presents a multilingual study of word meaning representations in context. We assess the ability of both static and contextualized models to adequately represent different lexical-semantic relations, such as homonymy and synonymy. To do so, we created a new multilingual dataset that allows us to perform a controlled evaluation of several factors such as the impact of the surrounding context or the overlap between words, conveying the same or different senses. A systematic assessment on four scenarios shows that the best monolingual models based on Transformers can adequately disambiguate homonyms in context. However, as they rely heavily on context, these models fail at representing words with different senses when occurring in similar sentences. Experiments are performed in Galician, Portuguese, English, and Spanish, and both the dataset (with more than 3,000 evaluation items) and new models are freely released with this study.
    A Source-Criticism Debiasing Method for GloVe Embeddings. (arXiv:2106.13382v1 [cs.CL])
    (2 min) It is well-documented that word embeddings trained on large public corpora consistently exhibit known human social biases. Although many methods for debiasing exist, almost all fixate on completely eliminating biased information from the embeddings and often diminish training set size in the process. In this paper, we present a simple yet effective method for debiasing GloVe word embeddings (Pennington et al., 2014) which works by incorporating explicit information about training set bias rather than removing biased data outright. Our method runs quickly and efficiently with the help of a fast bias gradient approximation method from Brunet et al. (2019). As our approach is akin to the notion of 'source criticism' in the humanities, we term our method Source-Critical GloVe (SC-GloVe). We show that SC-GloVe reduces the effect size on Word Embedding Association Test (WEAT) sets without sacrificing training data or TOP-1 performance.
    Adapt-and-Distill: Developing Small, Fast and Effective Pretrained Language Models for Domains. (arXiv:2106.13474v1 [cs.CL])
    (2 min) Large pre-trained models have achieved great success in many natural language processing tasks. However, when they are applied in specific domains, these models suffer from domain shift and bring challenges in fine-tuning and online serving for latency and capacity constraints. In this paper, we present a general approach to developing small, fast and effective pre-trained models for specific domains. This is achieved by adapting the off-the-shelf general pre-trained models and performing task-agnostic knowledge distillation in target domains. Specifically, we propose domain-specific vocabulary expansion in the adaptation stage and employ corpus level occurrence probability to choose the size of incremental vocabulary automatically. Then we systematically explore different strategies to compress the large pre-trained models for specific domains. We conduct our experiments in the biomedical and computer science domain. The experimental results demonstrate that our approach achieves better performance over the BERT BASE model in domain-specific tasks while 3.3x smaller and 5.1x faster than BERT BASE. The code and pre-trained models are available at https://aka.ms/adalm.
    JNLP Team: Deep Learning Approaches for Legal Processing Tasks in COLIEE 2021. (arXiv:2106.13405v1 [cs.CL])
    (2 min) COLIEE is an annual competition in automatic computerized legal text processing. Automatic legal document processing is an ambitious goal, and the structure and semantics of the law are often far more complex than everyday language. In this article, we survey and report our methods and experimental results in using deep learning in legal document processing. The results show the difficulties as well as potentials in this family of approaches.
    Language Models are Good Translators. (arXiv:2106.13627v1 [cs.CL])
    (2 min) Recent years have witnessed the rapid advance in neural machine translation (NMT), the core of which lies in the encoder-decoder architecture. Inspired by the recent progress of large-scale pre-trained language models on machine translation in a limited scenario, we firstly demonstrate that a single language model (LM4MT) can achieve comparable performance with strong encoder-decoder NMT models on standard machine translation benchmarks, using the same training data and similar amount of model parameters. LM4MT can also easily utilize source-side texts as additional supervision. Though modeling the source- and target-language texts with the same mechanism, LM4MT can provide unified representations for both source and target sentences, which can better transfer knowledge across languages. Extensive experiments on pivot-based and zero-shot translation tasks show that LM4MT can outperform the encoder-decoder NMT model by a large margin.
    Preliminary study on using vector quantization latent spaces for TTS/VC systems with consistent performance. (arXiv:2106.13479v1 [cs.SD])
    (2 min) Generally speaking, the main objective when training a neural speech synthesis system is to synthesize natural and expressive speech from the output layer of the neural network without much attention given to the hidden layers. However, by learning useful latent representation, the system can be used for many more practical scenarios. In this paper, we investigate the use of quantized vectors to model the latent linguistic embedding and compare it with the continuous counterpart. By enforcing different policies over the latent spaces in the training, we are able to obtain a latent linguistic embedding that takes on different properties while having a similar performance in terms of quality and speaker similarity. Our experiments show that the voice cloning system built with vector quantization has only a small degradation in terms of perceptive evaluations, but has a discrete latent space that is useful for reducing the representation bit-rate, which is desirable for data transferring, or limiting the information leaking, which is important for speaker anonymization and other tasks of that nature.
    byteSteady: Fast Classification Using Byte-Level n-Gram Embeddings. (arXiv:2106.13302v1 [cs.CL])
    (2 min) This article introduces byteSteady -- a fast model for classification using byte-level n-gram embeddings. byteSteady assumes that each input comes as a sequence of bytes. A representation vector is produced using the averaged embedding vectors of byte-level n-grams, with a pre-defined set of n. The hashing trick is used to reduce the number of embedding vectors. This input representation vector is then fed into a linear classifier. A straightforward application of byteSteady is text classification. We also apply byteSteady to one type of non-language data -- DNA sequences for gene classification. For both problems we achieved competitive classification results against strong baselines, suggesting that byteSteady can be applied to both language and non-language data. Furthermore, we find that simple compression using Huffman coding does not significantly impact the results, which offers an accuracy-speed trade-off previously unexplored in machine learning.
  • cs.CV updates on arXiv.org

    Diversifying Semantic Image Synthesis and Editing via Class- and Layer-wise VAEs. (arXiv:2106.13416v1 [cs.CV])
    (2 min) Semantic image synthesis is a process for generating photorealistic images from a single semantic mask. To enrich the diversity of multimodal image synthesis, previous methods have controlled the global appearance of an output image by learning a single latent space. However, a single latent code is often insufficient for capturing various object styles because object appearance depends on multiple factors. To handle individual factors that determine object styles, we propose a class- and layer-wise extension to the variational autoencoder (VAE) framework that allows flexible control over each object class at the local to global levels by learning multiple latent spaces. Furthermore, we demonstrate that our method generates images that are both plausible and more diverse compared to state-of-the-art methods via extensive experiments with real and synthetic datasets inthree different domains. We also show that our method enables a wide range of applications in image synthesis and editing tasks.
    A Novel Self-Learning Framework for Bladder Cancer Grading Using Histopathological Images. (arXiv:2106.13559v1 [eess.IV])
    (2 min) Recently, bladder cancer has been significantly increased in terms of incidence and mortality. Currently, two subtypes are known based on tumour growth: non-muscle invasive (NMIBC) and muscle-invasive bladder cancer (MIBC). In this work, we focus on the MIBC subtype because it is of the worst prognosis and can spread to adjacent organs. We present a self-learning framework to grade bladder cancer from histological images stained via immunohistochemical techniques. Specifically, we propose a novel Deep Convolutional Embedded Attention Clustering (DCEAC) which allows classifying histological patches into different severity levels of the disease, according to the patterns established in the literature. The proposed DCEAC model follows a two-step fully unsupervised learning methodology to discern between non-tumour, mild and infiltrative patterns from high-resolution samples of 512x512 pixels. Our system outperforms previous clustering-based methods by including a convolutional attention module, which allows refining the features of the latent space before the classification stage. The proposed network exceeds state-of-the-art approaches by 2-3% across different metrics, achieving a final average accuracy of 0.9034 in a multi-class scenario. Furthermore, the reported class activation maps evidence that our model is able to learn by itself the same patterns that clinicians consider relevant, without incurring prior annotation steps. This fact supposes a breakthrough in muscle-invasive bladder cancer grading which bridges the gap with respect to train the model on labelled data.
    Partially fake it till you make it: mixing real and fake thermal images for improved object detection. (arXiv:2106.13603v1 [cs.CV])
    (2 min) In this paper we propose a novel data augmentation approach for visual content domains that have scarce training datasets, compositing synthetic 3D objects within real scenes. We show the performance of the proposed system in the context of object detection in thermal videos, a domain where 1) training datasets are very limited compared to visible spectrum datasets and 2) creating full realistic synthetic scenes is extremely cumbersome and expensive due to the difficulty in modeling the thermal properties of the materials of the scene. We compare different augmentation strategies, including state of the art approaches obtained through RL techniques, the injection of simulated data and the employment of a generative model, and study how to best combine our proposed augmentation with these other techniques.Experimental results demonstrate the effectiveness of our approach, and our single-modality detector achieves state-of-the-art results on the FLIR ADAS dataset.
    Bayesian Eye Tracking. (arXiv:2106.13387v1 [cs.CV])
    (2 min) Model-based eye tracking has been a dominant approach for eye gaze tracking because of its ability to generalize to different subjects, without the need of any training data and eye gaze annotations. Model-based eye tracking, however, is susceptible to eye feature detection errors, in particular for eye tracking in the wild. To address this issue, we propose a Bayesian framework for model-based eye tracking. The proposed system consists of a cascade-Bayesian Convolutional Neural Network (c-BCNN) to capture the probabilistic relationships between eye appearance and its landmarks, and a geometric eye model to estimate eye gaze from the eye landmarks. Given a testing eye image, the Bayesian framework can generate, through Bayesian inference, the eye gaze distribution without explicit landmark detection and model training, based on which it not only estimates the most likely eye gaze but also its uncertainty. Furthermore, with Bayesian inference instead of point-based inference, our model can not only generalize better to different sub-jects, head poses, and environments but also is robust to image noise and landmark detection errors. Finally, with the estimated gaze uncertainty, we can construct a cascade architecture that allows us to progressively improve gaze estimation accuracy. Compared to state-of-the-art model-based and learning-based methods, the proposed Bayesian framework demonstrates significant improvement in generalization capability across several benchmark datasets and in accuracy and robustness under challenging real-world conditions.
    "Zero Shot" Point Cloud Upsampling. (arXiv:2106.13765v1 [cs.CV])
    (0 min) Point cloud upsampling using deep learning has been paid various efforts in the past few years. Recent supervised deep learning methods are restricted to the size of training data and is limited in terms of covering all shapes of point clouds. Besides, the acquisition of such amount of data is unrealistic, and the network generally performs less powerful than expected on unseen records. In this paper, we present an unsupervised approach to upsample point clouds internally referred as "Zero Shot" Point Cloud Upsampling (ZSPU) at holistic level. Our approach is solely based on the internal information provided by a particular point cloud without patching in both self-training and testing phases. This single-stream design significantly reduces the training time of the upsampling task, by learning the relation between low-resolution (LR) point clouds and their high (original) resolution (HR) counterparts. This association will provide super-resolution (SR) outputs when original point clouds are loaded as input. We demonstrate competitive performance on benchmark point cloud datasets when compared to other upsampling methods. Furthermore, ZSPU achieves superior qualitative results on shapes with complex local details or high curvatures.
    Hierarchical Object-oriented Spatio-Temporal Reasoning for Video Question Answering. (arXiv:2106.13432v1 [cs.CV])
    (2 min) Video Question Answering (Video QA) is a powerful testbed to develop new AI capabilities. This task necessitates learning to reason about objects, relations, and events across visual and linguistic domains in space-time. High-level reasoning demands lifting from associative visual pattern recognition to symbol-like manipulation over objects, their behavior and interactions. Toward reaching this goal we propose an object-oriented reasoning approach in that video is abstracted as a dynamic stream of interacting objects. At each stage of the video event flow, these objects interact with each other, and their interactions are reasoned about with respect to the query and under the overall context of a video. This mechanism is materialized into a family of general-purpose neural units and their multi-level architecture called Hierarchical Object-oriented Spatio-Temporal Reasoning (HOSTR) networks. This neural model maintains the objects' consistent lifelines in the form of a hierarchically nested spatio-temporal graph. Within this graph, the dynamic interactive object-oriented representations are built up along the video sequence, hierarchically abstracted in a bottom-up manner, and converge toward the key information for the correct answer. The method is evaluated on multiple major Video QA datasets and establishes new state-of-the-arts in these tasks. Analysis into the model's behavior indicates that object-oriented reasoning is a reliable, interpretable and efficient approach to Video QA.
    Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training. (arXiv:2106.13488v1 [cs.CV])
    (2 min) Vision-Language Pre-training (VLP) aims to learn multi-modal representations from image-text pairs and serves for downstream vision-language tasks in a fine-tuning fashion. The dominant VLP models adopt a CNN-Transformer architecture, which embeds images with a CNN, and then aligns images and text with a Transformer. Visual relationship between visual contents plays an important role in image understanding and is the basic for inter-modal alignment learning. However, CNNs have limitations in visual relation learning due to local receptive field's weakness in modeling long-range dependencies. Thus the two objectives of learning visual relation and inter-modal alignment are encapsulated in the same Transformer network. Such design might restrict the inter-modal alignment learning in the Transformer by ignoring the specialized characteristic of each objective. To tackle this, we propose a fully Transformer visual embedding for VLP to better learn visual relation and further promote inter-modal alignment. Specifically, we propose a metric named Inter-Modality Flow (IMF) to measure the interaction between vision and language modalities (i.e., inter-modality). We also design a novel masking optimization mechanism named Masked Feature Regression (MFR) in Transformer to further promote the inter-modality learning. To the best of our knowledge, this is the first study to explore the benefit of Transformer for visual feature learning in VLP. We verify our method on a wide range of vision-language tasks, including Visual Question Answering (VQA), Visual Entailment and Visual Reasoning. Our approach not only outperforms the state-of-the-art VLP performance, but also shows benefits on the IMF metric.
    NP-DRAW: A Non-Parametric Structured Latent Variable Modelfor Image Generation. (arXiv:2106.13435v1 [cs.CV])
    (2 min) In this paper, we present a non-parametric structured latent variable model for image generation, called NP-DRAW, which sequentially draws on a latent canvas in a part-by-part fashion and then decodes the image from the canvas. Our key contributions are as follows. 1) We propose a non-parametric prior distribution over the appearance of image parts so that the latent variable ``what-to-draw'' per step becomes a categorical random variable. This improves the expressiveness and greatly eases the learning compared to Gaussians used in the literature. 2) We model the sequential dependency structure of parts via a Transformer, which is more powerful and easier to train compared to RNNs used in the literature. 3) We propose an effective heuristic parsing algorithm to pre-train the prior. Experiments on MNIST, Omniglot, CIFAR-10, and CelebA show that our method significantly outperforms previous structured image models like DRAW and AIR and is competitive to other generic generative models. Moreover, we show that our model's inherent compositionality and interpretability bring significant benefits in the low-data learning regime and latent space editing. Code is available at \url{https://github.com/ZENGXH/NPDRAW}.
    FOVQA: Blind Foveated Video Quality Assessment. (arXiv:2106.13328v1 [eess.IV])
    (2 min) Previous blind or No Reference (NR) video quality assessment (VQA) models largely rely on features drawn from natural scene statistics (NSS), but under the assumption that the image statistics are stationary in the spatial domain. Several of these models are quite successful on standard pictures. However, in Virtual Reality (VR) applications, foveated video compression is regaining attention, and the concept of space-variant quality assessment is of interest, given the availability of increasingly high spatial and temporal resolution contents and practical ways of measuring gaze direction. Distortions from foveated video compression increase with increased eccentricity, implying that the natural scene statistics are space-variant. Towards advancing the development of foveated compression / streaming algorithms, we have devised a no-reference (NR) foveated video quality assessment model, called FOVQA, which is based on new models of space-variant natural scene statistics (NSS) and natural video statistics (NVS). Specifically, we deploy a space-variant generalized Gaussian distribution (SV-GGD) model and a space-variant asynchronous generalized Gaussian distribution (SV-AGGD) model of mean subtracted contrast normalized (MSCN) coefficients and products of neighboring MSCN coefficients, respectively. We devise a foveated video quality predictor that extracts radial basis features, and other features that capture perceptually annoying rapid quality fall-offs. We find that FOVQA achieves state-of-the-art (SOTA) performance on the new 2D LIVE-FBT-FCVR database, as compared with other leading FIQA / VQA models. we have made our implementation of FOVQA available at: this http URL
    To the Point: Efficient 3D Object Detection in the Range Image with Graph Convolution Kernels. (arXiv:2106.13381v1 [cs.CV])
    (2 min) 3D object detection is vital for many robotics applications. For tasks where a 2D perspective range image exists, we propose to learn a 3D representation directly from this range image view. To this end, we designed a 2D convolutional network architecture that carries the 3D spherical coordinates of each pixel throughout the network. Its layers can consume any arbitrary convolution kernel in place of the default inner product kernel and exploit the underlying local geometry around each pixel. We outline four such kernels: a dense kernel according to the bag-of-words paradigm, and three graph kernels inspired by recent graph neural network advances: the Transformer, the PointNet, and the Edge Convolution. We also explore cross-modality fusion with the camera image, facilitated by operating in the perspective range image view. Our method performs competitively on the Waymo Open Dataset and improves the state-of-the-art AP for pedestrian detection from 69.7% to 75.5%. It is also efficient in that our smallest model, which still outperforms the popular PointPillars in quality, requires 180 times fewer FLOPS and model parameters
    Optimal Pose and Shape Estimation for Category-level 3D Object Perception. (arXiv:2104.08383v3 [cs.CV] UPDATED)
    (0 min) We consider a category-level perception problem, where one is given 3D sensor data picturing an object of a given category (e.g. a car), and has to reconstruct the pose and shape of the object despite intra-class variability (i.e. different car models have different shapes). We consider an active shape model, where -- for an object category -- we are given a library of potential CAD models describing objects in that category, and we adopt a standard formulation where pose and shape estimation are formulated as a non-convex optimization. Our first contribution is to provide the first certifiably optimal solver for pose and shape estimation. In particular, we show that rotation estimation can be decoupled from the estimation of the object translation and shape, and we demonstrate that (i) the optimal object rotation can be computed via a tight (small-size) semidefinite relaxation, and (ii) the translation and shape parameters can be computed in closed-form given the rotation. Our second contribution is to add an outlier rejection layer to our solver, hence making it robust to a large number of misdetections. Towards this goal, we wrap our optimal solver in a robust estimation scheme based on graduated non-convexity. To further enhance robustness to outliers, we also develop the first graph-theoretic formulation to prune outliers in category-level perception, which removes outliers via convex hull and maximum clique computations; the resulting approach is robust to 70%-90% outliers. Our third contribution is an extensive experimental evaluation. Besides providing an ablation study on a simulated dataset and on the PASCAL3D+ dataset, we combine our solver with a deep-learned keypoint detector, and show that the resulting approach improves over the state of the art in vehicle pose estimation in the ApolloScape datasets.
    RSN: Range Sparse Net for Efficient, Accurate LiDAR 3D Object Detection. (arXiv:2106.13365v1 [cs.CV])
    (2 min) The detection of 3D objects from LiDAR data is a critical component in most autonomous driving systems. Safe, high speed driving needs larger detection ranges, which are enabled by new LiDARs. These larger detection ranges require more efficient and accurate detection models. Towards this goal, we propose Range Sparse Net (RSN), a simple, efficient, and accurate 3D object detector in order to tackle real time 3D object detection in this extended detection regime. RSN predicts foreground points from range images and applies sparse convolutions on the selected foreground points to detect objects. The lightweight 2D convolutions on dense range images results in significantly fewer selected foreground points, thus enabling the later sparse convolutions in RSN to efficiently operate. Combining features from the range image further enhance detection accuracy. RSN runs at more than 60 frames per second on a 150m x 150m detection region on Waymo Open Dataset (WOD) while being more accurate than previously published detectors. As of 11/2020, RSN is ranked first in the WOD leaderboard based on the APH/LEVEL 1 metrics for LiDAR-based pedestrian and vehicle detection, while being several times faster than alternatives.
    PVTv2: Improved Baselines with Pyramid Vision Transformer. (arXiv:2106.13797v1 [cs.CV])
    (0 min) Transformer in computer vision has recently shown encouraging progress. In this work, we improve the original Pyramid Vision Transformer (PVTv1) by adding three improvement designs, which include (1) locally continuous features with convolutions, (2) position encodings with zero paddings, and (3) linear complexity attention layers with average pooling. With these simple modifications, our PVTv2 significantly improves PVTv1 on classification, detection, and segmentation. Moreover, PVTv2 achieves much better performance than recent works, including Swin Transformer, under ImageNet-1K pre-training. We hope this work will make state-of-the-art vision Transformer research more accessible. Code is available at https://github.com/whai362/PVT .
    Masksembles for Uncertainty Estimation. (arXiv:2012.08334v2 [cs.LG] UPDATED)
    (0 min) Deep neural networks have amply demonstrated their prowess but estimating the reliability of their predictions remains challenging. Deep Ensembles are widely considered as being one of the best methods for generating uncertainty estimates but are very expensive to train and evaluate. MC-Dropout is another popular alternative, which is less expensive, but also less reliable. Our central intuition is that there is a continuous spectrum of ensemble-like models of which MC-Dropout and Deep Ensembles are extreme examples. The first uses an effectively infinite number of highly correlated models while the second relies on a finite number of independent models. To combine the benefits of both, we introduce Masksembles. Instead of randomly dropping parts of the network as in MC-dropout, Masksemble relies on a fixed number of binary masks, which are parameterized in a way that allows to change correlations between individual models. Namely, by controlling the overlap between the masks and their density one can choose the optimal configuration for the task at hand. This leads to a simple and easy to implement method with performance on par with Ensembles at a fraction of the cost. We experimentally validate Masksembles on two widely used datasets, CIFAR10 and ImageNet.
    Prior Image-Constrained Reconstruction using Style-Based Generative Models. (arXiv:2102.12525v2 [eess.IV] CROSS LISTED)
    (0 min) Obtaining a useful estimate of an object from highly incomplete imaging measurements remains a holy grail of imaging science. Deep learning methods have shown promise in learning object priors or constraints to improve the conditioning of an ill-posed imaging inverse problem. In this study, a framework for estimating an object of interest that is semantically related to a known prior image, is proposed. An optimization problem is formulated in the disentangled latent space of a style-based generative model, and semantically meaningful constraints are imposed using the disentangled latent representation of the prior image. Stable recovery from incomplete measurements with the help of a prior image is theoretically analyzed. Numerical experiments demonstrating the superior performance of our approach as compared to related methods are presented.
    Model-based multi-parameter mapping. (arXiv:2102.01604v2 [cs.CV] UPDATED)
    (0 min) Quantitative MR imaging is increasingly favoured for its richer information content and standardised measures. However, computing quantitative parameter maps, such as those encoding longitudinal relaxation rate (R1), apparent transverse relaxation rate (R2*) or magnetisation-transfer saturation (MTsat), involves inverting a highly non-linear function. Many methods for deriving parameter maps assume perfect measurements and do not consider how noise is propagated through the estimation procedure, resulting in needlessly noisy maps. Instead, we propose a probabilistic generative (forward) model of the entire dataset, which is formulated and inverted to jointly recover (log) parameter maps with a well-defined probabilistic interpretation (e.g., maximum likelihood or maximum a posteriori). The second order optimisation we propose for model fitting achieves rapid and stable convergence thanks to a novel approximate Hessian. We demonstrate the utility of our flexible framework in the context of recovering more accurate maps from data acquired using the popular multi-parameter mapping protocol. We also show how to incorporate a joint total variation prior to further decrease the noise in the maps, noting that the probabilistic formulation allows the uncertainty on the recovered parameter maps to be estimated. Our implementation uses a PyTorch backend and benefits from GPU acceleration. It is available at https://github.com/balbasty/nitorch.
    DONeRF: Towards Real-Time Rendering of Compact Neural Radiance Fields using Depth Oracle Networks. (arXiv:2103.03231v4 [cs.CV] UPDATED)
    (0 min) The recent research explosion around implicit neural representations, such as NeRF, shows that there is immense potential for implicitly storing high-quality scene and lighting information in compact neural networks. However, one major limitation preventing the use of NeRF in real-time rendering applications is the prohibitive computational cost of excessive network evaluations along each view ray, requiring dozens of petaFLOPS. In this work, we bring compact neural representations closer to practical rendering of synthetic content in real-time applications, such as games and virtual reality. We show that the number of samples required for each view ray can be significantly reduced when samples are placed around surfaces in the scene without compromising image quality. To this end, we propose a depth oracle network that predicts ray sample locations for each view ray with a single network evaluation. We show that using a classification network around logarithmically discretized and spherically warped depth values is essential to encode surface locations rather than directly estimating depth. The combination of these techniques leads to DONeRF, our compact dual network design with a depth oracle network as its first step and a locally sampled shading network for ray accumulation. With DONeRF, we reduce the inference costs by up to 48x compared to NeRF when conditioning on available ground truth depth information. Compared to concurrent acceleration methods for raymarching-based neural representations, DONeRF does not require additional memory for explicit caching or acceleration structures, and can render interactively (20 frames per second) on a single GPU.
    Building Intelligent Autonomous Navigation Agents. (arXiv:2106.13415v1 [cs.CV])
    (2 min) Breakthroughs in machine learning in the last decade have led to `digital intelligence', i.e. machine learning models capable of learning from vast amounts of labeled data to perform several digital tasks such as speech recognition, face recognition, machine translation and so on. The goal of this thesis is to make progress towards designing algorithms capable of `physical intelligence', i.e. building intelligent autonomous navigation agents capable of learning to perform complex navigation tasks in the physical world involving visual perception, natural language understanding, reasoning, planning, and sequential decision making. Despite several advances in classical navigation methods in the last few decades, current navigation agents struggle at long-term semantic navigation tasks. In the first part of the thesis, we discuss our work on short-term navigation using end-to-end reinforcement learning to tackle challenges such as obstacle avoidance, semantic perception, language grounding, and reasoning. In the second part, we present a new class of navigation methods based on modular learning and structured explicit map representations, which leverage the strengths of both classical and end-to-end learning methods, to tackle long-term navigation tasks. We show that these methods are able to effectively tackle challenges such as localization, mapping, long-term planning, exploration and learning semantic priors. These modular learning methods are capable of long-term spatial and semantic understanding and achieve state-of-the-art results on various navigation tasks.
    Joslim: Joint Widths and Weights Optimization for Slimmable Neural Networks. (arXiv:2007.11752v3 [cs.LG] UPDATED)
    (0 min) Slimmable neural networks provide a flexible trade-off front between prediction error and computational requirement (such as the number of floating-point operations or FLOPs) with the same storage requirement as a single model. They are useful for reducing maintenance overhead for deploying models to devices with different memory constraints and are useful for optimizing the efficiency of a system with many CNNs. However, existing slimmable network approaches either do not optimize layer-wise widths or optimize the shared-weights and layer-wise widths independently, thereby leaving significant room for improvement by joint width and weight optimization. In this work, we propose a general framework to enable joint optimization for both width configurations and weights of slimmable networks. Our framework subsumes conventional and NAS-based slimmable methods as special cases and provides flexibility to improve over existing methods. From a practical standpoint, we propose Joslim, an algorithm that jointly optimizes both the widths and weights for slimmable nets, which outperforms existing methods for optimizing slimmable networks across various networks, datasets, and objectives. Quantitatively, improvements up to 1.7% and 8% in top-1 accuracy on the ImageNet dataset can be attained for MobileNetV2 considering FLOPs and memory footprint, respectively. Our results highlight the potential of optimizing the channel counts for different layers jointly with the weights for slimmable networks. Code available at https://github.com/cmu-enyac/Joslim.
    Connecting Sphere Manifolds Hierarchically for Regularization. (arXiv:2106.13549v1 [cs.CV])
    (2 min) This paper considers classification problems with hierarchically organized classes. We force the classifier (hyperplane) of each class to belong to a sphere manifold, whose center is the classifier of its super-class. Then, individual sphere manifolds are connected based on their hierarchical relations. Our technique replaces the last layer of a neural network by combining a spherical fully-connected layer with a hierarchical layer. This regularization is shown to improve the performance of widely used deep neural network architectures (ResNet and DenseNet) on publicly available datasets (CIFAR100, CUB200, Stanford dogs, Stanford cars, and Tiny-ImageNet).
    Countering Adversarial Examples: Combining Input Transformation and Noisy Training. (arXiv:2106.13394v1 [cs.CV])
    (2 min) Recent studies have shown that neural network (NN) based image classifiers are highly vulnerable to adversarial examples, which poses a threat to security-sensitive image recognition task. Prior work has shown that JPEG compression can combat the drop in classification accuracy on adversarial examples to some extent. But, as the compression ratio increases, traditional JPEG compression is insufficient to defend those attacks but can cause an abrupt accuracy decline to the benign images. In this paper, with the aim of fully filtering the adversarial perturbations, we firstly make modifications to traditional JPEG compression algorithm which becomes more favorable for NN. Specifically, based on an analysis of the frequency coefficient, we design a NN-favored quantization table for compression. Considering compression as a data augmentation strategy, we then combine our model-agnostic preprocess with noisy training. We fine-tune the pre-trained model by training with images encoded at different compression levels, thus generating multiple classifiers. Finally, since lower (higher) compression ratio can remove both perturbations and original features slightly (aggressively), we use these trained multiple models for model ensemble. The majority vote of the ensemble of models is adopted as final predictions. Experiments results show our method can improve defense efficiency while maintaining original accuracy.
    Shape registration in the time of transformers. (arXiv:2106.13679v1 [cs.CV])
    (2 min) In this paper, we propose a transformer-based procedure for the efficient registration of non-rigid 3D point clouds. The proposed approach is data-driven and adopts for the first time the transformer architecture in the registration task. Our method is general and applies to different settings. Given a fixed template with some desired properties (e.g. skinning weights or other animation cues), we can register raw acquired data to it, thereby transferring all the template properties to the input geometry. Alternatively, given a pair of shapes, our method can register the first onto the second (or vice-versa), obtaining a high-quality dense correspondence between the two. In both contexts, the quality of our results enables us to target real applications such as texture transfer and shape interpolation. Furthermore, we also show that including an estimation of the underlying density of the surface eases the learning process. By exploiting the potential of this architecture, we can train our model requiring only a sparse set of ground truth correspondences ($10\sim20\%$ of the total points). The proposed model and the analysis that we perform pave the way for future exploration of transformer-based architectures for registration and matching applications. Qualitative and quantitative evaluations demonstrate that our pipeline outperforms state-of-the-art methods for deformable and unordered 3D data registration on different datasets and scenarios.
    HAN: An Efficient Hierarchical Self-Attention Network for Skeleton-Based Gesture Recognition. (arXiv:2106.13391v1 [cs.CV])
    (0 min) Previous methods for skeleton-based gesture recognition mostly arrange the skeleton sequence into a pseudo picture or spatial-temporal graph and apply deep Convolutional Neural Network (CNN) or Graph Convolutional Network (GCN) for feature extraction. Although achieving superior results, these methods have inherent limitations in dynamically capturing local features of interactive hand parts, and the computing efficiency still remains a serious issue. In this work, the self-attention mechanism is introduced to alleviate this problem. Considering the hierarchical structure of hand joints, we propose an efficient hierarchical self-attention network (HAN) for skeleton-based gesture recognition, which is based on pure self-attention without any CNN, RNN or GCN operators. Specifically, the joint self-attention module is used to capture spatial features of fingers, the finger self-attention module is designed to aggregate features of the whole hand. In terms of temporal features, the temporal self-attention module is utilized to capture the temporal dynamics of the fingers and the entire hand. Finally, these features are fused by the fusion self-attention module for gesture classification. Experiments show that our method achieves competitive results on three gesture recognition datasets with much lower computational complexity.
    SRPN: similarity-based region proposal networks for nuclei and cells detection in histology images. (arXiv:2106.13556v1 [cs.CV])
    (0 min) The detection of nuclei and cells in histology images is of great value in both clinical practice and pathological studies. However, multiple reasons such as morphological variations of nuclei or cells make it a challenging task where conventional object detection methods cannot obtain satisfactory performance in many cases. A detection task consists of two sub-tasks, classification and localization. Under the condition of dense object detection, classification is a key to boost the detection performance. Considering this, we propose similarity based region proposal networks (SRPN) for nuclei and cells detection in histology images. In particular, a customized convolution layer termed as embedding layer is designed for network building. The embedding layer is added into the region proposal networks, enabling the networks to learn discriminative features based on similarity learning. Features obtained by similarity learning can significantly boost the classification performance compared to conventional methods. SRPN can be easily integrated into standard convolutional neural networks architectures such as the Faster R-CNN and RetinaNet. We test the proposed approach on tasks of multi-organ nuclei detection and signet ring cells detection in histological images. Experimental results show that networks applying similarity learning achieved superior performance on both tasks when compared to their counterparts. In particular, the proposed SRPN achieve state-of-the-art performance on the MoNuSeg benchmark for nuclei segmentation and detection while compared to previous methods, and on the signet ring cell detection benchmark when compared with baselines. The sourcecode is publicly available at: https://github.com/sigma10010/nuclei_cells_det.
    On the Robustness of Pretraining and Self-Supervision for a Deep Learning-based Analysis of Diabetic Retinopathy. (arXiv:2106.13497v1 [cs.CV])
    (0 min) There is an increasing number of medical use-cases where classification algorithms based on deep neural networks reach performance levels that are competitive with human medical experts. To alleviate the challenges of small dataset sizes, these systems often rely on pretraining. In this work, we aim to assess the broader implications of these approaches. For diabetic retinopathy grading as exemplary use case, we compare the impact of different training procedures including recently established self-supervised pretraining methods based on contrastive learning. To this end, we investigate different aspects such as quantitative performance, statistics of the learned feature representations, interpretability and robustness to image distortions. Our results indicate that models initialized from ImageNet pretraining report a significant increase in performance, generalization and robustness to image distortions. In particular, self-supervised models show further benefits to supervised models. Self-supervised models with initialization from ImageNet pretraining not only report higher performance, they also reduce overfitting to large lesions along with improvements in taking into account minute lesions indicative of the progression of the disease. Understanding the effects of pretraining in a broader sense that goes beyond simple performance comparisons is of crucial importance for the broader medical imaging community beyond the use-case considered in this work.
    Animatable Neural Radiance Fields from Monocular RGB Video. (arXiv:2106.13629v1 [cs.CV])
    (0 min) We present animatable neural radiance fields for detailed human avatar creation from monocular videos. Our approach extends neural radiance fields (NeRF) to the dynamic scenes with human movements via introducing explicit pose-guided deformation while learning the scene representation network. In particular, we estimate the human pose for each frame and learn a constant canonical space for the detailed human template, which enables natural shape deformation from the observation space to the canonical space under the explicit control of the pose parameters. To compensate for inaccurate pose estimation, we introduce the pose refinement strategy that updates the initial pose during the learning process, which not only helps to learn more accurate human reconstruction but also accelerates the convergence. In experiments we show that the proposed approach achieves 1) implicit human geometry and appearance reconstruction with high-quality details, 2) photo-realistic rendering of the human from arbitrary views, and 3) animation of the human with arbitrary poses.
    Projection-wise Disentangling for Fair and Interpretable Representation Learning: Application to 3D Facial Shape Analysis. (arXiv:2106.13734v1 [cs.CV])
    (0 min) Confounding bias is a crucial problem when applying machine learning to practice, especially in clinical practice. We consider the problem of learning representations independent to multiple biases. In literature, this is mostly solved by purging the bias information from learned representations. We however expect this strategy to harm the diversity of information in the representation, and thus limiting its prospective usage (e.g., interpretation). Therefore, we propose to mitigate the bias while keeping almost all information in the latent representations, which enables us to observe and interpret them as well. To achieve this, we project latent features onto a learned vector direction, and enforce the independence between biases and projected features rather than all learned features. To interpret the mapping between projected features and input data, we propose projection-wise disentangling: a sampling and reconstruction along the learned vector direction. The proposed method was evaluated on the analysis of 3D facial shape and patient characteristics (N=5011). Experiments showed that this conceptually simple method achieved state-of-the-art fair prediction performance and interpretability, showing its great potential for clinical applications.
    MultiFace: A Generic Training Mechanism for Boosting Face Recognition Performance. (arXiv:2101.09899v3 [cs.CV] UPDATED)
    (0 min) Deep Convolutional Neural Networks (DCNNs) and their variants have been widely used in large scale face recognition(FR) recently. Existing methods have achieved good performance on many FR benchmarks. However, most of them suffer from two major problems. First, these methods converge quite slowly since they optimize the loss functions in a high-dimensional and sparse Gaussian Sphere. Second, the high dimensionality of features, despite the powerful descriptive ability, brings difficulty to the optimization, which may lead to a sub-optimal local optimum. To address these problems, we propose a simple yet efficient training mechanism called MultiFace, where we approximate the original high-dimensional features by the ensemble of low-dimensional features. The proposed mechanism is also generic and can be easily applied to many advanced FR models. Moreover, it brings the benefits of good interpretability to FR models via the clustering effect. In detail, the ensemble of these low-dimensional features can capture complementary yet discriminative information, which can increase the intra-class compactness and inter-class separability. Experimental results show that the proposed mechanism can accelerate 2-3 times with the softmax loss and 1.2-1.5 times with Arcface or Cosface, while achieving state-of-the-art performances in several benchmark datasets. Especially, the significant improvements on large-scale datasets(e.g., IJB and MageFace) demonstrate the flexibility of our new training mechanism.
    Efficient Document Image Classification Using Region-Based Graph Neural Network. (arXiv:2106.13802v1 [cs.CV])
    (0 min) Document image classification remains a popular research area because it can be commercialized in many enterprise applications across different industries. Recent advancements in large pre-trained computer vision and language models and graph neural networks has lent document image classification many tools. However using large pre-trained models usually requires substantial computing resources which could defeat the cost-saving advantages of automatic document image classification. In the paper we propose an efficient document image classification framework that uses graph convolution neural networks and incorporates textual, visual and layout information of the document. We have rigorously benchmarked our proposed algorithm against several state-of-art vision and language models on both publicly available dataset and a real-life insurance document classification dataset. Empirical results on both publicly available and real-world data show that our methods achieve near SOTA performance yet require much less computing resources and time for model training and inference. This results in solutions than offer better cost advantages, especially in scalable deployment for enterprise applications. The results showed that our algorithm can achieve classification performance quite close to SOTA. We also provide comprehensive comparisons of computing resources, model sizes, train and inference time between our proposed methods and baselines. In addition we delineate the cost per image using our method and other baselines.
    Graph Pattern Loss based Diversified Attention Network for Cross-Modal Retrieval. (arXiv:2106.13552v1 [cs.CV])
    (0 min) Cross-modal retrieval aims to enable flexible retrieval experience by combining multimedia data such as image, video, text, and audio. One core of unsupervised approaches is to dig the correlations among different object representations to complete satisfied retrieval performance without requiring expensive labels. In this paper, we propose a Graph Pattern Loss based Diversified Attention Network(GPLDAN) for unsupervised cross-modal retrieval to deeply analyze correlations among representations. First, we propose a diversified attention feature projector by considering the interaction between different representations to generate multiple representations of an instance. Then, we design a novel graph pattern loss to explore the correlations among different representations, in this graph all possible distances between different representations are considered. In addition, a modality classifier is added to explicitly declare the corresponding modalities of features before fusion and guide the network to enhance discrimination ability. We test GPLDAN on four public datasets. Compared with the state-of-the-art cross-modal retrieval methods, the experimental results demonstrate the performance and competitiveness of GPLDAN.
    Multiview Video Compression Using Advanced HEVC Screen Content Coding. (arXiv:2106.13574v1 [cs.MM])
    (2 min) The paper presents a new approach to multiview video coding using Screen Content Coding. It is assumed that for a time instant the frames corresponding to all views are packed into a single frame, i.e. the frame-compatible approach to multiview coding is applied. For such coding scenario, the paper demonstrates that Screen Content Coding can be efficiently used for multiview video coding. Two approaches are considered: the first using standard HEVC Screen Content Coding, and the second using Advanced Screen Content Coding. The latter is the original proposal of the authors that exploits quarter-pel motion vectors and other nonstandard extensions of HEVC Screen Content Coding. The experimental results demonstrate that multiview video coding even using standard HEVC Screen Content Coding is much more efficient than simulcast HEVC coding. The proposed Advanced Screen Content Coding provides virtually the same coding efficiency as MV-HEVC, which is the state-of-the-art multiview video compression technique. The authors suggest that Advanced Screen Content Coding can be efficiently used within the new Versatile Video Coding (VVC) technology. Nevertheless a reference multiview extension of VVC does not exist yet, therefore, for VVC-based coding, the experimental comparisons are left for future work.
    Multivariate Medians for Image and Shape Analysis. (arXiv:1911.00143v2 [eess.IV] UPDATED)
    (0 min) Having been studied since long by statisticians, multivariate median concepts found their way into the image processing literature in the course of the last decades, being used to construct robust and efficient denoising filters for multivariate images such as colour images but also matrix-valued images. Based on the similarities between image and geometric data as results of the sampling of continuous physical quantities, it can be expected that the understanding of multivariate median filters for images provides a starting point for the development of shape processing techniques. This paper presents an overview of multivariate median concepts relevant for image and shape processing. It focusses on their mathematical principles and discusses important properties especially in the context of image processing.
    Image-to-image Transformation with Auxiliary Condition. (arXiv:2106.13696v1 [cs.CV])
    (0 min) The performance of image recognition like human pose detection, trained with simulated images would usually get worse due to the divergence between real and simulated data. To make the distribution of a simulated image close to that of real one, there are several works applying GAN-based image-to-image transformation methods, e.g., SimGAN and CycleGAN. However, these methods would not be sensitive enough to the various change in pose and shape of subjects, especially when the training data are imbalanced, e.g., some particular poses and shapes are minor in the training data. To overcome this problem, we propose to introduce the label information of subjects, e.g., pose and type of objects in the training of CycleGAN, and lead it to obtain label-wise transforamtion models. We evaluate our proposed method called Label-CycleGAN, through experiments on the digit image transformation from SVHN to MNIST and the surveillance camera image transformation from simulated to real images.
    Interactive Multi-level Stroke Control for Neural Style Transfer. (arXiv:2106.13787v1 [cs.CV])
    (0 min) We present StyleTune, a mobile app for interactive multi-level control of neural style transfers that facilitates creative adjustments of style elements and enables high output fidelity. In contrast to current mobile neural style transfer apps, StyleTune supports users to adjust both the size and orientation of style elements, such as brushstrokes and texture patches, on a global as well as local level. To this end, we propose a novel stroke-adaptive feed-forward style transfer network, that enables control over stroke size and intensity and allows a larger range of edits than current approaches. For additional level-of-control, we propose a network agnostic method for stroke-orientation adjustment by utilizing the rotation-variance of CNNs. To achieve high output fidelity, we further add a patch-based style transfer method that enables users to obtain output resolutions of more than 20 Megapixel. Our approach empowers users to create many novel results that are not possible with current mobile neural style transfer apps.
    CausalCity: Complex Simulations with Agency for Causal Discovery and Reasoning. (arXiv:2106.13364v1 [cs.AI])
    (2 min) The ability to perform causal and counterfactual reasoning are central properties of human intelligence. Decision-making systems that can perform these types of reasoning have the potential to be more generalizable and interpretable. Simulations have helped advance the state-of-the-art in this domain, by providing the ability to systematically vary parameters (e.g., confounders) and generate examples of the outcomes in the case of counterfactual scenarios. However, simulating complex temporal causal events in multi-agent scenarios, such as those that exist in driving and vehicle navigation, is challenging. To help address this, we present a high-fidelity simulation environment that is designed for developing algorithms for causal discovery and counterfactual reasoning in the safety-critical context. A core component of our work is to introduce \textit{agency}, such that it is simple to define and create complex scenarios using high-level definitions. The vehicles then operate with agency to complete these objectives, meaning low-level behaviors need only be controlled if necessary. We perform experiments with three state-of-the-art methods to create baselines and highlight the affordances of this environment. Finally, we highlight challenges and opportunities for future work.
    Interpreting Depression From Question-wise Long-term Video Recording of SDS Evaluation. (arXiv:2106.13393v1 [cs.CV])
    (0 min) Self-Rating Depression Scale (SDS) questionnaire has frequently been used for efficient depression preliminary screening. However, the uncontrollable self-administered measure can be easily affected by insouciantly or deceptively answering, and producing the different results with the clinician-administered Hamilton Depression Rating Scale (HDRS) and the final diagnosis. Clinically, facial expression (FE) and actions play a vital role in clinician-administered evaluation, while FE and action are underexplored for self-administered evaluations. In this work, we collect a novel dataset of 200 subjects to evidence the validity of self-rating questionnaires with their corresponding question-wise video recording. To automatically interpret depression from the SDS evaluation and the paired video, we propose an end-to-end hierarchical framework for the long-term variable-length video, which is also conditioned on the questionnaire results and the answering time. Specifically, we resort to a hierarchical model which utilizes a 3D CNN for local temporal pattern exploration and a redundancy-aware self-attention (RAS) scheme for question-wise global feature aggregation. Targeting for the redundant long-term FE video processing, our RAS is able to effectively exploit the correlations of each video clip within a question set to emphasize the discriminative information and eliminate the redundancy based on feature pair-wise affinity. Then, the question-wise video feature is concatenated with the questionnaire scores for final depression detection. Our thorough evaluations also show the validity of fusing SDS evaluation and its video recording, and the superiority of our framework to the conventional state-of-the-art temporal modeling methods.
    Generative Modeling for Multi-task Visual Learning. (arXiv:2106.13409v1 [cs.CV])
    (0 min) Generative modeling has recently shown great promise in computer vision, but it has mostly focused on synthesizing visually realistic images. In this paper, motivated by multi-task learning of shareable feature representations, we consider a novel problem of learning a shared generative model that is useful across various visual perception tasks. Correspondingly, we propose a general multi-task oriented generative modeling (MGM) framework, by coupling a discriminative multi-task network with a generative network. While it is challenging to synthesize both RGB images and pixel-level annotations in multi-task scenarios, our framework enables us to use synthesized images paired with only weak annotations (i.e., image-level scene labels) to facilitate multiple visual tasks. Experimental evaluation on challenging multi-task benchmarks, including NYUv2 and Taskonomy, demonstrates that our MGM framework improves the performance of all the tasks by large margins, consistently outperforming state-of-the-art multi-task approaches.
    Video Moment Retrieval with Text Query Considering Many-to-Many Correspondence Using Potentially Relevant Pair. (arXiv:2106.13566v1 [cs.CV])
    (0 min) In this paper we undertake the task of text-based video moment retrieval from a corpus of videos. To train the model, text-moment paired datasets were used to learn the correct correspondences. In typical training methods, ground-truth text-moment pairs are used as positive pairs, whereas other pairs are regarded as negative pairs. However, aside from the ground-truth pairs, some text-moment pairs should be regarded as positive. In this case, one text annotation can be positive for many video moments. Conversely, one video moment can be corresponded to many text annotations. Thus, there are many-to-many correspondences between the text annotations and video moments. Based on these correspondences, we can form potentially relevant pairs, which are not given as ground truth yet are not negative; effectively incorporating such relevant pairs into training can improve the retrieval performance. The text query should describe what is happening in a video moment. Hence, different video moments annotated with similar texts, which contain a similar action, are likely to hold the similar action, thus these pairs can be considered as potentially relevant pairs. In this paper, we propose a novel training method that takes advantage of potentially relevant pairs, which are detected based on linguistic analysis about text annotation. Experiments on two benchmark datasets revealed that our method improves the retrieval performance both quantitatively and qualitatively.
    Vision Transformer Architecture Search. (arXiv:2106.13700v1 [cs.CV])
    (0 min) Recently, transformers have shown great superiority in solving computer vision tasks by modeling images as a sequence of manually-split patches with self-attention mechanism. However, current architectures of vision transformers (ViTs) are simply inherited from natural language processing (NLP) tasks and have not been sufficiently investigated and optimized. In this paper, we make a further step by examining the intrinsic structure of transformers for vision tasks and propose an architecture search method, dubbed ViTAS, to search for the optimal architecture with similar hardware budgets. Concretely, we design a new effective yet efficient weight sharing paradigm for ViTs, such that architectures with different token embedding, sequence size, number of heads, width, and depth can be derived from a single super-transformer. Moreover, to cater for the variance of distinct architectures, we introduce \textit{private} class token and self-attention maps in the super-transformer. In addition, to adapt the searching for different budgets, we propose to search the sampling probability of identity operation. Experimental results show that our ViTAS attains excellent results compared to existing pure transformer architectures. For example, with $1.3$G FLOPs budget, our searched architecture achieves $74.7\%$ top-$1$ accuracy on ImageNet and is $2.5\%$ superior than the current baseline ViT architecture. Code is available at \url{https://github.com/xiusu/ViTAS}.
    Circumpapillary OCT-Focused Hybrid Learning for Glaucoma Grading Using Tailored Prototypical Neural Networks. (arXiv:2106.13551v1 [eess.IV])
    (0 min) Glaucoma is one of the leading causes of blindness worldwide and Optical Coherence Tomography (OCT) is the quintessential imaging technique for its detection. Unlike most of the state-of-the-art studies focused on glaucoma detection, in this paper, we propose, for the first time, a novel framework for glaucoma grading using raw circumpapillary B-scans. In particular, we set out a new OCT-based hybrid network which combines hand-driven and deep learning algorithms. An OCT-specific descriptor is proposed to extract hand-crafted features related to the retinal nerve fibre layer (RNFL). In parallel, an innovative CNN is developed using skip-connections to include tailored residual and attention modules to refine the automatic features of the latent space. The proposed architecture is used as a backbone to conduct a novel few-shot learning based on static and dynamic prototypical networks. The k-shot paradigm is redefined giving rise to a supervised end-to-end system which provides substantial improvements discriminating between healthy, early and advanced glaucoma samples. The training and evaluation processes of the dynamic prototypical network are addressed from two fused databases acquired via Heidelberg Spectralis system. Validation and testing results reach a categorical accuracy of 0.9459 and 0.8788 for glaucoma grading, respectively. Besides, the high performance reported by the proposed model for glaucoma detection deserves a special mention. The findings from the class activation maps are directly in line with the clinicians' opinion since the heatmaps pointed out the RNFL as the most relevant structure for glaucoma diagnosis.
    Single Image Texture Translation for Data Augmentation. (arXiv:2106.13804v1 [cs.CV])
    (0 min) Recent advances in image synthesis enables one to translate images by learning the mapping between a source domain and a target domain. Existing methods tend to learn the distributions by training a model on a variety of datasets, with results evaluated largely in a subjective manner. Relatively few works in this area, however, study the potential use of semantic image translation methods for image recognition tasks. In this paper, we explore the use of Single Image Texture Translation (SITT) for data augmentation. We first propose a lightweight model for translating texture to images based on a single input of source texture, allowing for fast training and testing. Based on SITT, we then explore the use of augmented data in long-tailed and few-shot image classification tasks. We find the proposed method is capable of translating input data into a target domain, leading to consistent improved image recognition performance. Finally, we examine how SITT and related image translation methods can provide a basis for a data-efficient, augmentation engineering approach to model training.
    A Picture May Be Worth a Hundred Words for Visual Question Answering. (arXiv:2106.13445v1 [cs.CV])
    (0 min) How far can we go with textual representations for understanding pictures? In image understanding, it is essential to use concise but detailed image representations. Deep visual features extracted by vision models, such as Faster R-CNN, are prevailing used in multiple tasks, and especially in visual question answering (VQA). However, conventional deep visual features may struggle to convey all the details in an image as we humans do. Meanwhile, with recent language models' progress, descriptive text may be an alternative to this problem. This paper delves into the effectiveness of textual representations for image understanding in the specific context of VQA. We propose to take description-question pairs as input, instead of deep visual features, and fed them into a language-only Transformer model, simplifying the process and the computational cost. We also experiment with data augmentation techniques to increase the diversity in the training set and avoid learning statistical bias. Extensive evaluations have shown that textual representations require only about a hundred words to compete with deep visual features on both VQA 2.0 and VQA-CP v2.
    AD-NeRF: Audio Driven Neural Radiance Fields for Talking Head Synthesis. (arXiv:2103.11078v2 [cs.CV] UPDATED)
    (0 min) Generating high-fidelity talking head video by fitting with the input audio sequence is a challenging problem that receives considerable attentions recently. In this paper, we address this problem with the aid of neural scene representation networks. Our method is completely different from existing methods that rely on intermediate representations like 2D landmarks or 3D face models to bridge the gap between audio input and video output. Specifically, the feature of input audio signal is directly fed into a conditional implicit function to generate a dynamic neural radiance field, from which a high-fidelity talking-head video corresponding to the audio signal is synthesized using volume rendering. Another advantage of our framework is that not only the head (with hair) region is synthesized as previous methods did, but also the upper body is generated via two individual neural radiance fields. Experimental results demonstrate that our novel framework can (1) produce high-fidelity and natural results, and (2) support free adjustment of audio signals, viewing directions, and background images.
    Single and Union Non-parallel Support Vector Machine Frameworks. (arXiv:1910.09734v3 [cs.LG] UPDATED)
    (0 min) Considering the classification problem, we summarize the nonparallel support vector machines with the nonparallel hyperplanes to two types of frameworks. The first type constructs the hyperplanes separately. It solves a series of small optimization problems to obtain a series of hyperplanes, but is hard to measure the loss of each sample. The other type constructs all the hyperplanes simultaneously, and it solves one big optimization problem with the ascertained loss of each sample. We give the characteristics of each framework and compare them carefully. In addition, based on the second framework, we construct a max-min distance-based nonparallel support vector machine for multiclass classification problem, called NSVM. It constructs hyperplanes with large distance margin by solving an optimization problem. Experimental results on benchmark data sets show the advantages of our NSVM.
    Re-parameterizing VAEs for stability. (arXiv:2106.13739v1 [cs.LG])
    (0 min) We propose a theoretical approach towards the training numerical stability of Variational AutoEncoders (VAE). Our work is motivated by recent studies empowering VAEs to reach state of the art generative results on complex image datasets. These very deep VAE architectures, as well as VAEs using more complex output distributions, highlight a tendency to haphazardly produce high training gradients as well as NaN losses. The empirical fixes proposed to train them despite their limitations are neither fully theoretically grounded nor generally sufficient in practice. Building on this, we localize the source of the problem at the interface between the model's neural networks and their output probabilistic distributions. We explain a common source of instability stemming from an incautious formulation of the encoded Normal distribution's variance, and apply the same approach on other, less obvious sources. We show that by implementing small changes to the way we parameterize the Normal distributions on which they rely, VAEs can securely be trained.
    Physics perception in sloshing scenes with guaranteed thermodynamic consistency. (arXiv:2106.13301v1 [cs.CV])
    (0 min) Physics perception very often faces the problem that only limited data or partial measurements on the scene are available. In this work, we propose a strategy to learn the full state of sloshing liquids from measurements of the free surface. Our approach is based on recurrent neural networks (RNN) that project the limited information available to a reduced-order manifold so as to not only reconstruct the unknown information, but also to be capable of performing fluid reasoning about future scenarios in real time. To obtain physically consistent predictions, we train deep neural networks on the reduced-order manifold that, through the employ of inductive biases, ensure the fulfillment of the principles of thermodynamics. RNNs learn from history the required hidden information to correlate the limited information with the latent space where the simulation occurs. Finally, a decoder returns data back to the high-dimensional manifold, so as to provide the user with insightful information in the form of augmented reality. This algorithm is connected to a computer vision system to test the performance of the proposed methodology with real information, resulting in a system capable of understanding and predicting future states of the observed fluid in real-time.
    Semantic annotation for computational pathology: Multidisciplinary experience and best practice recommendations. (arXiv:2106.13689v1 [eess.IV])
    (0 min) Recent advances in whole slide imaging (WSI) technology have led to the development of a myriad of computer vision and artificial intelligence (AI) based diagnostic, prognostic, and predictive algorithms. Computational Pathology (CPath) offers an integrated solution to utilize information embedded in pathology WSIs beyond what we obtain through visual assessment. For automated analysis of WSIs and validation of machine learning (ML) models, annotations at the slide, tissue and cellular levels are required. The annotation of important visual constructs in pathology images is an important component of CPath projects. Improper annotations can result in algorithms which are hard to interpret and can potentially produce inaccurate and inconsistent results. Despite the crucial role of annotations in CPath projects, there are no well-defined guidelines or best practices on how annotations should be carried out. In this paper, we address this shortcoming by presenting the experience and best practices acquired during the execution of a large-scale annotation exercise involving a multidisciplinary team of pathologists, ML experts and researchers as part of the Pathology image data Lake for Analytics, Knowledge and Education (PathLAKE) consortium. We present a real-world case study along with examples of different types of annotations, diagnostic algorithm, annotation data dictionary and annotation constructs. The analyses reported in this work highlight best practice recommendations that can be used as annotation guidelines over the lifecycle of a CPath project.
    Domain-guided Machine Learning for Remotely Sensed In-Season Crop Growth Estimation. (arXiv:2106.13323v1 [cs.LG])
    (2 min) Advanced machine learning techniques have been used in remote sensing (RS) applications such as crop mapping and yield prediction, but remain under-utilized for tracking crop progress. In this study, we demonstrate the use of agronomic knowledge of crop growth drivers in a Long Short-Term Memory-based, Domain-guided neural network (DgNN) for in-season crop progress estimation. The DgNN uses a branched structure and attention to separate independent crop growth drivers and capture their varying importance throughout the growing season. The DgNN is implemented for corn, using RS data in Iowa for the period 2003-2019, with USDA crop progress reports used as ground truth. State-wide DgNN performance shows significant improvement over sequential and dense-only NN structures, and a widely-used Hidden Markov Model method. The DgNN had a 3.5% higher Nash-Sutfliffe efficiency over all growth stages and 33% more weeks with highest cosine similarity than the other NNs during test years. The DgNN and Sequential NN were more robust during periods of abnormal crop progress, though estimating the Silking-Grainfill transition was difficult for all methods. Finally, Uniform Manifold Approximation and Projection visualizations of layer activations showed how LSTM-based NNs separate crop growth time-series differently from a dense-only structure. Results from this study exhibit both the viability of NNs in crop growth stage estimation (CGSE) and the benefits of using domain knowledge. The DgNN methodology presented here can be extended to provide near-real time CGSE of other crops.
    Generalized Unsupervised Clustering of Hyperspectral Images of Geological Targets in the Near Infrared. (arXiv:2106.13315v1 [eess.IV])
    (2 min) The application of infrared hyperspectral imagery to geological problems is becoming more popular as data become more accessible and cost-effective. Clustering and classifying spectrally similar materials is often a first step in applications ranging from economic mineral exploration on Earth to planetary exploration on Mars. Semi-manual classification guided by expertly developed spectral parameters can be time consuming and biased, while supervised methods require abundant labeled data and can be difficult to generalize. Here we develop a fully unsupervised workflow for feature extraction and clustering informed by both expert spectral geologist input and quantitative metrics. Our pipeline uses a lightweight autoencoder followed by Gaussian mixture modeling to map the spectral diversity within any image. We validate the performance of our pipeline at submillimeter-scale with expert-labelled data from the Oman ophiolite drill core and evaluate performance at meters-scale with partially classified orbital data of Jezero Crater on Mars (the landing site for the Perseverance rover). We additionally examine the effects of various preprocessing techniques used in traditional analysis of hyperspectral imagery. This pipeline provides a fast and accurate clustering map of similar geological materials and consistently identifies and separates major mineral classes in both laboratory imagery and remote sensing imagery. We refer to our pipeline as "Generalized Pipeline for Spectroscopic Unsupervised clustering of Minerals (GyPSUM)."
    DnS: Distill-and-Select for Efficient and Accurate Video Indexing and Retrieval. (arXiv:2106.13266v1 [cs.CV])
    (2 min) In this paper, we address the problem of high performance and computationally efficient content-based video retrieval in large-scale datasets. Current methods typically propose either: (i) fine-grained approaches employing spatio-temporal representations and similarity calculations, achieving high performance at a high computational cost or (ii) coarse-grained approaches representing/indexing videos as global vectors, where the spatio-temporal structure is lost, providing low performance but also having low computational cost. In this work, we propose a Knowledge Distillation framework, which we call Distill-and-Select (DnS), that starting from a well-performing fine-grained Teacher Network learns: a) Student Networks at different retrieval performance and computational efficiency trade-offs and b) a Selection Network that at test time rapidly directs samples to the appropriate student to maintain both high retrieval performance and high computational efficiency. We train several students with different architectures and arrive at different trade-offs of performance and efficiency, i.e., speed and storage requirements, including fine-grained students that store index videos using binary representations. Importantly, the proposed scheme allows Knowledge Distillation in large, unlabelled datasets -- this leads to good students. We evaluate DnS on five public datasets on three different video retrieval tasks and demonstrate a) that our students achieve state-of-the-art performance in several cases and b) that our DnS framework provides an excellent trade-off between retrieval performance, computational speed, and storage space. In specific configurations, our method achieves similar mAP with the teacher but is 20 times faster and requires 240 times less storage space. Our collected dataset and implementation are publicly available: https://github.com/mever-team/distill-and-select.
    Energy-Based Generative Cooperative Saliency Prediction. (arXiv:2106.13389v1 [cs.CV])
    (2 min) Conventional saliency prediction models typically learn a deterministic mapping from images to the corresponding ground truth saliency maps. In this paper, we study the saliency prediction problem from the perspective of generative models by learning a conditional probability distribution over saliency maps given an image, and treating the prediction as a sampling process. Specifically, we propose a generative cooperative saliency prediction framework based on the generative cooperative networks, where a conditional latent variable model and a conditional energy-based model are jointly trained to predict saliency in a cooperative manner. We call our model the SalCoopNets. The latent variable model serves as a fast but coarse predictor to efficiently produce an initial prediction, which is then refined by the iterative Langevin revision of the energy-based model that serves as a fine predictor. Such a coarse-to-fine cooperative saliency prediction strategy offers the best of both worlds. Moreover, we generalize our framework to the scenario of weakly supervised saliency prediction, where saliency annotation of training images is partially observed, by proposing a cooperative learning while recovering strategy. Lastly, we show that the learned energy function can serve as a refinement module that can refine the results of other pre-trained saliency prediction models. Experimental results show that our generative model can achieve state-of-the-art performance. Our code is publicly available at: \url{https://github.com/JingZhang617/SalCoopNets}.
    Generalized One-Class Learning Using Pairs of Complementary Classifiers. (arXiv:2106.13272v1 [cs.CV])
    (2 min) One-class learning is the classic problem of fitting a model to the data for which annotations are available only for a single class. In this paper, we explore novel objectives for one-class learning, which we collectively refer to as Generalized One-class Discriminative Subspaces (GODS). Our key idea is to learn a pair of complementary classifiers to flexibly bound the one-class data distribution, where the data belongs to the positive half-space of one of the classifiers in the complementary pair and to the negative half-space of the other. To avoid redundancy while allowing non-linearity in the classifier decision surfaces, we propose to design each classifier as an orthonormal frame and seek to learn these frames via jointly optimizing for two conflicting objectives, namely: i) to minimize the distance between the two frames, and ii) to maximize the margin between the frames and the data. The learned orthonormal frames will thus characterize a piecewise linear decision surface that allows for efficient inference, while our objectives seek to bound the data within a minimal volume that maximizes the decision margin, thereby robustly capturing the data distribution. We explore several variants of our formulation under different constraints on the constituent classifiers, including kernelized feature maps. We demonstrate the empirical benefits of our approach via experiments on data from several applications in computer vision, such as anomaly detection in video sequences, human poses, and human activities. We also explore the generality and effectiveness of GODS for non-vision tasks via experiments on several UCI datasets, demonstrating state-of-the-art results.
    Semi-supervised Meta-learning with Disentanglement for Domain-generalised Medical Image Segmentation. (arXiv:2106.13292v1 [cs.CV])
    (2 min) Generalising deep models to new data from new centres (termed here domains) remains a challenge. This is largely attributed to shifts in data statistics (domain shifts) between source and unseen domains. Recently, gradient-based meta-learning approaches where the training data are split into meta-train and meta-test sets to simulate and handle the domain shifts during training have shown improved generalisation performance. However, the current fully supervised meta-learning approaches are not scalable for medical image segmentation, where large effort is required to create pixel-wise annotations. Meanwhile, in a low data regime, the simulated domain shifts may not approximate the true domain shifts well across source and unseen domains. To address this problem, we propose a novel semi-supervised meta-learning framework with disentanglement. We explicitly model the representations related to domain shifts. Disentangling the representations and combining them to reconstruct the input image allows unlabeled data to be used to better approximate the true domain shifts for meta-learning. Hence, the model can achieve better generalisation performance, especially when there is a limited amount of labeled data. Experiments show that the proposed method is robust on different segmentation tasks and achieves state-of-the-art generalisation performance on two public benchmarks.
    Federated Noisy Client Learning. (arXiv:2106.13239v1 [cs.LG])
    (2 min) Federated learning (FL) collaboratively aggregates a shared global model depending on multiple local clients, while keeping the training data decentralized in order to preserve data privacy. However, standard FL methods ignore the noisy client issue, which may harm the overall performance of the aggregated model. In this paper, we first analyze the noisy client statement, and then model noisy clients with different noise distributions (e.g., Bernoulli and truncated Gaussian distributions). To learn with noisy clients, we propose a simple yet effective FL framework, named Federated Noisy Client Learning (Fed-NCL), which is a plug-and-play algorithm and contains two main components: a data quality measurement (DQM) to dynamically quantify the data quality of each participating client, and a noise robust aggregation (NRA) to adaptively aggregate the local models of each client by jointly considering the amount of local training data and the data quality of each client. Our Fed-NCL can be easily applied in any standard FL workflow to handle the noisy client issue. Experimental results on various datasets demonstrate that our algorithm boosts the performances of different state-of-the-art systems with noisy clients.
    Free-viewpoint Indoor Neural Relighting from Multi-view Stereo. (arXiv:2106.13299v1 [cs.GR])
    (2 min) We introduce a neural relighting algorithm for captured indoors scenes, that allows interactive free-viewpoint navigation. Our method allows illumination to be changed synthetically, while coherently rendering cast shadows and complex glossy materials. We start with multiple images of the scene and a 3D mesh obtained by multi-view stereo (MVS) reconstruction. We assume that lighting is well-explained as the sum of a view-independent diffuse component and a view-dependent glossy term concentrated around the mirror reflection direction. We design a convolutional network around input feature maps that facilitate learning of an implicit representation of scene materials and illumination, enabling both relighting and free-viewpoint navigation. We generate these input maps by exploiting the best elements of both image-based and physically-based rendering. We sample the input views to estimate diffuse scene irradiance, and compute the new illumination caused by user-specified light sources using path tracing. To facilitate the network's understanding of materials and synthesize plausible glossy reflections, we reproject the views and compute mirror images. We train the network on a synthetic dataset where each scene is also reconstructed with MVS. We show results of our algorithm relighting real indoor scenes and performing free-viewpoint navigation with complex and realistic glossy reflections, which so far remained out of reach for view-synthesis techniques.
  • cs.IR updates on arXiv.org

    Pre-trained Language Model based Ranking in Baidu Search. (arXiv:2105.11108v3 [cs.IR] UPDATED)
    (3 min) As the heart of a search engine, the ranking system plays a crucial role in satisfying users' information demands. More recently, neural rankers fine-tuned from pre-trained language models (PLMs) establish state-of-the-art ranking effectiveness. However, it is nontrivial to directly apply these PLM-based rankers to the large-scale web search system due to the following challenging issues:(1) the prohibitively expensive computations of massive neural PLMs, especially for long texts in the web-document, prohibit their deployments in an online ranking system that demands extremely low latency;(2) the discrepancy between existing ranking-agnostic pre-training objectives and the ad-hoc retrieval scenarios that demand comprehensive relevance modeling is another main barrier for improving the online ranking system;(3) a real-world search engine typically involves a committee of ranking components, and thus the compatibility of the individually fine-tuned ranking model is critical for a cooperative ranking system. In this work, we contribute a series of successfully applied techniques in tackling these exposed issues when deploying the state-of-the-art Chinese pre-trained language model, i.e., ERNIE, in the online search engine system. We first articulate a novel practice to cost-efficiently summarize the web document and contextualize the resultant summary content with the query using a cheap yet powerful Pyramid-ERNIE architecture. Then we endow an innovative paradigm to finely exploit the large-scale noisy and biased post-click behavioral data for relevance-oriented pre-training. We also propose a human-anchored fine-tuning strategy tailored for the online ranking system, aiming to stabilize the ranking signals across various online components. Extensive offline and online experimental results show that the proposed techniques significantly boost the search engine's performance.
    A Modern Perspective on Query Likelihood with Deep Generative Retrieval Models. (arXiv:2106.13618v1 [cs.IR])
    (2 min) Existing neural ranking models follow the text matching paradigm, where document-to-query relevance is estimated through predicting the matching score. Drawing from the rich literature of classical generative retrieval models, we introduce and formalize the paradigm of deep generative retrieval models defined via the cumulative probabilities of generating query terms. This paradigm offers a grounded probabilistic view on relevance estimation while still enabling the use of modern neural architectures. In contrast to the matching paradigm, the probabilistic nature of generative rankers readily offers a fine-grained measure of uncertainty. We adopt several current neural generative models in our framework and introduce a novel generative ranker (T-PGN), which combines the encoding capacity of Transformers with the Pointer Generator Network model. We conduct an extensive set of evaluation experiments on passage retrieval, leveraging the MS MARCO Passage Re-ranking and TREC Deep Learning 2019 Passage Re-ranking collections. Our results show the significantly higher performance of the T-PGN model when compared with other generative models. Lastly, we demonstrate that exploiting the uncertainty information of deep generative rankers opens new perspectives to query/collection understanding, and significantly improves the cut-off prediction task.
    Pre-trained Language Model for Web-scale Retrieval in Baidu Search. (arXiv:2106.03373v2 [cs.IR] UPDATED)
    (2 min) Retrieval is a crucial stage in web search that identifies a small set of query-relevant candidates from a billion-scale corpus. Discovering more semantically-related candidates in the retrieval stage is very promising to expose more high-quality results to the end users. However, it still remains non-trivial challenges of building and deploying effective retrieval models for semantic matching in real search engine. In this paper, we describe the retrieval system that we developed and deployed in Baidu Search. The system exploits the recent state-of-the-art Chinese pretrained language model, namely Enhanced Representation through kNowledge IntEgration (ERNIE), which facilitates the system with expressive semantic matching. In particular, we developed an ERNIE-based retrieval model, which is equipped with 1) expressive Transformer-based semantic encoders, and 2) a comprehensive multi-stage training paradigm. More importantly, we present a practical system workflow for deploying the model in web-scale retrieval. Eventually, the system is fully deployed into production, where rigorous offline and online experiments were conducted. The results show that the system can perform high-quality candidate retrieval, especially for those tail queries with uncommon demands. Overall, the new retrieval system facilitated by pretrained language model (i.e., ERNIE) can largely improve the usability and applicability of our search engine.
    Balancing Accuracy and Fairness for Interactive Recommendation with Reinforcement Learning. (arXiv:2106.13386v1 [cs.IR])
    (2 min) Fairness in recommendation has attracted increasing attention due to bias and discrimination possibly caused by traditional recommenders. In Interactive Recommender Systems (IRS), user preferences and the system's fairness status are constantly changing over time. Existing fairness-aware recommenders mainly consider fairness in static settings. Directly applying existing methods to IRS will result in poor recommendation. To resolve this problem, we propose a reinforcement learning based framework, FairRec, to dynamically maintain a long-term balance between accuracy and fairness in IRS. User preferences and the system's fairness status are jointly compressed into the state representation to generate recommendations. FairRec aims at maximizing our designed cumulative reward that combines accuracy and fairness. Extensive experiments validate that FairRec can improve fairness, while preserving good recommendation quality.
    Session-aware Linear Item-Item Models for Session-based Recommendation. (arXiv:2103.16104v2 [cs.IR] UPDATED)
    (2 min) Session-based recommendation aims at predicting the next item given a sequence of previous items consumed in the session, e.g., on e-commerce or multimedia streaming services. Specifically, session data exhibits some unique characteristics, i.e., session consistency and sequential dependency over items within the session, repeated item consumption, and session timeliness. In this paper, we propose simple-yet-effective linear models for considering the holistic aspects of the sessions. The comprehensive nature of our models helps improve the quality of session-based recommendation. More importantly, it provides a generalized framework for reflecting different perspectives of session data. Furthermore, since our models can be solved by closed-form solutions, they are highly scalable. Experimental results demonstrate that the proposed linear models show competitive or state-of-the-art performance in various metrics on several real-world datasets.
    DnS: Distill-and-Select for Efficient and Accurate Video Indexing and Retrieval. (arXiv:2106.13266v1 [cs.CV])
    (2 min) In this paper, we address the problem of high performance and computationally efficient content-based video retrieval in large-scale datasets. Current methods typically propose either: (i) fine-grained approaches employing spatio-temporal representations and similarity calculations, achieving high performance at a high computational cost or (ii) coarse-grained approaches representing/indexing videos as global vectors, where the spatio-temporal structure is lost, providing low performance but also having low computational cost. In this work, we propose a Knowledge Distillation framework, which we call Distill-and-Select (DnS), that starting from a well-performing fine-grained Teacher Network learns: a) Student Networks at different retrieval performance and computational efficiency trade-offs and b) a Selection Network that at test time rapidly directs samples to the appropriate student to maintain both high retrieval performance and high computational efficiency. We train several students with different architectures and arrive at different trade-offs of performance and efficiency, i.e., speed and storage requirements, including fine-grained students that store index videos using binary representations. Importantly, the proposed scheme allows Knowledge Distillation in large, unlabelled datasets -- this leads to good students. We evaluate DnS on five public datasets on three different video retrieval tasks and demonstrate a) that our students achieve state-of-the-art performance in several cases and b) that our DnS framework provides an excellent trade-off between retrieval performance, computational speed, and storage space. In specific configurations, our method achieves similar mAP with the teacher but is 20 times faster and requires 240 times less storage space. Our collected dataset and implementation are publicly available: https://github.com/mever-team/distill-and-select.
    Multimodal Emergent Fake News Detection via Meta Neural Process Networks. (arXiv:2106.13711v1 [cs.IR])
    (2 min) Fake news travels at unprecedented speeds, reaches global audiences and puts users and communities at great risk via social media platforms. Deep learning based models show good performance when trained on large amounts of labeled data on events of interest, whereas the performance of models tends to degrade on other events due to domain shift. Therefore, significant challenges are posed for existing detection approaches to detect fake news on emergent events, where large-scale labeled datasets are difficult to obtain. Moreover, adding the knowledge from newly emergent events requires to build a new model from scratch or continue to fine-tune the model, which can be challenging, expensive, and unrealistic for real-world settings. In order to address those challenges, we propose an end-to-end fake news detection framework named MetaFEND, which is able to learn quickly to detect fake news on emergent events with a few verified posts. Specifically, the proposed model integrates meta-learning and neural process methods together to enjoy the benefits of these approaches. In particular, a label embedding module and a hard attention mechanism are proposed to enhance the effectiveness by handling categorical information and trimming irrelevant posts. Extensive experiments are conducted on multimedia datasets collected from Twitter and Weibo. The experimental results show our proposed MetaFEND model can detect fake news on never-seen events effectively and outperform the state-of-the-art methods.
    Sentiment Progression based Searching and Indexing of Literary Textual Artefacts. (arXiv:2106.13767v1 [cs.IR])
    (2 min) Literary artefacts are generally indexed and searched based on titles, meta data and keywords over the years. This searching and indexing works well when user/reader already knows about that particular creative textual artefact or document. This indexing and search hardly takes into account interest and emotional makeup of readers and its mapping to books. When a person is looking for a literary textual artefact, he/she might be looking for not only information but also to seek the joy of reading. In case of literary artefacts, progression of emotions across the key events could prove to be the key for indexing and searching. In this paper, we establish clusters among literary artefacts based on computational relationships among sentiment progressions using intelligent text analysis. We have created a database of 1076 English titles + 20 Marathi titles and also used database this http URL with 16559 titles and their summaries. We have proposed Sentiment Progression based Indexing for searching and recommending books. This can be used to create personalized clusters of book titles of interest to readers. The analysis clearly suggests better searching and indexing when we are targeting book lovers looking for a particular type of book or creative artefact. This indexing and searching can find many real-life applications for recommending books.
    Recurrent Coupled Topic Modeling over Sequential Documents. (arXiv:2106.13732v1 [cs.IR])
    (2 min) The abundant sequential documents such as online archival, social media and news feeds are streamingly updated, where each chunk of documents is incorporated with smoothly evolving yet dependent topics. Such digital texts have attracted extensive research on dynamic topic modeling to infer hidden evolving topics and their temporal dependencies. However, most of the existing approaches focus on single-topic-thread evolution and ignore the fact that a current topic may be coupled with multiple relevant prior topics. In addition, these approaches also incur the intractable inference problem when inferring latent parameters, resulting in a high computational cost and performance degradation. In this work, we assume that a current topic evolves from all prior topics with corresponding coupling weights, forming the multi-topic-thread evolution. Our method models the dependencies between evolving topics and thoroughly encodes their complex multi-couplings across time steps. To conquer the intractable inference challenge, a new solution with a set of novel data augmentation techniques is proposed, which successfully discomposes the multi-couplings between evolving topics. A fully conjugate model is thus obtained to guarantee the effectiveness and efficiency of the inference technique. A novel Gibbs sampler with a backward-forward filter algorithm efficiently learns latent timeevolving parameters in a closed-form. In addition, the latent Indian Buffet Process (IBP) compound distribution is exploited to automatically infer the overall topic number and customize the sparse topic proportions for each sequential document without bias. The proposed method is evaluated on both synthetic and real-world datasets against the competitive baselines, demonstrating its superiority over the baselines in terms of the low per-word perplexity, high coherent topics, and better document time prediction.
    TableSense: Spreadsheet Table Detection with Convolutional Neural Networks. (arXiv:2106.13500v1 [cs.IR])
    (2 min) Spreadsheet table detection is the task of detecting all tables on a given sheet and locating their respective ranges. Automatic table detection is a key enabling technique and an initial step in spreadsheet data intelligence. However, the detection task is challenged by the diversity of table structures and table layouts on the spreadsheet. Considering the analogy between a cell matrix as spreadsheet and a pixel matrix as image, and encouraged by the successful application of Convolutional Neural Networks (CNN) in computer vision, we have developed TableSense, a novel end-to-end framework for spreadsheet table detection. First, we devise an effective cell featurization scheme to better leverage the rich information in each cell; second, we develop an enhanced convolutional neural network model for table detection to meet the domain-specific requirement on precise table boundary detection; third, we propose an effective uncertainty metric to guide an active learning based smart sampling algorithm, which enables the efficient build-up of a training dataset with 22,176 tables on 10,220 sheets with broad coverage of diverse table structures and layouts. Our evaluation shows that TableSense is highly effective with 91.3\% recall and 86.5\% precision in EoB-2 metric, a significant improvement over both the current detection algorithm that are used in commodity spreadsheet tools and state-of-the-art convolutional neural networks in computer vision.
    Interactive query expansion for professional search applications. (arXiv:2106.13528v1 [cs.IR])
    (2 min) Knowledge workers (such as healthcare information professionals, patent agents and recruitment professionals) undertake work tasks where search forms a core part of their duties. In these instances, the search task is often complex and time-consuming and requires specialist expert knowledge to formulate accurate search strategies. Interactive features such as query expansion can play a key role in supporting these tasks. However, generating query suggestions within a professional search context requires that consideration be given to the specialist, structured nature of the search strategies they employ. In this paper, we investigate a variety of query expansion methods applied to a collection of Boolean search strategies used in a variety of real-world professional search tasks. The results demonstrate the utility of context-free distributional language models and the value of using linguistic cues such as ngram order to optimise the balance between precision and recall.
    Domain-Specific Pretraining for Vertical Search: Case Study on Biomedical Literature. (arXiv:2106.13375v1 [cs.IR])
    (2 min) Information overload is a prevalent challenge in many high-value domains. A prominent case in point is the explosion of the biomedical literature on COVID-19, which swelled to hundreds of thousands of papers in a matter of months. In general, biomedical literature expands by two papers every minute, totalling over a million new papers every year. Search in the biomedical realm, and many other vertical domains is challenging due to the scarcity of direct supervision from click logs. Self-supervised learning has emerged as a promising direction to overcome the annotation bottleneck. We propose a general approach for vertical search based on domain-specific pretraining and present a case study for the biomedical domain. Despite being substantially simpler and not using any relevance labels for training or development, our method performs comparably or better than the best systems in the official TREC-COVID evaluation, a COVID-related biomedical search competition. Using distributed computing in modern cloud infrastructure, our system can scale to tens of millions of articles on PubMed and has been deployed as Microsoft Biomedical Search, a new search experience for biomedical literature: https://aka.ms/biomedsearch.
  • cs.LG updates on arXiv.org

    Quantitative approximation results for complex-valued neural networks. (arXiv:2102.13092v2 [math.FA] UPDATED)
    (2 min) Until recently, applications of neural networks in machine learning have almost exclusively relied on real-valued networks. It was recently observed, however, that complex-valued neural networks (CVNNs) exhibit superior performance in applications in which the input is naturally complex-valued, such as MRI fingerprinting. While the mathematical theory of real-valued networks has, by now, reached some level of maturity, this is far from true for complex-valued networks. In this paper, we analyze the expressivity of complex-valued networks by providing explicit quantitative error bounds for approximating $C^n$ functions on compact subsets of $\mathbb{C}^d$ by complex-valued neural networks that employ the modReLU activation function, given by $\sigma(z) = \mathrm{ReLU}(|z| - 1) \, \mathrm{sgn} (z)$, which is one of the most popular complex activation functions used in practice. We show that the derived approximation rates are optimal (up to log factors) in the class of modReLU networks with weights of moderate growth.
    Learning Linear Temporal Properties from Noisy Data: A MaxSAT Approach. (arXiv:2104.15083v2 [cs.LG] UPDATED)
    (2 min) We address the problem of inferring descriptions of system behavior using Linear Temporal Logic (LTL) from a finite set of positive and negative examples. Most of the existing approaches for solving such a task rely on predefined templates for guiding the structure of the inferred formula. The approaches that can infer arbitrary LTL formulas, on the other hand, are not robust to noise in the data. To alleviate such limitations, we devise two algorithms for inferring concise LTL formulas even in the presence of noise. Our first algorithm infers minimal LTL formulas by reducing the inference problem to a problem in maximum satisfiability and then using off-the-shelf MaxSAT solvers to find a solution. To the best of our knowledge, we are the first to incorporate the usage of MaxSAT solvers for inferring formulas in LTL. Our second learning algorithm relies on the first algorithm to derive a decision tree over LTL formulas based on a decision tree learning algorithm. We have implemented both our algorithms and verified that our algorithms are efficient in extracting concise LTL descriptions even in the presence of noise.
    Stochastic Recurrent Neural Network for Multistep Time Series Forecasting. (arXiv:2104.12311v3 [cs.LG] UPDATED)
    (2 min) Time series forecasting based on deep architectures has been gaining popularity in recent years due to their ability to model complex non-linear temporal dynamics. The recurrent neural network is one such model capable of handling variable-length input and output. In this paper, we leverage recent advances in deep generative models and the concept of state space models to propose a stochastic adaptation of the recurrent neural network for multistep-ahead time series forecasting, which is trained with stochastic gradient variational Bayes. In our model design, the transition function of the recurrent neural network, which determines the evolution of the hidden states, is stochastic rather than deterministic as in a regular recurrent neural network; this is achieved by incorporating a latent random variable into the transition process which captures the stochasticity of the temporal dynamics. Our model preserves the architectural workings of a recurrent neural network for which all relevant information is encapsulated in its hidden states, and this flexibility allows our model to be easily integrated into any deep architecture for sequential modelling. We test our model on a wide range of datasets from finance to healthcare; results show that the stochastic recurrent neural network consistently outperforms its deterministic counterpart.
    Deep Residual Echo Suppression with A Tunable Tradeoff Between Signal Distortion and Echo Suppression. (arXiv:2106.13531v1 [cs.SD])
    (0 min) In this paper, we propose a residual echo suppression method using a UNet neural network that directly maps the outputs of a linear acoustic echo canceler to the desired signal in the spectral domain. This system embeds a design parameter that allows a tunable tradeoff between the desired-signal distortion and residual echo suppression in double-talk scenarios. The system employs 136 thousand parameters, and requires 1.6 Giga floating-point operations per second and 10 Mega-bytes of memory. The implementation satisfies both the timing requirements of the AEC challenge and the computational and memory limitations of on-device applications. Experiments are conducted with 161~h of data from the AEC challenge database and from real independent recordings. We demonstrate the performance of the proposed system in real-life conditions and compare it with two competing methods regarding echo suppression and desired-signal distortion, generalization to various environments, and robustness to high echo levels.
    Primordial non-Gaussianity from the Completed SDSS-IV extended Baryon Oscillation Spectroscopic Survey I: Catalogue Preparation and Systematic Mitigation. (arXiv:2106.13724v1 [astro-ph.CO])
    (3 min) We investigate the large-scale clustering of the final spectroscopic sample of quasars from the recently completed extended Baryon Oscillation Spectroscopic Survey (eBOSS). The sample contains $343708$ objects in the redshift range $0.8<z<2.2$ and $72667$ objects with redshifts $2.2<z<3.5$, covering an effective area of $4699~{\rm deg}^{2}$. We develop a neural network-based approach to mitigate spurious fluctuations in the density field caused by spatial variations in the quality of the imaging data used to select targets for follow-up spectroscopy. Simulations are used with the same angular and radial distributions as the real data to estimate covariance matrices, perform error analyses, and assess residual systematic uncertainties. We measure the mean density contrast and cross-correlations of the eBOSS quasars against maps of potential sources of imaging systematics to address algorithm effectiveness, finding that the neural network-based approach outperforms standard linear regression. Stellar density is one of the most important sources of spurious fluctuations, and a new template constructed using data from the Gaia spacecraft provides the best match to the observed quasar clustering. The end-product from this work is a new value-added quasar catalogue with the improved weights to correct for nonlinear imaging systematic effects, which will be made public. Our quasar catalogue is used to measure the local-type primordial non-Gaussianity in our companion paper, Mueller et al. in preparation.
    CADDA: Class-wise Automatic Differentiable Data Augmentation for EEG Signals. (arXiv:2106.13695v1 [cs.LG])
    (0 min) Data augmentation is a key element of deep learning pipelines, as it informs the network during training about transformations of the input data that keep the label unchanged. Manually finding adequate augmentation methods and parameters for a given pipeline is however rapidly cumbersome. In particular, while intuition can guide this decision for images, the design and choice of augmentation policies remains unclear for more complex types of data, such as neuroscience signals. Moreover, label independent strategies might not be suitable for such structured data and class-dependent augmentations might be necessary. This idea has been surprisingly unexplored in the literature, while it is quite intuitive: changing the color of a car image does not change the object class to be predicted, but doing the same to the picture of an orange does. This paper aims to increase the generalization power added through class-wise data augmentation. Yet, as seeking transformations depending on the class largely increases the complexity of the task, using gradient-free optimization techniques as done by most existing automatic approaches becomes intractable for real-world datasets. For this reason we propose to use differentiable data augmentation amenable to gradient-based learning. EEG signals are a perfect example of data for which good augmentation policies are mostly unknown. In this work, we demonstrate the relevance of our approach on the clinically relevant sleep staging classification task, for which we also propose differentiable transformations.
    A Novel Self-Learning Framework for Bladder Cancer Grading Using Histopathological Images. (arXiv:2106.13559v1 [eess.IV])
    (0 min) Recently, bladder cancer has been significantly increased in terms of incidence and mortality. Currently, two subtypes are known based on tumour growth: non-muscle invasive (NMIBC) and muscle-invasive bladder cancer (MIBC). In this work, we focus on the MIBC subtype because it is of the worst prognosis and can spread to adjacent organs. We present a self-learning framework to grade bladder cancer from histological images stained via immunohistochemical techniques. Specifically, we propose a novel Deep Convolutional Embedded Attention Clustering (DCEAC) which allows classifying histological patches into different severity levels of the disease, according to the patterns established in the literature. The proposed DCEAC model follows a two-step fully unsupervised learning methodology to discern between non-tumour, mild and infiltrative patterns from high-resolution samples of 512x512 pixels. Our system outperforms previous clustering-based methods by including a convolutional attention module, which allows refining the features of the latent space before the classification stage. The proposed network exceeds state-of-the-art approaches by 2-3% across different metrics, achieving a final average accuracy of 0.9034 in a multi-class scenario. Furthermore, the reported class activation maps evidence that our model is able to learn by itself the same patterns that clinicians consider relevant, without incurring prior annotation steps. This fact supposes a breakthrough in muscle-invasive bladder cancer grading which bridges the gap with respect to train the model on labelled data.
    Circumpapillary OCT-Focused Hybrid Learning for Glaucoma Grading Using Tailored Prototypical Neural Networks. (arXiv:2106.13551v1 [eess.IV])
    (0 min) Glaucoma is one of the leading causes of blindness worldwide and Optical Coherence Tomography (OCT) is the quintessential imaging technique for its detection. Unlike most of the state-of-the-art studies focused on glaucoma detection, in this paper, we propose, for the first time, a novel framework for glaucoma grading using raw circumpapillary B-scans. In particular, we set out a new OCT-based hybrid network which combines hand-driven and deep learning algorithms. An OCT-specific descriptor is proposed to extract hand-crafted features related to the retinal nerve fibre layer (RNFL). In parallel, an innovative CNN is developed using skip-connections to include tailored residual and attention modules to refine the automatic features of the latent space. The proposed architecture is used as a backbone to conduct a novel few-shot learning based on static and dynamic prototypical networks. The k-shot paradigm is redefined giving rise to a supervised end-to-end system which provides substantial improvements discriminating between healthy, early and advanced glaucoma samples. The training and evaluation processes of the dynamic prototypical network are addressed from two fused databases acquired via Heidelberg Spectralis system. Validation and testing results reach a categorical accuracy of 0.9459 and 0.8788 for glaucoma grading, respectively. Besides, the high performance reported by the proposed model for glaucoma detection deserves a special mention. The findings from the class activation maps are directly in line with the clinicians' opinion since the heatmaps pointed out the RNFL as the most relevant structure for glaucoma diagnosis.
    Optimal Combination of Linear and Spectral Estimators for Generalized Linear Models. (arXiv:2008.03326v3 [stat.ML] UPDATED)
    (0 min) We study the problem of recovering an unknown signal $\boldsymbol x$ given measurements obtained from a generalized linear model with a Gaussian sensing matrix. Two popular solutions are based on a linear estimator $\hat{\boldsymbol x}^{\rm L}$ and a spectral estimator $\hat{\boldsymbol x}^{\rm s}$. The former is a data-dependent linear combination of the columns of the measurement matrix, and its analysis is quite simple. The latter is the principal eigenvector of a data-dependent matrix, and a recent line of work has studied its performance. In this paper, we show how to optimally combine $\hat{\boldsymbol x}^{\rm L}$ and $\hat{\boldsymbol x}^{\rm s}$. At the heart of our analysis is the exact characterization of the joint empirical distribution of $(\boldsymbol x, \hat{\boldsymbol x}^{\rm L}, \hat{\boldsymbol x}^{\rm s})$ in the high-dimensional limit. This allows us to compute the Bayes-optimal combination of $\hat{\boldsymbol x}^{\rm L}$ and $\hat{\boldsymbol x}^{\rm s}$, given the limiting distribution of the signal $\boldsymbol x$. When the distribution of the signal is Gaussian, then the Bayes-optimal combination has the form $\theta\hat{\boldsymbol x}^{\rm L}+\hat{\boldsymbol x}^{\rm s}$ and we derive the optimal combination coefficient. In order to establish the limiting distribution of $(\boldsymbol x, \hat{\boldsymbol x}^{\rm L}, \hat{\boldsymbol x}^{\rm s})$, we design and analyze an Approximate Message Passing (AMP) algorithm whose iterates give $\hat{\boldsymbol x}^{\rm L}$ and approach $\hat{\boldsymbol x}^{\rm s}$. Numerical simulations demonstrate the improvement of the proposed combination with respect to the two methods considered separately.
    Active Learning with Multifidelity Modeling for Efficient Rare Event Simulation. (arXiv:2106.13790v1 [stat.ML])
    (0 min) While multifidelity modeling provides a cost-effective way to conduct uncertainty quantification with computationally expensive models, much greater efficiency can be achieved by adaptively deciding the number of required high-fidelity (HF) simulations, depending on the type and complexity of the problem and the desired accuracy in the results. We propose a framework for active learning with multifidelity modeling emphasizing the efficient estimation of rare events. Our framework works by fusing a low-fidelity (LF) prediction with an HF-inferred correction, filtering the corrected LF prediction to decide whether to call the high-fidelity model, and for enhanced subsequent accuracy, adapting the correction for the LF prediction after every HF model call. The framework does not make any assumptions as to the LF model type or its correlations with the HF model. In addition, for improved robustness when estimating smaller failure probabilities, we propose using dynamic active learning functions that decide when to call the HF model. We demonstrate our framework using several academic case studies and two finite element (FE) model case studies: estimating Navier-Stokes velocities using the Stokes approximation and estimating stresses in a transversely isotropic model subjected to displacements via a coarsely meshed isotropic model. Across these case studies, not only did the proposed framework estimate the failure probabilities accurately, but compared with either Monte Carlo or a standard variance reduction method, it also required only a small fraction of the calls to the HF model.
    Littlestone Classes are Privately Online Learnable. (arXiv:2106.13513v1 [cs.LG])
    (0 min) We consider the problem of online classification under a privacy constraint. In this setting a learner observes sequentially a stream of labelled examples $(x_t, y_t)$, for $1 \leq t \leq T$, and returns at each iteration $t$ a hypothesis $h_t$ which is used to predict the label of each new example $x_t$. The learner's performance is measured by her regret against a known hypothesis class $\mathcal{H}$. We require that the algorithm satisfies the following privacy constraint: the sequence $h_1, \ldots, h_T$ of hypotheses output by the algorithm needs to be an $(\epsilon, \delta)$-differentially private function of the whole input sequence $(x_1, y_1), \ldots, (x_T, y_T)$. We provide the first non-trivial regret bound for the realizable setting. Specifically, we show that if the class $\mathcal{H}$ has constant Littlestone dimension then, given an oblivious sequence of labelled examples, there is a private learner that makes in expectation at most $O(\log T)$ mistakes -- comparable to the optimal mistake bound in the non-private case, up to a logarithmic factor. Moreover, for general values of the Littlestone dimension $d$, the same mistake bound holds but with a doubly-exponential in $d$ factor. A recent line of work has demonstrated a strong connection between classes that are online learnable and those that are differentially-private learnable. Our results strengthen this connection and show that an online learning algorithm can in fact be directly privatized (in the realizable setting). We also discuss an adaptive setting and provide a sublinear regret bound of $O(\sqrt{T})$.
    Sketching Datasets for Large-Scale Learning (long version). (arXiv:2008.01839v3 [stat.ML] UPDATED)
    (0 min) This article considers "compressive learning," an approach to large-scale machine learning where datasets are massively compressed before learning (e.g., clustering, classification, or regression) is performed. In particular, a "sketch" is first constructed by computing carefully chosen nonlinear random features (e.g., random Fourier features) and averaging them over the whole dataset. Parameters are then learned from the sketch, without access to the original dataset. This article surveys the current state-of-the-art in compressive learning, including the main concepts and algorithms, their connections with established signal-processing methods, existing theoretical guarantees -- on both information preservation and privacy preservation, and important open problems.
    Tensor-based framework for training flexible neural networks. (arXiv:2106.13542v1 [cs.LG])
    (0 min) Activation functions (AFs) are an important part of the design of neural networks (NNs), and their choice plays a predominant role in the performance of a NN. In this work, we are particularly interested in the estimation of flexible activation functions using tensor-based solutions, where the AFs are expressed as a weighted sum of predefined basis functions. To do so, we propose a new learning algorithm which solves a constrained coupled matrix-tensor factorization (CMTF) problem. This technique fuses the first and zeroth order information of the NN, where the first-order information is contained in a Jacobian tensor, following a constrained canonical polyadic decomposition (CPD). The proposed algorithm can handle different decomposition bases. The goal of this method is to compress large pretrained NN models, by replacing subnetworks, {\em i.e.,} one or multiple layers of the original network, by a new flexible layer. The approach is applied to a pretrained convolutional neural network (CNN) used for character classification.
    HyperNP: Interactive Visual Exploration of Multidimensional Projection Hyperparameters. (arXiv:2106.13777v1 [cs.LG])
    (0 min) Projection algorithms such as t-SNE or UMAP are useful for the visualization of high dimensional data, but depend on hyperparameters which must be tuned carefully. Unfortunately, iteratively recomputing projections to find the optimal hyperparameter value is computationally intensive and unintuitive due to the stochastic nature of these methods. In this paper we propose HyperNP, a scalable method that allows for real-time interactive hyperparameter exploration of projection methods by training neural network approximations. HyperNP can be trained on a fraction of the total data instances and hyperparameter configurations and can compute projections for new data and hyperparameters at interactive speeds. HyperNP is compact in size and fast to compute, thus allowing it to be embedded in lightweight visualization systems such as web browsers. We evaluate the performance of the HyperNP across three datasets in terms of performance and speed. The results suggest that HyperNP is accurate, scalable, interactive, and appropriate for use in real-world settings.
    MDP Playground: A Design and Debug Testbed for Reinforcement Learning. (arXiv:1909.07750v4 [cs.LG] UPDATED)
    (0 min) We present \emph{MDP Playground}, an efficient testbed for Reinforcement Learning (RL) agents with \textit{orthogonal} dimensions that can be controlled independently to challenge agents in different ways and obtain varying degrees of hardness in generated environments. We consider and allow control over a wide variety of dimensions, including \textit{delayed rewards}, \textit{rewardable sequences}, \textit{density of rewards}, \textit{stochasticity}, \textit{image representations}, \textit{irrelevant features}, \textit{time unit}, \textit{action range} and more. We define a parameterised collection of fast-to-run toy environments in \textit{OpenAI Gym} by varying these dimensions and propose to use these for the initial design and development of agents. We also provide wrappers that inject these dimensions into complex environments from \textit{Atari} and \textit{Mujoco} to allow for evaluating agent robustness. We further provide various example use-cases and instructions on how to use \textit{MDP Playground} to design and debug agents. We believe that \textit{MDP Playground} is a valuable testbed for researchers designing new, adaptive and intelligent RL agents and those wanting to unit test their agents.
    Covariance-Aware Private Mean Estimation Without Private Covariance Estimation. (arXiv:2106.13329v1 [cs.LG])
    (0 min) We present two sample-efficient differentially private mean estimators for $d$-dimensional (sub)Gaussian distributions with unknown covariance. Informally, given $n \gtrsim d/\alpha^2$ samples from such a distribution with mean $\mu$ and covariance $\Sigma$, our estimators output $\tilde\mu$ such that $\| \tilde\mu - \mu \|_{\Sigma} \leq \alpha$, where $\| \cdot \|_{\Sigma}$ is the Mahalanobis distance. All previous estimators with the same guarantee either require strong a priori bounds on the covariance matrix or require $\Omega(d^{3/2})$ samples. Each of our estimators is based on a simple, general approach to designing differentially private mechanisms, but with novel technical steps to make the estimator private and sample-efficient. Our first estimator samples a point with approximately maximum Tukey depth using the exponential mechanism, but restricted to the set of points of large Tukey depth. Proving that this mechanism is private requires a novel analysis. Our second estimator perturbs the empirical mean of the data set with noise calibrated to the empirical covariance, without releasing the covariance itself. Its sample complexity guarantees hold more generally for subgaussian distributions, albeit with a slightly worse dependence on the privacy parameter. For both estimators, careful preprocessing of the data is required to satisfy differential privacy.
    Bayesian Neural Networks: Essentials. (arXiv:2106.13594v1 [cs.LG])
    (0 min) Bayesian neural networks utilize probabilistic layers that capture uncertainty over weights and activations, and are trained using Bayesian inference. Since these probabilistic layers are designed to be drop-in replacement of their deterministic counter parts, Bayesian neural networks provide a direct and natural way to extend conventional deep neural networks to support probabilistic deep learning. However, it is nontrivial to understand, design and train Bayesian neural networks due to their complexities. We discuss the essentials of Bayesian neural networks including duality (deep neural networks, probabilistic models), approximate Bayesian inference, Bayesian priors, Bayesian posteriors, and deep variational learning. We use TensorFlow Probability APIs and code examples for illustration. The main problem with Bayesian neural networks is that the architecture of deep neural networks makes it quite redundant, and costly, to account for uncertainty for a large number of successive layers. Hybrid Bayesian neural networks, which use few probabilistic layers judicially positioned in the networks, provide a practical solution.
    Re-parameterizing VAEs for stability. (arXiv:2106.13739v1 [cs.LG])
    (0 min) We propose a theoretical approach towards the training numerical stability of Variational AutoEncoders (VAE). Our work is motivated by recent studies empowering VAEs to reach state of the art generative results on complex image datasets. These very deep VAE architectures, as well as VAEs using more complex output distributions, highlight a tendency to haphazardly produce high training gradients as well as NaN losses. The empirical fixes proposed to train them despite their limitations are neither fully theoretically grounded nor generally sufficient in practice. Building on this, we localize the source of the problem at the interface between the model's neural networks and their output probabilistic distributions. We explain a common source of instability stemming from an incautious formulation of the encoded Normal distribution's variance, and apply the same approach on other, less obvious sources. We show that by implementing small changes to the way we parameterize the Normal distributions on which they rely, VAEs can securely be trained.
    Safe Learning-based Observers for Unknown Nonlinear Systems using Bayesian Optimization. (arXiv:2005.05888v2 [eess.SY] UPDATED)
    (0 min) Data generated from dynamical systems with unknown dynamics enable the learning of state observers that are: robust to modeling error, computationally tractable to design, and capable of operating with guaranteed performance. In this paper, a modular design methodology is formulated, that consists of three design phases: (i) an initial robust observer design that enables one to learn the dynamics without allowing the state estimation error to diverge (hence, safe); (ii) a learning phase wherein the unmodeled components are estimated using Bayesian optimization and Gaussian processes; and, (iii) a re-design phase that leverages the learned dynamics to improve convergence rate of the state estimation error. The potential of our proposed learning-based observer is demonstrated on a benchmark nonlinear system. Additionally, certificates of guaranteed estimation performance are provided.
    Transfer Learning in Bandits with Latent Continuity. (arXiv:2102.02472v2 [cs.LG] UPDATED)
    (0 min) Structured stochastic multi-armed bandits provide accelerated regret rates over the standard unstructured bandit problems. Most structured bandits, however, assume the knowledge of the structural parameter such as Lipschitz continuity, which is often not available. To cope with the latent structural parameter, we consider a transfer learning setting in which an agent must learn to transfer the structural information from the prior tasks to the next task, which is inspired by practical problems such as rate adaptation in wireless link. We propose a novel framework to provably and accurately estimate the Lipschitz constant based on previous tasks and fully exploit it for the new task at hand. We analyze the efficiency of the proposed framework in two folds: (i) the sample complexity of our estimator matches with the information-theoretic fundamental limit; and (ii) our regret bound on the new task is close to that of the oracle algorithm with the full knowledge of the Lipschitz constant under mild assumptions. Our analysis reveals a set of useful insights on transfer learning for latent Lipschitzconstants such as the fundamental challenge a learner faces. Our numerical evaluations confirm our theoretical findings and show the superiority of the proposed framework compared to baselines.
    DAEMON: Dataset-Agnostic Explainable Malware Classification Using Multi-Stage Feature Mining. (arXiv:2008.01855v2 [cs.CR] UPDATED)
    (0 min) Numerous metamorphic and polymorphic malicious variants are generated automatically on a daily basis by mutation engines that transform the code of a malicious program while retaining its functionality, in order to evade signature-based detection. These automatic processes have greatly increased the number of malware variants, deeming their fully-manual analysis impossible. Malware classification is the task of determining to which family a new malicious variant belongs. Variants of the same malware family show similar behavioral patterns. Thus, classifying newly discovered malicious programs and applications helps assess the risks they pose. Moreover, malware classification facilitates determining which of the newly discovered variants should undergo manual analysis by a security expert, in order to determine whether they belong to a new family (e.g., one whose members exploit a zero-day vulnerability) or are simply the result of a concept drift within a known malicious family. This motivated intense research in recent years on devising high-accuracy automatic tools for malware classification. In this work, we present DAEMON - a novel dataset-agnostic malware classifier. A key property of DAEMON is that the type of features it uses and the manner in which they are mined facilitate understanding the distinctive behavior of malware families, making its classification decisions explainable. We've optimized DAEMON using a large-scale dataset of x86 binaries, belonging to a mix of several malware families targeting computers running Windows. We then re-trained it and applied it, without any algorithmic change, feature re-engineering or parameter tuning, to two other large-scale datasets of malicious Android applications consisting of numerous malware families. DAEMON obtained highly accurate classification results on all datasets, establishing that it is also platform-agnostic.
    Nonlinear Acoustic Echo Cancellation with Deep Learning. (arXiv:2106.13754v1 [cs.SD])
    (0 min) We propose a nonlinear acoustic echo cancellation system, which aims to model the echo path from the far-end signal to the near-end microphone in two parts. Inspired by the physical behavior of modern hands-free devices, we first introduce a novel neural network architecture that is specifically designed to model the nonlinear distortions these devices induce between receiving and playing the far-end signal. To account for variations between devices, we construct this network with trainable memory length and nonlinear activation functions that are not parameterized in advance, but are rather optimized during the training stage using the training data. Second, the network is succeeded by a standard adaptive linear filter that constantly tracks the echo path between the loudspeaker output and the microphone. During training, the network and filter are jointly optimized to learn the network parameters. This system requires 17 thousand parameters that consume 500 Million floating-point operations per second and 40 Kilo-bytes of memory. It also satisfies hands-free communication timing requirements on a standard neural processor, which renders it adequate for embedding on hands-free communication devices. Using 280 hours of real and synthetic data, experiments show advantageous performance compared to competing methods.
    A Source-Criticism Debiasing Method for GloVe Embeddings. (arXiv:2106.13382v1 [cs.CL])
    (0 min) It is well-documented that word embeddings trained on large public corpora consistently exhibit known human social biases. Although many methods for debiasing exist, almost all fixate on completely eliminating biased information from the embeddings and often diminish training set size in the process. In this paper, we present a simple yet effective method for debiasing GloVe word embeddings (Pennington et al., 2014) which works by incorporating explicit information about training set bias rather than removing biased data outright. Our method runs quickly and efficiently with the help of a fast bias gradient approximation method from Brunet et al. (2019). As our approach is akin to the notion of 'source criticism' in the humanities, we term our method Source-Critical GloVe (SC-GloVe). We show that SC-GloVe reduces the effect size on Word Embedding Association Test (WEAT) sets without sacrificing training data or TOP-1 performance.
    Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing. (arXiv:2007.15779v5 [cs.CL] UPDATED)
    (0 min) Pretraining large neural language models, such as BERT, has led to impressive gains on many natural language processing (NLP) tasks. However, most pretraining efforts focus on general domain corpora, such as newswire and Web. A prevailing assumption is that even domain-specific pretraining can benefit by starting from general-domain language models. In this paper, we challenge this assumption by showing that for domains with abundant unlabeled text, such as biomedicine, pretraining language models from scratch results in substantial gains over continual pretraining of general-domain language models. To facilitate this investigation, we compile a comprehensive biomedical NLP benchmark from publicly-available datasets. Our experiments show that domain-specific pretraining serves as a solid foundation for a wide range of biomedical NLP tasks, leading to new state-of-the-art results across the board. Further, in conducting a thorough evaluation of modeling choices, both for pretraining and task-specific fine-tuning, we discover that some common practices are unnecessary with BERT models, such as using complex tagging schemes in named entity recognition (NER). To help accelerate research in biomedical NLP, we have released our state-of-the-art pretrained and task-specific models for the community, and created a leaderboard featuring our BLURB benchmark (short for Biomedical Language Understanding & Reasoning Benchmark) at https://aka.ms/BLURB.
    Multi-Goal Reinforcement Learning environments for simulated Franka Emika Panda robot. (arXiv:2106.13687v1 [cs.LG])
    (0 min) This technical report presents panda-gym, a set Reinforcement Learning (RL) environments for the Franka Emika Panda robot integrated with OpenAI Gym. Five tasks are included: reach, push, slide, pick & place and stack. They all follow a Multi-Goal RL framework, allowing to use goal-oriented RL algorithms. To foster open-research, we chose to use the open-source physics engine PyBullet. The implementation chosen for this package allows to define very easily new tasks or new robots. This report also presents a baseline of results obtained with state-of-the-art model-free off-policy algorithms. panda-gym is open-source at https://github.com/qgallouedec/panda-gym.
    Single Image Texture Translation for Data Augmentation. (arXiv:2106.13804v1 [cs.CV])
    (0 min) Recent advances in image synthesis enables one to translate images by learning the mapping between a source domain and a target domain. Existing methods tend to learn the distributions by training a model on a variety of datasets, with results evaluated largely in a subjective manner. Relatively few works in this area, however, study the potential use of semantic image translation methods for image recognition tasks. In this paper, we explore the use of Single Image Texture Translation (SITT) for data augmentation. We first propose a lightweight model for translating texture to images based on a single input of source texture, allowing for fast training and testing. Based on SITT, we then explore the use of augmented data in long-tailed and few-shot image classification tasks. We find the proposed method is capable of translating input data into a target domain, leading to consistent improved image recognition performance. Finally, we examine how SITT and related image translation methods can provide a basis for a data-efficient, augmentation engineering approach to model training.
    An Inertial Block Majorization Minimization Framework for Nonsmooth Nonconvex Optimization. (arXiv:2010.12133v2 [math.OC] UPDATED)
    (0 min) In this paper, we introduce TITAN, a novel inerTIal block majorizaTion minimizAtioN framework for non-smooth non-convex optimization problems. To the best of our knowledge, TITAN is the first framework of block-coordinate update method that relies on the majorization-minimization framework while embedding inertial force to each step of the block updates. The inertial force is obtained via an extrapolation operator that subsumes heavy-ball and Nesterov-type accelerations for block proximal gradient methods as special cases. By choosing various surrogate functions, such as proximal, Lipschitz gradient, Bregman, quadratic, and composite surrogate functions, and by varying the extrapolation operator, TITAN produces a rich set of inertial block-coordinate update methods. We study sub-sequential convergence as well as global convergence for the generated sequence of TITAN. We illustrate the effectiveness of TITAN on two important machine learning problems, namely sparse non-negative matrix factorization and matrix completion.
    A hybrid model-based and learning-based approach for classification using limited number of training samples. (arXiv:2106.13436v1 [cs.LG])
    (2 min) The fundamental task of classification given a limited number of training data samples is considered for physical systems with known parametric statistical models. The standalone learning-based and statistical model-based classifiers face major challenges towards the fulfillment of the classification task using a small training set. Specifically, classifiers that solely rely on the physics-based statistical models usually suffer from their inability to properly tune the underlying unobservable parameters, which leads to a mismatched representation of the system's behaviors. Learning-based classifiers, on the other hand, typically rely on a large number of training data from the underlying physical process, which might not be feasible in most practical scenarios. In this paper, a hybrid classification method -- termed HyPhyLearn -- is proposed that exploits both the physics-based statistical models and the learning-based classifiers. The proposed solution is based on the conjecture that HyPhyLearn would alleviate the challenges associated with the individual approaches of learning-based and statistical model-based classifiers by fusing their respective strengths. The proposed hybrid approach first estimates the unobservable model parameters using the available (suboptimal) statistical estimation procedures, and subsequently use the physics-based statistical models to generate synthetic data. Then, the training data samples are incorporated with the synthetic data in a learning-based classifier that is based on domain-adversarial training of neural networks. Specifically, in order to address the mismatch problem, the classifier learns a mapping from the training data and the synthetic data to a common feature space. Simultaneously, the classifier is trained to find discriminative features within this space in order to fulfill the classification task.
    Assessing the Lockdown Effects on Air Quality during COVID-19 Era. (arXiv:2106.13750v1 [cs.LG])
    (0 min) In this work we investigate the short-term variations in air quality emissions, attributed to the prevention measures, applied in different cities, to mitigate the COVID-19 spread. In particular, we emphasize on the concentration effects regarding specific pollutant gases, such as carbon monoxide (CO), ozone (O3), nitrogen dioxide (NO2) and sulphur dioxide (SO2). The assessment of the impact of lockdown on air quality focused on four European Cities (Athens, Gladsaxe, Lodz and Rome). Available data on pollutant factors were obtained using global satellite observations. The level of the employed prevention measures is employed using the Oxford COVID-19 Government Response Tracker. The second part of the analysis employed a variety of machine learning tools, utilized for estimating the concentration of each pollutant, two days ahead. The results showed that a weak to moderate correlation exists between the corresponding measures and the pollutant factors and that it is possible to create models which can predict the behaviour of the pollutant gases under daily human activities.
    Generalized Multimodal ELBO. (arXiv:2105.02470v2 [cs.LG] UPDATED)
    (0 min) Multiple data types naturally co-occur when describing real-world phenomena and learning from them is a long-standing goal in machine learning research. However, existing self-supervised generative models approximating an ELBO are not able to fulfill all desired requirements of multimodal models: their posterior approximation functions lead to a trade-off between the semantic coherence and the ability to learn the joint data distribution. We propose a new, generalized ELBO formulation for multimodal data that overcomes these limitations. The new objective encompasses two previous methods as special cases and combines their benefits without compromises. In extensive experiments, we demonstrate the advantage of the proposed method compared to state-of-the-art models in self-supervised, generative learning tasks.
    Ranger21: a synergistic deep learning optimizer. (arXiv:2106.13731v1 [cs.LG])
    (0 min) As optimizers are critical to the performances of neural networks, every year a large number of papers innovating on the subject are published. However, while most of these publications provide incremental improvements to existing algorithms, they tend to be presented as new optimizers rather than composable algorithms. Thus, many worthwhile improvements are rarely seen out of their initial publication. Taking advantage of this untapped potential, we introduce Ranger21, a new optimizer which combines AdamW with eight components, carefully selected after reviewing and testing ideas from the literature. We found that the resulting optimizer provides significantly improved validation accuracy and training speed, smoother training curves, and is even able to train a ResNet50 on ImageNet2012 without Batch Normalization layers. A problem on which AdamW stays systematically stuck in a bad initial state.
    Universal Approximation of Residual Flows in Maximum Mean Discrepancy. (arXiv:2103.05793v2 [cs.LG] UPDATED)
    (0 min) Normalizing flows are a class of flexible deep generative models that offer easy likelihood computation. Despite their empirical success, there is little theoretical understanding of their expressiveness. In this work, we study residual flows, a class of normalizing flows composed of Lipschitz residual blocks. We prove residual flows are universal approximators in maximum mean discrepancy. We provide upper bounds on the number of residual blocks to achieve approximation under different assumptions.
    Interval and fuzzy physics-informed neural networks for uncertain fields. (arXiv:2106.13727v1 [math.NA])
    (2 min) Temporally and spatially dependent uncertain parameters are regularly encountered in engineering applications. Commonly these uncertainties are accounted for using random fields and processes which require knowledge about the appearing probability distributions functions which is not readily available. In these cases non-probabilistic approaches such as interval analysis and fuzzy set theory are helpful uncertainty measures. Partial differential equations involving fuzzy and interval fields are traditionally solved using the finite element method where the input fields are sampled using some basis function expansion methods. This approach however is problematic, as it is reliant on knowledge about the spatial correlation fields. In this work we utilize physics-informed neural networks (PINNs) to solve interval and fuzzy partial differential equations. The resulting network structures termed interval physics-informed neural networks (iPINNs) and fuzzy physics-informed neural networks (fPINNs) show promising results for obtaining bounded solutions of equations involving spatially uncertain parameter fields. In contrast to finite element approaches, no correlation length specification of the input fields as well as no averaging via Monte-Carlo simulations are necessary. In fact, information about the input interval fields is obtained directly as a byproduct of the presented solution scheme. Furthermore, all major advantages of PINNs are retained, i.e. meshfree nature of the scheme, and ease of inverse problem set-up.
    Robustness and Generalization to Nearest Categories. (arXiv:2011.08485v3 [cs.LG] UPDATED)
    (0 min) Adversarial robustness has emerged as a desirable property for neural networks. Prior work shows that robust networks perform well in some out-of-distribution generalization tasks, such as transfer learning and outlier detection. We uncover a different kind of out-of-distribution generalization property of such networks, and find that they also do well in a task that we call nearest category generalization (NCG) - given an out-of-distribution input, they tend to predict the same label as that of the closest training example. We empirically show that this happens even when the out-of-distribution inputs lie outside the robustness radius of the training data, which suggests that these networks may generalize better along unseen directions on the natural image manifold than arbitrary unseen directions. We examine how performance changes when we change the robustness regions during training. We then design experiments to investigate the connection between out-of-distribution detection and nearest category generalization. Taken together, our work provides evidence that robust neural networks may resemble nearest neighbor classifiers in their behavior on out-of-distribution data. The code is available at https://github.com/yangarbiter/nearest-category-generalization
    Post Selections Using Test Sets (PSUTS) and How Developmental Networks Avoid Them. (arXiv:2106.13233v1 [cs.LG])
    (2 min) This paper raises a rarely reported practice in Artificial Intelligence (AI) called Post Selection Using Test Sets (PSUTS). Consequently, the popular error-backprop methodology in deep learning lacks an acceptable generalization power. All AI methods fall into two broad schools, connectionist and symbolic. The PSUTS fall into two kinds, machine PSUTS and human PSUTS. The connectionist school received criticisms for its "scruffiness" due to a huge number of network parameters and now the worse machine PSUTS; but the seemingly "clean" symbolic school seems more brittle because of a weaker generalization power using human PSUTS. This paper formally defines what PSUTS is, analyzes why error-backprop methods with random initial weights suffer from severe local minima, why PSUTS violates well-established research ethics, and how every paper that used PSUTS should have at least transparently reported PSUTS. For improved transparency in future publications, this paper proposes a new standard for performance evaluation of AI, called developmental errors for all networks trained, along with Three Learning Conditions: (1) an incremental learning architecture, (2) a training experience and (3) a limited amount of computational resources. Developmental Networks avoid PSUTS and are not "scruffy" because they drive Emergent Turing Machines and are optimal in the sense of maximum-likelihood across lifetime.
    Phoneme-aware and Channel-wise Attentive Learning for Text DependentSpeaker Verification. (arXiv:2106.13514v1 [cs.SD])
    (2 min) This paper proposes a multi-task learning network with phoneme-aware and channel-wise attentive learning strategies for text-dependent Speaker Verification (SV). In the proposed structure, the frame-level multi-task learning along with the segment-level adversarial learning is adopted for speaker embedding extraction. The phoneme-aware attentive pooling is exploited on frame-level features in the main network for speaker classifier, with the corresponding posterior probability for the phoneme distribution in the auxiliary subnet. Further, the introduction of Squeeze and Excitation (SE-block) performs dynamic channel-wise feature recalibration, which improves the representational ability. The proposed method exploits speaker idiosyncrasies associated with pass-phrases, and is further improved by the phoneme-aware attentive pooling and SE-block from temporal and channel-wise aspects, respectively. The experiments conducted on RSR2015 Part 1 database confirm that the proposed system achieves outstanding results for textdependent SV.
    An Empirical Study of Graph-Based Approaches for Semi-Supervised Time Series Classification. (arXiv:2104.08153v2 [cs.LG] UPDATED)
    (2 min) Time series data play an important role in many applications and their analysis reveals crucial information for understanding the underlying processes. Among the many time series learning tasks of great importance, we here focus on semi-supervised learning based on a graph representation of the data. Two main aspects are involved in this task. A suitable distance measure to evaluate the similarities between time series, and a learning method to make predictions based on these distances. However, the relationship between the two aspects has never been studied systematically in the context of graph-based learning. We describe four different distance measures, including (Soft) DTW and MPDist, a distance measure based on the Matrix Profile, as well as four successful semi-supervised learning methods, including the graph Allen--Cahn method and a Graph Convolutional Neural Network. We then compare the performance of the algorithms on binary classification data sets. In our findings we compare the chosen graph-based methods using all distance measures and observe that the results vary strongly with respect to the accuracy. As predicted by the ``no free lunch'' theorem, no clear best combination to employ in all cases is found. Our study provides a reproducible framework for future work in the direction of semi-supervised learning for time series with a focus on graph representations.
    Joslim: Joint Widths and Weights Optimization for Slimmable Neural Networks. (arXiv:2007.11752v3 [cs.LG] UPDATED)
    (2 min) Slimmable neural networks provide a flexible trade-off front between prediction error and computational requirement (such as the number of floating-point operations or FLOPs) with the same storage requirement as a single model. They are useful for reducing maintenance overhead for deploying models to devices with different memory constraints and are useful for optimizing the efficiency of a system with many CNNs. However, existing slimmable network approaches either do not optimize layer-wise widths or optimize the shared-weights and layer-wise widths independently, thereby leaving significant room for improvement by joint width and weight optimization. In this work, we propose a general framework to enable joint optimization for both width configurations and weights of slimmable networks. Our framework subsumes conventional and NAS-based slimmable methods as special cases and provides flexibility to improve over existing methods. From a practical standpoint, we propose Joslim, an algorithm that jointly optimizes both the widths and weights for slimmable nets, which outperforms existing methods for optimizing slimmable networks across various networks, datasets, and objectives. Quantitatively, improvements up to 1.7% and 8% in top-1 accuracy on the ImageNet dataset can be attained for MobileNetV2 considering FLOPs and memory footprint, respectively. Our results highlight the potential of optimizing the channel counts for different layers jointly with the weights for slimmable networks. Code available at https://github.com/cmu-enyac/Joslim.
    Non-stationary Online Learning with Memory and Non-stochastic Control. (arXiv:2102.03758v2 [cs.LG] UPDATED)
    (0 min) We study the problem of Online Convex Optimization (OCO) with memory, which allows loss functions to depend on past decisions and thus captures temporal effects of learning problems. In this paper, we introduce dynamic policy regret as the performance measure to design algorithms robust to non-stationary environments, which competes algorithms' decisions with a sequence of changing comparators. We propose a novel algorithm for OCO with memory that provably enjoys an optimal dynamic policy regret. The key technical challenge is how to control the switching cost, the cumulative movements of player's decisions, which is neatly addressed by a novel decomposition of dynamic policy regret and an appropriate meta-expert structure. Furthermore, we apply the results to the problem of online non-stochastic control, i.e., controlling a linear dynamical system with adversarial disturbance and convex loss functions. We derive a novel gradient-based controller with dynamic policy regret guarantees, which is the first controller competitive to a sequence of changing policies.
    Fed-NILM: A Federated Learning-based Non-Intrusive Load Monitoring Method for Privacy-Protection. (arXiv:2105.11085v2 [cs.LG] UPDATED)
    (2 min) Non-intrusive load monitoring (NILM) is essential for understanding customer's power consumption patterns and may find wide applications like carbon emission reduction and energy conservation. The training of NILM models requires massive load data containing different types of appliances. However, inadequate load data and the risk of power consumer privacy breaches may be encountered by local data owners during the NILM model training. To prevent such potential risks, a novel NILM method named Fed-NILM which is based on Federated Learning (FL) is proposed in this paper. In Fed-NILM, local model parameters instead of local load data are shared among multiple data owners. The global model is obtained by weighted averaging the parameters. Experiments based on two measured load datasets are conducted to explore the generalization ability of Fed-NILM. Besides, a comparison of Fed-NILM with locally-trained NILMs and the centrally-trained NILM is conducted. The experimental results show that Fed-NILM has superior performance in scalability and convergence. Fed-NILM outperforms locally-trained NILMs operated by local data owners and approximates the centrally-trained NILM which is trained on the entire load dataset without privacy protection. The proposed Fed-NILM significantly improves the co-modeling capabilities of local data owners while protecting power consumers' privacy.
    Bayesian Inference in High-Dimensional Time-Serieswith the Orthogonal Stochastic Linear Mixing Model. (arXiv:2106.13379v1 [cs.LG])
    (0 min) Many modern time-series datasets contain large numbers of output response variables sampled for prolonged periods of time. For example, in neuroscience, the activities of 100s-1000's of neurons are recorded during behaviors and in response to sensory stimuli. Multi-output Gaussian process models leverage the nonparametric nature of Gaussian processes to capture structure across multiple outputs. However, this class of models typically assumes that the correlations between the output response variables are invariant in the input space. Stochastic linear mixing models (SLMM) assume the mixture coefficients depend on input, making them more flexible and effective to capture complex output dependence. However, currently, the inference for SLMMs is intractable for large datasets, making them inapplicable to several modern time-series problems. In this paper, we propose a new regression framework, the orthogonal stochastic linear mixing model (OSLMM) that introduces an orthogonal constraint amongst the mixing coefficients. This constraint reduces the computational burden of inference while retaining the capability to handle complex output dependence. We provide Markov chain Monte Carlo inference procedures for both SLMM and OSLMM and demonstrate superior model scalability and reduced prediction error of OSLMM compared with state-of-the-art methods on several real-world applications. In neurophysiology recordings, we use the inferred latent functions for compact visualization of population responses to auditory stimuli, and demonstrate superior results compared to a competing method (GPFA). Together, these results demonstrate that OSLMM will be useful for the analysis of diverse, large-scale time-series datasets.
    Conjugate Energy-Based Models. (arXiv:2106.13798v1 [cs.LG])
    (2 min) In this paper, we propose conjugate energy-based models (CEBMs), a new class of energy-based models that define a joint density over data and latent variables. The joint density of a CEBM decomposes into an intractable distribution over data and a tractable posterior over latent variables. CEBMs have similar use cases as variational autoencoders, in the sense that they learn an unsupervised mapping from data to latent variables. However, these models omit a generator network, which allows them to learn more flexible notions of similarity between data points. Our experiments demonstrate that conjugate EBMs achieve competitive results in terms of image modelling, predictive power of latent space, and out-of-domain detection on a variety of datasets.
    BlockGNN: Towards Efficient GNN Acceleration Using Block-Circulant Weight Matrices. (arXiv:2104.06214v2 [cs.AI] UPDATED)
    (2 min) In recent years, Graph Neural Networks (GNNs) appear to be state-of-the-art algorithms for analyzing non-euclidean graph data. By applying deep-learning to extract high-level representations from graph structures, GNNs achieve extraordinary accuracy and great generalization ability in various tasks. However, with the ever-increasing graph sizes, more and more complicated GNN layers, and higher feature dimensions, the computational complexity of GNNs grows exponentially. How to inference GNNs in real time has become a challenging problem, especially for some resource-limited edge-computing platforms. To tackle this challenge, we propose BlockGNN, a software-hardware co-design approach to realize efficient GNN acceleration. At the algorithm level, we propose to leverage block-circulant weight matrices to greatly reduce the complexity of various GNN models. At the hardware design level, we propose a pipelined CirCore architecture, which supports efficient block-circulant matrices computation. Basing on CirCore, we present a novel BlockGNN accelerator to compute various GNNs with low latency. Moreover, to determine the optimal configurations for diverse deployed tasks, we also introduce a performance and resource model that helps choose the optimal hardware parameters automatically. Comprehensive experiments on the ZC706 FPGA platform demonstrate that on various GNN tasks, BlockGNN achieves up to $8.3\times$ speedup compared to the baseline HyGCN architecture and $111.9\times$ energy reduction compared to the Intel Xeon CPU platform.
    Accelerated Computation of a High Dimensional Kolmogorov-Smirnov Distance. (arXiv:2106.13706v1 [stat.CO])
    (2 min) Statistical testing is widespread and critical for a variety of scientific disciplines. The advent of machine learning and the increase of computing power has increased the interest in the analysis and statistical testing of multidimensional data. We extend the powerful Kolmogorov-Smirnov two sample test to a high dimensional form in a similar manner to Fasano (Fasano, 1987). We call our result the d-dimensional Kolmogorov-Smirnov test (ddKS) and provide three novel contributions therewith: we develop an analytical equation for the significance of a given ddKS score, we provide an algorithm for computation of ddKS on modern computing hardware that is of constant time complexity for small sample sizes and dimensions, and we provide two approximate calculations of ddKS: one that reduces the time complexity to linear at larger sample sizes, and another that reduces the time complexity to linear with increasing dimension. We perform power analysis of ddKS and its approximations on a corpus of datasets and compare to other common high dimensional two sample tests and distances: Hotelling's T^2 test and Kullback-Leibler divergence. Our ddKS test performs well for all datasets, dimensions, and sizes tested, whereas the other tests and distances fail to reject the null hypothesis on at least one dataset. We therefore conclude that ddKS is a powerful multidimensional two sample test for general use, and can be calculated in a fast and efficient manner using our parallel or approximate methods. Open source implementations of all methods described in this work are located at https://github.com/pnnl/ddks.
    Private Adaptive Gradient Methods for Convex Optimization. (arXiv:2106.13756v1 [cs.LG])
    (2 min) We study adaptive methods for differentially private convex optimization, proposing and analyzing differentially private variants of a Stochastic Gradient Descent (SGD) algorithm with adaptive stepsizes, as well as the AdaGrad algorithm. We provide upper bounds on the regret of both algorithms and show that the bounds are (worst-case) optimal. As a consequence of our development, we show that our private versions of AdaGrad outperform adaptive SGD, which in turn outperforms traditional SGD in scenarios with non-isotropic gradients where (non-private) Adagrad provably outperforms SGD. The major challenge is that the isotropic noise typically added for privacy dominates the signal in gradient geometry for high-dimensional problems; approaches to this that effectively optimize over lower-dimensional subspaces simply ignore the actual problems that varying gradient geometries introduce. In contrast, we study non-isotropic clipping and noise addition, developing a principled theoretical approach; the consequent procedures also enjoy significantly stronger empirical performance than prior approaches.
    FastICARL: Fast Incremental Classifier and Representation Learning with Efficient Budget Allocation in Audio Sensing Applications. (arXiv:2106.07268v2 [cs.SD] UPDATED)
    (2 min) Various incremental learning (IL) approaches have been proposed to help deep learning models learn new tasks/classes continuously without forgetting what was learned previously (i.e., avoid catastrophic forgetting). With the growing number of deployed audio sensing applications that need to dynamically incorporate new tasks and changing input distribution from users, the ability of IL on-device becomes essential for both efficiency and user privacy. However, prior works suffer from high computational costs and storage demands which hinders the deployment of IL on-device. In this work, to overcome these limitations, we develop an end-to-end and on-device IL framework, FastICARL, that incorporates an exemplar-based IL and quantization in the context of audio-based applications. We first employ k-nearest-neighbor to reduce the latency of IL. Then, we jointly utilize a quantization technique to decrease the storage requirements of IL. We implement FastICARL on two types of mobile devices and demonstrate that FastICARL remarkably decreases the IL time up to 78-92% and the storage requirements by 2-4 times without sacrificing its performance. FastICARL enables complete on-device IL, ensuring user privacy as the user data does not need to leave the device.
    Deep Interpretable Criminal Charge Prediction and Algorithmic Bias. (arXiv:2106.13456v1 [cs.LG])
    (2 min) While predictive policing has become increasingly common in assisting with decisions in the criminal justice system, the use of these results is still controversial. Some software based on deep learning lacks accuracy (e.g., in F-1), and many decision processes are not transparent causing doubt about decision bias, such as perceived racial, age, and gender disparities. This paper addresses bias issues with post-hoc explanations to provide a trustable prediction of whether a person will receive future criminal charges given one's previous criminal records by learning temporal behavior patterns over twenty years. Bi-LSTM relieves the vanishing gradient problem, and attentional mechanisms allows learning and interpretation of feature importance. Our approach shows consistent and reliable prediction precision and recall on a real-life dataset. Our analysis of the importance of each input feature shows the critical causal impact on decision-making, suggesting that criminal histories are statistically significant factors, while identifiers, such as race, gender, and age, are not. Finally, our algorithm indicates that a suspect tends to gradually rather than suddenly increase crime severity level over time.
    Using Machine Learning and Data Mining to Leverage Community Knowledge for the Engineering of Stable Metal-Organic Frameworks. (arXiv:2106.13327v1 [cond-mat.mtrl-sci])
    (2 min) Although the tailored metal active sites and porous architectures of MOFs hold great promise for engineering challenges ranging from gas separations to catalysis, a lack of understanding of how to improve their stability limits their use in practice. To overcome this limitation, we extract thousands of published reports of the key aspects of MOF stability necessary for their practical application: the ability to withstand high temperatures without degrading and the capacity to be activated by removal of solvent molecules. From nearly 4,000 manuscripts, we use natural language processing and automated image analysis to obtain over 2,000 solvent-removal stability measures and 3,000 thermal degradation temperatures. We analyze the relationships between stability properties and the chemical and geometric structures in this set to identify limits of prior heuristics derived from smaller sets of MOFs. By training predictive machine learning (ML, i.e., Gaussian process and artificial neural network) models to encode the structure-property relationships with graph- and pore-structure-based representations, we are able to make predictions of stability orders of magnitude faster than conventional physics-based modeling or experiment. Interpretation of important features in ML models provides insights that we use to identify strategies to engineer increased stability into typically unstable 3d-containing MOFs that are frequently targeted for catalytic applications. We expect our approach to accelerate the time to discovery of stable, practical MOF materials for a wide range of applications.
    Audio Attacks and Defenses against AED Systems -- A Practical Study. (arXiv:2106.07428v2 [cs.SD] UPDATED)
    (2 min) Audio Event Detection (AED) Systems capture audio from the environment and employ some deep learning algorithms for detecting the presence of a specific sound of interest. In this paper, we evaluate deep learning-based AED systems against evasion attacks through adversarial examples. We run multiple security critical AED tasks, implemented as CNNs classifiers, and then generate audio adversarial examples using two different types of noise, namely background and white noise, that can be used by the adversary to evade detection. We also examine the robustness of existing third-party AED capable devices, such as Nest devices manufactured by Google, which run their own black-box deep learning models. We show that an adversary can focus on audio adversarial inputs to cause AED systems to misclassify, similarly to what has been previously done by works focusing on adversarial examples from the image domain. We then, seek to improve classifiers' robustness through countermeasures to the attacks. We employ adversarial training and a custom denoising technique. We show that these countermeasures, when applied to audio input, can be successful, either in isolation or in combination, generating relevant increases of nearly fifty percent in the performance of the classifiers when these are under attack.
    Data efficiency in graph networks through equivariance. (arXiv:2106.13786v1 [cs.LG])
    (2 min) We introduce a novel architecture for graph networks which is equivariant to any transformation in the coordinate embeddings that preserves the distance between neighbouring nodes. In particular, it is equivariant to the Euclidean and conformal orthogonal groups in $n$-dimensions. Thanks to its equivariance properties, the proposed model is extremely more data efficient with respect to classical graph architectures and also intrinsically equipped with a better inductive bias. We show that, learning on a minimal amount of data, the architecture we propose can perfectly generalise to unseen data in a synthetic problem, while much more training data are required from a standard model to reach comparable performance.
    Unsupervised embedding of trajectories captures the latent structure of mobility. (arXiv:2012.02785v2 [cs.LG] UPDATED)
    (2 min) Human mobility drives major societal phenomena including epidemics, economies, and innovation. Historically, mobility was constrained by geographic distance, however, in the globalizing world, language, culture, and history are increasingly important. We propose using the neural embedding model word2vec for studying mobility and capturing its complexity. Word2ec is shown to be mathematically equivalent to the gravity model of mobility, and using three human trajectory datasets, we demonstrate that it encodes nuanced relationships between locations into a vector-space, providing a measure of effective distance that outperforms baselines. Focusing on the case of scientific mobility, we show that embeddings uncover cultural, linguistic, and hierarchical relationships at multiple levels of granularity. Connecting neural embeddings to the gravity model opens up new avenues for the study of mobility.
    Masksembles for Uncertainty Estimation. (arXiv:2012.08334v2 [cs.LG] UPDATED)
    (2 min) Deep neural networks have amply demonstrated their prowess but estimating the reliability of their predictions remains challenging. Deep Ensembles are widely considered as being one of the best methods for generating uncertainty estimates but are very expensive to train and evaluate. MC-Dropout is another popular alternative, which is less expensive, but also less reliable. Our central intuition is that there is a continuous spectrum of ensemble-like models of which MC-Dropout and Deep Ensembles are extreme examples. The first uses an effectively infinite number of highly correlated models while the second relies on a finite number of independent models. To combine the benefits of both, we introduce Masksembles. Instead of randomly dropping parts of the network as in MC-dropout, Masksemble relies on a fixed number of binary masks, which are parameterized in a way that allows to change correlations between individual models. Namely, by controlling the overlap between the masks and their density one can choose the optimal configuration for the task at hand. This leads to a simple and easy to implement method with performance on par with Ensembles at a fraction of the cost. We experimentally validate Masksembles on two widely used datasets, CIFAR10 and ImageNet.
    Multifidelity Modeling for Physics-Informed Neural Networks (PINNs). (arXiv:2106.13361v1 [physics.comp-ph])
    (2 min) Multifidelity simulation methodologies are often used in an attempt to judiciously combine low-fidelity and high-fidelity simulation results in an accuracy-increasing, cost-saving way. Candidates for this approach are simulation methodologies for which there are fidelity differences connected with significant computational cost differences. Physics-informed Neural Networks (PINNs) are candidates for these types of approaches due to the significant difference in training times required when different fidelities (expressed in terms of architecture width and depth as well as optimization criteria) are employed. In this paper, we propose a particular multifidelity approach applied to PINNs that exploits low-rank structure. We demonstrate that width, depth, and optimization criteria can be used as parameters related to model fidelity, and show numerical justification of cost differences in training due to fidelity parameter choices. We test our multifidelity scheme on various canonical forward PDE models that have been presented in the emerging PINNs literature.
    Jitter: Random Jittering Loss Function. (arXiv:2106.13749v1 [cs.LG])
    (0 min) Regularization plays a vital role in machine learning optimization. One novel regularization method called flooding makes the training loss fluctuate around the flooding level. It intends to make the model continue to random walk until it comes to a flat loss landscape to enhance generalization. However, the hyper-parameter flooding level of the flooding method fails to be selected properly and uniformly. We propose a novel method called Jitter to improve it. Jitter is essentially a kind of random loss function. Before training, we randomly sample the Jitter Point from a specific probability distribution. The flooding level should be replaced by Jitter point to obtain a new target function and train the model accordingly. As Jitter point acting as a random factor, we actually add some randomness to the loss function, which is consistent with the fact that there exists innumerable random behaviors in the learning process of the machine learning model and is supposed to make the model more robust. In addition, Jitter performs random walk randomly which divides the loss curve into small intervals and then flipping them over, ideally making the loss curve much flatter and enhancing generalization ability. Moreover, Jitter can be a domain-, task-, and model-independent regularization method and train the model effectively after the training error reduces to zero. Our experimental results show that Jitter method can improve model performance more significantly than the previous flooding method and make the test loss curve descend twice.
    Session-aware Linear Item-Item Models for Session-based Recommendation. (arXiv:2103.16104v2 [cs.IR] UPDATED)
    (0 min) Session-based recommendation aims at predicting the next item given a sequence of previous items consumed in the session, e.g., on e-commerce or multimedia streaming services. Specifically, session data exhibits some unique characteristics, i.e., session consistency and sequential dependency over items within the session, repeated item consumption, and session timeliness. In this paper, we propose simple-yet-effective linear models for considering the holistic aspects of the sessions. The comprehensive nature of our models helps improve the quality of session-based recommendation. More importantly, it provides a generalized framework for reflecting different perspectives of session data. Furthermore, since our models can be solved by closed-form solutions, they are highly scalable. Experimental results demonstrate that the proposed linear models show competitive or state-of-the-art performance in various metrics on several real-world datasets.
    Vision Transformer Architecture Search. (arXiv:2106.13700v1 [cs.CV])
    (2 min) Recently, transformers have shown great superiority in solving computer vision tasks by modeling images as a sequence of manually-split patches with self-attention mechanism. However, current architectures of vision transformers (ViTs) are simply inherited from natural language processing (NLP) tasks and have not been sufficiently investigated and optimized. In this paper, we make a further step by examining the intrinsic structure of transformers for vision tasks and propose an architecture search method, dubbed ViTAS, to search for the optimal architecture with similar hardware budgets. Concretely, we design a new effective yet efficient weight sharing paradigm for ViTs, such that architectures with different token embedding, sequence size, number of heads, width, and depth can be derived from a single super-transformer. Moreover, to cater for the variance of distinct architectures, we introduce \textit{private} class token and self-attention maps in the super-transformer. In addition, to adapt the searching for different budgets, we propose to search the sampling probability of identity operation. Experimental results show that our ViTAS attains excellent results compared to existing pure transformer architectures. For example, with $1.3$G FLOPs budget, our searched architecture achieves $74.7\%$ top-$1$ accuracy on ImageNet and is $2.5\%$ superior than the current baseline ViT architecture. Code is available at \url{https://github.com/xiusu/ViTAS}.
    Fast quantum state reconstruction via accelerated non-convex programming. (arXiv:2104.07006v3 [quant-ph] UPDATED)
    (0 min) We propose a new quantum state reconstruction method that combines ideas from compressed sensing, non-convex optimization, and acceleration methods. The algorithm, called Momentum-Inspired Factored Gradient Descent (\texttt{MiFGD}), extends the applicability of quantum tomography for larger systems. Despite being a non-convex method, \texttt{MiFGD} converges \emph{provably} to the true density matrix at a linear rate, in the absence of experimental and statistical noise, and under common assumptions. With this manuscript, we present the method, prove its convergence property and provide Frobenius norm bound guarantees with respect to the true density matrix. From a practical point of view, we benchmark the algorithm performance with respect to other existing methods, in both synthetic and real experiments performed on an IBM's quantum processing unit. We find that the proposed algorithm performs orders of magnitude faster than state of the art approaches, with the same or better accuracy. In both synthetic and real experiments, we observed accurate and robust reconstruction, despite experimental and statistical noise in the tomographic data. Finally, we provide a ready-to-use code for state tomography of multi-qubit systems.
    Data-based Design of Inferential Sensors for Petrochemical Industry. (arXiv:2106.13503v1 [cs.LG])
    (0 min) Inferential (or soft) sensors are used in industry to infer the values of imprecisely and rarely measured (or completely unmeasured) variables from variables measured online (e.g., pressures, temperatures). The main challenge, akin to classical model overfitting, in designing an effective inferential sensor is the selection of a correct structure of the sensor. The sensor structure is represented by the number of inputs to the sensor, which correspond to the variables measured online and their (simple) combinations. This work is focused on the design of inferential sensors for product composition of an industrial distillation column in two oil refinery units, a Fluid Catalytic Cracking unit and a Vacuum Gasoil Hydrogenation unit. As the first design step, we use several well-known data pre-treatment (gross error detection) methods and compare the ability of these approaches to indicate systematic errors and outliers in the available industrial data. We then study effectiveness of various methods for design of the inferential sensors taking into account the complexity and accuracy of the resulting model. The effectiveness analysis indicates that the improvements achieved over the current inferential sensors are up to 19 %.
    Assessing Generalization of SGD via Disagreement. (arXiv:2106.13799v1 [cs.LG])
    (0 min) We empirically show that the test error of deep networks can be estimated by simply training the same architecture on the same training set but with a different run of Stochastic Gradient Descent (SGD), and measuring the disagreement rate between the two networks on unlabeled test data. This builds on -- and is a stronger version of -- the observation in Nakkiran & Bansal '20, which requires the second run to be on an altogether fresh training set. We further theoretically show that this peculiar phenomenon arises from the \emph{well-calibrated} nature of \emph{ensembles} of SGD-trained models. This finding not only provides a simple empirical measure to directly predict the test error using unlabeled test data, but also establishes a new conceptual connection between generalization and calibration.
    Bringing Light Into the Dark: A Large-scale Evaluation of Knowledge Graph Embedding Models Under a Unified Framework. (arXiv:2006.13365v4 [cs.LG] UPDATED)
    (0 min) The heterogeneity in recently published knowledge graph embedding models' implementations, training, and evaluation has made fair and thorough comparisons difficult. In order to assess the reproducibility of previously published results, we re-implemented and evaluated 21 interaction models in the PyKEEN software package. Here, we outline which results could be reproduced with their reported hyper-parameters, which could only be reproduced with alternate hyper-parameters, and which could not be reproduced at all as well as provide insight as to why this might be the case. We then performed a large-scale benchmarking on four datasets with several thousands of experiments and 24,804 GPU hours of computation time. We present insights gained as to best practices, best configurations for each model, and where improvements could be made over previously published best configurations. Our results highlight that the combination of model architecture, training approach, loss function, and the explicit modeling of inverse relations is crucial for a model's performances, and not only determined by the model architecture. We provide evidence that several architectures can obtain results competitive to the state-of-the-art when configured carefully. We have made all code, experimental configurations, results, and analyses that lead to our interpretations available at https://github.com/pykeen/pykeen and https://github.com/pykeen/benchmarking
    NSL: Hybrid Interpretable Learning From Noisy Raw Data. (arXiv:2012.05023v2 [cs.LG] UPDATED)
    (0 min) Inductive Logic Programming (ILP) systems learn generalised, interpretable rules in a data-efficient manner utilising existing background knowledge. However, current ILP systems require training examples to be specified in a structured logical format. Neural networks learn from unstructured data, although their learned models may be difficult to interpret and are vulnerable to data perturbations at run-time. This paper introduces a hybrid neural-symbolic learning framework, called NSL, that learns interpretable rules from labelled unstructured data. NSL combines pre-trained neural networks for feature extraction with FastLAS, a state-of-the-art ILP system for rule learning under the answer set semantics. Features extracted by the neural components define the structured context of labelled examples and the confidence of the neural predictions determines the level of noise of the examples. Using the scoring function of FastLAS, NSL searches for short, interpretable rules that generalise over such noisy examples. We evaluate our framework on propositional and first-order classification tasks using the MNIST dataset as raw data. Specifically, we demonstrate that NSL is able to learn robust rules from perturbed MNIST data and achieve comparable or superior accuracy when compared to neural network and random forest baselines whilst being more general and interpretable.
    Task-Driven Out-of-Distribution Detection with Statistical Guarantees for Robot Learning. (arXiv:2106.13703v1 [cs.RO])
    (2 min) Our goal is to perform out-of-distribution (OOD) detection, i.e., to detect when a robot is operating in environments that are drawn from a different distribution than the environments used to train the robot. We leverage Probably Approximately Correct (PAC)-Bayes theory in order to train a policy with a guaranteed bound on performance on the training distribution. Our key idea for OOD detection then relies on the following intuition: violation of the performance bound on test environments provides evidence that the robot is operating OOD. We formalize this via statistical techniques based on p-values and concentration inequalities. The resulting approach (i) provides guaranteed confidence bounds on OOD detection, and (ii) is task-driven and sensitive only to changes that impact the robot's performance. We demonstrate our approach on a simulated example of grasping objects with unfamiliar poses or shapes. We also present both simulation and hardware experiments for a drone performing vision-based obstacle avoidance in unfamiliar environments (including wind disturbances and different obstacle densities). Our examples demonstrate that we can perform task-driven OOD detection within just a handful of trials. Comparisons with baselines also demonstrate the advantages of our approach in terms of providing statistical guarantees and being insensitive to task-irrelevant distribution shifts.
    A proximal-proximal majorization-minimization algorithm for nonconvex tuning-free robust regression problems. (arXiv:2106.13683v1 [math.OC])
    (2 min) In this paper, we introduce a proximal-proximal majorization-minimization (PPMM) algorithm for nonconvex tuning-free robust regression problems. The basic idea is to apply the proximal majorization-minimization algorithm to solve the nonconvex problem with the inner subproblems solved by a sparse semismooth Newton (SSN) method based proximal point algorithm (PPA). We must emphasize that the main difficulty in the design of the algorithm lies in how to overcome the singular difficulty of the inner subproblem. Furthermore, we also prove that the PPMM algorithm converges to a d-stationary point. Due to the Kurdyka-Lojasiewicz (KL) property of the problem, we present the convergence rate of the PPMM algorithm. Numerical experiments demonstrate that our proposed algorithm outperforms the existing state-of-the-art algorithms.
    Subgraph Federated Learning with Missing Neighbor Generation. (arXiv:2106.13430v1 [cs.LG])
    (0 min) Graphs have been widely used in data mining and machine learning due to their unique representation of real-world objects and their interactions. As graphs are getting bigger and bigger nowadays, it is common to see their subgraphs separately collected and stored in multiple local systems. Therefore, it is natural to consider the subgraph federated learning setting, where each local system holding a small subgraph that may be biased from the distribution of the whole graph. Hence, the subgraph federated learning aims to collaboratively train a powerful and generalizable graph mining model without directly sharing their graph data. In this work, towards the novel yet realistic setting of subgraph federated learning, we propose two major techniques: (1) FedSage, which trains a GraphSage model based on FedAvg to integrate node features, link structures, and task labels on multiple local subgraphs; (2) FedSage+, which trains a missing neighbor generator along FedSage to deal with missing links across local subgraphs. Empirical results on four real-world graph datasets with synthesized subgraph federated learning settings demonstrate the effectiveness and efficiency of our proposed techniques. At the same time, consistent theoretical implications are made towards their generalization ability on the global graphs.
    Effects of boundary conditions in fully convolutional networks for learning spatio-temporal dynamics. (arXiv:2106.11160v2 [cs.LG] UPDATED)
    (2 min) Accurate modeling of boundary conditions is crucial in computational physics. The ever increasing use of neural networks as surrogates for physics-related problems calls for an improved understanding of boundary condition treatment, and its influence on the network accuracy. In this paper, several strategies to impose boundary conditions (namely padding, improved spatial context, and explicit encoding of physical boundaries) are investigated in the context of fully convolutional networks applied to recurrent tasks. These strategies are evaluated on two spatio-temporal evolving problems modeled by partial differential equations: the 2D propagation of acoustic waves (hyperbolic PDE) and the heat equation (parabolic PDE). Results reveal a high sensitivity of both accuracy and stability on the boundary implementation in such recurrent tasks. It is then demonstrated that the choice of the optimal padding strategy is directly linked to the data semantics. Furthermore, the inclusion of additional input spatial context or explicit physics-based rules allows a better handling of boundaries in particular for large number of recurrences, resulting in more robust and stable neural networks, while facilitating the design and versatility of such networks.
    Understanding the Origin of Information-Seeking Exploration in Probabilistic Objectives for Control. (arXiv:2103.06859v4 [cs.LG] UPDATED)
    (0 min) The exploration-exploitation trade-off is central to the description of adaptive behaviour in fields ranging from machine learning, to biology, to economics. While many approaches have been taken, one approach to solving this trade-off has been to equip or propose that agents possess an intrinsic 'exploratory drive' which is often implemented in terms of maximizing the agents information gain about the world -- an approach which has been widely studied in machine learning and cognitive science. In this paper we mathematically investigate the nature and meaning of such approaches and demonstrate that this combination of utility maximizing and information-seeking behaviour arises from the minimization of an entirely difference class of objectives we call divergence objectives. We propose a dichotomy in the objective functions underlying adaptive behaviour between \emph{evidence} objectives, which correspond to well-known reward or utility maximizing objectives in the literature, and \emph{divergence} objectives which instead seek to minimize the divergence between the agent's expected and desired futures, and argue that this new class of divergence objectives could form the mathematical foundation for a much richer understanding of the exploratory components of adaptive and intelligent action, beyond simply greedy utility maximization.
    Learning a Probabilistic Relaxation of Discrete Variables for Soft Detection with Low Complexity: CMDNet. (arXiv:2102.12756v2 [eess.SP] UPDATED)
    (2 min) Following the great success of Machine Learning (ML), especially Deep Neural Networks (DNNs), in many research domains in 2010s, several ML-based approaches were proposed for detection in large inverse linear problems, e.g., massive MIMO systems. The main motivation behind is that the complexity of Maximum A-Posteriori (MAP) detection grows exponentially with system dimensions. Instead of using DNNs, essentially being a black-box, we take a slightly different approach and introduce a probabilistic Continuous relaxation of disCrete variables to MAP detection. Enabling close approximation and continuous optimization, we derive an iterative detection algorithm: Concrete MAP Detection (CMD). Furthermore, extending CMD by the idea of deep unfolding into CMDNet, we allow for (online) optimization of a small number of parameters to different working points while limiting complexity. In contrast to recent DNN-based approaches, we select the optimization criterion and output of CMDNet based on information theory and are thus able to learn approximate probabilities of the individual optimal detector. This is crucial for soft decoding in today's communication systems. Numerical simulation results in MIMO systems reveal CMDNet to feature a promising accuracy complexity trade-off compared to State of the Art. Notably, we demonstrate CMDNet's soft outputs to be reliable for decoders.
    Fine-grained Geolocation Prediction of Tweets with Human Machine Collaboration. (arXiv:2106.13411v1 [cs.LG])
    (0 min) Twitter is a useful resource to analyze peoples' opinions on various topics. Often these topics are correlated or associated with locations from where these Tweet posts are made. For example, restaurant owners may need to know where their target customers eat with respect to the sentiment of the posts made related to food, policy planners may need to analyze citizens' opinion on relevant issues such as crime, safety, congestion, etc. with respect to specific parts of the city, or county or state. As promising as this is, less than $1\%$ of the crawled Tweet posts come with geolocation tags. That makes accurate prediction of Tweet posts for the non geo-tagged tweets very critical to analyze data in various domains. In this research, we utilized millions of Twitter posts and end-users domain expertise to build a set of deep neural network models using natural language processing (NLP) techniques, that predicts the geolocation of non geo-tagged Tweet posts at various level of granularities such as neighborhood, zipcode, and longitude with latitudes. With multiple neural architecture experiments, and a collaborative human-machine workflow design, our ongoing work on geolocation detection shows promising results that empower end-users to correlate relationship between variables of choice with the location information.
    A mechanistic-based data-driven approach to accelerate structural topology optimization through finite element convolutional neural network (FE-CNN). (arXiv:2106.13652v1 [cs.LG])
    (0 min) In this paper, a mechanistic data-driven approach is proposed to accelerate structural topology optimization, employing an in-house developed finite element convolutional neural network (FE-CNN). Our approach can be divided into two stages: offline training, and online optimization. During offline training, a mapping function is built between high and low resolution representations of a given design domain. The mapping is expressed by a FE-CNN, which targets a common objective function value (e.g., structural compliance) across design domains of differing resolutions. During online optimization, an arbitrary design domain of high resolution is reduced to low resolution through the trained mapping function. The original high-resolution domain is thus designed by computations performed on only the low-resolution version, followed by an inverse mapping back to the high-resolution domain. Numerical examples demonstrate that this approach can accelerate optimization by up to an order of magnitude in computational time. Our proposed approach therefore shows great potential to overcome the curse-of-dimensionality incurred by density-based structural topology optimization. The limitation of our present approach is also discussed.
    Dealing with Expert Bias in Collective Decision-Making. (arXiv:2106.13539v1 [cs.AI])
    (2 min) Quite some real-world problems can be formulated as decision-making problems wherein one must repeatedly make an appropriate choice from a set of alternatives. Expert judgements, whether human or artificial, can help in taking correct decisions, especially when exploration of alternative solutions is costly. As expert opinions might deviate, the problem of finding the right alternative can be approached as a collective decision making problem (CDM). Current state-of-the-art approaches to solve CDM are limited by the quality of the best expert in the group, and perform poorly if experts are not qualified or if they are overly biased, thus potentially derailing the decision-making process. In this paper, we propose a new algorithmic approach based on contextual multi-armed bandit problems (CMAB) to identify and counteract such biased expertises. We explore homogeneous, heterogeneous and polarised expert groups and show that this approach is able to effectively exploit the collective expertise, irrespective of whether the provided advice is directly conducive to good performance, outperforming state-of-the-art methods, especially when the quality of the provided expertise degrades. Our novel CMAB-inspired approach achieves a higher final performance and does so while converging more rapidly than previous adaptive algorithms, especially when heterogeneous expertise is readily available.
    Prior Image-Constrained Reconstruction using Style-Based Generative Models. (arXiv:2102.12525v2 [eess.IV] CROSS LISTED)
    (2 min) Obtaining a useful estimate of an object from highly incomplete imaging measurements remains a holy grail of imaging science. Deep learning methods have shown promise in learning object priors or constraints to improve the conditioning of an ill-posed imaging inverse problem. In this study, a framework for estimating an object of interest that is semantically related to a known prior image, is proposed. An optimization problem is formulated in the disentangled latent space of a style-based generative model, and semantically meaningful constraints are imposed using the disentangled latent representation of the prior image. Stable recovery from incomplete measurements with the help of a prior image is theoretically analyzed. Numerical experiments demonstrating the superior performance of our approach as compared to related methods are presented.
    Privileged Zero-Shot AutoML. (arXiv:2106.13743v1 [cs.LG])
    (2 min) This work improves the quality of automated machine learning (AutoML) systems by using dataset and function descriptions while significantly decreasing computation time from minutes to milliseconds by using a zero-shot approach. Given a new dataset and a well-defined machine learning task, humans begin by reading a description of the dataset and documentation for the algorithms to be used. This work is the first to use these textual descriptions, which we call privileged information, for AutoML. We use a pre-trained Transformer model to process the privileged text and demonstrate that using this information improves AutoML performance. Thus, our approach leverages the progress of unsupervised representation learning in natural language processing to provide a significant boost to AutoML. We demonstrate that using only textual descriptions of the data and functions achieves reasonable classification performance, and adding textual descriptions to data meta-features improves classification across tabular datasets. To achieve zero-shot AutoML we train a graph neural network with these description embeddings and the data meta-features. Each node represents a training dataset, which we use to predict the best machine learning pipeline for a new test dataset in a zero-shot fashion. Our zero-shot approach rapidly predicts a high-quality pipeline for a supervised learning task and dataset. In contrast, most AutoML systems require tens or hundreds of pipeline evaluations. We show that zero-shot AutoML reduces running and prediction times from minutes to milliseconds, consistently across datasets. By speeding up AutoML by orders of magnitude this work demonstrates real-time AutoML.
    Vulnerability and Transaction behavior based detection of Malicious Smart Contracts. (arXiv:2106.13422v1 [cs.CR])
    (2 min) Smart Contracts (SCs) in Ethereum can automate tasks and provide different functionalities to a user. Such automation is enabled by the `Turing-complete' nature of the programming language (Solidity) in which SCs are written. This also opens up different vulnerabilities and bugs in SCs that malicious actors exploit to carry out malicious or illegal activities on the cryptocurrency platform. In this work, we study the correlation between malicious activities and the vulnerabilities present in SCs and find that some malicious activities are correlated with certain types of vulnerabilities. We then develop and study the feasibility of a scoring mechanism that corresponds to the severity of the vulnerabilities present in SCs to determine if it is a relevant feature to identify suspicious SCs. We analyze the utility of severity score towards detection of suspicious SCs using unsupervised machine learning (ML) algorithms across different temporal granularities and identify behavioral changes. In our experiments with on-chain SCs, we were able to find a total of 1094 benign SCs across different granularities which behave similar to malicious SCs, with the inclusion of the smart contract vulnerability scores in the feature set.
    TTS-Portuguese Corpus: a corpus for speech synthesis in Brazilian Portuguese. (arXiv:2005.05144v3 [eess.AS] UPDATED)
    (0 min) Speech provides a natural way for human-computer interaction. In particular, speech synthesis systems are popular in different applications, such as personal assistants, GPS applications, screen readers and accessibility tools. However, not all languages are on the same level when in terms of resources and systems for speech synthesis. This work consists of creating publicly available resources for Brazilian Portuguese in the form of a novel dataset along with deep learning models for end-to-end speech synthesis. Such dataset has 10.5 hours from a single speaker, from which a Tacotron 2 model with the RTISI-LA vocoder presented the best performance, achieving a 4.03 MOS value. The obtained results are comparable to related works covering English language and the state-of-the-art in Portuguese.
    Voice Activity Detection for Transient Noisy Environment Based on Diffusion Nets. (arXiv:2106.13763v1 [cs.SD])
    (0 min) We address voice activity detection in acoustic environments of transients and stationary noises, which often occur in real life scenarios. We exploit unique spatial patterns of speech and non-speech audio frames by independently learning their underlying geometric structure. This process is done through a deep encoder-decoder based neural network architecture. This structure involves an encoder that maps spectral features with temporal information to their low-dimensional representations, which are generated by applying the diffusion maps method. The encoder feeds a decoder that maps the embedded data back into the high-dimensional space. A deep neural network, which is trained to separate speech from non-speech frames, is obtained by concatenating the decoder to the encoder, resembling the known Diffusion nets architecture. Experimental results show enhanced performance compared to competing voice activity detection methods. The improvement is achieved in both accuracy, robustness and generalization ability. Our model performs in a real-time manner and can be integrated into audio-based communication systems. We also present a batch algorithm which obtains an even higher accuracy for off-line applications.
    Tighter Analysis of Alternating Stochastic Gradient Method for Stochastic Nested Problems. (arXiv:2106.13781v1 [stat.ML])
    (0 min) Stochastic nested optimization, including stochastic compositional, min-max and bilevel optimization, is gaining popularity in many machine learning applications. While the three problems share the nested structure, existing works often treat them separately, and thus develop problem-specific algorithms and their analyses. Among various exciting developments, simple SGD-type updates (potentially on multiple variables) are still prevalent in solving this class of nested problems, but they are believed to have slower convergence rate compared to that of the non-nested problems. This paper unifies several SGD-type updates for stochastic nested problems into a single SGD approach that we term ALternating Stochastic gradient dEscenT (ALSET) method. By leveraging the hidden smoothness of the problem, this paper presents a tighter analysis of ALSET for stochastic nested problems. Under the new analysis, to achieve an $\epsilon$-stationary point of the nested problem, it requires ${\cal O}(\epsilon^{-2})$ samples. Under certain regularity conditions, applying our results to stochastic compositional, min-max and reinforcement learning problems either improves or matches the best-known sample complexity in the respective cases. Our results explain why simple SGD-type algorithms in stochastic nested problems all work very well in practice without the need for further modifications.
    Building Intelligent Autonomous Navigation Agents. (arXiv:2106.13415v1 [cs.CV])
    (0 min) Breakthroughs in machine learning in the last decade have led to `digital intelligence', i.e. machine learning models capable of learning from vast amounts of labeled data to perform several digital tasks such as speech recognition, face recognition, machine translation and so on. The goal of this thesis is to make progress towards designing algorithms capable of `physical intelligence', i.e. building intelligent autonomous navigation agents capable of learning to perform complex navigation tasks in the physical world involving visual perception, natural language understanding, reasoning, planning, and sequential decision making. Despite several advances in classical navigation methods in the last few decades, current navigation agents struggle at long-term semantic navigation tasks. In the first part of the thesis, we discuss our work on short-term navigation using end-to-end reinforcement learning to tackle challenges such as obstacle avoidance, semantic perception, language grounding, and reasoning. In the second part, we present a new class of navigation methods based on modular learning and structured explicit map representations, which leverage the strengths of both classical and end-to-end learning methods, to tackle long-term navigation tasks. We show that these methods are able to effectively tackle challenges such as localization, mapping, long-term planning, exploration and learning semantic priors. These modular learning methods are capable of long-term spatial and semantic understanding and achieve state-of-the-art results on various navigation tasks.
    Boolean learning under noise-perturbations in hardware neural networks. (arXiv:2003.12319v2 [cs.NE] UPDATED)
    (0 min) A high efficiency hardware integration of neural networks benefits from realizing nonlinearity, network connectivity and learning fully in a physical substrate. Multiple systems have recently implemented some or all of these operations, yet the focus was placed on addressing technological challenges. Fundamental questions regarding learning in hardware neural networks remain largely unexplored. Noise in particular is unavoidable in such architectures, and here we investigate its interaction with a learning algorithm using an opto-electronic recurrent neural network. We find that noise strongly modifies the system's path during convergence, and surprisingly fully decorrelates the final readout weight matrices. This highlights the importance of understanding architecture, noise and learning algorithm as interacting players, and therefore identifies the need for mathematical tools for noisy, analogue system optimization.
    Recurrent Coupled Topic Modeling over Sequential Documents. (arXiv:2106.13732v1 [cs.IR])
    (0 min) The abundant sequential documents such as online archival, social media and news feeds are streamingly updated, where each chunk of documents is incorporated with smoothly evolving yet dependent topics. Such digital texts have attracted extensive research on dynamic topic modeling to infer hidden evolving topics and their temporal dependencies. However, most of the existing approaches focus on single-topic-thread evolution and ignore the fact that a current topic may be coupled with multiple relevant prior topics. In addition, these approaches also incur the intractable inference problem when inferring latent parameters, resulting in a high computational cost and performance degradation. In this work, we assume that a current topic evolves from all prior topics with corresponding coupling weights, forming the multi-topic-thread evolution. Our method models the dependencies between evolving topics and thoroughly encodes their complex multi-couplings across time steps. To conquer the intractable inference challenge, a new solution with a set of novel data augmentation techniques is proposed, which successfully discomposes the multi-couplings between evolving topics. A fully conjugate model is thus obtained to guarantee the effectiveness and efficiency of the inference technique. A novel Gibbs sampler with a backward-forward filter algorithm efficiently learns latent timeevolving parameters in a closed-form. In addition, the latent Indian Buffet Process (IBP) compound distribution is exploited to automatically infer the overall topic number and customize the sparse topic proportions for each sequential document without bias. The proposed method is evaluated on both synthetic and real-world datasets against the competitive baselines, demonstrating its superiority over the baselines in terms of the low per-word perplexity, high coherent topics, and better document time prediction.
    Robust Matrix Factorization with Grouping Effect. (arXiv:2106.13681v1 [cs.LG])
    (0 min) Although many techniques have been applied to matrix factorization (MF), they may not fully exploit the feature structure. In this paper, we incorporate the grouping effect into MF and propose a novel method called Robust Matrix Factorization with Grouping effect (GRMF). The grouping effect is a generalization of the sparsity effect, which conducts denoising by clustering similar values around multiple centers instead of just around 0. Compared with existing algorithms, the proposed GRMF can automatically learn the grouping structure and sparsity in MF without prior knowledge, by introducing a naturally adjustable non-convex regularization to achieve simultaneous sparsity and grouping effect. Specifically, GRMF uses an efficient alternating minimization framework to perform MF, in which the original non-convex problem is first converted into a convex problem through Difference-of-Convex (DC) programming, and then solved by Alternating Direction Method of Multipliers (ADMM). In addition, GRMF can be easily extended to the Non-negative Matrix Factorization (NMF) settings. Extensive experiments have been conducted using real-world data sets with outliers and contaminated noise, where the experimental results show that GRMF has promoted performance and robustness, compared to five benchmark algorithms.
    Multi-Robot Deep Reinforcement Learning for Mobile Navigation. (arXiv:2106.13280v1 [cs.RO])
    (0 min) Deep reinforcement learning algorithms require large and diverse datasets in order to learn successful policies for perception-based mobile navigation. However, gathering such datasets with a single robot can be prohibitively expensive. Collecting data with multiple different robotic platforms with possibly different dynamics is a more scalable approach to large-scale data collection. But how can deep reinforcement learning algorithms leverage such heterogeneous datasets? In this work, we propose a deep reinforcement learning algorithm with hierarchically integrated models (HInt). At training time, HInt learns separate perception and dynamics models, and at test time, HInt integrates the two models in a hierarchical manner and plans actions with the integrated model. This method of planning with hierarchically integrated models allows the algorithm to train on datasets gathered by a variety of different platforms, while respecting the physical capabilities of the deployment robot at test time. Our mobile navigation experiments show that HInt outperforms conventional hierarchical policies and single-source approaches.
    InteL-VAEs: Adding Inductive Biases to Variational Auto-Encoders via Intermediary Latents. (arXiv:2106.13746v1 [stat.ML])
    (2 min) We introduce a simple and effective method for learning VAEs with controllable inductive biases by using an intermediary set of latent variables. This allows us to overcome the limitations of the standard Gaussian prior assumption. In particular, it allows us to impose desired properties like sparsity or clustering on learned representations, and incorporate prior information into the learned model. Our approach, which we refer to as the Intermediary Latent Space VAE (InteL-VAE), is based around controlling the stochasticity of the encoding process with the intermediary latent variables, before deterministically mapping them forward to our target latent representation, from which reconstruction is performed. This allows us to maintain all the advantages of the traditional VAE framework, while incorporating desired prior information, inductive biases, and even topological information through the latent mapping. We show that this, in turn, allows InteL-VAEs to learn both better generative models and representations.
    Chebyshev-Cantelli PAC-Bayes-Bennett Inequality for the Weighted Majority Vote. (arXiv:2106.13624v1 [cs.LG])
    (2 min) We present a new second-order oracle bound for the expected risk of a weighted majority vote. The bound is based on a novel parametric form of the Chebyshev-Cantelli inequality (a.k.a.\ one-sided Chebyshev's), which is amenable to efficient minimization. The new form resolves the optimization challenge faced by prior oracle bounds based on the Chebyshev-Cantelli inequality, the C-bounds [Germain et al., 2015], and, at the same time, it improves on the oracle bound based on second order Markov's inequality introduced by Masegosa et al. [2020]. We also derive the PAC-Bayes-Bennett inequality, which we use for empirical estimation of the oracle bound. The PAC-Bayes-Bennett inequality improves on the PAC-Bayes-Bernstein inequality by Seldin et al. [2012]. We provide an empirical evaluation demonstrating that the new bounds can improve on the work by Masegosa et al. [2020]. Both the parametric form of the Chebyshev-Cantelli inequality and the PAC-Bayes-Bennett inequality may be of independent interest for the study of concentration of measure in other domains.
    Multi-Domain Active Learning: A Comparative Study. (arXiv:2106.13516v1 [cs.LG])
    (0 min) Building classifiers on multiple domains is a practical problem in the real life. Instead of building classifiers one by one, multi-domain learning (MDL) simultaneously builds classifiers on multiple domains. MDL utilizes the information shared among the domains to improve the performance. As a supervised learning problem, the labeling effort is still high in MDL problems. Usually, this high labeling cost issue could be relieved by using active learning. Thus, it is natural to utilize active learning to reduce the labeling effort in MDL, and we refer this setting as multi-domain active learning (MDAL). However, there are only few works which are built on this setting. And when the researches have to face this problem, there is no off-the-shelf solutions. Under this circumstance, combining the current multi-domain learning models and single-domain active learning strategies might be a preliminary solution for MDAL problem. To find out the potential of this preliminary solution, a comparative study over 5 models and 4 selection strategies is made in this paper. To the best of our knowledge, this is the first work provides the formal definition of MDAL. Besides, this is the first comparative work for MDAL problem. From the results, the Multinomial Adversarial Networks (MAN) model with a simple best vs second best (BvSB) uncertainty strategy shows its superiority in most cases. We take this combination as our off-the-shelf recommendation for the MDAL problem.
    Prediction of Hereditary Cancers Using Neural Networks. (arXiv:2106.13682v1 [stat.ML])
    (0 min) Family history is a major risk factor for many types of cancer. Mendelian risk prediction models translate family histories into cancer risk predictions based on knowledge of cancer susceptibility genes. These models are widely used in clinical practice to help identify high-risk individuals. Mendelian models leverage the entire family history, but they rely on many assumptions about cancer susceptibility genes that are either unrealistic or challenging to validate due to low mutation prevalence. Training more flexible models, such as neural networks, on large databases of pedigrees can potentially lead to accuracy gains. In this paper, we develop a framework to apply neural networks to family history data and investigate their ability to learn inherited susceptibility to cancer. While there is an extensive literature on neural networks and their state-of-the-art performance in many tasks, there is little work applying them to family history data. We propose adaptations of fully-connected neural networks and convolutional neural networks to pedigrees. In data simulated under Mendelian inheritance, we demonstrate that our proposed neural network models are able to achieve nearly optimal prediction performance. Moreover, when the observed family history includes misreported cancer diagnoses, neural networks are able to outperform the Mendelian BRCAPRO model embedding the correct inheritance laws. Using a large dataset of over 200,000 family histories, the Risk Service cohort, we train prediction models for future risk of breast cancer. We validate the models using data from the Cancer Genetics Network.
    Interpreting Depression From Question-wise Long-term Video Recording of SDS Evaluation. (arXiv:2106.13393v1 [cs.CV])
    (0 min) Self-Rating Depression Scale (SDS) questionnaire has frequently been used for efficient depression preliminary screening. However, the uncontrollable self-administered measure can be easily affected by insouciantly or deceptively answering, and producing the different results with the clinician-administered Hamilton Depression Rating Scale (HDRS) and the final diagnosis. Clinically, facial expression (FE) and actions play a vital role in clinician-administered evaluation, while FE and action are underexplored for self-administered evaluations. In this work, we collect a novel dataset of 200 subjects to evidence the validity of self-rating questionnaires with their corresponding question-wise video recording. To automatically interpret depression from the SDS evaluation and the paired video, we propose an end-to-end hierarchical framework for the long-term variable-length video, which is also conditioned on the questionnaire results and the answering time. Specifically, we resort to a hierarchical model which utilizes a 3D CNN for local temporal pattern exploration and a redundancy-aware self-attention (RAS) scheme for question-wise global feature aggregation. Targeting for the redundant long-term FE video processing, our RAS is able to effectively exploit the correlations of each video clip within a question set to emphasize the discriminative information and eliminate the redundancy based on feature pair-wise affinity. Then, the question-wise video feature is concatenated with the questionnaire scores for final depression detection. Our thorough evaluations also show the validity of fusing SDS evaluation and its video recording, and the superiority of our framework to the conventional state-of-the-art temporal modeling methods.
    Binary Matrix Factorisation and Completion via Integer Programming. (arXiv:2106.13434v1 [math.OC])
    (0 min) Binary matrix factorisation is an essential tool for identifying discrete patterns in binary data. In this paper we consider the rank-k binary matrix factorisation problem (k-BMF) under Boolean arithmetic: we are given an n x m binary matrix X with possibly missing entries and need to find two binary matrices A and B of dimension n x k and k x m respectively, which minimise the distance between X and the Boolean product of A and B in the squared Frobenius distance. We present a compact and two exponential size integer programs (IPs) for k-BMF and show that the compact IP has a weak LP relaxation, while the exponential size LPs have a stronger equivalent LP relaxation. We introduce a new objective function, which differs from the traditional squared Frobenius objective in attributing a weight to zero entries of the input matrix that is proportional to the number of times the zero is erroneously covered in a rank-k factorisation. For one of the exponential size IPs we describe a computational approach based on column generation. Experimental results on synthetic and real word datasets suggest that our integer programming approach is competitive against available methods for k-BMF and provides accurate low-error factorisations.
    DeepLoc: A Ubiquitous Accurate and Low-Overhead Outdoor Cellular Localization System. (arXiv:2106.13632v1 [cs.LG])
    (0 min) Recent years have witnessed fast growth in outdoor location-based services. While GPS is considered a ubiquitous localization system, it is not supported by low-end phones, requires direct line of sight to the satellites, and can drain the phone battery quickly. In this paper, we propose DeepLoc: a deep learning-based outdoor localization system that obtains GPS-like localization accuracy without its limitations. In particular, DeepLoc leverages the ubiquitous cellular signals received from the different cell towers heard by the mobile device as hints to localize it. To do that, crowd-sensed geo-tagged received signal strength information coming from different cell towers is used to train a deep model that is used to infer the user's position. As part of DeepLoc design, we introduce modules to address a number of practical challenges including scaling the data collection to large areas, handling the inherent noise in the cellular signal and geo-tagged data, as well as providing enough data that is required for deep learning models with low-overhead. We implemented DeepLoc on different Android devices. Evaluation results in realistic urban and rural environments show that DeepLoc can achieve a median localization accuracy within 18.8m in urban areas and within 15.7m in rural areas. This accuracy outperforms the state-of-the-art cellular-based systems by more than 470% and comes with 330% savings in power compared to the GPS. This highlights the promise of DeepLoc as a ubiquitous accurate and low-overhead localization system.
    Shape registration in the time of transformers. (arXiv:2106.13679v1 [cs.CV])
    (0 min) In this paper, we propose a transformer-based procedure for the efficient registration of non-rigid 3D point clouds. The proposed approach is data-driven and adopts for the first time the transformer architecture in the registration task. Our method is general and applies to different settings. Given a fixed template with some desired properties (e.g. skinning weights or other animation cues), we can register raw acquired data to it, thereby transferring all the template properties to the input geometry. Alternatively, given a pair of shapes, our method can register the first onto the second (or vice-versa), obtaining a high-quality dense correspondence between the two. In both contexts, the quality of our results enables us to target real applications such as texture transfer and shape interpolation. Furthermore, we also show that including an estimation of the underlying density of the surface eases the learning process. By exploiting the potential of this architecture, we can train our model requiring only a sparse set of ground truth correspondences ($10\sim20\%$ of the total points). The proposed model and the analysis that we perform pave the way for future exploration of transformer-based architectures for registration and matching applications. Qualitative and quantitative evaluations demonstrate that our pipeline outperforms state-of-the-art methods for deformable and unordered 3D data registration on different datasets and scenarios.
    Scalable Perception-Action-Communication Loops with Convolutional and Graph Neural Networks. (arXiv:2106.13358v1 [cs.RO])
    (0 min) In this paper, we present a perception-action-communication loop design using Vision-based Graph Aggregation and Inference (VGAI). This multi-agent decentralized learning-to-control framework maps raw visual observations to agent actions, aided by local communication among neighboring agents. Our framework is implemented by a cascade of a convolutional and a graph neural network (CNN / GNN), addressing agent-level visual perception and feature learning, as well as swarm-level communication, local information aggregation and agent action inference, respectively. By jointly training the CNN and GNN, image features and communication messages are learned in conjunction to better address the specific task. We use imitation learning to train the VGAI controller in an offline phase, relying on a centralized expert controller. This results in a learned VGAI controller that can be deployed in a distributed manner for online execution. Additionally, the controller exhibits good scaling properties, with training in smaller teams and application in larger teams. Through a multi-agent flocking application, we demonstrate that VGAI yields performance comparable to or better than other decentralized controllers, using only the visual input modality and without accessing precise location or motion state information.
    Semantic annotation for computational pathology: Multidisciplinary experience and best practice recommendations. (arXiv:2106.13689v1 [eess.IV])
    (2 min) Recent advances in whole slide imaging (WSI) technology have led to the development of a myriad of computer vision and artificial intelligence (AI) based diagnostic, prognostic, and predictive algorithms. Computational Pathology (CPath) offers an integrated solution to utilize information embedded in pathology WSIs beyond what we obtain through visual assessment. For automated analysis of WSIs and validation of machine learning (ML) models, annotations at the slide, tissue and cellular levels are required. The annotation of important visual constructs in pathology images is an important component of CPath projects. Improper annotations can result in algorithms which are hard to interpret and can potentially produce inaccurate and inconsistent results. Despite the crucial role of annotations in CPath projects, there are no well-defined guidelines or best practices on how annotations should be carried out. In this paper, we address this shortcoming by presenting the experience and best practices acquired during the execution of a large-scale annotation exercise involving a multidisciplinary team of pathologists, ML experts and researchers as part of the Pathology image data Lake for Analytics, Knowledge and Education (PathLAKE) consortium. We present a real-world case study along with examples of different types of annotations, diagnostic algorithm, annotation data dictionary and annotation constructs. The analyses reported in this work highlight best practice recommendations that can be used as annotation guidelines over the lifecycle of a CPath project.
    Disease Progression Modeling Workbench 360. (arXiv:2106.13265v1 [cs.LG])
    (2 min) In this work we introduce Disease Progression Modeling workbench 360 (DPM360) opensource clinical informatics framework for collaborative research and delivery of healthcare AI. DPM360, when fully developed, will manage the entire modeling life cycle, from data analysis (e.g., cohort identification) to machine learning algorithm development and prototyping. DPM360 augments the advantages of data model standardization and tooling (OMOP-CDM, Athena, ATLAS) provided by the widely-adopted OHDSI initiative with a powerful machine learning training framework, and a mechanism for rapid prototyping through automatic deployment of models as containerized services to a cloud environment.
    Cutting Down on Prompts and Parameters: Simple Few-Shot Learning with Language Models. (arXiv:2106.13353v1 [cs.CL])
    (2 min) Prompting language models (LMs) with training examples and task descriptions has been seen as critical to recent successes in few-shot learning. In this work, we show that finetuning LMs in the few-shot setting can considerably reduce the need for prompt engineering. In fact, one can use null prompts, prompts that contain neither task-specific templates nor training examples, and achieve competitive accuracy to manually-tuned prompts across a wide range of tasks. While finetuning LMs does introduce new parameters for each downstream task, we show that this memory overhead can be substantially reduced: finetuning only the bias terms can achieve comparable or better accuracy than standard finetuning while only updating 0.1% of the parameters. All in all, we recommend finetuning LMs for few-shot learning as it is more accurate, robust to different prompts, and can be made nearly as efficient as using frozen LMs.
    What will it take to generate fairness-preserving explanations?. (arXiv:2106.13346v1 [cs.LG])
    (2 min) In situations where explanations of black-box models may be useful, the fairness of the black-box is also often a relevant concern. However, the link between the fairness of the black-box model and the behavior of explanations for the black-box is unclear. We focus on explanations applied to tabular datasets, suggesting that explanations do not necessarily preserve the fairness properties of the black-box algorithm. In other words, explanation algorithms can ignore or obscure critical relevant properties, creating incorrect or misleading explanations. More broadly, we propose future research directions for evaluating and generating explanations such that they are informative and relevant from a fairness perspective.
    Identifying malicious accounts in Blockchains using Domain Names and associated temporal properties. (arXiv:2106.13420v1 [cs.CR])
    (2 min) The rise in the adoption of blockchain technology has led to increased illegal activities by cyber-criminals costing billions of dollars. Many machine learning algorithms are applied to detect such illegal behavior. These algorithms are often trained on the transaction behavior and, in some cases, trained on the vulnerabilities that exist in the system. In our approach, we study the feasibility of using metadata such as Domain Name (DN) associated with the account in the blockchain and identify whether an account should be tagged malicious or not. Here, we leverage the temporal aspects attached to the DNs. Our results identify 144930 DNs that show malicious behavior, and out of these, 54114 DNs show persistent malicious behavior over time. Nonetheless, none of these identified malicious DNs were reported in new officially tagged malicious blockchain DNs.
    Multi-player Multi-armed Bandits with Collision-Dependent Reward Distributions. (arXiv:2106.13669v1 [cs.IT])
    (2 min) We study a new stochastic multi-player multi-armed bandits (MP-MAB) problem, where the reward distribution changes if a collision occurs on the arm. Existing literature always assumes a zero reward for involved players if collision happens, but for applications such as cognitive radio, the more realistic scenario is that collision reduces the mean reward but not necessarily to zero. We focus on the more practical no-sensing setting where players do not perceive collisions directly, and propose the Error-Correction Collision Communication (EC3) algorithm that models implicit communication as a reliable communication over noisy channel problem, for which random coding error exponent is used to establish the optimal regret that no communication protocol can beat. Finally, optimizing the tradeoff between code length and decoding error rate leads to a regret that approaches the centralized MP-MAB regret, which represents a natural lower bound. Experiments with practical error-correction codes on both synthetic and real-world datasets demonstrate the superiority of EC3. In particular, the results show that the choice of coding schemes has a profound impact on the regret performance.
    Continual Competitive Memory: A Neural System for Online Task-Free Lifelong Learning. (arXiv:2106.13300v1 [cs.LG])
    (2 min) In this article, we propose a novel form of unsupervised learning, continual competitive memory (CCM), as well as a computational framework to unify related neural models that operate under the principles of competition. The resulting neural system is shown to offer an effective approach for combating catastrophic forgetting in online continual classification problems. We demonstrate that the proposed CCM system not only outperforms other competitive learning neural models but also yields performance that is competitive with several modern, state-of-the-art lifelong learning approaches on benchmarks such as Split MNIST and Split NotMNIST. CCM yields a promising path forward for acquiring representations that are robust to interference from data streams, especially when the task is unknown to the model and must be inferred without external guidance.
    Branch Prediction as a Reinforcement Learning Problem: Why, How and Case Studies. (arXiv:2106.13429v1 [cs.LG])
    (2 min) Recent years have seen stagnating improvements to branch predictor (BP) efficacy and a dearth of fresh ideas in branch predictor design, calling for fresh thinking in this area. This paper argues that looking at BP from the viewpoint of Reinforcement Learning (RL) facilitates systematic reasoning about, and exploration of, BP designs. We describe how to apply the RL formulation to branch predictors, show that existing predictors can be succinctly expressed in this formulation, and study two RL-based variants of conventional BPs.
    CausalCity: Complex Simulations with Agency for Causal Discovery and Reasoning. (arXiv:2106.13364v1 [cs.AI])
    (2 min) The ability to perform causal and counterfactual reasoning are central properties of human intelligence. Decision-making systems that can perform these types of reasoning have the potential to be more generalizable and interpretable. Simulations have helped advance the state-of-the-art in this domain, by providing the ability to systematically vary parameters (e.g., confounders) and generate examples of the outcomes in the case of counterfactual scenarios. However, simulating complex temporal causal events in multi-agent scenarios, such as those that exist in driving and vehicle navigation, is challenging. To help address this, we present a high-fidelity simulation environment that is designed for developing algorithms for causal discovery and counterfactual reasoning in the safety-critical context. A core component of our work is to introduce \textit{agency}, such that it is simple to define and create complex scenarios using high-level definitions. The vehicles then operate with agency to complete these objectives, meaning low-level behaviors need only be controlled if necessary. We perform experiments with three state-of-the-art methods to create baselines and highlight the affordances of this environment. Finally, we highlight challenges and opportunities for future work.
    Understanding Clipping for Federated Learning: Convergence and Client-Level Differential Privacy. (arXiv:2106.13673v1 [cs.LG])
    (2 min) Providing privacy protection has been one of the primary motivations of Federated Learning (FL). Recently, there has been a line of work on incorporating the formal privacy notion of differential privacy with FL. To guarantee the client-level differential privacy in FL algorithms, the clients' transmitted model updates have to be clipped before adding privacy noise. Such clipping operation is substantially different from its counterpart of gradient clipping in the centralized differentially private SGD and has not been well-understood. In this paper, we first empirically demonstrate that the clipped FedAvg can perform surprisingly well even with substantial data heterogeneity when training neural networks, which is partly because the clients' updates become similar for several popular deep architectures. Based on this key observation, we provide the convergence analysis of a differential private (DP) FedAvg algorithm and highlight the relationship between clipping bias and the distribution of the clients' updates. To the best of our knowledge, this is the first work that rigorously investigates theoretical and empirical issues regarding the clipping operation in FL algorithms.
    Fostering Diversity in Spatial Evolutionary Generative Adversarial Networks. (arXiv:2106.13590v1 [cs.LG])
    (2 min) Generative adversary networks (GANs) suffer from training pathologies such as instability and mode collapse, which mainly arise from a lack of diversity in their adversarial interactions. Co-evolutionary GAN (CoE-GAN) training algorithms have shown to be resilient to these pathologies. This article introduces Mustangs, a spatially distributed CoE-GAN, which fosters diversity by using different loss functions during the training. Experimental analysis on MNIST and CelebA demonstrated that Mustangs trains statistically more accurate generators.
    Temporal Graph Signal Decomposition. (arXiv:2106.13517v1 [cs.LG])
    (2 min) Temporal graph signals are multivariate time series with individual components associated with nodes of a fixed graph structure. Data of this kind arises in many domains including activity of social network users, sensor network readings over time, and time course gene expression within the interaction network of a model organism. Traditional matrix decomposition methods applied to such data fall short of exploiting structural regularities encoded in the underlying graph and also in the temporal patterns of the signal. How can we take into account such structure to obtain a succinct and interpretable representation of temporal graph signals? We propose a general, dictionary-based framework for temporal graph signal decomposition (TGSD). The key idea is to learn a low-rank, joint encoding of the data via a combination of graph and time dictionaries. We propose a highly scalable decomposition algorithm for both complete and incomplete data, and demonstrate its advantage for matrix decomposition, imputation of missing values, temporal interpolation, clustering, period estimation, and rank estimation in synthetic and real-world data ranging from traffic patterns to social media activity. Our framework achieves 28% reduction in RMSE compared to baselines for temporal interpolation when as many as 75% of the observations are missing. It scales best among baselines taking under 20 seconds on 3.5 million data points and produces the most parsimonious models. To the best of our knowledge, TGSD is the first framework to jointly model graph signals by temporal and graph dictionaries.
    Federated Graph Classification over Non-IID Graphs. (arXiv:2106.13423v1 [cs.LG])
    (2 min) Federated learning has emerged as an important paradigm for training machine learning models in different domains. For graph-level tasks such as graph classification, graphs can also be regarded as a special type of data samples, which can be collected and stored in separate local systems. Similar to other domains, multiple local systems, each holding a small set of graphs, may benefit from collaboratively training a powerful graph mining model, such as the popular graph neural networks (GNNs). To provide more motivation towards such endeavors, we analyze real-world graphs from different domains to confirm that they indeed share certain graph properties that are statistically significant compared with random graphs. However, we also find that different sets of graphs, even from the same domain or same dataset, are non-IID regarding both graph structures and node features. To handle this, we propose a graph clustering federated learning (GCFL) framework that dynamically finds clusters of local systems based on the gradients of GNNs, and theoretically justify that such clusters can reduce the structure and feature heterogeneity among graphs owned by the local systems. Moreover, we observe the gradients of GNNs to be rather fluctuating in GCFL which impedes high-quality clustering, and design a gradient sequence-based clustering mechanism based on dynamic time warping (GCFL+). Extensive experimental results and in-depth analysis demonstrate the effectiveness of our proposed frameworks.
    Reinforcement Learning for Mean Field Games, with Applications to Economics. (arXiv:2106.13755v1 [math.OC])
    (2 min) Mean field games (MFG) and mean field control problems (MFC) are frameworks to study Nash equilibria or social optima in games with a continuum of agents. These problems can be used to approximate competitive or cooperative games with a large finite number of agents and have found a broad range of applications, in particular in economics. In recent years, the question of learning in MFG and MFC has garnered interest, both as a way to compute solutions and as a way to model how large populations of learners converge to an equilibrium. Of particular interest is the setting where the agents do not know the model, which leads to the development of reinforcement learning (RL) methods. After reviewing the literature on this topic, we present a two timescale approach with RL for MFG and MFC, which relies on a unified Q-learning algorithm. The main novelty of this method is to simultaneously update an action-value function and a distribution but with different rates, in a model-free fashion. Depending on the ratio of the two learning rates, the algorithm learns either the MFG or the MFC solution. To illustrate this method, we apply it to a mean field problem of accumulated consumption in finite horizon with HARA utility function, and to a trader's optimal liquidation problem.
    Learning Gradual Argumentation Frameworks using Genetic Algorithms. (arXiv:2106.13585v1 [cs.LG])
    (2 min) Gradual argumentation frameworks represent arguments and their relationships in a weighted graph. Their graphical structure and intuitive semantics makes them a potentially interesting tool for interpretable machine learning. It has been noted recently that their mechanics are closely related to neural networks, which allows learning their weights from data by standard deep learning frameworks. As a first proof of concept, we propose a genetic algorithm to simultaneously learn the structure of argumentative classification models. To obtain a well interpretable model, the fitness function balances sparseness and accuracy of the classifier. We discuss our algorithm and present first experimental results on standard benchmarks from the UCI machine learning repository. Our prototype learns argumentative classification models that are comparable to decision trees in terms of learning performance and interpretability.
    Limitations of machine learning for building energy prediction: ASHRAE Great Energy Predictor III Kaggle competition error analysis. (arXiv:2106.13475v1 [cs.LG])
    (2 min) Machine learning for building energy prediction has exploded in popularity in recent years, yet understanding its limitations and potential for improvement are lacking. The ASHRAE Great Energy Predictor III (GEPIII) Kaggle competition was the largest building energy meter machine learning competition ever held with 4,370 participants who submitted 39,403 predictions. The test data set included two years of hourly electricity, hot water, chilled water, and steam readings from 2,380 meters in 1,448 buildings at 16 locations. This paper analyzes the various sources and types of residual model error from an aggregation of the competition's top 50 solutions. This analysis reveals the limitations for machine learning using the standard model inputs of historical meter, weather, and basic building metadata. The types of error are classified according to the amount of time errors occur in each instance, abrupt versus gradual behavior, the magnitude of error, and whether the error existed on single buildings or several buildings at once from a single location. The results show machine learning models have errors within a range of acceptability on 79.1% of the test data. Lower magnitude model errors occur in 16.1% of the test data. These discrepancies can likely be addressed through additional training data sources or innovations in machine learning. Higher magnitude errors occur in 4.8% of the test data and are unlikely to be accurately predicted regardless of innovation. There is a diversity of error behavior depending on the energy meter type (electricity prediction models have unacceptable error in under 10% of test data, while hot water is over 60%) and building use type (public service less than 14%, while technology/science is just over 46%).
    Physics perception in sloshing scenes with guaranteed thermodynamic consistency. (arXiv:2106.13301v1 [cs.CV])
    (2 min) Physics perception very often faces the problem that only limited data or partial measurements on the scene are available. In this work, we propose a strategy to learn the full state of sloshing liquids from measurements of the free surface. Our approach is based on recurrent neural networks (RNN) that project the limited information available to a reduced-order manifold so as to not only reconstruct the unknown information, but also to be capable of performing fluid reasoning about future scenarios in real time. To obtain physically consistent predictions, we train deep neural networks on the reduced-order manifold that, through the employ of inductive biases, ensure the fulfillment of the principles of thermodynamics. RNNs learn from history the required hidden information to correlate the limited information with the latent space where the simulation occurs. Finally, a decoder returns data back to the high-dimensional manifold, so as to provide the user with insightful information in the form of augmented reality. This algorithm is connected to a computer vision system to test the performance of the proposed methodology with real information, resulting in a system capable of understanding and predicting future states of the observed fluid in real-time.
    Fairness in the Eyes of the Data: Certifying Machine-Learning Models. (arXiv:2009.01534v3 [cs.AI] UPDATED)
    (2 min) We present a framework that allows to certify the fairness degree of a model based on an interactive and privacy-preserving test. The framework verifies any trained model, regardless of its training process and architecture. Thus, it allows us to evaluate any deep learning model on multiple fairness definitions empirically. We tackle two scenarios, where either the test data is privately available only to the tester or is publicly known in advance, even to the model creator. We investigate the soundness of the proposed approach using theoretical analysis and present statistical guarantees for the interactive test. Finally, we provide a cryptographic technique to automate fairness testing and certified inference with only black-box access to the model at hand while hiding the participants' sensitive data.
    Decomposed Mutual Information Estimation for Contrastive Representation Learning. (arXiv:2106.13401v1 [cs.LG])
    (2 min) Recent contrastive representation learning methods rely on estimating mutual information (MI) between multiple views of an underlying context. E.g., we can derive multiple views of a given image by applying data augmentation, or we can split a sequence into views comprising the past and future of some step in the sequence. Contrastive lower bounds on MI are easy to optimize, but have a strong underestimation bias when estimating large amounts of MI. We propose decomposing the full MI estimation problem into a sum of smaller estimation problems by splitting one of the views into progressively more informed subviews and by applying the chain rule on MI between the decomposed views. This expression contains a sum of unconditional and conditional MI terms, each measuring modest chunks of the total MI, which facilitates approximation via contrastive bounds. To maximize the sum, we formulate a contrastive lower bound on the conditional MI which can be approximated efficiently. We refer to our general approach as Decomposed Estimation of Mutual Information (DEMI). We show that DEMI can capture a larger amount of MI than standard non-decomposed contrastive bounds in a synthetic setting, and learns better representations in a vision domain and for dialogue generation.
    On the (Un-)Avoidability of Adversarial Examples. (arXiv:2106.13326v1 [cs.LG])
    (2 min) The phenomenon of adversarial examples in deep learning models has caused substantial concern over their reliability. While many deep neural networks have shown impressive performance in terms of predictive accuracy, it has been shown that in many instances an imperceptible perturbation can falsely flip the network's prediction. Most research has then focused on developing defenses against adversarial attacks or learning under a worst-case adversarial loss. In this work, we take a step back and aim to provide a framework for determining whether a model's label change under small perturbation is justified (and when it is not). We carefully argue that adversarial robustness should be defined as a locally adaptive measure complying with the underlying distribution. We then suggest a definition for an adaptive robust loss, derive an empirical version of it, and develop a resulting data-augmentation framework. We prove that our adaptive data-augmentation maintains consistency of 1-nearest neighbor classification under deterministic labels and provide illustrative empirical evaluations.
    VEGN: Variant Effect Prediction with Graph Neural Networks. (arXiv:2106.13642v1 [cs.LG])
    (2 min) Genetic mutations can cause disease by disrupting normal gene function. Identifying the disease-causing mutations from millions of genetic variants within an individual patient is a challenging problem. Computational methods which can prioritize disease-causing mutations have, therefore, enormous applications. It is well-known that genes function through a complex regulatory network. However, existing variant effect prediction models only consider a variant in isolation. In contrast, we propose VEGN, which models variant effect prediction using a graph neural network (GNN) that operates on a heterogeneous graph with genes and variants. The graph is created by assigning variants to genes and connecting genes with an gene-gene interaction network. In this context, we explore an approach where a gene-gene graph is given and another where VEGN learns the gene-gene graph and therefore operates both on given and learnt edges. The graph neural network is trained to aggregate information between genes, and between genes and variants. Variants can exchange information via the genes they connect to. This approach improves the performance of existing state-of-the-art models.
    Single and Union Non-parallel Support Vector Machine Frameworks. (arXiv:1910.09734v3 [cs.LG] UPDATED)
    (2 min) Considering the classification problem, we summarize the nonparallel support vector machines with the nonparallel hyperplanes to two types of frameworks. The first type constructs the hyperplanes separately. It solves a series of small optimization problems to obtain a series of hyperplanes, but is hard to measure the loss of each sample. The other type constructs all the hyperplanes simultaneously, and it solves one big optimization problem with the ascertained loss of each sample. We give the characteristics of each framework and compare them carefully. In addition, based on the second framework, we construct a max-min distance-based nonparallel support vector machine for multiclass classification problem, called NSVM. It constructs hyperplanes with large distance margin by solving an optimization problem. Experimental results on benchmark data sets show the advantages of our NSVM.
    Federated Noisy Client Learning. (arXiv:2106.13239v1 [cs.LG])
    (2 min) Federated learning (FL) collaboratively aggregates a shared global model depending on multiple local clients, while keeping the training data decentralized in order to preserve data privacy. However, standard FL methods ignore the noisy client issue, which may harm the overall performance of the aggregated model. In this paper, we first analyze the noisy client statement, and then model noisy clients with different noise distributions (e.g., Bernoulli and truncated Gaussian distributions). To learn with noisy clients, we propose a simple yet effective FL framework, named Federated Noisy Client Learning (Fed-NCL), which is a plug-and-play algorithm and contains two main components: a data quality measurement (DQM) to dynamically quantify the data quality of each participating client, and a noise robust aggregation (NRA) to adaptively aggregate the local models of each client by jointly considering the amount of local training data and the data quality of each client. Our Fed-NCL can be easily applied in any standard FL workflow to handle the noisy client issue. Experimental results on various datasets demonstrate that our algorithm boosts the performances of different state-of-the-art systems with noisy clients.
    Connecting Sphere Manifolds Hierarchically for Regularization. (arXiv:2106.13549v1 [cs.CV])
    (2 min) This paper considers classification problems with hierarchically organized classes. We force the classifier (hyperplane) of each class to belong to a sphere manifold, whose center is the classifier of its super-class. Then, individual sphere manifolds are connected based on their hierarchical relations. Our technique replaces the last layer of a neural network by combining a spherical fully-connected layer with a hierarchical layer. This regularization is shown to improve the performance of widely used deep neural network architectures (ResNet and DenseNet) on publicly available datasets (CIFAR100, CUB200, Stanford dogs, Stanford cars, and Tiny-ImageNet).
    Multitask Learning for Citation Purpose Classification. (arXiv:2106.13275v1 [cs.LG])
    (2 min) We present our entry into the 2021 3C Shared Task Citation Context Classification based on Purpose competition. The goal of the competition is to classify a citation in a scientific article based on its purpose. This task is important because it could potentially lead to more comprehensive ways of summarizing the purpose and uses of scientific articles, but it is also difficult, mainly due to the limited amount of available training data in which the purposes of each citation have been hand-labeled, along with the subjectivity of these labels. Our entry in the competition is a multi-task model that combines multiple modules designed to handle the problem from different perspectives, including hand-generated linguistic features, TF-IDF features, and an LSTM-with-attention model. We also provide an ablation study and feature analysis whose insights could lead to future work.
    A variational autoencoder approach for choice set generation and implicit perception of alternatives in choice modeling. (arXiv:2106.13319v1 [cs.AI])
    (2 min) This paper derives the generalized extreme value (GEV) model with implicit availability/perception (IAP) of alternatives and proposes a variational autoencoder (VAE) approach for choice set generation and implicit perception of alternatives. Specifically, the cross-nested logit (CNL) model with IAP is derived as an example of IAP-GEV models. The VAE approach is adapted to model the choice set generation process, in which the likelihood of perceiving chosen alternatives in the choice set is maximized. The VAE approach for route choice set generation is exemplified using a real dataset. IAP- CNL model estimated has the best performance in terms of goodness-of-fit and prediction performance, compared to multinomial logit models and conventional choice set generation methods.
    Prediction of geophysical properties of rocks on rare well data and attributes of seismic waves by machine learning methods on the example of the Achimov formation. (arXiv:2106.13274v1 [physics.geo-ph])
    (2 min) Purpose of this research is to forecast the development of sand bodies in productive sediments based on well log data and seismic attributes. The object of the study is the productive intervals of Achimov sedimentary complex in the part of oil field located in Western Siberia. The research shows a technological stack of machine learning algorithms, methods for enriching the source data with synthetic ones and algorithms for creating new features. The result was the model of regression relationship between the values of natural radioactivity of rocks and seismic wave field attributes with an acceptable prediction quality. Acceptable quality of the forecast is confirmed both by model cross validation, and by the data obtained following the results of new well.
    Transient Stability Analysis with Physics-Informed Neural Networks. (arXiv:2106.13638v1 [cs.LG])
    (2 min) Solving the ordinary differential equations that govern the power system is an indispensable part in transient stability analysis. However, the traditionally applied methods either carry a significant computational burden, require model simplifications, or use overly conservative surrogate models. Neural networks can circumvent these limitations but are faced with high demands on the used datasets. Furthermore, they are agnostic to the underlying governing equations. Physics-informed neural network tackle this problem and we explore their advantages and challenges in this paper. We illustrate the findings on the Kundur two-area system and highlight possible pathways forward in developing this method further.
    Self-training Converts Weak Learners to Strong Learners in Mixture Models. (arXiv:2106.13805v1 [cs.LG])
    (2 min) We consider a binary classification problem when the data comes from a mixture of two isotropic distributions satisfying concentration and anti-concentration properties enjoyed by log-concave distributions among others. We show that there exists a universal constant $C_{\mathrm{err}}>0$ such that if a pseudolabeler $\boldsymbol{\beta}_{\mathrm{pl}}$ can achieve classification error at most $C_{\mathrm{err}}$, then for any $\varepsilon>0$, an iterative self-training algorithm initialized at $\boldsymbol{\beta}_0 := \boldsymbol{\beta}_{\mathrm{pl}}$ using pseudolabels $\hat y = \mathrm{sgn}(\langle \boldsymbol{\beta}_t, \mathbf{x}\rangle)$ and using at most $\tilde O(d/\varepsilon^2)$ unlabeled examples suffices to learn the Bayes-optimal classifier up to $\varepsilon$ error, where $d$ is the ambient dimension. That is, self-training converts weak learners to strong learners using only unlabeled examples. We additionally show that by running gradient descent on the logistic loss one can obtain a pseudolabeler $\boldsymbol{\beta}_{\mathrm{pl}}$ with classification error $C_{\mathrm{err}}$ using only $O(d)$ labeled examples (i.e., independent of $\varepsilon$). Together our results imply that mixture models can be learned to within $\varepsilon$ of the Bayes-optimal accuracy using at most $O(d)$ labeled examples and $\tilde O(d/\varepsilon^2)$ unlabeled examples by way of a semi-supervised self-training algorithm.
    Hate Speech Detection in Clubhouse. (arXiv:2106.13238v1 [cs.LG])
    (2 min) With high prevalence of offensive language against the minorities in social media, counter hate speech generation is considered as an automatic way to tackle this challenge. The counter hate speeches are supposed to appear as a third voice to educate people and keep the social red lines bold without limiting the freedom of speech principles. The counter hate speech generation is based on the optimistic assumption that, any attempt to intervene the hate speeches in social media can play a positive role in this context. Beyond that, previous works ignored to investigate the sequence of comments before and after counter speech. To the best of our knowledge, no attempt has been made to measure the counter hate speech impact from statistical point of view. In this paper, we take the first step in this direction by measuring the counter hate speech impact on the next comments in terms of Google Perspective Scores. Furthermore, our experiments show that, counter hate speech can cause negative impacts, a phenomena which is called aggression in social media.
    Online Self-Attentive Gated RNNs for Real-Time Speaker Separation. (arXiv:2106.13493v1 [eess.AS])
    (2 min) Deep neural networks have recently shown great success in the task of blind source separation, both under monaural and binaural settings. Although these methods were shown to produce high-quality separations, they were mainly applied under offline settings, in which the model has access to the full input signal while separating the signal. In this study, we convert a non-causal state-of-the-art separation model into a causal and real-time model and evaluate its performance under both online and offline settings. We compare the performance of the proposed model to several baseline methods under anechoic, noisy, and noisy-reverberant recording conditions while exploring both monaural and binaural inputs and outputs. Our findings shed light on the relative difference between causal and non-causal models when performing separation. Our stateful implementation for online separation leads to a minor drop in performance compared to the offline model; 0.8dB for monaural inputs and 0.3dB for binaural inputs while reaching a real-time factor of 0.65. Samples can be found under the following link: https://kwanum.github.io/sagrnnc-stream-results/.
    Proxy Convexity: A Unified Framework for the Analysis of Neural Networks Trained by Gradient Descent. (arXiv:2106.13792v1 [cs.LG])
    (2 min) Although the optimization objectives for learning neural networks are highly non-convex, gradient-based methods have been wildly successful at learning neural networks in practice. This juxtaposition has led to a number of recent studies on provable guarantees for neural networks trained by gradient descent. Unfortunately, the techniques in these works are often highly specific to the problem studied in each setting, relying on different assumptions on the distribution, optimization parameters, and network architectures, making it difficult to generalize across different settings. In this work, we propose a unified non-convex optimization framework for the analysis of neural network training. We introduce the notions of proxy convexity and proxy Polyak-Lojasiewicz (PL) inequalities, which are satisfied if the original objective function induces a proxy objective function that is implicitly minimized when using gradient methods. We show that stochastic gradient descent (SGD) on objectives satisfying proxy convexity or the proxy PL inequality leads to efficient guarantees for proxy objective functions. We further show that many existing guarantees for neural networks trained by gradient descent can be unified through proxy convexity and proxy PL inequalities.
    Reliable Graph Neural Network Explanations Through Adversarial Training. (arXiv:2106.13427v1 [cs.LG])
    (2 min) Graph neural network (GNN) explanations have largely been facilitated through post-hoc introspection. While this has been deemed successful, many post-hoc explanation methods have been shown to fail in capturing a model's learned representation. Due to this problem, it is worthwhile to consider how one might train a model so that it is more amenable to post-hoc analysis. Given the success of adversarial training in the computer vision domain to train models with more reliable representations, we propose a similar training paradigm for GNNs and analyze the respective impact on a model's explanations. In instances without ground truth labels, we also determine how well an explanation method is utilizing a model's learned representation through a new metric and demonstrate adversarial training can help better extract domain-relevant insights in chemistry.
    Geometric learning of the conformational dynamics of molecules using dynamic graph neural networks. (arXiv:2106.13277v1 [cs.LG])
    (2 min) We apply a temporal edge prediction model for weighted dynamic graphs to predict time-dependent changes in molecular structure. Each molecule is represented as a complete graph in which each atom is a vertex and all vertex pairs are connected by an edge weighted by the Euclidean distance between atom pairs. We ingest a sequence of complete molecular graphs into a dynamic graph neural network (GNN) to predict the graph at the next time step. Our dynamic GNN predicts atom-to-atom distances with a mean absolute error of 0.017 \r{A}, which is considered ``chemically accurate'' for molecular simulations. We also explored the transferability of a trained network to new molecular systems and found that finetuning with less than 10% of the total trajectory provides a mean absolute error of the same order of magnitude as that when training from scratch on the full molecular trajectory.
    Generalized One-Class Learning Using Pairs of Complementary Classifiers. (arXiv:2106.13272v1 [cs.CV])
    (2 min) One-class learning is the classic problem of fitting a model to the data for which annotations are available only for a single class. In this paper, we explore novel objectives for one-class learning, which we collectively refer to as Generalized One-class Discriminative Subspaces (GODS). Our key idea is to learn a pair of complementary classifiers to flexibly bound the one-class data distribution, where the data belongs to the positive half-space of one of the classifiers in the complementary pair and to the negative half-space of the other. To avoid redundancy while allowing non-linearity in the classifier decision surfaces, we propose to design each classifier as an orthonormal frame and seek to learn these frames via jointly optimizing for two conflicting objectives, namely: i) to minimize the distance between the two frames, and ii) to maximize the margin between the frames and the data. The learned orthonormal frames will thus characterize a piecewise linear decision surface that allows for efficient inference, while our objectives seek to bound the data within a minimal volume that maximizes the decision margin, thereby robustly capturing the data distribution. We explore several variants of our formulation under different constraints on the constituent classifiers, including kernelized feature maps. We demonstrate the empirical benefits of our approach via experiments on data from several applications in computer vision, such as anomaly detection in video sequences, human poses, and human activities. We also explore the generality and effectiveness of GODS for non-vision tasks via experiments on several UCI datasets, demonstrating state-of-the-art results.
    Realistic molecule optimization on a learned graph manifold. (arXiv:2106.13318v1 [physics.chem-ph])
    (2 min) Deep learning based molecular graph generation and optimization has recently been attracting attention due to its great potential for de novo drug design. On the one hand, recent models are able to efficiently learn a given graph distribution, and many approaches have proven very effective to produce a molecule that maximizes a given score. On the other hand, it was shown by previous studies that generated optimized molecules are often unrealistic, even with the inclusion of mechanics to enforce similarity to a dataset of real drug molecules. In this work we use a hybrid approach, where the dataset distribution is learned using an autoregressive model while the score optimization is done using the Metropolis algorithm, biased toward the learned distribution. We show that the resulting method, that we call learned realism sampling (LRS), produces empirically more realistic molecules and outperforms all recent baselines in the task of molecule optimization with similarity constraints.
    Domain-guided Machine Learning for Remotely Sensed In-Season Crop Growth Estimation. (arXiv:2106.13323v1 [cs.LG])
    (2 min) Advanced machine learning techniques have been used in remote sensing (RS) applications such as crop mapping and yield prediction, but remain under-utilized for tracking crop progress. In this study, we demonstrate the use of agronomic knowledge of crop growth drivers in a Long Short-Term Memory-based, Domain-guided neural network (DgNN) for in-season crop progress estimation. The DgNN uses a branched structure and attention to separate independent crop growth drivers and capture their varying importance throughout the growing season. The DgNN is implemented for corn, using RS data in Iowa for the period 2003-2019, with USDA crop progress reports used as ground truth. State-wide DgNN performance shows significant improvement over sequential and dense-only NN structures, and a widely-used Hidden Markov Model method. The DgNN had a 3.5% higher Nash-Sutfliffe efficiency over all growth stages and 33% more weeks with highest cosine similarity than the other NNs during test years. The DgNN and Sequential NN were more robust during periods of abnormal crop progress, though estimating the Silking-Grainfill transition was difficult for all methods. Finally, Uniform Manifold Approximation and Projection visualizations of layer activations showed how LSTM-based NNs separate crop growth time-series differently from a dense-only structure. Results from this study exhibit both the viability of NNs in crop growth stage estimation (CGSE) and the benefits of using domain knowledge. The DgNN methodology presented here can be extended to provide near-real time CGSE of other crops.
    Deep Learning for High-Impedance Fault Detection: Convolutional Autoencoders. (arXiv:2106.13276v1 [cs.LG])
    (2 min) High-impedance faults (HIF) are difficult to detect because of their low current amplitude and highly diverse characteristics. In recent years, machine learning (ML) has been gaining popularity in HIF detection because ML techniques learn patterns from data and successfully detect HIFs. However, as these methods are based on supervised learning, they fail to reliably detect any scenario, fault or non-fault, not present in the training data. Consequently, this paper takes advantage of unsupervised learning and proposes a convolutional autoencoder framework for HIF detection (CAE-HIFD). Contrary to the conventional autoencoders that learn from normal behavior, the convolutional autoencoder (CAE) in CAE-HIFD learns only from the HIF signals eliminating the need for presence of diverse non-HIF scenarios in the CAE training. CAE distinguishes HIFs from non-HIF operating conditions by employing cross-correlation. To discriminate HIFs from transient disturbances such as capacitor or load switching, CAE-HIFD uses kurtosis, a statistical measure of the probability distribution shape. The performance evaluation studies conducted using the IEEE 13-node test feeder indicate that the CAE-HIFD reliably detects HIFs, outperforms the state-of-the-art HIF detection techniques, and is robust against noise.
    Evaluation of Deep-Learning-Based Voice Activity Detectors and Room Impulse Response Models in Reverberant Environments. (arXiv:2106.13511v1 [cs.SD])
    (2 min) State-of-the-art deep-learning-based voice activity detectors (VADs) are often trained with anechoic data. However, real acoustic environments are generally reverberant, which causes the performance to significantly deteriorate. To mitigate this mismatch between training data and real data, we simulate an augmented training set that contains nearly five million utterances. This extension comprises of anechoic utterances and their reverberant modifications, generated by convolutions of the anechoic utterances with a variety of room impulse responses (RIRs). We consider five different models to generate RIRs, and five different VADs that are trained with the augmented training set. We test all trained systems in three different real reverberant environments. Experimental results show $20\%$ increase on average in accuracy, precision and recall for all detectors and response models, compared to anechoic training. Furthermore, one of the RIR models consistently yields better performance than the other models, for all the tested VADs. Additionally, one of the VADs consistently outperformed the other VADs in all experiments.
    Promises and Pitfalls of Black-Box Concept Learning Models. (arXiv:2106.13314v1 [cs.LG])
    (2 min) Machine learning models that incorporate concept learning as an intermediate step in their decision making process can match the performance of black-box predictive models while retaining the ability to explain outcomes in human understandable terms. However, we demonstrate that the concept representations learned by these models encode information beyond the pre-defined concepts, and that natural mitigation strategies do not fully work, rendering the interpretation of the downstream prediction misleading. We describe the mechanism underlying the information leakage and suggest recourse for mitigating its effects.
    You are AllSet: A Multiset Function Framework for Hypergraph Neural Networks. (arXiv:2106.13264v1 [cs.LG])
    (0 min) Hypergraphs are used to model higher-order interactions amongst agents and there exist many practically relevant instances of hypergraph datasets. To enable efficient processing of hypergraph-structured data, several hypergraph neural network platforms have been proposed for learning hypergraph properties and structure, with a special focus on node classification. However, almost all existing methods use heuristic propagation rules and offer suboptimal performance on many datasets. We propose AllSet, a new hypergraph neural network paradigm that represents a highly general framework for (hyper)graph neural networks and for the first time implements hypergraph neural network layers as compositions of two multiset functions that can be efficiently learned for each task and each dataset. Furthermore, AllSet draws on new connections between hypergraph neural networks and recent advances in deep learning of multiset functions. In particular, the proposed architecture utilizes Deep Sets and Set Transformer architectures that allow for significant modeling flexibility and offer high expressive power. To evaluate the performance of AllSet, we conduct the most extensive experiments to date involving ten known benchmarking datasets and three newly curated datasets that represent significant challenges for hypergraph node classification. The results demonstrate that AllSet has the unique ability to consistently either match or outperform all other hypergraph neural networks across the tested datasets. Our implementation and dataset will be released upon acceptance.
    byteSteady: Fast Classification Using Byte-Level n-Gram Embeddings. (arXiv:2106.13302v1 [cs.CL])
    (2 min) This article introduces byteSteady -- a fast model for classification using byte-level n-gram embeddings. byteSteady assumes that each input comes as a sequence of bytes. A representation vector is produced using the averaged embedding vectors of byte-level n-grams, with a pre-defined set of n. The hashing trick is used to reduce the number of embedding vectors. This input representation vector is then fed into a linear classifier. A straightforward application of byteSteady is text classification. We also apply byteSteady to one type of non-language data -- DNA sequences for gene classification. For both problems we achieved competitive classification results against strong baselines, suggesting that byteSteady can be applied to both language and non-language data. Furthermore, we find that simple compression using Huffman coding does not significantly impact the results, which offers an accuracy-speed trade-off previously unexplored in machine learning.

2021-06-25

  • cs.CL updates on arXiv.org

    Context Transformer with Stacked Pointer Networks for Conversational Question Answering over Knowledge Graphs. (arXiv:2103.07766v2 [cs.CL] UPDATED)
    (2 min) Neural semantic parsing approaches have been widely used for Question Answering (QA) systems over knowledge graphs. Such methods provide the flexibility to handle QA datasets with complex queries and a large number of entities. In this work, we propose a novel framework named CARTON, which performs multi-task semantic parsing for handling the problem of conversational question answering over a large-scale knowledge graph. Our framework consists of a stack of pointer networks as an extension of a context transformer model for parsing the input question and the dialog history. The framework generates a sequence of actions that can be executed on the knowledge graph. We evaluate CARTON on a standard dataset for complex sequential question answering on which CARTON outperforms all baselines. Specifically, we observe performance improvements in F1-score on eight out of ten question types compared to the previous state of the art. For logical reasoning questions, an improvement of 11 absolute points is reached.
    Empirical Study of Transformers for Source Code. (arXiv:2010.07987v2 [cs.LG] UPDATED)
    (2 min) Initially developed for natural language processing (NLP), Transformers are now widely used for source code processing, due to the format similarity between source code and text. In contrast to natural language, source code is strictly structured, i.e., it follows the syntax of the programming language. Several recent works develop Transformer modifications for capturing syntactic information in source code. The drawback of these works is that they do not compare to each other and consider different tasks. In this work, we conduct a thorough empirical study of the capabilities of Transformers to utilize syntactic information in different tasks. We consider three tasks (code completion, function naming and bug fixing) and re-implement different syntax-capturing modifications in a unified framework. We show that Transformers are able to make meaningful predictions based purely on syntactic information and underline the best practices of taking the syntactic information into account for improving the performance of the model.
    On the Influence of Machine Translation on Language Origin Obfuscation. (arXiv:2106.12830v1 [cs.CL])
    (2 min) In the last decade, machine translation has become a popular means to deal with multilingual digital content. By providing higher quality translations, obfuscating the source language of a text becomes more attractive. In this paper, we analyze the ability to detect the source language from the translated output of two widely used commercial machine translation systems by utilizing machine-learning algorithms with basic textual features like n-grams. Evaluations show that the source language can be reconstructed with high accuracy for documents that contain a sufficient amount of translated text. In addition, we analyze how the document size influences the performance of the prediction, as well as how limiting the set of possible source languages improves the classification accuracy.
    TagRuler: Interactive Tool for Span-Level Data Programming by Demonstration. (arXiv:2106.12767v1 [cs.CL])
    (2 min) Despite rapid developments in the field of machine learning research, collecting high-quality labels for supervised learning remains a bottleneck for many applications. This difficulty is exacerbated by the fact that state-of-the-art models for NLP tasks are becoming deeper and more complex, often increasing the amount of training data required even for fine-tuning. Weak supervision methods, including data programming, address this problem and reduce the cost of label collection by using noisy label sources for supervision. However, until recently, data programming was only accessible to users who knew how to program. To bridge this gap, the Data Programming by Demonstration framework was proposed to facilitate the automatic creation of labeling functions based on a few examples labeled by a domain expert. This framework has proven successful for generating high-accuracy labeling models for document classification. In this work, we extend the DPBD framework to span-level annotation tasks, arguably one of the most time-consuming NLP labeling tasks. We built a novel tool, TagRuler, that makes it easy for annotators to build span-level labeling functions without programming and encourages them to explore trade-offs between different labeling models and active learning strategies. We empirically demonstrated that an annotator could achieve a higher F1 score using the proposed tool compared to manual labeling for different span-level annotation tasks.
    Where are we in semantic concept extraction for Spoken Language Understanding?. (arXiv:2106.13045v1 [cs.CL])
    (2 min) Spoken language understanding (SLU) topic has seen a lot of progress these last three years, with the emergence of end-to-end neural approaches. Spoken language understanding refers to natural language processing tasks related to semantic extraction from speech signal, like named entity recognition from speech or slot filling task in a context of human-machine dialogue. Classically, SLU tasks were processed through a cascade approach that consists in applying, firstly, an automatic speech recognition process, followed by a natural language processing module applied to the automatic transcriptions. These three last years, end-to-end neural approaches, based on deep neural networks, have been proposed in order to directly extract the semantics from speech signal, by using a single neural model. More recent works on self-supervised training with unlabeled data open new perspectives in term of performance for automatic speech recognition and natural language processing. In this paper, we present a brief overview of the recent advances on the French MEDIA benchmark dataset for SLU, with or without the use of additional data. We also present our last results that significantly outperform the current state-of-the-art with a Concept Error Rate (CER) of 11.2%, instead of 13.6% for the last state-of-the-art system presented this year.
    Learning Language and Multimodal Privacy-Preserving Markers of Mood from Mobile Data. (arXiv:2106.13213v1 [cs.LG])
    (2 min) Mental health conditions remain underdiagnosed even in countries with common access to advanced medical care. The ability to accurately and efficiently predict mood from easily collectible data has several important implications for the early detection, intervention, and treatment of mental health disorders. One promising data source to help monitor human behavior is daily smartphone usage. However, care must be taken to summarize behaviors without identifying the user through personal (e.g., personally identifiable information) or protected (e.g., race, gender) attributes. In this paper, we study behavioral markers of daily mood using a recent dataset of mobile behaviors from adolescent populations at high risk of suicidal behaviors. Using computational models, we find that language and multimodal representations of mobile typed text (spanning typed characters, words, keystroke timings, and app usage) are predictive of daily mood. However, we find that models trained to predict mood often also capture private user identities in their intermediate representations. To tackle this problem, we evaluate approaches that obfuscate user identity while remaining predictive. By combining multimodal representations with privacy-preserving learning, we are able to push forward the performance-privacy frontier.
    Unsupervised Topic Segmentation of Meetings with BERT Embeddings. (arXiv:2106.12978v1 [cs.LG])
    (2 min) Topic segmentation of meetings is the task of dividing multi-person meeting transcripts into topic blocks. Supervised approaches to the problem have proven intractable due to the difficulties in collecting and accurately annotating large datasets. In this paper we show how previous unsupervised topic segmentation methods can be improved using pre-trained neural architectures. We introduce an unsupervised approach based on BERT embeddings that achieves a 15.5% reduction in error rate over existing unsupervised approaches applied to two popular datasets for meeting transcripts.
    Explaining NLP Models via Minimal Contrastive Editing (MiCE). (arXiv:2012.13985v2 [cs.CL] UPDATED)
    (2 min) Humans have been shown to give contrastive explanations, which explain why an observed event happened rather than some other counterfactual event (the contrast case). Despite the influential role that contrastivity plays in how humans explain, this property is largely missing from current methods for explaining NLP models. We present Minimal Contrastive Editing (MiCE), a method for producing contrastive explanations of model predictions in the form of edits to inputs that change model outputs to the contrast case. Our experiments across three tasks--binary sentiment classification, topic classification, and multiple-choice question answering--show that MiCE is able to produce edits that are not only contrastive, but also minimal and fluent, consistent with human contrastive edits. We demonstrate how MiCE edits can be used for two use cases in NLP system development--debugging incorrect model outputs and uncovering dataset artifacts--and thereby illustrate that producing contrastive explanations is a promising research direction for model interpretability.
    Towards Understanding and Mitigating Social Biases in Language Models. (arXiv:2106.13219v1 [cs.CL])
    (2 min) As machine learning methods are deployed in real-world settings such as healthcare, legal systems, and social science, it is crucial to recognize how they shape social biases and stereotypes in these sensitive decision-making processes. Among such real-world deployments are large-scale pretrained language models (LMs) that can be potentially dangerous in manifesting undesirable representational biases - harmful biases resulting from stereotyping that propagate negative generalizations involving gender, race, religion, and other social constructs. As a step towards improving the fairness of LMs, we carefully define several sources of representational biases before proposing new benchmarks and metrics to measure them. With these tools, we propose steps towards mitigating social biases during text generation. Our empirical results and human evaluation demonstrate effectiveness in mitigating bias while retaining crucial contextual information for high-fidelity text generation, thereby pushing forward the performance-fairness Pareto frontier.
    Introducing Orthogonal Constraint in Structural Probes. (arXiv:2012.15228v2 [cs.CL] UPDATED)
    (2 min) With the recent success of pre-trained models in NLP, a significant focus was put on interpreting their representations. One of the most prominent approaches is structural probing (Hewitt and Manning, 2019), where a linear projection of word embeddings is performed in order to approximate the topology of dependency structures. In this work, we introduce a new type of structural probing, where the linear projection is decomposed into 1. isomorphic space rotation; 2. linear scaling that identifies and scales the most relevant dimensions. In addition to syntactic dependency, we evaluate our method on novel tasks (lexical hypernymy and position in a sentence). We jointly train the probes for multiple tasks and experimentally show that lexical and syntactic information is separated in the representations. Moreover, the orthogonal constraint makes the Structural Probes less vulnerable to memorization.
    Conversational Question Answering over Knowledge Graphs with Transformer and Graph Attention Networks. (arXiv:2104.01569v2 [cs.CL] UPDATED)
    (2 min) This paper addresses the task of (complex) conversational question answering over a knowledge graph. For this task, we propose LASAGNE (muLti-task semAntic parSing with trAnsformer and Graph atteNtion nEtworks). It is the first approach, which employs a transformer architecture extended with Graph Attention Networks for multi-task neural semantic parsing. LASAGNE uses a transformer model for generating the base logical forms, while the Graph Attention model is used to exploit correlations between (entity) types and predicates to produce node representations. LASAGNE also includes a novel entity recognition module which detects, links, and ranks all relevant entities in the question context. We evaluate LASAGNE on a standard dataset for complex sequential question answering, on which it outperforms existing baseline averages on all question types. Specifically, we show that LASAGNE improves the F1-score on eight out of ten question types; in some cases, the increase in F1-score is more than 20% compared to the state of the art.
    UXLA: A Robust Unsupervised Data Augmentation Framework for {Zero-Resource} Cross-Lingual NLP. (arXiv:2004.13240v3 [cs.CL] UPDATED)
    (2 min) Transfer learning has yielded state-of-the-art (SoTA) results in many supervised NLP tasks. However, annotated data for every target task in every target language is rare, especially for low-resource languages. We propose UXLA, a novel unsupervised data augmentation framework for zero-resource transfer learning scenarios. In particular, UXLA aims to solve cross-lingual adaptation problems from a source language task distribution to an unknown target language task distribution, assuming no training label in the target language. At its core, UXLA performs simultaneous self-training with data augmentation and unsupervised sample selection. To show its effectiveness, we conduct extensive experiments on three diverse zero-resource cross-lingual transfer tasks. UXLA achieves SoTA results in all the tasks, outperforming the baselines by a good margin. With an in-depth framework dissection, we demonstrate the cumulative contributions of different components to its success.
    Emotion Carrier Recognition from Personal Narratives. (arXiv:2008.07481v2 [cs.CL] UPDATED)
    (2 min) Personal Narratives (PN) - recollections of facts, events, and thoughts from one's own experience - are often used in everyday conversations. So far, PNs have mainly been explored for tasks such as valence prediction or emotion classification (e.g. happy, sad). However, these tasks might overlook more fine-grained information that could prove to be relevant for understanding PNs. In this work, we propose a novel task for Narrative Understanding: Emotion Carrier Recognition (ECR). Emotion carriers, the text fragments that carry the emotions of the narrator (e.g. loss of a grandpa, high school reunion), provide a fine-grained description of the emotion state. We explore the task of ECR in a corpus of PNs manually annotated with emotion carriers and investigate different machine learning models for the task. We propose evaluation strategies for ECR including metrics that can be appropriate for different tasks.
    A Transformer-based Cross-modal Fusion Model with Adversarial Training for VQA Challenge 2021. (arXiv:2106.13033v1 [cs.CV])
    (2 min) In this paper, inspired by the successes of visionlanguage pre-trained models and the benefits from training with adversarial attacks, we present a novel transformerbased cross-modal fusion modeling by incorporating the both notions for VQA challenge 2021. Specifically, the proposed model is on top of the architecture of VinVL model [19], and the adversarial training strategy [4] is applied to make the model robust and generalized. Moreover, two implementation tricks are also used in our system to obtain better results. The experiments demonstrate that the novel framework can achieve 76.72% on VQAv2 test-std set.
    QASR: QCRI Aljazeera Speech Resource -- A Large Scale Annotated Arabic Speech Corpus. (arXiv:2106.13000v1 [cs.CL])
    (2 min) We introduce the largest transcribed Arabic speech corpus, QASR, collected from the broadcast domain. This multi-dialect speech dataset contains 2,000 hours of speech sampled at 16kHz crawled from Aljazeera news channel. The dataset is released with lightly supervised transcriptions, aligned with the audio segments. Unlike previous datasets, QASR contains linguistically motivated segmentation, punctuation, speaker information among others. QASR is suitable for training and evaluating speech recognition systems, acoustics- and/or linguistics- based Arabic dialect identification, punctuation restoration, speaker identification, speaker linking, and potentially other NLP modules for spoken data. In addition to QASR transcription, we release a dataset of 130M words to aid in designing and training a better language model. We show that end-to-end automatic speech recognition trained on QASR reports a competitive word error rate compared to the previous MGB-2 corpus. We report baseline results for downstream natural language processing tasks such as named entity recognition using speech transcript. We also report the first baseline for Arabic punctuation restoration. We make the corpus available for the research community.
    Exploring Self-Identified Counseling Expertise in Online Support Forums. (arXiv:2106.12976v1 [cs.CL])
    (2 min) A growing number of people engage in online health forums, making it important to understand the quality of the advice they receive. In this paper, we explore the role of expertise in responses provided to help-seeking posts regarding mental health. We study the differences between (1) interactions with peers; and (2) interactions with self-identified mental health professionals. First, we show that a classifier can distinguish between these two groups, indicating that their language use does in fact differ. To understand this difference, we perform several analyses addressing engagement aspects, including whether their comments engage the support-seeker further as well as linguistic aspects, such as dominant language and linguistic style matching. Our work contributes toward the developing efforts of understanding how health experts engage with health information- and support-seekers in social networks. More broadly, it is a step toward a deeper understanding of the styles of interactions that cultivate supportive engagement in online communities.
    Splitting EUD graphs into trees: A quick and clatty approach. (arXiv:2106.13155v1 [cs.CL])
    (2 min) We present the system submission from the FASTPARSE team for the EUD Shared Task at IWPT 2021. We engaged in the task last year by focusing on efficiency. This year we have focused on experimenting with new ideas on a limited time budget. Our system is based on splitting the EUD graph into several trees, based on linguistic criteria. We predict these trees using a sequence-labelling parser and combine them into an EUD graph. The results were relatively poor, although not a total disaster and could probably be improved with some polishing of the system's rough edges.
    Discovering novel drug-supplement interactions using a dietary supplements knowledge graph generated from the biomedical literature. (arXiv:2106.12741v1 [cs.IR])
    (2 min) OBJECTIVE: Leverage existing biomedical NLP tools and DS domain terminology to produce a novel and comprehensive knowledge graph containing dietary supplement (DS) information for discovering interactions between DS and drugs, or Drug-Supplement Interactions (DSI). MATERIALS AND METHODS: We created SemRepDS (an extension of SemRep), capable of extracting semantic relations from abstracts by leveraging a DS-specific terminology (iDISK) containing 28,884 DS terms not found in the UMLS. PubMed abstracts were processed using SemRepDS to generate semantic relations, which were then filtered using a PubMedBERT-based model to remove incorrect relations before generating our knowledge graph (SuppKG). Two pathways are used to identify potential DS-Drug interactions which are then evaluated by medical professionals for mechanistic plausibility. RESULTS: Comparison analysis found that SemRepDS returned 206.9% more DS relations and 158.5% more DS entities than SemRep. The fine-tuned BERT model obtained an F1 score of 0.8605 and removed 43.86% of the relations, improving the precision of the relations by 26.4% compared to pre-filtering. SuppKG consists of 2,928 DS-specific nodes. Manual review of findings identified 44 (88%) proposed DS-Gene-Drug and 32 (64%) proposed DS-Gene1-Function-Gene2-Drug pathways to be mechanistically plausible. DISCUSSION: The additional relations extracted using SemRepDS generated SuppKG that was used to find plausible DSI not found in the current literature. By the nature of the SuppKG, these interactions are unlikely to have been found using SemRep without the expanded DS terminology. CONCLUSION: We successfully extend SemRep to include DS information and produce SuppKG which can be used to find potential DS-Drug interactions.
    Bidding via Clustering Ads Intentions: an Efficient Search Engine Marketing System for E-commerce. (arXiv:2106.12700v1 [cs.CL])
    (2 min) With the increasing scale of search engine marketing, designing an efficient bidding system is becoming paramount for the success of e-commerce companies. The critical challenges faced by a modern industrial-level bidding system include: 1. the catalog is enormous, and the relevant bidding features are of high sparsity; 2. the large volume of bidding requests induces significant computation burden to both the offline and online serving. Leveraging extraneous user-item information proves essential to mitigate the sparsity issue, for which we exploit the natural language signals from the users' query and the contextual knowledge from the products. In particular, we extract the vector representations of ads via the Transformer model and leverage their geometric relation to building collaborative bidding predictions via clustering. The two-step procedure also significantly reduces the computation stress of bid evaluation and optimization. In this paper, we introduce the end-to-end structure of the bidding system for search engine marketing for Walmart e-commerce, which successfully handles tens of millions of bids each day. We analyze the online and offline performances of our approach and discuss how we find it as a production-efficient solution.
    Evaluation of Representation Models for Text Classification with AutoML Tools. (arXiv:2106.12798v1 [cs.CL])
    (2 min) Automated Machine Learning (AutoML) has gained increasing success on tabular data in recent years. However, processing unstructured data like text is a challenge and not widely supported by open-source AutoML tools. This work compares three manually created text representations and text embeddings automatically created by AutoML tools. Our benchmark includes four popular open-source AutoML tools and eight datasets for text classification purposes. The results show that straightforward text representations perform better than AutoML tools with automatically created text embeddings.
    Multilingual transfer of acoustic word embeddings improves when training on languages related to the target zero-resource language. (arXiv:2106.12834v1 [cs.CL])
    (2 min) Acoustic word embedding models map variable duration speech segments to fixed dimensional vectors, enabling efficient speech search and discovery. Previous work explored how embeddings can be obtained in zero-resource settings where no labelled data is available in the target language. The current best approach uses transfer learning: a single supervised multilingual model is trained using labelled data from multiple well-resourced languages and then applied to a target zero-resource language (without fine-tuning). However, it is still unclear how the specific choice of training languages affect downstream performance. Concretely, here we ask whether it is beneficial to use training languages related to the target. Using data from eleven languages spoken in Southern Africa, we experiment with adding data from different language families while controlling for the amount of data per language. In word discrimination and query-by-example search evaluations, we show that training on languages from the same family gives large improvements. Through finer-grained analysis, we show that training on even just a single related language gives the largest gain. We also find that adding data from unrelated languages generally doesn't hurt performance.
    OKGIT: Open Knowledge Graph Link Prediction with Implicit Types. (arXiv:2106.12806v1 [cs.CL])
    (2 min) Open Knowledge Graphs (OpenKG) refer to a set of (head noun phrase, relation phrase, tail noun phrase) triples such as (tesla, return to, new york) extracted from a corpus using OpenIE tools. While OpenKGs are easy to bootstrap for a domain, they are very sparse and far from being directly usable in an end task. Therefore, the task of predicting new facts, i.e., link prediction, becomes an important step while using these graphs in downstream tasks such as text comprehension, question answering, and web search query recommendation. Learning embeddings for OpenKGs is one approach for link prediction that has received some attention lately. However, on careful examination, we found that current OpenKG link prediction algorithms often predict noun phrases (NPs) with incompatible types for given noun and relation phrases. We address this problem in this work and propose OKGIT that improves OpenKG link prediction using novel type compatibility score and type regularization. With extensive experiments on multiple datasets, we show that the proposed method achieves state-of-the-art performance while producing type compatible NPs in the link prediction task.
    Modeling Diagnostic Label Correlation for Automatic ICD Coding. (arXiv:2106.12800v1 [cs.CL])
    (2 min) Given the clinical notes written in electronic health records (EHRs), it is challenging to predict the diagnostic codes which is formulated as a multi-label classification task. The large set of labels, the hierarchical dependency, and the imbalanced data make this prediction task extremely hard. Most existing work built a binary prediction for each label independently, ignoring the dependencies between labels. To address this problem, we propose a two-stage framework to improve automatic ICD coding by capturing the label correlation. Specifically, we train a label set distribution estimator to rescore the probability of each label set candidate generated by a base predictor. This paper is the first attempt at learning the label set distribution as a reranking module for medical code prediction. In the experiments, our proposed framework is able to improve upon best-performing predictors on the benchmark MIMIC datasets. The source code of this project is available at https://github.com/MiuLab/ICD-Correlation.
    A comprehensive empirical analysis on cross-domain semantic enrichment for detection of depressive language. (arXiv:2106.12797v1 [cs.CL])
    (2 min) We analyze the process of creating word embedding feature representations designed for a learning task when annotated data is scarce, for example, in depressive language detection from Tweets. We start with a rich word embedding pre-trained from a large general dataset, which is then augmented with embeddings learned from a much smaller and more specific domain dataset through a simple non-linear mapping mechanism. We also experimented with several other more sophisticated methods of such mapping including, several auto-encoder based and custom loss-function based methods that learn embedding representations through gradually learning to be close to the words of similar semantics and distant to dissimilar semantics. Our strengthened representations better capture the semantics of the depression domain, as it combines the semantics learned from the specific domain coupled with word coverage from the general language. We also present a comparative performance analyses of our word embedding representations with a simple bag-of-words model, well known sentiment and psycholinguistic lexicons, and a general pre-trained word embedding. When used as feature representations for several different machine learning methods, including deep learning models in a depressive Tweets identification task, we show that our augmented word embedding representations achieve a significantly better F1 score than the others, specially when applied to a high quality dataset. Also, we present several data ablation tests which confirm the efficacy of our augmentation techniques.
    Comparative Error Analysis in Neural and Finite-state Models for Unsupervised Character-level Transduction. (arXiv:2106.12698v1 [cs.CL])
    (2 min) Traditionally, character-level transduction problems have been solved with finite-state models designed to encode structural and linguistic knowledge of the underlying process, whereas recent approaches rely on the power and flexibility of sequence-to-sequence models with attention. Focusing on the less explored unsupervised learning scenario, we compare the two model classes side by side and find that they tend to make different types of errors even when achieving comparable performance. We analyze the distributions of different error classes using two unsupervised tasks as testbeds: converting informally romanized text into the native script of its language (for Russian, Arabic, and Kannada) and translating between a pair of closely related languages (Serbian and Bosnian). Finally, we investigate how combining finite-state and sequence-to-sequence models at decoding time affects the output quantitatively and qualitatively.
    Clinical Named Entity Recognition using Contextualized Token Representations. (arXiv:2106.12608v1 [cs.CL])
    (2 min) The clinical named entity recognition (CNER) task seeks to locate and classify clinical terminologies into predefined categories, such as diagnostic procedure, disease disorder, severity, medication, medication dosage, and sign symptom. CNER facilitates the study of side-effect on medications including identification of novel phenomena and human-focused information extraction. Existing approaches in extracting the entities of interests focus on using static word embeddings to represent each word. However, one word can have different interpretations that depend on the context of the sentences. Evidently, static word embeddings are insufficient to integrate the diverse interpretation of a word. To overcome this challenge, the technique of contextualized word embedding has been introduced to better capture the semantic meaning of each word based on its context. Two of these language models, ELMo and Flair, have been widely used in the field of Natural Language Processing to generate the contextualized word embeddings on domain-generic documents. However, these embeddings are usually too general to capture the proximity among vocabularies of specific domains. To facilitate various downstream applications using clinical case reports (CCRs), we pre-train two deep contextualized language models, Clinical Embeddings from Language Model (C-ELMo) and Clinical Contextual String Embeddings (C-Flair) using the clinical-related corpus from the PubMed Central. Explicit experiments show that our models gain dramatic improvements compared to both static word embeddings and domain-generic language models.
    AIT-QA: Question Answering Dataset over Complex Tables in the Airline Industry. (arXiv:2106.12944v1 [cs.CL])
    (2 min) Recent advances in transformers have enabled Table Question Answering (Table QA) systems to achieve high accuracy and SOTA results on open domain datasets like WikiTableQuestions and WikiSQL. Such transformers are frequently pre-trained on open-domain content such as Wikipedia, where they effectively encode questions and corresponding tables from Wikipedia as seen in Table QA dataset. However, web tables in Wikipedia are notably flat in their layout, with the first row as the sole column header. The layout lends to a relational view of tables where each row is a tuple. Whereas, tables in domain-specific business or scientific documents often have a much more complex layout, including hierarchical row and column headers, in addition to having specialized vocabulary terms from that domain. To address this problem, we introduce the domain-specific Table QA dataset AIT-QA (Airline Industry Table QA). The dataset consists of 515 questions authored by human annotators on 116 tables extracted from public U.S. SEC filings (publicly available at: https://www.sec.gov/edgar.shtml) of major airline companies for the fiscal years 2017-2019. We also provide annotations pertaining to the nature of questions, marking those that require hierarchical headers, domain-specific terminology, and paraphrased forms. Our zero-shot baseline evaluation of three transformer-based SOTA Table QA methods - TaPAS (end-to-end), TaBERT (semantic parsing-based), and RCI (row-column encoding-based) - clearly exposes the limitation of these methods in this practical setting, with the best accuracy at just 51.8\% (RCI). We also present pragmatic table preprocessing steps used to pivot and project these complex tables into a layout suitable for the SOTA Table QA models.
    Dealing with training and test segmentation mismatch: FBK@IWSLT2021. (arXiv:2106.12607v1 [cs.CL])
    (2 min) This paper describes FBK's system submission to the IWSLT 2021 Offline Speech Translation task. We participated with a direct model, which is a Transformer-based architecture trained to translate English speech audio data into German texts. The training pipeline is characterized by knowledge distillation and a two-step fine-tuning procedure. Both knowledge distillation and the first fine-tuning step are carried out on manually segmented real and synthetic data, the latter being generated with an MT system trained on the available corpora. Differently, the second fine-tuning step is carried out on a random segmentation of the MuST-C v2 En-De dataset. Its main goal is to reduce the performance drops occurring when a speech translation model trained on manually segmented data (i.e. an ideal, sentence-like segmentation) is evaluated on automatically segmented audio (i.e. actual, more realistic testing conditions). For the same purpose, a custom hybrid segmentation procedure that accounts for both audio content (pauses) and for the length of the produced segments is applied to the test data before passing them to the system. At inference time, we compared this procedure with a baseline segmentation method based on Voice Activity Detection (VAD). Our results indicate the effectiveness of the proposed hybrid approach, shown by a reduction of the gap with manual segmentation from 8.3 to 1.4 BLEU points.
    An Automated Knowledge Mining and Document Classification System with Multi-model Transfer Learning. (arXiv:2106.12744v1 [cs.CL])
    (2 min) Service manual documents are crucial to the engineering company as they provide guidelines and knowledge to service engineers. However, it has become inconvenient and inefficient for service engineers to retrieve specific knowledge from documents due to the complexity of resources. In this research, we propose an automated knowledge mining and document classification system with novel multi-model transfer learning approaches. Particularly, the classification performance of the system has been improved with three effective techniques: fine-tuning, pruning, and multi-model method. The fine-tuning technique optimizes a pre-trained BERT model by adding a feed-forward neural network layer and the pruning technique is used to retrain the BERT model with new data. The multi-model method initializes and trains multiple BERT models to overcome the randomness of data ordering during the fine-tuning process. In the first iteration of the training process, multiple BERT models are being trained simultaneously. The best model is then selected for the next phase of the training process with another two iterations and the training processes for other BERT models will be terminated. The performance of the proposed system has been evaluated by comparing with two robust baseline methods, BERT and BERT-CNN. Experimental results on a widely used Corpus of Linguistic Acceptability (CoLA) dataset have shown that the proposed techniques perform better than these baseline methods in terms of accuracy and MCC score.
    Charformer: Fast Character Transformers via Gradient-based Subword Tokenization. (arXiv:2106.12672v1 [cs.CL])
    (2 min) State-of-the-art models in natural language processing rely on separate rigid subword tokenization algorithms, which limit their generalization ability and adaptation to new settings. In this paper, we propose a new model inductive bias that learns a subword tokenization end-to-end as part of the model. To this end, we introduce a soft gradient-based subword tokenization module (GBST) that automatically learns latent subword representations from characters in a data-driven fashion. Concretely, GBST enumerates candidate subword blocks and learns to score them in a position-wise fashion using a block scoring network. We additionally introduce Charformer, a deep Transformer model that integrates GBST and operates on the byte level. Via extensive experiments on English GLUE, multilingual, and noisy text datasets, we show that Charformer outperforms a series of competitive byte-level baselines while generally performing on par and sometimes outperforming subword-based models. Additionally, Charformer is fast, improving the speed of both vanilla byte-level and subword-level Transformers by 28%-100% while maintaining competitive quality. We believe this work paves the way for highly performant token-free models that are trained completely end-to-end.
  • cs.CV updates on arXiv.org

    A Global Appearance and Local Coding Distortion based Fusion Framework for CNN based Filtering in Video Coding. (arXiv:2106.12746v1 [eess.IV])
    (2 min) In-loop filtering is used in video coding to process the reconstructed frame in order to remove blocking artifacts. With the development of convolutional neural networks (CNNs), CNNs have been explored for in-loop filtering considering it can be treated as an image de-noising task. However, in addition to being a distorted image, the reconstructed frame is also obtained by a fixed line of block based encoding operations in video coding. It carries coding-unit based coding distortion of some similar characteristics. Therefore, in this paper, we address the filtering problem from two aspects, global appearance restoration for disrupted texture and local coding distortion restoration caused by fixed pipeline of coding. Accordingly, a three-stream global appearance and local coding distortion based fusion network is developed with a high-level global feature stream, a high-level local feature stream and a low-level local feature stream. Ablation study is conducted to validate the necessity of different features, demonstrating that the global features and local features can complement each other in filtering and achieve better performance when combined. To the best of our knowledge, we are the first one that clearly characterizes the video filtering process from the above global appearance and local coding distortion restoration aspects with experimental verification, providing a clear pathway to developing filter techniques. Experimental results demonstrate that the proposed method significantly outperforms the existing single-frame based methods and achieves 13.5%, 11.3%, 11.7% BD-Rate saving on average for AI, LDP and RA configurations, respectively, compared with the HEVC reference software.
    Handwritten Digit Recognition using Machine and Deep Learning Algorithms. (arXiv:2106.12614v1 [cs.CV])
    (2 min) The reliance of humans over machines has never been so high such that from object classification in photographs to adding sound to silent movies everything can be performed with the help of deep learning and machine learning algorithms. Likewise, Handwritten text recognition is one of the significant areas of research and development with a streaming number of possibilities that could be attained. Handwriting recognition (HWR), also known as Handwritten Text Recognition (HTR), is the ability of a computer to receive and interpret intelligible handwritten input from sources such as paper documents, photographs, touch-screens and other devices [1]. Apparently, in this paper, we have performed handwritten digit recognition with the help of MNIST datasets using Support Vector Machines (SVM), Multi-Layer Perceptron (MLP) and Convolution Neural Network (CNN) models. Our main objective is to compare the accuracy of the models stated above along with their execution time to get the best possible model for digit recognition.
    IA-RED$^2$: Interpretability-Aware Redundancy Reduction for Vision Transformers. (arXiv:2106.12620v1 [cs.CV])
    (2 min) The self-attention-based model, transformer, is recently becoming the leading backbone in the field of computer vision. In spite of the impressive success made by transformers in a variety of vision tasks, it still suffers from heavy computation and intensive memory cost. To address this limitation, this paper presents an Interpretability-Aware REDundancy REDuction framework (IA-RED$^2$). We start by observing a large amount of redundant computation, mainly spent on uncorrelated input patches, and then introduce an interpretable module to dynamically and gracefully drop these redundant patches. This novel framework is then extended to a hierarchical structure, where uncorrelated tokens at different stages are gradually removed, resulting in a considerable shrinkage of computational cost. We include extensive experiments on both image and video tasks, where our method could deliver up to 1.4X speed-up for state-of-the-art models like DeiT and TimeSformer, by only sacrificing less than 0.7% accuracy. More importantly, contrary to other acceleration approaches, our method is inherently interpretable with substantial visual evidence, making vision transformer closer to a more human-understandable architecture while being lighter. We demonstrate that the interpretability that naturally emerged in our framework can outperform the raw attention learned by the original visual transformer, as well as those generated by off-the-shelf interpretation methods, with both qualitative and quantitative results. Project Page: this http URL
    Graceful Degradation and Related Fields. (arXiv:2106.11119v2 [cs.LG] UPDATED)
    (2 min) When machine learning models encounter data which is out of the distribution on which they were trained they have a tendency to behave poorly, most prominently over-confidence in erroneous predictions. Such behaviours will have disastrous effects on real-world machine learning systems. In this field graceful degradation refers to the optimisation of model performance as it encounters this out-of-distribution data. This work presents a definition and discussion of graceful degradation and where it can be applied in deployed visual systems. Following this a survey of relevant areas is undertaken, novelly splitting the graceful degradation problem into active and passive approaches. In passive approaches, graceful degradation is handled and achieved by the model in a self-contained manner, in active approaches the model is updated upon encountering epistemic uncertainties. This work communicates the importance of the problem and aims to prompt the development of machine learning strategies that are aware of graceful degradation.
    Towards Automatic Speech to Sign Language Generation. (arXiv:2106.12790v1 [cs.CV])
    (2 min) We aim to solve the highly challenging task of generating continuous sign language videos solely from speech segments for the first time. Recent efforts in this space have focused on generating such videos from human-annotated text transcripts without considering other modalities. However, replacing speech with sign language proves to be a practical solution while communicating with people suffering from hearing loss. Therefore, we eliminate the need of using text as input and design techniques that work for more natural, continuous, freely uttered speech covering an extensive vocabulary. Since the current datasets are inadequate for generating sign language directly from speech, we collect and release the first Indian sign language dataset comprising speech-level annotations, text transcripts, and the corresponding sign-language videos. Next, we propose a multi-tasking transformer network trained to generate signer's poses from speech segments. With speech-to-text as an auxiliary task and an additional cross-modal discriminator, our model learns to generate continuous sign pose sequences in an end-to-end manner. Extensive experiments and comparisons with other baselines demonstrate the effectiveness of our approach. We also conduct additional ablation studies to analyze the effect of different modules of our network. A demo video containing several results is attached to the supplementary material.
    When Differential Privacy Meets Interpretability: A Case Study. (arXiv:2106.13203v1 [cs.CV])
    (2 min) Given the increase in the use of personal data for training Deep Neural Networks (DNNs) in tasks such as medical imaging and diagnosis, differentially private training of DNNs is surging in importance and there is a huge body of work focusing on providing better privacy-utility trade-off. However, little attention is given to the interpretability of these models, and how the application of DP affects the quality of interpretations. We propose an extensive study into the effects of DP training on DNNs, especially on medical imaging applications, on the APTOS dataset.
    A Sparse and Locally Coherent Morphable Face Model for Dense Semantic Correspondence Across Heterogeneous 3D Faces. (arXiv:2006.03840v3 [cs.CV] UPDATED)
    (2 min) The 3D Morphable Model (3DMM) is a powerful statistical tool for representing 3D face shapes. To build a 3DMM, a training set of face scans in full point-to-point correspondence is required, and its modeling capabilities directly depend on the variability contained in the training data. Thus, to increase the descriptive power of the 3DMM, establishing a dense correspondence across heterogeneous scans with sufficient diversity in terms of identities, ethnicities, or expressions becomes essential. In this manuscript, we present a fully automatic approach that leverages a 3DMM to transfer its dense semantic annotation across raw 3D faces, establishing a dense correspondence between them. We propose a novel formulation to learn a set of sparse deformation components with local support on the face that, together with an original non-rigid deformation algorithm, allow the 3DMM to precisely fit unseen faces and transfer its semantic annotation. We extensively experimented our approach, showing it can effectively generalize to highly diverse samples and accurately establish a dense correspondence even in presence of complex facial expressions. The accuracy of the dense registration is demonstrated by building a heterogeneous, large-scale 3DMM from more than 9,000 fully registered scans obtained by joining three large datasets together.
    Differential Morph Face Detection using Discriminative Wavelet Sub-bands. (arXiv:2106.13178v1 [cs.CV])
    (2 min) Face recognition systems are extremely vulnerable to morphing attacks, in which a morphed facial reference image can be successfully verified as two or more distinct identities. In this paper, we propose a morph attack detection algorithm that leverages an undecimated 2D Discrete Wavelet Transform (DWT) for identifying morphed face images. The core of our framework is that artifacts resulting from the morphing process that are not discernible in the image domain can be more easily identified in the spatial frequency domain. A discriminative wavelet sub-band can accentuate the disparity between a real and a morphed image. To this end, multi-level DWT is applied to all images, yielding 48 mid and high-frequency sub-bands each. The entropy distributions for each sub-band are calculated separately for both bona fide and morph images. For some of the sub-bands, there is a marked difference between the entropy of the sub-band in a bona fide image and the identical sub-band's entropy in a morphed image. Consequently, we employ Kullback-Liebler Divergence (KLD) to exploit these differences and isolate the sub-bands that are the most discriminative. We measure how discriminative a sub-band is by its KLD value and the 22 sub-bands with the highest KLD values are chosen for network training. Then, we train a deep Siamese neural network using these 22 selected sub-bands for differential morph attack detection. We examine the efficacy of discriminative wavelet sub-bands for morph attack detection and show that a deep neural network trained on these sub-bands can accurately identify morph imagery.
    Boosting Semi-supervised Image Segmentation with Global and Local Mutual Information Regularization. (arXiv:2103.04813v2 [cs.CV] UPDATED)
    (2 min) The scarcity of labeled data often impedes the application of deep learning to the segmentation of medical images. Semi-supervised learning seeks to overcome this limitation by exploiting unlabeled examples in the learning process. In this paper, we present a novel semi-supervised segmentation method that leverages mutual information (MI) on categorical distributions to achieve both global representation invariance and local smoothness. In this method, we maximize the MI for intermediate feature embeddings that are taken from both the encoder and decoder of a segmentation network. We first propose a global MI loss constraining the encoder to learn an image representation that is invariant to geometric transformations. Instead of resorting to computationally-expensive techniques for estimating the MI on continuous feature embeddings, we use projection heads to map them to a discrete cluster assignment where MI can be computed efficiently. Our method also includes a local MI loss to promote spatial consistency in the feature maps of the decoder and provide a smoother segmentation. Since mutual information does not require a strict ordering of clusters in two different assignments, we incorporate a final consistency regularization loss on the output which helps align the cluster labels throughout the network. We evaluate the method on four challenging publicly-available datasets for medical image segmentation. Experimental results show our method to outperform recently-proposed approaches for semi-supervised segmentation and provide an accuracy near to full supervision while training with very few annotated images.
    Q-space Conditioned Translation Networks for Directional Synthesis of Diffusion Weighted Images from Multi-modal Structural MRI. (arXiv:2106.13188v1 [eess.IV])
    (2 min) Current deep learning approaches for diffusion MRI modeling circumvent the need for densely-sampled diffusion-weighted images (DWIs) by directly predicting microstructural indices from sparsely-sampled DWIs. However, they implicitly make unrealistic assumptions of static $q$-space sampling during training and reconstruction. Further, such approaches can restrict downstream usage of variably sampled DWIs for usages including the estimation of microstructural indices or tractography. We propose a generative adversarial translation framework for high-quality DWI synthesis with arbitrary $q$-space sampling given commonly acquired structural images (e.g., B0, T1, T2). Our translation network linearly modulates its internal representations conditioned on continuous $q$-space information, thus removing the need for fixed sampling schemes. Moreover, this approach enables downstream estimation of high-quality microstructural maps from arbitrarily subsampled DWIs, which may be particularly important in cases with sparsely sampled DWIs. Across several recent methodologies, the proposed approach yields improved DWI synthesis accuracy and fidelity with enhanced downstream utility as quantified by the accuracy of scalar microstructure indices estimated from the synthesized images. Code is available at https://github.com/mengweiren/q-space-conditioned-dwi-synthesis.
    Unsupervised Learning of Depth and Depth-of-Field Effect from Natural Images with Aperture Rendering Generative Adversarial Networks. (arXiv:2106.13041v1 [cs.CV])
    (2 min) Understanding the 3D world from 2D projected natural images is a fundamental challenge in computer vision and graphics. Recently, an unsupervised learning approach has garnered considerable attention owing to its advantages in data collection. However, to mitigate training limitations, typical methods need to impose assumptions for viewpoint distribution (e.g., a dataset containing various viewpoint images) or object shape (e.g., symmetric objects). These assumptions often restrict applications; for instance, the application to non-rigid objects or images captured from similar viewpoints (e.g., flower or bird images) remains a challenge. To complement these approaches, we propose aperture rendering generative adversarial networks (AR-GANs), which equip aperture rendering on top of GANs, and adopt focus cues to learn the depth and depth-of-field (DoF) effect of unlabeled natural images. To address the ambiguities triggered by unsupervised setting (i.e., ambiguities between smooth texture and out-of-focus blurs, and between foreground and background blurs), we develop DoF mixture learning, which enables the generator to learn real image distribution while generating diverse DoF images. In addition, we devise a center focus prior to guiding the learning direction. In the experiments, we demonstrate the effectiveness of AR-GANs in various datasets, such as flower, bird, and face images, demonstrate their portability by incorporating them into other 3D representation learning GANs, and validate their applicability in shallow DoF rendering.
    Handling Data Heterogeneity with Generative Replay in Collaborative Learning for Medical Imaging. (arXiv:2106.13208v1 [cs.CV])
    (2 min) Collaborative learning, which enables collaborative and decentralized training of deep neural networks at multiple institutions in a privacy-preserving manner, is rapidly emerging as a valuable technique in healthcare applications. However, its distributed nature often leads to significant heterogeneity in data distributions across institutions. Existing collaborative learning approaches generally do not account for the presence of heterogeneity in data among institutions, or only mildly skewed label distributions are studied. In this paper, we present a novel generative replay strategy to address the challenge of data heterogeneity in collaborative learning methods. Instead of directly training a model for task performance, we leverage recent image synthesis techniques to develop a novel dual model architecture: a primary model learns the desired task, and an auxiliary "generative replay model" either synthesizes images that closely resemble the input images or helps extract latent variables. The generative replay strategy is flexible to use, can either be incorporated into existing collaborative learning methods to improve their capability of handling data heterogeneity across institutions, or be used as a novel and individual collaborative learning framework (termed FedReplay) to reduce communication cost. Experimental results demonstrate the capability of the proposed method in handling heterogeneous data across institutions. On highly heterogeneous data partitions, our model achieves ~4.88% improvement in the prediction accuracy on a diabetic retinopathy classification dataset, and ~49.8% reduction of mean absolution value on a Bone Age prediction dataset, respectively, compared to the state-of-the art collaborative learning methods.
    PocketNet: A Smaller Neural Network for Medical Image Analysis. (arXiv:2104.10745v2 [eess.IV] UPDATED)
    (2 min) Medical imaging deep learning models are often large and complex, requiring specialized hardware to train and evaluate these models. To address such issues, we propose the PocketNet paradigm to reduce the size of deep learning models by throttling the growth of the number of channels in convolutional neural networks. We demonstrate that, for a range of segmentation and classification tasks, PocketNet architectures produce results comparable to that of conventional neural networks while reducing the number of parameters by multiple orders of magnitude, using up to 90% less GPU memory, and speeding up training times by up to 40%, thereby allowing such models to be trained and deployed in resource-constrained settings.
    Improving Network Slimming with Nonconvex Regularization. (arXiv:2010.01242v3 [cs.CV] UPDATED)
    (2 min) Convolutional neural networks (CNNs) have developed to become powerful models for various computer vision tasks ranging from object detection to semantic segmentation. However, most of state-of-the-art CNNs can not be deployed directly on edge devices such as smartphones and drones, which need low latency under limited power and memory bandwidth. One popular, straightforward approach to compressing CNNs is network slimming, which imposes $\ell_1$ regularization on the channel-associated scaling factors via the batch normalization layers during training. Network slimming thereby identifies insignificant channels that can be pruned for inference. In this paper, we propose replacing the $\ell_1$ penalty with an alternative sparse, nonconvex penalty in order to yield a more compressed and/or accurate CNN architecture. We investigate $\ell_p (0 < p < 1)$, transformed $\ell_1$ (T$\ell_1$), minimax concave penalty (MCP), and smoothly clipped absolute deviation (SCAD) due to their recent successes and popularity in solving sparse optimization problems, such as compressed sensing and variable selection. We demonstrate the effectiveness of network slimming with nonconvex penalties on VGGNet, Densenet, and Resnet on standard image classification datasets. Based on the numerical experiments, T$\ell_1$ preserves model accuracy against channel pruning, $\ell_{1/2, 3/4}$ yield better compressed models with similar accuracies after retraining as $\ell_1$, and MCP and SCAD provide more accurate models after retraining with similar compression as $\ell_1$. Network slimming with T$\ell_1$ regularization also outperforms the latest Bayesian modification of network slimming in compressing a CNN architecture in terms of memory storage while preserving its model accuracy after channel pruning.
    Long-term Cross Adversarial Training: A Robust Meta-learning Method for Few-shot Classification Tasks. (arXiv:2106.12900v1 [cs.LG])
    (2 min) Meta-learning model can quickly adapt to new tasks using few-shot labeled data. However, despite achieving good generalization on few-shot classification tasks, it is still challenging to improve the adversarial robustness of the meta-learning model in few-shot learning. Although adversarial training (AT) methods such as Adversarial Query (AQ) can improve the adversarially robust performance of meta-learning models, AT is still computationally expensive training. On the other hand, meta-learning models trained with AT will drop significant accuracy on the original clean images. This paper proposed a meta-learning method on the adversarially robust neural network called Long-term Cross Adversarial Training (LCAT). LCAT will update meta-learning model parameters cross along the natural and adversarial sample distribution direction with long-term to improve both adversarial and clean few-shot classification accuracy. Due to cross-adversarial training, LCAT only needs half of the adversarial training epoch than AQ, resulting in a low adversarial training computation. Experiment results show that LCAT achieves superior performance both on the clean and adversarial few-shot classification accuracy than SOTA adversarial training methods for meta-learning models.
    Towards Fully Interpretable Deep Neural Networks: Are We There Yet?. (arXiv:2106.13164v1 [cs.LG])
    (2 min) Despite the remarkable performance, Deep Neural Networks (DNNs) behave as black-boxes hindering user trust in Artificial Intelligence (AI) systems. Research on opening black-box DNN can be broadly categorized into post-hoc methods and inherently interpretable DNNs. While many surveys have been conducted on post-hoc interpretation methods, little effort is devoted to inherently interpretable DNNs. This paper provides a review of existing methods to develop DNNs with intrinsic interpretability, with a focus on Convolutional Neural Networks (CNNs). The aim is to understand the current progress towards fully interpretable DNNs that can cater to different interpretation requirements. Finally, we identify gaps in current work and suggest potential research directions.
    ChaLearn Looking at People: Inpainting and Denoising challenges. (arXiv:2106.13071v1 [cs.CV])
    (2 min) Dealing with incomplete information is a well studied problem in the context of machine learning and computational intelligence. However, in the context of computer vision, the problem has only been studied in specific scenarios (e.g., certain types of occlusions in specific types of images), although it is common to have incomplete information in visual data. This chapter describes the design of an academic competition focusing on inpainting of images and video sequences that was part of the competition program of WCCI2018 and had a satellite event collocated with ECCV2018. The ChaLearn Looking at People Inpainting Challenge aimed at advancing the state of the art on visual inpainting by promoting the development of methods for recovering missing and occluded information from images and video. Three tracks were proposed in which visual inpainting might be helpful but still challenging: human body pose estimation, text overlays removal and fingerprint denoising. This chapter describes the design of the challenge, which includes the release of three novel datasets, and the description of evaluation metrics, baselines and evaluation protocol. The results of the challenge are analyzed and discussed in detail and conclusions derived from this event are outlined.
    Exploring Stronger Feature for Temporal Action Localization. (arXiv:2106.13014v1 [cs.CV])
    (2 min) Temporal action localization aims to localize starting and ending time with action category. Limited by GPU memory, mainstream methods pre-extract features for each video. Therefore, feature quality determines the upper bound of detection performance. In this technical report, we explored classic convolution-based backbones and the recent surge of transformer-based backbones. We found that the transformer-based methods can achieve better classification performance than convolution-based, but they cannot generate accuracy action proposals. In addition, extracting features with larger frame resolution to reduce the loss of spatial information can also effectively improve the performance of temporal action localization. Finally, we achieve 42.42% in terms of mAP on validation set with a single SlowFast feature by a simple combination: BMN+TCANet, which is 1.87% higher than the result of 2020's multi-model ensemble. Finally, we achieve Rank 1st on the CVPR2021 HACS supervised Temporal Action Localization Challenge.
    MatchVIE: Exploiting Match Relevancy between Entities for Visual Information Extraction. (arXiv:2106.12940v1 [cs.CV])
    (2 min) Visual Information Extraction (VIE) task aims to extract key information from multifarious document images (e.g., invoices and purchase receipts). Most previous methods treat the VIE task simply as a sequence labeling problem or classification problem, which requires models to carefully identify each kind of semantics by introducing multimodal features, such as font, color, layout. But simply introducing multimodal features couldn't work well when faced with numeric semantic categories or some ambiguous texts. To address this issue, in this paper we propose a novel key-value matching model based on a graph neural network for VIE (MatchVIE). Through key-value matching based on relevancy evaluation, the proposed MatchVIE can bypass the recognitions to various semantics, and simply focuses on the strong relevancy between entities. Besides, we introduce a simple but effective operation, Num2Vec, to tackle the instability of encoded values, which helps model converge more smoothly. Comprehensive experiments demonstrate that the proposed MatchVIE can significantly outperform previous methods. Notably, to the best of our knowledge, MatchVIE may be the first attempt to tackle the VIE task by modeling the relevancy between keys and values and it is a good complement to the existing methods.
    Symmetric Wasserstein Autoencoders. (arXiv:2106.13024v1 [cs.LG])
    (2 min) Leveraging the framework of Optimal Transport, we introduce a new family of generative autoencoders with a learnable prior, called Symmetric Wasserstein Autoencoders (SWAEs). We propose to symmetrically match the joint distributions of the observed data and the latent representation induced by the encoder and the decoder. The resulting algorithm jointly optimizes the modelling losses in both the data and the latent spaces with the loss in the data space leading to the denoising effect. With the symmetric treatment of the data and the latent representation, the algorithm implicitly preserves the local structure of the data in the latent space. To further improve the quality of the latent representation, we incorporate a reconstruction loss into the objective, which significantly benefits both the generation and reconstruction. We empirically show the superior performance of SWAEs over the state-of-the-art generative autoencoders in terms of classification, reconstruction, and generation.
    Driver-centric Risk Object Identification. (arXiv:2106.13201v1 [cs.CV])
    (2 min) A massive number of traffic fatalities are due to driver errors. To reduce fatalities, developing intelligent driving systems assisting drivers to identify potential risks is in urgent need. Risky situations are generally defined based on collision prediction in existing research. However, collisions are only one type of risk in traffic scenarios. We believe a more generic definition is required. In this work, we propose a novel driver-centric definition of risk, i.e., risky objects influence driver behavior. Based on this definition, a new task called risk object identification is introduced. We formulate the task as a cause-effect problem and present a novel two-stage risk object identification framework, taking inspiration from models of situation awareness and causal inference. A driver-centric Risk Object Identification (ROI) dataset is curated to evaluate the proposed system. We demonstrate state-of-the-art risk object identification performance compared with strong baselines on the ROI dataset. In addition, we conduct extensive ablative studies to justify our design choices.
    VOLO: Vision Outlooker for Visual Recognition. (arXiv:2106.13112v1 [cs.CV])
    (2 min) Visual recognition has been dominated by convolutionalneural networks (CNNs) for years. Though recently the pre-vailing vision transformers (ViTs) have shown great poten-tial of self-attention based models in ImageNet classifica-tion, their performance is still inferior to latest SOTA CNNsif no extra data are provided. In this work, we aim to closethe performance gap and demonstrate that attention-basedmodels are indeed able to outperform CNNs. We found thatthe main factor limiting the performance of ViTs for Ima-geNet classification is their low efficacy in encoding fine-level features into the token representations. To resolvethis, we introduce a noveloutlook attentionand present asimple and general architecture, termed Vision Outlooker(VOLO). Unlike self-attention that focuses on global depen-dency modeling at a coarse level, the outlook attention aimsto efficiently encode finer-level features and contexts intotokens, which are shown to be critical for recognition per-formance but largely ignored by the self-attention. Experi-ments show that our VOLO achieves 87.1% top-1 accuracyon ImageNet-1K classification, being the first model exceed-ing 87% accuracy on this competitive benchmark, withoutusing any extra training data. In addition, the pre-trainedVOLO transfers well to downstream tasks, such as seman-tic segmentation. We achieve 84.3% mIoU score on thecityscapes validation set and 54.3% on the ADE20K valida-tion set. Code is available at https://github.com/sail-sg/volo.
    Attention Toward Neighbors: A Context Aware Framework for High Resolution Image Segmentation. (arXiv:2106.12902v1 [cs.CV])
    (2 min) High-resolution image segmentation remains challenging and error-prone due to the enormous size of intermediate feature maps. Conventional methods avoid this problem by using patch based approaches where each patch is segmented independently. However, independent patch segmentation induces errors, particularly at the patch boundary due to the lack of contextual information in very high-resolution images where the patch size is much smaller compared to the full image. To overcome these limitations, in this paper, we propose a novel framework to segment a particular patch by incorporating contextual information from its neighboring patches. This allows the segmentation network to see the target patch with a wider field of view without the need of larger feature maps. Comparative analysis from a number of experiments shows that our proposed framework is able to segment high resolution images with significantly improved mean Intersection over Union and overall accuracy.
    AudioCLIP: Extending CLIP to Image, Text and Audio. (arXiv:2106.13043v1 [cs.SD])
    (2 min) In the past, the rapidly evolving field of sound classification greatly benefited from the application of methods from other domains. Today, we observe the trend to fuse domain-specific tasks and approaches together, which provides the community with new outstanding models. In this work, we present an extension of the CLIP model that handles audio in addition to text and images. Our proposed model incorporates the ESResNeXt audio-model into the CLIP framework using the AudioSet dataset. Such a combination enables the proposed model to perform bimodal and unimodal classification and querying, while keeping CLIP's ability to generalize to unseen datasets in a zero-shot inference fashion. AudioCLIP achieves new state-of-the-art results in the Environmental Sound Classification (ESC) task, out-performing other approaches by reaching accuracies of 90.07% on the UrbanSound8K and 97.15% on the ESC-50 datasets. Further it sets new baselines in the zero-shot ESC-task on the same datasets 68.78% and 69.40%, respectively). Finally, we also assess the cross-modal querying performance of the proposed model as well as the influence of full and partial training on the results. For the sake of reproducibility, our code is published.
    A Systematic Collection of Medical Image Datasets for Deep Learning. (arXiv:2106.12864v1 [eess.IV])
    (2 min) The astounding success made by artificial intelligence (AI) in healthcare and other fields proves that AI can achieve human-like performance. However, success always comes with challenges. Deep learning algorithms are data-dependent and require large datasets for training. The lack of data in the medical imaging field creates a bottleneck for the application of deep learning to medical image analysis. Medical image acquisition, annotation, and analysis are costly, and their usage is constrained by ethical restrictions. They also require many resources, such as human expertise and funding. That makes it difficult for non-medical researchers to have access to useful and large medical data. Thus, as comprehensive as possible, this paper provides a collection of medical image datasets with their associated challenges for deep learning research. We have collected information of around three hundred datasets and challenges mainly reported between 2013 and 2020 and categorized them into four categories: head & neck, chest & abdomen, pathology & blood, and ``others''. Our paper has three purposes: 1) to provide a most up to date and complete list that can be used as a universal reference to easily find the datasets for clinical image analysis, 2) to guide researchers on the methodology to test and evaluate their methods' performance and robustness on relevant datasets, 3) to provide a ``route'' to relevant algorithms for the relevant medical topics, and challenge leaderboards.
    GaussiGAN: Controllable Image Synthesis with 3D Gaussians from Unposed Silhouettes. (arXiv:2106.13215v1 [cs.CV])
    (2 min) We present an algorithm that learns a coarse 3D representation of objects from unposed multi-view 2D mask supervision, then uses it to generate detailed mask and image texture. In contrast to existing voxel-based methods for unposed object reconstruction, our approach learns to represent the generated shape and pose with a set of self-supervised canonical 3D anisotropic Gaussians via a perspective camera, and a set of per-image transforms. We show that this approach can robustly estimate a 3D space for the camera and object, while recent baselines sometimes struggle to reconstruct coherent 3D spaces in this setting. We show results on synthetic datasets with realistic lighting, and demonstrate object insertion with interactive posing. With our work, we help move towards structured representations that handle more real-world variation in learning-based object reconstruction.
    CAGAN: Text-To-Image Generation with Combined Attention GANs. (arXiv:2104.12663v2 [cs.CV] UPDATED)
    (2 min) Generating images according to natural language descriptions is a challenging task. Prior research has mainly focused to enhance the quality of generation by investigating the use of spatial attention and/or textual attention thereby neglecting the relationship between channels. In this work, we propose the Combined Attention Generative Adversarial Network (CAGAN) to generate photo-realistic images according to textual descriptions. The proposed CAGAN utilises two attention models: word attention to draw different sub-regions conditioned on related words; and squeeze-and-excitation attention to capture non-linear interaction among channels. With spectral normalisation to stabilise training, our proposed CAGAN improves the state of the art on the IS and FID on the CUB dataset and the FID on the more challenging COCO dataset. Furthermore, we demonstrate that judging a model by a single evaluation metric can be misleading by developing an additional model adding local self-attention which scores a higher IS, outperforming the state of the art on the CUB dataset, but generates unrealistic images through feature repetition.
    All You Need is a Second Look: Towards Arbitrary-Shaped Text Detection. (arXiv:2106.12720v1 [cs.CV])
    (2 min) Arbitrary-shaped text detection is a challenging task since curved texts in the wild are of the complex geometric layouts. Existing mainstream methods follow the instance segmentation pipeline to obtain the text regions. However, arbitraryshaped texts are difficult to be depicted through one single segmentation network because of the varying scales. In this paper, we propose a two-stage segmentation-based detector, termed as NASK (Need A Second looK), for arbitrary-shaped text detection. Compared to the traditional single-stage segmentation network, our NASK conducts the detection in a coarse-to-fine manner with the first stage segmentation spotting the rectangle text proposals and the second one retrieving compact representations. Specifically, NASK is composed of a Text Instance Segmentation (TIS) network (1st stage), a Geometry-aware Text RoI Alignment (GeoAlign) module, and a Fiducial pOint eXpression (FOX) module (2nd stage). Firstly, TIS extracts the augmented features with a novel Group Spatial and Channel Attention (GSCA) module and conducts instance segmentation to obtain rectangle proposals. Then, GeoAlign converts these rectangles into the fixed size and encodes RoI-wise feature representation. Finally, FOX disintegrates the text instance into serval pivotal geometrical attributes to refine the detection results. Extensive experimental results on three public benchmarks including Total-Text, SCUTCTW1500, and ICDAR 2015 verify that our NASK outperforms recent state-of-the-art methods.
    ST-HOI: A Spatial-Temporal Baseline for Human-Object Interaction Detection in Videos. (arXiv:2105.11731v2 [cs.CV] UPDATED)
    (2 min) Detecting human-object interactions (HOI) is an important step toward a comprehensive visual understanding of machines. While detecting non-temporal HOIs (e.g., sitting on a chair) from static images is feasible, it is unlikely even for humans to guess temporal-related HOIs (e.g., opening/closing a door) from a single video frame, where the neighboring frames play an essential role. However, conventional HOI methods operating on only static images have been used to predict temporal-related interactions, which is essentially guessing without temporal contexts and may lead to sub-optimal performance. In this paper, we bridge this gap by detecting video-based HOIs with explicit temporal information. We first show that a naive temporal-aware variant of a common action detection baseline does not work on video-based HOIs due to a feature-inconsistency issue. We then propose a simple yet effective architecture named Spatial-Temporal HOI Detection (ST-HOI) utilizing temporal information such as human and object trajectories, correctly-localized visual features, and spatial-temporal masking pose features. We construct a new video HOI benchmark dubbed VidHOI where our proposed approach serves as a solid baseline.
    The effectiveness of feature attribution methods and its correlation with automatic evaluation scores. (arXiv:2105.14944v2 [cs.CV] UPDATED)
    (2 min) Explaining the decisions of an Artificial Intelligence (AI) model is increasingly critical in many real-world, high-stake applications. Hundreds of papers have either proposed new feature attribution methods, discussed or harnessed these tools in their work. However, despite humans being the target end-users, most attribution methods were only evaluated on proxy automatic-evaluation metrics. In this paper, we conduct the first, large-scale user study on 320 lay and 11 expert users to shed light on the effectiveness of state-of-the-art attribution methods in assisting humans in ImageNet classification, Stanford Dogs fine-grained classification, and these two tasks but when the input image contains adversarial perturbations. We found that, in overall, feature attribution is surprisingly not more effective than showing humans nearest training-set examples. On a hard task of fine-grained dog categorization, presenting attribution maps to humans does not help, but instead hurts the performance of human-AI teams compared to AI alone. Importantly, we found automatic attribution-map evaluation measures to correlate poorly with the actual human-AI team performance. Our findings encourage the community to rigorously test their methods on the downstream human-in-the-loop applications and to rethink the existing evaluation metrics.
    Fast Monte Carlo Rendering via Multi-Resolution Sampling. (arXiv:2106.12802v1 [cs.CV])
    (2 min) Monte Carlo rendering algorithms are widely used to produce photorealistic computer graphics images. However, these algorithms need to sample a substantial amount of rays per pixel to enable proper global illumination and thus require an immense amount of computation. In this paper, we present a hybrid rendering method to speed up Monte Carlo rendering algorithms. Our method first generates two versions of a rendering: one at a low resolution with a high sample rate (LRHS) and the other at a high resolution with a low sample rate (HRLS). We then develop a deep convolutional neural network to fuse these two renderings into a high-quality image as if it were rendered at a high resolution with a high sample rate. Specifically, we formulate this fusion task as a super resolution problem that generates a high resolution rendering from a low resolution input (LRHS), assisted with the HRLS rendering. The HRLS rendering provides critical high frequency details which are difficult to recover from the LRHS for any super resolution methods. Our experiments show that our hybrid rendering algorithm is significantly faster than the state-of-the-art Monte Carlo denoising methods while rendering high-quality images when tested on both our own BCR dataset and the Gharbi dataset. \url{https://github.com/hqqxyy/msspl}
    Learning by Planning: Language-Guided Global Image Editing. (arXiv:2106.13156v1 [cs.CV])
    (2 min) Recently, language-guided global image editing draws increasing attention with growing application potentials. However, previous GAN-based methods are not only confined to domain-specific, low-resolution data but also lacking in interpretability. To overcome the collective difficulties, we develop a text-to-operation model to map the vague editing language request into a series of editing operations, e.g., change contrast, brightness, and saturation. Each operation is interpretable and differentiable. Furthermore, the only supervision in the task is the target image, which is insufficient for a stable training of sequential decisions. Hence, we propose a novel operation planning algorithm to generate possible editing sequences from the target image as pseudo ground truth. Comparison experiments on the newly collected MA5k-Req dataset and GIER dataset show the advantages of our methods. Code is available at https://jshi31.github.io/T2ONet.
    EfficientNetV2: Smaller Models and Faster Training. (arXiv:2104.00298v3 [cs.CV] UPDATED)
    (2 min) This paper introduces EfficientNetV2, a new family of convolutional networks that have faster training speed and better parameter efficiency than previous models. To develop this family of models, we use a combination of training-aware neural architecture search and scaling, to jointly optimize training speed and parameter efficiency. The models were searched from the search space enriched with new ops such as Fused-MBConv. Our experiments show that EfficientNetV2 models train much faster than state-of-the-art models while being up to 6.8x smaller. Our training can be further sped up by progressively increasing the image size during training, but it often causes a drop in accuracy. To compensate for this accuracy drop, we propose to adaptively adjust regularization (e.g., dropout and data augmentation) as well, such that we can achieve both fast training and good accuracy. With progressive learning, our EfficientNetV2 significantly outperforms previous models on ImageNet and CIFAR/Cars/Flowers datasets. By pretraining on the same ImageNet21k, our EfficientNetV2 achieves 87.3% top-1 accuracy on ImageNet ILSVRC2012, outperforming the recent ViT by 2.0% accuracy while training 5x-11x faster using the same computing resources. Code will be available at https://github.com/google/automl/tree/master/efficientnetv2.
    Three-stream network for enriched Action Recognition. (arXiv:2104.13051v2 [cs.CV] UPDATED)
    (2 min) Understanding accurate information on human behaviours is one of the most important tasks in machine intelligence. Human Activity Recognition that aims to understand human activities from a video is a challenging task due to various problems including background, camera motion and dataset variations. This paper proposes two CNN based architectures with three streams which allow the model to exploit the dataset under different settings. The three pathways are differentiated in frame rates. The single pathway, operates at a single frame rate captures spatial information, the slow pathway operates at low frame rates captures the spatial information and the fast pathway operates at high frame rates that capture fine temporal information. Post CNN encoders, we add bidirectional LSTM and attention heads respectively to capture the context and temporal features. By experimenting with various algorithms on UCF-101, Kinetics-600 and AVA dataset, we observe that the proposed models achieve state-of-art performance for human action recognition task.
    FDRN: A Fast Deformable Registration Network for Medical Images. (arXiv:2011.02307v4 [cs.CV] UPDATED)
    (2 min) Deformable image registration is a fundamental task in medical imaging. Due to the large computational complexity of deformable registration of volumetric images, conventional iterative methods usually face the tradeoff between the registration accuracy and the computation time in practice. In order to boost the registration performance in both accuracy and runtime, we propose a fast convolutional neural network. Specially, to efficiently utilize the memory resources and enlarge the model capacity, we adopt additive forwarding instead of channel concatenation and deepen the network in each encoder and decoder stage. To facilitate the learning efficiency, we leverage skip connection within the encoder and decoder stages to enable residual learning and employ an auxiliary loss at the bottom layer with lowest resolution to involve deep supervision. Particularly, the low-resolution auxiliary loss is weighted by an exponentially decayed parameter during the training phase. In conjunction with the main loss in high-resolution grid, a coarse-to-fine learning strategy is achieved. Last but not least, we introduce an auxiliary loss based on the segmentation prior to improve the registration performance in Dice score. Comparing to the auxiliary loss using average Dice score, the proposed multi-label segmentation loss does not induce additional memory cost in the training phase and can be employed on images with arbitrary amount of categories. In the experiments, we show FDRN outperforms the existing state-of-the-art registration methods for brain MR images by resorting to the compact network structure and efficient learning. Besides, FDRN is a generalized framework for image registration which is not confined to a particular type of medical images or anatomy.
    Embracing Uncertainty: Decoupling and De-bias for Robust Temporal Grounding. (arXiv:2103.16848v2 [cs.CV] UPDATED)
    (2 min) Temporal grounding aims to localize temporal boundaries within untrimmed videos by language queries, but it faces the challenge of two types of inevitable human uncertainties: query uncertainty and label uncertainty. The two uncertainties stem from human subjectivity, leading to limited generalization ability of temporal grounding. In this work, we propose a novel DeNet (Decoupling and De-bias) to embrace human uncertainty: Decoupling - We explicitly disentangle each query into a relation feature and a modified feature. The relation feature, which is mainly based on skeleton-like words (including nouns and verbs), aims to extract basic and consistent information in the presence of query uncertainty. Meanwhile, modified feature assigned with style-like words (including adjectives, adverbs, etc) represents the subjective information, and thus brings personalized predictions; De-bias - We propose a de-bias mechanism to generate diverse predictions, aim to alleviate the bias caused by single-style annotations in the presence of label uncertainty. Moreover, we put forward new multi-label metrics to diversify the performance evaluation. Extensive experiments show that our approach is more effective and robust than state-of-the-arts on Charades-STA and ActivityNet Captions datasets.
    Where can I drive? A System Approach: Deep Ego Corridor Estimation for Robust Automated Driving. (arXiv:2004.07639v2 [cs.CV] UPDATED)
    (2 min) Lane detection is an essential part of the perception sub-architecture of any automated driving (AD) or advanced driver assistance system (ADAS). When focusing on low-cost, large scale products for automated driving, model-driven approaches for the detection of lane markings have proven good performance. More recently, data-driven approaches have been proposed that target the drivable area / freespace mainly in inner-city applications. Focus of these approaches is less on lane-based driving due to the fact that the lane concept does not fully apply in unstructured, residential inner-city environments. So-far the concept of drivable area is seldom used for highway and inter-urban applications due to the specific requirements of these scenarios that require clear lane associations of all traffic participants. We believe that lane-based, mapless driving in inter-urban and highway scenarios is still not fully handled with sufficient robustness and availability. Especially for challenging weather situations such as heavy rain, fog, low-standing sun, darkness or reflections in puddles, the mapless detection of lane markings decreases significantly or completely fails. We see potential in applying specifically designed data-driven freespace approaches in more lane-based driving applications for highways and inter-urban use. Therefore, we propose to classify specifically a drivable corridor of the ego lane on pixel level with a deep learning approach. Our approach is kept computationally efficient with only 0.66 million parameters allowing its application in large scale products. Thus, we were able to easily integrate into an online AD system of a test vehicle. We demonstrate the performance of our approach under challenging conditions qualitatively and quantitatively in comparison to a state-of-the-art model-driven approach.
    ShapeFlow: Learnable Deformations Among 3D Shapes. (arXiv:2006.07982v2 [cs.CV] UPDATED)
    (2 min) We present ShapeFlow, a flow-based model for learning a deformation space for entire classes of 3D shapes with large intra-class variations. ShapeFlow allows learning a multi-template deformation space that is agnostic to shape topology, yet preserves fine geometric details. Different from a generative space where a latent vector is directly decoded into a shape, a deformation space decodes a vector into a continuous flow that can advect a source shape towards a target. Such a space naturally allows the disentanglement of geometric style (coming from the source) and structural pose (conforming to the target). We parametrize the deformation between geometries as a learned continuous flow field via a neural network and show that such deformations can be guaranteed to have desirable properties, such as be bijectivity, freedom from self-intersections, or volume preservation. We illustrate the effectiveness of this learned deformation space for various downstream applications, including shape generation via deformation, geometric style transfer, unsupervised learning of a consistent parameterization for entire classes of shapes, and shape interpolation.
    Self-Supervised Monocular Depth Estimation of Untextured Indoor Rotated Scenes. (arXiv:2106.12958v1 [cs.CV])
    (2 min) Self-supervised deep learning methods have leveraged stereo images for training monocular depth estimation. Although these methods show strong results on outdoor datasets such as KITTI, they do not match performance of supervised methods on indoor environments with camera rotation. Indoor, rotated scenes are common for less constrained applications and pose problems for two reasons: abundance of low texture regions and increased complexity of depth cues for images under rotation. In an effort to extend self-supervised learning to more generalised environments we propose two additions. First, we propose a novel Filled Disparity Loss term that corrects for ambiguity of image reconstruction error loss in textureless regions. Specifically, we interpolate disparity in untextured regions, using the estimated disparity from surrounding textured areas, and use L1 loss to correct the original estimation. Our experiments show that depth estimation is substantially improved on low-texture scenes, without any loss on textured scenes, when compared to Monodepth by Godard et al. Secondly, we show that training with an application's representative rotations, in both pitch and roll, is sufficient to significantly improve performance over the entire range of expected rotation. We demonstrate that depth estimation is successfully generalised as performance is not lost when evaluated on test sets with no camera rotation. Together these developments enable a broader use of self-supervised learning of monocular depth estimation for complex environments.
    FedFace: Collaborative Learning of Face Recognition Model. (arXiv:2104.03008v2 [cs.CV] UPDATED)
    (2 min) DNN-based face recognition models require large centrally aggregated face datasets for training. However, due to the growing data privacy concerns and legal restrictions, accessing and sharing face datasets has become exceedingly difficult. We propose FedFace, a federated learning (FL) framework for collaborative learning of face recognition models in a privacy-aware manner. FedFace utilizes the face images available on multiple clients to learn an accurate and generalizable face recognition model where the face images stored at each client are neither shared with other clients nor the central host and each client is a mobile device containing face images pertaining to only the owner of the device (one identity per client). Our experiments show the effectiveness of FedFace in enhancing the verification performance of pre-trained face recognition system on standard face verification benchmarks namely LFW, IJB-A, and IJB-C.
    Topological Semantic Mapping by Consolidation of Deep Visual Features. (arXiv:2106.12709v1 [cs.CV])
    (2 min) Many works in the recent literature introduce semantic mapping methods that use CNNs (Convolutional Neural Networks) to recognize semantic properties in images. The types of properties (eg.: room size, place category, and objects) and their classes (eg.: kitchen and bathroom, for place category) are usually predefined and restricted to a specific task. Thus, all the visual data acquired and processed during the construction of the maps are lost and only the recognized semantic properties remain on the maps. In contrast, this work introduces a topological semantic mapping method that uses deep visual features extracted by a CNN, the GoogLeNet, from 2D images captured in multiple views of the environment as the robot operates, to create consolidated representations of visual features acquired in the regions covered by each topological node. These consolidated representations allow flexible recognition of semantic properties of the regions and use in a range of visual tasks. The experiments, performed using a real-world indoor dataset, showed that the method is able to consolidate the visual features of regions and use them to recognize objects and place categories as semantic properties, and to indicate the topological location of images, with very promising results. The objects are classified using the classification layer of GoogLeNet, without retraining, and the place categories are recognized using a shallow Multilayer Perceptron.
    Relationship between pulmonary nodule malignancy and surrounding pleurae, airways and vessels: a quantitative study using the public LIDC-IDRI dataset. (arXiv:2106.12991v1 [cs.CV])
    (3 min) To investigate whether the pleurae, airways and vessels surrounding a nodule on non-contrast computed tomography (CT) can discriminate benign and malignant pulmonary nodules. The LIDC-IDRI dataset, one of the largest publicly available CT database, was exploited for study. A total of 1556 nodules from 694 patients were involved in statistical analysis, where nodules with average scorings 3 were respectively denoted as benign and malignant. Besides, 339 nodules from 113 patients with diagnosis ground-truth were independently evaluated. Computer algorithms were developed to segment pulmonary structures and quantify the distances to pleural surface, airways and vessels, as well as the counting number and normalized volume of airways and vessels near a nodule. Odds ratio (OR) and Chi-square (\chi^2) testing were performed to demonstrate the correlation between features of surrounding structures and nodule malignancy. A non-parametric receiver operating characteristic (ROC) analysis was conducted in logistic regression to evaluate discrimination ability of each structure. For benign and malignant groups, the average distances from nodules to pleural surface, airways and vessels are respectively (6.56, 5.19), (37.08, 26.43) and (1.42, 1.07) mm. The correlation between nodules and the counting number of airways and vessels that contact or project towards nodules are respectively (OR=22.96, \chi^2=105.04) and (OR=7.06, \chi^2=290.11). The correlation between nodules and the volume of airways and vessels are (OR=9.19, \chi^2=159.02) and (OR=2.29, \chi^2=55.89). The areas-under-curves (AUCs) for pleurae, airways and vessels are respectively 0.5202, 0.6943 and 0.6529. Our results show that malignant nodules are often surrounded by more pulmonary structures compared with benign ones, suggesting that features of these structures could be viewed as lung cancer biomarkers.
    Unsupervised Deep Image Stitching: Reconstructing Stitched Features to Images. (arXiv:2106.12859v1 [cs.CV])
    (2 min) Traditional feature-based image stitching technologies rely heavily on feature detection quality, often failing to stitch images with few features or low resolution. The learning-based image stitching solutions are rarely studied due to the lack of labeled data, making the supervised methods unreliable. To address the above limitations, we propose an unsupervised deep image stitching framework consisting of two stages: unsupervised coarse image alignment and unsupervised image reconstruction. In the first stage, we design an ablation-based loss to constrain an unsupervised homography network, which is more suitable for large-baseline scenes. Moreover, a transformer layer is introduced to warp the input images in the stitching-domain space. In the second stage, motivated by the insight that the misalignments in pixel-level can be eliminated to a certain extent in feature-level, we design an unsupervised image reconstruction network to eliminate the artifacts from features to pixels. Specifically, the reconstruction network can be implemented by a low-resolution deformation branch and a high-resolution refined branch, learning the deformation rules of image stitching and enhancing the resolution simultaneously. To establish an evaluation benchmark and train the learning framework, a comprehensive real-world image dataset for unsupervised deep image stitching is presented and released. Extensive experiments well demonstrate the superiority of our method over other state-of-the-art solutions. Even compared with the supervised solutions, our image stitching quality is still preferred by users.
    Real-time Semantic Segmentation via Spatial-detail Guided Context Propagation. (arXiv:2005.11034v4 [cs.CV] UPDATED)
    (2 min) Nowadays, vision-based computing tasks play an important role in various real-world applications. However, many vision computing tasks, e.g. semantic segmentation, are usually computationally expensive, posing a challenge to the computing systems that are resource-constrained but require fast response speed. Therefore, it is valuable to develop accurate and real-time vision processing models that only require limited computational resources. To this end, we propose the Spatial-detail Guided Context Propagation Network (SGCPNet) for achieving real-time semantic segmentation. In SGCPNet, we propose the strategy of spatial-detail guided context propagation. It uses the spatial details of shallow layers to guide the propagation of the low-resolution global contexts, in which the lost spatial information can be effectively reconstructed. In this way, the need for maintaining high-resolution features along the network is freed, therefore largely improving the model efficiency. On the other hand, due to the effective reconstruction of spatial details, the segmentation accuracy can be still preserved. In the experiments, we validate the effectiveness and efficiency of the proposed SGCPNet model. On the Citysacpes dataset, for example, our SGCPNet achieves 69.5 % mIoU segmentation accuracy, while its speed reaches 178.5 FPS on 768x1536 images on a GeForce GTX 1080 Ti GPU card.
    Composition of Saliency Metrics for Channel Pruning with a Myopic Oracle. (arXiv:2004.03376v2 [cs.CV] UPDATED)
    (2 min) The computation and memory needed for Convolutional Neural Network (CNN) inference can be reduced by pruning weights from the trained network. Pruning is guided by a pruning saliency, which heuristically approximates the change in the loss function associated with the removal of specific weights. Many pruning signals have been proposed, but the performance of each heuristic depends on the particular trained network. This leaves the data scientist with a difficult choice. When using any one saliency metric for the entire pruning process, we run the risk of the metric assumptions being invalidated, leading to poor decisions being made by the metric. Ideally we could combine the best aspects of different saliency metrics. However, despite an extensive literature review, we are unable to find any prior work on composing different saliency metrics. The chief difficulty lies in combining the numerical output of different saliency metrics, which are not directly comparable. We propose a method to compose several primitive pruning saliencies, to exploit the cases where each saliency measure does well. Our experiments show that the composition of saliencies avoids many poor pruning choices identified by individual saliencies. In most cases our method finds better selections than even the best individual pruning saliency.
    Rate Distortion Characteristic Modeling for Neural Image Compression. (arXiv:2106.12954v1 [eess.IV])
    (2 min) End-to-end optimization capability offers neural image compression (NIC) superior lossy compression performance. However, distinct models are required to be trained to reach different points in the rate-distortion (R-D) space. In this paper, we consider the problem of R-D characteristic analysis and modeling for NIC. We make efforts to formulate the essential mathematical functions to describe the R-D behavior of NIC using deep network and statistical modeling. Thus continuous bit-rate points could be elegantly realized by leveraging such model via a single trained network. In this regard, we propose a plugin-in module to learn the relationship between the target bit-rate and the binary representation for the latent variable of auto-encoder. Furthermore, we model the rate and distortion characteristic of NIC as a function of the coding parameter $\lambda$ respectively. Our experiments show our proposed method is easy to adopt and obtains competitive coding performance with fixed-rate coding approaches, which would benefit the practical deployment of NIC. In addition, the proposed model could be applied to NIC rate control with limited bit-rate error using a single network.
    An Efficient $k$-modes Algorithm for Clustering Categorical Datasets. (arXiv:2006.03936v3 [stat.ME] UPDATED)
    (2 min) Mining clusters from data is an important endeavor in many applications. The $k$-means method is a popular, efficient, and distribution-free approach for clustering numerical-valued data, but does not apply for categorical-valued observations. The $k$-modes method addresses this lacuna by replacing the Euclidean with the Hamming distance and the means with the modes in the $k$-means objective function. We provide a novel, computationally efficient implementation of $k$-modes, called OTQT. We prove that OTQT finds updates to improve the objective function that are undetectable to existing $k$-modes algorithms. Although slightly slower per iteration due to algorithmic complexity, OTQT is always more accurate per iteration and almost always faster (and only barely slower on some datasets) to the final optimum. Thus, we recommend OTQT as the preferred, default algorithm for $k$-modes optimization.
    HyperNeRF: A Higher-Dimensional Representation for Topologically Varying Neural Radiance Fields. (arXiv:2106.13228v1 [cs.CV])
    (2 min) Neural Radiance Fields (NeRF) are able to reconstruct scenes with unprecedented fidelity, and various recent works have extended NeRF to handle dynamic scenes. A common approach to reconstruct such non-rigid scenes is through the use of a learned deformation field mapping from coordinates in each input image into a canonical template coordinate space. However, these deformation-based approaches struggle to model changes in topology, as topological changes require a discontinuity in the deformation field, but these deformation fields are necessarily continuous. We address this limitation by lifting NeRFs into a higher dimensional space, and by representing the 5D radiance field corresponding to each individual input image as a slice through this "hyper-space". Our method is inspired by level set methods, which model the evolution of surfaces as slices through a higher dimensional surface. We evaluate our method on two tasks: (i) interpolating smoothly between "moments", i.e., configurations of the scene, seen in the input images while maintaining visual plausibility, and (ii) novel-view synthesis at fixed moments. We show that our method, which we dub HyperNeRF, outperforms existing methods on both tasks by significant margins. Compared to Nerfies, HyperNeRF reduces average error rates by 8.6% for interpolation and 8.8% for novel-view synthesis, as measured by LPIPS.
    AutoAdapt: Automated Segmentation Network Search for Unsupervised Domain Adaptation. (arXiv:2106.13227v1 [cs.CV])
    (2 min) Neural network-based semantic segmentation has achieved remarkable results when large amounts of annotated data are available, that is, in the supervised case. However, such data is expensive to collect and so methods have been developed to adapt models trained on related, often synthetic data for which labels are readily available. Current adaptation approaches do not consider the dependence of the generalization/transferability of these models on network architecture. In this paper, we perform neural architecture search (NAS) to provide architecture-level perspective and analysis for domain adaptation. We identify the optimization gap that exists when searching architectures for unsupervised domain adaptation which makes this NAS problem uniquely difficult. We propose bridging this gap by using maximum mean discrepancy and regional weighted entropy to estimate the accuracy metric. Experimental results on several widely adopted benchmarks show that our proposed AutoAdapt framework indeed discovers architectures that improve the performance of a number of existing adaptation techniques.
    Evaluation of deep lift pose models for 3D rodent pose estimation based on geometrically triangulated data. (arXiv:2106.12993v1 [cs.CV])
    (2 min) The assessment of laboratory animal behavior is of central interest in modern neuroscience research. Behavior is typically studied in terms of pose changes, which are ideally captured in three dimensions. This requires triangulation over a multi-camera system which view the animal from different angles. However, this is challenging in realistic laboratory setups due to occlusions and other technical constrains. Here we propose the usage of lift-pose models that allow for robust 3D pose estimation of freely moving rodents from a single view camera view. To obtain high-quality training data for the pose-lifting, we first perform geometric calibration in a camera setup involving bottom as well as side views of the behaving animal. We then evaluate the performance of two previously proposed model architectures under given inference perspectives and conclude that reliable 3D pose inference can be obtained using temporal convolutions. With this work we would like to contribute to a more robust and diverse behavior tracking of freely moving rodents for a wide range of experiments and setups in the neuroscience community.
    FitVid: Overfitting in Pixel-Level Video Prediction. (arXiv:2106.13195v1 [cs.CV])
    (2 min) An agent that is capable of predicting what happens next can perform a variety of tasks through planning with no additional training. Furthermore, such an agent can internally represent the complex dynamics of the real-world and therefore can acquire a representation useful for a variety of visual perception tasks. This makes predicting the future frames of a video, conditioned on the observed past and potentially future actions, an interesting task which remains exceptionally challenging despite many recent advances. Existing video prediction models have shown promising results on simple narrow benchmarks but they generate low quality predictions on real-life datasets with more complicated dynamics or broader domain. There is a growing body of evidence that underfitting on the training data is one of the primary causes for the low quality predictions. In this paper, we argue that the inefficient use of parameters in the current video models is the main reason for underfitting. Therefore, we introduce a new architecture, named FitVid, which is capable of severe overfitting on the common benchmarks while having similar parameter count as the current state-of-the-art models. We analyze the consequences of overfitting, illustrating how it can produce unexpected outcomes such as generating high quality output by repeating the training data, and how it can be mitigated using existing image augmentation techniques. As a result, FitVid outperforms the current state-of-the-art models across four different video prediction benchmarks on four different metrics.
    High-resolution Image Registration of Consecutive and Re-stained Sections in Histopathology. (arXiv:2106.13150v1 [eess.IV])
    (2 min) We compare variational image registration in consectutive and re-stained sections from histopathology. We present a fully-automatic algorithm for non-parametric (nonlinear) image registration and apply it to a previously existing dataset from the ANHIR challenge (230 slide pairs, consecutive sections) and a new dataset (hybrid re-stained and consecutive, 81 slide pairs, ca. 3000 landmarks) which is made publicly available. Registration hyperparameters are obtained in the ANHIR dataset and applied to the new dataset without modification. In the new dataset, landmark errors after registration range from 13.2 micrometers for consecutive sections to 1 micrometer for re-stained sections. We observe that non-parametric registration leads to lower landmark errors in both cases, even though the effect is smaller in re-stained sections. The nucleus-level alignment after non-parametric registration of re-stained sections provides a valuable tool to generate automatic ground-truth for machine learning applications in histopathology.
    Video Swin Transformer. (arXiv:2106.13230v1 [cs.CV])
    (2 min) The vision community is witnessing a modeling shift from CNNs to Transformers, where pure Transformer architectures have attained top accuracy on the major video recognition benchmarks. These video models are all built on Transformer layers that globally connect patches across the spatial and temporal dimensions. In this paper, we instead advocate an inductive bias of locality in video Transformers, which leads to a better speed-accuracy trade-off compared to previous approaches which compute self-attention globally even with spatial-temporal factorization. The locality of the proposed video architecture is realized by adapting the Swin Transformer designed for the image domain, while continuing to leverage the power of pre-trained image models. Our approach achieves state-of-the-art accuracy on a broad range of video recognition benchmarks, including on action recognition (84.9 top-1 accuracy on Kinetics-400 and 86.1 top-1 accuracy on Kinetics-600 with ~20x less pre-training data and ~3x smaller model size) and temporal modeling (69.6 top-1 accuracy on Something-Something v2). The code and models will be made publicly available at https://github.com/SwinTransformer/Video-Swin-Transformer.
    A Transformer-based Cross-modal Fusion Model with Adversarial Training for VQA Challenge 2021. (arXiv:2106.13033v1 [cs.CV])
    (2 min) In this paper, inspired by the successes of visionlanguage pre-trained models and the benefits from training with adversarial attacks, we present a novel transformerbased cross-modal fusion modeling by incorporating the both notions for VQA challenge 2021. Specifically, the proposed model is on top of the architecture of VinVL model [19], and the adversarial training strategy [4] is applied to make the model robust and generalized. Moreover, two implementation tricks are also used in our system to obtain better results. The experiments demonstrate that the novel framework can achieve 76.72% on VQAv2 test-std set.
    Advancing biological super-resolution microscopy through deep learning: a brief review. (arXiv:2106.13064v1 [physics.bio-ph])
    (2 min) Super-resolution microscopy overcomes the diffraction limit of conventional light microscopy in spatial resolution. By providing novel spatial or spatio-temporal information on biological processes at nanometer resolution with molecular specificity, it plays an increasingly important role in life sciences. However, its technical limitations require trade-offs to balance its spatial resolution, temporal resolution, and light exposure of samples. Recently, deep learning has achieved breakthrough performance in many image processing and computer vision tasks. It has also shown great promise in pushing the performance envelope of super-resolution microscopy. In this brief Review, we survey recent advances in using deep learning to enhance performance of super-resolution microscopy. We focus primarily on how deep learning ad-vances reconstruction of super-resolution images. Related key technical challenges are discussed. Despite the challenges, deep learning is set to play an indispensable and transformative role in the development of super-resolution microscopy. We conclude with an outlook on how deep learning could shape the future of this new generation of light microscopy technology.
    Exploring Corruption Robustness: Inductive Biases in Vision Transformers and MLP-Mixers. (arXiv:2106.13122v1 [cs.CV])
    (2 min) Recently, vision transformers and MLP-based models have been developed in order to address some of the prevalent weaknesses in convolutional neural networks. Due to the novelty of transformers being used in this domain along with the self-attention mechanism, it remains unclear to what degree these architectures are robust to corruptions. Despite some works proposing that data augmentation remains essential for a model to be robust against corruptions, we propose to explore the impact that the architecture has on corruption robustness. We find that vision transformer architectures are inherently more robust to corruptions than the ResNet-50 and MLP-Mixers. We also find that vision transformers with 5 times fewer parameters than a ResNet-50 have more shape bias. Our code is available to reproduce.
    A Simple and Strong Baseline: Progressively Region-based Scene Text Removal Networks. (arXiv:2106.13029v1 [cs.CV])
    (2 min) Existing scene text removal methods mainly train an elaborate network with paired images to realize the function of text localization and background reconstruction simultaneously, but there exists two problems: 1) lacking the exhaustive erasure of text region and 2) causing the excessive erasure to text-free areas. To handle these issues, this paper provides a novel ProgrEssively Region-based scene Text eraser (PERT), which introduces region-based modification strategy to progressively erase the pixels in only text region. Firstly, PERT decomposes the STR task to several erasing stages. As each stage aims to take a further step toward the text-removed image rather than directly regress to the final result, the decomposed operation reduces the learning difficulty in each stage, and an exhaustive erasure result can be obtained by iterating over lightweight erasing blocks with shared parameters. Then, PERT introduces a region-based modification strategy to ensure the integrity of text-free areas by decoupling text localization from erasure process to guide the removal. Benefiting from the simplicity architecture, PERT is a simple and strong baseline, and is easy to be followed and developed. Extensive experiments demonstrate that PERT obtains the state-of-the-art results on both synthetic and real-world datasets. Code is available athttps://github.com/wangyuxin87/PERT.
    Sparse Needlets for Lighting Estimation with Spherical Transport Loss. (arXiv:2106.13090v1 [cs.CV])
    (2 min) Accurate lighting estimation is challenging yet critical to many computer vision and computer graphics tasks such as high-dynamic-range (HDR) relighting. Existing approaches model lighting in either frequency domain or spatial domain which is insufficient to represent the complex lighting conditions in scenes and tends to produce inaccurate estimation. This paper presents NeedleLight, a new lighting estimation model that represents illumination with needlets and allows lighting estimation in both frequency domain and spatial domain jointly. An optimal thresholding function is designed to achieve sparse needlets which trims redundant lighting parameters and demonstrates superior localization properties for illumination representation. In addition, a novel spherical transport loss is designed based on optimal transport theory which guides to regress lighting representation parameters with consideration of the spatial information. Furthermore, we propose a new metric that is concise yet effective by directly evaluating the estimated illumination maps rather than rendered images. Extensive experiments show that NeedleLight achieves superior lighting estimation consistently across multiple evaluation metrics as compared with state-of-the-art methods.
    Depth Confidence-aware Camouflaged Object Detection. (arXiv:2106.13217v1 [cs.CV])
    (2 min) Camouflaged object detection (COD) aims to segment camouflaged objects hiding in the environment, which is challenging due to the similar appearance of camouflaged objects and their surroundings. Research in biology suggests that depth can provide useful object localization cues for camouflaged object discovery, as all the animals have 3D perception ability. However, the depth information has not been exploited for camouflaged object detection. To explore the contribution of depth for camouflage detection, we present a depth-guided camouflaged object detection network with pre-computed depth maps from existing monocular depth estimation methods. Due to the domain gap between the depth estimation dataset and our camouflaged object detection dataset, the generated depth may not be accurate enough to be directly used in our framework. We then introduce a depth quality assessment module to evaluate the quality of depth based on the model prediction from both RGB COD branch and RGB-D COD branch. During training, only high-quality depth is used to update the modal interaction module for multi-modal learning. During testing, our depth quality assessment module can effectively determine the contribution of depth and select the RGB branch or RGB-D branch for camouflage prediction. Extensive experiments on various camouflaged object detection datasets prove the effectiveness of our solution in exploring the depth information for camouflaged object detection. Our code and data is publicly available at: \url{https://github.com/JingZhang617/RGBD-COD}.
    FaDIV-Syn: Fast Depth-Independent View Synthesis. (arXiv:2106.13139v1 [cs.CV])
    (2 min) We introduce FaDIV-Syn, a fast depth-independent view synthesis method. Our multi-view approach addresses the problem that view synthesis methods are often limited by their depth estimation stage, where incorrect depth predictions can lead to large projection errors. To avoid this issue, we efficiently warp multiple input images into the target frame for a range of assumed depth planes. The resulting tensor representation is fed into a U-Net-like CNN with gated convolutions, which directly produces the novel output view. We therefore side-step explicit depth estimation. This improves efficiency and performance on transparent, reflective, and feature-less scene parts. FaDIV-Syn can handle both interpolation and extrapolation tasks and outperforms state-of-the-art extrapolation methods on the large-scale RealEstate10k dataset. In contrast to comparable methods, it is capable of real-time operation due to its lightweight architecture. We further demonstrate data efficiency of FaDIV-Syn by training from fewer examples as well as its generalization to higher resolutions and arbitrary depth ranges under severe depth discretization.
    SGTBN: Generating Dense Depth Maps from Single-Line LiDAR. (arXiv:2106.12994v1 [cs.CV])
    (2 min) Depth completion aims to generate a dense depth map from the sparse depth map and aligned RGB image. However, current depth completion methods use extremely expensive 64-line LiDAR(about $100,000) to obtain sparse depth maps, which will limit their application scenarios. Compared with the 64-line LiDAR, the single-line LiDAR is much less expensive and much more robust. Therefore, we propose a method to tackle the problem of single-line depth completion, in which we aim to generate a dense depth map from the single-line LiDAR info and the aligned RGB image. A single-line depth completion dataset is proposed based on the existing 64-line depth completion dataset(KITTI). A network called Semantic Guided Two-Branch Network(SGTBN) which contains global and local branches to extract and fuse global and local info is proposed for this task. A Semantic guided depth upsampling module is used in our network to make full use of the semantic info in RGB images. Except for the usual MSE loss, we add the virtual normal loss to increase the constraint of high-order 3D geometry in our network. Our network outperforms the state-of-the-art in the single-line depth completion task. Besides, compared with the monocular depth estimation, our method also has significant advantages in precision and model size.
    Detection of Deepfake Videos Using Long Distance Attention. (arXiv:2106.12832v1 [cs.CV])
    (2 min) With the rapid progress of deepfake techniques in recent years, facial video forgery can generate highly deceptive video contents and bring severe security threats. And detection of such forgery videos is much more urgent and challenging. Most existing detection methods treat the problem as a vanilla binary classification problem. In this paper, the problem is treated as a special fine-grained classification problem since the differences between fake and real faces are very subtle. It is observed that most existing face forgery methods left some common artifacts in the spatial domain and time domain, including generative defects in the spatial domain and inter-frame inconsistencies in the time domain. And a spatial-temporal model is proposed which has two components for capturing spatial and temporal forgery traces in global perspective respectively. The two components are designed using a novel long distance attention mechanism. The one component of the spatial domain is used to capture artifacts in a single frame, and the other component of the time domain is used to capture artifacts in consecutive frames. They generate attention maps in the form of patches. The attention method has a broader vision which contributes to better assembling global information and extracting local statistic information. Finally, the attention maps are used to guide the network to focus on pivotal parts of the face, just like other fine-grained classification methods. The experimental results on different public datasets demonstrate that the proposed method achieves the state-of-the-art performance, and the proposed long distance attention method can effectively capture pivotal parts for face forgery.
    Class agnostic moving target detection by color and location prediction of moving area. (arXiv:2106.12966v1 [cs.CV])
    (2 min) Moving target detection plays an important role in computer vision. However, traditional algorithms such as frame difference and optical flow usually suffer from low accuracy or heavy computation. Recent algorithms such as deep learning-based convolutional neural networks have achieved high accuracy and real-time performance, but they usually need to know the classes of targets in advance, which limits the practical applications. Therefore, we proposed a model free moving target detection algorithm. This algorithm extracts the moving area through the difference of image features. Then, the color and location probability map of the moving area will be calculated through maximum a posteriori probability. And the target probability map can be obtained through the dot multiply between the two maps. Finally, the optimal moving target area can be solved by stochastic gradient descent on the target probability map. Results show that the proposed algorithm achieves the highest accuracy compared with state-of-the-art algorithms, without needing to know the classes of targets. Furthermore, as the existing datasets are not suitable for moving target detection, we proposed a method for producing evaluation dataset. Besides, we also proved the proposed algorithm can be used to assist target tracking.
    Florida Wildlife Camera Trap Dataset. (arXiv:2106.12628v1 [cs.CV])
    (2 min) Trail camera imagery has increasingly gained popularity amongst biologists for conservation and ecological research. Minimal human interference required to operate camera traps allows capturing unbiased species activities. Several studies - based on human and wildlife interactions, migratory patterns of various species, risk of extinction in endangered populations - are limited by the lack of rich data and the time-consuming nature of manually annotating trail camera imagery. We introduce a challenging wildlife camera trap classification dataset collected from two different locations in Southwestern Florida, consisting of 104,495 images featuring visually similar species, varying illumination conditions, skewed class distribution, and including samples of endangered species, i.e. Florida panthers. Experimental evaluations with ResNet-50 architecture indicate that this image classification-based dataset can further push the advancements in wildlife statistical modeling. We will make the dataset publicly available.
    Frequency Domain Convolutional Neural Network: Accelerated CNN for Large Diabetic Retinopathy Image Classification. (arXiv:2106.12736v1 [cs.CV])
    (2 min) The conventional spatial convolution layers in the Convolutional Neural Networks (CNNs) are computationally expensive at the point where the training time could take days unless the number of layers, the number of training images or the size of the training images are reduced. The image size of 256x256 pixels is commonly used for most of the applications of CNN, but this image size is too small for applications like Diabetic Retinopathy (DR) classification where the image details are important for accurate classification. This research proposed Frequency Domain Convolution (FDC) and Frequency Domain Pooling (FDP) layers which were built with RFFT, kernel initialization strategy, convolution artifact removal and Channel Independent Convolution (CIC) to replace the conventional convolution and pooling layers. The FDC and FDP layers are used to build a Frequency Domain Convolutional Neural Network (FDCNN) to accelerate the training of large images for DR classification. The Full FDC layer is an extension of the FDC layer to allow direct use in conventional CNNs, it is also used to modify the VGG16 architecture. FDCNN is shown to be at least 54.21% faster and 70.74% more memory efficient compared to an equivalent CNN architecture. The modified VGG16 architecture with Full FDC layer is reported to achieve a shorter training time and a higher accuracy at 95.63% compared to the original VGG16 architecture for DR classification.
    High Performance Hyperspectral Image Classification using Graphics Processing Units. (arXiv:2106.12942v1 [cs.DC])
    (2 min) Real-time remote sensing applications like search and rescue missions, military target detection, environmental monitoring, hazard prevention and other time-critical applications require onboard real time processing capabilities or autonomous decision making. Some unmanned remote systems like satellites are physically remote from their operators, and all control of the spacecraft and data returned by the spacecraft must be transmitted over a wireless radio link. This link may not be available for extended periods when the satellite is out of line of sight of its ground station. Therefore, lightweight, small size and low power consumption hardware is essential for onboard real time processing systems. With increasing dimensionality, size and resolution of recent hyperspectral imaging sensors, additional challenges are posed upon remote sensing processing systems and more capable computing architectures are needed. Graphical Processing Units (GPUs) emerged as promising architecture for light weight high performance computing that can address these computational requirements for onboard systems. The goal of this study is to build high performance methods for onboard hyperspectral analysis. We propose accelerated methods for the well-known recursive hierarchical segmentation (RHSEG) clustering method, using GPUs, hybrid multicore CPU with a GPU and hybrid multi-core CPU/GPU clusters. RHSEG is a method developed by the National Aeronautics and Space Administration (NASA), which is designed to provide rich classification information with several output levels. The achieved speedups by parallel solutions compared to CPU sequential implementations are 21x for parallel single GPU and 240x for hybrid multi-node computer clusters with 16 computing nodes. The energy consumption is reduced to 74% using a single GPU compared to the equivalent parallel CPU cluster.
    Human Activity Recognition using Continuous Wavelet Transform and Convolutional Neural Networks. (arXiv:2106.12666v1 [cs.CV])
    (2 min) Quite a few people in the world have to stay under permanent surveillance for health reasons; they include diabetic people or people with some other chronic conditions, the elderly and the disabled.These groups may face heightened risk of having life-threatening falls or of being struck by a syncope. Due to limited availability of resources a substantial part of people at risk can not receive necessary monitoring and thus are exposed to excessive danger. Nowadays, this problem is usually solved via applying Human Activity Recognition (HAR) methods. HAR is a perspective and fast-paced Data Science field, which has a wide range of application areas such as healthcare, sport, security etc. However, the currently techniques of recognition are markedly lacking in accuracy, hence, the present paper suggests a highly accurate method for human activity classification. Wepropose a new workflow to address the HAR problem and evaluate it on the UniMiB SHAR dataset, which consists of the accelerometer signals. The model we suggest is based on continuous wavelet transform (CWT) and convolutional neural networks (CNNs). Wavelet transform localizes signal features both in time and frequency domains and after that a CNN extracts these features and recognizes activity. It is also worth noting that CWT converts 1D accelerometer signal into 2D images and thus enables to obtain better results as 2D networks have a significantly higher predictive capacity. In the course of the work we build a convolutional neural network and vary such model parameters as number of spatial axes, number of layers, number of neurons in each layer, image size, type of mother wavelet, the order of zero moment of mother wavelet etc. Besides, we also apply models with residual blocks which resulted in significantly higher metric values. Finally, we succeed to reach 99.26 % accuracy and it is a worthy performance for this problem.
    Regularisation for PCA- and SVD-type matrix factorisations. (arXiv:2106.12955v1 [cs.CV])
    (2 min) Singular Value Decomposition (SVD) and its close relative, Principal Component Analysis (PCA), are well-known linear matrix decomposition techniques that are widely used in applications such as dimension reduction and clustering. However, an important limitation of SVD/PCA is its sensitivity to noise in the input data. In this paper, we take another look at the problem of regularisation and show that different formulations of the minimisation problem lead to qualitatively different solutions.
    Planetary UAV localization based on Multi-modal Registration with Pre-existing Digital Terrain Model. (arXiv:2106.12738v1 [cs.CV])
    (2 min) The autonomous real-time optical navigation of planetary UAV is of the key technologies to ensure the success of the exploration. In such a GPS denied environment, vision-based localization is an optimal approach. In this paper, we proposed a multi-modal registration based SLAM algorithm, which estimates the location of a planet UAV using a nadir view camera on the UAV compared with pre-existing digital terrain model. To overcome the scale and appearance difference between on-board UAV images and pre-installed digital terrain model, a theoretical model is proposed to prove that topographic features of UAV image and DEM can be correlated in frequency domain via cross power spectrum. To provide the six-DOF of the UAV, we also developed an optimization approach which fuses the geo-referencing result into a SLAM system via LBA (Local Bundle Adjustment) to achieve robust and accurate vision-based navigation even in featureless planetary areas. To test the robustness and effectiveness of the proposed localization algorithm, a new cross-source drone-based localization dataset for planetary exploration is proposed. The proposed dataset includes 40200 synthetic drone images taken from nine planetary scenes with related DEM query images. Comparison experiments carried out demonstrate that over the flight distance of 33.8km, the proposed method achieved average localization error of 0.45 meters, compared to 1.31 meters by ORB-SLAM, with the processing speed of 12hz which will ensure a real-time performance. We will make our datasets available to encourage further work on this promising topic.
    Continuous-Time Deep Glioma Growth Models. (arXiv:2106.12917v1 [eess.IV])
    (2 min) The ability to estimate how a tumor might evolve in the future could have tremendous clinical benefits, from improved treatment decisions to better dose distribution in radiation therapy. Recent work has approached the glioma growth modeling problem via deep learning and variational inference, thus learning growth dynamics entirely from a real patient data distribution. So far, this approach was constrained to predefined image acquisition intervals and sequences of fixed length, which limits its applicability in more realistic scenarios. We overcome these limitations by extending Neural Processes, a class of conditional generative models for stochastic time series, with a hierarchical multi-scale representation encoding including a spatio-temporal attention mechanism. The result is a learned growth model that can be conditioned on an arbitrary number of observations, and that can produce a distribution of temporally consistent growth trajectories on a continuous time axis. On a dataset of 379 patients, the approach successfully captures both global and finer-grained variations in the images, exhibiting superior performance compared to other learned growth models.
    VinDr-SpineXR: A deep learning framework for spinal lesions detection and classification from radiographs. (arXiv:2106.12930v1 [eess.IV])
    (2 min) Radiographs are used as the most important imaging tool for identifying spine anomalies in clinical practice. The evaluation of spinal bone lesions, however, is a challenging task for radiologists. This work aims at developing and evaluating a deep learning-based framework, named VinDr-SpineXR, for the classification and localization of abnormalities from spine X-rays. First, we build a large dataset, comprising 10,468 spine X-ray images from 5,000 studies, each of which is manually annotated by an experienced radiologist with bounding boxes around abnormal findings in 13 categories. Using this dataset, we then train a deep learning classifier to determine whether a spine scan is abnormal and a detector to localize 7 crucial findings amongst the total 13. The VinDr-SpineXR is evaluated on a test set of 2,078 images from 1,000 studies, which is kept separate from the training set. It demonstrates an area under the receiver operating characteristic curve (AUROC) of 88.61% (95% CI 87.19%, 90.02%) for the image-level classification task and a mean average precision (mAP@0.5) of 33.56% for the lesion-level localization task. These results serve as a proof of concept and set a baseline for future research in this direction. To encourage advances, the dataset, codes, and trained deep learning models are made publicly available.
    Continual Novelty Detection. (arXiv:2106.12964v1 [cs.CV])
    (2 min) Novelty Detection methods identify samples that are not representative of a model's training set thereby flagging misleading predictions and bringing a greater flexibility and transparency at deployment time. However, research in this area has only considered Novelty Detection in the offline setting. Recently, there has been a growing realization in the computer vision community that applications demand a more flexible framework - Continual Learning - where new batches of data representing new domains, new classes or new tasks become available at different points in time. In this setting, Novelty Detection becomes more important, interesting and challenging. This work identifies the crucial link between the two problems and investigates the Novelty Detection problem under the Continual Learning setting. We formulate the Continual Novelty Detection problem and present a benchmark, where we compare several Novelty Detection methods under different Continual Learning settings. We show that Continual Learning affects the behaviour of novelty detection algorithms, while novelty detection can pinpoint insights in the behaviour of a continual learner. We further propose baselines and discuss possible research directions. We believe that the coupling of the two problems is a promising direction to bring vision models into practice.
    DCoM: A Deep Column Mapper for Semantic Data Type Detection. (arXiv:2106.12871v1 [cs.LG])
    (2 min) Detection of semantic data types is a very crucial task in data science for automated data cleaning, schema matching, data discovery, semantic data type normalization and sensitive data identification. Existing methods include regular expression-based or dictionary lookup-based methods that are not robust to dirty as well unseen data and are limited to a very less number of semantic data types to predict. Existing Machine Learning methods extract large number of engineered features from data and build logistic regression, random forest or feedforward neural network for this purpose. In this paper, we introduce DCoM, a collection of multi-input NLP-based deep neural networks to detect semantic data types where instead of extracting large number of features from the data, we feed the raw values of columns (or instances) to the model as texts. We train DCoM on 686,765 data columns extracted from VizNet corpus with 78 different semantic data types. DCoM outperforms other contemporary results with a quite significant margin on the same dataset.
    Feature Completion for Occluded Person Re-Identification. (arXiv:2106.12733v1 [cs.CV])
    (2 min) Person re-identification (reID) plays an important role in computer vision. However, existing methods suffer from performance degradation in occluded scenes. In this work, we propose an occlusion-robust block, Region Feature Completion (RFC), for occluded reID. Different from most previous works that discard the occluded regions, RFC block can recover the semantics of occluded regions in feature space. Firstly, a Spatial RFC (SRFC) module is developed. SRFC exploits the long-range spatial contexts from non-occluded regions to predict the features of occluded regions. The unit-wise prediction task leads to an encoder/decoder architecture, where the region-encoder models the correlation between non-occluded and occluded region, and the region-decoder utilizes the spatial correlation to recover occluded region features. Secondly, we introduce Temporal RFC (TRFC) module which captures the long-term temporal contexts to refine the prediction of SRFC. RFC block is lightweight, end-to-end trainable and can be easily plugged into existing CNNs to form RFCnet. Extensive experiments are conducted on occluded and commonly holistic reID benchmarks. Our method significantly outperforms existing methods on the occlusion datasets, while remains top even superior performance on holistic datasets. The source code is available at https://github.com/blue-blue272/OccludedReID-RFCnet.
    Multi-Modal 3D Object Detection in Autonomous Driving: a Survey. (arXiv:2106.12735v1 [cs.CV])
    (2 min) In the past few years, we have witnessed rapid development of autonomous driving. However, achieving full autonomy remains a daunting task due to the complex and dynamic driving environment. As a result, self-driving cars are equipped with a suite of sensors to conduct robust and accurate environment perception. As the number and type of sensors keep increasing, combining them for better perception is becoming a natural trend. So far, there has been no indepth review that focuses on multi-sensor fusion based perception. To bridge this gap and motivate future research, this survey devotes to review recent fusion-based 3D detection deep learning models that leverage multiple sensor data sources, especially cameras and LiDARs. In this survey, we first introduce the background of popular sensors for autonomous cars, including their common data representations as well as object detection networks developed for each type of sensor data. Next, we discuss some popular datasets for multi-modal 3D object detection, with a special focus on the sensor data included in each dataset. Then we present in-depth reviews of recent multi-modal 3D detection networks by considering the following three aspects of the fusion: fusion location, fusion data representation, and fusion granularity. After a detailed review, we discuss open challenges and point out possible solutions. We hope that our detailed review can help researchers to embark investigations in the area of multi-modal 3D object detection.
    Video Super-Resolution with Long-Term Self-Exemplars. (arXiv:2106.12778v1 [cs.CV])
    (2 min) Existing video super-resolution methods often utilize a few neighboring frames to generate a higher-resolution image for each frame. However, the redundant information between distant frames has not been fully exploited in these methods: corresponding patches of the same instance appear across distant frames at different scales. Based on this observation, we propose a video super-resolution method with long-term cross-scale aggregation that leverages similar patches (self-exemplars) across distant frames. Our model also consists of a multi-reference alignment module to fuse the features derived from similar patches: we fuse the features of distant references to perform high-quality super-resolution. We also propose a novel and practical training strategy for referenced-based super-resolution. To evaluate the performance of our proposed method, we conduct extensive experiments on our collected CarCam dataset and the Waymo Open dataset, and the results demonstrate our method outperforms state-of-the-art methods. Our source code will be publicly available.
    AVHYAS: A Free and Open Source QGIS Plugin for Advanced Hyperspectral Image Analysis. (arXiv:2106.12776v1 [eess.IV])
    (2 min) Advanced Hyperspectral Data Analysis Software (AVHYAS) plugin is a python3 based quantum GIS (QGIS) plugin designed to process and analyse hyperspectral (Hx) images. It is developed to guarantee full usage of present and future Hx airborne or spaceborne sensors and provides access to advanced algorithms for Hx data processing. The software is freely available and offers a range of basic and advanced tools such as atmospheric correction (for airborne AVIRISNG image), standard processing tools as well as powerful machine learning and Deep Learning interfaces for Hx data analysis.
    ATP-Net: An Attention-based Ternary Projection Network For Compressed Sensing. (arXiv:2106.12728v1 [eess.SP])
    (2 min) Compressed Sensing (CS) theory simultaneously realizes the signal sampling and compression process, and can use fewer observations to achieve accurate signal recovery, providing a solution for better and faster transmission of massive data. In this paper, a ternary sampling matrix-based method with attention mechanism is proposed with the purpose to solve the problem that the CS sampling matrices in most cases are random matrices, which are irrelative to the sampled signal and need a large storage space. The proposed method consists of three components, i.e., ternary sampling, initial reconstruction and deep reconstruction, with the emphasis on the ternary sampling. The main idea of the ternary method (-1, 0, +1) is to introduce the attention mechanism to evaluate the importance of parameters at the sampling layer after the sampling matrix is binarized (-1, +1), followed by pruning weight of parameters, whose importance is below a predefined threshold, to achieve ternarization. Furthermore, a compressed sensing algorithm especially for image reconstruction is implemented, on the basis of the ternary sampling matrix, which is called ATP-Net, i.e., Attention-based ternary projection network. Experimental results show that the quality of image reconstruction by means of ATP-Net maintains a satisfactory level with the employment of the ternary sampling matrix, i.e., the average PSNR on Set11 is 30.4 when the sampling rate is 0.25, approximately 6% improvement compared with that of DR2-Net.
    What makes visual place recognition easy or hard?. (arXiv:2106.12671v1 [cs.CV])
    (2 min) Visual place recognition is a fundamental capability for the localization of mobile robots. It places image retrieval in the practical context of physical agents operating in a physical world. It is an active field of research and many different approaches have been proposed and evaluated in many different experiments. In the following, we argue that due to variations of this practical context and individual design decisions, place recognition experiments are barely comparable across different papers and that there is a variety of properties that can change from one experiment to another. We provide an extensive list of such properties and give examples how they can be used to setup a place recognition experiment easier or harder. This might be interesting for different involved parties: (1) people who just want to select a place recognition approach that is suitable for the properties of their particular task at hand, (2) researchers that look for open research questions and are interested in particularly difficult instances, (3) authors that want to create reproducible papers on this topic, and (4) also reviewers that have the task to identify potential problems in papers under review.
    Deep Fake Detection: Survey of Facial Manipulation Detection Solutions. (arXiv:2106.12605v1 [cs.CV])
    (2 min) Deep Learning as a field has been successfully used to solve a plethora of complex problems, the likes of which we could not have imagined a few decades back. But as many benefits as it brings, there are still ways in which it can be used to bring harm to our society. Deep fakes have been proven to be one such problem, and now more than ever, when any individual can create a fake image or video simply using an application on the smartphone, there need to be some countermeasures, with which we can detect if the image or video is a fake or real and dispose of the problem threatening the trustworthiness of online information. Although the Deep fakes created by neural networks, may seem to be as real as a real image or video, it still leaves behind spatial and temporal traces or signatures after moderation, these signatures while being invisible to a human eye can be detected with the help of a neural network trained to specialize in Deep fake detection. In this paper, we analyze several such states of the art neural networks (MesoNet, ResNet-50, VGG-19, and Xception Net) and compare them against each other, to find an optimal solution for various scenarios like real-time deep fake detection to be deployed in online social media platforms where the classification should be made as fast as possible or for a small news agency where the classification need not be in real-time but requires utmost accuracy.
    Conditional Deformable Image Registration with Convolutional Neural Network. (arXiv:2106.12673v1 [cs.CV])
    (2 min) Recent deep learning-based methods have shown promising results and runtime advantages in deformable image registration. However, analyzing the effects of hyperparameters and searching for optimal regularization parameters prove to be too prohibitive in deep learning-based methods. This is because it involves training a substantial number of separate models with distinct hyperparameter values. In this paper, we propose a conditional image registration method and a new self-supervised learning paradigm for deep deformable image registration. By learning the conditional features that correlated with the regularization hyperparameter, we demonstrate that optimal solutions with arbitrary hyperparameters can be captured by a single deep convolutional neural network. In addition, the smoothness of the resulting deformation field can be manipulated with arbitrary strength of smoothness regularization during inference. Extensive experiments on a large-scale brain MRI dataset show that our proposed method enables the precise control of the smoothness of the deformation field without sacrificing the runtime advantage or registration accuracy.
  • cs.IR updates on arXiv.org

    Pattern-based Visualization of Knowledge Graphs. (arXiv:2106.12857v1 [cs.HC])
    (2 min) We present a novel approach to knowledge graph visualization based on ontology design patterns. This approach relies on OPLa (Ontology Pattern Language) annotations and on a catalogue of visual frames, which are associated with foundational ontology design patterns. We demonstrate that this approach significantly reduces the cognitive load required to users for visualizing and interpreting a knowledge graph and guides the user in exploring it through meaningful thematic paths provided by ontology patterns.
    Leveraging semantically similar queries for ranking via combining representations. (arXiv:2106.12621v1 [cs.LG])
    (2 min) In modern ranking problems, different and disparate representations of the items to be ranked are often available. It is sensible, then, to try to combine these representations to improve ranking. Indeed, learning to rank via combining representations is both principled and practical for learning a ranking function for a particular query. In extremely data-scarce settings, however, the amount of labeled data available for a particular query can lead to a highly variable and ineffective ranking function. One way to mitigate the effect of the small amount of data is to leverage information from semantically similar queries. Indeed, as we demonstrate in simulation settings and real data examples, when semantically similar queries are available it is possible to gainfully use them when ranking with respect to a particular query. We describe and explore this phenomenon in the context of the bias-variance trade off and apply it to the data-scarce settings of a Bing navigational graph and the Drosophila larva connectome.
    RikoNet: A Novel Anime Recommendation Engine. (arXiv:2106.12970v1 [cs.IR])
    (2 min) Anime is quite well-received today, especially among the younger generations. With many genres of available shows, more and more people are increasingly getting attracted to this niche section of the entertainment industry. As anime has recently garnered mainstream attention, we have insufficient information regarding users' penchant and watching habits. Therefore, it is an uphill task to build a recommendation engine for this relatively obscure entertainment medium. In this attempt, we have built a novel hybrid recommendation system that could act both as a recommendation system and as a means of exploring new anime genres and titles. We have analyzed the general trends in this field and the users' watching habits for coming up with our efficacious solution. Our solution employs deep autoencoders for the tasks of predicting ratings and generating embeddings. Following this, we formed clusters using the embeddings of the anime titles. These clusters form the search space for anime with similarities and are used to find anime similar to the ones liked and disliked by the user. This method, combined with the predicted ratings, forms the novel hybrid filter. In this article, we have demonstrated this idea and compared the performance of our implemented model with the existing state-of-the-art techniques.
    Discovering novel drug-supplement interactions using a dietary supplements knowledge graph generated from the biomedical literature. (arXiv:2106.12741v1 [cs.IR])
    (2 min) OBJECTIVE: Leverage existing biomedical NLP tools and DS domain terminology to produce a novel and comprehensive knowledge graph containing dietary supplement (DS) information for discovering interactions between DS and drugs, or Drug-Supplement Interactions (DSI). MATERIALS AND METHODS: We created SemRepDS (an extension of SemRep), capable of extracting semantic relations from abstracts by leveraging a DS-specific terminology (iDISK) containing 28,884 DS terms not found in the UMLS. PubMed abstracts were processed using SemRepDS to generate semantic relations, which were then filtered using a PubMedBERT-based model to remove incorrect relations before generating our knowledge graph (SuppKG). Two pathways are used to identify potential DS-Drug interactions which are then evaluated by medical professionals for mechanistic plausibility. RESULTS: Comparison analysis found that SemRepDS returned 206.9% more DS relations and 158.5% more DS entities than SemRep. The fine-tuned BERT model obtained an F1 score of 0.8605 and removed 43.86% of the relations, improving the precision of the relations by 26.4% compared to pre-filtering. SuppKG consists of 2,928 DS-specific nodes. Manual review of findings identified 44 (88%) proposed DS-Gene-Drug and 32 (64%) proposed DS-Gene1-Function-Gene2-Drug pathways to be mechanistically plausible. DISCUSSION: The additional relations extracted using SemRepDS generated SuppKG that was used to find plausible DSI not found in the current literature. By the nature of the SuppKG, these interactions are unlikely to have been found using SemRep without the expanded DS terminology. CONCLUSION: We successfully extend SemRep to include DS information and produce SuppKG which can be used to find potential DS-Drug interactions.
    Extreme Multi-label Learning for Semantic Matching in Product Search. (arXiv:2106.12657v1 [cs.IR])
    (2 min) We consider the problem of semantic matching in product search: given a customer query, retrieve all semantically related products from a huge catalog of size 100 million, or more. Because of large catalog spaces and real-time latency constraints, semantic matching algorithms not only desire high recall but also need to have low latency. Conventional lexical matching approaches (e.g., Okapi-BM25) exploit inverted indices to achieve fast inference time, but fail to capture behavioral signals between queries and products. In contrast, embedding-based models learn semantic representations from customer behavior data, but the performance is often limited by shallow neural encoders due to latency constraints. Semantic product search can be viewed as an eXtreme Multi-label Classification (XMC) problem, where customer queries are input instances and products are output labels. In this paper, we aim to improve semantic product search by using tree-based XMC models where inference time complexity is logarithmic in the number of products. We consider hierarchical linear models with n-gram features for fast real-time inference. Quantitatively, our method maintains a low latency of 1.25 milliseconds per query and achieves a 65% improvement of Recall@100 (60.9% v.s. 36.8%) over a competing embedding-based DSSM model. Our model is robust to weight pruning with varying thresholds, which can flexibly meet different system requirements for online deployments. Qualitatively, our method can retrieve products that are complementary to existing product search system and add diversity to the match set.
    A Novel Approach to Discover Switch Behaviours in Process Mining. (arXiv:2106.12765v1 [cs.DB])
    (2 min) Process mining is a relatively new subject which builds a bridge between process modelling and data mining. An exclusive choice in a process model usually splits the process into different branches. However, in some processes, it is possible to switch from one branch to another. The inductive miner guarantees to return sound process models, but fails to return a precise model when there are switch behaviours between different exclusive choice branches due to the limitation of process trees. In this paper, we present a novel extension to the process tree model to support switch behaviours between different branches of the exclusive choice operator and propose a novel extension to the inductive miner to discover sound process models with switch behaviours. The proposed discovery technique utilizes the theory of a previous study to detect possible switch behaviours. We apply both artificial and publicly-available datasets to evaluate our approach. Our results show that our approach can improve the precision of discovered models by 36% while maintaining high fitness values compared to the original inductive miner.
    The Stereotyping Problem in Collaboratively Filtered Recommender Systems. (arXiv:2106.12622v1 [cs.IR])
    (2 min) Recommender systems -- and especially matrix factorization-based collaborative filtering algorithms -- play a crucial role in mediating our access to online information. We show that such algorithms induce a particular kind of stereotyping: if preferences for a \textit{set} of items are anti-correlated in the general user population, then those items may not be recommended together to a user, regardless of that user's preferences and ratings history. First, we introduce a notion of \textit{joint accessibility}, which measures the extent to which a set of items can jointly be accessed by users. We then study joint accessibility under the standard factorization-based collaborative filtering framework, and provide theoretical necessary and sufficient conditions when joint accessibility is violated. Moreover, we show that these conditions can easily be violated when the users are represented by a single feature vector. To improve joint accessibility, we further propose an alternative modelling fix, which is designed to capture the diverse multiple interests of each user using a multi-vector representation. We conduct extensive experiments on real and simulated datasets, demonstrating the stereotyping problem with standard single-vector matrix factorization models.
    Detection, Analysis, and Prediction of Research Topics with Scientific Knowledge Graphs. (arXiv:2106.12875v1 [cs.DL])
    (2 min) Analysing research trends and predicting their impact on academia and industry is crucial to gain a deeper understanding of the advances in a research field and to inform critical decisions about research funding and technology adoption. In the last years, we saw the emergence of several publicly-available and large-scale Scientific Knowledge Graphs fostering the development of many data-driven approaches for performing quantitative analyses of research trends. This chapter presents an innovative framework for detecting, analysing, and forecasting research topics based on a large-scale knowledge graph characterising research articles according to the research topics from the Computer Science Ontology. We discuss the advantages of a solution based on a formal representation of topics and describe how it was applied to produce bibliometric studies and innovative tools for analysing and predicting research dynamics.
    An Automated Knowledge Mining and Document Classification System with Multi-model Transfer Learning. (arXiv:2106.12744v1 [cs.CL])
    (2 min) Service manual documents are crucial to the engineering company as they provide guidelines and knowledge to service engineers. However, it has become inconvenient and inefficient for service engineers to retrieve specific knowledge from documents due to the complexity of resources. In this research, we propose an automated knowledge mining and document classification system with novel multi-model transfer learning approaches. Particularly, the classification performance of the system has been improved with three effective techniques: fine-tuning, pruning, and multi-model method. The fine-tuning technique optimizes a pre-trained BERT model by adding a feed-forward neural network layer and the pruning technique is used to retrain the BERT model with new data. The multi-model method initializes and trains multiple BERT models to overcome the randomness of data ordering during the fine-tuning process. In the first iteration of the training process, multiple BERT models are being trained simultaneously. The best model is then selected for the next phase of the training process with another two iterations and the training processes for other BERT models will be terminated. The performance of the proposed system has been evaluated by comparing with two robust baseline methods, BERT and BERT-CNN. Experimental results on a widely used Corpus of Linguistic Acceptability (CoLA) dataset have shown that the proposed techniques perform better than these baseline methods in terms of accuracy and MCC score.
  • cs.LG updates on arXiv.org

    Composition of Saliency Metrics for Channel Pruning with a Myopic Oracle. (arXiv:2004.03376v2 [cs.CV] UPDATED)
    (2 min) The computation and memory needed for Convolutional Neural Network (CNN) inference can be reduced by pruning weights from the trained network. Pruning is guided by a pruning saliency, which heuristically approximates the change in the loss function associated with the removal of specific weights. Many pruning signals have been proposed, but the performance of each heuristic depends on the particular trained network. This leaves the data scientist with a difficult choice. When using any one saliency metric for the entire pruning process, we run the risk of the metric assumptions being invalidated, leading to poor decisions being made by the metric. Ideally we could combine the best aspects of different saliency metrics. However, despite an extensive literature review, we are unable to find any prior work on composing different saliency metrics. The chief difficulty lies in combining the numerical output of different saliency metrics, which are not directly comparable. We propose a method to compose several primitive pruning saliencies, to exploit the cases where each saliency measure does well. Our experiments show that the composition of saliencies avoids many poor pruning choices identified by individual saliencies. In most cases our method finds better selections than even the best individual pruning saliency.
    Semi-supervised Vector-valued Learning: From Theory to Algorithm. (arXiv:1909.04883v3 [cs.LG] UPDATED)
    (2 min) Vector-valued learning, where the output space admits a vector-valued structure, is an important problem that covers a broad family of important domains, e.g. multi-label learning and multi-class classification. Using local Rademacher complexity and unlabeled data, we derive novel data-dependent excess risk bounds for learning vector-valued functions in both the kernel space and linear space. The derived bounds are much sharper than existing ones, where convergence rates are improved from $\mathcal{O}(1/\sqrt{n})$ to $\mathcal{O}(1/\sqrt{n+u}),$ and $\mathcal{O}(1/n)$ in special cases. Motivated by our theoretical analysis, we propose a unified framework for learning vector-valued functions, incorporating both local Rademacher complexity and Laplacian regularization. Empirical results on a wide number of benchmark datasets show that the proposed algorithm significantly outperforms baseline methods, which coincides with our theoretical findings.
    Faster Policy Learning with Continuous-Time Gradients. (arXiv:2012.06684v2 [cs.LG] UPDATED)
    (2 min) We study the estimation of policy gradients for continuous-time systems with known dynamics. By reframing policy learning in continuous-time, we show that it is possible construct a more efficient and accurate gradient estimator. The standard back-propagation through time estimator (BPTT) computes exact gradients for a crude discretization of the continuous-time system. In contrast, we approximate continuous-time gradients in the original system. With the explicit goal of estimating continuous-time gradients, we are able to discretize adaptively and construct a more efficient policy gradient estimator which we call the Continuous-Time Policy Gradient (CTPG). We show that replacing BPTT policy gradients with more efficient CTPG estimates results in faster and more robust learning in a variety of control tasks and simulators.
    A Federated Learning Approach to Anomaly Detection in Smart Buildings. (arXiv:2010.10293v3 [cs.LG] UPDATED)
    (2 min) Internet of Things (IoT) sensors in smart buildings are becoming increasingly ubiquitous, making buildings more livable, energy efficient, and sustainable. These devices sense the environment and generate multivariate temporal data of paramount importance for detecting anomalies and improving the prediction of energy usage in smart buildings. However, detecting these anomalies in centralized systems is often plagued by a huge delay in response time. To overcome this issue, we formulate the anomaly detection problem in a federated learning setting by leveraging the multi-task learning paradigm, which aims at solving multiple tasks simultaneously while taking advantage of the similarities and differences across tasks. We propose a novel privacy-by-design federated learning model using a stacked long short-time memory (LSTM) model, and we demonstrate that it is more than twice as fast during training convergence compared to the centralized LSTM. The effectiveness of our federated learning approach is demonstrated on three real-world datasets generated by the IoT production system at General Electric Current smart building, achieving state-of-the-art performance compared to baseline methods in both classification and regression tasks. Our experimental results demonstrate the effectiveness of the proposed framework in reducing the overall training cost without compromising the prediction performance.
    Few-Shot Bearing Fault Diagnosis Based on Model-Agnostic Meta-Learning. (arXiv:2007.12851v4 [cs.LG] UPDATED)
    (2 min) The rapid development of artificial intelligence and deep learning has provided many opportunities to further enhance the safety, stability, and accuracy of industrial Cyber-Physical Systems (CPS). As indispensable components to many mission-critical CPS assets and equipment, mechanical bearings need to be monitored to identify any trace of abnormal conditions. Most of the data-driven approaches applied to bearing fault diagnosis up-to-date are trained using a large amount of fault data collected a priori. In many practical applications, however, it can be unsafe and time-consuming to collect sufficient data samples for each fault category, making it challenging to train a robust classifier. In this paper, we propose a few-shot learning framework for bearing fault diagnosis based on model-agnostic meta-learning (MAML), which targets for training an effective fault classifier using limited data. In addition, it can leverage the training data and learn to identify new fault scenarios more efficiently. Case studies on the generalization to new artificial faults show that the proposed framework achieves an overall accuracy up to 25% higher than a Siamese network-based benchmark study. Finally, the robustness and the generalization capability of the proposed framework are further validated by applying it to identify real bearing damages using data from artificial damages, which compares favorably against 6 state-of-the-art few-shot learning algorithms using consistent test environments.
    MIxBN: library for learning Bayesian networks from mixed data. (arXiv:2106.13194v1 [stat.ML])
    (2 min) This paper describes a new library for learning Bayesian networks from data containing discrete and continuous variables (mixed data). In addition to the classical learning methods on discretized data, this library proposes its algorithm that allows structural learning and parameters learning from mixed data without discretization since data discretization leads to information loss. This algorithm based on mixed MI score function for structural learning, and also linear regression and Gaussian distribution approximation for parameters learning. The library also offers two algorithms for enumerating graph structures - the greedy Hill-Climbing algorithm and the evolutionary algorithm. Thus the key capabilities of the proposed library are as follows: (1) structural and parameters learning of a Bayesian network on discretized data, (2) structural and parameters learning of a Bayesian network on mixed data using the MI mixed score function and Gaussian approximation, (3) launching learning algorithms on one of two algorithms for enumerating graph structures - Hill-Climbing and the evolutionary algorithm. Since the need for mixed data representation comes from practical necessity, the advantages of our implementations are evaluated in the context of solving approximation and gap recovery problems on synthetic data and real datasets.
    Efficient Non-parametric Bayesian Hawkes Processes. (arXiv:1810.03730v4 [cs.LG] UPDATED)
    (2 min) In this paper, we develop an efficient nonparametric Bayesian estimation of the kernel function of Hawkes processes. The non-parametric Bayesian approach is important because it provides flexible Hawkes kernels and quantifies their uncertainty. Our method is based on the cluster representation of Hawkes processes. Utilizing the stationarity of the Hawkes process, we efficiently sample random branching structures and thus, we split the Hawkes process into clusters of Poisson processes. We derive two algorithms -- a block Gibbs sampler and a maximum a posteriori estimator based on expectation maximization -- and we show that our methods have a linear time complexity, both theoretically and empirically. On synthetic data, we show our methods to be able to infer flexible Hawkes triggering kernels. On two large-scale Twitter diffusion datasets, we show that our methods outperform the current state-of-the-art in goodness-of-fit and that the time complexity is linear in the size of the dataset. We also observe that on diffusions related to online videos, the learned kernels reflect the perceived longevity for different content types such as music or pets videos.
    Model-Based Reinforcement Learning via Latent-Space Collocation. (arXiv:2106.13229v1 [cs.LG])
    (2 min) The ability to plan into the future while utilizing only raw high-dimensional observations, such as images, can provide autonomous agents with broad capabilities. Visual model-based reinforcement learning (RL) methods that plan future actions directly have shown impressive results on tasks that require only short-horizon reasoning, however, these methods struggle on temporally extended tasks. We argue that it is easier to solve long-horizon tasks by planning sequences of states rather than just actions, as the effects of actions greatly compound over time and are harder to optimize. To achieve this, we draw on the idea of collocation, which has shown good results on long-horizon tasks in optimal control literature, and adapt it to the image-based setting by utilizing learned latent state space models. The resulting latent collocation method (LatCo) optimizes trajectories of latent states, which improves over previously proposed shooting methods for visual model-based RL on tasks with sparse rewards and long-term goals. Videos and code at https://orybkin.github.io/latco/.
    Towards Fully Interpretable Deep Neural Networks: Are We There Yet?. (arXiv:2106.13164v1 [cs.LG])
    (2 min) Despite the remarkable performance, Deep Neural Networks (DNNs) behave as black-boxes hindering user trust in Artificial Intelligence (AI) systems. Research on opening black-box DNN can be broadly categorized into post-hoc methods and inherently interpretable DNNs. While many surveys have been conducted on post-hoc interpretation methods, little effort is devoted to inherently interpretable DNNs. This paper provides a review of existing methods to develop DNNs with intrinsic interpretability, with a focus on Convolutional Neural Networks (CNNs). The aim is to understand the current progress towards fully interpretable DNNs that can cater to different interpretation requirements. Finally, we identify gaps in current work and suggest potential research directions.
    Video Swin Transformer. (arXiv:2106.13230v1 [cs.CV])
    (2 min) The vision community is witnessing a modeling shift from CNNs to Transformers, where pure Transformer architectures have attained top accuracy on the major video recognition benchmarks. These video models are all built on Transformer layers that globally connect patches across the spatial and temporal dimensions. In this paper, we instead advocate an inductive bias of locality in video Transformers, which leads to a better speed-accuracy trade-off compared to previous approaches which compute self-attention globally even with spatial-temporal factorization. The locality of the proposed video architecture is realized by adapting the Swin Transformer designed for the image domain, while continuing to leverage the power of pre-trained image models. Our approach achieves state-of-the-art accuracy on a broad range of video recognition benchmarks, including on action recognition (84.9 top-1 accuracy on Kinetics-400 and 86.1 top-1 accuracy on Kinetics-600 with ~20x less pre-training data and ~3x smaller model size) and temporal modeling (69.6 top-1 accuracy on Something-Something v2). The code and models will be made publicly available at https://github.com/SwinTransformer/Video-Swin-Transformer.
    Machine learning to tame divergent density functional approximations: a new path to consensus materials design principles. (arXiv:2106.13109v1 [cond-mat.mtrl-sci])
    (2 min) Computational virtual high-throughput screening (VHTS) with density functional theory (DFT) and machine-learning (ML)-acceleration is essential in rapid materials discovery. By necessity, efficient DFT-based workflows are carried out with a single density functional approximation (DFA). Nevertheless, properties evaluated with different DFAs can be expected to disagree for the cases with challenging electronic structure (e.g., open shell transition metal complexes, TMCs) for which rapid screening is most needed and accurate benchmarks are often unavailable. To quantify the effect of DFA bias, we introduce an approach to rapidly obtain property predictions from 23 representative DFAs spanning multiple families and "rungs" (e.g., semi-local to double hybrid) and basis sets on over 2,000 TMCs. Although computed properties (e.g., spin-state ordering and frontier orbital gap) naturally differ by DFA, high linear correlations persist across all DFAs. We train independent ML models for each DFA and observe convergent trends in feature importance; these features thus provide DFA-invariant, universal design rules. We devise a strategy to train ML models informed by all 23 DFAs and use them to predict properties (e.g., spin-splitting energy) of over 182k TMCs. By requiring consensus of the ANN-predicted DFA properties, we improve correspondence of these computational lead compounds with literature-mined, experimental compounds over the single-DFA approach typically employed. Both feature analysis and consensus-based ML provide efficient, alternative paths to overcome accuracy limitations of practical DFT.
    Exploring Corruption Robustness: Inductive Biases in Vision Transformers and MLP-Mixers. (arXiv:2106.13122v1 [cs.CV])
    (2 min) Recently, vision transformers and MLP-based models have been developed in order to address some of the prevalent weaknesses in convolutional neural networks. Due to the novelty of transformers being used in this domain along with the self-attention mechanism, it remains unclear to what degree these architectures are robust to corruptions. Despite some works proposing that data augmentation remains essential for a model to be robust against corruptions, we propose to explore the impact that the architecture has on corruption robustness. We find that vision transformer architectures are inherently more robust to corruptions than the ResNet-50 and MLP-Mixers. We also find that vision transformers with 5 times fewer parameters than a ResNet-50 have more shape bias. Our code is available to reproduce.
    Towards Understanding and Mitigating Social Biases in Language Models. (arXiv:2106.13219v1 [cs.CL])
    (2 min) As machine learning methods are deployed in real-world settings such as healthcare, legal systems, and social science, it is crucial to recognize how they shape social biases and stereotypes in these sensitive decision-making processes. Among such real-world deployments are large-scale pretrained language models (LMs) that can be potentially dangerous in manifesting undesirable representational biases - harmful biases resulting from stereotyping that propagate negative generalizations involving gender, race, religion, and other social constructs. As a step towards improving the fairness of LMs, we carefully define several sources of representational biases before proposing new benchmarks and metrics to measure them. With these tools, we propose steps towards mitigating social biases during text generation. Our empirical results and human evaluation demonstrate effectiveness in mitigating bias while retaining crucial contextual information for high-fidelity text generation, thereby pushing forward the performance-fairness Pareto frontier.
    FitVid: Overfitting in Pixel-Level Video Prediction. (arXiv:2106.13195v1 [cs.CV])
    (2 min) An agent that is capable of predicting what happens next can perform a variety of tasks through planning with no additional training. Furthermore, such an agent can internally represent the complex dynamics of the real-world and therefore can acquire a representation useful for a variety of visual perception tasks. This makes predicting the future frames of a video, conditioned on the observed past and potentially future actions, an interesting task which remains exceptionally challenging despite many recent advances. Existing video prediction models have shown promising results on simple narrow benchmarks but they generate low quality predictions on real-life datasets with more complicated dynamics or broader domain. There is a growing body of evidence that underfitting on the training data is one of the primary causes for the low quality predictions. In this paper, we argue that the inefficient use of parameters in the current video models is the main reason for underfitting. Therefore, we introduce a new architecture, named FitVid, which is capable of severe overfitting on the common benchmarks while having similar parameter count as the current state-of-the-art models. We analyze the consequences of overfitting, illustrating how it can produce unexpected outcomes such as generating high quality output by repeating the training data, and how it can be mitigated using existing image augmentation techniques. As a result, FitVid outperforms the current state-of-the-art models across four different video prediction benchmarks on four different metrics.
    Partial Maximum Correntropy Regression for Robust Trajectory Decoding from Noisy Epidural Electrocorticographic Signals. (arXiv:2106.13086v1 [eess.SP])
    (2 min) The Partial Least Square Regression (PLSR) algorithm exhibits exceptional competence for predicting continuous variables from inter-correlated brain recordings in brain-computer interfaces, which achieved successful prediction from epidural electrocorticography of macaques to three-dimensional continuous hand trajectories recently. Nevertheless, PLSR is in essence formulated based on the least square criterion, thus, being non-robust with respect to complicated noises consequently. The aim of the present study is to propose a robust version of PLSR. To this end, the maximum correntropy criterion is adopted to structure a new robust variant of PLSR, namely Partial Maximum Correntropy Regression (PMCR). Half-quadratic optimization technique is utilized to calculate the robust latent variables. We assess the proposed PMCR on a synthetic example and the public Neurotycho dataset. Compared with the conventional PLSR and the state-of-the-art variant, PMCR realized superior prediction competence on three different performance indicators with contaminated training set. The proposed PMCR was demonstrated as an effective approach for robust decoding from noisy brain measurements, which could reduce the performance degradation resulting from adverse noises, thus, improving the decoding robustness of brain-computer interfaces.
    Unifying Gradient Estimators for Meta-Reinforcement Learning via Off-Policy Evaluation. (arXiv:2106.13125v1 [cs.LG])
    (2 min) Model-agnostic meta-reinforcement learning requires estimating the Hessian matrix of value functions. This is challenging from an implementation perspective, as repeatedly differentiating policy gradient estimates may lead to biased Hessian estimates. In this work, we provide a unifying framework for estimating higher-order derivatives of value functions, based on off-policy evaluation. Our framework interprets a number of prior approaches as special cases and elucidates the bias and variance trade-off of Hessian estimates. This framework also opens the door to a new family of estimates, which can be easily implemented with auto-differentiation libraries, and lead to performance gains in practice.
    Real-time Spatio-temporal Event Detection on Geotagged Social Media. (arXiv:2106.13121v1 [cs.SI])
    (2 min) A key challenge in mining social media data streams is to identify events which are actively discussed by a group of people in a specific local or global area. Such events are useful for early warning for accident, protest, election or breaking news. However, neither the list of events nor the resolution of both event time and space is fixed or known beforehand. In this work, we propose an online spatio-temporal event detection system using social media that is able to detect events at different time and space resolutions. First, to address the challenge related to the unknown spatial resolution of events, a quad-tree method is exploited in order to split the geographical space into multiscale regions based on the density of social media data. Then, a statistical unsupervised approach is performed that involves Poisson distribution and a smoothing method for highlighting regions with unexpected density of social posts. Further, event duration is precisely estimated by merging events happening in the same region at consecutive time intervals. A post processing stage is introduced to filter out events that are spam, fake or wrong. Finally, we incorporate simple semantics by using social media entities to assess the integrity, and accuracy of detected events. The proposed method is evaluated using different social media datasets: Twitter and Flickr for different cities: Melbourne, London, Paris and New York. To verify the effectiveness of the proposed method, we compare our results with two baseline algorithms based on fixed split of geographical space and clustering method. For performance evaluation, we manually compute recall and precision. We also propose a new quality measure named strength index, which automatically measures how accurate the reported event is.
    Fast and Efficient Locomotion via Learned Gait Transitions. (arXiv:2104.04644v2 [cs.RO] UPDATED)
    (2 min) We focus on the problem of developing energy efficient controllers for quadrupedal robots. Animals can actively switch gaits at different speeds to lower their energy consumption. In this paper, we devise a hierarchical learning framework, in which distinctive locomotion gaits and natural gait transitions emerge automatically with a simple reward of energy minimization. We use reinforcement learning to train a high-level gait policy that specifies gait patterns of each foot, while the low-level whole-body controller optimizes the motor commands so that the robot can walk at a desired velocity using that gait pattern. We test our learning framework on a quadruped robot and demonstrate automatic gait transitions, from walking to trotting and to fly-trotting, as the robot increases its speed. We show that the learned hierarchical controller consumes much less energy across a wide range of locomotion speed than baseline controllers.
    Pre-training transformer-based framework on large-scale pediatric claims data for downstream population-specific tasks. (arXiv:2106.13095v1 [cs.LG])
    (2 min) The adoption of electronic health records (EHR) has become universal during the past decade, which has afforded in-depth data-based research. By learning from the large amount of healthcare data, various data-driven models have been built to predict future events for different medical tasks, such as auto diagnosis and heart-attack prediction. Although EHR is abundant, the population that satisfies specific criteria for learning population-specific tasks is scarce, making it challenging to train data-hungry deep learning models. This study presents the Claim Pre-Training (Claim-PT) framework, a generic pre-training model that first trains on the entire pediatric claims dataset, followed by a discriminative fine-tuning on each population-specific task. The semantic meaning of medical events can be captured in the pre-training stage, and the effective knowledge transfer is completed through the task-aware fine-tuning stage. The fine-tuning process requires minimal parameter modification without changing the model architecture, which mitigates the data scarcity issue and helps train the deep learning model adequately on small patient cohorts. We conducted experiments on a real-world claims dataset with more than one million patient records. Experimental results on two downstream tasks demonstrated the effectiveness of our method: our general task-agnostic pre-training framework outperformed tailored task-specific models, achieving more than 10\% higher in model performance as compared to baselines. In addition, our framework showed a great generalizability potential to transfer learned knowledge from one institution to another, paving the way for future healthcare model pre-training across institutions.
    Q-space Conditioned Translation Networks for Directional Synthesis of Diffusion Weighted Images from Multi-modal Structural MRI. (arXiv:2106.13188v1 [eess.IV])
    (2 min) Current deep learning approaches for diffusion MRI modeling circumvent the need for densely-sampled diffusion-weighted images (DWIs) by directly predicting microstructural indices from sparsely-sampled DWIs. However, they implicitly make unrealistic assumptions of static $q$-space sampling during training and reconstruction. Further, such approaches can restrict downstream usage of variably sampled DWIs for usages including the estimation of microstructural indices or tractography. We propose a generative adversarial translation framework for high-quality DWI synthesis with arbitrary $q$-space sampling given commonly acquired structural images (e.g., B0, T1, T2). Our translation network linearly modulates its internal representations conditioned on continuous $q$-space information, thus removing the need for fixed sampling schemes. Moreover, this approach enables downstream estimation of high-quality microstructural maps from arbitrarily subsampled DWIs, which may be particularly important in cases with sparsely sampled DWIs. Across several recent methodologies, the proposed approach yields improved DWI synthesis accuracy and fidelity with enhanced downstream utility as quantified by the accuracy of scalar microstructure indices estimated from the synthesized images. Code is available at https://github.com/mengweiren/q-space-conditioned-dwi-synthesis.
    Differential Privacy and Byzantine Resilience in SGD: Do They Add Up?. (arXiv:2102.08166v3 [cs.LG] UPDATED)
    (2 min) This paper addresses the problem of combining Byzantine resilience with privacy in machine learning (ML). Specifically, we study if a distributed implementation of the renowned Stochastic Gradient Descent (SGD) learning algorithm is feasible with both differential privacy (DP) and $(\alpha,f)$-Byzantine resilience. To the best of our knowledge, this is the first work to tackle this problem from a theoretical point of view. A key finding of our analyses is that the classical approaches to these two (seemingly) orthogonal issues are incompatible. More precisely, we show that a direct composition of these techniques makes the guarantees of the resulting SGD algorithm depend unfavourably upon the number of parameters of the ML model, making the training of large models practically infeasible. We validate our theoretical results through numerical experiments on publicly-available datasets; showing that it is impractical to ensure DP and Byzantine resilience simultaneously.
    Efficient Black-Box Planning Using Macro-Actions with Focused Effects. (arXiv:2004.13242v3 [cs.AI] UPDATED)
    (2 min) The difficulty of deterministic planning increases exponentially with search-tree depth. Black-box planning presents an even greater challenge, since planners must operate without an explicit model of the domain. Heuristics can make search more efficient, but goal-aware heuristics for black-box planning usually rely on goal counting, which is often quite uninformative. In this work, we show how to overcome this limitation by discovering macro-actions that make the goal-count heuristic more accurate. Our approach searches for macro-actions with focused effects (i.e. macros that modify only a small number of state variables), which align well with the assumptions made by the goal-count heuristic. Focused macros dramatically improve black-box planning efficiency across a wide range of planning domains, sometimes beating even state-of-the-art planners with access to a full domain model.
    RikoNet: A Novel Anime Recommendation Engine. (arXiv:2106.12970v1 [cs.IR])
    (2 min) Anime is quite well-received today, especially among the younger generations. With many genres of available shows, more and more people are increasingly getting attracted to this niche section of the entertainment industry. As anime has recently garnered mainstream attention, we have insufficient information regarding users' penchant and watching habits. Therefore, it is an uphill task to build a recommendation engine for this relatively obscure entertainment medium. In this attempt, we have built a novel hybrid recommendation system that could act both as a recommendation system and as a means of exploring new anime genres and titles. We have analyzed the general trends in this field and the users' watching habits for coming up with our efficacious solution. Our solution employs deep autoencoders for the tasks of predicting ratings and generating embeddings. Following this, we formed clusters using the embeddings of the anime titles. These clusters form the search space for anime with similarities and are used to find anime similar to the ones liked and disliked by the user. This method, combined with the predicted ratings, forms the novel hybrid filter. In this article, we have demonstrated this idea and compared the performance of our implemented model with the existing state-of-the-art techniques.
    On the relationship between predictive coding and backpropagation. (arXiv:2106.13082v1 [q-bio.NC])
    (2 min) In this manuscript, I review and extend recent work on the relationship between predictive coding and backpropagation for training artificial neural networks on supervised learning tasks. I also discuss some implications of these results for the interpretation of predictive coding and deep neural networks as models of biological learning and I describe a repository of functions, Torch2PC, for performing predictive coding with PyTorch neural network models.
    Abstraction of Markov Population Dynamics via Generative Adversarial Nets. (arXiv:2106.12981v1 [cs.LG])
    (2 min) Markov Population Models are a widespread formalism used to model the dynamics of complex systems, with applications in Systems Biology and many other fields. The associated Markov stochastic process in continuous time is often analyzed by simulation, which can be costly for large or stiff systems, particularly when a massive number of simulations has to be performed (e.g. in a multi-scale model). A strategy to reduce computational load is to abstract the population model, replacing it with a simpler stochastic model, faster to simulate. Here we pursue this idea, building on previous works and constructing a generator capable of producing stochastic trajectories in continuous space and discrete time. This generator is learned automatically from simulations of the original model in a Generative Adversarial setting. Compared to previous works, which rely on deep neural networks and Dirichlet processes, we explore the use of state of the art generative models, which are flexible enough to learn a full trajectory rather than a single transition kernel.
    Receiver operating characteristic (ROC) movies, universal ROC (UROC) curves, and coefficient of predictive ability (CPA). (arXiv:1912.01956v3 [stat.ML] UPDATED)
    (2 min) Throughout science and technology, receiver operating characteristic (ROC) curves and associated area under the curve (AUC) measures constitute powerful tools for assessing the predictive abilities of features, markers and tests in binary classification problems. Despite its immense popularity, ROC analysis has been subject to a fundamental restriction, in that it applies to dichotomous (yes or no) outcomes only. Here we introduce ROC movies and universal ROC (UROC) curves that apply to just any linearly ordered outcome, along with an associated coefficient of predictive ability (CPA) measure. CPA equals the area under the UROC curve, and admits appealing interpretations in terms of probabilities and rank based covariances. For binary outcomes CPA equals AUC, and for pairwise distinct outcomes CPA relates linearly to Spearman's coefficient, in the same way that the C index relates linearly to Kendall's coefficient. ROC movies, UROC curves, and CPA nest and generalize the tools of classical ROC analysis, and are bound to supersede them in a wealth of applications. Their usage is illustrated in data examples from biomedicine and meteorology, where rank based measures yield new insights in the WeatherBench comparison of the predictive performance of convolutional neural networks and physical-numerical models for weather prediction.
    Wav2vec-C: A Self-supervised Model for Speech Representation Learning. (arXiv:2103.08393v2 [eess.AS] UPDATED)
    (2 min) Wav2vec-C introduces a novel representation learning technique combining elements from wav2vec 2.0 and VQ-VAE. Our model learns to reproduce quantized representations from partially masked speech encoding using a contrastive loss in a way similar to Wav2vec 2.0. However, the quantization process is regularized by an additional consistency network that learns to reconstruct the input features to the wav2vec 2.0 network from the quantized representations in a way similar to a VQ-VAE model. The proposed self-supervised model is trained on 10k hours of unlabeled data and subsequently used as the speech encoder in a RNN-T ASR model and fine-tuned with 1k hours of labeled data. This work is one of only a few studies of self-supervised learning on speech tasks with a large volume of real far-field labeled data. The Wav2vec-C encoded representations achieves, on average, twice the error reduction over baseline and a higher codebook utilization in comparison to wav2vec 2.0
    Multiclass Disease Predictions Based on Integrated Clinical and Genomics Datasets. (arXiv:2006.07879v1 [q-bio.GN] CROSS LISTED)
    (2 min) Clinical predictions using clinical data by computational methods are common in bioinformatics. However, clinical predictions using information from genomics datasets as well is not a frequently observed phenomenon in research. Precision medicine research requires information from all available datasets to provide intelligent clinical solutions. In this paper, we have attempted to create a prediction model which uses information from both clinical and genomics datasets. We have demonstrated multiclass disease predictions based on combined clinical and genomics datasets using machine learning methods. We have created an integrated dataset, using a clinical (ClinVar) and a genomics (gene expression) dataset, and trained it using instance-based learner to predict clinical diseases. We have used an innovative but simple way for multiclass classification, where the number of output classes is as high as 75. We have used Principal Component Analysis for feature selection. The classifier predicted diseases with 73\% accuracy on the integrated dataset. The results were consistent and competent when compared with other classification models. The results show that genomics information can be reliably included in datasets for clinical predictions and it can prove to be valuable in clinical diagnostics and precision medicine.
    Universal Adversarial Perturbations for CNN Classifiers in EEG-Based BCIs. (arXiv:1912.01171v5 [cs.LG] UPDATED)
    (2 min) Multiple convolutional neural network (CNN) classifiers have been proposed for electroencephalogram (EEG) based brain-computer interfaces (BCIs). However, CNN models have been found vulnerable to universal adversarial perturbations (UAPs), which are small and example-independent, yet powerful enough to degrade the performance of a CNN model, when added to a benign example. This paper proposes a novel total loss minimization (TLM) approach to generate UAPs for EEG-based BCIs. Experimental results demonstrated the effectiveness of TLM on three popular CNN classifiers for both target and non-target attacks. We also verified the transferability of UAPs in EEG-based BCI systems. To our knowledge, this is the first study on UAPs of CNN classifiers in EEG-based BCIs. UAPs are easy to construct, and can attack BCIs in real-time, exposing a potentially critical security concern of BCIs.
    Artifact Detection and Correction in EEG data: A Review. (arXiv:2106.13081v1 [eess.SP])
    (2 min) Electroencephalography (EEG) has countless applications across many of fields. However, EEG applications are limited by low signal-to-noise ratios. Multiple types of artifacts contribute to the noisiness of EEG, and many techniques have been proposed to detect and correct these artifacts. These techniques range from simply detecting and rejecting artifact ridden segments, to extracting the noise component from the EEG signal. In this paper we review a variety of recent and classical techniques for EEG data artifact detection and correction with a focus on the last half-decade. We compare the strengths and weaknesses of the approaches and conclude with proposed future directions for the field.
    SALT: Sea lice Adaptive Lattice Tracking -- An Unsupervised Approach to Generate an Improved Ocean Model. (arXiv:2106.13202v1 [q-bio.QM])
    (2 min) Warming oceans due to climate change are leading to increased numbers of ectoparasitic copepods, also known as sea lice, which can cause significant ecological loss to wild salmon populations and major economic loss to aquaculture sites. The main transport mechanism driving the spread of sea lice populations are near-surface ocean currents. Present strategies to estimate the distribution of sea lice larvae are computationally complex and limit full-scale analysis. Motivated to address this challenge, we propose SALT: Sea lice Adaptive Lattice Tracking approach for efficient estimation of sea lice dispersion and distribution in space and time. Specifically, an adaptive spatial mesh is generated by merging nodes in the lattice graph of the Ocean Model based on local ocean properties, thus enabling highly efficient graph representation. SALT demonstrates improved efficiency while maintaining consistent results with the standard method, using near-surface current data for Hardangerfjord, Norway. The proposed SALT technique shows promise for enhancing proactive aquaculture management through predictive modelling of sea lice infestation pressure maps in a changing climate.
    Lettuce: PyTorch-based Lattice Boltzmann Framework. (arXiv:2106.12929v1 [physics.comp-ph])
    (2 min) The lattice Boltzmann method (LBM) is an efficient simulation technique for computational fluid mechanics and beyond. It is based on a simple stream-and-collide algorithm on Cartesian grids, which is easily compatible with modern machine learning architectures. While it is becoming increasingly clear that deep learning can provide a decisive stimulus for classical simulation techniques, recent studies have not addressed possible connections between machine learning and LBM. Here, we introduce Lettuce, a PyTorch-based LBM code with a threefold aim. Lettuce enables GPU accelerated calculations with minimal source code, facilitates rapid prototyping of LBM models, and enables integrating LBM simulations with PyTorch's deep learning and automatic differentiation facility. As a proof of concept for combining machine learning with the LBM, a neural collision model is developed, trained on a doubly periodic shear layer and then transferred to a different flow, a decaying turbulence. We also exemplify the added benefit of PyTorch's automatic differentiation framework in flow control and optimization. To this end, the spectrum of a forced isotropic turbulence is maintained without further constraining the velocity field. The source code is freely available from https://github.com/lettucecfd/lettuce.
    Tensor networks for unsupervised machine learning. (arXiv:2106.12974v1 [cond-mat.stat-mech])
    (2 min) Modeling the joint distribution of high-dimensional data is a central task in unsupervised machine learning. In recent years, many interests have been attracted to developing learning models based on tensor networks, which have advantages of theoretical understandings of the expressive power using entanglement properties, and as a bridge connecting the classical computation and the quantum computation. Despite the great potential, however, existing tensor-network-based unsupervised models only work as a proof of principle, as their performances are much worse than the standard models such as the restricted Boltzmann machines and neural networks. In this work, we present the Autoregressive Matrix Product States (AMPS), a tensor-network-based model combining the matrix product states from quantum many-body physics and the autoregressive models from machine learning. The model enjoys exact calculation of normalized probability and unbiased sampling, as well as a clear theoretical understanding of expressive power. We demonstrate the performance of our model using two applications, the generative modeling on synthetic and real-world data, and the reinforcement learning in statistical physics. Using extensive numerical experiments, we show that the proposed model significantly outperforms the existing tensor-network-based models and the restricted Boltzmann machines, and is competitive with the state-of-the-art neural network models.
    Optimizing Black-box Metrics with Iterative Example Weighting. (arXiv:2102.09492v2 [cs.LG] UPDATED)
    (2 min) We consider learning to optimize a classification metric defined by a black-box function of the confusion matrix. Such black-box learning settings are ubiquitous, for example, when the learner only has query access to the metric of interest, or in noisy-label and domain adaptation applications where the learner must evaluate the metric via performance evaluation using a small validation sample. Our approach is to adaptively learn example weights on the training dataset such that the resulting weighted objective best approximates the metric on the validation sample. We show how to model and estimate the example weights and use them to iteratively post-shift a pre-trained class probability estimator to construct a classifier. We also analyze the resulting procedure's statistical properties. Experiments on various label noise, domain shift, and fair classification setups confirm that our proposal compares favorably to the state-of-the-art baselines for each application.
    Personalized Federated Learning with Clustered Generalization. (arXiv:2106.13044v1 [cs.LG])
    (2 min) We study the recent emerging personalized federated learning (PFL) that aims at dealing with the challenging problem of Non-I.I.D. data in the federated learning (FL) setting. The key difference between PFL and conventional FL lies in the training target, of which the personalized models in PFL usually pursue a trade-off between personalization (i.e., usually from local models) and generalization (i.e., usually from the global model) on trained models. Conventional FL methods can hardly meet this target because of their both well-developed global and local models. The prevalent PFL approaches usually maintain a global model to guide the training process of local models and transfer a proper degree of generalization to them. However, the sole global model can only provide one direction of generalization and may even transfer negative effects to some local models when rich statistical diversity exists across multiple local datasets. Based on our observation, most real or synthetic data distributions usually tend to be clustered to some degree, of which we argue different directions of generalization can facilitate the PFL. In this paper, we propose a novel concept called clustered generalization to handle the challenge of statistical heterogeneity in FL. Specifically, we maintain multiple global (generalized) models in the server to associate with the corresponding amount of local model clusters in clients, and further formulate the PFL as a bi-level optimization problem that can be solved efficiently and robustly. We also conduct detailed theoretical analysis and provide the convergence guarantee for the smooth non-convex objectives. Experimental results on both synthetic and real datasets show that our approach surpasses the state-of-the-art by a significant margin.
    UXLA: A Robust Unsupervised Data Augmentation Framework for {Zero-Resource} Cross-Lingual NLP. (arXiv:2004.13240v3 [cs.CL] UPDATED)
    (2 min) Transfer learning has yielded state-of-the-art (SoTA) results in many supervised NLP tasks. However, annotated data for every target task in every target language is rare, especially for low-resource languages. We propose UXLA, a novel unsupervised data augmentation framework for zero-resource transfer learning scenarios. In particular, UXLA aims to solve cross-lingual adaptation problems from a source language task distribution to an unknown target language task distribution, assuming no training label in the target language. At its core, UXLA performs simultaneous self-training with data augmentation and unsupervised sample selection. To show its effectiveness, we conduct extensive experiments on three diverse zero-resource cross-lingual transfer tasks. UXLA achieves SoTA results in all the tasks, outperforming the baselines by a good margin. With an in-depth framework dissection, we demonstrate the cumulative contributions of different components to its success.
    Empirical Study of Transformers for Source Code. (arXiv:2010.07987v2 [cs.LG] UPDATED)
    (2 min) Initially developed for natural language processing (NLP), Transformers are now widely used for source code processing, due to the format similarity between source code and text. In contrast to natural language, source code is strictly structured, i.e., it follows the syntax of the programming language. Several recent works develop Transformer modifications for capturing syntactic information in source code. The drawback of these works is that they do not compare to each other and consider different tasks. In this work, we conduct a thorough empirical study of the capabilities of Transformers to utilize syntactic information in different tasks. We consider three tasks (code completion, function naming and bug fixing) and re-implement different syntax-capturing modifications in a unified framework. We show that Transformers are able to make meaningful predictions based purely on syntactic information and underline the best practices of taking the syntactic information into account for improving the performance of the model.
    Rethinking Graph Neural Architecture Search from Message-passing. (arXiv:2103.14282v4 [cs.LG] UPDATED)
    (2 min) Graph neural networks (GNNs) emerged recently as a standard toolkit for learning from data on graphs. Current GNN designing works depend on immense human expertise to explore different message-passing mechanisms, and require manual enumeration to determine the proper message-passing depth. Inspired by the strong searching capability of neural architecture search (NAS) in CNN, this paper proposes Graph Neural Architecture Search (GNAS) with novel-designed search space. The GNAS can automatically learn better architecture with the optimal depth of message passing on the graph. Specifically, we design Graph Neural Architecture Paradigm (GAP) with tree-topology computation procedure and two types of fine-grained atomic operations (feature filtering and neighbor aggregation) from message-passing mechanism to construct powerful graph network search space. Feature filtering performs adaptive feature selection, and neighbor aggregation captures structural information and calculates neighbors' statistics. Experiments show that our GNAS can search for better GNNs with multiple message-passing mechanisms and optimal message-passing depth. The searched network achieves remarkable improvement over state-of-the-art manual designed and search-based GNNs on five large-scale datasets at three classical graph tasks. Codes can be found at https://github.com/phython96/GNAS-MP.
    Safe Learning and Optimization Techniques: Towards a Survey of the State of the Art. (arXiv:2101.09505v3 [cs.LG] UPDATED)
    (2 min) Safe learning and optimization deals with learning and optimization problems that avoid, as much as possible, the evaluation of non-safe input points, which are solutions, policies, or strategies that cause an irrecoverable loss (e.g., breakage of a machine or equipment, or life threat). Although a comprehensive survey of safe reinforcement learning algorithms was published in 2015, a number of new algorithms have been proposed thereafter, and related works in active learning and in optimization were not considered. This paper reviews those algorithms from a number of domains including reinforcement learning, Gaussian process regression and classification, evolutionary algorithms, and active learning. We provide the fundamental concepts on which the reviewed algorithms are based and a characterization of the individual algorithms. We conclude by explaining how the algorithms are connected and suggestions for future research.
    Improved Regret Bounds for Tracking Experts with Memory. (arXiv:2106.13021v1 [cs.LG])
    (2 min) We address the problem of sequential prediction with expert advice in a non-stationary environment with long-term memory guarantees in the sense of Bousquet and Warmuth [4]. We give a linear-time algorithm that improves on the best known regret bounds [26]. This algorithm incorporates a relative entropy projection step. This projection is advantageous over previous weight-sharing approaches in that weight updates may come with implicit costs as in for example portfolio optimization. We give an algorithm to compute this projection step in linear time, which may be of independent interest.
    Instance-based Counterfactual Explanations for Time Series Classification. (arXiv:2009.13211v2 [cs.LG] UPDATED)
    (2 min) In recent years, there has been a rapidly expanding focus on explaining the predictions made by black-box AI systems that handle image and tabular data. However, considerably less attention has been paid to explaining the predictions of opaque AI systems handling time series data. In this paper, we advance a novel model-agnostic, case-based technique -- Native Guide -- that generates counterfactual explanations for time series classifiers. Given a query time series, $T_{q}$, for which a black-box classification system predicts class, $c$, a counterfactual time series explanation shows how $T_{q}$ could change, such that the system predicts an alternative class, $c'$. The proposed instance-based technique adapts existing counterfactual instances in the case-base by highlighting and modifying discriminative areas of the time series that underlie the classification. Quantitative and qualitative results from two comparative experiments indicate that Native Guide generates plausible, proximal, sparse and diverse explanations that are better than those produced by key benchmark counterfactual methods.
    Isotonic regression with unknown permutations: Statistics, computation, and adaptation. (arXiv:2009.02609v2 [math.ST] UPDATED)
    (2 min) Motivated by models for multiway comparison data, we consider the problem of estimating a coordinate-wise isotonic function on the domain $[0, 1]^d$ from noisy observations collected on a uniform lattice, but where the design points have been permuted along each dimension. While the univariate and bivariate versions of this problem have received significant attention, our focus is on the multivariate case $d \geq 3$. We study both the minimax risk of estimation (in empirical $L_2$ loss) and the fundamental limits of adaptation (quantified by the adaptivity index) to a family of piecewise constant functions. We provide a computationally efficient Mirsky partition estimator that is minimax optimal while also achieving the smallest adaptivity index possible for polynomial time procedures. Thus, from a worst-case perspective and in sharp contrast to the bivariate case, the latent permutations in the model do not introduce significant computational difficulties over and above vanilla isotonic regression. On the other hand, the fundamental limits of adaptation are significantly different with and without unknown permutations: Assuming a hardness conjecture from average-case complexity theory, a statistical-computational gap manifests in the former case. In a complementary direction, we show that natural modifications of existing estimators fail to satisfy at least one of the desiderata of optimal worst-case statistical performance, computational efficiency, and fast adaptation. Along the way to showing our results, we improve adaptation results in the special case $d = 2$ and establish some properties of estimators for vanilla isotonic regression, both of which may be of independent interest.
    Rank $2r$ iterative least squares: efficient recovery of ill-conditioned low rank matrices from few entries. (arXiv:2002.01849v2 [math.OC] CROSS LISTED)
    (2 min) We present a new, simple and computationally efficient iterative method for low rank matrix completion. Our method is inspired by the class of factorization-type iterative algorithms, but substantially differs from them in the way the problem is cast. Precisely, given a target rank $r$, instead of optimizing on the manifold of rank $r$ matrices, we allow our interim estimated matrix to have a specific over-parametrized rank $2r$ structure. Our algorithm, denoted R2RILS for rank $2r$ iterative least squares, has low memory requirements, and at each iteration it solves a computationally cheap sparse least-squares problem. We motivate our algorithm by its theoretical analysis for the simplified case of a rank-1 matrix. Empirically, R2RILS is able to recover ill conditioned low rank matrices from very few observations -- near the information limit, and it is stable to additive noise.
    Factors affecting the COVID-19 risk in the US counties: an innovative approach by combining unsupervised and supervised learning. (arXiv:2106.12766v1 [cs.LG])
    (2 min) The COVID-19 disease spreads swiftly, and nearly three months after the first positive case was confirmed in China, Coronavirus started to spread all over the United States. Some states and counties reported high number of positive cases and deaths, while some reported lower COVID-19 related cases and mortality. In this paper, the factors that could affect the risk of COVID-19 infection and mortality were analyzed in county level. An innovative method by using K-means clustering and several classification models is utilized to determine the most critical factors. Results showed that mean temperature, percent of people below poverty, percent of adults with obesity, air pressure, population density, wind speed, longitude, and percent of uninsured people were the most significant attributes
    Information Bottleneck: Exact Analysis of (Quantized) Neural Networks. (arXiv:2106.12912v1 [cs.LG])
    (2 min) The information bottleneck (IB) principle has been suggested as a way to analyze deep neural networks. The learning dynamics are studied by inspecting the mutual information (MI) between the hidden layers and the input and output. Notably, separate fitting and compression phases during training have been reported. This led to some controversy including claims that the observations are not reproducible and strongly dependent on the type of activation function used as well as on the way the MI is estimated. Our study confirms that different ways of binning when computing the MI lead to qualitatively different results, either supporting or refusing IB conjectures. To resolve the controversy, we study the IB principle in settings where MI is non-trivial and can be computed exactly. We monitor the dynamics of quantized neural networks, that is, we discretize the whole deep learning system so that no approximation is required when computing the MI. This allows us to quantify the information flow without measurement errors. In this setting, we observed a fitting phase for all layers and a compression phase for the output layer in all experiments; the compression in the hidden layers was dependent on the type of activation function. Our study shows that the initial IB results were not artifacts of binning when computing the MI. However, the critical claim that the compression phase may not be observed for some networks also holds true.
    Learning Language and Multimodal Privacy-Preserving Markers of Mood from Mobile Data. (arXiv:2106.13213v1 [cs.LG])
    (2 min) Mental health conditions remain underdiagnosed even in countries with common access to advanced medical care. The ability to accurately and efficiently predict mood from easily collectible data has several important implications for the early detection, intervention, and treatment of mental health disorders. One promising data source to help monitor human behavior is daily smartphone usage. However, care must be taken to summarize behaviors without identifying the user through personal (e.g., personally identifiable information) or protected (e.g., race, gender) attributes. In this paper, we study behavioral markers of daily mood using a recent dataset of mobile behaviors from adolescent populations at high risk of suicidal behaviors. Using computational models, we find that language and multimodal representations of mobile typed text (spanning typed characters, words, keystroke timings, and app usage) are predictive of daily mood. However, we find that models trained to predict mood often also capture private user identities in their intermediate representations. To tackle this problem, we evaluate approaches that obfuscate user identity while remaining predictive. By combining multimodal representations with privacy-preserving learning, we are able to push forward the performance-privacy frontier.
    Optimal Uniform OPE and Model-based Offline Reinforcement Learning in Time-Homogeneous, Reward-Free and Task-Agnostic Settings. (arXiv:2105.06029v3 [cs.LG] UPDATED)
    (2 min) This work studies the statistical limits of uniform convergence for offline policy evaluation (OPE) problems with model-based methods (for episodic MDP) and provides a unified framework towards optimal learning for several well-motivated offline tasks. Uniform OPE $\sup_\Pi|Q^\pi-\hat{Q}^\pi|<\epsilon$ is a stronger measure than the point-wise OPE and ensures offline learning when $\Pi$ contains all policies (the global class). In this paper, we establish an $\Omega(H^2 S/d_m\epsilon^2)$ lower bound (over model-based family) for the global uniform OPE and our main result establishes an upper bound of $\tilde{O}(H^2/d_m\epsilon^2)$ for the \emph{local} uniform convergence that applies to all \emph{near-empirically optimal} policies for the MDPs with \emph{stationary} transition. Here $d_m$ is the minimal marginal state-action probability. Critically, the highlight in achieving the optimal rate $\tilde{O}(H^2/d_m\epsilon^2)$ is our design of \emph{singleton absorbing MDP}, which is a new sharp analysis tool that works with the model-based approach. We generalize such a model-based framework to the new settings: offline task-agnostic and the offline reward-free with optimal complexity $\tilde{O}(H^2\log(K)/d_m\epsilon^2)$ ($K$ is the number of tasks) and $\tilde{O}(H^2S/d_m\epsilon^2)$ respectively. These results provide a unified solution for simultaneously solving different offline RL problems.
    Meta-learning for Multi-variable Non-convex Optimization Problems: Iterating Non-optimums Makes Optimum Possible. (arXiv:2009.04899v3 [cs.LG] UPDATED)
    (2 min) In this paper, we aim to address the problem of solving a non-convex optimization problem over an intersection of multiple variable sets. This kind of problems is typically solved by using an alternating minimization (AM) strategy which splits the overall problem into a set of sub-problems corresponding to each variable, and then iteratively performs minimization over each sub-problem using a fixed updating rule. However, due to the intrinsic non-convexity of the overall problem, the optimization can usually be trapped into bad local minimum even when each sub-problem can be globally optimized at each iteration. To tackle this problem, we propose a meta-learning based Global Scope Optimization (GSO) method. It adaptively generates optimizers for sub-problems via meta-learners and constantly updates these meta-learners with respect to the global loss information of the overall problem. Therefore, the sub-problems are optimized with the objective of minimizing the global loss specifically. We evaluate the proposed model on a number of simulations, including solving bi-linear inverse problems: matrix completion, and non-linear problems: Gaussian mixture models. The experimental results show that our proposed approach outperforms AM-based methods in standard settings, and is able to achieve effective optimization in some challenging cases while other methods would typically fail.
    Improving Playtesting Coverage via Curiosity Driven Reinforcement Learning Agents. (arXiv:2103.13798v2 [cs.LG] UPDATED)
    (2 min) As modern games continue growing both in size and complexity, it has become more challenging to ensure that all the relevant content is tested and that any potential issue is properly identified and fixed. Attempting to maximize testing coverage using only human participants, however, results in a tedious and hard to orchestrate process which normally slows down the development cycle. Complementing playtesting via autonomous agents has shown great promise accelerating and simplifying this process. This paper addresses the problem of automatically exploring and testing a given scenario using reinforcement learning agents trained to maximize game state coverage. Each of these agents is rewarded based on the novelty of its actions, thus encouraging a curious and exploratory behaviour on a complex 3D scenario where previously proposed exploration techniques perform poorly. The curious agents are able to learn the complex navigation mechanics required to reach the different areas around the map, thus providing the necessary data to identify potential issues. Moreover, the paper also explores different visualization strategies and evaluates how to make better use of the collected data to drive design decisions and to recognize possible problems and oversights.
    Planning with Exploration: Addressing Dynamics Bottleneck in Model-based Reinforcement Learning. (arXiv:2010.12914v3 [cs.LG] UPDATED)
    (2 min) Model-based reinforcement learning (MBRL) is believed to have higher sample efficiency compared with model-free reinforcement learning (MFRL). However, MBRL is plagued by dynamics bottleneck dilemma. Dynamics bottleneck dilemma is the phenomenon that the performance of the algorithm falls into the local optimum instead of increasing when the interaction step with the environment increases, which means more data can not bring better performance. In this paper, we find that the trajectory reward estimation error is the main reason that causes dynamics bottleneck dilemma through theoretical analysis. We give an upper bound of the trajectory reward estimation error and point out that increasing the agent's exploration ability is the key to reduce trajectory reward estimation error, thereby alleviating dynamics bottleneck dilemma. Motivated by this, a model-based control method combined with exploration named MOdel-based Progressive Entropy-based Exploration (MOPE2) is proposed. We conduct experiments on several complex continuous control benchmark tasks. The results verify that MOPE2 can effectively alleviate dynamics bottleneck dilemma and have higher sample efficiency than previous MBRL and MFRL algorithms.
    FF-NSL: Feed-Forward Neural-Symbolic Learner. (arXiv:2106.13103v1 [cs.LG])
    (2 min) Inductive Logic Programming (ILP) aims to learn generalised, interpretable hypotheses in a data-efficient manner. However, current ILP systems require training examples to be specified in a structured logical form. This paper introduces a neural-symbolic learning framework, called Feed-Forward Neural-Symbolic Learner (FF-NSL), that integrates state-of-the-art ILP systems based on the Answer Set semantics, with neural networks, in order to learn interpretable hypotheses from labelled unstructured data. FF-NSL uses a pre-trained neural network to extract symbolic facts from unstructured data and an ILP system to learn a hypothesis that performs a downstream classification task. In order to evaluate the applicability of our approach to real-world applications, the framework is evaluated on tasks where distributional shifts are introduced to unstructured input data, for which pre-trained neural networks are likely to predict incorrectly and with high confidence. Experimental results show that FF-NSL outperforms baseline approaches such as a random forest and deep neural networks by learning more accurate and interpretable hypotheses with fewer examples.
    Language for Description of Worlds. (arXiv:2010.16243v3 [cs.AI] UPDATED)
    (2 min) We will reduce the task of creating AI to the task of finding an appropriate language for description of the world. This will not be a programing language because programing languages describe only computable functions, while our language will describe a somewhat broader class of functions. Another specificity of this language will be that the description will consist of separate modules. This will enable us look for the description of the world automatically such that we discover it module after module. Our approach to the creation of this new language will be to start with a particular world and write the description of that particular world. The point is that the language which can describe this particular world will be appropriate for describing any world.
    Graceful Degradation and Related Fields. (arXiv:2106.11119v2 [cs.LG] UPDATED)
    (2 min) When machine learning models encounter data which is out of the distribution on which they were trained they have a tendency to behave poorly, most prominently over-confidence in erroneous predictions. Such behaviours will have disastrous effects on real-world machine learning systems. In this field graceful degradation refers to the optimisation of model performance as it encounters this out-of-distribution data. This work presents a definition and discussion of graceful degradation and where it can be applied in deployed visual systems. Following this a survey of relevant areas is undertaken, novelly splitting the graceful degradation problem into active and passive approaches. In passive approaches, graceful degradation is handled and achieved by the model in a self-contained manner, in active approaches the model is updated upon encountering epistemic uncertainties. This work communicates the importance of the problem and aims to prompt the development of machine learning strategies that are aware of graceful degradation.
    Variational Quantum Singular Value Decomposition. (arXiv:2006.02336v3 [quant-ph] UPDATED)
    (2 min) Singular value decomposition is central to many problems in engineering and scientific fields. Several quantum algorithms have been proposed to determine the singular values and their associated singular vectors of a given matrix. Although these algorithms are promising, the required quantum subroutines and resources are too costly on near-term quantum devices. In this work, we propose a variational quantum algorithm for singular value decomposition (VQSVD). By exploiting the variational principles for singular values and the Ky Fan Theorem, we design a novel loss function such that two quantum neural networks (or parameterized quantum circuits) could be trained to learn the singular vectors and output the corresponding singular values. Furthermore, we conduct numerical simulations of VQSVD for random matrices as well as its applications in image compression of handwritten digits. Finally, we discuss the applications of our algorithm in recommendation systems and polar decomposition. Our work explores new avenues for quantum information processing beyond the conventional protocols that only works for Hermitian data, and reveals the capability of matrix decomposition on near-term quantum devices.
    Policy Gradient Methods for the Noisy Linear Quadratic Regulator over a Finite Horizon. (arXiv:2011.10300v2 [cs.LG] UPDATED)
    (2 min) We explore reinforcement learning methods for finding the optimal policy in the linear quadratic regulator (LQR) problem. In particular, we consider the convergence of policy gradient methods in the setting of known and unknown parameters. We are able to produce a global linear convergence guarantee for this approach in the setting of finite time horizon and stochastic state dynamics under weak assumptions. The convergence of a projected policy gradient method is also established in order to handle problems with constraints. We illustrate the performance of the algorithm with two examples. The first example is the optimal liquidation of a holding in an asset. We show results for the case where we assume a model for the underlying dynamics and where we apply the method to the data directly. The empirical evidence suggests that the policy gradient method can learn the global optimal solution for a larger class of stochastic systems containing the LQR framework and that it is more robust with respect to model mis-specification when compared to a model-based approach. The second example is an LQR system in a higher dimensional setting with synthetic data.
    Privacy Threats Analysis to Secure Federated Learning. (arXiv:2106.13076v1 [cs.LG])
    (2 min) Federated learning is emerging as a machine learning technique that trains a model across multiple decentralized parties. It is renowned for preserving privacy as the data never leaves the computational devices, and recent approaches further enhance its privacy by hiding messages transferred in encryption. However, we found that despite the efforts, federated learning remains privacy-threatening, due to its interactive nature across different parties. In this paper, we analyze the privacy threats in industrial-level federated learning frameworks with secure computation, and reveal such threats widely exist in typical machine learning models such as linear regression, logistic regression and decision tree. For the linear and logistic regression, we show through theoretical analysis that it is possible for the attacker to invert the entire private input of the victim, given very few information. For the decision tree model, we launch an attack to infer the range of victim's private inputs. All attacks are evaluated on popular federated learning frameworks and real-world datasets.
    Variational Autoencoder-Based Vehicle Trajectory Prediction with an Interpretable Latent Space. (arXiv:2103.13726v2 [cs.LG] UPDATED)
    (2 min) This paper introduces the Descriptive Variational Autoencoder (DVAE), an unsupervised and end-to-end trainable neural network for predicting vehicle trajectories that provides partial interpretability. The novel approach is based on the architecture and objective of common variational autoencoders. By introducing expert knowledge within the decoder part of the autoencoder, the encoder learns to extract latent parameters that provide a graspable meaning in human terms. Such an interpretable latent space enables the validation by expert defined rule sets. The evaluation of the DVAE is performed using the publicly available highD dataset for highway traffic scenarios. In comparison to a conventional variational autoencoder with equivalent complexity, the proposed model provides a similar prediction accuracy but with the great advantage of having an interpretable latent space. For crucial decision making and assessing trustworthiness of a prediction this property is highly desirable.
    Autoencoding Under Normalization Constraints. (arXiv:2105.05735v2 [cs.LG] UPDATED)
    (2 min) Likelihood is a standard estimate for outlier detection. The specific role of the normalization constraint is to ensure that the out-of-distribution (OOD) regime has a small likelihood when samples are learned using maximum likelihood. Because autoencoders do not possess such a process of normalization, they often fail to recognize outliers even when they are obviously OOD. We propose the Normalized Autoencoder (NAE), a normalized probabilistic model constructed from an autoencoder. The probability density of NAE is defined using the reconstruction error of an autoencoder, which is differently defined in the conventional energy-based model. In our model, normalization is enforced by suppressing the reconstruction of negative samples, significantly improving the outlier detection performance. Our experimental results confirm the efficacy of NAE, both in detecting outliers and in generating in-distribution samples.
    Rate Distortion Characteristic Modeling for Neural Image Compression. (arXiv:2106.12954v1 [eess.IV])
    (2 min) End-to-end optimization capability offers neural image compression (NIC) superior lossy compression performance. However, distinct models are required to be trained to reach different points in the rate-distortion (R-D) space. In this paper, we consider the problem of R-D characteristic analysis and modeling for NIC. We make efforts to formulate the essential mathematical functions to describe the R-D behavior of NIC using deep network and statistical modeling. Thus continuous bit-rate points could be elegantly realized by leveraging such model via a single trained network. In this regard, we propose a plugin-in module to learn the relationship between the target bit-rate and the binary representation for the latent variable of auto-encoder. Furthermore, we model the rate and distortion characteristic of NIC as a function of the coding parameter $\lambda$ respectively. Our experiments show our proposed method is easy to adopt and obtains competitive coding performance with fixed-rate coding approaches, which would benefit the practical deployment of NIC. In addition, the proposed model could be applied to NIC rate control with limited bit-rate error using a single network.
    Self-Supervised Monocular Depth Estimation of Untextured Indoor Rotated Scenes. (arXiv:2106.12958v1 [cs.CV])
    (2 min) Self-supervised deep learning methods have leveraged stereo images for training monocular depth estimation. Although these methods show strong results on outdoor datasets such as KITTI, they do not match performance of supervised methods on indoor environments with camera rotation. Indoor, rotated scenes are common for less constrained applications and pose problems for two reasons: abundance of low texture regions and increased complexity of depth cues for images under rotation. In an effort to extend self-supervised learning to more generalised environments we propose two additions. First, we propose a novel Filled Disparity Loss term that corrects for ambiguity of image reconstruction error loss in textureless regions. Specifically, we interpolate disparity in untextured regions, using the estimated disparity from surrounding textured areas, and use L1 loss to correct the original estimation. Our experiments show that depth estimation is substantially improved on low-texture scenes, without any loss on textured scenes, when compared to Monodepth by Godard et al. Secondly, we show that training with an application's representative rotations, in both pitch and roll, is sufficient to significantly improve performance over the entire range of expected rotation. We demonstrate that depth estimation is successfully generalised as performance is not lost when evaluated on test sets with no camera rotation. Together these developments enable a broader use of self-supervised learning of monocular depth estimation for complex environments.
    Understanding Uncertainty in Bayesian Deep Learning. (arXiv:2106.13055v1 [stat.ML])
    (2 min) Neural Linear Models (NLM) are deep Bayesian models that produce predictive uncertainty by learning features from the data and then performing Bayesian linear regression over these features. Despite their popularity, few works have focused on formally evaluating the predictive uncertainties of these models. Furthermore, existing works point out the difficulties of encoding domain knowledge in models like NLMs, making them unsuitable for applications where interpretability is required. In this work, we show that traditional training procedures for NLMs can drastically underestimate uncertainty in data-scarce regions. We identify the underlying reasons for this behavior and propose a novel training method that can both capture useful predictive uncertainties as well as allow for incorporation of domain knowledge.
    DCoM: A Deep Column Mapper for Semantic Data Type Detection. (arXiv:2106.12871v1 [cs.LG])
    (2 min) Detection of semantic data types is a very crucial task in data science for automated data cleaning, schema matching, data discovery, semantic data type normalization and sensitive data identification. Existing methods include regular expression-based or dictionary lookup-based methods that are not robust to dirty as well unseen data and are limited to a very less number of semantic data types to predict. Existing Machine Learning methods extract large number of engineered features from data and build logistic regression, random forest or feedforward neural network for this purpose. In this paper, we introduce DCoM, a collection of multi-input NLP-based deep neural networks to detect semantic data types where instead of extracting large number of features from the data, we feed the raw values of columns (or instances) to the model as texts. We train DCoM on 686,765 data columns extracted from VizNet corpus with 78 different semantic data types. DCoM outperforms other contemporary results with a quite significant margin on the same dataset.
    Fund2Vec: Mutual Funds Similarity using Graph Learning. (arXiv:2106.12987v1 [q-fin.ST])
    (2 min) Identifying similar mutual funds with respect to the underlying portfolios has found many applications in financial services ranging from fund recommender systems, competitors analysis, portfolio analytics, marketing and sales, etc. The traditional methods are either qualitative, and hence prone to biases and often not reproducible, or, are known not to capture all the nuances (non-linearities) among the portfolios from the raw data. We propose a radically new approach to identify similar funds based on the weighted bipartite network representation of funds and their underlying assets data using a sophisticated machine learning method called Node2Vec which learns an embedded low-dimensional representation of the network. We call the embedding \emph{Fund2Vec}. Ours is the first ever study of the weighted bipartite network representation of the funds-assets network in its original form that identifies structural similarity among portfolios as opposed to merely portfolio overlaps.
    Charformer: Fast Character Transformers via Gradient-based Subword Tokenization. (arXiv:2106.12672v1 [cs.CL])
    (2 min) State-of-the-art models in natural language processing rely on separate rigid subword tokenization algorithms, which limit their generalization ability and adaptation to new settings. In this paper, we propose a new model inductive bias that learns a subword tokenization end-to-end as part of the model. To this end, we introduce a soft gradient-based subword tokenization module (GBST) that automatically learns latent subword representations from characters in a data-driven fashion. Concretely, GBST enumerates candidate subword blocks and learns to score them in a position-wise fashion using a block scoring network. We additionally introduce Charformer, a deep Transformer model that integrates GBST and operates on the byte level. Via extensive experiments on English GLUE, multilingual, and noisy text datasets, we show that Charformer outperforms a series of competitive byte-level baselines while generally performing on par and sometimes outperforming subword-based models. Additionally, Charformer is fast, improving the speed of both vanilla byte-level and subword-level Transformers by 28%-100% while maintaining competitive quality. We believe this work paves the way for highly performant token-free models that are trained completely end-to-end.
    Task-agnostic Continual Learning with Hybrid Probabilistic Models. (arXiv:2106.12772v1 [cs.LG])
    (2 min) Learning new tasks continuously without forgetting on a constantly changing data distribution is essential for real-world problems but extremely challenging for modern deep learning. In this work we propose HCL, a Hybrid generative-discriminative approach to Continual Learning for classification. We model the distribution of each task and each class with a normalizing flow. The flow is used to learn the data distribution, perform classification, identify task changes, and avoid forgetting, all leveraging the invertibility and exact likelihood which are uniquely enabled by the normalizing flow model. We use the generative capabilities of the flow to avoid catastrophic forgetting through generative replay and a novel functional regularization technique. For task identification, we use state-of-the-art anomaly detection techniques based on measuring the typicality of the model's statistics. We demonstrate the strong performance of HCL on a range of continual learning benchmarks such as split-MNIST, split-CIFAR, and SVHN-MNIST.
    A Near-Optimal Algorithm for Debiasing Trained Machine Learning Models. (arXiv:2106.12887v1 [cs.LG])
    (2 min) We present a scalable post-processing algorithm for debiasing trained models, including deep neural networks (DNNs), which we prove to be near-optimal by bounding its excess Bayes risk. We empirically validate its advantages on standard benchmark datasets across both classical algorithms as well as modern DNN architectures and demonstrate that it outperforms previous post-processing methods while performing on par with in-processing. In addition, we show that the proposed algorithm is particularly effective for models trained at scale where post-processing is a natural and practical choice.
    Minimum sharpness: Scale-invariant parameter-robustness of neural networks. (arXiv:2106.12612v1 [cs.LG])
    (2 min) Toward achieving robust and defensive neural networks, the robustness against the weight parameters perturbations, i.e., sharpness, attracts attention in recent years (Sun et al., 2020). However, sharpness is known to remain a critical issue, "scale-sensitivity." In this paper, we propose a novel sharpness measure, Minimum Sharpness. It is known that NNs have a specific scale transformation that constitutes equivalent classes where functional properties are completely identical, and at the same time, their sharpness could change unlimitedly. We define our sharpness through a minimization problem over the equivalent NNs being invariant to the scale transformation. We also develop an efficient and exact technique to make the sharpness tractable, which reduces the heavy computational costs involved with Hessian. In the experiment, we observed that our sharpness has a valid correlation with the generalization of NNs and runs with less computational cost than existing sharpness measures.
    Non-Autoregressive TTS with Explicit Duration Modelling for Low-Resource Highly Expressive Speech. (arXiv:2106.12896v1 [cs.SD])
    (2 min) Whilst recent neural text-to-speech (TTS) approaches produce high-quality speech, they typically require a large amount of recordings from the target speaker. In previous work, a 3-step method was proposed to generate high-quality TTS while greatly reducing the amount of data required for training. However, we have observed a ceiling effect in the level of naturalness achievable for highly expressive voices when using this approach. In this paper, we present a method for building highly expressive TTS voices with as little as 15 minutes of speech data from the target speaker. Compared to the current state-of-the-art approach, our proposed improvements close the gap to recordings by 23.3% for naturalness of speech and by 16.3% for speaker similarity. Further, we match the naturalness and speaker similarity of a Tacotron2-based full-data (~10 hours) model using only 15 minutes of target speaker data, whereas with 30 minutes or more, we significantly outperform it. The following improvements are proposed: 1) changing from an autoregressive, attention-based TTS model to a non-autoregressive model replacing attention with an external duration model and 2) an additional Conditional Generative Adversarial Network (cGAN) based fine-tuning step.
    Evaluation of Saliency-based Explainability Method. (arXiv:2106.12773v1 [cs.LG])
    (2 min) A particular class of Explainable AI (XAI) methods provide saliency maps to highlight part of the image a Convolutional Neural Network (CNN) model looks at to classify the image as a way to explain its working. These methods provide an intuitive way for users to understand predictions made by CNNs. Other than quantitative computational tests, the vast majority of evidence to highlight that the methods are valuable is anecdotal. Given that humans would be the end-users of such methods, we devise three human subject experiments through which we gauge the effectiveness of these saliency-based explainability methods.
    Dungeon and Platformer Level Blending and Generation using Conditional VAEs. (arXiv:2106.12692v1 [cs.LG])
    (2 min) Variational autoencoders (VAEs) have been used in prior works for generating and blending levels from different games. To add controllability to these models, conditional VAEs (CVAEs) were recently shown capable of generating output that can be modified using labels specifying desired content, albeit working with segments of levels and platformers exclusively. We expand these works by using CVAEs for generating whole platformer and dungeon levels, and blending levels across these genres. We show that CVAEs can reliably control door placement in dungeons and progression direction in platformer levels. Thus, by using appropriate labels, our approach can generate whole dungeons and platformer levels of interconnected rooms and segments respectively as well as levels that blend dungeons and platformers. We demonstrate our approach using The Legend of Zelda, Metroid, Mega Man and Lode Runner.
    All unconstrained strongly convex problems are weakly simplicial. (arXiv:2106.12704v1 [math.OC])
    (2 min) A multi-objective optimization problem is $C^r$ weakly simplicial if there exists a $C^r$ surjection from a simplex onto the Pareto set/front such that the image of each subsimplex is the Pareto set/front of a subproblem, where $0\leq r\leq \infty$. This property is helpful to compute a parametric-surface approximation of the entire Pareto set and Pareto front. It is known that all unconstrained strongly convex $C^r$ problems are $C^{r-1}$ weakly simplicial for $1\leq r \leq \infty$. In this paper, we show that all unconstrained strongly convex problems are $C^0$ weakly simplicial. The usefulness of this theorem is demonstrated in a sparse modeling application: we reformulate the elastic net as a non-differentiable multi-objective strongly convex problem and approximate its Pareto set (the set of all trained models with different hyper-parameters) and Pareto front (the set of performance metrics of the trained models) by using a B\'ezier simplex fitting method, which accelerates hyper-parameter search.
    Multi-objective Asynchronous Successive Halving. (arXiv:2106.12639v1 [stat.ML])
    (2 min) Hyperparameter optimization (HPO) is increasingly used to automatically tune the predictive performance (e.g., accuracy) of machine learning models. However, in a plethora of real-world applications, accuracy is only one of the multiple -- often conflicting -- performance criteria, necessitating the adoption of a multi-objective (MO) perspective. While the literature on MO optimization is rich, few prior studies have focused on HPO. In this paper, we propose algorithms that extend asynchronous successive halving (ASHA) to the MO setting. Considering multiple evaluation metrics, we assess the performance of these methods on three real world tasks: (i) Neural architecture search, (ii) algorithmic fairness and (iii) language model optimization. Our empirical analysis shows that MO ASHA enables to perform MO HPO at scale. Further, we observe that that taking the entire Pareto front into account for candidate selection consistently outperforms multi-fidelity HPO based on MO scalarization in terms of wall-clock time. Our algorithms (to be open-sourced) establish new baselines for future research in the area.
    Extreme Multi-label Learning for Semantic Matching in Product Search. (arXiv:2106.12657v1 [cs.IR])
    (2 min) We consider the problem of semantic matching in product search: given a customer query, retrieve all semantically related products from a huge catalog of size 100 million, or more. Because of large catalog spaces and real-time latency constraints, semantic matching algorithms not only desire high recall but also need to have low latency. Conventional lexical matching approaches (e.g., Okapi-BM25) exploit inverted indices to achieve fast inference time, but fail to capture behavioral signals between queries and products. In contrast, embedding-based models learn semantic representations from customer behavior data, but the performance is often limited by shallow neural encoders due to latency constraints. Semantic product search can be viewed as an eXtreme Multi-label Classification (XMC) problem, where customer queries are input instances and products are output labels. In this paper, we aim to improve semantic product search by using tree-based XMC models where inference time complexity is logarithmic in the number of products. We consider hierarchical linear models with n-gram features for fast real-time inference. Quantitatively, our method maintains a low latency of 1.25 milliseconds per query and achieves a 65% improvement of Recall@100 (60.9% v.s. 36.8%) over a competing embedding-based DSSM model. Our model is robust to weight pruning with varying thresholds, which can flexibly meet different system requirements for online deployments. Qualitatively, our method can retrieve products that are complementary to existing product search system and add diversity to the match set.
    A review of systematic selection of clustering algorithms and their evaluation. (arXiv:2106.12792v1 [cs.LG])
    (2 min) Data analysis plays an indispensable role for value creation in industry. Cluster analysis in this context is able to explore given datasets with little or no prior knowledge and to identify unknown patterns. As (big) data complexity increases in the dimensions volume, variety, and velocity, this becomes even more important. Many tools for cluster analysis have been developed from early on and the variety of different clustering algorithms is huge. As the selection of the right clustering procedure is crucial to the results of the data analysis, users are in need for support on their journey of extracting knowledge from raw data. Thus, the objective of this paper lies in the identification of a systematic selection logic for clustering algorithms and corresponding validation concepts. The goal is to enable potential users to choose an algorithm that fits best to their needs and the properties of their underlying data clustering problem. Moreover, users are supported in selecting the right validation concepts to make sense of the clustering results. Based on a comprehensive literature review, this paper provides assessment criteria for clustering method evaluation and validation concept selection. The criteria are applied to several common algorithms and the selection process of an algorithm is supported by the introduction of pseudocode-based routines that consider the underlying data structure.
    Shallow Representation is Deep: Learning Uncertainty-aware and Worst-case Random Feature Dynamics. (arXiv:2106.13066v1 [cs.LG])
    (2 min) Random features is a powerful universal function approximator that inherits the theoretical rigor of kernel methods and can scale up to modern learning tasks. This paper views uncertain system models as unknown or uncertain smooth functions in universal reproducing kernel Hilbert spaces. By directly approximating the one-step dynamics function using random features with uncertain parameters, which are equivalent to a shallow Bayesian neural network, we then view the whole dynamical system as a multi-layer neural network. Exploiting the structure of Hamiltonian dynamics, we show that finding worst-case dynamics realizations using Pontryagin's minimum principle is equivalent to performing the Frank-Wolfe algorithm on the deep net. Various numerical experiments on dynamics learning showcase the capacity of our modeling methodology.
    Leveraging semantically similar queries for ranking via combining representations. (arXiv:2106.12621v1 [cs.LG])
    (2 min) In modern ranking problems, different and disparate representations of the items to be ranked are often available. It is sensible, then, to try to combine these representations to improve ranking. Indeed, learning to rank via combining representations is both principled and practical for learning a ranking function for a particular query. In extremely data-scarce settings, however, the amount of labeled data available for a particular query can lead to a highly variable and ineffective ranking function. One way to mitigate the effect of the small amount of data is to leverage information from semantically similar queries. Indeed, as we demonstrate in simulation settings and real data examples, when semantically similar queries are available it is possible to gainfully use them when ranking with respect to a particular query. We describe and explore this phenomenon in the context of the bias-variance trade off and apply it to the data-scarce settings of a Bing navigational graph and the Drosophila larva connectome.
    A Deep Learning Approach to Private Data Sharing of Medical Images Using Conditional GANs. (arXiv:2106.13199v1 [cs.LG])
    (2 min) Sharing data from clinical studies can facilitate innovative data-driven research and ultimately lead to better public health. However, sharing biomedical data can put sensitive personal information at risk. This is usually solved by anonymization, which is a slow and expensive process. An alternative to anonymization is sharing a synthetic dataset that bears a behaviour similar to the real data but preserves privacy. As part of the collaboration between Novartis and the Oxford Big Data Institute, we generate a synthetic dataset based on COSENTYX (secukinumab) Ankylosing Spondylitis (AS) clinical study. We apply an Auxiliary Classifier GAN (ac-GAN) to generate synthetic magnetic resonance images (MRIs) of vertebral units (VUs). The images are conditioned on the VU location (cervical, thoracic and lumbar). In this paper, we present a method for generating a synthetic dataset and conduct an in-depth analysis on its properties of along three key metrics: image fidelity, sample diversity and dataset privacy.
    Understanding the Spread of COVID-19 Epidemic: A Spatio-Temporal Point Process View. (arXiv:2106.13097v1 [cs.LG])
    (2 min) Since the first coronavirus case was identified in the U.S. on Jan. 21, more than 1 million people in the U.S. have confirmed cases of COVID-19. This infectious respiratory disease has spread rapidly across more than 3000 counties and 50 states in the U.S. and have exhibited evolutionary clustering and complex triggering patterns. It is essential to understand the complex spacetime intertwined propagation of this disease so that accurate prediction or smart external intervention can be carried out. In this paper, we model the propagation of the COVID-19 as spatio-temporal point processes and propose a generative and intensity-free model to track the spread of the disease. We further adopt a generative adversarial imitation learning framework to learn the model parameters. In comparison with the traditional likelihood-based learning methods, this imitation learning framework does not need to prespecify an intensity function, which alleviates the model-misspecification. Moreover, the adversarial learning procedure bypasses the difficult-to-evaluate integral involved in the likelihood evaluation, which makes the model inference more scalable with the data and variables. We showcase the dynamic learning performance on the COVID-19 confirmed cases in the U.S. and evaluate the social distancing policy based on the learned generative model.
    Software for Dataset-wide XAI: From Local Explanations to Global Insights with Zennit, CoRelAy, and ViRelAy. (arXiv:2106.13200v1 [cs.LG])
    (2 min) Deep Neural Networks (DNNs) are known to be strong predictors, but their prediction strategies can rarely be understood. With recent advances in Explainable Artificial Intelligence, approaches are available to explore the reasoning behind those complex models' predictions. One class of approaches are post-hoc attribution methods, among which Layer-wise Relevance Propagation (LRP) shows high performance. However, the attempt at understanding a DNN's reasoning often stops at the attributions obtained for individual samples in input space, leaving the potential for deeper quantitative analyses untouched. As a manual analysis without the right tools is often unnecessarily labor intensive, we introduce three software packages targeted at scientists to explore model reasoning using attribution approaches and beyond: (1) Zennit - a highly customizable and intuitive attribution framework implementing LRP and related approaches in PyTorch, (2) CoRelAy - a framework to easily and quickly construct quantitative analysis pipelines for dataset-wide analyses of explanations, and (3) ViRelAy - a web-application to interactively explore data, attributions, and analysis results.
    Bayesian Optimization with High-Dimensional Outputs. (arXiv:2106.12997v1 [cs.LG])
    (2 min) Bayesian Optimization is a sample-efficient black-box optimization procedure that is typically applied to problems with a small number of independent objectives. However, in practice we often wish to optimize objectives defined over many correlated outcomes (or ``tasks"). For example, scientists may want to optimize the coverage of a cell tower network across a dense grid of locations. Similarly, engineers may seek to balance the performance of a robot across dozens of different environments via constrained or robust optimization. However, the Gaussian Process (GP) models typically used as probabilistic surrogates for multi-task Bayesian Optimization scale poorly with the number of outcomes, greatly limiting applicability. We devise an efficient technique for exact multi-task GP sampling that combines exploiting Kronecker structure in the covariance matrices with Matheron's identity, allowing us to perform Bayesian Optimization using exact multi-task GP models with tens of thousands of correlated outputs. In doing so, we achieve substantial improvements in sample efficiency compared to existing approaches that only model aggregate functions of the outcomes. We demonstrate how this unlocks a new class of applications for Bayesian Optimization across a range of tasks in science and engineering, including optimizing interference patterns of an optical interferometer with more than 65,000 outputs.
    Symmetric Wasserstein Autoencoders. (arXiv:2106.13024v1 [cs.LG])
    (2 min) Leveraging the framework of Optimal Transport, we introduce a new family of generative autoencoders with a learnable prior, called Symmetric Wasserstein Autoencoders (SWAEs). We propose to symmetrically match the joint distributions of the observed data and the latent representation induced by the encoder and the decoder. The resulting algorithm jointly optimizes the modelling losses in both the data and the latent spaces with the loss in the data space leading to the denoising effect. With the symmetric treatment of the data and the latent representation, the algorithm implicitly preserves the local structure of the data in the latent space. To further improve the quality of the latent representation, we incorporate a reconstruction loss into the objective, which significantly benefits both the generation and reconstruction. We empirically show the superior performance of SWAEs over the state-of-the-art generative autoencoders in terms of classification, reconstruction, and generation.
    Fold2Seq: A Joint Sequence(1D)-Fold(3D) Embedding-based Generative Model for Protein Design. (arXiv:2106.13058v1 [cs.LG])
    (2 min) Designing novel protein sequences for a desired 3D topological fold is a fundamental yet non-trivial task in protein engineering. Challenges exist due to the complex sequence--fold relationship, as well as the difficulties to capture the diversity of the sequences (therefore structures and functions) within a fold. To overcome these challenges, we propose Fold2Seq, a novel transformer-based generative framework for designing protein sequences conditioned on a specific target fold. To model the complex sequence--structure relationship, Fold2Seq jointly learns a sequence embedding using a transformer and a fold embedding from the density of secondary structural elements in 3D voxels. On test sets with single, high-resolution and complete structure inputs for individual folds, our experiments demonstrate improved or comparable performance of Fold2Seq in terms of speed, coverage, and reliability for sequence design, when compared to existing state-of-the-art methods that include data-driven deep generative models and physics-based RosettaDesign. The unique advantages of fold-based Fold2Seq, in comparison to a structure-based deep model and RosettaDesign, become more evident on three additional real-world challenges originating from low-quality, incomplete, or ambiguous input structures. Source code and data are available at https://github.com/IBM/fold2seq.
    Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting. (arXiv:2106.13008v1 [cs.LG])
    (2 min) Extending the forecasting time is a critical demand for real applications, such as extreme weather early warning and long-term energy consumption planning. This paper studies the \textit{long-term forecasting} problem of time series. Prior Transformer-based models adopt various self-attention mechanisms to discover the long-range dependencies. However, intricate temporal patterns of the long-term future prohibit the model from finding reliable dependencies. Also, Transformers have to adopt the sparse versions of point-wise self-attentions for long series efficiency, resulting in the information utilization bottleneck. Towards these challenges, we propose Autoformer as a novel decomposition architecture with an Auto-Correlation mechanism. We go beyond the pre-processing convention of series decomposition and renovate it as a basic inner block of deep models. This design empowers Autoformer with progressive decomposition capacities for complex time series. Further, inspired by the stochastic process theory, we design the Auto-Correlation mechanism based on the series periodicity, which conducts the dependencies discovery and representation aggregation at the sub-series level. Auto-Correlation outperforms self-attention in both efficiency and accuracy. In long-term forecasting, Autoformer yields state-of-the-art accuracy, with a 38% relative improvement on six benchmarks, covering five practical applications: energy, traffic, economics, weather and disease.
    Online Verification of Deep Neural Networks under Domain or Weight Shift. (arXiv:2106.12732v1 [cs.LG])
    (2 min) Although neural networks are widely used, it remains challenging to formally verify the safety and robustness of neural networks in real-world applications. Existing methods are designed to verify the network before use, which is limited to relatively simple specifications and fixed networks. These methods are not ready to be applied to real-world problems with complex and/or dynamically changing specifications and networks. To effectively handle dynamically changing specifications and networks, the verification needs to be performed online when these changes take place. However, it is still challenging to run existing verification algorithms online. Our key insight is that we can leverage the temporal dependencies of these changes to accelerate the verification process, e.g., by warm starting new online verification using previous verified results. This paper establishes a novel framework for scalable online verification to solve real-world verification problems with dynamically changing specifications and/or networks, known as domain shift and weight shift respectively. We propose three types of techniques (branch management, perturbation tolerance analysis, and incremental computation) to accelerate the online verification of deep neural networks. Experiment results show that our online verification algorithm is up to two orders of magnitude faster than existing verification algorithms, and thus can scale to real-world applications.
    Neural ODE to model and prognose thermoacoustic instability. (arXiv:2106.12758v1 [physics.flu-dyn])
    (2 min) In reacting flow systems, thermoacoustic instability characterized by high amplitude pressure fluctuations, is driven by a positive coupling between the unsteady heat release rate and the acoustic field of the combustor. When the underlying flow is turbulent, as a control parameter of the system is varied and the system approach thermoacoustic instability, the acoustic pressure oscillations synchronize with heat release rate oscillations. Consequently, during the onset of thermoacoustic instability in turbulent combustors, the system dynamics transition from chaotic oscillations to periodic oscillations via a state of intermittency. Thermoacoustic systems are traditionally modeled by coupling the model for the unsteady heat source and the acoustic subsystem, each estimated independently. The response of the unsteady heat source, the flame, to acoustic fluctuations are characterized by introducing external unsteady forcing. This necessitates a powerful excitation module to obtain the nonlinear response of the flame to acoustic perturbations. Instead of characterizing individual subsystems, we introduce a neural ordinary differential equation (neural ODE) framework to model the thermoacoustic system as a whole. The neural ODE model for the thermoacoustic system uses time series of the heat release rate and the pressure fluctuations, measured simultaneously without introducing any external perturbations, to model their coupled interaction. Further, we use the parameters of neural ODE to define an anomaly measure that represents the proximity of system dynamics to limit cycle oscillations and thus provide an early warning signal for the onset of thermoacoustic instability.
    DeepAuditor: Distributed Online Intrusion Detection System for IoT devices via Power Side-channel Auditing. (arXiv:2106.12753v1 [cs.CR])
    (2 min) As the number of IoT devices has increased rapidly, IoT botnets have exploited the vulnerabilities of IoT devices. However, it is still challenging to detect the initial intrusion on IoT devices prior to massive attacks. Recent studies have utilized power side-channel information to characterize this intrusion behavior on IoT devices but still lack real-time detection approaches. This study aimed to design an online intrusion detection system called DeepAuditor for IoT devices via power auditing. To realize the real-time system, we first proposed a lightweight power auditing device called Power Auditor. With the Power Auditor, we developed a Distributed CNN classifier for online inference in our laboratory setting. In order to protect data leakage and reduce networking redundancy, we also proposed a privacy-preserved inference protocol via Packed Homomorphic Encryption and a sliding window protocol in our system. The classification accuracy and processing time were measured in our laboratory settings. We also demonstrated that the distributed CNN design is secure against any distributed components. Overall, the measurements were shown to the feasibility of our real-time distributed system for intrusion detection on IoT devices.
    Objective discovery of dominant dynamical processes with intelligible machine learning. (arXiv:2106.12963v1 [cs.LG])
    (2 min) The advent of big data has vast potential for discovery in natural phenomena ranging from climate science to medicine, but overwhelming complexity stymies insight. Existing theory is often not able to succinctly describe salient phenomena, and progress has largely relied on ad hoc definitions of dynamical regimes to guide and focus exploration. We present a formal definition in which the identification of dynamical regimes is formulated as an optimization problem, and we propose an intelligible objective function. Furthermore, we propose an unsupervised learning framework which eliminates the need for a priori knowledge and ad hoc definitions; instead, the user need only choose appropriate clustering and dimensionality reduction algorithms, and this choice can be guided using our proposed objective function. We illustrate its applicability with example problems drawn from ocean dynamics, tumor angiogenesis, and turbulent boundary layers. Our method is a step towards unbiased data exploration that allows serendipitous discovery within dynamical systems, with the potential to propel the physical sciences forward.
    Evaluation of Representation Models for Text Classification with AutoML Tools. (arXiv:2106.12798v1 [cs.CL])
    (2 min) Automated Machine Learning (AutoML) has gained increasing success on tabular data in recent years. However, processing unstructured data like text is a challenge and not widely supported by open-source AutoML tools. This work compares three manually created text representations and text embeddings automatically created by AutoML tools. Our benchmark includes four popular open-source AutoML tools and eight datasets for text classification purposes. The results show that straightforward text representations perform better than AutoML tools with automatically created text embeddings.
    Efficient Tensor Contraction via Fast Count Sketch. (arXiv:2106.13062v1 [cs.LG])
    (2 min) Sketching uses randomized Hash functions for dimensionality reduction and acceleration. The existing sketching methods, such as count sketch (CS), tensor sketch (TS), and higher-order count sketch (HCS), either suffer from low accuracy or slow speed in some tensor based applications. In this paper, the proposed fast count sketch (FCS) applies multiple shorter Hash functions based CS to the vector form of the input tensor, which is more accurate than TS since the spatial information of the input tensor can be preserved more sufficiently. When the input tensor admits CANDECOMP/PARAFAC decomposition (CPD), FCS can accelerate CS and HCS by using fast Fourier transform, which exhibits a computational complexity asymptotically identical to TS for low-order tensors. The effectiveness of FCS is validated by CPD, tensor regression network compression, and Kronecker product compression. Experimental results show its superior performance in terms of approximation accuracy and computational efficiency.
    A Systematic Collection of Medical Image Datasets for Deep Learning. (arXiv:2106.12864v1 [eess.IV])
    (2 min) The astounding success made by artificial intelligence (AI) in healthcare and other fields proves that AI can achieve human-like performance. However, success always comes with challenges. Deep learning algorithms are data-dependent and require large datasets for training. The lack of data in the medical imaging field creates a bottleneck for the application of deep learning to medical image analysis. Medical image acquisition, annotation, and analysis are costly, and their usage is constrained by ethical restrictions. They also require many resources, such as human expertise and funding. That makes it difficult for non-medical researchers to have access to useful and large medical data. Thus, as comprehensive as possible, this paper provides a collection of medical image datasets with their associated challenges for deep learning research. We have collected information of around three hundred datasets and challenges mainly reported between 2013 and 2020 and categorized them into four categories: head & neck, chest & abdomen, pathology & blood, and ``others''. Our paper has three purposes: 1) to provide a most up to date and complete list that can be used as a universal reference to easily find the datasets for clinical image analysis, 2) to guide researchers on the methodology to test and evaluate their methods' performance and robustness on relevant datasets, 3) to provide a ``route'' to relevant algorithms for the relevant medical topics, and challenge leaderboards.
    A comprehensive empirical analysis on cross-domain semantic enrichment for detection of depressive language. (arXiv:2106.12797v1 [cs.CL])
    (2 min) We analyze the process of creating word embedding feature representations designed for a learning task when annotated data is scarce, for example, in depressive language detection from Tweets. We start with a rich word embedding pre-trained from a large general dataset, which is then augmented with embeddings learned from a much smaller and more specific domain dataset through a simple non-linear mapping mechanism. We also experimented with several other more sophisticated methods of such mapping including, several auto-encoder based and custom loss-function based methods that learn embedding representations through gradually learning to be close to the words of similar semantics and distant to dissimilar semantics. Our strengthened representations better capture the semantics of the depression domain, as it combines the semantics learned from the specific domain coupled with word coverage from the general language. We also present a comparative performance analyses of our word embedding representations with a simple bag-of-words model, well known sentiment and psycholinguistic lexicons, and a general pre-trained word embedding. When used as feature representations for several different machine learning methods, including deep learning models in a depressive Tweets identification task, we show that our augmented word embedding representations achieve a significantly better F1 score than the others, specially when applied to a high quality dataset. Also, we present several data ablation tests which confirm the efficacy of our augmentation techniques.
    Real-time gravitational-wave science with neural posterior estimation. (arXiv:2106.12594v1 [gr-qc])
    (2 min) We demonstrate unprecedented accuracy for rapid gravitational-wave parameter estimation with deep learning. Using neural networks as surrogates for Bayesian posterior distributions, we analyze eight gravitational-wave events from the first LIGO-Virgo Gravitational-Wave Transient Catalog and find very close quantitative agreement with standard inference codes, but with inference times reduced from O(day) to a minute per event. Our networks are trained using simulated data, including an estimate of the detector-noise characteristics near the event. This encodes the signal and noise models within millions of neural-network parameters, and enables inference for any observed data consistent with the training distribution, accounting for noise nonstationarity from event to event. Our algorithm -- called "DINGO" -- sets a new standard in fast-and-accurate inference of physical parameters of detected gravitational-wave events, which should enable real-time data analysis without sacrificing accuracy.
    Multi-Reference Alignment for sparse signals, Uniform Uncertainty Principles and the Beltway Problem. (arXiv:2106.12996v1 [math.ST])
    (2 min) Motivated by cutting-edge applications like cryo-electron microscopy (cryo-EM), the Multi-Reference Alignment (MRA) model entails the learning of an unknown signal from repeated measurements of its images under the latent action of a group of isometries and additive noise of magnitude $\sigma$. Despite significant interest, a clear picture for understanding rates of estimation in this model has emerged only recently, particularly in the high-noise regime $\sigma \gg 1$ that is highly relevant in applications. Recent investigations have revealed a remarkable asymptotic sample complexity of order $\sigma^6$ for certain signals whose Fourier transforms have full support, in stark contrast to the traditional $\sigma^2$ that arise in regular models. Often prohibitively large in practice, these results have prompted the investigation of variations around the MRA model where better sample complexity may be achieved. In this paper, we show that \emph{sparse} signals exhibit an intermediate $\sigma^4$ sample complexity even in the classical MRA model. Our results explore and exploit connections of the MRA estimation problem with two classical topics in applied mathematics: the \textit{beltway problem} from combinatorial optimization, and \textit{uniform uncertainty principles} from harmonic analysis.
    Stochastic Projective Splitting: Solving Saddle-Point Problems with Multiple Regularizers. (arXiv:2106.13067v1 [math.OC])
    (2 min) We present a new, stochastic variant of the projective splitting (PS) family of algorithms for monotone inclusion problems. It can solve min-max and noncooperative game formulations arising in applications such as robust ML without the convergence issues associated with gradient descent-ascent, the current de facto standard approach in such situations. Our proposal is the first version of PS able to use stochastic (as opposed to deterministic) gradient oracles. It is also the first stochastic method that can solve min-max games while easily handling multiple constraints and nonsmooth regularizers via projection and proximal operators. We close with numerical experiments on a distributionally robust sparse logistic regression problem.
    Low-Latency Federated Learning over Wireless Channels with Differential Privacy. (arXiv:2106.13039v1 [cs.DC])
    (2 min) In federated learning (FL), model training is distributed over clients and local models are aggregated by a central server. The performance of uploaded models in such situations can vary widely due to imbalanced data distributions, potential demands on privacy protections, and quality of transmissions. In this paper, we aim to minimize FL training delay over wireless channels, constrained by overall training performance as well as each client's differential privacy (DP) requirement. We solve this problem in the framework of multi-agent multi-armed bandit (MAMAB) to deal with the situation where there are multiple clients confornting different unknown transmission environments, e.g., channel fading and interferences. Specifically, we first transform the long-term constraints on both training performance and each client's DP into a virtual queue based on the Lyapunov drift technique. Then, we convert the MAMAB to a max-min bipartite matching problem at each communication round, by estimating rewards with the upper confidence bound (UCB) approach. More importantly, we propose two efficient solutions to this matching problem, i.e., modified Hungarian algorithm and greedy matching with a better alternative (GMBA), in which the first one can achieve the optimal solution with a high complexity while the second one approaches a better trade-off by enabling a verified low-complexity with little performance loss. In addition, we develop an upper bound on the expected regret of this MAMAB based FL framework, which shows a linear growth over the logarithm of communication rounds, justifying its theoretical feasibility. Extensive experimental results are conducted to validate the effectiveness of our proposed algorithms, and the impacts of various parameters on the FL performance over wireless edge networks are also discussed.
    Stock Market Analysis with Text Data: A Review. (arXiv:2106.12985v1 [q-fin.ST])
    (2 min) Stock market movements are influenced by public and private information shared through news articles, company reports, and social media discussions. Analyzing these vast sources of data can give market participants an edge to make profit. However, the majority of the studies in the literature are based on traditional approaches that come short in analyzing unstructured, vast textual data. In this study, we provide a review on the immense amount of existing literature of text-based stock market analysis. We present input data types and cover main textual data sources and variations. Feature representation techniques are then presented. Then, we cover the analysis techniques and create a taxonomy of the main stock market forecast models. Importantly, we discuss representative work in each category of the taxonomy, analyzing their respective contributions. Finally, this paper shows the findings on unaddressed open problems and gives suggestions for future work. The aim of this study is to survey the main stock market analysis models, text representation techniques for financial market prediction, shortcomings of existing techniques, and propose promising directions for future research.
    Towards Biologically Plausible Convolutional Networks. (arXiv:2106.13031v1 [cs.LG])
    (2 min) Convolutional networks are ubiquitous in deep learning. They are particularly useful for images, as they reduce the number of parameters, reduce training time, and increase accuracy. However, as a model of the brain they are seriously problematic, since they require weight sharing - something real neurons simply cannot do. Consequently, while neurons in the brain can be locally connected (one of the features of convolutional networks), they cannot be convolutional. Locally connected but non-convolutional networks, however, significantly underperform convolutional ones. This is troublesome for studies that use convolutional networks to explain activity in the visual system. Here we study plausible alternatives to weight sharing that aim at the same regularization principle, which is to make each neuron within a pool react similarly to identical inputs. The most natural way to do that is by showing the network multiple translations of the same image, akin to saccades in animal vision. However, this approach requires many translations, and doesn't remove the performance gap. We propose instead to add lateral connectivity to a locally connected network, and allow learning via Hebbian plasticity. This requires the network to pause occasionally for a sleep-like phase of "weight sharing". This method enables locally connected networks to achieve nearly convolutional performance on ImageNet, thus supporting convolutional networks as a model of the visual stream.
    Next-Day Bitcoin Price Forecast Based on Artificial intelligence Methods. (arXiv:2106.12961v1 [q-fin.ST])
    (2 min) In recent years, Bitcoin price prediction has attracted the interest of researchers and investors. However, the accuracy of previous studies is not well enough. Machine learning and deep learning methods have been proved to have strong prediction ability in this area. This paper proposed a method combined with Ensemble Empirical Mode Decomposition (EEMD) and a deep learning method called long short-term memory (LSTM) to research the problem of next-day Bitcoin price forecast.
    Fea2Fea: Exploring Structural Feature Correlations via Graph Neural Networks. (arXiv:2106.13061v1 [cs.LG])
    (2 min) Structural features are important features in graph datasets. However, although there are some correlation analysis of features based on covariance, there is no relevant research on exploring structural feature correlation on graphs with graph neural network based models. In this paper, we introduce graph feature to feature (Fea2Fea) prediction pipelines in a low dimensional space to explore some preliminary results on structural feature correlation, which is based on graph neural network. The results show that there exists high correlation between some of the structural features. A redundant feature combination with initial node features, which is filtered by graph neural network has improved its classification accuracy in some graph datasets. We compare the difference between concatenation methods on connecting embeddings between features and show that the simplest is the best. We generalize on the synthetic geometric graphs and certify the results on prediction difficulty between two structural features.
    Quantization Aware Training, ERNIE and Kurtosis Regularizer: a short empirical study. (arXiv:2106.13035v1 [stat.ML])
    (2 min) Pre-trained language models like Ernie or Bert are currently used in many applications. These models come with a set of pre-trained weights typically obtained in unsupervised/self-supervised modality on a huge amount of data. After that, they are fine-tuned on a specific task. Applications then use these models for inference, and often some additional constraints apply, like low power-budget or low latency between input and output. The main avenue to meet these additional requirements for the inference settings, is to use low precision computation (e.g. INT8 rather than FP32), but this comes with a cost of deteriorating the functional performance (e.g. accuracy) of the model. Some approaches have been developed to tackle the problem and go beyond the limitations of the PTO (Post-Training Quantization), more specifically the QAT (Quantization Aware Training, see [4]) is a procedure that interferes with the training process in order to make it affected (or simply disturbed) by the quantization phase during the training itself. Besides QAT, recently Intel-Habana Labs have proposed an additional and more direct way to make the training results more robust to subsequent quantization which uses a regularizer, therefore changing the loss function that drives the training procedure. But their proposal does not work out-of-the-box for pre-trained models like Ernie, for example. In this short paper we show why this is not happening (for the Ernie case) and we propose a very basic way to deal with it, sharing as well some initial results (increase in final INT8 accuracy) that might be of interest to practitioners willing to use Ernie in their applications, in low precision regime.
    Unsupervised Topic Segmentation of Meetings with BERT Embeddings. (arXiv:2106.12978v1 [cs.LG])
    (2 min) Topic segmentation of meetings is the task of dividing multi-person meeting transcripts into topic blocks. Supervised approaches to the problem have proven intractable due to the difficulties in collecting and accurately annotating large datasets. In this paper we show how previous unsupervised topic segmentation methods can be improved using pre-trained neural architectures. We introduce an unsupervised approach based on BERT embeddings that achieves a 15.5% reduction in error rate over existing unsupervised approaches applied to two popular datasets for meeting transcripts.
    Spatial-Temporal Graph ODE Networks for Traffic Flow Forecasting. (arXiv:2106.12931v1 [cs.LG])
    (2 min) Spatial-temporal forecasting has attracted tremendous attention in a wide range of applications, and traffic flow prediction is a canonical and typical example. The complex and long-range spatial-temporal correlations of traffic flow bring it to a most intractable challenge. Existing works typically utilize shallow graph convolution networks (GNNs) and temporal extracting modules to model spatial and temporal dependencies respectively. However, the representation ability of such models is limited due to: (1) shallow GNNs are incapable to capture long-range spatial correlations, (2) only spatial connections are considered and a mass of semantic connections are ignored, which are of great importance for a comprehensive understanding of traffic networks. To this end, we propose Spatial-Temporal Graph Ordinary Differential Equation Networks (STGODE). Specifically, we capture spatial-temporal dynamics through a tensor-based ordinary differential equation (ODE), as a result, deeper networks can be constructed and spatial-temporal features are utilized synchronously. To understand the network more comprehensively, semantical adjacency matrix is considered in our model, and a well-design temporal dialated convolution structure is used to capture long term temporal dependencies. We evaluate our model on multiple real-world traffic datasets and superior performance is achieved over state-of-the-art baselines.
    Mix and Mask Actor-Critic Methods. (arXiv:2106.13037v1 [cs.LG])
    (2 min) Shared feature spaces for actor-critic methods aims to capture generalized latent representations to be used by the policy and value function with the hopes for a more stable and sample-efficient optimization. However, such a paradigm present a number of challenges in practice, as parameters generating a shared representation must learn off two distinct objectives, resulting in competing updates and learning perturbations. In this paper, we present a novel feature-sharing framework to address these difficulties by introducing the mix and mask mechanisms and the distributional scalarization technique. These mechanisms behaves dynamically to couple and decouple connected latent features variably between the policy and value function, while the distributional scalarization standardizes the two objectives using a probabilistic standpoint. From our experimental results, we demonstrate significant performance improvements compared to alternative methods using separate networks and networks with a shared backbone.
    Learning Multiple Stock Trading Patterns with Temporal Routing Adaptor and Optimal Transport. (arXiv:2106.12950v1 [cs.LG])
    (2 min) Successful quantitative investment usually relies on precise predictions of the future movement of the stock price. Recently, machine learning based solutions have shown their capacity to give more accurate stock prediction and become indispensable components in modern quantitative investment systems. However, the i.i.d. assumption behind existing methods is inconsistent with the existence of diverse trading patterns in the stock market, which inevitably limits their ability to achieve better stock prediction performance. In this paper, we propose a novel architecture, Temporal Routing Adaptor (TRA), to empower existing stock prediction models with the ability to model multiple stock trading patterns. Essentially, TRA is a lightweight module that consists of a set of independent predictors for learning multiple patterns as well as a router to dispatch samples to different predictors. Nevertheless, the lack of explicit pattern identifiers makes it quite challenging to train an effective TRA-based model. To tackle this challenge, we further design a learning algorithm based on Optimal Transport (OT) to obtain the optimal sample to predictor assignment and effectively optimize the router with such assignment through an auxiliary loss term. Experiments on the real-world stock ranking task show that compared to the state-of-the-art baselines, e.g., Attention LSTM and Transformer, the proposed method can improve information coefficient (IC) from 0.053 to 0.059 and 0.051 to 0.056 respectively. Our dataset and code used in this work are publicly available: https://github.com/microsoft/qlib.
    Unsupervised Learning of Depth and Depth-of-Field Effect from Natural Images with Aperture Rendering Generative Adversarial Networks. (arXiv:2106.13041v1 [cs.CV])
    (2 min) Understanding the 3D world from 2D projected natural images is a fundamental challenge in computer vision and graphics. Recently, an unsupervised learning approach has garnered considerable attention owing to its advantages in data collection. However, to mitigate training limitations, typical methods need to impose assumptions for viewpoint distribution (e.g., a dataset containing various viewpoint images) or object shape (e.g., symmetric objects). These assumptions often restrict applications; for instance, the application to non-rigid objects or images captured from similar viewpoints (e.g., flower or bird images) remains a challenge. To complement these approaches, we propose aperture rendering generative adversarial networks (AR-GANs), which equip aperture rendering on top of GANs, and adopt focus cues to learn the depth and depth-of-field (DoF) effect of unlabeled natural images. To address the ambiguities triggered by unsupervised setting (i.e., ambiguities between smooth texture and out-of-focus blurs, and between foreground and background blurs), we develop DoF mixture learning, which enables the generator to learn real image distribution while generating diverse DoF images. In addition, we devise a center focus prior to guiding the learning direction. In the experiments, we demonstrate the effectiveness of AR-GANs in various datasets, such as flower, bird, and face images, demonstrate their portability by incorporating them into other 3D representation learning GANs, and validate their applicability in shallow DoF rendering.
    A Fully Problem-Dependent Regret Lower Bound for Finite-Horizon MDPs. (arXiv:2106.13013v1 [cs.LG])
    (2 min) We derive a novel asymptotic problem-dependent lower-bound for regret minimization in finite-horizon tabular Markov Decision Processes (MDPs). While, similar to prior work (e.g., for ergodic MDPs), the lower-bound is the solution to an optimization problem, our derivation reveals the need for an additional constraint on the visitation distribution over state-action pairs that explicitly accounts for the dynamics of the MDP. We provide a characterization of our lower-bound through a series of examples illustrating how different MDPs may have significantly different complexity. 1) We first consider a "difficult" MDP instance, where the novel constraint based on the dynamics leads to a larger lower-bound (i.e., a larger regret) compared to the classical analysis. 2) We then show that our lower-bound recovers results previously derived for specific MDP instances. 3) Finally, we show that, in certain "simple" MDPs, the lower bound is considerably smaller than in the general case and it does not scale with the minimum action gap at all. We show that this last result is attainable (up to $poly(H)$ terms, where $H$ is the horizon) by providing a regret upper-bound based on policy gaps for an optimistic algorithm.
    SofaMyRoom: a fast and multiplatform "shoebox" room simulator for binaural room impulse response dataset generation. (arXiv:2106.12992v1 [cs.SD])
    (2 min) This paper introduces a shoebox room simulator able to systematically generate synthetic datasets of binaural room impulse responses (BRIRs) given an arbitrary set of head-related transfer functions (HRTFs). The evaluation of machine hearing algorithms frequently requires BRIR datasets in order to simulate the acoustics of any environment. However, currently available solutions typically consider only HRTFs measured on dummy heads, which poorly characterize the high variability in spatial sound perception. Our solution allows to integrate a room impulse response (RIR) simulator with different HRTF sets represented in Spatially Oriented Format for Acoustics (SOFA). The source code and the compiled binaries for different operating systems allow to both advanced and non-expert users to benefit from our toolbox, see https://github.com/spatialaudiotools/sofamyroom/ .
    VinDr-SpineXR: A deep learning framework for spinal lesions detection and classification from radiographs. (arXiv:2106.12930v1 [eess.IV])
    (2 min) Radiographs are used as the most important imaging tool for identifying spine anomalies in clinical practice. The evaluation of spinal bone lesions, however, is a challenging task for radiologists. This work aims at developing and evaluating a deep learning-based framework, named VinDr-SpineXR, for the classification and localization of abnormalities from spine X-rays. First, we build a large dataset, comprising 10,468 spine X-ray images from 5,000 studies, each of which is manually annotated by an experienced radiologist with bounding boxes around abnormal findings in 13 categories. Using this dataset, we then train a deep learning classifier to determine whether a spine scan is abnormal and a detector to localize 7 crucial findings amongst the total 13. The VinDr-SpineXR is evaluated on a test set of 2,078 images from 1,000 studies, which is kept separate from the training set. It demonstrates an area under the receiver operating characteristic curve (AUROC) of 88.61% (95% CI 87.19%, 90.02%) for the image-level classification task and a mean average precision (mAP@0.5) of 33.56% for the lesion-level localization task. These results serve as a proof of concept and set a baseline for future research in this direction. To encourage advances, the dataset, codes, and trained deep learning models are made publicly available.
    PocketNet: A Smaller Neural Network for Medical Image Analysis. (arXiv:2104.10745v2 [eess.IV] UPDATED)
    (2 min) Medical imaging deep learning models are often large and complex, requiring specialized hardware to train and evaluate these models. To address such issues, we propose the PocketNet paradigm to reduce the size of deep learning models by throttling the growth of the number of channels in convolutional neural networks. We demonstrate that, for a range of segmentation and classification tasks, PocketNet architectures produce results comparable to that of conventional neural networks while reducing the number of parameters by multiple orders of magnitude, using up to 90% less GPU memory, and speeding up training times by up to 40%, thereby allowing such models to be trained and deployed in resource-constrained settings.
    Hierarchical Inducing Point Gaussian Process for Inter-domain Observations. (arXiv:2103.00393v2 [cs.LG] UPDATED)
    (2 min) We examine the general problem of inter-domain Gaussian Processes (GPs): problems where the GP realization and the noisy observations of that realization lie on different domains. When the mapping between those domains is linear, such as integration or differentiation, inference is still closed form. However, many of the scaling and approximation techniques that our community has developed do not apply to this setting. In this work, we introduce the hierarchical inducing point GP (HIP-GP), a scalable inter-domain GP inference method that enables us to improve the approximation accuracy by increasing the number of inducing points to the millions. HIP-GP, which relies on inducing points with grid structure and a stationary kernel assumption, is suitable for low-dimensional problems. In developing HIP-GP, we introduce (1) a fast whitening strategy, and (2) a novel preconditioner for conjugate gradients which can be helpful in general GP settings. Our code is available at https: //github.com/cunningham-lab/hipgp.
    High Performance Hyperspectral Image Classification using Graphics Processing Units. (arXiv:2106.12942v1 [cs.DC])
    (2 min) Real-time remote sensing applications like search and rescue missions, military target detection, environmental monitoring, hazard prevention and other time-critical applications require onboard real time processing capabilities or autonomous decision making. Some unmanned remote systems like satellites are physically remote from their operators, and all control of the spacecraft and data returned by the spacecraft must be transmitted over a wireless radio link. This link may not be available for extended periods when the satellite is out of line of sight of its ground station. Therefore, lightweight, small size and low power consumption hardware is essential for onboard real time processing systems. With increasing dimensionality, size and resolution of recent hyperspectral imaging sensors, additional challenges are posed upon remote sensing processing systems and more capable computing architectures are needed. Graphical Processing Units (GPUs) emerged as promising architecture for light weight high performance computing that can address these computational requirements for onboard systems. The goal of this study is to build high performance methods for onboard hyperspectral analysis. We propose accelerated methods for the well-known recursive hierarchical segmentation (RHSEG) clustering method, using GPUs, hybrid multicore CPU with a GPU and hybrid multi-core CPU/GPU clusters. RHSEG is a method developed by the National Aeronautics and Space Administration (NASA), which is designed to provide rich classification information with several output levels. The achieved speedups by parallel solutions compared to CPU sequential implementations are 21x for parallel single GPU and 240x for hybrid multi-node computer clusters with 16 computing nodes. The energy consumption is reduced to 74% using a single GPU compared to the equivalent parallel CPU cluster.
    Differentially Private Algorithms for Clustering with Stability Assumptions. (arXiv:2106.12959v1 [cs.LG])
    (2 min) We study the problem of differentially private clustering under input-stability assumptions. Despite the ever-growing volume of works on differential privacy in general and differentially private clustering in particular, only three works (Nissim et al. 2007, Wang et al. 2015, Huang et al. 2018) looked at the problem of privately clustering "nice" k-means instances, all three relying on the sample-and-aggregate framework and all three measuring utility in terms of Wasserstein distance between the true cluster centers and the centers returned by the private algorithm. In this work we improve upon this line of works on multiple axes. We present a far simpler algorithm for clustering stable inputs (not relying on the sample-and-aggregate framework), and analyze its utility in both the Wasserstein distance and the k-means cost. Moreover, our algorithm has straight-forward analogues for "nice" k-median instances and for the local-model of differential privacy.
    Understanding Modern Techniques in Optimization: Frank-Wolfe, Nesterov's Momentum, and Polyak's Momentum. (arXiv:2106.12923v1 [math.OC])
    (2 min) In the first part of this dissertation research, we develop a modular framework that can serve as a recipe for constructing and analyzing iterative algorithms for convex optimization. Specifically, our work casts optimization as iteratively playing a two-player zero-sum game. Many existing optimization algorithms including Frank-Wolfe and Nesterov's acceleration methods can be recovered from the game by pitting two online learners with appropriate strategies against each other. Furthermore, the sum of the weighted average regrets of the players in the game implies the convergence rate. As a result, our approach provides simple alternative proofs to these algorithms. Moreover, we demonstrate that our approach of optimization as iteratively playing a game leads to three new fast Frank-Wolfe-like algorithms for some constraint sets, which further shows that our framework is indeed generic, modular, and easy-to-use. In the second part, we develop a modular analysis of provable acceleration via Polyak's momentum for certain problems, which include solving the classical strongly quadratic convex problems, training a wide ReLU network under the neural tangent kernel regime, and training a deep linear network with an orthogonal initialization. We develop a meta theorem and show that when applying Polyak's momentum for these problems, the induced dynamics exhibit a form where we can directly apply our meta theorem. In the last part of the dissertation, we show another advantage of the use of Polyak's momentum -- it facilitates fast saddle point escape in smooth non-convex optimization. This result, together with those of the second part, sheds new light on Polyak's momentum in modern non-convex optimization and deep learning.
    Long-term Cross Adversarial Training: A Robust Meta-learning Method for Few-shot Classification Tasks. (arXiv:2106.12900v1 [cs.LG])
    (2 min) Meta-learning model can quickly adapt to new tasks using few-shot labeled data. However, despite achieving good generalization on few-shot classification tasks, it is still challenging to improve the adversarial robustness of the meta-learning model in few-shot learning. Although adversarial training (AT) methods such as Adversarial Query (AQ) can improve the adversarially robust performance of meta-learning models, AT is still computationally expensive training. On the other hand, meta-learning models trained with AT will drop significant accuracy on the original clean images. This paper proposed a meta-learning method on the adversarially robust neural network called Long-term Cross Adversarial Training (LCAT). LCAT will update meta-learning model parameters cross along the natural and adversarial sample distribution direction with long-term to improve both adversarial and clean few-shot classification accuracy. Due to cross-adversarial training, LCAT only needs half of the adversarial training epoch than AQ, resulting in a low adversarial training computation. Experiment results show that LCAT achieves superior performance both on the clean and adversarial few-shot classification accuracy than SOTA adversarial training methods for meta-learning models.
    Accelerating variational quantum algorithms with multiple quantum processors. (arXiv:2106.12819v1 [quant-ph])
    (2 min) Variational quantum algorithms (VQAs) have the potential of utilizing near-term quantum machines to gain certain computational advantages over classical methods. Nevertheless, modern VQAs suffer from cumbersome computational overhead, hampered by the tradition of employing a solitary quantum processor to handle large-volume data. As such, to better exert the superiority of VQAs, it is of great significance to improve their runtime efficiency. Here we devise an efficient distributed optimization scheme, called QUDIO, to address this issue. Specifically, in QUDIO, a classical central server partitions the learning problem into multiple subproblems and allocate them to multiple local nodes where each of them consists of a quantum processor and a classical optimizer. During the training procedure, all local nodes proceed parallel optimization and the classical server synchronizes optimization information among local nodes timely. In doing so, we prove a sublinear convergence rate of QUDIO in terms of the number of global iteration under the ideal scenario, while the system imperfection may incur divergent optimization. Numerical results on standard benchmarks demonstrate that QUDIO can surprisingly achieve a superlinear runtime speedup with respect to the number of local nodes. Our proposal can be readily mixed with other advanced VQAs-based techniques to narrow the gap between the state of the art and applications with quantum advantage.
    The Option Keyboard: Combining Skills in Reinforcement Learning. (arXiv:2106.13105v1 [cs.AI])
    (2 min) The ability to combine known skills to create new ones may be crucial in the solution of complex reinforcement learning problems that unfold over extended periods. We argue that a robust way of combining skills is to define and manipulate them in the space of pseudo-rewards (or "cumulants"). Based on this premise, we propose a framework for combining skills using the formalism of options. We show that every deterministic option can be unambiguously represented as a cumulant defined in an extended domain. Building on this insight and on previous results on transfer learning, we show how to approximate options whose cumulants are linear combinations of the cumulants of known options. This means that, once we have learned options associated with a set of cumulants, we can instantaneously synthesise options induced by any linear combination of them, without any learning involved. We describe how this framework provides a hierarchical interface to the environment whose abstract actions correspond to combinations of basic skills. We demonstrate the practical benefits of our approach in a resource management problem and a navigation task involving a quadrupedal simulated robot.
    L'Apprentissage Automatique dans la planification et le contr{\^o}le de la production : un {\'e}tat de l'art. (arXiv:2106.12916v1 [cs.LG])
    (2 min) Proper Production Planning and Control (PPC) is capital to have an edge over competitors, reduce costs and respect delivery dates. With regard to PPC, Machine Learning (ML) provides new opportunities to make intelligent decisions based on data. Therefore, this communication provides an initial systematic review of publications on ML applied in PPC. The research objective of this study is twofold: firstly, it aims to identify techniques and tools allowing to apply ML in PPC, and secondly, it reviews the characteristics of Industry 4.0 (I4.0) in recent research papers. Concerning the second objective, seven characteristics of I4.0 are used in the analysis framework, from which two of them are proposed by the authors. Additionally, the addressed domains of ML-aided PPC in scientific literature are identified. Finally, results are analyzed and gaps that may motivate further research are highlighted.
    Density Constrained Reinforcement Learning. (arXiv:2106.12764v1 [cs.LG])
    (2 min) We study constrained reinforcement learning (CRL) from a novel perspective by setting constraints directly on state density functions, rather than the value functions considered by previous works. State density has a clear physical and mathematical interpretation, and is able to express a wide variety of constraints such as resource limits and safety requirements. Density constraints can also avoid the time-consuming process of designing and tuning cost functions required by value function-based constraints to encode system specifications. We leverage the duality between density functions and Q functions to develop an effective algorithm to solve the density constrained RL problem optimally and the constrains are guaranteed to be satisfied. We prove that the proposed algorithm converges to a near-optimal solution with a bounded error even when the policy update is imperfect. We use a set of comprehensive experiments to demonstrate the advantages of our approach over state-of-the-art CRL methods, with a wide range of density constrained tasks as well as standard CRL benchmarks such as Safety-Gym.
    Fundamental limits for learning hidden Markov model parameters. (arXiv:2106.12936v1 [stat.ML])
    (2 min) We study the frontier between learnable and unlearnable hidden Markov models (HMMs). HMMs are flexible tools for clustering dependent data coming from unknown populations. The model parameters are known to be identifiable as soon as the clusters are distinct and the hidden chain is ergodic with a full rank transition matrix. In the limit as any one of these conditions fails, it becomes impossible to identify parameters. For a chain with two hidden states we prove nonasymptotic minimax upper and lower bounds, matching up to constants, which exhibit thresholds at which the parameters become learnable.
    Using machine learning techniques to predict hospital admission at the emergency department. (arXiv:2106.12921v1 [cs.LG])
    (2 min) Introduction: One of the most important tasks in the Emergency Department (ED) is to promptly identify the patients who will benefit from hospital admission. Machine Learning (ML) techniques show promise as diagnostic aids in healthcare. Material and methods: We investigated the following features seeking to investigate their performance in predicting hospital admission: serum levels of Urea, Creatinine, Lactate Dehydrogenase, Creatine Kinase, C-Reactive Protein, Complete Blood Count with differential, Activated Partial Thromboplastin Time, D Dimer, International Normalized Ratio, age, gender, triage disposition to ED unit and ambulance utilization. A total of 3,204 ED visits were analyzed. Results: The proposed algorithms generated models which demonstrated acceptable performance in predicting hospital admission of ED patients. The range of F-measure and ROC Area values of all eight evaluated algorithms were [0.679-0.708] and [0.734-0.774], respectively. Discussion: The main advantages of this tool include easy access, availability, yes/no result, and low cost. The clinical implications of our approach might facilitate a shift from traditional clinical decision-making to a more sophisticated model. Conclusion: Developing robust prognostic models with the utilization of common biomarkers is a project that might shape the future of emergency medicine. Our findings warrant confirmation with implementation in pragmatic ED trials.
    GNMR: A provable one-line algorithm for low rank matrix recovery. (arXiv:2106.12933v1 [math.OC])
    (2 min) Low rank matrix recovery problems, including matrix completion and matrix sensing, appear in a broad range of applications. In this work we present GNMR -- an extremely simple iterative algorithm for low rank matrix recovery, based on a Gauss-Newton linearization. On the theoretical front, we derive recovery guarantees for GNMR in both the matrix sensing and matrix completion settings. A key property of GNMR is that it implicitly keeps the factor matrices approximately balanced throughout its iterations. On the empirical front, we show that for matrix completion with uniform sampling, GNMR performs better than several popular methods, especially when given very few observations close to the information limit.
    Exploration-Exploitation in Multi-Agent Competition: Convergence with Bounded Rationality. (arXiv:2106.12928v1 [cs.GT])
    (2 min) The interplay between exploration and exploitation in competitive multi-agent learning is still far from being well understood. Motivated by this, we study smooth Q-learning, a prototypical learning model that explicitly captures the balance between game rewards and exploration costs. We show that Q-learning always converges to the unique quantal-response equilibrium (QRE), the standard solution concept for games under bounded rationality, in weighted zero-sum polymatrix games with heterogeneous learning agents using positive exploration rates. Complementing recent results about convergence in weighted potential games, we show that fast convergence of Q-learning in competitive settings is obtained regardless of the number of agents and without any need for parameter fine-tuning. As showcased by our experiments in network zero-sum games, these theoretical results provide the necessary guarantees for an algorithmic approach to the currently open problem of equilibrium selection in competitive multi-agent settings.
    A Construction Kit for Efficient Low Power Neural Network Accelerator Designs. (arXiv:2106.12810v1 [cs.AR])
    (2 min) Implementing embedded neural network processing at the edge requires efficient hardware acceleration that couples high computational performance with low power consumption. Driven by the rapid evolution of network architectures and their algorithmic features, accelerator designs are constantly updated and improved. To evaluate and compare hardware design choices, designers can refer to a myriad of accelerator implementations in the literature. Surveys provide an overview of these works but are often limited to system-level and benchmark-specific performance metrics, making it difficult to quantitatively compare the individual effect of each utilized optimization technique. This complicates the evaluation of optimizations for new accelerator designs, slowing-down the research progress. This work provides a survey of neural network accelerator optimization approaches that have been used in recent works and reports their individual effects on edge processing performance. It presents the list of optimizations and their quantitative effects as a construction kit, allowing to assess the design choices for each building block separately. Reported optimizations range from up to 10'000x memory savings to 33x energy reductions, providing chip designers an overview of design choices for implementing efficient low power neural network accelerators.
    Partial Wasserstein and Maximum Mean Discrepancy distances for bridging the gap between outlier detection and drift detection. (arXiv:2106.12893v1 [cs.LG])
    (2 min) With the rise of machine learning and deep learning based applications in practice, monitoring, i.e. verifying that these operate within specification, has become an important practical problem. An important aspect of this monitoring is to check whether the inputs (or intermediates) have strayed from the distribution they were validated for, which can void the performance assurances obtained during testing. There are two common approaches for this. The, perhaps, more classical one is outlier detection or novelty detection, where, for a single input we ask whether it is an outlier, i.e. exceedingly unlikely to have originated from a reference distribution. The second, perhaps more recent approach, is to consider a larger number of inputs and compare its distribution to a reference distribution (e.g. sampled during testing). This is done under the label drift detection. In this work, we bridge the gap between outlier detection and drift detection through comparing a given number of inputs to an automatically chosen part of the reference distribution.
    Simple Truncated SVD based Model for Node Classification on Heterophilic Graphs. (arXiv:2106.12807v1 [cs.LG])
    (2 min) Graph Neural Networks (GNNs) have shown excellent performance on graphs that exhibit strong homophily with respect to the node labels i.e. connected nodes have same labels. However, they perform poorly on heterophilic graphs. Recent approaches have typically modified aggregation schemes, designed adaptive graph filters, etc. to address this limitation. In spite of this, the performance on heterophilic graphs can still be poor. We propose a simple alternative method that exploits Truncated Singular Value Decomposition (TSVD) of topological structure and node features. Our approach achieves up to ~30% improvement in performance over state-of-the-art methods on heterophilic graphs. This work is an early investigation into methods that differ from aggregation based approaches. Our experimental results suggest that it might be important to explore other alternatives to aggregation methods for heterophilic setting.
    Neural Networks for Dengue Prediction: A Systematic Review. (arXiv:2106.12905v1 [cs.LG])
    (2 min) Due to a lack of treatments and universal vaccine, early forecasts of Dengue are an important tool for disease control. Neural networks are powerful predictive models that have made contributions to many areas of public health. In this systematic review, we provide an introduction to the neural networks relevant to Dengue forecasting and review their applications in the literature. The objective is to help inform model design for future work. Following the PRISMA guidelines, we conduct a systematic search of studies that use neural networks to forecast Dengue in human populations. We summarize the relative performance of neural networks and comparator models, model architectures and hyper-parameters, as well as choices of input features. Nineteen papers were included. Most studies implement shallow neural networks using historical Dengue incidence and meteorological input features. Prediction horizons tend to be short. Building on the strengths of neural networks, most studies use granular observations at the city or sub-national level. Performance of neural networks relative to comparators such as Support Vector Machines varies across study contexts. The studies suggest that neural networks can provide good predictions of Dengue and should be included in the set of candidate models. The use of convolutional, recurrent, or deep networks is relatively unexplored but offers promising avenues for further research, as does the use of a broader set of input features such as social media or mobile phone data.
    TagRuler: Interactive Tool for Span-Level Data Programming by Demonstration. (arXiv:2106.12767v1 [cs.CL])
    (2 min) Despite rapid developments in the field of machine learning research, collecting high-quality labels for supervised learning remains a bottleneck for many applications. This difficulty is exacerbated by the fact that state-of-the-art models for NLP tasks are becoming deeper and more complex, often increasing the amount of training data required even for fine-tuning. Weak supervision methods, including data programming, address this problem and reduce the cost of label collection by using noisy label sources for supervision. However, until recently, data programming was only accessible to users who knew how to program. To bridge this gap, the Data Programming by Demonstration framework was proposed to facilitate the automatic creation of labeling functions based on a few examples labeled by a domain expert. This framework has proven successful for generating high-accuracy labeling models for document classification. In this work, we extend the DPBD framework to span-level annotation tasks, arguably one of the most time-consuming NLP labeling tasks. We built a novel tool, TagRuler, that makes it easy for annotators to build span-level labeling functions without programming and encourages them to explore trade-offs between different labeling models and active learning strategies. We empirically demonstrated that an annotator could achieve a higher F1 score using the proposed tool compared to manual labeling for different span-level annotation tasks.
    COVID-19 cases prediction using regression and novel SSM model for non-converged countries. (arXiv:2106.12888v1 [cs.LG])
    (2 min) Anticipating the quantity of new associated or affirmed cases with novel coronavirus ailment 2019 (COVID-19) is critical in the counteraction and control of the COVID-19 flare-up. The new associated cases with COVID-19 information were gathered from 20 January 2020 to 21 July 2020. We filtered out the countries which are converging and used those for training the network. We utilized the SARIMAX, Linear regression model to anticipate new suspected COVID-19 cases for the countries which did not converge yet. We predict the curve of non-converged countries with the help of proposed Statistical SARIMAX model (SSM). We present new information investigation-based forecast results that can assist governments with planning their future activities and help clinical administrations to be more ready for what's to come. Our framework can foresee peak corona cases with an R-Squared value of 0.986 utilizing linear regression and fall of this pandemic at various levels for countries like India, US, and Brazil. We found that considering more countries for training degrades the prediction process as constraints vary from nation to nation. Thus, we expect that the outcomes referenced in this work will help individuals to better understand the possibilities of this pandemic.
    rSoccer: A Framework for Studying Reinforcement Learning in Small and Very Small Size Robot Soccer. (arXiv:2106.12895v1 [cs.LG])
    (2 min) Reinforcement learning is an active research area with a vast number of applications in robotics, and the RoboCup competition is an interesting environment for studying and evaluating reinforcement learning methods. A known difficulty in applying reinforcement learning to robotics is the high number of experience samples required, being the use of simulated environments for training the agents followed by transfer learning to real-world (sim-to-real) a viable path. This article introduces an open-source simulator for the IEEE Very Small Size Soccer and the Small Size League optimized for reinforcement learning experiments. We also propose a framework for creating OpenAI Gym environments with a set of benchmarks tasks for evaluating single-agent and multi-agent robot soccer skills. We then demonstrate the learning capabilities of two state-of-the-art reinforcement learning methods as well as their limitations in certain scenarios introduced in this framework. We believe this will make it easier for more teams to compete in these categories using end-to-end reinforcement learning approaches and further develop this research area.
    InFlow: Robust outlier detection utilizing Normalizing Flows. (arXiv:2106.12894v1 [cs.LG])
    (2 min) Normalizing flows are prominent deep generative models that provide tractable probability distributions and efficient density estimation. However, they are well known to fail while detecting Out-of-Distribution (OOD) inputs as they directly encode the local features of the input representations in their latent space. In this paper, we solve this overconfidence issue of normalizing flows by demonstrating that flows, if extended by an attention mechanism, can reliably detect outliers including adversarial attacks. Our approach does not require outlier data for training and we showcase the efficiency of our method for OOD detection by reporting state-of-the-art performance in diverse experimental settings. Code available at https://github.com/ComputationalRadiationPhysics/InFlow .
    Recurrent Neural Network from Adder's Perspective: Carry-lookahead RNN. (arXiv:2106.12901v1 [cs.LG])
    (2 min) The recurrent network architecture is a widely used model in sequence modeling, but its serial dependency hinders the computation parallelization, which makes the operation inefficient. The same problem was encountered in serial adder at the early stage of digital electronics. In this paper, we discuss the similarities between recurrent neural network (RNN) and serial adder. Inspired by carry-lookahead adder, we introduce carry-lookahead module to RNN, which makes it possible for RNN to run in parallel. Then, we design the method of parallel RNN computation, and finally Carry-lookahead RNN (CL-RNN) is proposed. CL-RNN takes advantages in parallelism and flexible receptive field. Through a comprehensive set of tests, we verify that CL-RNN can perform better than existing typical RNNs in sequence modeling tasks which are specially designed for RNNs.
    Reimagining GNN Explanations with ideas from Tabular Data. (arXiv:2106.12665v1 [cs.LG])
    (2 min) Explainability techniques for Graph Neural Networks still have a long way to go compared to explanations available for both neural and decision decision tree-based models trained on tabular data. Using a task that straddles both graphs and tabular data, namely Entity Matching, we comment on key aspects of explainability that are missing in GNN model explanations.
    Visualizing Graph Neural Networks with CorGIE: Corresponding a Graph to Its Embedding. (arXiv:2106.12839v1 [cs.LG])
    (2 min) Graph neural networks (GNNs) are a class of powerful machine learning tools that model node relations for making predictions of nodes or links. GNN developers rely on quantitative metrics of the predictions to evaluate a GNN, but similar to many other neural networks, it is difficult for them to understand if the GNN truly learns characteristics of a graph as expected. We propose an approach to corresponding an input graph to its node embedding (aka latent space), a common component of GNNs that is later used for prediction. We abstract the data and tasks, and develop an interactive multi-view interface called CorGIE to instantiate the abstraction. As the key function in CorGIE, we propose the K-hop graph layout to show topological neighbors in hops and their clustering structure. To evaluate the functionality and usability of CorGIE, we present how to use CorGIE in two usage scenarios, and conduct a case study with two GNN experts.
    Hamiltonian-based Neural ODE Networks on the SE(3) Manifold For Dynamics Learning and Control. (arXiv:2106.12782v1 [cs.RO])
    (2 min) Accurate models of robot dynamics are critical for safe and stable control and generalization to novel operational conditions. Hand-designed models, however, may be insufficiently accurate, even after careful parameter tuning. This motivates the use of machine learning techniques to approximate the robot dynamics over a training set of state-control trajectories. The dynamics of many robots, including ground, aerial, and underwater vehicles, are described in terms of their SE(3) pose and generalized velocity, and satisfy conservation of energy principles. This paper proposes a Hamiltonian formulation over the SE(3) manifold of the structure of a neural ordinary differential equation (ODE) network to approximate the dynamics of a rigid body. In contrast to a black-box ODE network, our formulation guarantees total energy conservation by construction. We develop energy shaping and damping injection control for the learned, potentially under-actuated SE(3) Hamiltonian dynamics to enable a unified approach for stabilization and trajectory tracking with various platforms, including pendulum, rigid-body, and quadrotor systems.
    Numerical influence of ReLU'(0) on backpropagation. (arXiv:2106.12915v1 [cs.LG])
    (2 min) In theory, the choice of ReLU (0) in [0, 1] for a neural network has a negligible influence both on backpropagation and training. Yet, in the real world, 32 bits default precision combined with the size of deep learning problems makes it a hyperparameter of training methods. We investigate the importance of the value of ReLU (0) for several precision levels (16, 32, 64 bits), on various networks (fully connected, VGG, ResNet) and datasets (MNIST, CIFAR10, SVHN). We observe considerable variations of backpropagation outputs which occur around half of the time in 32 bits precision. The effect disappears with double precision, while it is systematic at 16 bits. For vanilla SGD training, the choice ReLU (0) = 0 seems to be the most efficient. We also evidence that reconditioning approaches as batch-norm or ADAM tend to buffer the influence of ReLU (0)'s value. Overall, the message we want to convey is that algorithmic differentiation of nonsmooth problems potentially hides parameters that could be tuned advantageously.
    Encoding Involutory Invariance in Neural Networks. (arXiv:2106.12891v1 [cs.LG])
    (2 min) In certain situations, Neural Networks (NN) are trained upon data that obey underlying physical symmetries. However, it is not guaranteed that NNs will obey the underlying symmetry unless embedded in the network structure. In this work, we explore a special kind of symmetry where functions are invariant with respect to involutory linear/affine transformations up to parity $p=\pm 1$. We develop mathematical theorems and propose NN architectures that ensure invariance and universal approximation properties. Numerical experiments indicate that the proposed models outperform baseline networks while respecting the imposed symmetry. An adaption of our technique to convolutional NN classification tasks for datasets with inherent horizontal/vertical reflection symmetry has also been proposed.
    Learnt Sparsification for Interpretable Graph Neural Networks. (arXiv:2106.12920v1 [cs.LG])
    (2 min) Graph neural networks (GNNs) have achieved great success on various tasks and fields that require relational modeling. GNNs aggregate node features using the graph structure as inductive biases resulting in flexible and powerful models. However, GNNs remain hard to interpret as the interplay between node features and graph structure is only implicitly learned. In this paper, we propose a novel method called Kedge for explicitly sparsifying the underlying graph by removing unnecessary neighbors. Our key idea is based on a tractable method for sparsification using the Hard Kumaraswamy distribution that can be used in conjugation with any GNN model. Kedge learns edge masks in a modular fashion trained with any GNN allowing for gradient based optimization in an end-to-end fashion. We demonstrate through extensive experiments that our model Kedge can prune a large proportion of the edges with only a minor effect on the test accuracy. Specifically, in the PubMed dataset, Kedge learns to drop more than 80% of the edges with an accuracy drop of merely 2% showing that graph structure has only a small contribution in comparison to node features. Finally, we also show that Kedge effectively counters the over-smoothing phenomena in deep GNNs by maintaining good task performance with increasing GNN layers.
    Label Disentanglement in Partition-based Extreme Multilabel Classification. (arXiv:2106.12751v1 [stat.ML])
    (2 min) Partition-based methods are increasingly-used in extreme multi-label classification (XMC) problems due to their scalability to large output spaces (e.g., millions or more). However, existing methods partition the large label space into mutually exclusive clusters, which is sub-optimal when labels have multi-modality and rich semantics. For instance, the label "Apple" can be the fruit or the brand name, which leads to the following research question: can we disentangle these multi-modal labels with non-exclusive clustering tailored for downstream XMC tasks? In this paper, we show that the label assignment problem in partition-based XMC can be formulated as an optimization problem, with the objective of maximizing precision rates. This leads to an efficient algorithm to form flexible and overlapped label clusters, and a method that can alternatively optimizes the cluster assignments and the model parameters for partition-based XMC. Experimental results on synthetic and real datasets show that our method can successfully disentangle multi-modal labels, leading to state-of-the-art (SOTA) results on four XMC benchmarks.
    Alternative Microfoundations for Strategic Classification. (arXiv:2106.12705v1 [cs.LG])
    (2 min) When reasoning about strategic behavior in a machine learning context it is tempting to combine standard microfoundations of rational agents with the statistical decision theory underlying classification. In this work, we argue that a direct combination of these standard ingredients leads to brittle solution concepts of limited descriptive and prescriptive value. First, we show that rational agents with perfect information produce discontinuities in the aggregate response to a decision rule that we often do not observe empirically. Second, when any positive fraction of agents is not perfectly strategic, desirable stable points -- where the classifier is optimal for the data it entails -- cease to exist. Third, optimal decision rules under standard microfoundations maximize a measure of negative externality known as social burden within a broad class of possible assumptions about agent behavior. Recognizing these limitations we explore alternatives to standard microfoundations for binary classification. We start by describing a set of desiderata that help navigate the space of possible assumptions about how agents respond to a decision rule. In particular, we analyze a natural constraint on feature manipulations, and discuss properties that are sufficient to guarantee the robust existence of stable points. Building on these insights, we then propose the noisy response model. Inspired by smoothed analysis and empirical observations, noisy response incorporates imperfection in the agent responses, which we show mitigates the limitations of standard microfoundations. Our model retains analytical tractability, leads to more robust insights about stable points, and imposes a lower social burden at optimality.
    The Stereotyping Problem in Collaboratively Filtered Recommender Systems. (arXiv:2106.12622v1 [cs.IR])
    (2 min) Recommender systems -- and especially matrix factorization-based collaborative filtering algorithms -- play a crucial role in mediating our access to online information. We show that such algorithms induce a particular kind of stereotyping: if preferences for a \textit{set} of items are anti-correlated in the general user population, then those items may not be recommended together to a user, regardless of that user's preferences and ratings history. First, we introduce a notion of \textit{joint accessibility}, which measures the extent to which a set of items can jointly be accessed by users. We then study joint accessibility under the standard factorization-based collaborative filtering framework, and provide theoretical necessary and sufficient conditions when joint accessibility is violated. Moreover, we show that these conditions can easily be violated when the users are represented by a single feature vector. To improve joint accessibility, we further propose an alternative modelling fix, which is designed to capture the diverse multiple interests of each user using a multi-vector representation. We conduct extensive experiments on real and simulated datasets, demonstrating the stereotyping problem with standard single-vector matrix factorization models.
    Adversarial Examples in Multi-Layer Random ReLU Networks. (arXiv:2106.12611v1 [cs.LG])
    (2 min) We consider the phenomenon of adversarial examples in ReLU networks with independent gaussian parameters. For networks of constant depth and with a large range of widths (for instance, it suffices if the width of each layer is polynomial in that of any other layer), small perturbations of input vectors lead to large changes of outputs. This generalizes results of Daniely and Schacham (2020) for networks of rapidly decreasing width and of Bubeck et al (2021) for two-layer networks. The proof shows that adversarial examples arise in these networks because the functions that they compute are very close to linear. Bottleneck layers in the network play a key role: the minimal width up to some point in the network determines scales and sensitivities of mappings computed up to that point. The main result is for networks with constant depth, but we also show that some constraint on depth is necessary for a result of this kind, because there are suitably deep networks that, with constant probability, compute a function that is close to constant.
    Machine Learning-based Orchestration of Containers: A Taxonomy and Future Directions. (arXiv:2106.12739v1 [cs.LG])
    (2 min) Containerization is a lightweight application virtualization technology, providing high environmental consistency, operating system distribution portability, and resource isolation. Existing mainstream cloud service providers have prevalently adopted container technologies in their distributed system infrastructures for automated application management. To handle the automation of deployment, maintenance, autoscaling, and networking of containerized applications, container orchestration is proposed as an essential research problem. However, the highly dynamic and diverse feature of cloud workloads and environments considerably raises the complexity of orchestration mechanisms. Machine learning algorithms are accordingly employed by container orchestration systems for behavior modelling and prediction of multi-dimensional performance metrics. Such insights could further improve the quality of resource provisioning decisions in response to the changing workloads under complex environments. In this paper, we present a comprehensive literature review of existing machine learning-based container orchestration approaches. Detailed taxonomies are proposed to classify the current researches by their common features. Moreover, the evolution of machine learning-based container orchestration technologies from the year 2016 to 2021 has been designed based on objectives and metrics. A comparative analysis of the reviewed techniques is conducted according to the proposed taxonomies, with emphasis on their key characteristics. Finally, various open research challenges and potential future directions are highlighted.
    Frequency Domain Convolutional Neural Network: Accelerated CNN for Large Diabetic Retinopathy Image Classification. (arXiv:2106.12736v1 [cs.CV])
    (2 min) The conventional spatial convolution layers in the Convolutional Neural Networks (CNNs) are computationally expensive at the point where the training time could take days unless the number of layers, the number of training images or the size of the training images are reduced. The image size of 256x256 pixels is commonly used for most of the applications of CNN, but this image size is too small for applications like Diabetic Retinopathy (DR) classification where the image details are important for accurate classification. This research proposed Frequency Domain Convolution (FDC) and Frequency Domain Pooling (FDP) layers which were built with RFFT, kernel initialization strategy, convolution artifact removal and Channel Independent Convolution (CIC) to replace the conventional convolution and pooling layers. The FDC and FDP layers are used to build a Frequency Domain Convolutional Neural Network (FDCNN) to accelerate the training of large images for DR classification. The Full FDC layer is an extension of the FDC layer to allow direct use in conventional CNNs, it is also used to modify the VGG16 architecture. FDCNN is shown to be at least 54.21% faster and 70.74% more memory efficient compared to an equivalent CNN architecture. The modified VGG16 architecture with Full FDC layer is reported to achieve a shorter training time and a higher accuracy at 95.63% compared to the original VGG16 architecture for DR classification.
    Meaningfully Explaining a Model's Mistakes. (arXiv:2106.12723v1 [cs.LG])
    (2 min) Understanding and explaining the mistakes made by trained models is critical to many machine learning objectives, such as improving robustness, addressing concept drift, and mitigating biases. However, this is often an ad hoc process that involves manually looking at the model's mistakes on many test samples and guessing at the underlying reasons for those incorrect predictions. In this paper, we propose a systematic approach, conceptual explanation scores (CES), that explains why a classifier makes a mistake on a particular test sample(s) in terms of human-understandable concepts (e.g. this zebra is misclassified as a dog because of faint stripes). We base CES on two prior ideas: counterfactual explanations and concept activation vectors, and validate our approach on well-known pretrained models, showing that it explains the models' mistakes meaningfully. We also train new models with intentional and known spurious correlations, which CES successfully identifies from a single misclassified test sample. The code for CES is publicly available and can easily be applied to new models.
    Sparse Flows: Pruning Continuous-depth Models. (arXiv:2106.12718v1 [cs.LG])
    (2 min) Continuous deep learning architectures enable learning of flexible probabilistic models for predictive modeling as neural ordinary differential equations (ODEs), and for generative modeling as continuous normalizing flows. In this work, we design a framework to decipher the internal dynamics of these continuous depth models by pruning their network architectures. Our empirical results suggest that pruning improves generalization for neural ODEs in generative modeling. Moreover, pruning finds minimal and efficient neural ODE representations with up to 98\% less parameters compared to the original network, without loss of accuracy. Finally, we show that by applying pruning we can obtain insightful information about the design of better neural ODEs.We hope our results will invigorate further research into the performance-size trade-offs of modern continuous-depth models.
    Distilling the Knowledge from Normalizing Flows. (arXiv:2106.12699v1 [cs.LG])
    (2 min) Normalizing flows are a powerful class of generative models demonstrating strong performance in several speech and vision problems. In contrast to other generative models, normalizing flows have tractable likelihoods and allow for stable training. However, they have to be carefully designed to represent invertible functions with efficient Jacobian determinant calculation. In practice, these requirements lead to overparameterized and sophisticated architectures that are inferior to alternative feed-forward models in terms of inference time and memory consumption. In this work, we investigate whether one can distill knowledge from flow-based models to more efficient alternatives. We provide a positive answer to this question by proposing a simple distillation approach and demonstrating its effectiveness on state-of-the-art conditional flow-based models for image super-resolution and speech synthesis.
    Long short-term relevance learning. (arXiv:2106.12694v1 [cs.LG])
    (2 min) To incorporate prior knowledge as well as measurement uncertainties in the traditional long short term memory (LSTM) neural networks, an efficient sparse Bayesian training algorithm is introduced to the network architecture. The proposed scheme automatically determines relevant neural connections and adapts accordingly, in contrast to the classical LSTM solution. Due to its flexibility, the new LSTM scheme is less prone to overfitting, and hence can approximate time dependent solutions by use of a smaller data set. On a structural nonlinear finite element application we show that the self-regulating framework does not require prior knowledge of a suitable network architecture and size, while ensuring satisfying accuracy at reasonable computational cost.
    Fairness via Representation Neutralization. (arXiv:2106.12674v1 [cs.LG])
    (2 min) Existing bias mitigation methods for DNN models primarily work on learning debiased encoders. This process not only requires a lot of instance-level annotations for sensitive attributes, it also does not guarantee that all fairness sensitive information has been removed from the encoder. To address these limitations, we explore the following research question: Can we reduce the discrimination of DNN models by only debiasing the classification head, even with biased representations as inputs? To this end, we propose a new mitigation technique, namely, Representation Neutralization for Fairness (RNF) that achieves fairness by debiasing only the task-specific classification head of DNN models. To this end, we leverage samples with the same ground-truth label but different sensitive attributes, and use their neutralized representations to train the classification head of the DNN model. The key idea of RNF is to discourage the classification head from capturing spurious correlation between fairness sensitive information in encoder representations with specific class labels. To address low-resource settings with no access to sensitive attribute annotations, we leverage a bias-amplified model to generate proxy annotations for sensitive attributes. Experimental results over several benchmark datasets demonstrate our RNF framework to effectively reduce discrimination of DNN models with minimal degradation in task-specific performance.
    Finite-Sample Analysis of Off-Policy TD-Learning via Generalized Bellman Operators. (arXiv:2106.12729v1 [cs.LG])
    (2 min) In temporal difference (TD) learning, off-policy sampling is known to be more practical than on-policy sampling, and by decoupling learning from data collection, it enables data reuse. It is known that policy evaluation (including multi-step off-policy importance sampling) has the interpretation of solving a generalized Bellman equation. In this paper, we derive finite-sample bounds for any general off-policy TD-like stochastic approximation algorithm that solves for the fixed-point of this generalized Bellman operator. Our key step is to show that the generalized Bellman operator is simultaneously a contraction mapping with respect to a weighted $\ell_p$-norm for each $p$ in $[1,\infty)$, with a common contraction factor. Off-policy TD-learning is known to suffer from high variance due to the product of importance sampling ratios. A number of algorithms (e.g. $Q^\pi(\lambda)$, Tree-Backup$(\lambda)$, Retrace$(\lambda)$, and $Q$-trace) have been proposed in the literature to address this issue. Our results immediately imply finite-sample bounds of these algorithms. In particular, we provide first-known finite-sample guarantees for $Q^\pi(\lambda)$, Tree-Backup$(\lambda)$, and Retrace$(\lambda)$, and improve the best known bounds of $Q$-trace in [19]. Moreover, we show the bias-variance trade-offs in each of these algorithms.
    Automated Agriculture Commodity Price Prediction System with Machine Learning Techniques. (arXiv:2106.12747v1 [cs.LG])
    (2 min) The intention of this research is to study and design an automated agriculture commodity price prediction system with novel machine learning techniques. Due to the increasing large amounts historical data of agricultural commodity prices and the need of performing accurate prediction of price fluctuations, the solution has largely shifted from statistical methods to machine learning area. However, the selection of proper set from historical data for forecasting still has limited consideration. On the other hand, when implementing machine learning techniques, finding a suitable model with optimal parameters for global solution, nonlinearity and avoiding curse of dimensionality are still biggest challenges, therefore machine learning strategies study are needed. In this research, we propose a web-based automated system to predict agriculture commodity price. In the two series experiments, five popular machine learning algorithms, ARIMA, SVR, Prophet, XGBoost and LSTM have been compared with large historical datasets in Malaysia and the most optimal algorithm, LSTM model with an average of 0.304 mean-square error has been selected as the prediction engine of the proposed system.
    Transformer-based unsupervised patient representation learning based on medical claims for risk stratification and analysis. (arXiv:2106.12658v1 [cs.LG])
    (2 min) The claims data, containing medical codes, services information, and incurred expenditure, can be a good resource for estimating an individual's health condition and medical risk level. In this study, we developed Transformer-based Multimodal AutoEncoder (TMAE), an unsupervised learning framework that can learn efficient patient representation by encoding meaningful information from the claims data. TMAE is motivated by the practical needs in healthcare to stratify patients into different risk levels for improving care delivery and management. Compared to previous approaches, TMAE is able to 1) model inpatient, outpatient, and medication claims collectively, 2) handle irregular time intervals between medical events, 3) alleviate the sparsity issue of the rare medical codes, and 4) incorporate medical expenditure information. We trained TMAE using a real-world pediatric claims dataset containing more than 600,000 patients and compared its performance with various approaches in two clustering tasks. Experimental results demonstrate that TMAE has superior performance compared to all baselines. Multiple downstream applications are also conducted to illustrate the effectiveness of our framework. The promising results confirm that the TMAE framework is scalable to large claims data and is able to generate efficient patient embeddings for risk stratification and analysis.
    Study of Robust Adaptive Beamforming Based on Low-Complexity DFT Spatial Sampling. (arXiv:2106.12663v1 [cs.IT])
    (2 min) In this paper, a novel and robust algorithm is proposed for adaptive beamforming based on the idea of reconstructing the autocorrelation sequence (ACS) of a random process from a set of measured data. This is obtained from the first column and the first row of the sample covariance matrix (SCM) after averaging along its diagonals. Then, the power spectrum of the correlation sequence is estimated using the discrete Fourier transform (DFT). The DFT coefficients corresponding to the angles within the noise-plus-interference region are used to reconstruct the noise-plus-interference covariance matrix (NPICM), while the desired signal covariance matrix (DSCM) is estimated by identifying and removing the noise-plus-interference component from the SCM. In particular, the spatial power spectrum of the estimated received signal is utilized to compute the correlation sequence corresponding to the noise-plus-interference in which the dominant DFT coefficient of the noise-plus-interference is captured. A key advantage of the proposed adaptive beamforming is that only little prior information is required. Specifically, an imprecise knowledge of the array geometry and of the angular sectors in which the interferences are located is needed. Simulation results demonstrate that compared with previous reconstruction-based beamformers, the proposed approach can achieve better overall performance in the case of multiple mismatches over a very large range of input signal-to-noise ratios.
    Best-Case Lower Bounds in Online Learning. (arXiv:2106.12688v1 [cs.LG])
    (2 min) Much of the work in online learning focuses on the study of sublinear upper bounds on the regret. In this work, we initiate the study of best-case lower bounds in online convex optimization, wherein we bound the largest improvement an algorithm can obtain relative to the single best action in hindsight. This problem is motivated by the goal of better understanding the adaptivity of a learning algorithm. Another motivation comes from fairness: it is known that best-case lower bounds are instrumental in obtaining algorithms for decision-theoretic online learning (DTOL) that satisfy a notion of group fairness. Our contributions are a general method to provide best-case lower bounds in Follow The Regularized Leader (FTRL) algorithms with time-varying regularizers, which we use to show that best-case lower bounds are of the same order as existing upper regret bounds: this includes situations with a fixed learning rate, decreasing learning rates, timeless methods, and adaptive gradient methods. In stark contrast, we show that the linearized version of FTRL can attain negative linear regret. Finally, in DTOL with two experts and binary predictions, we fully characterize the best-case sequences, which provides a finer understanding of the best-case lower bounds.
    Provably efficient machine learning for quantum many-body problems. (arXiv:2106.12627v1 [quant-ph])
    (2 min) Classical machine learning (ML) provides a potentially powerful approach to solving challenging quantum many-body problems in physics and chemistry. However, the advantages of ML over more traditional methods have not been firmly established. In this work, we prove that classical ML algorithms can efficiently predict ground state properties of gapped Hamiltonians in finite spatial dimensions, after learning from data obtained by measuring other Hamiltonians in the same quantum phase of matter. In contrast, under widely accepted complexity theory assumptions, classical algorithms that do not learn from data cannot achieve the same guarantee. We also prove that classical ML algorithms can efficiently classify a wide range of quantum phases of matter. Our arguments are based on the concept of a classical shadow, a succinct classical description of a many-body quantum state that can be constructed in feasible quantum experiments and be used to predict many properties of the state. Extensive numerical experiments corroborate our theoretical results in a variety of scenarios, including Rydberg atom systems, 2D random Heisenberg models, symmetry-protected topological phases, and topologically ordered phases.
    Deep Learning for Network Traffic Classification. (arXiv:2106.12693v1 [cs.NI])
    (2 min) Monitoring network traffic to identify content, services, and applications is an active research topic in network traffic control systems. While modern firewalls provide the capability to decrypt packets, this is not appealing for privacy advocates. Hence, identifying any information from encrypted traffic is a challenging task. Nonetheless, previous work has identified machine learning methods that may enable application and service identification. The process involves high level feature extraction from network packet data then training a robust machine learning classifier for traffic identification. We propose a classification technique using an ensemble of deep learning architectures on packet, payload, and inter-arrival time sequences. To our knowledge, this is the first time such deep learning architectures have been applied to the Server Name Indication (SNI) classification problem. Our ensemble model beats the state of the art machine learning methods and our up-to-date model can be found on github: \url{https://github.com/niloofarbayat/NetworkClassification}
    Machine learning structure preserving brackets for forecasting irreversible processes. (arXiv:2106.12619v1 [physics.comp-ph])
    (2 min) Forecasting of time-series data requires imposition of inductive biases to obtain predictive extrapolation, and recent works have imposed Hamiltonian/Lagrangian form to preserve structure for systems with reversible dynamics. In this work we present a novel parameterization of dissipative brackets from metriplectic dynamical systems appropriate for learning irreversible dynamics with unknown a priori model form. The process learns generalized Casimirs for energy and entropy guaranteed to be conserved and nondecreasing, respectively. Furthermore, for the case of added thermal noise, we guarantee exact preservation of a fluctuation-dissipation theorem, ensuring thermodynamic consistency. We provide benchmarks for dissipative systems demonstrating learned dynamics are more robust and generalize better than either "black-box" or penalty-based approaches.
    Handwritten Digit Recognition using Machine and Deep Learning Algorithms. (arXiv:2106.12614v1 [cs.CV])
    (2 min) The reliance of humans over machines has never been so high such that from object classification in photographs to adding sound to silent movies everything can be performed with the help of deep learning and machine learning algorithms. Likewise, Handwritten text recognition is one of the significant areas of research and development with a streaming number of possibilities that could be attained. Handwriting recognition (HWR), also known as Handwritten Text Recognition (HTR), is the ability of a computer to receive and interpret intelligible handwritten input from sources such as paper documents, photographs, touch-screens and other devices [1]. Apparently, in this paper, we have performed handwritten digit recognition with the help of MNIST datasets using Support Vector Machines (SVM), Multi-Layer Perceptron (MLP) and Convolution Neural Network (CNN) models. Our main objective is to compare the accuracy of the models stated above along with their execution time to get the best possible model for digit recognition.
    Deep Fake Detection: Survey of Facial Manipulation Detection Solutions. (arXiv:2106.12605v1 [cs.CV])
    (2 min) Deep Learning as a field has been successfully used to solve a plethora of complex problems, the likes of which we could not have imagined a few decades back. But as many benefits as it brings, there are still ways in which it can be used to bring harm to our society. Deep fakes have been proven to be one such problem, and now more than ever, when any individual can create a fake image or video simply using an application on the smartphone, there need to be some countermeasures, with which we can detect if the image or video is a fake or real and dispose of the problem threatening the trustworthiness of online information. Although the Deep fakes created by neural networks, may seem to be as real as a real image or video, it still leaves behind spatial and temporal traces or signatures after moderation, these signatures while being invisible to a human eye can be detected with the help of a neural network trained to specialize in Deep fake detection. In this paper, we analyze several such states of the art neural networks (MesoNet, ResNet-50, VGG-19, and Xception Net) and compare them against each other, to find an optimal solution for various scenarios like real-time deep fake detection to be deployed in online social media platforms where the classification should be made as fast as possible or for a small news agency where the classification need not be in real-time but requires utmost accuracy.
    DP-SGD vs PATE: Which Has Less Disparate Impact on Model Accuracy?. (arXiv:2106.12576v1 [cs.LG])
    (2 min) Recent advances in differentially private deep learning have demonstrated that application of differential privacy, specifically the DP-SGD algorithm, has a disparate impact on different sub-groups in the population, which leads to a significantly high drop-in model utility for sub-populations that are under-represented (minorities), compared to well-represented ones. In this work, we aim to compare PATE, another mechanism for training deep learning models using differential privacy, with DP-SGD in terms of fairness. We show that PATE does have a disparate impact too, however, it is much less severe than DP-SGD. We draw insights from this observation on what might be promising directions in achieving better fairness-privacy trade-offs.

2021-06-24

  • cs.CL updates on arXiv.org

    PO-EMO: Conceptualization, Annotation, and Modeling of Aesthetic Emotions in German and English Poetry. (arXiv:2003.07723v3 [cs.CL] UPDATED)
    (2 min) Most approaches to emotion analysis of social media, literature, news, and other domains focus exclusively on basic emotion categories as defined by Ekman or Plutchik. However, art (such as literature) enables engagement in a broader range of more complex and subtle emotions. These have been shown to also include mixed emotional responses. We consider emotions in poetry as they are elicited in the reader, rather than what is expressed in the text or intended by the author. Thus, we conceptualize a set of aesthetic emotions that are predictive of aesthetic appreciation in the reader, and allow the annotation of multiple labels per line to capture mixed emotions within their context. We evaluate this novel setting in an annotation experiment both with carefully trained experts and via crowdsourcing. Our annotation with experts leads to an acceptable agreement of kappa = .70, resulting in a consistent dataset for future large scale analysis. Finally, we conduct first emotion classification experiments based on BERT, showing that identifying aesthetic emotions is challenging in our data, with up to .52 F1-micro on the German subset. Data and resources are available at https://github.com/tnhaider/poetry-emotion
    Mixtures of Deep Neural Experts for Automated Speech Scoring. (arXiv:2106.12475v1 [cs.CL])
    (2 min) The paper copes with the task of automatic assessment of second language proficiency from the language learners' spoken responses to test prompts. The task has significant relevance to the field of computer assisted language learning. The approach presented in the paper relies on two separate modules: (1) an automatic speech recognition system that yields text transcripts of the spoken interactions involved, and (2) a multiple classifier system based on deep learners that ranks the transcripts into proficiency classes. Different deep neural network architectures (both feed-forward and recurrent) are specialized over diverse representations of the texts in terms of: a reference grammar, the outcome of probabilistic language models, several word embeddings, and two bag-of-word models. Combination of the individual classifiers is realized either via a probabilistic pseudo-joint model, or via a neural mixture of experts. Using the data of the third Spoken CALL Shared Task challenge, the highest values to date were obtained in terms of three popular evaluation metrics.
    Predicting Legal Proceedings Status: Approaches Based on Sequential Text Data. (arXiv:2003.11561v4 [cs.CL] UPDATED)
    (2 min) The objective of this paper is to develop predictive models to classify Brazilian legal proceedings in three possible classes of status: (i) archived proceedings, (ii) active proceedings, and (iii) suspended proceedings. This problem's resolution is intended to assist public and private institutions in managing large portfolios of legal proceedings, providing gains in scale and efficiency. In this paper, legal proceedings are made up of sequences of short texts called "motions." We combined several natural language processing (NLP) and machine learning techniques to solve the problem. Although working with Portuguese NLP, which can be challenging due to lack of resources, our approaches performed remarkably well in the classification task, achieving maximum accuracy of .93 and top average F1 Scores of .89 (macro) and .93 (weighted). Furthermore, we could extract and interpret the patterns learned by one of our models besides quantifying how those patterns relate to the classification task. The interpretability step is important among machine learning legal applications and gives us an exciting insight into how black-box models make decisions.
    Residual Energy-Based Models for End-to-End Speech Recognition. (arXiv:2103.14152v2 [eess.AS] UPDATED)
    (2 min) End-to-end models with auto-regressive decoders have shown impressive results for automatic speech recognition (ASR). These models formulate the sequence-level probability as a product of the conditional probabilities of all individual tokens given their histories. However, the performance of locally normalised models can be sub-optimal because of factors such as exposure bias. Consequently, the model distribution differs from the underlying data distribution. In this paper, the residual energy-based model (R-EBM) is proposed to complement the auto-regressive ASR model to close the gap between the two distributions. Meanwhile, R-EBMs can also be regarded as utterance-level confidence estimators, which may benefit many downstream tasks. Experiments on a 100hr LibriSpeech dataset show that R-EBMs can reduce the word error rates (WERs) by 8.2%/6.7% while improving areas under precision-recall curves of confidence scores by 12.6%/28.4% on test-clean/test-other sets. Furthermore, on a state-of-the-art model using self-supervised learning (wav2vec 2.0), R-EBMs still significantly improves both the WER and confidence estimation performance.
    Reducing Spelling Inconsistencies in Code-Switching ASR using Contextualized CTC Loss. (arXiv:2005.07920v3 [eess.AS] UPDATED)
    (2 min) Code-Switching (CS) remains a challenge for Automatic Speech Recognition (ASR), especially character-based models. With the combined choice of characters from multiple languages, the outcome from character-based models suffers from phoneme duplication, resulting in language-inconsistent spellings. We propose Contextualized Connectionist Temporal Classification (CCTC) loss to encourage spelling consistencies of a character-based non-autoregressive ASR which allows for faster inference. The CCTC loss conditions the main prediction on the predicted contexts to ensure language consistency in the spellings. In contrast to existing CTC-based approaches, CCTC loss does not require frame-level alignments, since the context ground truth is obtained from the model's estimated path. Compared to the same model trained with regular CTC loss, our method consistently improved the ASR performance on both CS and monolingual corpora.
    Dialectal Layers in West Iranian: a Hierarchical Dirichlet Process Approach to Linguistic Relationships. (arXiv:2001.05297v2 [cs.CL] UPDATED)
    (2 min) This paper addresses a series of complex and unresolved issues in the historical phonology of West Iranian languages. The West Iranian languages (Persian, Kurdish, Balochi, and other languages) display a high degree of non-Lautgesetzlich behavior. Most of this irregularity is undoubtedly due to language contact; we argue, however, that an oversimplified view of the processes at work has prevailed in the literature on West Iranian dialectology, with specialists assuming that deviations from an expected outcome in a given non-Persian language are due to lexical borrowing from some chronological stage of Persian. It is demonstrated that this qualitative approach yields at times problematic conclusions stemming from the lack of explicit probabilistic inferences regarding the distribution of the data: Persian may not be the sole donor language; additionally, borrowing at the lexical level is not always the mechanism that introduces irregularity. In many cases, the possibility that West Iranian languages show different reflexes in different conditioning environments remains under-explored. We employ a novel Bayesian approach designed to overcome these problems and tease apart the different determinants of irregularity in patterns of West Iranian sound change. Our methodology allows us to provisionally resolve a number of outstanding questions in the literature on West Iranian dialectology concerning the dialectal affiliation of certain sound changes. We outline future directions for work of this sort.
    Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding. (arXiv:2106.12566v1 [cs.LG])
    (2 min) The attention module, which is a crucial component in Transformer, cannot scale efficiently to long sequences due to its quadratic complexity. Many works focus on approximating the dot-then-exponentiate softmax function in the original attention, leading to sub-quadratic or even linear-complexity Transformer architectures. However, we show that these methods cannot be applied to more powerful attention modules that go beyond the dot-then-exponentiate style, e.g., Transformers with relative positional encoding (RPE). Since in many state-of-the-art models, relative positional encoding is used as default, designing efficient Transformers that can incorporate RPE is appealing. In this paper, we propose a novel way to accelerate attention calculation for Transformers with RPE on top of the kernelized attention. Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT). With FFT, our method achieves $\mathcal{O}(n\log n)$ time complexity. Interestingly, we further demonstrate that properly using relative positional encoding can mitigate the training instability problem of vanilla kernelized attention. On a wide range of tasks, we empirically show that our models can be trained from scratch without any optimization issues. The learned model performs better than many efficient Transformer variants and is faster than standard Transformer in the long-sequence regime.
    Classifying Textual Data with Pre-trained Vision Models through Transfer Learning and Data Transformations. (arXiv:2106.12479v1 [cs.CL])
    (2 min) Knowledge is acquired by humans through experience, and no boundary is set between the kinds of knowledge or skill levels we can achieve on different tasks at the same time. When it comes to Neural Networks, that is not the case, the major breakthroughs in the field are extremely task and domain specific. Vision and language are dealt with in separate manners, using separate methods and different datasets. In this work, we propose to use knowledge acquired by benchmark Vision Models which are trained on ImageNet to help a much smaller architecture learn to classify text. After transforming the textual data contained in the IMDB dataset to gray scale images. An analysis of different domains and the Transfer Learning method is carried out. Despite the challenge posed by the very different datasets, promising results are achieved. The main contribution of this work is a novel approach which links large pretrained models on both language and vision to achieve state-of-the-art results in different sub-fields from the original task. Without needing high compute capacity resources. Specifically, Sentiment Analysis is achieved after transferring knowledge between vision and language models. BERT embeddings are transformed into grayscale images, these images are then used as training examples for pretrained vision models such as VGG16 and ResNet Index Terms: Natural language, Vision, BERT, Transfer Learning, CNN, Domain Adaptation.
    TextSETTR: Few-Shot Text Style Extraction and Tunable Targeted Restyling. (arXiv:2010.03802v3 [cs.CL] UPDATED)
    (2 min) We present a novel approach to the problem of text style transfer. Unlike previous approaches requiring style-labeled training data, our method makes use of readily-available unlabeled text by relying on the implicit connection in style between adjacent sentences, and uses labeled data only at inference time. We adapt T5 (Raffel et al., 2020), a strong pretrained text-to-text model, to extract a style vector from text and use it to condition the decoder to perform style transfer. As our label-free training results in a style vector space encoding many facets of style, we recast transfers as "targeted restyling" vector operations that adjust specific attributes of the input while preserving others. We demonstrate that training on unlabeled Amazon reviews data results in a model that is competitive on sentiment transfer, even compared to models trained fully on labeled data. Furthermore, applying our novel method to a diverse corpus of unlabeled web text results in a single model capable of transferring along multiple dimensions of style (dialect, emotiveness, formality, politeness, sentiment) despite no additional training and using only a handful of exemplars at inference time.
    Pre-trained Models for Natural Language Processing: A Survey. (arXiv:2003.08271v4 [cs.CL] UPDATED)
    (2 min) Recently, the emergence of pre-trained models (PTMs) has brought natural language processing (NLP) to a new era. In this survey, we provide a comprehensive review of PTMs for NLP. We first briefly introduce language representation learning and its research progress. Then we systematically categorize existing PTMs based on a taxonomy with four perspectives. Next, we describe how to adapt the knowledge of PTMs to the downstream tasks. Finally, we outline some potential directions of PTMs for future research. This survey is purposed to be a hands-on guide for understanding, using, and developing PTMs for various NLP tasks.
    End-to-End Lexically Constrained Machine Translation for Morphologically Rich Languages. (arXiv:2106.12398v1 [cs.CL])
    (2 min) Lexically constrained machine translation allows the user to manipulate the output sentence by enforcing the presence or absence of certain words and phrases. Although current approaches can enforce terms to appear in the translation, they often struggle to make the constraint word form agree with the rest of the generated output. Our manual analysis shows that 46% of the errors in the output of a baseline constrained model for English to Czech translation are related to agreement. We investigate mechanisms to allow neural machine translation to infer the correct word inflection given lemmatized constraints. In particular, we focus on methods based on training the model with constraints provided as part of the input sequence. Our experiments on the English-Czech language pair show that this approach improves the translation of constrained terms in both automatic and manual evaluation by reducing errors in agreement. Our approach thus eliminates inflection errors, without introducing new errors or decreasing the overall quality of the translation.
    Ad Text Classification with Transformer-Based Natural Language Processing Methods. (arXiv:2106.10899v2 [cs.CL] UPDATED)
    (2 min) In this study, a natural language processing-based (NLP-based) method is proposed for the sector-wise automatic classification of ad texts created on online advertising platforms. Our data set consists of approximately 21,000 labeled advertising texts from 12 different sectors. In the study, the Bidirectional Encoder Representations from Transformers (BERT) model, which is a transformer-based language model that is recently used in fields such as text classification in the natural language processing literature, was used. The classification efficiencies obtained using a pre-trained BERT model for the Turkish language are shown in detail.
    STEP-EZ: Syntax Tree guided semantic ExPlanation for Explainable Zero-shot modeling of clinical depression symptoms from text. (arXiv:2106.10928v2 [cs.CL] UPDATED)
    (2 min) We focus on exploring various approaches of Zero-Shot Learning (ZSL) and their explainability for a challenging yet important supervised learning task notorious for training data scarcity, i.e. Depression Symptoms Detection (DSD) from text. We start with a comprehensive synthesis of different components of our ZSL modeling and analysis of our ground truth samples and Depression symptom clues curation process with the help of a practicing clinician. We next analyze the accuracy of various state-of-the-art ZSL models and their potential enhancements for our task. Further, we sketch a framework for the use of ZSL for hierarchical text-based explanation mechanism, which we call, Syntax Tree-Guided Semantic Explanation (STEP). Finally, we summarize experiments from which we conclude that we can use ZSL models and achieve reasonable accuracy and explainability, measured by a proposed Explainability Index (EI). This work is, to our knowledge, the first work to exhaustively explore the efficacy of ZSL models for DSD task, both in terms of accuracy and explainability.
    BERT Goes Shopping: Comparing Distributional Models for Product Representations. (arXiv:2012.09807v2 [cs.CL] UPDATED)
    (2 min) Word embeddings (e.g., word2vec) have been applied successfully to eCommerce products through~\textit{prod2vec}. Inspired by the recent performance improvements on several NLP tasks brought by contextualized embeddings, we propose to transfer BERT-like architectures to eCommerce: our model -- ~\textit{Prod2BERT} -- is trained to generate representations of products through masked session modeling. Through extensive experiments over multiple shops, different tasks, and a range of design choices, we systematically compare the accuracy of~\textit{Prod2BERT} and~\textit{prod2vec} embeddings: while~\textit{Prod2BERT} is found to be superior in several scenarios, we highlight the importance of resources and hyperparameters in the best performing models. Finally, we provide guidelines to practitioners for training embeddings under a variety of computational and data constraints.
    Deep Multi-Task Model for Sarcasm Detection and Sentiment Analysis in Arabic Language. (arXiv:2106.12488v1 [cs.CL])
    (2 min) The prominence of figurative language devices, such as sarcasm and irony, poses serious challenges for Arabic Sentiment Analysis (SA). While previous research works tackle SA and sarcasm detection separately, this paper introduces an end-to-end deep Multi-Task Learning (MTL) model, allowing knowledge interaction between the two tasks. Our MTL model's architecture consists of a Bidirectional Encoder Representation from Transformers (BERT) model, a multi-task attention interaction module, and two task classifiers. The overall obtained results show that our proposed model outperforms its single-task counterparts on both SA and sarcasm detection sub-tasks.
    Reinforcement Learning-based Dialogue Guided Event Extraction to Exploit Argument Relations. (arXiv:2106.12384v1 [cs.CL])
    (2 min) Event extraction is a fundamental task for natural language processing. Finding the roles of event arguments like event participants is essential for event extraction. However, doing so for real-life event descriptions is challenging because an argument's role often varies in different contexts. While the relationship and interactions between multiple arguments are useful for settling the argument roles, such information is largely ignored by existing approaches. This paper presents a better approach for event extraction by explicitly utilizing the relationships of event arguments. We achieve this through a carefully designed task-oriented dialogue system. To model the argument relation, we employ reinforcement learning and incremental learning to extract multiple arguments via a multi-turned, iterative process. Our approach leverages knowledge of the already extracted arguments of the same sentence to determine the role of arguments that would be difficult to decide individually. It then uses the newly obtained information to improve the decisions of previously extracted arguments. This two-way feedback process allows us to exploit the argument relations to effectively settle argument roles, leading to better sentence understanding and event extraction. Experimental results show that our approach consistently outperforms seven state-of-the-art event extraction methods for the classification of events and argument role and argument identification.
    BERT-based Multi-Task Model for Country and Province Level Modern Standard Arabic and Dialectal Arabic Identification. (arXiv:2106.12495v1 [cs.CL])
    (2 min) Dialect and standard language identification are crucial tasks for many Arabic natural language processing applications. In this paper, we present our deep learning-based system, submitted to the second NADI shared task for country-level and province-level identification of Modern Standard Arabic (MSA) and Dialectal Arabic (DA). The system is based on an end-to-end deep Multi-Task Learning (MTL) model to tackle both country-level and province-level MSA/DA identification. The latter MTL model consists of a shared Bidirectional Encoder Representation Transformers (BERT) encoder, two task-specific attention layers, and two classifiers. Our key idea is to leverage both the task-discriminative and the inter-task shared features for country and province MSA/DA identification. The obtained results show that our MTL model outperforms single-task models on most subtasks.
    PALRACE: Reading Comprehension Dataset with Human Data and Labeled Rationales. (arXiv:2106.12373v1 [cs.CL])
    (2 min) Pre-trained language models achieves high performance on machine reading comprehension (MRC) tasks but the results are hard to explain. An appealing approach to make models explainable is to provide rationales for its decision. To facilitate supervised learning of human rationales, here we present PALRACE (Pruned And Labeled RACE), a new MRC dataset with human labeled rationales for 800 passages selected from the RACE dataset. We further classified the question to each passage into 6 types. Each passage was read by at least 26 participants, who labeled their rationales to answer the question. Besides, we conducted a rationale evaluation session in which participants were asked to answering the question solely based on labeled rationales, confirming that the labeled rationales were of high quality and can sufficiently support question answering.
    CharacterChat: Supporting the Creation of Fictional Characters through Conversation and Progressive Manifestation with a Chatbot. (arXiv:2106.12314v1 [cs.HC])
    (2 min) We present CharacterChat, a concept and chatbot to support writers in creating fictional characters. Concretely, writers progressively turn the bot into their imagined character through conversation. We iteratively developed CharacterChat in a user-centred approach, starting with a survey on character creation with writers (N=30), followed by two qualitative user studies (N=7 and N=8). Our prototype combines two modes: (1) Guided prompts help writers define character attributes (e.g. User: "Your name is Jane."), including suggestions for attributes (e.g. Bot: "What is my main motivation?") and values, realised as a rule-based system with a concept network. (2) Open conversation with the chatbot helps writers explore their character and get inspiration, realised with a language model that takes into account the defined character attributes. Our user studies reveal benefits particularly for early stages of character creation, and challenges due to limited conversational capabilities. We conclude with lessons learned and ideas for future work.
    Zero-Shot Joint Modeling of Multiple Spoken-Text-Style Conversion Tasks using Switching Tokens. (arXiv:2106.12131v1 [cs.CL])
    (2 min) In this paper, we propose a novel spoken-text-style conversion method that can simultaneously execute multiple style conversion modules such as punctuation restoration and disfluency deletion without preparing matched datasets. In practice, transcriptions generated by automatic speech recognition systems are not highly readable because they often include many disfluencies and do not include punctuation marks. To improve their readability, multiple spoken-text-style conversion modules that individually model a single conversion task are cascaded because matched datasets that simultaneously handle multiple conversion tasks are often unavailable. However, the cascading is unstable against the order of tasks because of the chain of conversion errors. Besides, the computation cost of the cascading must be higher than the single conversion. To execute multiple conversion tasks simultaneously without preparing matched datasets, our key idea is to distinguish individual conversion tasks using the on-off switch. In our proposed zero-shot joint modeling, we switch the individual tasks using multiple switching tokens, enabling us to utilize a zero-shot learning approach to executing simultaneous conversions. Our experiments on joint modeling of disfluency deletion and punctuation restoration demonstrate the effectiveness of our method.
    NodePiece: Compositional and Parameter-Efficient Representations of Large Knowledge Graphs. (arXiv:2106.12144v1 [cs.CL])
    (2 min) Conventional representation learning algorithms for knowledge graphs (KG) map each entity to a unique embedding vector. Such a shallow lookup results in a linear growth of memory consumption for storing the embedding matrix and incurs high computational costs when working with real-world KGs. Drawing parallels with subword tokenization commonly used in NLP, we explore the landscape of more parameter-efficient node embedding strategies with possibly sublinear memory requirements. To this end, we propose NodePiece, an anchor-based approach to learn a fixed-size entity vocabulary. In NodePiece, a vocabulary of subword/sub-entity units is constructed from anchor nodes in a graph with known relation types. Given such a fixed-size vocabulary, it is possible to bootstrap an encoding and embedding for any entity, including those unseen during training. Experiments show that NodePiece performs competitively in node classification, link prediction, and relation prediction tasks while retaining less than 10% of explicit nodes in a graph as anchors and often having 10x fewer parameters.
    On Positivity Bias in Negative Reviews. (arXiv:2106.12056v1 [cs.CL])
    (2 min) Prior work has revealed that positive words occur more frequently than negative words in human expressions, which is typically attributed to positivity bias, a tendency for people to report positive views of reality. But what about the language used in negative reviews? Consistent with prior work, we show that English negative reviews tend to contain more positive words than negative words, using a variety of datasets. We reconcile this observation with prior findings on the pragmatics of negation, and show that negations are commonly associated with positive words in negative reviews. Furthermore, in negative reviews, the majority of sentences with positive words express negative opinions based on sentiment classifiers, indicating some form of negation.
    Recognising Biomedical Names: Challenges and Solutions. (arXiv:2106.12230v1 [cs.CL])
    (2 min) The growth rate in the amount of biomedical documents is staggering. Unlocking information trapped in these documents can enable researchers and practitioners to operate confidently in the information world. Biomedical NER, the task of recognising biomedical names, is usually employed as the first step of the NLP pipeline. Standard NER models, based on sequence tagging technique, are good at recognising short entity mentions in the generic domain. However, there are several open challenges of applying these models to recognise biomedical names: 1) Biomedical names may contain complex inner structure (discontinuity and overlapping) which cannot be recognised using standard sequence tagging technique; 2) The training of NER models usually requires large amount of labelled data, which are difficult to obtain in the biomedical domain; and, 3) Commonly used language representation models are pre-trained on generic data; a domain shift therefore exists between these models and target biomedical data. To deal with these challenges, we explore several research directions and make the following contributions: 1) we propose a transition-based NER model which can recognise discontinuous mentions; 2) We develop a cost-effective approach that nominates the suitable pre-training data; and, 3) We design several data augmentation methods for NER. Our contributions have obvious practical implications, especially when new biomedical applications are needed. Our proposed data augmentation methods can help the NER model achieve decent performance, requiring only a small amount of labelled data. Our investigation regarding selecting pre-training data can improve the model by incorporating language representation models, which are pre-trained using in-domain data. Finally, our proposed transition-based NER model can further improve the performance by recognising discontinuous mentions.
    ABCD: A Graph Framework to Convert Complex Sentences to a Covering Set of Simple Sentences. (arXiv:2106.12027v1 [cs.CL])
    (2 min) Atomic clauses are fundamental text units for understanding complex sentences. Identifying the atomic sentences within complex sentences is important for applications such as summarization, argument mining, discourse analysis, discourse parsing, and question answering. Previous work mainly relies on rule-based methods dependent on parsing. We propose a new task to decompose each complex sentence into simple sentences derived from the tensed clauses in the source, and a novel problem formulation as a graph edit task. Our neural model learns to Accept, Break, Copy or Drop elements of a graph that combines word adjacency and grammatical dependencies. The full processing pipeline includes modules for graph construction, graph editing, and sentence generation from the output graph. We introduce DeSSE, a new dataset designed to train and evaluate complex sentence decomposition, and MinWiki, a subset of MinWikiSplit. ABCD achieves comparable performance as two parsing baselines on MinWiki. On DeSSE, which has a more even balance of complex sentence types, our model achieves higher accuracy on the number of atomic sentences than an encoder-decoder baseline. Results include a detailed error analysis.
    A Simple and Practical Approach to Improve Misspellings in OCR Text. (arXiv:2106.12030v1 [cs.CL])
    (2 min) The focus of our paper is the identification and correction of non-word errors in OCR text. Such errors may be the result of incorrect insertion, deletion, or substitution of a character, or the transposition of two adjacent characters within a single word. Or, it can be the result of word boundary problems that lead to run-on errors and incorrect-split errors. The traditional N-gram correction methods can handle single-word errors effectively. However, they show limitations when dealing with split and merge errors. In this paper, we develop an unsupervised method that can handle both errors. The method we develop leads to a sizable improvement in the correction rates. This tutorial paper addresses very difficult word correction problems - namely incorrect run-on and split errors - and illustrates what needs to be considered when addressing such problems. We outline a possible approach and assess its success on a limited study.
    It's All in the Heads: Using Attention Heads as a Baseline for Cross-Lingual Transfer in Commonsense Reasoning. (arXiv:2106.12066v1 [cs.CL])
    (2 min) Commonsense reasoning is one of the key problems in natural language processing, but the relative scarcity of labeled data holds back the progress for languages other than English. Pretrained cross-lingual models are a source of powerful language-agnostic representations, yet their inherent reasoning capabilities are still actively studied. In this work, we design a simple approach to commonsense reasoning which trains a linear classifier with weights of multi-head attention as features. To evaluate this approach, we create a multilingual Winograd Schema corpus by processing several datasets from prior work within a standardized pipeline and measure cross-lingual generalization ability in terms of out-of-sample performance. The method performs competitively with recent supervised and unsupervised approaches for commonsense reasoning, even when applied to other languages in a zero-shot manner. Also, we demonstrate that most of the performance is given by the same small subset of attention heads for all studied languages, which provides evidence of universal reasoning capabilities in multilingual encoders.
    Structured in Space, Randomized in Time: Leveraging Dropout in RNNs for Efficient Training. (arXiv:2106.12089v1 [cs.LG])
    (2 min) Recurrent Neural Networks (RNNs), more specifically their Long Short-Term Memory (LSTM) variants, have been widely used as a deep learning tool for tackling sequence-based learning tasks in text and speech. Training of such LSTM applications is computationally intensive due to the recurrent nature of hidden state computation that repeats for each time step. While sparsity in Deep Neural Nets has been widely seen as an opportunity for reducing computation time in both training and inference phases, the usage of non-ReLU activation in LSTM RNNs renders the opportunities for such dynamic sparsity associated with neuron activation and gradient values to be limited or non-existent. In this work, we identify dropout induced sparsity for LSTMs as a suitable mode of computation reduction. Dropout is a widely used regularization mechanism, which randomly drops computed neuron values during each iteration of training. We propose to structure dropout patterns, by dropping out the same set of physical neurons within a batch, resulting in column (row) level hidden state sparsity, which are well amenable to computation reduction at run-time in general-purpose SIMD hardware as well as systolic arrays. We conduct our experiments for three representative NLP tasks: language modelling on the PTB dataset, OpenNMT based machine translation using the IWSLT De-En and En-Vi datasets, and named entity recognition sequence labelling using the CoNLL-2003 shared task. We demonstrate that our proposed approach can be used to translate dropout-based computation reduction into reduced training time, with improvement ranging from 1.23x to 1.64x, without sacrificing the target metric.
    On the Diversity and Limits of Human Explanations. (arXiv:2106.11988v1 [cs.CL])
    (2 min) A growing effort in NLP aims to build datasets of human explanations. However, the term explanation encompasses a broad range of notions, each with different properties and ramifications. Our goal is to provide an overview of diverse types of explanations and human limitations, and discuss implications for collecting and using explanations in NLP. Inspired by prior work in psychology and cognitive sciences, we group existing human explanations in NLP into three categories: proximal mechanism, evidence, and procedure. These three types differ in nature and have implications for the resultant explanations. For instance, procedure is not considered explanations in psychology and connects with a rich body of work on learning from instructions. The diversity of explanations is further evidenced by proxy questions that are needed for annotators to interpret and answer open-ended why questions. Finally, explanations may require different, often deeper, understandings than predictions, which casts doubt on whether humans can provide useful explanations in some tasks.
  • cs.CV updates on arXiv.org

    Variational Quanvolutional Neural Networks with enhanced image encoding. (arXiv:2106.07327v2 [cs.CV] UPDATED)
    (2 min) Image classification is an important task in various machine learning applications. In recent years, a number of classification methods based on quantum machine learning and different quantum image encoding techniques have been proposed. In this paper, we study the effect of three different quantum image encoding approaches on the performance of a convolution-inspired hybrid quantum-classical image classification algorithm called quanvolutional neural network (QNN). We furthermore examine the effect of variational - i.e. trainable - quantum circuits on the classification results. Our experiments indicate that some image encodings are better suited for variational circuits. However, our experiments show as well that there is not one best image encoding, but that the choice of the encoding depends on the specific constraints of the application.
    Exploiting Spatial Dimensions of Latent in GAN for Real-time Image Editing. (arXiv:2104.14754v2 [cs.CV] UPDATED)
    (2 min) Generative adversarial networks (GANs) synthesize realistic images from random latent vectors. Although manipulating the latent vectors controls the synthesized outputs, editing real images with GANs suffers from i) time-consuming optimization for projecting real images to the latent vectors, ii) or inaccurate embedding through an encoder. We propose StyleMapGAN: the intermediate latent space has spatial dimensions, and a spatially variant modulation replaces AdaIN. It makes the embedding through an encoder more accurate than existing optimization-based methods while maintaining the properties of GANs. Experimental results demonstrate that our method significantly outperforms state-of-the-art models in various image manipulation tasks such as local editing and image interpolation. Last but not least, conventional editing methods on GANs are still valid on our StyleMapGAN. Source code is available at https://github.com/naver-ai/StyleMapGAN.
    VariTex: Variational Neural Face Textures. (arXiv:2104.05988v2 [cs.CV] UPDATED)
    (2 min) Deep generative models have recently demonstrated the ability to synthesize photorealistic images of human faces with novel identities. A key challenge to the wide applicability of such techniques is to provide independent control over semantically meaningful parameters: appearance, head pose, face shape, and facial expressions. In this paper, we propose VariTex - to the best of our knowledge the first method that learns a variational latent feature space of neural face textures, which allows sampling of novel identities. We combine this generative model with a parametric face model and gain explicit control over head pose and facial expressions. To generate images of complete human heads, we propose an additive decoder that generates plausible additional details such as hair. A novel training scheme enforces a pose independent latent space and in consequence, allows learning of a one-to-many mapping between latent codes and pose-conditioned exterior regions. The resulting method can generate geometrically consistent images of novel identities allowing fine-grained control over head pose, face shape, and facial expressions, facilitating a broad range of downstream tasks, like sampling novel identities, re-posing, expression transfer, and more.
    Permuted AdaIN: Reducing the Bias Towards Global Statistics in Image Classification. (arXiv:2010.05785v3 [cs.CV] UPDATED)
    (2 min) Recent work has shown that convolutional neural network classifiers overly rely on texture at the expense of shape cues. We make a similar but different distinction between shape and local image cues, on the one hand, and global image statistics, on the other. Our method, called Permuted Adaptive Instance Normalization (pAdaIN), reduces the representation of global statistics in the hidden layers of image classifiers. pAdaIN samples a random permutation $\pi$ that rearranges the samples in a given batch. Adaptive Instance Normalization (AdaIN) is then applied between the activations of each (non-permuted) sample $i$ and the corresponding activations of the sample $\pi(i)$, thus swapping statistics between the samples of the batch. Since the global image statistics are distorted, this swapping procedure causes the network to rely on cues, such as shape or texture. By choosing the random permutation with probability $p$ and the identity permutation otherwise, one can control the effect's strength. With the correct choice of $p$, fixed apriori for all experiments and selected without considering test data, our method consistently outperforms baselines in multiple settings. In image classification, our method improves on both CIFAR100 and ImageNet using multiple architectures. In the setting of robustness, our method improves on both ImageNet-C and Cifar-100-C for multiple architectures. In the setting of domain adaptation and domain generalization, our method achieves state of the art results on the transfer learning task from GTAV to Cityscapes and on the PACS benchmark.
    ShaRF: Shape-conditioned Radiance Fields from a Single View. (arXiv:2102.08860v2 [cs.CV] UPDATED)
    (2 min) We present a method for estimating neural scenes representations of objects given only a single image. The core of our method is the estimation of a geometric scaffold for the object and its use as a guide for the reconstruction of the underlying radiance field. Our formulation is based on a generative process that first maps a latent code to a voxelized shape, and then renders it to an image, with the object appearance being controlled by a second latent code. During inference, we optimize both the latent codes and the networks to fit a test image of a new object. The explicit disentanglement of shape and appearance allows our model to be fine-tuned given a single image. We can then render new views in a geometrically consistent manner and they represent faithfully the input object. Additionally, our method is able to generalize to images outside of the training domain (more realistic renderings and even real photographs). Finally, the inferred geometric scaffold is itself an accurate estimate of the object's 3D shape. We demonstrate in several experiments the effectiveness of our approach in both synthetic and real images.
    Generative Adversarial Neural Architecture Search. (arXiv:2105.09356v3 [cs.LG] UPDATED)
    (2 min) Despite the empirical success of neural architecture search (NAS) in deep learning applications, the optimality, reproducibility and cost of NAS schemes remain hard to assess. In this paper, we propose Generative Adversarial NAS (GA-NAS) with theoretically provable convergence guarantees, promoting stability and reproducibility in neural architecture search. Inspired by importance sampling, GA-NAS iteratively fits a generator to previously discovered top architectures, thus increasingly focusing on important parts of a large search space. Furthermore, we propose an efficient adversarial learning approach, where the generator is trained by reinforcement learning based on rewards provided by a discriminator, thus being able to explore the search space without evaluating a large number of architectures. Extensive experiments show that GA-NAS beats the best published results under several cases on three public NAS benchmarks. In the meantime, GA-NAS can handle ad-hoc search constraints and search spaces. We show that GA-NAS can be used to improve already optimized baselines found by other NAS methods, including EfficientNet and ProxylessNAS, in terms of ImageNet accuracy or the number of parameters, in their original search space.
    S$^2$-MLP: Spatial-Shift MLP Architecture for Vision. (arXiv:2106.07477v2 [cs.CV] UPDATED)
    (2 min) Recently, visual Transformer (ViT) and its following works abandon the convolution and exploit the self-attention operation, attaining a comparable or even higher accuracy than CNNs. More recently, MLP-Mixer abandons both the convolution and the self-attention operation, proposing an architecture containing only MLP layers. To achieve cross-patch communications, it devises an additional token-mixing MLP besides the channel-mixing MLP. It achieves promising results when training on an extremely large-scale dataset. But it cannot achieve as outstanding performance as its CNN and ViT counterparts when training on medium-scale datasets such as ImageNet1K and ImageNet21K. The performance drop of MLP-Mixer motivates us to rethink the token-mixing MLP. We discover that the token-mixing MLP is a variant of the depthwise convolution with a global reception field and spatial-specific configuration. But the global reception field and the spatial-specific property make token-mixing MLP prone to over-fitting. In this paper, we propose a novel pure MLP architecture, spatial-shift MLP (S$^2$-MLP). Different from MLP-Mixer, our S$^2$-MLP only contains channel-mixing MLP. We utilize a spatial-shift operation for communications between patches. It has a local reception field and is spatial-agnostic. It is parameter-free and efficient for computation. The proposed S$^2$-MLP attains higher recognition accuracy than MLP-Mixer when training on ImageNet-1K dataset. Meanwhile, S$^2$-MLP accomplishes as excellent performance as ViT on ImageNet-1K dataset with considerably simpler architecture and fewer FLOPs and parameters.
    TetraPackNet: Four-Corner-Based Object Detection in Logistics Use-Cases. (arXiv:2104.09123v2 [cs.CV] UPDATED)
    (2 min) While common image object detection tasks focus on bounding boxes or segmentation masks as object representations, we consider the problem of finding objects based on four arbitrary vertices. We propose a novel model, named TetraPackNet, to tackle this problem. TetraPackNet is based on CornerNet and uses similar algorithms and ideas. It is designated for applications requiring high-accuracy detection of regularly shaped objects, which is the case in the logistics use-case of packaging structure recognition. We evaluate our model on our specific real-world dataset for this use-case. Baselined against a previous solution, consisting of a Mask R-CNN model and suitable post-processing steps, TetraPackNet achieves superior results (9% higher in accuracy) in the sub-task of four-corner based transport unit side detection.
    Learning Generalized Spatial-Temporal Deep Feature Representation for No-Reference Video Quality Assessment. (arXiv:2012.13936v2 [eess.IV] UPDATED)
    (2 min) In this work, we propose a no-reference video quality assessment method, aiming to achieve high-generalization capability in cross-content, -resolution and -frame rate quality prediction. In particular, we evaluate the quality of a video by learning effective feature representations in spatial-temporal domain. In the spatial domain, to tackle the resolution and content variations, we impose the Gaussian distribution constraints on the quality features. The unified distribution can significantly reduce the domain gap between different video samples, resulting in a more generalized quality feature representation. Along the temporal dimension, inspired by the mechanism of visual perception, we propose a pyramid temporal aggregation module by involving the short-term and long-term memory to aggregate the frame-level quality. Experiments show that our method outperforms the state-of-the-art methods on cross-dataset settings, and achieves comparable performance on intra-dataset configurations, demonstrating the high-generalization capability of the proposed method.
    Rethinking supervised learning: insights from biological learning and from calling it by its name. (arXiv:2012.02526v2 [cs.LG] UPDATED)
    (2 min) The renaissance of artificial neural networks was catalysed by the success of classification models, tagged by the community with the broader term supervised learning. The extraordinary results gave rise to a hype loaded with ambitious promises and overstatements. Soon the community realised that the success owed much to the availability of thousands of labelled examples and supervised learning went, for many, from glory to shame: Some criticised deep learning as a whole and others proclaimed that the way forward had to be alternatives to supervised learning: predictive, unsupervised, semi-supervised and, more recently, self-supervised learning. However, all these seem brand names, rather than actual categories of a theoretically grounded taxonomy. Moreover, the call to banish supervised learning was motivated by the questionable claim that humans learn with little or no supervision and are capable of robust out-of-distribution generalisation. Here, we review insights about learning and supervision in nature, revisit the notion that learning and generalisation are not possible without supervision or inductive biases and argue that we will make better progress if we just call it by its name.
    Blur, Noise, and Compression Robust Generative Adversarial Networks. (arXiv:2003.07849v2 [cs.CV] UPDATED)
    (2 min) Generative adversarial networks (GANs) have gained considerable attention owing to their ability to reproduce images. However, they can recreate training images faithfully despite image degradation in the form of blur, noise, and compression, generating similarly degraded images. To solve this problem, the recently proposed noise robust GAN (NR-GAN) provides a partial solution by demonstrating the ability to learn a clean image generator directly from noisy images using a two-generator model comprising image and noise generators. However, its application is limited to noise, which is relatively easy to decompose owing to its additive and reversible characteristics, and its application to irreversible image degradation, in the form of blur, compression, and combination of all, remains a challenge. To address these problems, we propose blur, noise, and compression robust GAN (BNCR-GAN) that can learn a clean image generator directly from degraded images without knowledge of degradation parameters (e.g., blur kernel types, noise amounts, or quality factor values). Inspired by NR-GAN, BNCR-GAN uses a multiple-generator model composed of image, blur-kernel, noise, and quality-factor generators. However, in contrast to NR-GAN, to address irreversible characteristics, we introduce masking architectures adjusting degradation strength values in a data-driven manner using bypasses before and after degradation. Furthermore, to suppress uncertainty caused by the combination of blur, noise, and compression, we introduce adaptive consistency losses imposing consistency between irreversible degradation processes according to the degradation strengths. We demonstrate the effectiveness of BNCR-GAN through large-scale comparative studies on CIFAR-10 and a generality analysis on FFHQ. In addition, we demonstrate the applicability of BNCR-GAN in image restoration.
    HAWQV3: Dyadic Neural Network Quantization. (arXiv:2011.10680v3 [cs.CV] UPDATED)
    (2 min) Current low-precision quantization algorithms often have the hidden cost of conversion back and forth from floating point to quantized integer values. This hidden cost limits the latency improvement realized by quantizing Neural Networks. To address this, we present HAWQV3, a novel mixed-precision integer-only quantization framework. The contributions of HAWQV3 are the following: (i) An integer-only inference where the entire computational graph is performed only with integer multiplication, addition, and bit shifting, without any floating point operations or even integer division; (ii) A novel hardware-aware mixed-precision quantization method where the bit-precision is calculated by solving an integer linear programming problem that balances the trade-off between model perturbation and other constraints, e.g., memory footprint and latency; (iii) Direct hardware deployment and open source contribution for 4-bit uniform/mixed-precision quantization in TVM, achieving an average speed up of $1.45\times$ for uniform 4-bit, as compared to uniform 8-bit for ResNet50 on T4 GPUs; and (iv) extensive evaluation of the proposed methods on ResNet18/50 and InceptionV3, for various model compression levels with/without mixed precision. For ResNet50, our INT8 quantization achieves an accuracy of $77.58\%$, which is $2.68\%$ higher than prior integer-only work, and our mixed-precision INT4/8 quantization can reduce INT8 latency by $23\%$ and still achieve $76.73\%$ accuracy. Our framework and the TVM implementation have been open sourced.
    ShapeMOD: Macro Operation Discovery for 3D Shape Programs. (arXiv:2104.06392v2 [cs.GR] UPDATED)
    (2 min) A popular way to create detailed yet easily controllable 3D shapes is via procedural modeling, i.e. generating geometry using programs. Such programs consist of a series of instructions along with their associated parameter values. To fully realize the benefits of this representation, a shape program should be compact and only expose degrees of freedom that allow for meaningful manipulation of output geometry. One way to achieve this goal is to design higher-level macro operators that, when executed, expand into a series of commands from the base shape modeling language. However, manually authoring such macros, much like shape programs themselves, is difficult and largely restricted to domain experts. In this paper, we present ShapeMOD, an algorithm for automatically discovering macros that are useful across large datasets of 3D shape programs. ShapeMOD operates on shape programs expressed in an imperative, statement-based language. It is designed to discover macros that make programs more compact by minimizing the number of function calls and free parameters required to represent an input shape collection. We run ShapeMOD on multiple collections of programs expressed in a domain-specific language for 3D shape structures. We show that it automatically discovers a concise set of macros that abstract out common structural and parametric patterns that generalize over large shape collections. We also demonstrate that the macros found by ShapeMOD improve performance on downstream tasks including shape generative modeling and inferring programs from point clouds. Finally, we conduct a user study that indicates that ShapeMOD's discovered macros make interactive shape editing more efficient.
    Information Bottleneck Attribution for Visual Explanations of Diagnosis and Prognosis. (arXiv:2104.02869v2 [eess.IV] UPDATED)
    (2 min) Visual explanation methods have an important role in the prognosis of the patients where the annotated data is limited or unavailable. There have been several attempts to use gradient-based attribution methods to localize pathology from medical scans without using segmentation labels. This research direction has been impeded by the lack of robustness and reliability. These methods are highly sensitive to the network parameters. In this study, we introduce a robust visual explanation method to address this problem for medical applications. We provide an innovative visual explanation algorithm for general purpose and as an example application, we demonstrate its effectiveness for quantifying lesions in the lungs caused by the Covid-19 with high accuracy and robustness without using dense segmentation labels. This approach overcomes the drawbacks of commonly used Grad-CAM and its extended versions. The premise behind our proposed strategy is that the information flow is minimized while ensuring the classifier prediction stays similar. Our findings indicate that the bottleneck condition provides a more stable severity estimation than the similar attribution methods.
    Coarse-to-Fine Q-attention: Efficient Learning for Visual Robotic Manipulation via Discretisation. (arXiv:2106.12534v1 [cs.RO])
    (2 min) Reflecting on the last few years, the biggest breakthroughs in deep reinforcement learning (RL) have been in the discrete action domain. Robotic manipulation, however, is inherently a continuous control environment, but these continuous control reinforcement learning algorithms often depend on actor-critic methods that are sample-inefficient and inherently difficult to train, due to the joint optimisation of the actor and critic. To that end, we explore how we can bring the stability of discrete action RL algorithms to the robot manipulation domain. We extend the recently released ARM algorithm, by replacing the continuous next-best pose agent with a discrete next-best pose agent. Discretisation of rotation is trivial given its bounded nature, while translation is inherently unbounded, making discretisation difficult. We formulate the translation prediction as the voxel prediction problem by discretising the 3D space; however, voxelisation of a large workspace is memory intensive and would not work with a high density of voxels, crucial to obtaining the resolution needed for robotic manipulation. We therefore propose to apply this voxel prediction in a coarse-to-fine manner by gradually increasing the resolution. In each step, we extract the highest valued voxel as the predicted location, which is then used as the centre of the higher-resolution voxelisation in the next step. This coarse-to-fine prediction is applied over several steps, giving a near-lossless prediction of the translation. We show that our new coarse-to-fine algorithm is able to accomplish RLBench tasks much more efficiently than the continuous control equivalent, and even train some real-world tasks, tabular rasa, in less than 7 minutes, with only 3 demonstrations. Moreover, we show that by moving to a voxel representation, we are able to easily incorporate observations from multiple cameras.
    Test-Time Adaptation for Out-of-distributed Image Inpainting. (arXiv:2102.01360v2 [cs.CV] UPDATED)
    (2 min) Deep learning-based image inpainting algorithms have shown great performance via powerful learned prior from the numerous external natural images. However, they show unpleasant results on the test image whose distribution is far from the that of training images because their models are biased toward the training images. In this paper, we propose a simple image inpainting algorithm with test-time adaptation named AdaFill. Given a single out-of-distributed test image, our goal is to complete hole region more naturally than the pre-trained inpainting models. To achieve this goal, we treat remained valid regions of the test image as another training cues because natural images have strong internal similarities. From this test-time adaptation, our network can exploit externally learned image priors from the pre-trained features as well as the internal prior of the test image explicitly. Experimental results show that AdaFill outperforms other models on the various out-of-distribution test images. Furthermore, the model named ZeroFill, that are not pre-trained also sometimes outperforms the pre-trained models.
    Horizontal-to-Vertical Video Conversion. (arXiv:2101.04051v2 [cs.CV] UPDATED)
    (2 min) Alongside the prevalence of mobile videos, the general public leans towards consuming vertical videos on hand-held devices. To revitalize the exposure of horizontal contents, we hereby set forth the exploration of automated horizontal-to-vertical (abbreviated as H2V) video conversion with our proposed H2V framework, accompanied by an accurately annotated H2V-142K dataset. Concretely, H2V framework integrates video shot boundary detection, subject selection and multi-object tracking to facilitate the subject-preserving conversion, wherein the key is subject selection. To achieve so, we propose a Rank-SS module that detects human objects, then selects the subject-to-preserve via exploiting location, appearance, and salient cues. Afterward, the framework automatically crops the video around the subject to produce vertical contents from horizontal sources. To build and evaluate our H2V framework, H2V-142K dataset is densely annotated with subject bounding boxes for 125 videos with 132K frames and 9,500 video covers, upon which we demonstrate superior subject selection performance comparing to traditional salient approaches, and exhibit promising horizontal-to-vertical conversion performance overall. By publicizing this dataset as well as our approach, we wish to pave the way for more valuable endeavors on the horizontal-to-vertical video conversion task.
    Stronger NAS with Weaker Predictors. (arXiv:2102.10490v2 [cs.LG] UPDATED)
    (2 min) Neural Architecture Search (NAS) often trains and evaluates a large number of architectures. Recent predictor-based NAS approaches attempt to address such heavy computation costs with two key steps: sampling some architecture-performance pairs and fitting a proxy accuracy predictor. Given limited samples, these predictors, however, are far from accurate to locate top architectures due to the difficulty of fitting the huge search space. This paper reflects on a simple yet crucial question: if our final goal is to find the best architecture, do we really need to model the whole space well?. We propose a paradigm shift from fitting the whole architecture space using one strong predictor, to progressively fitting a search path towards the high-performance sub-space through a set of weaker predictors. As a key property of the proposed weak predictors, their probabilities of sampling better architectures keep increasing. Hence we only sample a few well-performed architectures guided by the previously learned predictor and estimate a new better weak predictor. This embarrassingly easy framework produces coarse-to-fine iteration to refine the ranking of sampling space gradually. Extensive experiments demonstrate that our method costs fewer samples to find top-performance architectures on NAS-Bench-101 and NAS-Bench-201, as well as achieves the state-of-the-art ImageNet performance on the NASNet search space. In particular, compared to state-of-the-art (SOTA) predictor-based NAS methods, WeakNAS outperforms all of them with notable margins, e.g., requiring at least 7.5x less samples to find global optimal on NAS-Bench-101; and WeakNAS can also absorb them for further performance boost. We further strike the new SOTA result of 81.3% in the ImageNet MobileNet Search Space. The code is available at https://github.com/VITA-Group/WeakNAS.
    Reducing Textural Bias Improves Robustness of Deep Segmentation Models. (arXiv:2011.15093v2 [eess.IV] UPDATED)
    (2 min) Despite advances in deep learning, robustness under domain shift remains a major bottleneck in medical imaging settings. Findings on natural images suggest that deep neural models can show a strong textural bias when carrying out image classification tasks. In this thorough empirical study, we draw inspiration from findings on natural images and investigate ways in which addressing the textural bias phenomenon could bring up the robustness of deep segmentation models when applied to three-dimensional (3D) medical data. To achieve this, publicly available MRI scans from the Developing Human Connectome Project are used to study ways in which simulating textural noise can help train robust models in a complex semantic segmentation task. We contribute an extensive empirical investigation consisting of 176 experiments and illustrate how applying specific types of simulated textural noise prior to training can lead to texture invariant models, resulting in improved robustness when segmenting scans corrupted by previously unseen noise types and levels.
    Visualizing Missing Surfaces In Colonoscopy Videos using Shared Latent Space Representations. (arXiv:2101.07280v2 [eess.IV] UPDATED)
    (2 min) Optical colonoscopy (OC), the most prevalent colon cancer screening tool, has a high miss rate due to a number of factors, including the geometry of the colon (haustral fold and sharp bends occlusions), endoscopist inexperience or fatigue, endoscope field of view, etc. We present a framework to visualize the missed regions per-frame during the colonoscopy, and provides a workable clinical solution. Specifically, we make use of 3D reconstructed virtual colonoscopy (VC) data and the insight that VC and OC share the same underlying geometry but differ in color, texture and specular reflections, embedded in the OC domain. A lossy unpaired image-to-image translation model is introduced with enforced shared latent space for OC and VC. This shared latent space captures the geometric information while deferring the color, texture, and specular information creation to additional Gaussian noise input. This additional noise input can be utilized to generate one-to-many mappings from VC to OC and OC to OC. The code, data and trained models will be released via our Computational Endoscopy Platform at https://github.com/nadeemlab/CEP.
    Perceiver: General Perception with Iterative Attention. (arXiv:2103.03206v2 [cs.CV] UPDATED)
    (2 min) Biological systems perceive the world by simultaneously processing high-dimensional inputs from modalities as diverse as vision, audition, touch, proprioception, etc. The perception models used in deep learning on the other hand are designed for individual modalities, often relying on domain-specific assumptions such as the local grid structures exploited by virtually all existing vision models. These priors introduce helpful inductive biases, but also lock models to individual modalities. In this paper we introduce the Perceiver - a model that builds upon Transformers and hence makes few architectural assumptions about the relationship between its inputs, but that also scales to hundreds of thousands of inputs, like ConvNets. The model leverages an asymmetric attention mechanism to iteratively distill inputs into a tight latent bottleneck, allowing it to scale to handle very large inputs. We show that this architecture is competitive with or outperforms strong, specialized models on classification tasks across various modalities: images, point clouds, audio, video, and video+audio. The Perceiver obtains performance comparable to ResNet-50 and ViT on ImageNet without 2D convolutions by directly attending to 50,000 pixels. It is also competitive in all modalities in AudioSet.
    A System for Automatic Rice Disease Detection from Rice Paddy Images Serviced via a Chatbot. (arXiv:2011.10823v2 [eess.SY] UPDATED)
    (3 min) A LINE Bot System to diagnose rice diseases from actual paddy field images was developed and presented in this paper. It was easy-to-use and automatic system designed to help rice farmers improve the rice yield and quality. The targeted images were taken from the actual paddy environment without special sample preparation. We used a deep learning neural networks technique to detect rice diseases from the images. We developed an object detection model training and refinement process to improve the performance of our previous research on rice leave diseases detection. The process was based on analyzing the model's predictive results and could be repeatedly used to improve the quality of the database in the next training of the model. The deployment model for our LINE Bot system was created from the selected best performance technique in our previous paper, YOLOv3, trained by refined training data set. The performance of the deployment model was measured on 5 target classes and found that the Average True Positive Point improved from 91.1% in the previous paper to 95.6% in this study. Therefore, we used this deployment model for Rice Disease LINE Bot system. Our system worked automatically real-time to suggest primary diagnosis results to the users in the LINE group, which included rice farmers and rice disease specialists. They could communicate freely via chat. In the real LINE Bot deployment, the model's performance was measured by our own defined measurement Average True Positive Point and was found to be an average of 78.86%. The system was fast and took only 2-3 s for detection process in our system server.
    Taming Transformers for High-Resolution Image Synthesis. (arXiv:2012.09841v3 [cs.CV] UPDATED)
    (2 min) Designed to learn long-range interactions on sequential data, transformers continue to show state-of-the-art results on a wide variety of tasks. In contrast to CNNs, they contain no inductive bias that prioritizes local interactions. This makes them expressive, but also computationally infeasible for long sequences, such as high-resolution images. We demonstrate how combining the effectiveness of the inductive bias of CNNs with the expressivity of transformers enables them to model and thereby synthesize high-resolution images. We show how to (i) use CNNs to learn a context-rich vocabulary of image constituents, and in turn (ii) utilize transformers to efficiently model their composition within high-resolution images. Our approach is readily applied to conditional synthesis tasks, where both non-spatial information, such as object classes, and spatial information, such as segmentations, can control the generated image. In particular, we present the first results on semantically-guided synthesis of megapixel images with transformers and obtain the state of the art among autoregressive models on class-conditional ImageNet. Code and pretrained models can be found at https://github.com/CompVis/taming-transformers .
    Simple Copy-Paste is a Strong Data Augmentation Method for Instance Segmentation. (arXiv:2012.07177v2 [cs.CV] UPDATED)
    (2 min) Building instance segmentation models that are data-efficient and can handle rare object categories is an important challenge in computer vision. Leveraging data augmentations is a promising direction towards addressing this challenge. Here, we perform a systematic study of the Copy-Paste augmentation ([13, 12]) for instance segmentation where we randomly paste objects onto an image. Prior studies on Copy-Paste relied on modeling the surrounding visual context for pasting the objects. However, we find that the simple mechanism of pasting objects randomly is good enough and can provide solid gains on top of strong baselines. Furthermore, we show Copy-Paste is additive with semi-supervised methods that leverage extra data through pseudo labeling (e.g. self-training). On COCO instance segmentation, we achieve 49.1 mask AP and 57.3 box AP, an improvement of +0.6 mask AP and +1.5 box AP over the previous state-of-the-art. We further demonstrate that Copy-Paste can lead to significant improvements on the LVIS benchmark. Our baseline model outperforms the LVIS 2020 Challenge winning entry by +3.6 mask AP on rare categories.
    FLOP: Federated Learning on Medical Datasets using Partial Networks. (arXiv:2102.05218v2 [cs.LG] UPDATED)
    (2 min) The outbreak of COVID-19 Disease due to the novel coronavirus has caused a shortage of medical resources. To aid and accelerate the diagnosis process, automatic diagnosis of COVID-19 via deep learning models has recently been explored by researchers across the world. While different data-driven deep learning models have been developed to mitigate the diagnosis of COVID-19, the data itself is still scarce due to patient privacy concerns. Federated Learning (FL) is a natural solution because it allows different organizations to cooperatively learn an effective deep learning model without sharing raw data. However, recent studies show that FL still lacks privacy protection and may cause data leakage. We investigate this challenging problem by proposing a simple yet effective algorithm, named \textbf{F}ederated \textbf{L}earning \textbf{o}n Medical Datasets using \textbf{P}artial Networks (FLOP), that shares only a partial model between the server and clients. Extensive experiments on benchmark data and real-world healthcare tasks show that our approach achieves comparable or better performance while reducing the privacy and security risks. Of particular interest, we conduct experiments on the COVID-19 dataset and find that our FLOP algorithm can allow different hospitals to collaboratively and effectively train a partially shared model without sharing local patients' data.
    Modality Attention and Sampling Enables Deep Learning with Heterogeneous Marker Combinations in Fluorescence Microscopy. (arXiv:2008.12380v2 [cs.CV] UPDATED)
    (3 min) Fluorescence microscopy allows for a detailed inspection of cells, cellular networks, and anatomical landmarks by staining with a variety of carefully-selected markers visualized as color channels. Quantitative characterization of structures in acquired images often relies on automatic image analysis methods. Despite the success of deep learning methods in other vision applications, their potential for fluorescence image analysis remains underexploited. One reason lies in the considerable workload required to train accurate models, which are normally specific for a given combination of markers, and therefore applicable to a very restricted number of experimental settings. We herein propose Marker Sampling and Excite, a neural network approach with a modality sampling strategy and a novel attention module that together enable (i) flexible training with heterogeneous datasets with combinations of markers and (ii) successful utility of learned models on arbitrary subsets of markers prospectively. We show that our single neural network solution performs comparably to an upper bound scenario where an ensemble of many networks is na\"ively trained for each possible marker combination separately. In addition, we demonstrate the feasibility of this framework in high-throughput biological analysis by revising a recent quantitative characterization of bone marrow vasculature in 3D confocal microscopy datasets and further confirm the validity of our approach on an additional, significantly different dataset of microvessels in fetal liver tissues. Not only can our work substantially ameliorate the use of deep learning in fluorescence microscopy analysis, but it can also be utilized in other fields with incomplete data acquisitions and missing modalities.
    Towards Automated Biometric Identification of Sea Turtles (Chelonia mydas). (arXiv:1909.11277v2 [cs.CV] UPDATED)
    (2 min) Passive biometric identification enables wildlife monitoring with minimal disturbance. Using a motion-activated camera placed at an elevated position and facing downwards, we collected images of sea turtle carapace, each belonging to one of sixteen Chelonia mydas juveniles. We then learned co-variant and robust image descriptors from these images, enabling indexing and retrieval. In this work, we presented several classification results of sea turtle carapaces using the learned image descriptors. We found that a template-based descriptor, i.e., Histogram of Oriented Gradients (HOG) performed exceedingly better during classification than keypoint-based descriptors. For our dataset, a high-dimensional descriptor is a must due to the minimal gradient and color information inside the carapace images. Using HOG, we obtained an average classification accuracy of 65%.
    Unbiased Mean Teacher for Cross-domain Object Detection. (arXiv:2003.00707v2 [cs.CV] UPDATED)
    (2 min) Cross-domain object detection is challenging, because object detection model is often vulnerable to data variance, especially to the considerable domain shift between two distinctive domains. In this paper, we propose a new Unbiased Mean Teacher (UMT) model for cross-domain object detection. We reveal that there often exists a considerable model bias for the simple mean teacher (MT) model in cross-domain scenarios, and eliminate the model bias with several simple yet highly effective strategies. In particular, for the teacher model, we propose a cross-domain distillation method for MT to maximally exploit the expertise of the teacher model. Moreover, for the student model, we alleviate its bias by augmenting training samples with pixel-level adaptation. Finally, for the teaching process, we employ an out-of-distribution estimation strategy to select samples that most fit the current model to further enhance the cross-domain distillation process. By tackling the model bias issue with these strategies, our UMT model achieves mAPs of 44.1%, 58.1%, 41.7%, and 43.1% on benchmark datasets Clipart1k, Watercolor2k, Foggy Cityscapes, and Cityscapes, respectively, which outperforms the existing state-of-the-art results in notable margins. Our implementation is available at https://github.com/kinredon/umt.
    Emergent Properties of Foveated Perceptual Systems. (arXiv:2006.07991v3 [cs.CV] UPDATED)
    (3 min) The goal of this work is to characterize the representational impact that foveation operations have for machine vision systems, inspired by the foveated human visual system, which has higher acuity at the center of gaze and texture-like encoding in the periphery. To do so, we introduce models consisting of a first-stage \textit{fixed} image transform followed by a second-stage \textit{learnable} convolutional neural network, and we varied the first stage component. The primary model has a foveated-textural input stage, which we compare to a model with foveated-blurred input and a model with spatially-uniform blurred input (both matched for perceptual compression), and a final reference model with minimal input-based compression. We find that: 1) the foveated-texture model shows similar scene classification accuracy as the reference model despite its compressed input, with greater i.i.d. generalization than the other models; 2) the foveated-texture model has greater sensitivity to high-spatial frequency information and greater robustness to occlusion, w.r.t the comparison models; 3) both the foveated systems, show a stronger center image-bias relative to the spatially-uniform systems even with a weight sharing constraint. Critically, these results are preserved over different classical CNN architectures throughout their learning dynamics. Altogether, this suggests that foveation with peripheral texture-based computations yields an efficient, distinct, and robust representational format of scene information, and provides symbiotic computational insight into the representational consequences that texture-based peripheral encoding may have for processing in the human visual system, while also potentially inspiring the next generation of computer vision models via spatially-adaptive computation. Code + Data available here: https://github.com/ArturoDeza/EmergentProperties
    Unsupervised Segmentation of Action Segments in Egocentric Videos using Gaze. (arXiv:1710.00187v2 [cs.CV] UPDATED)
    (2 min) Unsupervised segmentation of action segments in egocentric videos is a desirable feature in tasks such as activity recognition and content-based video retrieval. Reducing the search space into a finite set of action segments facilitates a faster and less noisy matching. However, there exist a substantial gap in machine understanding of natural temporal cuts during a continuous human activity. This work reports on a novel gaze-based approach for segmenting action segments in videos captured using an egocentric camera. Gaze is used to locate the region-of-interest inside a frame. By tracking two simple motion-based parameters inside successive regions-of-interest, we discover a finite set of temporal cuts. We present several results using combinations (of the two parameters) on a dataset, i.e., BRISGAZE-ACTIONS. The dataset contains egocentric videos depicting several daily-living activities. The quality of the temporal cuts is further improved by implementing two entropy measures.
    HDR Environment Map Estimation for Real-Time Augmented Reality. (arXiv:2011.10687v4 [cs.CV] UPDATED)
    (2 min) We present a method to estimate an HDR environment map from a narrow field-of-view LDR camera image in real-time. This enables perceptually appealing reflections and shading on virtual objects of any material finish, from mirror to diffuse, rendered into a real physical environment using augmented reality. Our method is based on our efficient convolutional neural network architecture, EnvMapNet, trained end-to-end with two novel losses, ProjectionLoss for the generated image, and ClusterLoss for adversarial training. Through qualitative and quantitative comparison to state-of-the-art methods, we demonstrate that our algorithm reduces the directional error of estimated light sources by more than 50%, and achieves 3.7 times lower Frechet Inception Distance (FID). We further showcase a mobile application that is able to run our neural network model in under 9 ms on an iPhone XS, and render in real-time, visually coherent virtual objects in previously unseen real-world environments.
    Unsupervised Classification of Intrusive Igneous Rock Thin Section Images using Edge Detection and Colour Analysis. (arXiv:1710.00189v2 [cs.CV] UPDATED)
    (2 min) Classification of rocks is one of the fundamental tasks in a geological study. The process requires a human expert to examine sampled thin section images under a microscope. In this study, we propose a method that uses microscope automation, digital image acquisition, edge detection and colour analysis (histogram). We collected 60 digital images from 20 standard thin sections using a digital camera mounted on a conventional microscope. Each image is partitioned into a finite number of cells that form a grid structure. Edge and colour profile of pixels inside each cell determine its classification. The individual cells then determine the thin section image classification via a majority voting scheme. Our method yielded successful results as high as 90% to 100% precision.
    FoldIt: Haustral Folds Detection and Segmentation in Colonoscopy Videos. (arXiv:2106.12522v1 [eess.IV])
    (2 min) Haustral folds are colon wall protrusions implicated for high polyp miss rate during optical colonoscopy procedures. If segmented accurately, haustral folds can allow for better estimation of missed surface and can also serve as valuable landmarks for registering pre-treatment virtual (CT) and optical colonoscopies, to guide navigation towards the anomalies found in pre-treatment scans. We present a novel generative adversarial network, FoldIt, for feature-consistent image translation of optical colonoscopy videos to virtual colonoscopy renderings with haustral fold overlays. A new transitive loss is introduced in order to leverage ground truth information between haustral fold annotations and virtual colonoscopy renderings. We demonstrate the effectiveness of our model on real challenging optical colonoscopy videos as well as on textured virtual colonoscopy videos with clinician-verified haustral fold annotations. All code and scripts to reproduce the experiments of this paper will be made available via our Computational Endoscopy Platform at https://github.com/nadeemlab/CEP.
    High-Throughput Precision Phenotyping of Left Ventricular Hypertrophy with Cardiovascular Deep Learning. (arXiv:2106.12511v1 [eess.IV])
    (2 min) Left ventricular hypertrophy (LVH) results from chronic remodeling caused by a broad range of systemic and cardiovascular disease including hypertension, aortic stenosis, hypertrophic cardiomyopathy, and cardiac amyloidosis. Early detection and characterization of LVH can significantly impact patient care but is limited by under-recognition of hypertrophy, measurement error and variability, and difficulty differentiating etiologies of LVH. To overcome this challenge, we present EchoNet-LVH - a deep learning workflow that automatically quantifies ventricular hypertrophy with precision equal to human experts and predicts etiology of LVH. Trained on 28,201 echocardiogram videos, our model accurately measures intraventricular wall thickness (mean absolute error [MAE] 1.4mm, 95% CI 1.2-1.5mm), left ventricular diameter (MAE 2.4mm, 95% CI 2.2-2.6mm), and posterior wall thickness (MAE 1.2mm, 95% CI 1.1-1.3mm) and classifies cardiac amyloidosis (area under the curve of 0.83) and hypertrophic cardiomyopathy (AUC 0.98) from other etiologies of LVH. In external datasets from independent domestic and international healthcare systems, EchoNet-LVH accurately quantified ventricular parameters (R2 of 0.96 and 0.90 respectively) and detected cardiac amyloidosis (AUC 0.79) and hypertrophic cardiomyopathy (AUC 0.89) on the domestic external validation site. Leveraging measurements across multiple heart beats, our model can more accurately identify subtle changes in LV geometry and its causal etiologies. Compared to human experts, EchoNet-LVH is fully automated, allowing for reproducible, precise measurements, and lays the foundation for precision diagnosis of cardiac hypertrophy. As a resource to promote further innovation, we also make publicly available a large dataset of 23,212 annotated echocardiogram videos.
    Measuring Human Perception to Improve Handwritten Document Transcription. (arXiv:1904.03734v5 [cs.CV] UPDATED)
    (2 min) The subtleties of human perception, as measured by vision scientists through the use of psychophysics, are important clues to the internal workings of visual recognition. For instance, measured reaction time can indicate whether a visual stimulus is easy for a subject to recognize, or whether it is hard. In this paper, we consider how to incorporate psychophysical measurements of visual perception into the loss function of a deep neural network being trained for a recognition task, under the assumption that such information can enforce consistency with human behavior. As a case study to assess the viability of this approach, we look at the problem of handwritten document transcription. While good progress has been made towards automatically transcribing modern handwriting, significant challenges remain in transcribing historical documents. Here we describe a general enhancement strategy, underpinned by the new loss formulation, which can be applied to the training regime of any deep learning-based document transcription system. Through experimentation, reliable performance improvement is demonstrated for the standard IAM and RIMES datasets for three different network architectures. Further, we go on to show feasibility for our approach on a new dataset of digitized Latin manuscripts, originally produced by scribes in the Cloister of St. Gall in the the 9th century.
    Learning Multimodal VAEs through Mutual Supervision. (arXiv:2106.12570v1 [cs.LG])
    (2 min) Multimodal VAEs seek to model the joint distribution over heterogeneous data (e.g.\ vision, language), whilst also capturing a shared representation across such modalities. Prior work has typically combined information from the modalities by reconciling idiosyncratic representations directly in the recognition model through explicit products, mixtures, or other such factorisations. Here we introduce a novel alternative, the MEME, that avoids such explicit combinations by repurposing semi-supervised VAEs to combine information between modalities implicitly through mutual supervision. This formulation naturally allows learning from partially-observed data where some modalities can be entirely missing -- something that most existing approaches either cannot handle, or do so to a limited extent. We demonstrate that MEME outperforms baselines on standard metrics across both partial and complete observation schemes on the MNIST-SVHN (image-image) and CUB (image-text) datasets. We also contrast the quality of the representations learnt by mutual supervision against standard approaches and observe interesting trends in its ability to capture relatedness between data.
    FusionPainting: Multimodal Fusion with Adaptive Attention for 3D Object Detection. (arXiv:2106.12449v1 [cs.CV])
    (2 min) Accurate detection of obstacles in 3D is an essential task for autonomous driving and intelligent transportation. In this work, we propose a general multimodal fusion framework FusionPainting to fuse the 2D RGB image and 3D point clouds at a semantic level for boosting the 3D object detection task. Especially, the FusionPainting framework consists of three main modules: a multi-modal semantic segmentation module, an adaptive attention-based semantic fusion module, and a 3D object detector. First, semantic information is obtained for 2D images and 3D Lidar point clouds based on 2D and 3D segmentation approaches. Then the segmentation results from different sensors are adaptively fused based on the proposed attention-based semantic fusion module. Finally, the point clouds painted with the fused semantic label are sent to the 3D detector for obtaining the 3D objection results. The effectiveness of the proposed framework has been verified on the large-scale nuScenes detection benchmark by comparing it with three different baselines. The experimental results show that the fusion strategy can significantly improve the detection performance compared to the methods using only point clouds, and the methods using point clouds only painted with 2D segmentation information. Furthermore, the proposed approach outperforms other state-of-the-art methods on the nuScenes testing benchmark.
    Adapting Off-the-Shelf Source Segmenter for Target Medical Image Segmentation. (arXiv:2106.12497v1 [cs.CV])
    (2 min) Unsupervised domain adaptation (UDA) aims to transfer knowledge learned from a labeled source domain to an unlabeled and unseen target domain, which is usually trained on data from both domains. Access to the source domain data at the adaptation stage, however, is often limited, due to data storage or privacy issues. To alleviate this, in this work, we target source free UDA for segmentation, and propose to adapt an ``off-the-shelf" segmentation model pre-trained in the source domain to the target domain, with an adaptive batch-wise normalization statistics adaptation framework. Specifically, the domain-specific low-order batch statistics, i.e., mean and variance, are gradually adapted with an exponential momentum decay scheme, while the consistency of domain shareable high-order batch statistics, i.e., scaling and shifting parameters, is explicitly enforced by our optimization objective. The transferability of each channel is adaptively measured first from which to balance the contribution of each channel. Moreover, the proposed source free UDA framework is orthogonal to unsupervised learning methods, e.g., self-entropy minimization, which can thus be simply added on top of our framework. Extensive experiments on the BraTS 2018 database show that our source free UDA framework outperformed existing source-relaxed UDA methods for the cross-subtype UDA segmentation task and yielded comparable results for the cross-modality UDA segmentation task, compared with a supervised UDA methods with the source data.
    Diabetic Retinopathy Detection using Ensemble Machine Learning. (arXiv:2106.12545v1 [eess.IV])
    (2 min) Diabetic Retinopathy (DR) is among the worlds leading vision loss causes in diabetic patients. DR is a microvascular disease that affects the eye retina, which causes vessel blockage and therefore cuts the main source of nutrition for the retina tissues. Treatment for this visual disorder is most effective when it is detected in its earliest stages, as severe DR can result in irreversible blindness. Nonetheless, DR identification requires the expertise of Ophthalmologists which is often expensive and time-consuming. Therefore, automatic detection systems were introduced aiming to facilitate the identification process, making it available globally in a time and cost-efficient manner. However, due to the limited reliable datasets and medical records for this particular eye disease, the obtained predictions accuracies were relatively unsatisfying for eye specialists to rely on them as diagnostic systems. Thus, we explored an ensemble-based learning strategy, merging a substantial selection of well-known classification algorithms in one sophisticated diagnostic model. The proposed framework achieved the highest accuracy rates among all other common classification algorithms in the area. 4 subdatasets were generated to contain the top 5 and top 10 features of the Messidor dataset, selected by InfoGainEval. and WrapperSubsetEval., accuracies of 70.7% and 75.1% were achieved on the InfoGainEval. top 5 and original dataset respectively. The results imply the impressive performance of the subdataset, which significantly conduces to a less complex classification process
    Multi-Class Classification of Blood Cells -- End to End Computer Vision based diagnosis case study. (arXiv:2106.12548v1 [cs.CV])
    (2 min) The diagnosis of blood-based diseases often involves identifying and characterizing patient blood samples. Automated methods to detect and classify blood cell subtypes have important medical applications. Automated medical image processing and analysis offers a powerful tool for medical diagnosis. In this work we tackle the problem of white blood cell classification based on the morphological characteristics of their outer contour, color. The work we would explore a set of preprocessing and segmentation (Color-based segmentation, Morphological processing, contouring) algorithms along with a set of features extraction methods (Corner detection algorithms and Histogram of Gradients(HOG)), dimensionality reduction algorithms (Principal Component Analysis(PCA)) that are able to recognize and classify through various Unsupervised(k-nearest neighbors) and Supervised (Support Vector Machine, Decision Trees, Linear Discriminant Analysis, Quadratic Discriminant Analysis, Naive Bayes) algorithms different categories of white blood cells to Eosinophil, Lymphocyte, Monocyte, and Neutrophil. We even take a step forwards to explore various Deep Convolutional Neural network architecture (Sqeezent, MobilenetV1,MobilenetV2, InceptionNet etc.) without preprocessing/segmentation and with preprocessing. We would like to explore many algorithms to identify the robust algorithm with least time complexity and low resource requirement. The outcome of this work can be a cue to selection of algorithms as per requirement for automated blood cell classification.
    Fairness in Cardiac MR Image Analysis: An Investigation of Bias Due to Data Imbalance in Deep Learning Based Segmentation. (arXiv:2106.12387v1 [cs.CV])
    (3 min) The subject of "fairness" in artificial intelligence (AI) refers to assessing AI algorithms for potential bias based on demographic characteristics such as race and gender, and the development of algorithms to address this bias. Most applications to date have been in computer vision, although some work in healthcare has started to emerge. The use of deep learning (DL) in cardiac MR segmentation has led to impressive results in recent years, and such techniques are starting to be translated into clinical practice. However, no work has yet investigated the fairness of such models. In this work, we perform such an analysis for racial/gender groups, focusing on the problem of training data imbalance, using a nnU-Net model trained and evaluated on cine short axis cardiac MR data from the UK Biobank dataset, consisting of 5,903 subjects from 6 different racial groups. We find statistically significant differences in Dice performance between different racial groups. To reduce the racial bias, we investigated three strategies: (1) stratified batch sampling, in which batch sampling is stratified to ensure balance between racial groups; (2) fair meta-learning for segmentation, in which a DL classifier is trained to classify race and jointly optimized with the segmentation model; and (3) protected group models, in which a different segmentation model is trained for each racial group. We also compared the results to the scenario where we have a perfectly balanced database. To assess fairness we used the standard deviation (SD) and skewed error ratio (SER) of the average Dice values. Our results demonstrate that the racial bias results from the use of imbalanced training data, and that all proposed bias mitigation strategies improved fairness, with the best SD and SER resulting from the use of protected group models.
    CxSE: Chest X-ray Slow Encoding CNN forCOVID-19 Diagnosis. (arXiv:2106.12157v1 [eess.IV])
    (2 min) The coronavirus continues to disrupt our everyday lives as it spreads at an exponential rate. It needs to be detected quickly in order to quarantine positive patients so as to avoid further spread. This work proposes a new convolutional neural network (CNN) architecture called 'slow Encoding CNN. The proposed model's best performance wrt Sensitivity, Positive Predictive Value (PPV) found to be SP=0.67, PP=0.98, SN=0.96, and PN=0.52 on AI AGAINST COVID19 - Screening X-ray images for COVID-19 Infections competition's test data samples. SP and PP stand for the Sensitivity and PPV of the COVID-19 positive class, while PN and SN stand for the Sensitivity and PPV of the COVID-19 negative class.
    3D human tongue reconstruction from single "in-the-wild" images. (arXiv:2106.12302v1 [cs.CV])
    (2 min) 3D face reconstruction from a single image is a task that has garnered increased interest in the Computer Vision community, especially due to its broad use in a number of applications such as realistic 3D avatar creation, pose invariant face recognition and face hallucination. Since the introduction of the 3D Morphable Model in the late 90's, we witnessed an explosion of research aiming at particularly tackling this task. Nevertheless, despite the increasing level of detail in the 3D face reconstructions from single images mainly attributed to deep learning advances, finer and highly deformable components of the face such as the tongue are still absent from all 3D face models in the literature, although being very important for the realness of the 3D avatar representations. In this work we present the first, to the best of our knowledge, end-to-end trainable pipeline that accurately reconstructs the 3D face together with the tongue. Moreover, we make this pipeline robust in "in-the-wild" images by introducing a novel GAN method tailored for 3D tongue surface generation. Finally, we make publicly available to the community the first diverse tongue dataset, consisting of 1,800 raw scans of 700 individuals varying in gender, age, and ethnicity backgrounds. As we demonstrate in an extensive series of quantitative as well as qualitative experiments, our model proves to be robust and realistically captures the 3D tongue structure, even in adverse "in-the-wild" conditions.
    Multi-modal and frequency-weighted tensor nuclear norm for hyperspectral image denoising. (arXiv:2106.12489v1 [eess.IV])
    (2 min) Low-rankness is important in the hyperspectral image (HSI) denoising tasks. The tensor nuclear norm (TNN), defined based on the tensor singular value decomposition, is a state-of-the-art method to describe the low-rankness of HSI. However, TNN ignores some of the physical meanings of HSI in tackling the denoising tasks, leading to suboptimal denoising performance. In this paper, we propose the multi-modal and frequency-weighted tensor nuclear norm (MFWTNN) and the non-convex MFWTNN for HSI denoising tasks. Firstly, we investigate the physical meaning of frequency components and reconsider their weights to improve the low-rank representation ability of TNN. Meanwhile, we also consider the correlation among two spatial dimensions and the spectral dimension of HSI and combine the above improvements to TNN to propose MFWTNN. Secondly, we use non-convex functions to approximate the rank function of the frequency tensor and propose the NonMFWTNN to relax the MFWTNN better. Besides, we adaptively choose bigger weights for slices mainly containing noise information and smaller weights for slices containing profile information. Finally, we develop the efficient alternating direction method of multiplier (ADMM) based algorithm to solve the proposed models, and the effectiveness of our models are substantiated in simulated and real HSI datasets.
    A Circular-Structured Representation for Visual Emotion Distribution Learning. (arXiv:2106.12450v1 [cs.CV])
    (2 min) Visual Emotion Analysis (VEA) has attracted increasing attention recently with the prevalence of sharing images on social networks. Since human emotions are ambiguous and subjective, it is more reasonable to address VEA in a label distribution learning (LDL) paradigm rather than a single-label classification task. Different from other LDL tasks, there exist intrinsic relationships between emotions and unique characteristics within them, as demonstrated in psychological theories. Inspired by this, we propose a well-grounded circular-structured representation to utilize the prior knowledge for visual emotion distribution learning. To be specific, we first construct an Emotion Circle to unify any emotional state within it. On the proposed Emotion Circle, each emotion distribution is represented with an emotion vector, which is defined with three attributes (i.e., emotion polarity, emotion type, emotion intensity) as well as two properties (i.e., similarity, additivity). Besides, we design a novel Progressive Circular (PC) loss to penalize the dissimilarities between predicted emotion vector and labeled one in a coarse-to-fine manner, which further boosts the learning process in an emotion-specific way. Extensive experiments and comparisons are conducted on public visual emotion distribution datasets, and the results demonstrate that the proposed method outperforms the state-of-the-art methods.
    Gradient-Based Interpretability Methods and Binarized Neural Networks. (arXiv:2106.12569v1 [cs.CV])
    (2 min) Binarized Neural Networks (BNNs) have the potential to revolutionize the way that deep learning is carried out in edge computing platforms. However, the effectiveness of interpretability methods on these networks has not been assessed. In this paper, we compare the performance of several widely used saliency map-based interpretabilty techniques (Gradient, SmoothGrad and GradCAM), when applied to Binarized or Full Precision Neural Networks (FPNNs). We found that the basic Gradient method produces very similar-looking maps for both types of network. However, SmoothGrad produces significantly noisier maps for BNNs. GradCAM also produces saliency maps which differ between network types, with some of the BNNs having seemingly nonsensical explanations. We comment on possible reasons for these differences in explanations and present it as an example of why interpretability techniques should be tested on a wider range of network types.
    SketchEmbedNet: Learning Novel Concepts by Imitating Drawings. (arXiv:2009.04806v4 [cs.CV] UPDATED)
    (2 min) Sketch drawings capture the salient information of visual concepts. Previous work has shown that neural networks are capable of producing sketches of natural objects drawn from a small number of classes. While earlier approaches focus on generation quality or retrieval, we explore properties of image representations learned by training a model to produce sketches of images. We show that this generative, class-agnostic model produces informative embeddings of images from novel examples, classes, and even novel datasets in a few-shot setting. Additionally, we find that these learned representations exhibit interesting structure and compositionality.
    Feature Alignment for Approximated Reversibility in Neural Networks. (arXiv:2106.12562v1 [cs.LG])
    (2 min) We introduce feature alignment, a technique for obtaining approximate reversibility in artificial neural networks. By means of feature extraction, we can train a neural network to learn an estimated map for its reverse process from outputs to inputs. Combined with variational autoencoders, we can generate new samples from the same statistics as the training data. Improvements of the results are obtained by using concepts from generative adversarial networks. Finally, we show that the technique can be modified for training neural networks locally, saving computational memory resources. Applying these techniques, we report results for three vision generative tasks: MNIST, CIFAR-10, and celebA.
    Transformer Meets Convolution: A Bilateral Awareness Net-work for Semantic Segmentation of Very Fine Resolution Ur-ban Scene Images. (arXiv:2106.12413v1 [cs.CV])
    (2 min) Semantic segmentation from very fine resolution (VFR) urban scene images plays a significant role in several application scenarios including autonomous driving, land cover classification, and urban planning, etc. However, the tremendous details contained in the VFR image severely limit the potential of the existing deep learning approaches. More seriously, the considerable variations in scale and appearance of objects further deteriorate the representational capacity of those se-mantic segmentation methods, leading to the confusion of adjacent objects. Addressing such is-sues represents a promising research field in the remote sensing community, which paves the way for scene-level landscape pattern analysis and decision making. In this manuscript, we pro-pose a bilateral awareness network (BANet) which contains a dependency path and a texture path to fully capture the long-range relationships and fine-grained details in VFR images. Specif-ically, the dependency path is conducted based on the ResT, a novel Transformer backbone with memory-efficient multi-head self-attention, while the texture path is built on the stacked convo-lution operation. Besides, using the linear attention mechanism, a feature aggregation module (FAM) is designed to effectively fuse the dependency features and texture features. Extensive experiments conducted on the three large-scale urban scene image segmentation datasets, i.e., ISPRS Vaihingen dataset, ISPRS Potsdam dataset, and UAVid dataset, demonstrate the effective-ness of our BANet. Specifically, a 64.6% mIoU is achieved on the UAVid dataset.
    Fine-Tuning StyleGAN2 For Cartoon Face Generation. (arXiv:2106.12445v1 [cs.CV])
    (2 min) Recent studies have shown remarkable success in the unsupervised image to image (I2I) translation. However, due to the imbalance in the data, learning joint distribution for various domains is still very challenging. Although existing models can generate realistic target images, it's difficult to maintain the structure of the source image. In addition, training a generative model on large data in multiple domains requires a lot of time and computer resources. To address these limitations, we propose a novel image-to-image translation method that generates images of the target domain by finetuning a stylegan2 pretrained model. The stylegan2 model is suitable for unsupervised I2I translation on unbalanced datasets; it is highly stable, produces realistic images, and even learns properly from limited data when applied with simple fine-tuning techniques. Thus, in this paper, we propose new methods to preserve the structure of the source images and generate realistic images in the target domain. The code and results are available at https://github.com/happy-jihye/Cartoon-StyleGan2
    Euro-PVI: Pedestrian Vehicle Interactions in Dense Urban Centers. (arXiv:2106.12442v1 [cs.CV])
    (2 min) Accurate prediction of pedestrian and bicyclist paths is integral to the development of reliable autonomous vehicles in dense urban environments. The interactions between vehicle and pedestrian or bicyclist have a significant impact on the trajectories of traffic participants e.g. stopping or turning to avoid collisions. Although recent datasets and trajectory prediction approaches have fostered the development of autonomous vehicles yet the amount of vehicle-pedestrian (bicyclist) interactions modeled are sparse. In this work, we propose Euro-PVI, a dataset of pedestrian and bicyclist trajectories. In particular, our dataset caters more diverse and complex interactions in dense urban scenarios compared to the existing datasets. To address the challenges in predicting future trajectories with dense interactions, we develop a joint inference model that learns an expressive multi-modal shared latent space across agents in the urban scene. This enables our Joint-$\beta$-cVAE approach to better model the distribution of future trajectories. We achieve state of the art results on the nuScenes and Euro-PVI datasets demonstrating the importance of capturing interactions between ego-vehicle and pedestrians (bicyclists) for accurate predictions.
    Alias-Free Generative Adversarial Networks. (arXiv:2106.12423v1 [cs.CV])
    (2 min) We observe that despite their hierarchical convolutional nature, the synthesis process of typical generative adversarial networks depends on absolute pixel coordinates in an unhealthy manner. This manifests itself as, e.g., detail appearing to be glued to image coordinates instead of the surfaces of depicted objects. We trace the root cause to careless signal processing that causes aliasing in the generator network. Interpreting all signals in the network as continuous, we derive generally applicable, small architectural changes that guarantee that unwanted information cannot leak into the hierarchical synthesis process. The resulting networks match the FID of StyleGAN2 but differ dramatically in their internal representations, and they are fully equivariant to translation and rotation even at subpixel scales. Our results pave the way for generative models better suited for video and animation.
    Deep unsupervised 3D human body reconstruction from a sparse set of landmarks. (arXiv:2106.12282v1 [cs.CV])
    (2 min) In this paper we propose the first deep unsupervised approach in human body reconstruction to estimate body surface from a sparse set of landmarks, so called DeepMurf. We apply a denoising autoencoder to estimate missing landmarks. Then we apply an attention model to estimate body joints from landmarks. Finally, a cascading network is applied to regress parameters of a statistical generative model that reconstructs body. Our set of proposed loss functions allows us to train the network in an unsupervised way. Results on four public datasets show that our approach accurately reconstructs the human body from real world mocap data.
    Image-to-Image Translation of Synthetic Samples for Rare Classes. (arXiv:2106.12212v1 [cs.CV])
    (2 min) The natural world is long-tailed: rare classes are observed orders of magnitudes less frequently than common ones, leading to highly-imbalanced data where rare classes can have only handfuls of examples. Learning from few examples is a known challenge for deep learning based classification algorithms, and is the focus of the field of low-shot learning. One potential approach to increase the training data for these rare classes is to augment the limited real data with synthetic samples. This has been shown to help, but the domain shift between real and synthetic hinders the approaches' efficacy when tested on real data. We explore the use of image-to-image translation methods to close the domain gap between synthetic and real imagery for animal species classification in data collected from camera traps: motion-activated static cameras used to monitor wildlife. We use low-level feature alignment between source and target domains to make synthetic data for a rare species generated using a graphics engine more "realistic". Compared against a system augmented with unaligned synthetic data, our experiments show a considerable decrease in classification error rates on a rare species.
    A Label Management Mechanism for Retinal Fundus Image Classification of Diabetic Retinopathy. (arXiv:2106.12284v1 [cs.CV])
    (2 min) Diabetic retinopathy (DR) remains the most prevalent cause of vision impairment and irreversible blindness in the working-age adults. Due to the renaissance of deep learning (DL), DL-based DR diagnosis has become a promising tool for the early screening and severity grading of DR. However, training deep neural networks (DNNs) requires an enormous amount of carefully labeled data. Noisy label data may be introduced when labeling plenty of data, degrading the performance of models. In this work, we propose a novel label management mechanism (LMM) for the DNN to overcome overfitting on the noisy data. LMM utilizes maximum posteriori probability (MAP) in the Bayesian statistic and time-weighted technique to selectively correct the labels of unclean data, which gradually purify the training data and improve classification performance. Comprehensive experiments on both synthetic noise data (Messidor \& our collected DR dataset) and real-world noise data (ANIMAL-10N) demonstrated that LMM could boost performance of models and is superior to three state-of-the-art methods.
    Region-Aware Network: Model Human's Top-Down Visual Perception Mechanism for Crowd Counting. (arXiv:2106.12163v1 [cs.CV])
    (2 min) Background noise and scale variation are common problems that have been long recognized in crowd counting. Humans glance at a crowd image and instantly know the approximate number of human and where they are through attention the crowd regions and the congestion degree of crowd regions with a global receptive filed. Hence, in this paper, we propose a novel feedback network with Region-Aware block called RANet by modeling human's Top-Down visual perception mechanism. Firstly, we introduce a feedback architecture to generate priority maps that provide prior about candidate crowd regions in input images. The prior enables the RANet pay more attention to crowd regions. Then we design Region-Aware block that could adaptively encode the contextual information into input images through global receptive field. More specifically, we scan the whole input images and its priority maps in the form of column vector to obtain a relevance matrix estimating their similarity. The relevance matrix obtained would be utilized to build global relationships between pixels. Our method outperforms state-of-the-art crowd counting methods on several public datasets.
    How Well do Feature Visualizations Support Causal Understanding of CNN Activations?. (arXiv:2106.12447v1 [cs.CV])
    (2 min) One widely used approach towards understanding the inner workings of deep convolutional neural networks is to visualize unit responses via activation maximization. Feature visualizations via activation maximization are thought to provide humans with precise information about the image features that cause a unit to be activated. If this is indeed true, these synthetic images should enable humans to predict the effect of an intervention, such as whether occluding a certain patch of the image (say, a dog's head) changes a unit's activation. Here, we test this hypothesis by asking humans to predict which of two square occlusions causes a larger change to a unit's activation. Both a large-scale crowdsourced experiment and measurements with experts show that on average, the extremely activating feature visualizations by Olah et al. (2017) indeed help humans on this task ($67 \pm 4\%$ accuracy; baseline performance without any visualizations is $60 \pm 3\%$). However, they do not provide any significant advantage over other visualizations (such as e.g. dataset samples), which yield similar performance ($66 \pm 3\%$ to $67 \pm 3\%$ accuracy). Taken together, we propose an objective psychophysical task to quantify the benefit of unit-level interpretability methods for humans, and find no evidence that feature visualizations provide humans with better "causal understanding" than simple alternative visualizations.
    Vision Permutator: A Permutable MLP-Like Architecture for Visual Recognition. (arXiv:2106.12368v1 [cs.CV])
    (2 min) In this paper, we present Vision Permutator, a conceptually simple and data efficient MLP-like architecture for visual recognition. By realizing the importance of the positional information carried by 2D feature representations, unlike recent MLP-like models that encode the spatial information along the flattened spatial dimensions, Vision Permutator separately encodes the feature representations along the height and width dimensions with linear projections. This allows Vision Permutator to capture long-range dependencies along one spatial direction and meanwhile preserve precise positional information along the other direction. The resulting position-sensitive outputs are then aggregated in a mutually complementing manner to form expressive representations of the objects of interest. We show that our Vision Permutators are formidable competitors to convolutional neural networks (CNNs) and vision transformers. Without the dependence on spatial convolutions or attention mechanisms, Vision Permutator achieves 81.5% top-1 accuracy on ImageNet without extra large-scale training data (e.g., ImageNet-22k) using only 25M learnable parameters, which is much better than most CNNs and vision transformers under the same model size constraint. When scaling up to 88M, it attains 83.2% top-1 accuracy. We hope this work could encourage research on rethinking the way of encoding spatial information and facilitate the development of MLP-like models. Code is available at https://github.com/Andrew-Qibin/VisionPermutator.
    Open Images V5 Text Annotation and Yet Another Mask Text Spotter. (arXiv:2106.12326v1 [cs.CV])
    (2 min) A large scale human-labeled dataset plays an important role in creating high quality deep learning models. In this paper we present text annotation for Open Images V5 dataset. To our knowledge it is the largest among publicly available manually created text annotations. Having this annotation we trained a simple Mask-RCNN-based network, referred as Yet Another Mask Text Spotter (YAMTS), which achieves competitive performance or even outperforms current state-of-the-art approaches in some cases on ICDAR2013, ICDAR2015 and Total-Text datasets. Code for text spotting model available online at: https://github.com/openvinotoolkit/training_extensions. The model can be exported to OpenVINO-format and run on Intel CPUs.
    Sentinel-1 and Sentinel-2 Spatio-Temporal Data Fusion for Clouds Removal. (arXiv:2106.12226v1 [cs.CV])
    (2 min) The abundance of clouds, located both spatially and temporally, often makes remote sensing applications with optical images difficult or even impossible. In this manuscript, a novel method for clouds-corrupted optical image restoration has been presented and developed, based on a joint data fusion paradigm, where three deep neural networks have been combined in order to fuse spatio-temporal features extracted from Sentinel-1 and Sentinel-2 time-series of data. It is worth highlighting that both the code and the dataset have been implemented from scratch and made available to interested research for further analysis and investigation.
    Co-advise: Cross Inductive Bias Distillation. (arXiv:2106.12378v1 [cs.CV])
    (2 min) Transformers recently are adapted from the community of natural language processing as a promising substitute of convolution-based neural networks for visual learning tasks. However, its supremacy degenerates given an insufficient amount of training data (e.g., ImageNet). To make it into practical utility, we propose a novel distillation-based method to train vision transformers. Unlike previous works, where merely heavy convolution-based teachers are provided, we introduce lightweight teachers with different architectural inductive biases (e.g., convolution and involution) to co-advise the student transformer. The key is that teachers with different inductive biases attain different knowledge despite that they are trained on the same dataset, and such different knowledge compounds and boosts the student's performance during distillation. Equipped with this cross inductive bias distillation method, our vision transformers (termed as CivT) outperform all previous transformers of the same architecture on ImageNet.
    Estimating the Robustness of Classification Models by the Structure of the Learned Feature-Space. (arXiv:2106.12303v1 [cs.CV])
    (2 min) Over the last decade, the development of deep image classification networks has mostly been driven by the search for the best performance in terms of classification accuracy on standardized benchmarks like ImageNet. More recently, this focus has been expanded by the notion of model robustness, i.e. the generalization abilities of models towards previously unseen changes in the data distribution. While new benchmarks, like ImageNet-C, have been introduced to measure robustness properties, we argue that fixed testsets are only able to capture a small portion of possible data variations and are thus limited and prone to generate new overfitted solutions. To overcome these drawbacks, we suggest to estimate the robustness of a model directly from the structure of its learned feature-space. We introduce robustness indicators which are obtained via unsupervised clustering of latent representations inside a trained classifier and show very high correlations to the model performance on corrupted test data.
    Generative Self-training for Cross-domain Unsupervised Tagged-to-Cine MRI Synthesis. (arXiv:2106.12499v1 [cs.CV])
    (2 min) Self-training based unsupervised domain adaptation (UDA) has shown great potential to address the problem of domain shift, when applying a trained deep learning model in a source domain to unlabeled target domains. However, while the self-training UDA has demonstrated its effectiveness on discriminative tasks, such as classification and segmentation, via the reliable pseudo-label selection based on the softmax discrete histogram, the self-training UDA for generative tasks, such as image synthesis, is not fully investigated. In this work, we propose a novel generative self-training (GST) UDA framework with continuous value prediction and regression objective for cross-domain image synthesis. Specifically, we propose to filter the pseudo-label with an uncertainty mask, and quantify the predictive confidence of generated images with practical variational Bayes learning. The fast test-time adaptation is achieved by a round-based alternative optimization scheme. We validated our framework on the tagged-to-cine magnetic resonance imaging (MRI) synthesis problem, where datasets in the source and target domains were acquired from different scanners or centers. Extensive validations were carried out to verify our framework against popular adversarial training UDA methods. Results show that our GST, with tagged MRI of test subjects in new target domains, improved the synthesis quality by a large margin, compared with the adversarial training UDA methods.
    Collaborative Visual Inertial SLAM for Multiple Smart Phones. (arXiv:2106.12186v1 [cs.RO])
    (2 min) The efficiency and accuracy of mapping are crucial in a large scene and long-term AR applications. Multi-agent cooperative SLAM is the precondition of multi-user AR interaction. The cooperation of multiple smart phones has the potential to improve efficiency and robustness of task completion and can complete tasks that a single agent cannot do. However, it depends on robust communication, efficient location detection, robust mapping, and efficient information sharing among agents. We propose a multi-intelligence collaborative monocular visual-inertial SLAM deployed on multiple ios mobile devices with a centralized architecture. Each agent can independently explore the environment, run a visual-inertial odometry module online, and then send all the measurement information to a central server with higher computing resources. The server manages all the information received, detects overlapping areas, merges and optimizes the map, and shares information with the agents when needed. We have verified the performance of the system in public datasets and real environments. The accuracy of mapping and fusion of the proposed system is comparable to VINS-Mono which requires higher computing resources.
    Fairness for Image Generation with Uncertain Sensitive Attributes. (arXiv:2106.12182v1 [cs.LG])
    (2 min) This work tackles the issue of fairness in the context of generative procedures, such as image super-resolution, which entail different definitions from the standard classification setting. Moreover, while traditional group fairness definitions are typically defined with respect to specified protected groups -- camouflaging the fact that these groupings are artificial and carry historical and political motivations -- we emphasize that there are no ground truth identities. For instance, should South and East Asians be viewed as a single group or separate groups? Should we consider one race as a whole or further split by gender? Choosing which groups are valid and who belongs in them is an impossible dilemma and being ``fair'' with respect to Asians may require being ``unfair'' with respect to South Asians. This motivates the introduction of definitions that allow algorithms to be \emph{oblivious} to the relevant groupings. We define several intuitive notions of group fairness and study their incompatibilities and trade-offs. We show that the natural extension of demographic parity is strongly dependent on the grouping, and \emph{impossible} to achieve obliviously. On the other hand, the conceptually new definition we introduce, Conditional Proportional Representation, can be achieved obliviously through Posterior Sampling. Our experiments validate our theoretical results and achieve fair image reconstruction using state-of-the-art generative models.
    Behavior Mimics Distribution: Combining Individual and Group Behaviors for Federated Learning. (arXiv:2106.12300v1 [cs.LG])
    (2 min) Federated Learning (FL) has become an active and promising distributed machine learning paradigm. As a result of statistical heterogeneity, recent studies clearly show that the performance of popular FL methods (e.g., FedAvg) deteriorates dramatically due to the client drift caused by local updates. This paper proposes a novel Federated Learning algorithm (called IGFL), which leverages both Individual and Group behaviors to mimic distribution, thereby improving the ability to deal with heterogeneity. Unlike existing FL methods, our IGFL can be applied to both client and server optimization. As a by-product, we propose a new attention-based federated learning in the server optimization of IGFL. To the best of our knowledge, this is the first time to incorporate attention mechanisms into federated optimization. We conduct extensive experiments and show that IGFL can significantly improve the performance of existing federated learning methods. Especially when the distributions of data among individuals are diverse, IGFL can improve the classification accuracy by about 13% compared with prior baselines.
    Learning from Pseudo Lesion: A Self-supervised Framework for COVID-19 Diagnosis. (arXiv:2106.12313v1 [eess.IV])
    (2 min) The Coronavirus disease 2019 (COVID-19) has rapidly spread all over the world since its first report in December 2019 and thoracic computed tomography (CT) has become one of the main tools for its diagnosis. In recent years, deep learning-based approaches have shown impressive performance in myriad image recognition tasks. However, they usually require a large number of annotated data for training. Inspired by Ground Glass Opacity (GGO), a common finding in COIVD-19 patient's CT scans, we proposed in this paper a novel self-supervised pretraining method based on pseudo lesions generation and restoration for COVID-19 diagnosis. We used Perlin noise, a gradient noise based mathematical model, to generate lesion-like patterns, which were then randomly pasted to the lung regions of normal CT images to generate pseudo COVID-19 images. The pairs of normal and pseudo COVID-19 images were then used to train an encoder-decoder architecture based U-Net for image restoration, which does not require any labelled data. The pretrained encoder was then fine-tuned using labelled data for COVID-19 diagnosis task. Two public COVID-19 diagnosis datasets made up of CT images were employed for evaluation. Comprehensive experimental results demonstrated that the proposed self-supervised learning approach could extract better feature representation for COVID-19 diagnosis and the accuracy of the proposed method outperformed the supervised model pretrained on large scale images by 6.57% and 3.03% on SARS-CoV-2 dataset and Jinan COVID-19 dataset, respectively.
    STRESS: Super-Resolution for Dynamic Fetal MRI using Self-Supervised Learning. (arXiv:2106.12407v1 [eess.IV])
    (2 min) Fetal motion is unpredictable and rapid on the scale of conventional MR scan times. Therefore, dynamic fetal MRI, which aims at capturing fetal motion and dynamics of fetal function, is limited to fast imaging techniques with compromises in image quality and resolution. Super-resolution for dynamic fetal MRI is still a challenge, especially when multi-oriented stacks of image slices for oversampling are not available and high temporal resolution for recording the dynamics of the fetus or placenta is desired. Further, fetal motion makes it difficult to acquire high-resolution images for supervised learning methods. To address this problem, in this work, we propose STRESS (Spatio-Temporal Resolution Enhancement with Simulated Scans), a self-supervised super-resolution framework for dynamic fetal MRI with interleaved slice acquisitions. Our proposed method simulates an interleaved slice acquisition along the high-resolution axis on the originally acquired data to generate pairs of low- and high-resolution images. Then, it trains a super-resolution network by exploiting both spatial and temporal correlations in the MR time series, which is used to enhance the resolution of the original data. Evaluations on both simulated and in utero data show that our proposed method outperforms other self-supervised super-resolution methods and improves image quality, which is beneficial to other downstream tasks and evaluations.
    Multiband VAE: Latent Space Partitioning for Knowledge Consolidation in Continual Learning. (arXiv:2106.12196v1 [cs.LG])
    (2 min) We propose a new method for unsupervised continual knowledge consolidation in generative models that relies on the partitioning of Variational Autoencoder's latent space. Acquiring knowledge about new data samples without forgetting previous ones is a critical problem of continual learning. Currently proposed methods achieve this goal by extending the existing model while constraining its behavior not to degrade on the past data, which does not exploit the full potential of relations within the entire training dataset. In this work, we identify this limitation and posit the goal of continual learning as a knowledge accumulation task. We solve it by continuously re-aligning latent space partitions that we call bands which are representations of samples seen in different tasks, driven by the similarity of the information they contain. In addition, we introduce a simple yet effective method for controlled forgetting of past data that improves the quality of reconstructions encoded in latent bands and a latent space disentanglement technique that improves knowledge consolidation. On top of the standard continual learning evaluation benchmarks, we evaluate our method on a new knowledge consolidation scenario and show that the proposed approach outperforms state-of-the-art by up to twofold across all testing scenarios.
    Instance-based Vision Transformer for Subtyping of Papillary Renal Cell Carcinoma in Histopathological Image. (arXiv:2106.12265v1 [cs.CV])
    (2 min) Histological subtype of papillary (p) renal cell carcinoma (RCC), type 1 vs. type 2, is an essential prognostic factor. The two subtypes of pRCC have a similar pattern, i.e., the papillary architecture, yet some subtle differences, including cellular and cell-layer level patterns. However, the cellular and cell-layer level patterns almost cannot be captured by existing CNN-based models in large-size histopathological images, which brings obstacles to directly applying these models to such a fine-grained classification task. This paper proposes a novel instance-based Vision Transformer (i-ViT) to learn robust representations of histopathological images for the pRCC subtyping task by extracting finer features from instance patches (by cropping around segmented nuclei and assigning predicted grades). The proposed i-ViT takes top-K instances as input and aggregates them for capturing both the cellular and cell-layer level patterns by a position-embedding layer, a grade-embedding layer, and a multi-head multi-layer self-attention module. To evaluate the performance of the proposed framework, experienced pathologists are invited to selected 1162 regions of interest from 171 whole slide images of type 1 and type 2 pRCC. Experimental results show that the proposed method achieves better performance than existing CNN-based models with a significant margin.
    A new Video Synopsis Based Approach Using Stereo Camera. (arXiv:2106.12362v1 [cs.CV])
    (2 min) In today's world, the amount of data produced in every field has increased at an unexpected level. In the face of increasing data, the importance of data processing has increased remarkably. Our resource topic is on the processing of video data, which has an important place in increasing data, and the production of summary videos. Within the scope of this resource, a new method for anomaly detection with object-based unsupervised learning has been developed while creating a video summary. By using this method, the video data is processed as pixels and the result is produced as a video segment. The process flow can be briefly summarized as follows. Objects on the video are detected according to their type, and then they are tracked. Then, the tracking history data of the objects are processed, and the classifier is trained with the object type. Thanks to this classifier, anomaly behavior of objects is detected. Video segments are determined by processing video moments containing anomaly behaviors. The video summary is created by extracting the detected video segments from the original video and combining them. The model we developed has been tested and verified separately for single camera and dual camera systems.
    Neural Fashion Image Captioning : Accounting for Data Diversity. (arXiv:2106.12154v1 [cs.CV])
    (2 min) Image captioning has increasingly large domains of application, and fashion is not an exception. Having automatic item descriptions is of great interest for fashion web platforms hosting sometimes hundreds of thousands of images. This paper is one of the first tackling image captioning for fashion images. To contribute addressing dataset diversity issues, we introduced the InFashAIv1 dataset containing almost 16.000 African fashion item images with their titles, prices and general descriptions. We also used the well known DeepFashion dataset in addition to InFashAIv1. Captions are generated using the \textit{Show and Tell} model made of CNN encoder and RNN Decoder. We showed that jointly training the model on both datasets improves captions quality for African style fashion images, suggesting a transfer learning from Western style data. The InFashAIv1 dataset is released on \href{https://github.com/hgilles06/infashai}{Github} to encourage works with more diversity inclusion.
    Real-time Instance Segmentation with Discriminative Orientation Maps. (arXiv:2106.12204v1 [cs.CV])
    (2 min) Although instance segmentation has made considerable advancement over recent years, it's still a challenge to design high accuracy algorithms with real-time performance. In this paper, we propose a real-time instance segmentation framework termed OrienMask. Upon the one-stage object detector YOLOv3, a mask head is added to predict some discriminative orientation maps, which are explicitly defined as spatial offset vectors for both foreground and background pixels. Thanks to the discrimination ability of orientation maps, masks can be recovered without the need for extra foreground segmentation. All instances that match with the same anchor size share a common orientation map. This special sharing strategy reduces the amortized memory utilization for mask predictions but without loss of mask granularity. Given the surviving box predictions after NMS, instance masks can be concurrently constructed from the corresponding orientation maps with low complexity. Owing to the concise design for mask representation and its effective integration with the anchor-based object detector, our method is qualified under real-time conditions while maintaining competitive accuracy. Experiments on COCO benchmark show that OrienMask achieves 34.8 mask AP at the speed of 42.7 fps evaluated with a single RTX 2080 Ti. The code is available at https://github.com/duwt/OrienMask.
    Mutual-Information Based Few-Shot Classification. (arXiv:2106.12252v1 [cs.CV])
    (2 min) We introduce Transductive Infomation Maximization (TIM) for few-shot learning. Our method maximizes the mutual information between the query features and their label predictions for a given few-shot task, in conjunction with a supervision loss based on the support set. We motivate our transductive loss by deriving a formal relation between the classification accuracy and mutual-information maximization. Furthermore, we propose a new alternating-direction solver, which substantially speeds up transductive inference over gradient-based optimization, while yielding competitive accuracy. We also provide a convergence analysis of our solver based on Zangwill's theory and bound-optimization arguments. TIM inference is modular: it can be used on top of any base-training feature extractor. Following standard transductive few-shot settings, our comprehensive experiments demonstrate that TIM outperforms state-of-the-art methods significantly across various datasets and networks, while used on top of a fixed feature extractor trained with simple cross-entropy on the base classes, without resorting to complex meta-learning schemes. It consistently brings between 2 % and 5 % improvement in accuracy over the best performing method, not only on all the well-established few-shot benchmarks but also on more challenging scenarios, with random tasks, domain shift and larger numbers of classes, as in the recently introduced META-DATASET. Our code is publicly available at https://github.com/mboudiaf/TIM. We also publicly release a standalone PyTorch implementation of META-DATASET, along with additional benchmarking results, at https://github.com/mboudiaf/pytorch-meta-dataset.
    Reachability Analysis of Convolutional Neural Networks. (arXiv:2106.12074v1 [cs.CV])
    (2 min) Deep convolutional neural networks have been widely employed as an effective technique to handle complex and practical problems. However, one of the fundamental problems is the lack of formal methods to analyze their behavior. To address this challenge, we propose an approach to compute the exact reachable sets of a network given an input domain, where the reachable set is represented by the face lattice structure. Besides the computation of reachable sets, our approach is also capable of backtracking to the input domain given an output reachable set. Therefore, a full analysis of a network's behavior can be realized. In addition, an approach for fast analysis is also introduced, which conducts fast computation of reachable sets by considering selected sensitive neurons in each layer. The exact pixel-level reachability analysis method is evaluated on a CNN for the CIFAR10 dataset and compared to related works. The fast analysis method is evaluated over a CNN CIFAR10 dataset and VGG16 architecture for the ImageNet dataset.
    Vision-based Behavioral Recognition of Novelty Preference in Pigs. (arXiv:2106.12181v1 [cs.CV])
    (2 min) Behavioral scoring of research data is crucial for extracting domain-specific metrics but is bottlenecked on the ability to analyze enormous volumes of information using human labor. Deep learning is widely viewed as a key advancement to relieve this bottleneck. We identify one such domain, where deep learning can be leveraged to alleviate the process of manual scoring. Novelty preference paradigms have been widely used to study recognition memory in pigs, but analysis of these videos requires human intervention. We introduce a subset of such videos in the form of the 'Pig Novelty Preference Behavior' (PNPB) dataset that is fully annotated with pig actions and keypoints. In order to demonstrate the application of state-of-the-art action recognition models on this dataset, we compare LRCN, C3D, and TSM on the basis of various analytical metrics and discuss common pitfalls of the models. Our methods achieve an accuracy of 93% and a mean Average Precision of 96% in estimating piglet behavior. We open-source our code and annotated dataset at https://github.com/AIFARMS/NOR-behavior-recognition
    Deformed2Self: Self-Supervised Denoising for Dynamic Medical Imaging. (arXiv:2106.12175v1 [eess.IV])
    (2 min) Image denoising is of great importance for medical imaging system, since it can improve image quality for disease diagnosis and downstream image analyses. In a variety of applications, dynamic imaging techniques are utilized to capture the time-varying features of the subject, where multiple images are acquired for the same subject at different time points. Although signal-to-noise ratio of each time frame is usually limited by the short acquisition time, the correlation among different time frames can be exploited to improve denoising results with shared information across time frames. With the success of neural networks in computer vision, supervised deep learning methods show prominent performance in single-image denoising, which rely on large datasets with clean-vs-noisy image pairs. Recently, several self-supervised deep denoising models have been proposed, achieving promising results without needing the pairwise ground truth of clean images. In the field of multi-image denoising, however, very few works have been done on extracting correlated information from multiple slices for denoising using self-supervised deep learning methods. In this work, we propose Deformed2Self, an end-to-end self-supervised deep learning framework for dynamic imaging denoising. It combines single-image and multi-image denoising to improve image quality and use a spatial transformer network to model motion between different slices. Further, it only requires a single noisy image with a few auxiliary observations at different time frames for training and inference. Evaluations on phantom and in vivo data with different noise statistics show that our method has comparable performance to other state-of-the-art unsupervised or self-supervised denoising methods and outperforms under high noise levels.
    A Review of Assistive Technologies for Activities of Daily Living of Elderly. (arXiv:2106.12183v1 [cs.HC])
    (2 min) One of the distinct features of this century has been the population of older adults which has been on a constant rise. Elderly people have several needs and requirements due to physical disabilities, cognitive issues, weakened memory and disorganized behavior, that they face with increasing age. The extent of these limitations also differs according to the varying diversities in elderly, which include age, gender, background, experience, skills, knowledge and so on. These varying needs and challenges with increasing age, limits abilities of older adults to perform Activities of Daily Living (ADLs) in an independent manner. To add to it, the shortage of caregivers creates a looming need for technology-based services for elderly people, to assist them in performing their daily routine tasks to sustain their independent living and active aging. To address these needs, this work consists of making three major contributions in this field. First, it provides a rather comprehensive review of assisted living technologies aimed at helping elderly people to perform ADLs. Second, the work discusses the challenges identified through this review, that currently exist in the context of implementation of assisted living services for elderly care in Smart Homes and Smart Cities. Finally, the work also outlines an approach for implementation, extension and integration of the existing works in this field for development of a much-needed framework that can provide personalized assistance and user-centered behavior interventions to elderly as per their varying and ever-changing needs.
    PatentNet: A Large-Scale Incomplete Multiview, Multimodal, Multilabel Industrial Goods Image Database. (arXiv:2106.12139v1 [cs.CV])
    (2 min) In deep learning area, large-scale image datasets bring a breakthrough in the success of object recognition and retrieval. Nowadays, as the embodiment of innovation, the diversity of the industrial goods is significantly larger, in which the incomplete multiview, multimodal and multilabel are different from the traditional dataset. In this paper, we introduce an industrial goods dataset, namely PatentNet, with numerous highly diverse, accurate and detailed annotations of industrial goods images, and corresponding texts. In PatentNet, the images and texts are sourced from design patent. Within over 6M images and corresponding texts of industrial goods labeled manually checked by professionals, PatentNet is the first ongoing industrial goods image database whose varieties are wider than industrial goods datasets used previously for benchmarking. PatentNet organizes millions of images into 32 classes and 219 subclasses based on the Locarno Classification Agreement. Through extensive experiments on image classification, image retrieval and incomplete multiview clustering, we demonstrate that our PatentNet is much more diverse, complex, and challenging, enjoying higher potentials than existing industrial image datasets. Furthermore, the characteristics of incomplete multiview, multimodal and multilabel in PatentNet are able to offer unparalleled opportunities in the artificial intelligence community and beyond.
    P2T: Pyramid Pooling Transformer for Scene Understanding. (arXiv:2106.12011v1 [cs.CV])
    (2 min) This paper jointly resolves two problems in vision transformer: i) the computation of Multi-Head Self-Attention (MHSA) has high computational/space complexity; ii) recent vision transformer networks are overly tuned for image classification, ignoring the difference between image classification (simple scenarios, more similar to NLP) and downstream scene understanding tasks (complicated scenarios, rich structural and contextual information). To this end, we note that pyramid pooling has been demonstrated to be effective in various vision tasks owing to its powerful context abstraction, and its natural property of spatial invariance is suitable to address the loss of structural information (problem ii)). Hence, we propose to adapt pyramid pooling to MHSA for alleviating its high requirement on computational resources (problem i)). In this way, this pooling-based MHSA can well address the above two problems and is thus flexible and powerful for downstream scene understanding tasks. Plugged with our pooling-based MHSA, we build a downstream-task-oriented transformer network, dubbed Pyramid Pooling Transformer (P2T). Extensive experiments demonstrate that, when applied P2T as the backbone network, it shows substantial superiority in various downstream scene understanding tasks such as semantic segmentation, object detection, instance segmentation, and visual saliency detection, compared to previous CNN- and transformer-based networks. The code will be released at https://github.com/yuhuan-wu/P2T. Note that this technical report will keep updating.
    LegoFormer: Transformers for Block-by-Block Multi-view 3D Reconstruction. (arXiv:2106.12102v1 [cs.CV])
    (2 min) Most modern deep learning-based multi-view 3D reconstruction techniques use RNNs or fusion modules to combine information from multiple images after encoding them. These two separate steps have loose connections and do not consider all available information while encoding each view. We propose LegoFormer, a transformer-based model that unifies object reconstruction under a single framework and parametrizes the reconstructed occupancy grid by its decomposition factors. This reformulation allows the prediction of an object as a set of independent structures then aggregated to obtain the final reconstruction. Experiments conducted on ShapeNet display the competitive performance of our network with respect to the state-of-the-art methods. We also demonstrate how the use of self-attention leads to increased interpretability of the model output.
    APNN-TC: Accelerating Arbitrary Precision Neural Networks on Ampere GPU Tensor Cores. (arXiv:2106.12169v1 [cs.DC])
    (2 min) Over the years, accelerating neural networks with quantization has been widely studied. Unfortunately, prior efforts with diverse precisions (e.g., 1-bit weights and 2-bit activations) are usually restricted by limited precision support on GPUs (e.g., int1 and int4). To break such restrictions, we introduce the first Arbitrary Precision Neural Network framework (APNN-TC) to fully exploit quantization benefits on Ampere GPU Tensor Cores. Specifically, APNN-TC first incorporates a novel emulation algorithm to support arbitrary short bit-width computation with int1 compute primitives and XOR/AND Boolean operations. Second, APNN-TC integrates arbitrary precision layer designs to efficiently map our emulation algorithm to Tensor Cores with novel batching strategies and specialized memory organization. Third, APNN-TC embodies a novel arbitrary precision NN design to minimize memory access across layers and further improve performance. Extensive evaluations show that APNN-TC can achieve significant speedup over CUTLASS kernels and various NN models, such as ResNet and VGG.
    Exploiting Negative Learning for Implicit Pseudo Label Rectification in Source-Free Domain Adaptive Semantic Segmentation. (arXiv:2106.12123v1 [cs.CV])
    (2 min) It is desirable to transfer the knowledge stored in a well-trained source model onto non-annotated target domain in the absence of source data. However, state-of-the-art methods for source free domain adaptation (SFDA) are subject to strict limits: 1) access to internal specifications of source models is a must; and 2) pseudo labels should be clean during self-training, making critical tasks relying on semantic segmentation unreliable. Aiming at these pitfalls, this study develops a domain adaptive solution to semantic segmentation with pseudo label rectification (namely \textit{PR-SFDA}), which operates in two phases: 1) \textit{Confidence-regularized unsupervised learning}: Maximum squares loss applies to regularize the target model to ensure the confidence in prediction; and 2) \textit{Noise-aware pseudo label learning}: Negative learning enables tolerance to noisy pseudo labels in training, meanwhile positive learning achieves fast convergence. Extensive experiments have been performed on domain adaptive semantic segmentation benchmark, \textit{GTA5 $\to$ Cityscapes}. Overall, \textit{PR-SFDA} achieves a performance of 49.0 mIoU, which is very close to that of the state-of-the-art counterparts. Note that the latter demand accesses to the source model's internal specifications, whereas the \textit{PR-SFDA} solution needs none as a sharp contrast.
    Bootstrap Representation Learning for Segmentation on Medical Volumes and Sequences. (arXiv:2106.12153v1 [cs.CV])
    (2 min) In this work, we propose a novel straightforward method for medical volume and sequence segmentation with limited annotations. To avert laborious annotating, the recent success of self-supervised learning(SSL) motivates the pre-training on unlabeled data. Despite its success, it is still challenging to adapt typical SSL methods to volume/sequence segmentation, due to their lack of mining on local semantic discrimination and rare exploitation on volume and sequence structures. Based on the continuity between slices/frames and the common spatial layout of organs across volumes/sequences, we introduced a novel bootstrap self-supervised representation learning method by leveraging the predictable possibility of neighboring slices. At the core of our method is a simple and straightforward dense self-supervision on the predictions of local representations and a strategy of predicting locals based on global context, which enables stable and reliable supervision for both global and local representation mining among volumes. Specifically, we first proposed an asymmetric network with an attention-guided predictor to enforce distance-specific prediction and supervision on slices within and across volumes/sequences. Secondly, we introduced a novel prototype-based foreground-background calibration module to enhance representation consistency. The two parts are trained jointly on labeled and unlabeled data. When evaluated on three benchmark datasets of medical volumes and sequences, our model outperforms existing methods with a large margin of 4.5\% DSC on ACDC, 1.7\% on Prostate, and 2.3\% on CAMUS. Intensive evaluations reveals the effectiveness and superiority of our method.
    Team PyKale (xy9) Submission to the EPIC-Kitchens 2021 Unsupervised Domain Adaptation Challenge for Action Recognition. (arXiv:2106.12023v1 [cs.CV])
    (2 min) This report describes the technical details of our submission to the EPIC-Kitchens 2021 Unsupervised Domain Adaptation Challenge for Action Recognition. The EPIC-Kitchens dataset is more difficult than other video domain adaptation datasets due to multi-tasks with more modalities. Firstly, to participate in the challenge, we employ a transformer to capture the spatial information from each modality. Secondly, we employ a temporal attention module to model temporal-wise inter-dependency. Thirdly, we employ the adversarial domain adaptation network to learn the general features between labeled source and unlabeled target domain. Finally, we incorporate multiple modalities to improve the performance by a three-stream network with late fusion. Our network achieves the comparable performance with the state-of-the-art baseline T$A^3$N and outperforms the baseline on top-1 accuracy for verb class and top-5 accuracies for all three tasks which are verb, noun and action. Under the team name xy9, our submission achieved 5th place in terms of top-1 accuracy for verb class and all top-5 accuracies.
    Towards Consistent Predictive Confidence through Fitted Ensembles. (arXiv:2106.12070v1 [cs.LG])
    (2 min) Deep neural networks are behind many of the recent successes in machine learning applications. However, these models can produce overconfident decisions while encountering out-of-distribution (OOD) examples or making a wrong prediction. This inconsistent predictive confidence limits the integration of independently-trained learning models into a larger system. This paper introduces separable concept learning framework to realistically measure the performance of classifiers in presence of OOD examples. In this setup, several instances of a classifier are trained on different parts of a partition of the set of classes. Later, the performance of the combination of these models is evaluated on a separate test set. Unlike current OOD detection techniques, this framework does not require auxiliary OOD datasets and does not separate classification from detection performance. Furthermore, we present a new strong baseline for more consistent predictive confidence in deep models, called fitted ensembles, where overconfident predictions are rectified by transformed versions of the original classification task. Fitted ensembles can naturally detect OOD examples without requiring auxiliary data by observing contradicting predictions among its components. Experiments on MNIST, SVHN, CIFAR-10/100, and ImageNet show fitted ensemble significantly outperform conventional ensembles on OOD examples and are possible to scale.
    Volume Rendering of Neural Implicit Surfaces. (arXiv:2106.12052v1 [cs.CV])
    (2 min) Neural volume rendering became increasingly popular recently due to its success in synthesizing novel views of a scene from a sparse set of input images. So far, the geometry learned by neural volume rendering techniques was modeled using a generic density function. Furthermore, the geometry itself was extracted using an arbitrary level set of the density function leading to a noisy, often low fidelity reconstruction. The goal of this paper is to improve geometry representation and reconstruction in neural volume rendering. We achieve that by modeling the volume density as a function of the geometry. This is in contrast to previous work modeling the geometry as a function of the volume density. In more detail, we define the volume density function as Laplace's cumulative distribution function (CDF) applied to a signed distance function (SDF) representation. This simple density representation has three benefits: (i) it provides a useful inductive bias to the geometry learned in the neural volume rendering process; (ii) it facilitates a bound on the opacity approximation error, leading to an accurate sampling of the viewing ray. Accurate sampling is important to provide a precise coupling of geometry and radiance; and (iii) it allows efficient unsupervised disentanglement of shape and appearance in volume rendering. Applying this new density representation to challenging scene multiview datasets produced high quality geometry reconstructions, outperforming relevant baselines. Furthermore, switching shape and appearance between scenes is possible due to the disentanglement of the two.
    On Matrix Factorizations in Subspace Clustering. (arXiv:2106.12016v1 [cs.CV])
    (2 min) This article explores subspace clustering algorithms using CUR decompositions, and examines the effect of various hyperparameters in these algorithms on clustering performance on two real-world benchmark datasets, the Hopkins155 motion segmentation dataset and the Yale face dataset. Extensive experiments are done for a variety of sampling methods and oversampling parameters for these datasets, and some guidelines for parameter choices are given for practical applications.
    The Neurally-Guided Shape Parser: A Monte Carlo Method for Hierarchical Labeling of Over-segmented 3D Shapes. (arXiv:2106.12026v1 [cs.CV])
    (2 min) Many learning-based 3D shape semantic segmentation methods assign labels to shape atoms (e.g. points in a point cloud or faces in a mesh) with a single-pass approach trained in an end-to-end fashion. Such methods achieve impressive performance but require large amounts of labeled training data. This paradigm entangles two separable subproblems: (1) decomposing a shape into regions and (2) assigning semantic labels to these regions. We claim that disentangling these subproblems reduces the labeled data burden: (1) region decomposition requires no semantic labels and could be performed in an unsupervised fashion, and (2) labeling shape regions instead of atoms results in a smaller search space and should be learnable with less labeled training data. In this paper, we investigate this second claim by presenting the Neurally-Guided Shape Parser (NGSP), a method that learns how to assign semantic labels to regions of an over-segmented 3D shape. We solve this problem via MAP inference, modeling the posterior probability of a labeling assignment conditioned on an input shape. We employ a Monte Carlo importance sampling approach guided by a neural proposal network, a search-based approach made feasible by assuming the input shape is decomposed into discrete regions. We evaluate NGSP on the task of hierarchical semantic segmentation on manufactured 3D shapes from PartNet. We find that NGSP delivers significant performance improvements over baselines that learn to label shape atoms and then aggregate predictions for each shape region, especially in low-data regimes. Finally, we demonstrate that NGSP is robust to region granularity, as it maintains strong segmentation performance even as the regions undergo significant corruption.
    Transfer Learning of Deep Spatiotemporal Networks to Model Arbitrarily Long Videos of Seizures. (arXiv:2106.12014v1 [cs.CV])
    (2 min) Detailed analysis of seizure semiology, the symptoms and signs which occur during a seizure, is critical for management of epilepsy patients. Inter-rater reliability using qualitative visual analysis is often poor for semiological features. Therefore, automatic and quantitative analysis of video-recorded seizures is needed for objective assessment. We present GESTURES, a novel architecture combining convolutional neural networks (CNNs) and recurrent neural networks (RNNs) to learn deep representations of arbitrarily long videos of epileptic seizures. We use a spatiotemporal CNN (STCNN) pre-trained on large human action recognition (HAR) datasets to extract features from short snippets (approx. 0.5 s) sampled from seizure videos. We then train an RNN to learn seizure-level representations from the sequence of features. We curated a dataset of seizure videos from 68 patients and evaluated GESTURES on its ability to classify seizures into focal onset seizures (FOSs) (N = 106) vs. focal to bilateral tonic-clonic seizures (TCSs) (N = 77), obtaining an accuracy of 98.9% using bidirectional long short-term memory (BLSTM) units. We demonstrate that an STCNN trained on a HAR dataset can be used in combination with an RNN to accurately represent arbitrarily long videos of seizures. GESTURES can provide accurate seizure classification by modeling sequences of semiologies.
    Automatic Head Overcoat Thickness Measure with NASNet-Large-Decoder Net. (arXiv:2106.12054v1 [cs.CV])
    (2 min) Transmission electron microscopy (TEM) is one of the primary tools to show microstructural characterization of materials as well as film thickness. However, manual determination of film thickness from TEM images is time-consuming as well as subjective, especially when the films in question are very thin and the need for measurement precision is very high. Such is the case for head overcoat (HOC) thickness measurements in the magnetic hard disk drive industry. It is therefore necessary to develop software to automatically measure HOC thickness. In this paper, for the first time, we propose a HOC layer segmentation method using NASNet-Large as an encoder and then followed by a decoder architecture, which is one of the most commonly used architectures in deep learning for image segmentation. To further improve segmentation results, we are the first to propose a post-processing layer to remove irrelevant portions in the segmentation result. To measure the thickness of the segmented HOC layer, we propose a regressive convolutional neural network (RCNN) model as well as orthogonal thickness calculation methods. Experimental results demonstrate a higher dice score for our model which has lower mean squared error and outperforms current state-of-the-art manual measurement.
    Listen to Your Favorite Melodies with img2Mxml, Producing MusicXML from Sheet Music Image by Measure-based Multimodal Deep Learning-driven Assembly. (arXiv:2106.12037v1 [cs.CV])
    (2 min) Deep learning has recently been applied to optical music recognition (OMR). However, currently OMR processing from various sheet music images still lacks precision to be widely applicable. Here, we present an MMdA (Measure-based Multimodal deep learning (DL)-driven Assembly) method allowing for end-to-end OMR processing from various images including inclined photo images. Using this method, measures are extracted by a deep learning model, aligned, and resized to be used for inference of given musical symbol components by using multiple deep learning models in sequence or in parallel. Use of each standardized measure enables efficient training of the models and accurate adjustment of five staff lines in each measure. Multiple musical symbol component category models with a small number of feature types can represent a diverse set of notes and other musical symbols including chords. This MMdA method provides a solution to end-to-end OMR processing with precision.
  • cs.IR updates on arXiv.org

    STEP-EZ: Syntax Tree guided semantic ExPlanation for Explainable Zero-shot modeling of clinical depression symptoms from text. (arXiv:2106.10928v2 [cs.CL] UPDATED)
    (2 min) We focus on exploring various approaches of Zero-Shot Learning (ZSL) and their explainability for a challenging yet important supervised learning task notorious for training data scarcity, i.e. Depression Symptoms Detection (DSD) from text. We start with a comprehensive synthesis of different components of our ZSL modeling and analysis of our ground truth samples and Depression symptom clues curation process with the help of a practicing clinician. We next analyze the accuracy of various state-of-the-art ZSL models and their potential enhancements for our task. Further, we sketch a framework for the use of ZSL for hierarchical text-based explanation mechanism, which we call, Syntax Tree-Guided Semantic Explanation (STEP). Finally, we summarize experiments from which we conclude that we can use ZSL models and achieve reasonable accuracy and explainability, measured by a proposed Explainability Index (EI). This work is, to our knowledge, the first work to exhaustively explore the efficacy of ZSL models for DSD task, both in terms of accuracy and explainability.
    A Graph-based Method for Session-based Recommendations. (arXiv:2106.12085v1 [cs.IR])
    (2 min) We present a graph-based approach for the data management tasks and the efficient operation of a system for session-based next-item recommendations. The proposed method can collect data continuously and incrementally from an ecommerce web site, thus seemingly prepare the necessary data infrastructure for the recommendation algorithm to operate without any excessive training phase. Our work aims at developing a recommender method that represents a balance between data processing and management efficiency requirements and the effectiveness of the recommendations produced. We use the Neo4j graph database to implement a prototype of such a system. Furthermore, we use an industry dataset corresponding to a typical e-commerce session-based scenario, and we report on experiments using our graph-based approach and other state-of-the-art machine learning and deep learning methods.
    Learnt Sparsity for Effective and Interpretable Document Ranking. (arXiv:2106.12460v1 [cs.IR])
    (2 min) Machine learning models for the ad-hoc retrieval of documents and passages have recently shown impressive improvements due to better language understanding using large pre-trained language models. However, these over-parameterized models are inherently non-interpretable and do not provide any information on the parts of the documents that were used to arrive at a certain prediction. In this paper we introduce the select and rank paradigm for document ranking, where interpretability is explicitly ensured when scoring longer documents. Specifically, we first select sentences in a document based on the input query and then predict the query-document score based only on the selected sentences, acting as an explanation. We treat sentence selection as a latent variable trained jointly with the ranker from the final output. We conduct extensive experiments to demonstrate that our inherently interpretable select-and-rank approach is competitive in comparison to other state-of-the-art methods and sometimes even outperforms them. This is due to our novel end-to-end training approach based on weighted reservoir sampling that manages to train the selector despite the stochastic sentence selection. We also show that our sentence selection approach can be used to provide explanations for models that operate on only parts of the document, such as BERT.
    BERT Goes Shopping: Comparing Distributional Models for Product Representations. (arXiv:2012.09807v2 [cs.CL] UPDATED)
    (2 min) Word embeddings (e.g., word2vec) have been applied successfully to eCommerce products through~\textit{prod2vec}. Inspired by the recent performance improvements on several NLP tasks brought by contextualized embeddings, we propose to transfer BERT-like architectures to eCommerce: our model -- ~\textit{Prod2BERT} -- is trained to generate representations of products through masked session modeling. Through extensive experiments over multiple shops, different tasks, and a range of design choices, we systematically compare the accuracy of~\textit{Prod2BERT} and~\textit{prod2vec} embeddings: while~\textit{Prod2BERT} is found to be superior in several scenarios, we highlight the importance of resources and hyperparameters in the best performing models. Finally, we provide guidelines to practitioners for training embeddings under a variety of computational and data constraints.
    A Novel Approach to Detect Redundant Activity Labels For More Representative Event Logs. (arXiv:2103.16061v2 [cs.DB] UPDATED)
    (2 min) The insights revealed from process mining heavily rely on the quality of event logs. Activities extracted from healthcare information systems with the free-text nature may lead to inconsistent labels. Such inconsistency would then lead to redundancy of activity labels, which refer to labels that have different syntax but share the same behaviours. The identifications of these labels from data-driven process discovery are difficult and rely heavily on resource-intensive human review. Existing work achieves low accuracy either redundant activity labels are in low occurrence frequency or the existence of numerical data values as attributes in event logs. However, these phenomena are commonly observed in healthcare information systems. In this paper, we propose an approach to detect redundant activity labels using control-flow relations and numerical data values from event logs. Natural Language Processing is also integrated into our method to assess semantic similarity between labels, which provides users with additional insights. We have evaluated our approach through synthetic logs generated from the real-life Sepsis log and a case study using the MIMIC-III data set. The results demonstrate that our approach can successfully detect redundant activity labels. This approach can add value to the preprocessing step to generate more representative event logs for process mining tasks in the healthcare domain.
    GraphConfRec: A Graph Neural Network-Based Conference Recommender System. (arXiv:2106.12340v1 [cs.IR])
    (2 min) In today's academic publishing model, especially in Computer Science, conferences commonly constitute the main platforms for releasing the latest peer-reviewed advancements in their respective fields. However, choosing a suitable academic venue for publishing one's research can represent a challenging task considering the plethora of available conferences, particularly for those at the start of their academic careers, or for those seeking to publish outside of their usual domain. In this paper, we propose GraphConfRec, a conference recommender system which combines SciGraph and graph neural networks, to infer suggestions based not only on title and abstract, but also on co-authorship and citation relationships. GraphConfRec achieves a recall@10 of up to 0.580 and a MAP of up to 0.336 with a graph attention network-based recommendation model. A user study with 25 subjects supports the positive results.
    Diversity-Robust Acoustic Feature Signatures Based on Multiscale Fractal Dimension for Similarity Search of Environmental Sounds. (arXiv:2102.02964v2 [cs.SD] UPDATED)
    (2 min) This paper proposes new acoustic feature signatures based on the multiscale fractal dimension (MFD), which are robust against the diversity of environmental sounds, for the content-based similarity search. The diversity of sound sources and acoustic compositions is a typical feature of environmental sounds. Several acoustic features have been proposed for environmental sounds. Among them is the widely-used Mel-Frequency Cepstral Coefficients (MFCCs), which describes frequency-domain features. However, in addition to these features in the frequency domain, environmental sounds have other important features in the time domain with various time scales. In our previous paper, we proposed enhanced multiscale fractal dimension signature (EMFD) for environmental sounds. This paper extends EMFD by using the kernel density estimation method, which results in better performance of the similarity search tasks. Furthermore, it newly proposes another acoustic feature signature based on MFD, namely very-long-range multiscale fractal dimension signature (MFD-VL). The MFD-VL signature describes several features of the time-varying envelope for long periods of time. The MFD-VL signature has stability and robustness against background noise and small fluctuations in the parameters of sound sources, which are produced in field recordings. We discuss the effectiveness of these signatures in the similarity sound search by comparing with acoustic features proposed in the DCASE 2018 challenges. Due to the unique descriptiveness of our proposed signatures, we confirmed the signatures are effective when they are used with other acoustic features.
    BiblioDAP: The 1st Workshop on Bibliographic Data Analysis and Processing. (arXiv:2106.12320v1 [cs.DL])
    (2 min) Automatic processing of bibliographic data becomes very important in digital libraries, data science and machine learning due to its importance in keeping pace with the significant increase of published papers every year from one side and to the inherent challenges from the other side. This processing has several aspects including but not limited to I) Automatic extraction of references from PDF documents, II) Building an accurate citation graph, III) Author name disambiguation, etc. Bibliographic data is heterogeneous by nature and occurs in both structured (e.g. citation graph) and unstructured (e.g. publications) formats. Therefore, it requires data science and machine learning techniques to be processed and analysed. Here we introduce BiblioDAP'21: The 1st Workshop on Bibliographic Data Analysis and Processing.
    Improving Transformer-based Sequential Recommenders through Preference Editing. (arXiv:2106.12120v1 [cs.IR])
    (2 min) One of the key challenges in Sequential Recommendation (SR) is how to extract and represent user preferences. Traditional SR methods rely on the next item as the supervision signal to guide preference extraction and representation. We propose a novel learning strategy, named preference editing. The idea is to force the SR model to discriminate the common and unique preferences in different sequences of interactions between users and the recommender system. By doing so, the SR model is able to learn how to identify common and unique user preferences, and thereby do better user preference extraction and representation. We propose a transformer based SR model, named MrTransformer (Multi-preference Transformer), that concatenates some special tokens in front of the sequence to represent multiple user preferences and makes sure they capture different aspects through a preference coverage mechanism. Then, we devise a preference editing-based self-supervised learning mechanism for training MrTransformer which contains two main operations: preference separation and preference recombination. The former separates the common and unique user preferences for a given pair of sequences. The latter swaps the common preferences to obtain recombined user preferences for each sequence. Based on the preference separation and preference recombination operations, we define two types of SSL loss that require that the recombined preferences are similar to the original ones, and the common preferences are close to each other. We carry out extensive experiments on two benchmark datasets. MrTransformer with preference editing significantly outperforms state-of-the-art SR methods in terms of Recall, MRR and NDCG. We find that long sequences whose user preferences are harder to extract and represent benefit most from preference editing.
    BanditMF: Multi-Armed Bandit Based Matrix Factorization Recommender System. (arXiv:2106.10898v2 [cs.IR] UPDATED)
    (2 min) Multi-armed bandits (MAB) provide a principled online learning approach to attain the balance between exploration and exploitation. Due to the superior performance and low feedback learning without the learning to act in multiple situations, Multi-armed Bandits drawing widespread attention in applications ranging such as recommender systems. Likewise, within the recommender system, collaborative filtering (CF) is arguably the earliest and most influential method in the recommender system. Crucially, new users and an ever-changing pool of recommended items are the challenges that recommender systems need to address. For collaborative filtering, the classical method is training the model offline, then perform the online testing, but this approach can no longer handle the dynamic changes in user preferences which is the so-called cold start. So how to effectively recommend items to users in the absence of effective information? To address the aforementioned problems, a multi-armed bandit based collaborative filtering recommender system has been proposed, named BanditMF. BanditMF is designed to address two challenges in the multi-armed bandits algorithm and collaborative filtering: (1) how to solve the cold start problem for collaborative filtering under the condition of scarcity of valid information, (2) how to solve the sub-optimal problem of bandit algorithms in strong social relations domains caused by independently estimating unknown parameters associated with each user and ignoring correlations between users.
  • cs.LG updates on arXiv.org

    STEP-EZ: Syntax Tree guided semantic ExPlanation for Explainable Zero-shot modeling of clinical depression symptoms from text. (arXiv:2106.10928v2 [cs.CL] UPDATED)
    (2 min) We focus on exploring various approaches of Zero-Shot Learning (ZSL) and their explainability for a challenging yet important supervised learning task notorious for training data scarcity, i.e. Depression Symptoms Detection (DSD) from text. We start with a comprehensive synthesis of different components of our ZSL modeling and analysis of our ground truth samples and Depression symptom clues curation process with the help of a practicing clinician. We next analyze the accuracy of various state-of-the-art ZSL models and their potential enhancements for our task. Further, we sketch a framework for the use of ZSL for hierarchical text-based explanation mechanism, which we call, Syntax Tree-Guided Semantic Explanation (STEP). Finally, we summarize experiments from which we conclude that we can use ZSL models and achieve reasonable accuracy and explainability, measured by a proposed Explainability Index (EI). This work is, to our knowledge, the first work to exhaustively explore the efficacy of ZSL models for DSD task, both in terms of accuracy and explainability.
    Fast and Feature-Complete Differentiable Physics for Articulated Rigid Bodies with Contact. (arXiv:2103.16021v3 [cs.RO] UPDATED)
    (2 min) We present a fast and feature-complete differentiable physics engine, Nimble (nimblephysics.org), that supports Lagrangian dynamics and hard contact constraints for articulated rigid body simulation. Our differentiable physics engine offers a complete set of features that are typically only available in non-differentiable physics simulators commonly used by robotics applications. We solve contact constraints precisely using linear complementarity problems (LCPs). We present efficient and novel analytical gradients through the LCP formulation of inelastic contact that exploit the sparsity of the LCP solution. We support complex contact geometry, and gradients approximating continuous-time elastic collision. We also introduce a novel method to compute complementarity-aware gradients that help downstream optimization tasks avoid stalling in saddle points. We show that an implementation of this combination in an existing physics engine (DART) is capable of a 87x single-core speedup over finite-differencing in computing analytical Jacobians for a single timestep, while preserving all the expressiveness of original DART.
    Differentially Private Query Release Through Adaptive Projection. (arXiv:2103.06641v2 [cs.LG] UPDATED)
    (2 min) We propose, implement, and evaluate a new algorithm for releasing answers to very large numbers of statistical queries like $k$-way marginals, subject to differential privacy. Our algorithm makes adaptive use of a continuous relaxation of the Projection Mechanism, which answers queries on the private dataset using simple perturbation, and then attempts to find the synthetic dataset that most closely matches the noisy answers. We use a continuous relaxation of the synthetic dataset domain which makes the projection loss differentiable, and allows us to use efficient ML optimization techniques and tooling. Rather than answering all queries up front, we make judicious use of our privacy budget by iteratively and adaptively finding queries for which our (relaxed) synthetic data has high error, and then repeating the projection. We perform extensive experimental evaluations across a range of parameters and datasets, and find that our method outperforms existing algorithms in many cases, especially when the privacy budget is small or the query class is large.
    TetraPackNet: Four-Corner-Based Object Detection in Logistics Use-Cases. (arXiv:2104.09123v2 [cs.CV] UPDATED)
    (2 min) While common image object detection tasks focus on bounding boxes or segmentation masks as object representations, we consider the problem of finding objects based on four arbitrary vertices. We propose a novel model, named TetraPackNet, to tackle this problem. TetraPackNet is based on CornerNet and uses similar algorithms and ideas. It is designated for applications requiring high-accuracy detection of regularly shaped objects, which is the case in the logistics use-case of packaging structure recognition. We evaluate our model on our specific real-world dataset for this use-case. Baselined against a previous solution, consisting of a Mask R-CNN model and suitable post-processing steps, TetraPackNet achieves superior results (9% higher in accuracy) in the sub-task of four-corner based transport unit side detection.
    A Bayesian Multiscale Deep Learning Framework for Flows in Random Media. (arXiv:2103.09056v2 [physics.comp-ph] UPDATED)
    (2 min) Fine-scale simulation of complex systems governed by multiscale partial differential equations (PDEs) is computationally expensive and various multiscale methods have been developed for addressing such problems. In addition, it is challenging to develop accurate surrogate and uncertainty quantification models for high-dimensional problems governed by stochastic multiscale PDEs using limited training data. In this work to address these challenges, we introduce a novel hybrid deep-learning and multiscale approach for stochastic multiscale PDEs with limited training data. For demonstration purposes, we focus on a porous media flow problem. We use an image-to-image supervised deep learning model to learn the mapping between the input permeability field and the multiscale basis functions. We introduce a Bayesian approach to this hybrid framework to allow us to perform uncertainty quantification and propagation tasks. The performance of this hybrid approach is evaluated with varying intrinsic dimensionality of the permeability field. Numerical results indicate that the hybrid network can efficiently predict well for high-dimensional inputs.
    Performance and Complexity Analysis of bi-directional Recurrent Neural Network Models vs. Volterra Nonlinear Equalizers in Digital Coherent Systems. (arXiv:2103.03832v2 [eess.SP] UPDATED)
    (2 min) We investigate the complexity and performance of recurrent neural network (RNN) models as post-processing units for the compensation of fibre nonlinearities in digital coherent systems carrying polarization multiplexed 16-QAM and 32-QAM signals. We evaluate three bi-directional RNN models, namely the bi-LSTM, bi-GRU and bi-Vanilla-RNN and show that all of them are promising nonlinearity compensators especially in dispersion unmanaged systems. Our simulations show that during inference the three models provide similar compensation performance, therefore in real-life systems the simplest scheme based on Vanilla-RNN units should be preferred. We compare bi-Vanilla-RNN with Volterra nonlinear equalizers and exhibit its superiority both in terms of performance and complexity, thus highlighting that RNN processing is a very promising pathway for the upgrade of long-haul optical communication systems utilizing coherent detection.
    Explaining Black-Box Algorithms Using Probabilistic Contrastive Counterfactuals. (arXiv:2103.11972v2 [cs.AI] UPDATED)
    (2 min) There has been a recent resurgence of interest in explainable artificial intelligence (XAI) that aims to reduce the opaqueness of AI-based decision-making systems, allowing humans to scrutinize and trust them. Prior work in this context has focused on the attribution of responsibility for an algorithm's decisions to its inputs wherein responsibility is typically approached as a purely associational concept. In this paper, we propose a principled causality-based approach for explaining black-box decision-making systems that addresses limitations of existing methods in XAI. At the core of our framework lies probabilistic contrastive counterfactuals, a concept that can be traced back to philosophical, cognitive, and social foundations of theories on how humans generate and select explanations. We show how such counterfactuals can quantify the direct and indirect influences of a variable on decisions made by an algorithm, and provide actionable recourse for individuals negatively affected by the algorithm's decision. Unlike prior work, our system, LEWIS: (1)can compute provably effective explanations and recourse at local, global and contextual levels (2)is designed to work with users with varying levels of background knowledge of the underlying causal model and (3)makes no assumptions about the internals of an algorithmic system except for the availability of its input-output data. We empirically evaluate LEWIS on three real-world datasets and show that it generates human-understandable explanations that improve upon state-of-the-art approaches in XAI, including the popular LIME and SHAP. Experiments on synthetic data further demonstrate the correctness of LEWIS's explanations and the scalability of its recourse algorithm.
    Residual Energy-Based Models for End-to-End Speech Recognition. (arXiv:2103.14152v2 [eess.AS] UPDATED)
    (2 min) End-to-end models with auto-regressive decoders have shown impressive results for automatic speech recognition (ASR). These models formulate the sequence-level probability as a product of the conditional probabilities of all individual tokens given their histories. However, the performance of locally normalised models can be sub-optimal because of factors such as exposure bias. Consequently, the model distribution differs from the underlying data distribution. In this paper, the residual energy-based model (R-EBM) is proposed to complement the auto-regressive ASR model to close the gap between the two distributions. Meanwhile, R-EBMs can also be regarded as utterance-level confidence estimators, which may benefit many downstream tasks. Experiments on a 100hr LibriSpeech dataset show that R-EBMs can reduce the word error rates (WERs) by 8.2%/6.7% while improving areas under precision-recall curves of confidence scores by 12.6%/28.4% on test-clean/test-other sets. Furthermore, on a state-of-the-art model using self-supervised learning (wav2vec 2.0), R-EBMs still significantly improves both the WER and confidence estimation performance.
    Deep ReLU Networks Preserve Expected Length. (arXiv:2102.10492v2 [stat.ML] UPDATED)
    (2 min) Assessing the complexity of functions computed by a neural network helps us understand how the network will learn and generalize. One natural measure of complexity is how the network distorts length - if the network takes a unit-length curve as input, what is the length of the resulting curve of outputs? It has been widely believed that this length grows exponentially in network depth. We prove that in fact this is not the case: the expected length distortion does not grow with depth, and indeed shrinks slightly, for ReLU networks with standard random initialization. We also generalize this result by proving upper bounds both for higher moments of the length distortion and for the distortion of higher-dimensional volumes. These theoretical results are corroborated by our experiments.
    Regret-optimal measurement-feedback control. (arXiv:2011.12785v2 [eess.SY] UPDATED)
    (2 min) We consider measurement-feedback control in linear dynamical systems from the perspective of regret minimization. Unlike most prior work in this area, we focus on the problem of designing an online controller which competes with the optimal dynamic sequence of control actions selected in hindsight, instead of the best controller in some specific class of controllers. This formulation of regret is attractive when the environment changes over time and no single controller achieves good performance over the entire time horizon. We show that in the measurement-feedback setting, unlike in the full-information setting, there is no single offline controller which outperforms every other offline controller on every disturbance, and propose a new $H_2$-optimal offline controller as a benchmark for the online controller to compete against. We show that the corresponding regret-optimal online controller can be found via a novel reduction to the classical Nehari problem from robust control and present a tight data-dependent bound on its regret.
    BanditMF: Multi-Armed Bandit Based Matrix Factorization Recommender System. (arXiv:2106.10898v2 [cs.IR] UPDATED)
    (2 min) Multi-armed bandits (MAB) provide a principled online learning approach to attain the balance between exploration and exploitation. Due to the superior performance and low feedback learning without the learning to act in multiple situations, Multi-armed Bandits drawing widespread attention in applications ranging such as recommender systems. Likewise, within the recommender system, collaborative filtering (CF) is arguably the earliest and most influential method in the recommender system. Crucially, new users and an ever-changing pool of recommended items are the challenges that recommender systems need to address. For collaborative filtering, the classical method is training the model offline, then perform the online testing, but this approach can no longer handle the dynamic changes in user preferences which is the so-called cold start. So how to effectively recommend items to users in the absence of effective information? To address the aforementioned problems, a multi-armed bandit based collaborative filtering recommender system has been proposed, named BanditMF. BanditMF is designed to address two challenges in the multi-armed bandits algorithm and collaborative filtering: (1) how to solve the cold start problem for collaborative filtering under the condition of scarcity of valid information, (2) how to solve the sub-optimal problem of bandit algorithms in strong social relations domains caused by independently estimating unknown parameters associated with each user and ignoring correlations between users.
    ShapeMOD: Macro Operation Discovery for 3D Shape Programs. (arXiv:2104.06392v2 [cs.GR] UPDATED)
    (2 min) A popular way to create detailed yet easily controllable 3D shapes is via procedural modeling, i.e. generating geometry using programs. Such programs consist of a series of instructions along with their associated parameter values. To fully realize the benefits of this representation, a shape program should be compact and only expose degrees of freedom that allow for meaningful manipulation of output geometry. One way to achieve this goal is to design higher-level macro operators that, when executed, expand into a series of commands from the base shape modeling language. However, manually authoring such macros, much like shape programs themselves, is difficult and largely restricted to domain experts. In this paper, we present ShapeMOD, an algorithm for automatically discovering macros that are useful across large datasets of 3D shape programs. ShapeMOD operates on shape programs expressed in an imperative, statement-based language. It is designed to discover macros that make programs more compact by minimizing the number of function calls and free parameters required to represent an input shape collection. We run ShapeMOD on multiple collections of programs expressed in a domain-specific language for 3D shape structures. We show that it automatically discovers a concise set of macros that abstract out common structural and parametric patterns that generalize over large shape collections. We also demonstrate that the macros found by ShapeMOD improve performance on downstream tasks including shape generative modeling and inferring programs from point clouds. Finally, we conduct a user study that indicates that ShapeMOD's discovered macros make interactive shape editing more efficient.
    Stronger NAS with Weaker Predictors. (arXiv:2102.10490v2 [cs.LG] UPDATED)
    (2 min) Neural Architecture Search (NAS) often trains and evaluates a large number of architectures. Recent predictor-based NAS approaches attempt to address such heavy computation costs with two key steps: sampling some architecture-performance pairs and fitting a proxy accuracy predictor. Given limited samples, these predictors, however, are far from accurate to locate top architectures due to the difficulty of fitting the huge search space. This paper reflects on a simple yet crucial question: if our final goal is to find the best architecture, do we really need to model the whole space well?. We propose a paradigm shift from fitting the whole architecture space using one strong predictor, to progressively fitting a search path towards the high-performance sub-space through a set of weaker predictors. As a key property of the proposed weak predictors, their probabilities of sampling better architectures keep increasing. Hence we only sample a few well-performed architectures guided by the previously learned predictor and estimate a new better weak predictor. This embarrassingly easy framework produces coarse-to-fine iteration to refine the ranking of sampling space gradually. Extensive experiments demonstrate that our method costs fewer samples to find top-performance architectures on NAS-Bench-101 and NAS-Bench-201, as well as achieves the state-of-the-art ImageNet performance on the NASNet search space. In particular, compared to state-of-the-art (SOTA) predictor-based NAS methods, WeakNAS outperforms all of them with notable margins, e.g., requiring at least 7.5x less samples to find global optimal on NAS-Bench-101; and WeakNAS can also absorb them for further performance boost. We further strike the new SOTA result of 81.3% in the ImageNet MobileNet Search Space. The code is available at https://github.com/VITA-Group/WeakNAS.
    Assessment of the influence of features on a classification problem: an application to COVID-19 patients. (arXiv:2104.14958v2 [stat.ML] UPDATED)
    (2 min) This paper deals with an important subject in classification problems addressed by machine learning techniques: the evaluation of the influence of each of the features on the classification of individuals. Specifically, a measure of that influence is introduced using the Shapley value of cooperative games. In addition, an axiomatic characterisation of the proposed measure is provided based on properties of efficiency and balanced contributions. Furthermore, some experiments have been designed in order to validate the appropriate performance of such measure. Finally, the methodology introduced is applied to a sample of COVID-19 patients to study the influence of certain demographic or risk factors on various events of interest related to the evolution of the disease.
    Meta-Cal: Well-controlled Post-hoc Calibration by Ranking. (arXiv:2105.04290v2 [stat.ML] UPDATED)
    (2 min) In many applications, it is desirable that a classifier not only makes accurate predictions, but also outputs calibrated posterior probabilities. However, many existing classifiers, especially deep neural network classifiers, tend to be uncalibrated. Post-hoc calibration is a technique to recalibrate a model by learning a calibration map. Existing approaches mostly focus on constructing calibration maps with low calibration errors, however, this quality is inadequate for a calibrator being useful. In this paper, we introduce two constraints that are worth consideration in designing a calibration map for post-hoc calibration. Then we present Meta-Cal, which is built from a base calibrator and a ranking model. Under some mild assumptions, two high-probability bounds are given with respect to these constraints. Empirical results on CIFAR-10, CIFAR-100 and ImageNet and a range of popular network architectures show our proposed method significantly outperforms the current state of the art for post-hoc multi-class classification calibration.
    Online Learning with Radial Basis Function Networks. (arXiv:2103.08414v2 [cs.CE] UPDATED)
    (2 min) We investigate the benefits of feature selection, nonlinear modelling and online learning when forecasting in financial time series. We consider the sequential and continual learning sub-genres of online learning. The experiments we conduct show that there is a benefit to online transfer learning, in the form of radial basis function networks, beyond the sequential updating of recursive least-squares models. We show that the radial basis function networks, which make use of clustering algorithms to construct a kernel Gram matrix, are more beneficial than treating each training vector as separate basis functions, as occurs with kernel Ridge regression. We demonstrate quantitative procedures to determine the very structure of the radial basis function networks. Finally, we conduct experiments on the log returns of financial time series and show that the online learning models, particularly the radial basis function networks, are able to outperform a random walk baseline, whereas the offline learning models struggle to do so.
    Generative Adversarial Neural Architecture Search. (arXiv:2105.09356v3 [cs.LG] UPDATED)
    (2 min) Despite the empirical success of neural architecture search (NAS) in deep learning applications, the optimality, reproducibility and cost of NAS schemes remain hard to assess. In this paper, we propose Generative Adversarial NAS (GA-NAS) with theoretically provable convergence guarantees, promoting stability and reproducibility in neural architecture search. Inspired by importance sampling, GA-NAS iteratively fits a generator to previously discovered top architectures, thus increasingly focusing on important parts of a large search space. Furthermore, we propose an efficient adversarial learning approach, where the generator is trained by reinforcement learning based on rewards provided by a discriminator, thus being able to explore the search space without evaluating a large number of architectures. Extensive experiments show that GA-NAS beats the best published results under several cases on three public NAS benchmarks. In the meantime, GA-NAS can handle ad-hoc search constraints and search spaces. We show that GA-NAS can be used to improve already optimized baselines found by other NAS methods, including EfficientNet and ProxylessNAS, in terms of ImageNet accuracy or the number of parameters, in their original search space.
    Unsupervised Information Obfuscation for Split Inference of Neural Networks. (arXiv:2104.11413v2 [cs.LG] UPDATED)
    (2 min) Splitting network computations between the edge device and a server enables low edge-compute inference of neural networks but might expose sensitive information about the test query to the server. To address this problem, existing techniques train the model to minimize information leakage for a given set of sensitive attributes. In practice, however, the test queries might contain attributes that are not foreseen during training. We propose instead an unsupervised obfuscation method to discard the information irrelevant to the main task. We formulate the problem via an information theoretical framework and derive an analytical solution for a given distortion to the model output. In our method, the edge device runs the model up to a split layer determined based on its computational capacity. It then obfuscates the obtained feature vector based on the first layer of the server model by removing the components in the null space as well as the low-energy components of the remaining signal. Our experimental results show that our method outperforms existing techniques in removing the information of the irrelevant attributes and maintaining the accuracy on the target label. We also show that our method reduces the communication cost and incurs only a small computational overhead.
    Machine Learning in weakly nonlinear systems: A Case study on Significant wave heights. (arXiv:2105.08583v2 [physics.ao-ph] UPDATED)
    (2 min) This paper proposes a machine learning method based on the Extra Trees (ET) algorithm for forecasting Significant Wave Heights in oceanic waters. To derive multiple features from the CDIP buoys, which make point measurements, we first nowcast various parameters and then forecast them at 30-min intervals. The proposed algorithm has Scatter Index (SI), Bias, Correlation Coefficient, Root Mean Squared Error (RMSE) of 0.130, -0.002, 0.97, and 0.14, respectively, for one day ahead prediction and 0.110, -0.001, 0.98, and 0.122, respectively, for 14-day ahead prediction on the testing dataset. While other state-of-the-art methods can only forecast up to 120 hours ahead, we extend it further to 14 days. Our proposed setup includes spectral features, hv-block cross-validation, and stringent QC criteria. The proposed algorithm performs significantly better than the state-of-the-art methods commonly used for significant wave height forecasting for one-day ahead prediction. Moreover, the improved performance of the proposed machine learning method compared to the numerical methods shows that this performance can be extended to even longer periods allowing for early prediction of significant wave heights in oceanic waters.
    S$^2$-MLP: Spatial-Shift MLP Architecture for Vision. (arXiv:2106.07477v2 [cs.CV] UPDATED)
    (2 min) Recently, visual Transformer (ViT) and its following works abandon the convolution and exploit the self-attention operation, attaining a comparable or even higher accuracy than CNNs. More recently, MLP-Mixer abandons both the convolution and the self-attention operation, proposing an architecture containing only MLP layers. To achieve cross-patch communications, it devises an additional token-mixing MLP besides the channel-mixing MLP. It achieves promising results when training on an extremely large-scale dataset. But it cannot achieve as outstanding performance as its CNN and ViT counterparts when training on medium-scale datasets such as ImageNet1K and ImageNet21K. The performance drop of MLP-Mixer motivates us to rethink the token-mixing MLP. We discover that the token-mixing MLP is a variant of the depthwise convolution with a global reception field and spatial-specific configuration. But the global reception field and the spatial-specific property make token-mixing MLP prone to over-fitting. In this paper, we propose a novel pure MLP architecture, spatial-shift MLP (S$^2$-MLP). Different from MLP-Mixer, our S$^2$-MLP only contains channel-mixing MLP. We utilize a spatial-shift operation for communications between patches. It has a local reception field and is spatial-agnostic. It is parameter-free and efficient for computation. The proposed S$^2$-MLP attains higher recognition accuracy than MLP-Mixer when training on ImageNet-1K dataset. Meanwhile, S$^2$-MLP accomplishes as excellent performance as ViT on ImageNet-1K dataset with considerably simpler architecture and fewer FLOPs and parameters.
    Oneshot Differentially Private Top-k Selection. (arXiv:2105.08233v2 [cs.LG] UPDATED)
    (2 min) Being able to efficiently and accurately select the top-$k$ elements with differential privacy is an integral component of various private data analysis tasks. In this paper, we present the oneshot Laplace mechanism, which generalizes the well-known Report Noisy Max mechanism to reporting noisy top-$k$ elements. We show that the oneshot Laplace mechanism with a noise level of $\widetilde{O}(\sqrt{k}/\eps)$ is approximately differentially private. Compared to the previous peeling approach of running Report Noisy Max $k$ times, the oneshot Laplace mechanism only adds noises and computes the top $k$ elements once, hence much more efficient for large $k$. In addition, our proof of privacy relies on a novel coupling technique that bypasses the use of composition theorems. Finally, we present a novel application of efficient top-$k$ selection in the classical problem of ranking from pairwise comparisons.
    VariTex: Variational Neural Face Textures. (arXiv:2104.05988v2 [cs.CV] UPDATED)
    (2 min) Deep generative models have recently demonstrated the ability to synthesize photorealistic images of human faces with novel identities. A key challenge to the wide applicability of such techniques is to provide independent control over semantically meaningful parameters: appearance, head pose, face shape, and facial expressions. In this paper, we propose VariTex - to the best of our knowledge the first method that learns a variational latent feature space of neural face textures, which allows sampling of novel identities. We combine this generative model with a parametric face model and gain explicit control over head pose and facial expressions. To generate images of complete human heads, we propose an additive decoder that generates plausible additional details such as hair. A novel training scheme enforces a pose independent latent space and in consequence, allows learning of a one-to-many mapping between latent codes and pose-conditioned exterior regions. The resulting method can generate geometrically consistent images of novel identities allowing fine-grained control over head pose, face shape, and facial expressions, facilitating a broad range of downstream tasks, like sampling novel identities, re-posing, expression transfer, and more.
    Posterior Meta-Replay for Continual Learning. (arXiv:2103.01133v2 [cs.LG] UPDATED)
    (2 min) Learning a sequence of tasks without access to i.i.d. observations is a widely studied form of continual learning (CL) that remains challenging. In principle, Bayesian learning directly applies to this setting, since recursive and one-off Bayesian updates yield the same result. In practice, however, recursive updating often leads to poor trade-off solutions across tasks because approximate inference is necessary for most models of interest. Here, we describe an alternative Bayesian approach where task-conditioned parameter distributions are continually inferred from data. We offer a practical deep learning implementation of our framework based on probabilistic task-conditioned hypernetworks, an approach we term "posterior meta-replay". Experiments on standard benchmarks show that our probabilistic hypernetworks compress sequences of posterior parameter distributions with virtually no forgetting. We obtain considerable performance gains compared to existing Bayesian CL methods, and identify task inference as our major limiting factor. This limitation has several causes that are independent of the considered sequential setting, opening up new avenues for progress in CL.
    A Deep Learning Approach to Anomaly Sequence Detection for High-Resolution Monitoring of Power Systems. (arXiv:2012.05163v2 [eess.SY] UPDATED)
    (2 min) A deep learning approach is proposed to detect data and system anomalies using high-resolution continuous point-on-wave (CPOW) or phasor measurements. Both the anomaly and anomaly-free measurement models are assumed to have unknown temporal dependencies and probability distributions. Historical training samples are assumed for the anomaly-free model, while no training samples are available for the anomaly measurements. By transforming the anomaly-free observations into uniform independent and identically distributed sequences via a generative adversarial network, the proposed approach deploys a uniformity test for anomaly detection at the sensor level. A distributed detection scheme that combines sensor level detections at the control center is also proposed that combines local detections to form more reliable detections. Numerical results demonstrate significant improvement over the state-of-the-art solutions for various bad-data cases using real and synthetic CPOW and PMU data sets.
    OpenML-Python: an extensible Python API for OpenML. (arXiv:1911.02490v2 [cs.LG] UPDATED)
    (2 min) OpenML is an online platform for open science collaboration in machine learning, used to share datasets and results of machine learning experiments. In this paper we introduce OpenML-Python, a client API for Python, opening up the OpenML platform for a wide range of Python-based tools. It provides easy access to all datasets, tasks and experiments on OpenML from within Python. It also provides functionality to conduct machine learning experiments, upload the results to OpenML, and reproduce results which are stored on OpenML. Furthermore, it comes with a scikit-learn plugin and a plugin mechanism to easily integrate other machine learning libraries written in Python into the OpenML ecosystem. Source code and documentation is available at https://github.com/openml/openml-python/.
    FLOP: Federated Learning on Medical Datasets using Partial Networks. (arXiv:2102.05218v2 [cs.LG] UPDATED)
    (2 min) The outbreak of COVID-19 Disease due to the novel coronavirus has caused a shortage of medical resources. To aid and accelerate the diagnosis process, automatic diagnosis of COVID-19 via deep learning models has recently been explored by researchers across the world. While different data-driven deep learning models have been developed to mitigate the diagnosis of COVID-19, the data itself is still scarce due to patient privacy concerns. Federated Learning (FL) is a natural solution because it allows different organizations to cooperatively learn an effective deep learning model without sharing raw data. However, recent studies show that FL still lacks privacy protection and may cause data leakage. We investigate this challenging problem by proposing a simple yet effective algorithm, named \textbf{F}ederated \textbf{L}earning \textbf{o}n Medical Datasets using \textbf{P}artial Networks (FLOP), that shares only a partial model between the server and clients. Extensive experiments on benchmark data and real-world healthcare tasks show that our approach achieves comparable or better performance while reducing the privacy and security risks. Of particular interest, we conduct experiments on the COVID-19 dataset and find that our FLOP algorithm can allow different hospitals to collaboratively and effectively train a partially shared model without sharing local patients' data.
    GANMEX: One-vs-One Attributions Guided by GAN-based Counterfactual Explanation Baselines. (arXiv:2011.06015v4 [cs.LG] UPDATED)
    (2 min) Attribution methods have been shown as promising approaches for identifying key features that led to learned model predictions. While most existing attribution methods rely on a baseline input for performing feature perturbations, limited research has been conducted to address the baseline selection issues. Poor choices of baselines limit the ability of one-vs-one (1-vs-1) explanations for multi-class classifiers, which means the attribution methods were not able to explain why an input belongs to its original class but not the other specified target class. 1-vs-1 explanation is crucial when certain classes are more similar than others, e.g. two bird types among multiple animals, by focusing on key differentiating features rather than shared features across classes. In this paper, we present GAN-based Model EXplainability (GANMEX), a novel approach applying Generative Adversarial Networks (GAN) by incorporating the to-be-explained classifier as part of the adversarial networks. Our approach effectively selects the counterfactual baseline as the closest realistic sample belong to the target class, which allows attribution methods to provide true 1-vs-1 explanations. We showed that GANMEX baselines improved the saliency maps and led to stronger performance on perturbation-based evaluation metrics over the existing baselines. Existing attribution results are known for being insensitive to model randomization, and we demonstrated that GANMEX baselines led to better outcome under the cascading randomization of the model.
    The Symmetry between Arms and Knapsacks: A Primal-Dual Approach for Bandits with Knapsacks. (arXiv:2102.06385v3 [cs.LG] UPDATED)
    (2 min) In this paper, we study the bandits with knapsacks (BwK) problem and develop a primal-dual based algorithm that achieves a problem-dependent logarithmic regret bound. The BwK problem extends the multi-arm bandit (MAB) problem to model the resource consumption associated with playing each arm, and the existing BwK literature has been mainly focused on deriving asymptotically optimal distribution-free regret bounds. We first study the primal and dual linear programs underlying the BwK problem. From this primal-dual perspective, we discover symmetry between arms and knapsacks, and then propose a new notion of sub-optimality measure for the BwK problem. The sub-optimality measure highlights the important role of knapsacks in determining algorithm regret and inspires the design of our two-phase algorithm. In the first phase, the algorithm identifies the optimal arms and the binding knapsacks, and in the second phase, it exhausts the binding knapsacks via playing the optimal arms through an adaptive procedure. Our regret upper bound involves the proposed sub-optimality measure and it has a logarithmic dependence on length of horizon $T$ and a polynomial dependence on $m$ (the numbers of arms) and $d$ (the number of knapsacks). To the best of our knowledge, this is the first problem-dependent logarithmic regret bound for solving the general BwK problem.
    Fine-Grained Data Selection for Improved Energy Efficiency of Federated Edge Learning. (arXiv:2106.12561v1 [cs.LG])
    (2 min) In Federated edge learning (FEEL), energy-constrained devices at the network edge consume significant energy when training and uploading their local machine learning models, leading to a decrease in their lifetime. This work proposes novel solutions for energy-efficient FEEL by jointly considering local training data, available computation, and communications resources, and deadline constraints of FEEL rounds to reduce energy consumption. This paper considers a system model where the edge server is equipped with multiple antennas employing beamforming techniques to communicate with the local users through orthogonal channels. Specifically, we consider a problem that aims to find the optimal user's resources, including the fine-grained selection of relevant training samples, bandwidth, transmission power, beamforming weights, and processing speed with the goal of minimizing the total energy consumption given a deadline constraint on the communication rounds of FEEL. Then, we devise tractable solutions by first proposing a novel fine-grained training algorithm that excludes less relevant training samples and effectively chooses only the samples that improve the model's performance. After that, we derive closed-form solutions, followed by a Golden-Section-based iterative algorithm to find the optimal computation and communication resources that minimize energy consumption. Experiments using MNIST and CIFAR-10 datasets demonstrate that our proposed algorithms considerably outperform the state-of-the-art solutions as energy consumption decreases by 79% for MNIST and 73% for CIFAR-10 datasets.
    A novel multi-classifier information fusion based on Dempster-Shafer theory: application to vibration-based fault detection. (arXiv:2012.02481v2 [cs.LG] UPDATED)
    (2 min) Achieving a high prediction rate is a crucial task in fault detection. Although various classification procedures are available, none of them can give high accuracy in all applications. Therefore, in this paper, a novel multi-classifier fusion approach is developed to boost the performance of the individual classifiers. This is acquired by using Dempster-Shafer theory (DST). However, in cases with conflicting evidences, the DST may give counter-intuitive results. In this regard, a preprocessing technique based on a new metric is devised in order to measure and mitigate the conflict between the evidences. To evaluate and validate the effectiveness of the proposed approach, the method is applied to 15 benchmarks datasets from UCI and KEEL. Further, it is applied for classifying polycrystalline Nickel alloy first-stage turbine blades based on their broadband vibrational response. Through statistical analysis with different noise levels, and by comparing with four state-of-the-art fusion techniques, it is shown that that the proposed method improves the classification accuracy and outperforms the individual classifiers.
    CIFS: Improving Adversarial Robustness of CNNs via Channel-wise Importance-based Feature Selection. (arXiv:2102.05311v2 [cs.LG] UPDATED)
    (2 min) We investigate the adversarial robustness of CNNs from the perspective of channel-wise activations. By comparing \textit{non-robust} (normally trained) and \textit{robustified} (adversarially trained) models, we observe that adversarial training (AT) robustifies CNNs by aligning the channel-wise activations of adversarial data with those of their natural counterparts. However, the channels that are \textit{negatively-relevant} (NR) to predictions are still over-activated when processing adversarial data. Besides, we also observe that AT does not result in similar robustness for all classes. For the robust classes, channels with larger activation magnitudes are usually more \textit{positively-relevant} (PR) to predictions, but this alignment does not hold for the non-robust classes. Given these observations, we hypothesize that suppressing NR channels and aligning PR ones with their relevances further enhances the robustness of CNNs under AT. To examine this hypothesis, we introduce a novel mechanism, i.e., \underline{C}hannel-wise \underline{I}mportance-based \underline{F}eature \underline{S}election (CIFS). The CIFS manipulates channels' activations of certain layers by generating non-negative multipliers to these channels based on their relevances to predictions. Extensive experiments on benchmark datasets including CIFAR10 and SVHN clearly verify the hypothesis and CIFS's effectiveness of robustifying CNNs.
    SoftNER: Mining Knowledge Graphs From Cloud Incidents. (arXiv:2101.05961v2 [cs.SE] UPDATED)
    (2 min) The move from boxed products to services and the widespread adoption of cloud computing has had a huge impact on the software development life cycle and DevOps processes. Particularly, incident management has become critical for developing and operating large-scale services. Prior work on incident management has heavily focused on the challenges with incident triaging and de-duplication. In this work, we address the fundamental problem of structured knowledge extraction from service incidents. We have built SoftNER, a framework for mining Knowledge Graphs from incident reports. First, we build a novel multi-task learning based BiLSTM-CRF model which leverages not just the semantic context but also the data-types for extracting factual information in the form of named entities. Next, we present an approach to mine relations between the named entities for automatically constructing knowledge graphs. We have deployed SoftNER at Microsoft, a major cloud service provider and have evaluated it on more than 2 months of cloud incidents. We show that the unsupervised machine learning pipeline has a high precision of 0.96. Our multi-task learning based deep learning model also outperforms the state-of-the-art NER models. Lastly, using the knowledge extracted by SoftNER, we are able to build accurate models for applications such as incident triaging and recommending entities based on their relevance to incident titles.
    PHEW: Constructing Sparse Networks that Learn Fast and Generalize Well without Training Data. (arXiv:2010.11354v2 [cs.LG] UPDATED)
    (2 min) Methods that sparsify a network at initialization are important in practice because they greatly improve the efficiency of both learning and inference. Our work is based on a recently proposed decomposition of the Neural Tangent Kernel (NTK) that has decoupled the dynamics of the training process into a data-dependent component and an architecture-dependent kernel - the latter referred to as Path Kernel. That work has shown how to design sparse neural networks for faster convergence, without any training data, using the Synflow-L2 algorithm. We first show that even though Synflow-L2 is optimal in terms of convergence, for a given network density, it results in sub-networks with "bottleneck" (narrow) layers - leading to poor performance as compared to other data-agnostic methods that use the same number of parameters. Then we propose a new method to construct sparse networks, without any training data, referred to as Paths with Higher-Edge Weights (PHEW). PHEW is a probabilistic network formation method based on biased random walks that only depends on the initial weights. It has similar path kernel properties as Synflow-L2 but it generates much wider layers, resulting in better generalization and performance. PHEW achieves significant improvements over the data-independent SynFlow and SynFlow-L2 methods at a wide range of network densities.
    HDR Environment Map Estimation for Real-Time Augmented Reality. (arXiv:2011.10687v4 [cs.CV] UPDATED)
    (2 min) We present a method to estimate an HDR environment map from a narrow field-of-view LDR camera image in real-time. This enables perceptually appealing reflections and shading on virtual objects of any material finish, from mirror to diffuse, rendered into a real physical environment using augmented reality. Our method is based on our efficient convolutional neural network architecture, EnvMapNet, trained end-to-end with two novel losses, ProjectionLoss for the generated image, and ClusterLoss for adversarial training. Through qualitative and quantitative comparison to state-of-the-art methods, we demonstrate that our algorithm reduces the directional error of estimated light sources by more than 50%, and achieves 3.7 times lower Frechet Inception Distance (FID). We further showcase a mobile application that is able to run our neural network model in under 9 ms on an iPhone XS, and render in real-time, visually coherent virtual objects in previously unseen real-world environments.
    Probing Model Signal-Awareness via Prediction-Preserving Input Minimization. (arXiv:2011.14934v2 [cs.SE] UPDATED)
    (2 min) This work explores the signal awareness of AI models for source code understanding. Using a software vulnerability detection use case, we evaluate the models' ability to capture the correct vulnerability signals to produce their predictions. Our prediction-preserving input minimization (P2IM) approach systematically reduces the original source code to a minimal snippet which a model needs to maintain its prediction. The model's reliance on incorrect signals is then uncovered when the vulnerability in the original code is missing in the minimal snippet, both of which the model however predicts as being vulnerable. We measure the signal awareness of models using a new metric we propose- Signal-aware Recall (SAR). We apply P2IM on three different neural network architectures across multiple datasets. The results show a sharp drop in the model's Recall from the high 90s to sub-60s with the new metric, highlighting that the models are presumably picking up a lot of noise or dataset nuances while learning their vulnerability detection logic. Although the drop in model performance may be perceived as an adversarial attack, but this isn't P2IM's objective. The idea is rather to uncover the signal-awareness of a black-box model in a data-driven manner via controlled queries. SAR's purpose is to measure the impact of task-agnostic model training, and not to suggest a shortcoming in the Recall metric. The expectation, in fact, is for SAR to match Recall in the ideal scenario where the model truly captures task-specific signals.
    Weisfeiler and Lehman Go Cellular: CW Networks. (arXiv:2106.12575v1 [cs.LG])
    (2 min) Graph Neural Networks (GNNs) are limited in their expressive power, struggle with long-range interactions and lack a principled way to model higher-order structures. These problems can be attributed to the strong coupling between the computational graph and the input graph structure. The recently proposed Message Passing Simplicial Networks naturally decouple these elements by performing message passing on the clique complex of the graph. Nevertheless, these models are severely constrained by the rigid combinatorial structure of Simplicial Complexes (SCs). In this work, we extend recent theoretical results on SCs to regular Cell Complexes, topological objects that flexibly subsume SCs and graphs. We show that this generalisation provides a powerful set of graph ``lifting'' transformations, each leading to a unique hierarchical message passing procedure. The resulting methods, which we collectively call CW Networks (CWNs), are strictly more powerful than the WL test and, in certain cases, not less powerful than the 3-WL test. In particular, we demonstrate the effectiveness of one such scheme, based on rings, when applied to molecular graph problems. The proposed architecture benefits from provably larger expressivity than commonly used GNNs, principled modelling of higher-order signals and from compressing the distances between nodes. We demonstrate that our model achieves state-of-the-art results on a variety of molecular datasets.
    Emergent Properties of Foveated Perceptual Systems. (arXiv:2006.07991v3 [cs.CV] UPDATED)
    (3 min) The goal of this work is to characterize the representational impact that foveation operations have for machine vision systems, inspired by the foveated human visual system, which has higher acuity at the center of gaze and texture-like encoding in the periphery. To do so, we introduce models consisting of a first-stage \textit{fixed} image transform followed by a second-stage \textit{learnable} convolutional neural network, and we varied the first stage component. The primary model has a foveated-textural input stage, which we compare to a model with foveated-blurred input and a model with spatially-uniform blurred input (both matched for perceptual compression), and a final reference model with minimal input-based compression. We find that: 1) the foveated-texture model shows similar scene classification accuracy as the reference model despite its compressed input, with greater i.i.d. generalization than the other models; 2) the foveated-texture model has greater sensitivity to high-spatial frequency information and greater robustness to occlusion, w.r.t the comparison models; 3) both the foveated systems, show a stronger center image-bias relative to the spatially-uniform systems even with a weight sharing constraint. Critically, these results are preserved over different classical CNN architectures throughout their learning dynamics. Altogether, this suggests that foveation with peripheral texture-based computations yields an efficient, distinct, and robust representational format of scene information, and provides symbiotic computational insight into the representational consequences that texture-based peripheral encoding may have for processing in the human visual system, while also potentially inspiring the next generation of computer vision models via spatially-adaptive computation. Code + Data available here: https://github.com/ArturoDeza/EmergentProperties
    HILONet: Hierarchical Imitation Learning from Non-Aligned Observations. (arXiv:2011.02671v2 [cs.LG] UPDATED)
    (2 min) It is challenging learning from demonstrated observation-only trajectories in a non-time-aligned environment because most imitation learning methods aim to imitate experts by following the demonstration step-by-step. However, aligned demonstrations are seldom obtainable in real-world scenarios. In this work, we propose a new imitation learning approach called Hierarchical Imitation Learning from Observation(HILONet), which adopts a hierarchical structure to choose feasible sub-goals from demonstrated observations dynamically. Our method can solve all kinds of tasks by achieving these sub-goals, whether it has a single goal position or not. We also present three different ways to increase sample efficiency in the hierarchical structure. We conduct extensive experiments using several environments. The results show the improvement in both performance and learning efficiency.
    Sequential Model Adaptation Using Domain Agnostic Internal Distributions. (arXiv:2007.00197v4 [cs.LG] UPDATED)
    (2 min) We develop an algorithm for sequential adaptation of a classifier that is trained for a source domain to generalize in an unannotated target domain. We consider that the model has been trained on the source domain annotated data and then it needs to be adapted using the target domain unannotated data when the source domain data is not accessible. We align the distributions of the source and the target domains in a discriminative embedding space via an intermediate internal distribution. This distribution is estimated using the source data representations in the embedding. We conduct experiments on four benchmarks to demonstrate the method is effective and compares favorably against existing methods.
    Policy choice in experiments with unknown interference. (arXiv:2011.08174v4 [econ.EM] UPDATED)
    (2 min) This paper discusses experimental design to estimate welfare-maximizing policies. We consider a setting where units are organized into large, finitely many independent clusters and interact over unobserved dimensions within each cluster. The contribution of this paper is two-fold. First, we construct a test for whether a welfare-improving treatment configuration exists and hence worth learning by conducting a larger scale experiment. Second, we introduce an adaptive randomization procedure to estimate welfare-maximizing individual treatment allocation rules valid under unobserved interference. We derive asymptotic properties of the marginal effects estimators and finite-sample regret guarantees of the policy. Finally, we illustrate the method's advantage in simulations calibrated to an existing experiment on information diffusion.
    Learning Multimodal VAEs through Mutual Supervision. (arXiv:2106.12570v1 [cs.LG])
    (2 min) Multimodal VAEs seek to model the joint distribution over heterogeneous data (e.g.\ vision, language), whilst also capturing a shared representation across such modalities. Prior work has typically combined information from the modalities by reconciling idiosyncratic representations directly in the recognition model through explicit products, mixtures, or other such factorisations. Here we introduce a novel alternative, the MEME, that avoids such explicit combinations by repurposing semi-supervised VAEs to combine information between modalities implicitly through mutual supervision. This formulation naturally allows learning from partially-observed data where some modalities can be entirely missing -- something that most existing approaches either cannot handle, or do so to a limited extent. We demonstrate that MEME outperforms baselines on standard metrics across both partial and complete observation schemes on the MNIST-SVHN (image-image) and CUB (image-text) datasets. We also contrast the quality of the representations learnt by mutual supervision against standard approaches and observe interesting trends in its ability to capture relatedness between data.
    Emergent Social Learning via Multi-agent Reinforcement Learning. (arXiv:2010.00581v3 [cs.LG] UPDATED)
    (2 min) Social learning is a key component of human and animal intelligence. By taking cues from the behavior of experts in their environment, social learners can acquire sophisticated behavior and rapidly adapt to new circumstances. This paper investigates whether independent reinforcement learning (RL) agents in a multi-agent environment can learn to use social learning to improve their performance. We find that in most circumstances, vanilla model-free RL agents do not use social learning. We analyze the reasons for this deficiency, and show that by imposing constraints on the training environment and introducing a model-based auxiliary loss we are able to obtain generalized social learning policies which enable agents to: i) discover complex skills that are not learned from single-agent training, and ii) adapt online to novel environments by taking cues from experts present in the new environment. In contrast, agents trained with model-free RL or imitation learning generalize poorly and do not succeed in the transfer tasks. By mixing multi-agent and solo training, we can obtain agents that use social learning to gain skills that they can deploy when alone, even out-performing agents trained alone from the start.
    Meta-Thompson Sampling. (arXiv:2102.06129v2 [cs.LG] UPDATED)
    (2 min) Efficient exploration in bandits is a fundamental online learning problem. We propose a variant of Thompson sampling that learns to explore better as it interacts with bandit instances drawn from an unknown prior. The algorithm meta-learns the prior and thus we call it MetaTS. We propose several efficient implementations of MetaTS and analyze it in Gaussian bandits. Our analysis shows the benefit of meta-learning and is of a broader interest, because we derive a novel prior-dependent Bayes regret bound for Thompson sampling. Our theory is complemented by empirical evaluation, which shows that MetaTS quickly adapts to the unknown prior.
    Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding. (arXiv:2106.12566v1 [cs.LG])
    (2 min) The attention module, which is a crucial component in Transformer, cannot scale efficiently to long sequences due to its quadratic complexity. Many works focus on approximating the dot-then-exponentiate softmax function in the original attention, leading to sub-quadratic or even linear-complexity Transformer architectures. However, we show that these methods cannot be applied to more powerful attention modules that go beyond the dot-then-exponentiate style, e.g., Transformers with relative positional encoding (RPE). Since in many state-of-the-art models, relative positional encoding is used as default, designing efficient Transformers that can incorporate RPE is appealing. In this paper, we propose a novel way to accelerate attention calculation for Transformers with RPE on top of the kernelized attention. Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT). With FFT, our method achieves $\mathcal{O}(n\log n)$ time complexity. Interestingly, we further demonstrate that properly using relative positional encoding can mitigate the training instability problem of vanilla kernelized attention. On a wide range of tasks, we empirically show that our models can be trained from scratch without any optimization issues. The learned model performs better than many efficient Transformer variants and is faster than standard Transformer in the long-sequence regime.
    Regret Bounds for Stochastic Shortest Path Problems with Linear Function Approximation. (arXiv:2105.01593v2 [cs.LG] UPDATED)
    (2 min) We propose two algorithms that use linear function approximation (LFA) for stochastic shortest path (SSP) and bound their regret over $K$ episodes. When all stationary policies are proper, our first algorithm obtains sublinear regret ($K^{3/4}$), is computationally efficient, and uses stationary policies. This is the first LFA algorithm with these three properties, to the best of our knowledge. Our second algorithm improves the regret to $\sqrt{K}$ when the feature vectors satisfy certain assumptions. Both algorithms are special cases of a more general one, which has $\sqrt{K}$ regret for general features given access to a certain computation oracle. These algorithms and regret bounds are the first for SSP with function approximation.
    Perceiver: General Perception with Iterative Attention. (arXiv:2103.03206v2 [cs.CV] UPDATED)
    (2 min) Biological systems perceive the world by simultaneously processing high-dimensional inputs from modalities as diverse as vision, audition, touch, proprioception, etc. The perception models used in deep learning on the other hand are designed for individual modalities, often relying on domain-specific assumptions such as the local grid structures exploited by virtually all existing vision models. These priors introduce helpful inductive biases, but also lock models to individual modalities. In this paper we introduce the Perceiver - a model that builds upon Transformers and hence makes few architectural assumptions about the relationship between its inputs, but that also scales to hundreds of thousands of inputs, like ConvNets. The model leverages an asymmetric attention mechanism to iteratively distill inputs into a tight latent bottleneck, allowing it to scale to handle very large inputs. We show that this architecture is competitive with or outperforms strong, specialized models on classification tasks across various modalities: images, point clouds, audio, video, and video+audio. The Perceiver obtains performance comparable to ResNet-50 and ViT on ImageNet without 2D convolutions by directly attending to 50,000 pixels. It is also competitive in all modalities in AudioSet.
    Eliminating Sharp Minima from SGD with Truncated Heavy-tailed Noise. (arXiv:2102.04297v3 [cs.LG] UPDATED)
    (2 min) The empirical success of deep learning is often attributed to SGD's mysterious ability to avoid sharp local minima in the loss landscape, as sharp minima are known to lead to poor generalization. Recently, empirical evidence of heavy-tailed gradient noise was reported in many deep learning tasks, and it was shown in \c{S}im\c{s}ekli (2019a,b) that SGD can escape sharp local minima under the presence of such heavy-tailed gradient noise, providing a partial solution to the mystery. In this work, we analyze a popular variant of SGD where gradients are truncated above a fixed threshold. We show that it achieves a stronger notion of avoiding sharp minima: it can effectively eliminate sharp local minima entirely from its training trajectory. We characterize the dynamics of truncated SGD driven by heavy-tailed noises. First, we show that the truncation threshold and width of the attraction field dictate the order of the first exit time from the associated local minimum. Moreover, when the objective function satisfies appropriate structural conditions, we prove that as the learning rate decreases, the dynamics of heavy-tailed truncated SGD closely resemble those of a continuous-time Markov chain that never visits any sharp minima. Real data experiments on deep learning confirm our theoretical prediction that heavy-tailed SGD with gradient clipping finds a "flatter" local minima and achieves better generalization.
    Optimal training of variational quantum algorithms without barren plateaus. (arXiv:2104.14543v3 [quant-ph] UPDATED)
    (2 min) Variational quantum algorithms (VQAs) promise efficient use of near-term quantum computers. However, training VQAs often requires an extensive amount of time and suffers from the barren plateau problem where the magnitude of the gradients vanishes with increasing number of qubits. Here, we show how to optimally train VQAs for learning quantum states. Parameterized quantum circuits can form Gaussian kernels, which we use to derive adaptive learning rates for gradient ascent. We introduce the generalized quantum natural gradient that features stability and optimized movement in parameter space. Both methods together outperform other optimization routines in training VQAs. Our methods also excel at numerically optimizing driving protocols for quantum control problems. The gradients of the VQA do not vanish when the fidelity between the initial state and the state to be learned is bounded from below. We identify a VQA for quantum simulation with such a constraint that thus can be trained free of barren plateaus. Finally, we propose the application of Gaussian kernels for quantum machine learning.
    On the Power of Localized Perceptron for Label-Optimal Learning of Halfspaces with Adversarial Noise. (arXiv:2012.10793v3 [cs.LG] UPDATED)
    (2 min) We study {\em online} active learning of homogeneous halfspaces in $\mathbb{R}^d$ with adversarial noise where the overall probability of a noisy label is constrained to be at most $\nu$. Our main contribution is a Perceptron-like online active learning algorithm that runs in polynomial time, and under the conditions that the marginal distribution is isotropic log-concave and $\nu = \Omega(\epsilon)$, where $\epsilon \in (0, 1)$ is the target error rate, our algorithm PAC learns the underlying halfspace with near-optimal label complexity of $\tilde{O}\big(d \cdot polylog(\frac{1}{\epsilon})\big)$ and sample complexity of $\tilde{O}\big(\frac{d}{\epsilon} \big)$. Prior to this work, existing online algorithms designed for tolerating the adversarial noise are subject to either label complexity polynomial in $\frac{1}{\epsilon}$, or suboptimal noise tolerance, or restrictive marginal distributions. With the additional prior knowledge that the underlying halfspace is $s$-sparse, we obtain attribute-efficient label complexity of $\tilde{O}\big( s \cdot polylog(d, \frac{1}{\epsilon}) \big)$ and sample complexity of $\tilde{O}\big(\frac{s}{\epsilon} \cdot polylog(d) \big)$. As an immediate corollary, we show that under the agnostic model where no assumption is made on the noise rate $\nu$, our active learner achieves an error rate of $O(OPT) + \epsilon$ with the same running time and label and sample complexity, where $OPT$ is the best possible error rate achievable by any homogeneous halfspace.
    Adaptive Learning of Tensor Network Structures. (arXiv:2008.05437v2 [cs.LG] UPDATED)
    (2 min) Tensor Networks (TN) offer a powerful framework to efficiently represent very high-dimensional objects. TN have recently shown their potential for machine learning applications and offer a unifying view of common tensor decomposition models such as Tucker, tensor train (TT) and tensor ring (TR). However, identifying the best tensor network structure from data for a given task is challenging. In this work, we leverage the TN formalism to develop a generic and efficient adaptive algorithm to jointly learn the structure and the parameters of a TN from data. Our method is based on a simple greedy approach starting from a rank one tensor and successively identifying the most promising tensor network edges for small rank increments. Our algorithm can adaptively identify TN structures with small number of parameters that effectively optimize any differentiable objective function. Experiments on tensor decomposition, tensor completion and model compression tasks demonstrate the effectiveness of the proposed algorithm. In particular, our method outperforms the state-of-the-art evolutionary topology search [Li and Sun, 2020] for tensor decomposition of images (while being orders of magnitude faster) and finds efficient tensor network structures to compress neural networks outperforming popular TT based approaches [Novikov et al., 2015].
    TextSETTR: Few-Shot Text Style Extraction and Tunable Targeted Restyling. (arXiv:2010.03802v3 [cs.CL] UPDATED)
    (2 min) We present a novel approach to the problem of text style transfer. Unlike previous approaches requiring style-labeled training data, our method makes use of readily-available unlabeled text by relying on the implicit connection in style between adjacent sentences, and uses labeled data only at inference time. We adapt T5 (Raffel et al., 2020), a strong pretrained text-to-text model, to extract a style vector from text and use it to condition the decoder to perform style transfer. As our label-free training results in a style vector space encoding many facets of style, we recast transfers as "targeted restyling" vector operations that adjust specific attributes of the input while preserving others. We demonstrate that training on unlabeled Amazon reviews data results in a model that is competitive on sentiment transfer, even compared to models trained fully on labeled data. Furthermore, applying our novel method to a diverse corpus of unlabeled web text results in a single model capable of transferring along multiple dimensions of style (dialect, emotiveness, formality, politeness, sentiment) despite no additional training and using only a handful of exemplars at inference time.
    Rethinking supervised learning: insights from biological learning and from calling it by its name. (arXiv:2012.02526v2 [cs.LG] UPDATED)
    (2 min) The renaissance of artificial neural networks was catalysed by the success of classification models, tagged by the community with the broader term supervised learning. The extraordinary results gave rise to a hype loaded with ambitious promises and overstatements. Soon the community realised that the success owed much to the availability of thousands of labelled examples and supervised learning went, for many, from glory to shame: Some criticised deep learning as a whole and others proclaimed that the way forward had to be alternatives to supervised learning: predictive, unsupervised, semi-supervised and, more recently, self-supervised learning. However, all these seem brand names, rather than actual categories of a theoretically grounded taxonomy. Moreover, the call to banish supervised learning was motivated by the questionable claim that humans learn with little or no supervision and are capable of robust out-of-distribution generalisation. Here, we review insights about learning and supervision in nature, revisit the notion that learning and generalisation are not possible without supervision or inductive biases and argue that we will make better progress if we just call it by its name.
    Post-hoc Uncertainty Calibration for Domain Drift Scenarios. (arXiv:2012.10988v2 [cs.LG] UPDATED)
    (2 min) We address the problem of uncertainty calibration. While standard deep neural networks typically yield uncalibrated predictions, calibrated confidence scores that are representative of the true likelihood of a prediction can be achieved using post-hoc calibration methods. However, to date the focus of these approaches has been on in-domain calibration. Our contribution is two-fold. First, we show that existing post-hoc calibration methods yield highly over-confident predictions under domain shift. Second, we introduce a simple strategy where perturbations are applied to samples in the validation set before performing the post-hoc calibration step. In extensive experiments, we demonstrate that this perturbation step results in substantially better calibration under domain shift on a wide range of architectures and modelling tasks.
    Interpretable Clustering on Dynamic Graphs with Recurrent Graph Neural Networks. (arXiv:2012.08740v2 [cs.LG] UPDATED)
    (2 min) We study the problem of clustering nodes in a dynamic graph, where the connections between nodes and nodes' cluster memberships may change over time, e.g., due to community migration. We first propose a dynamic stochastic block model that captures these changes, and a simple decay-based clustering algorithm that clusters nodes based on weighted connections between them, where the weight decreases at a fixed rate over time. This decay rate can then be interpreted as signifying the importance of including historical connection information in the clustering. However, the optimal decay rate may differ for clusters with different rates of turnover. We characterize the optimal decay rate for each cluster and propose a clustering method that achieves almost exact recovery of the true clusters. We then demonstrate the efficacy of our clustering algorithm with optimized decay rates on simulated graph data. Recurrent neural networks (RNNs), a popular algorithm for sequence learning, use a similar decay-based method, and we use this insight to propose two new RNN-GCN (graph convolutional network) architectures for semi-supervised graph clustering. We finally demonstrate that the proposed architectures perform well on real data compared to state-of-the-art graph clustering algorithms.
    A LightGBM based Forecasting of Dominant Wave Periods in Oceanic Waters. (arXiv:2105.08721v3 [physics.ao-ph] UPDATED)
    (3 min) In this paper, we propose a Light Gradient Boosting (LightGBM) to forecast dominant wave periods in oceanic waters. First, we use the data collected from CDIP buoys and apply various data filtering methods. The data filtering methods allow us to obtain a high-quality dataset for training and validation purposes. We then extract various wave-based features like wave heights, periods, skewness, kurtosis, etc., and atmospheric features like humidity, pressure, and air temperature for the buoys. Afterward, we train algorithms that use LightGBM and Extra Trees through a hv-block cross-validation scheme to forecast dominant wave periods for up to 30 days ahead. LightGBM has the R2 score of 0.94, 0.94, and 0.94 for 1-day ahead, 15-day ahead, and 30-day ahead prediction. Similarly, Extra Trees (ET) has an R2 score of 0.88, 0.86, and 0.85 for 1-day ahead, 15-day ahead, and 30 day ahead prediction. In case of the test dataset, LightGBM has R2 score of 0.94, 0.94, and 0.94 for 1-day ahead, 15-day ahead and 30-day ahead prediction. ET has R2 score of 0.88, 0.86, and 0.85 for 1-day ahead, 15-day ahead, and 30-day ahead prediction. A similar R2 score for both training and the test dataset suggests that the machine learning models developed in this paper are robust. Since the LightGBM algorithm outperforms ET for all the windows tested, it is taken as the final algorithm. Note that the performance of both methods does not decrease significantly as the forecast horizon increases. Likewise, the proposed method outperforms the numerical approaches included in this paper in the test dataset. For 1 day ahead prediction, the proposed algorithm has SI, Bias, CC, and RMSE of 0.09, 0.00, 0.97, and 1.78 compared to 0.268, 0.40, 0.63, and 2.18 for the European Centre for Medium-range Weather Forecasts (ECMWF) model, which outperforms all the other methods in the test dataset.
    Exploiting Spatial Dimensions of Latent in GAN for Real-time Image Editing. (arXiv:2104.14754v2 [cs.CV] UPDATED)
    (2 min) Generative adversarial networks (GANs) synthesize realistic images from random latent vectors. Although manipulating the latent vectors controls the synthesized outputs, editing real images with GANs suffers from i) time-consuming optimization for projecting real images to the latent vectors, ii) or inaccurate embedding through an encoder. We propose StyleMapGAN: the intermediate latent space has spatial dimensions, and a spatially variant modulation replaces AdaIN. It makes the embedding through an encoder more accurate than existing optimization-based methods while maintaining the properties of GANs. Experimental results demonstrate that our method significantly outperforms state-of-the-art models in various image manipulation tasks such as local editing and image interpolation. Last but not least, conventional editing methods on GANs are still valid on our StyleMapGAN. Source code is available at https://github.com/naver-ai/StyleMapGAN.
    Information Bottleneck Attribution for Visual Explanations of Diagnosis and Prognosis. (arXiv:2104.02869v2 [eess.IV] UPDATED)
    (2 min) Visual explanation methods have an important role in the prognosis of the patients where the annotated data is limited or unavailable. There have been several attempts to use gradient-based attribution methods to localize pathology from medical scans without using segmentation labels. This research direction has been impeded by the lack of robustness and reliability. These methods are highly sensitive to the network parameters. In this study, we introduce a robust visual explanation method to address this problem for medical applications. We provide an innovative visual explanation algorithm for general purpose and as an example application, we demonstrate its effectiveness for quantifying lesions in the lungs caused by the Covid-19 with high accuracy and robustness without using dense segmentation labels. This approach overcomes the drawbacks of commonly used Grad-CAM and its extended versions. The premise behind our proposed strategy is that the information flow is minimized while ensuring the classifier prediction stays similar. Our findings indicate that the bottleneck condition provides a more stable severity estimation than the similar attribution methods.
    Sample-Optimal PAC Learning of Halfspaces with Malicious Noise. (arXiv:2102.06247v2 [cs.LG] UPDATED)
    (2 min) We study efficient PAC learning of homogeneous halfspaces in $\mathbb{R}^d$ in the presence of malicious noise of Valiant~(1985). This is a challenging noise model and only until recently has near-optimal noise tolerance bound been established under the mild condition that the unlabeled data distribution is isotropic log-concave. However, it remains unsettled how to obtain the optimal sample complexity simultaneously. In this work, we present a new analysis for the algorithm of Awasthi~et~al.~(2017) and show that it essentially achieves the near-optimal sample complexity bound of $\tilde{O}(d)$, improving the best known result of $\tilde{O}(d^2)$. Our main ingredient is a novel incorporation of a matrix Chernoff-type inequality to bound the spectrum of an empirical covariance matrix for well-behaved distributions, in conjunction with a careful exploration of the localization schemes of Awasthi~et~al.~(2017). We further extend the algorithm and analysis to the more general and stronger nasty noise model of Bshouty~et~al.~(2002), showing that it is still possible to achieve near-optimal noise tolerance and sample complexity in polynomial time.
    Fast Certified Robust Training with Short Warmup. (arXiv:2103.17268v3 [cs.LG] UPDATED)
    (2 min) Recently, bound propagation based certified robust training methods have been proposed for training neural networks with certifiable robustness guarantees. Despite that state-of-the-art (SOTA) methods including interval bound propagation (IBP) and CROWN-IBP have per-batch training complexity similar to standard neural network training, they usually use a long warmup schedule with hundreds or thousands epochs to reach SOTA performance and are thus still costly. In this paper, we identify two important issues in existing methods, namely exploded bounds at initialization, and the imbalance in ReLU activation states. These two issues make certified training difficult and unstable, and thereby long warmup schedules were needed in prior works. To mitigate these issues and conduct certified training with shorter warmup, we propose three improvements: 1) We derive a new weight initialization method for IBP training; 2) We propose to fully add Batch Normalization (BN) to each layer in the model, since we find BN can reduce the imbalance in ReLU activation states; 3) We also design regularization to explicitly tighten certified bounds and balance ReLU activation states. In our experiments, we are able to obtain 65.03% verified error on CIFAR-10 ($\epsilon=\frac{8}{255}$) and 82.36% verified error on TinyImageNet ($\epsilon=\frac{1}{255}$) using very short training schedules (160 and 80 total epochs, respectively), outperforming literature SOTA trained with hundreds or thousands epochs under the same network architecture.
    Tractable structured natural gradient descent using local parameterizations. (arXiv:2102.07405v5 [stat.ML] UPDATED)
    (2 min) Natural-gradient descent on structured parameter spaces (e.g., low-rank covariances) is computationally challenging due to complicated inverse Fisher-matrix computations. We address this issue for optimization, inference, and search problems by using \emph{local-parameter coordinates}. Our method generalizes an existing evolutionary-strategy method, recovers Newton and Riemannian-gradient methods as special cases, and also yields new tractable natural-gradient algorithms for learning flexible covariance structures of Gaussian and Wishart-based distributions via \emph{matrix groups}. We show results on a range of applications on deep learning, variational inference, and evolution strategies. Our work opens a new direction for scalable structured geometric methods via local parameterizations.
    Dissecting Supervised Constrastive Learning. (arXiv:2102.08817v2 [stat.ML] UPDATED)
    (2 min) Minimizing cross-entropy over the softmax scores of a linear map composed with a high-capacity encoder is arguably the most popular choice for training neural networks on supervised learning tasks. However, recent works show that one can directly optimize the encoder instead, to obtain equally (or even more) discriminative representations via a supervised variant of a contrastive objective. In this work, we address the question whether there are fundamental differences in the sought-for representation geometry in the output space of the encoder at minimal loss. Specifically, we prove, under mild assumptions, that both losses attain their minimum once the representations of each class collapse to the vertices of a regular simplex, inscribed in a hypersphere. We provide empirical evidence that this configuration is attained in practice and that reaching a close-to-optimal state typically indicates good generalization performance. Yet, the two losses show remarkably different optimization behavior. The number of iterations required to perfectly fit to data scales superlinearly with the amount of randomly flipped labels for the supervised contrastive loss. This is in contrast to the approximately linear scaling previously reported for networks trained with cross-entropy.
    Stiff Neural Ordinary Differential Equations. (arXiv:2103.15341v2 [math.NA] UPDATED)
    (2 min) Neural Ordinary Differential Equations (ODE) are a promising approach to learn dynamic models from time-series data in science and engineering applications. This work aims at learning Neural ODE for stiff systems, which are usually raised from chemical kinetic modeling in chemical and biological systems. We first show the challenges of learning neural ODE in the classical stiff ODE systems of Robertson's problem and propose techniques to mitigate the challenges associated with scale separations in stiff systems. We then present successful demonstrations in stiff systems of Robertson's problem and an air pollution problem. The demonstrations show that the usage of deep networks with rectified activations, proper scaling of the network outputs as well as loss functions, and stabilized gradient calculations are the key techniques enabling the learning of stiff neural ODE. The success of learning stiff neural ODE opens up possibilities of using neural ODEs in applications with widely varying time-scales, like chemical dynamics in energy conversion, environmental engineering, and the life sciences.
    Coarse-to-Fine Q-attention: Efficient Learning for Visual Robotic Manipulation via Discretisation. (arXiv:2106.12534v1 [cs.RO])
    (2 min) Reflecting on the last few years, the biggest breakthroughs in deep reinforcement learning (RL) have been in the discrete action domain. Robotic manipulation, however, is inherently a continuous control environment, but these continuous control reinforcement learning algorithms often depend on actor-critic methods that are sample-inefficient and inherently difficult to train, due to the joint optimisation of the actor and critic. To that end, we explore how we can bring the stability of discrete action RL algorithms to the robot manipulation domain. We extend the recently released ARM algorithm, by replacing the continuous next-best pose agent with a discrete next-best pose agent. Discretisation of rotation is trivial given its bounded nature, while translation is inherently unbounded, making discretisation difficult. We formulate the translation prediction as the voxel prediction problem by discretising the 3D space; however, voxelisation of a large workspace is memory intensive and would not work with a high density of voxels, crucial to obtaining the resolution needed for robotic manipulation. We therefore propose to apply this voxel prediction in a coarse-to-fine manner by gradually increasing the resolution. In each step, we extract the highest valued voxel as the predicted location, which is then used as the centre of the higher-resolution voxelisation in the next step. This coarse-to-fine prediction is applied over several steps, giving a near-lossless prediction of the translation. We show that our new coarse-to-fine algorithm is able to accomplish RLBench tasks much more efficiently than the continuous control equivalent, and even train some real-world tasks, tabular rasa, in less than 7 minutes, with only 3 demonstrations. Moreover, we show that by moving to a voxel representation, we are able to easily incorporate observations from multiple cameras.
    Feature Alignment for Approximated Reversibility in Neural Networks. (arXiv:2106.12562v1 [cs.LG])
    (2 min) We introduce feature alignment, a technique for obtaining approximate reversibility in artificial neural networks. By means of feature extraction, we can train a neural network to learn an estimated map for its reverse process from outputs to inputs. Combined with variational autoencoders, we can generate new samples from the same statistics as the training data. Improvements of the results are obtained by using concepts from generative adversarial networks. Finally, we show that the technique can be modified for training neural networks locally, saving computational memory resources. Applying these techniques, we report results for three vision generative tasks: MNIST, CIFAR-10, and celebA.
    Real-time Outdoor Localization Using Radio Maps: A Deep Learning Approach. (arXiv:2106.12556v1 [cs.LG])
    (2 min) This paper deals with the problem of localization in a cellular network in a dense urban scenario. Global Navigation Satellite Systems typically perform poorly in urban environments, where the likelihood of line-of-sight conditions between the devices and the satellites is low, and thus alternative localization methods are required for good accuracy. We present a deep learning method for localization, based merely on pathloss, which does not require any increase in computation complexity at the user devices with respect to the device standard operations, unlike methods that rely on time of arrival or angle of arrival information. In a wireless network, user devices scan the base station beacon slots and identify the few strongest base station signals for handover and user-base station association purposes. In the proposed method, the user to be localized simply reports such received signal strengths to a central processing unit, which may be located in the cloud. For each base station we have good approximation of the pathloss at every location in a dense grid in the map. This approximation is provided by RadioUNet, a deep learning-based simulator of pathloss functions in urban environment, that we have previously proposed and published. Using the estimated pathloss radio maps of all base stations and the corresponding reported signal strengths, the proposed deep learning algorithm can extract a very accurate localization of the user. The proposed method, called LocUNet, enjoys high robustness to inaccuracies in the estimated radio maps. We demonstrate this by numerical experiments, which obtain state-of-the-art results.
    Meta-Learning Divergences of Variational Inference. (arXiv:2007.02912v2 [cs.LG] UPDATED)
    (2 min) Variational inference (VI) plays an essential role in approximate Bayesian inference due to its computational efficiency and broad applicability. Crucial to the performance of VI is the selection of the associated divergence measure, as VI approximates the intractable distribution by minimizing this divergence. In this paper we propose a meta-learning algorithm to learn the divergence metric suited for the task of interest, automating the design of VI methods. In addition, we learn the initialization of the variational parameters without additional cost when our method is deployed in the few-shot learning scenarios. We demonstrate our approach outperforms standard VI on Gaussian mixture distribution approximation, Bayesian neural network regression, image generation with variational autoencoders and recommender systems with a partial variational autoencoder.
    Modality Attention and Sampling Enables Deep Learning with Heterogeneous Marker Combinations in Fluorescence Microscopy. (arXiv:2008.12380v2 [cs.CV] UPDATED)
    (3 min) Fluorescence microscopy allows for a detailed inspection of cells, cellular networks, and anatomical landmarks by staining with a variety of carefully-selected markers visualized as color channels. Quantitative characterization of structures in acquired images often relies on automatic image analysis methods. Despite the success of deep learning methods in other vision applications, their potential for fluorescence image analysis remains underexploited. One reason lies in the considerable workload required to train accurate models, which are normally specific for a given combination of markers, and therefore applicable to a very restricted number of experimental settings. We herein propose Marker Sampling and Excite, a neural network approach with a modality sampling strategy and a novel attention module that together enable (i) flexible training with heterogeneous datasets with combinations of markers and (ii) successful utility of learned models on arbitrary subsets of markers prospectively. We show that our single neural network solution performs comparably to an upper bound scenario where an ensemble of many networks is na\"ively trained for each possible marker combination separately. In addition, we demonstrate the feasibility of this framework in high-throughput biological analysis by revising a recent quantitative characterization of bone marrow vasculature in 3D confocal microscopy datasets and further confirm the validity of our approach on an additional, significantly different dataset of microvessels in fetal liver tissues. Not only can our work substantially ameliorate the use of deep learning in fluorescence microscopy analysis, but it can also be utilized in other fields with incomplete data acquisitions and missing modalities.
    Feature Attributions and Counterfactual Explanations Can Be Manipulated. (arXiv:2106.12563v1 [cs.LG])
    (2 min) As machine learning models are increasingly used in critical decision-making settings (e.g., healthcare, finance), there has been a growing emphasis on developing methods to explain model predictions. Such \textit{explanations} are used to understand and establish trust in models and are vital components in machine learning pipelines. Though explanations are a critical piece in these systems, there is little understanding about how they are vulnerable to manipulation by adversaries. In this paper, we discuss how two broad classes of explanations are vulnerable to manipulation. We demonstrate how adversaries can design biased models that manipulate model agnostic feature attribution methods (e.g., LIME \& SHAP) and counterfactual explanations that hill-climb during the counterfactual search (e.g., Wachter's Algorithm \& DiCE) into \textit{concealing} the model's biases. These vulnerabilities allow an adversary to deploy a biased model, yet explanations will not reveal this bias, thereby deceiving stakeholders into trusting the model. We evaluate the manipulations on real world data sets, including COMPAS and Communities \& Crime, and find explanations can be manipulated in practice.
    SIGL: Securing Software Installations Through Deep Graph Learning. (arXiv:2008.11533v2 [cs.CR] UPDATED)
    (2 min) Many users implicitly assume that software can only be exploited after it is installed. However, recent supply-chain attacks demonstrate that application integrity must be ensured during installation itself. We introduce SIGL, a new tool for detecting malicious behavior during software installation. SIGL collects traces of system call activity, building a data provenance graph that it analyzes using a novel autoencoder architecture with a graph long short-term memory network (graph LSTM) for the encoder and a standard multilayer perceptron for the decoder. SIGL flags suspicious installations as well as the specific installation-time processes that are likely to be malicious. Using a test corpus of 625 malicious installers containing real-world malware, we demonstrate that SIGL has a detection accuracy of 96%, outperforming similar systems from industry and academia by up to 87% in precision and recall and 45% in accuracy. We also demonstrate that SIGL can pinpoint the processes most likely to have triggered malicious behavior, works on different audit platforms and operating systems, and is robust to training data contamination and adversarial attack. It can be used with application-specific models, even in the presence of new software versions, as well as application-agnostic meta-models that encompass a wide range of applications and installers.
    Learning Explainable Representations of Malware Behavior. (arXiv:2106.12328v1 [cs.LG])
    (2 min) We address the problems of identifying malware in network telemetry logs and providing \emph{indicators of compromise} -- comprehensible explanations of behavioral patterns that identify the threat. In our system, an array of specialized detectors abstracts network-flow data into comprehensible \emph{network events} in a first step. We develop a neural network that processes this sequence of events and identifies specific threats, malware families and broad categories of malware. We then use the \emph{integrated-gradients} method to highlight events that jointly constitute the characteristic behavioral pattern of the threat. We compare network architectures based on CNNs, LSTMs, and transformers, and explore the efficacy of unsupervised pre-training experimentally on large-scale telemetry data. We demonstrate how this system detects njRAT and other malware based on behavioral patterns.
    Synthetic Benchmarks for Scientific Research in Explainable Machine Learning. (arXiv:2106.12543v1 [cs.LG])
    (2 min) As machine learning models grow more complex and their applications become more high-stakes, tools for explaining model predictions have become increasingly important. Despite the widespread use of explainability techniques, evaluating and comparing different feature attribution methods remains challenging: evaluations ideally require human studies, and empirical evaluation metrics are often computationally prohibitive on real-world datasets. In this work, we address this issue by releasing XAI-Bench: a suite of synthetic datasets along with a library for benchmarking feature attribution algorithms. Unlike real-world datasets, synthetic datasets allow the efficient computation of conditional expected values that are needed to evaluate ground-truth Shapley values and other metrics. The synthetic datasets we release offer a wide variety of parameters that can be configured to simulate real-world data. We demonstrate the power of our library by benchmarking popular explainability techniques across several evaluation metrics and identifying failure modes for popular explainers. The efficiency of our library will help bring new explainability methods from development to deployment.
    ATOM: Robustifying Out-of-distribution Detection Using Outlier Mining. (arXiv:2006.15207v3 [cs.LG] UPDATED)
    (2 min) Detecting out-of-distribution (OOD) inputs is critical for safely deploying deep learning models in an open-world setting. However, existing OOD detection solutions can be brittle in the open world, facing various types of adversarial OOD inputs. While methods leveraging auxiliary OOD data have emerged, our analysis on illuminative examples reveals a key insight that the majority of auxiliary OOD examples may not meaningfully improve or even hurt the decision boundary of the OOD detector, which is also observed in empirical results on real data. In this paper, we provide a theoretically motivated method, Adversarial Training with informative Outlier Mining (ATOM), which improves the robustness of OOD detection. We show that, by mining informative auxiliary OOD data, one can significantly improve OOD detection performance, and somewhat surprisingly, generalize to unseen adversarial attacks. ATOM achieves state-of-the-art performance under a broad family of classic and adversarial OOD evaluation tasks. For example, on the CIFAR-10 in-distribution dataset, ATOM reduces the FPR (at TPR 95%) by up to 57.99% under adversarial OOD inputs, surpassing the previous best baseline by a large margin.
    Taming GANs with Lookahead-Minmax. (arXiv:2006.14567v3 [stat.ML] UPDATED)
    (2 min) Generative Adversarial Networks are notoriously challenging to train. The underlying minmax optimization is highly susceptible to the variance of the stochastic gradient and the rotational component of the associated game vector field. To tackle these challenges, we propose the Lookahead algorithm for minmax optimization, originally developed for single objective minimization only. The backtracking step of our Lookahead-minmax naturally handles the rotational game dynamics, a property which was identified to be key for enabling gradient ascent descent methods to converge on challenging examples often analyzed in the literature. Moreover, it implicitly handles high variance without using large mini-batches, known to be essential for reaching state of the art performance. Experimental results on MNIST, SVHN, CIFAR-10, and ImageNet demonstrate a clear advantage of combining Lookahead-minmax with Adam or extragradient, in terms of performance and improved stability, for negligible memory and computational cost. Using 30-fold fewer parameters and 16-fold smaller minibatches we outperform the reported performance of the class-dependent BigGAN on CIFAR-10 by obtaining FID of 12.19 without using the class labels, bringing state-of-the-art GAN training within reach of common computational resources.
    Pre-trained Models for Natural Language Processing: A Survey. (arXiv:2003.08271v4 [cs.CL] UPDATED)
    (2 min) Recently, the emergence of pre-trained models (PTMs) has brought natural language processing (NLP) to a new era. In this survey, we provide a comprehensive review of PTMs for NLP. We first briefly introduce language representation learning and its research progress. Then we systematically categorize existing PTMs based on a taxonomy with four perspectives. Next, we describe how to adapt the knowledge of PTMs to the downstream tasks. Finally, we outline some potential directions of PTMs for future research. This survey is purposed to be a hands-on guide for understanding, using, and developing PTMs for various NLP tasks.
    A Review of Assistive Technologies for Activities of Daily Living of Elderly. (arXiv:2106.12183v1 [cs.HC])
    (2 min) One of the distinct features of this century has been the population of older adults which has been on a constant rise. Elderly people have several needs and requirements due to physical disabilities, cognitive issues, weakened memory and disorganized behavior, that they face with increasing age. The extent of these limitations also differs according to the varying diversities in elderly, which include age, gender, background, experience, skills, knowledge and so on. These varying needs and challenges with increasing age, limits abilities of older adults to perform Activities of Daily Living (ADLs) in an independent manner. To add to it, the shortage of caregivers creates a looming need for technology-based services for elderly people, to assist them in performing their daily routine tasks to sustain their independent living and active aging. To address these needs, this work consists of making three major contributions in this field. First, it provides a rather comprehensive review of assisted living technologies aimed at helping elderly people to perform ADLs. Second, the work discusses the challenges identified through this review, that currently exist in the context of implementation of assisted living services for elderly care in Smart Homes and Smart Cities. Finally, the work also outlines an approach for implementation, extension and integration of the existing works in this field for development of a much-needed framework that can provide personalized assistance and user-centered behavior interventions to elderly as per their varying and ever-changing needs.
    Algorithm Based on One Monocular Video Delivers Highly Valid and Reliable Gait Parameters. (arXiv:2008.08045v5 [eess.SP] UPDATED)
    (2 min) Despite its paramount importance for manifold use cases (e.g., in the health care industry, sports, rehabilitation and fitness assessment), sufficiently valid and reliable gait parameter measurement is still limited to high-tech gait laboratories mostly. Here, we demonstrate the excellent validity and test-retest repeatability of a novel gait assessment system which is built upon modern convolutional neural networks to extract three-dimensional skeleton joints from monocular frontal-view videos of walking humans. The validity study is based on a comparison to the GAITRite pressure-sensitive walkway system. All measured gait parameters (gait speed, cadence, step length and step time) showed excellent concurrent validity for multiple walk trials at normal and fast gait speeds. The test-retest-repeatability is on the same level as the GAITRite system. In conclusion, we are convinced that our results can pave the way for cost, space and operationally effective gait analysis in broad mainstream applications. Most sensor-based systems are costly, must be operated by extensively trained personnel (e.g., motion capture systems) or - even if not quite as costly - still possess considerable complexity (e.g., wearable sensors). In contrast, a video sufficient for the assessment method presented here can be obtained by anyone, without much training, via a smartphone camera.
    Learning to Generalize Across Long-Horizon Tasks from Human Demonstrations. (arXiv:2003.06085v2 [cs.RO] UPDATED)
    (2 min) Imitation learning is an effective and safe technique to train robot policies in the real world because it does not depend on an expensive random exploration process. However, due to the lack of exploration, learning policies that generalize beyond the demonstrated behaviors is still an open challenge. We present a novel imitation learning framework to enable robots to 1) learn complex real world manipulation tasks efficiently from a small number of human demonstrations, and 2) synthesize new behaviors not contained in the collected demonstrations. Our key insight is that multi-task domains often present a latent structure, where demonstrated trajectories for different tasks intersect at common regions of the state space. We present Generalization Through Imitation (GTI), a two-stage offline imitation learning algorithm that exploits this intersecting structure to train goal-directed policies that generalize to unseen start and goal state combinations. In the first stage of GTI, we train a stochastic policy that leverages trajectory intersections to have the capacity to compose behaviors from different demonstration trajectories together. In the second stage of GTI, we collect a small set of rollouts from the unconditioned stochastic policy of the first stage, and train a goal-directed agent to generalize to novel start and goal configurations. We validate GTI in both simulated domains and a challenging long-horizon robotic manipulation domain in the real world. Additional results and videos are available at https://sites.google.com/view/gti2020/ .
    SketchEmbedNet: Learning Novel Concepts by Imitating Drawings. (arXiv:2009.04806v4 [cs.CV] UPDATED)
    (2 min) Sketch drawings capture the salient information of visual concepts. Previous work has shown that neural networks are capable of producing sketches of natural objects drawn from a small number of classes. While earlier approaches focus on generation quality or retrieval, we explore properties of image representations learned by training a model to produce sketches of images. We show that this generative, class-agnostic model produces informative embeddings of images from novel examples, classes, and even novel datasets in a few-shot setting. Additionally, we find that these learned representations exhibit interesting structure and compositionality.
    Dual T: Reducing Estimation Error for Transition Matrix in Label-noise Learning. (arXiv:2006.07805v3 [cs.LG] UPDATED)
    (2 min) The transition matrix, denoting the transition relationship from clean labels to noisy labels, is essential to build statistically consistent classifiers in label-noise learning. Existing methods for estimating the transition matrix rely heavily on estimating the noisy class posterior. However, the estimation error for noisy class posterior could be large due to the randomness of label noise, which would lead the transition matrix to be poorly estimated. Therefore, in this paper, we aim to solve this problem by exploiting the divide-and-conquer paradigm. Specifically, we introduce an intermediate class to avoid directly estimating the noisy class posterior. By this intermediate class, the original transition matrix can then be factorized into the product of two easy-to-estimate transition matrices. We term the proposed method the dual-T estimator. Both theoretical analyses and empirical results illustrate the effectiveness of the dual-T estimator for estimating transition matrices, leading to better classification performances.
    Learning Stochastic Majority Votes by Minimizing a PAC-Bayes Generalization Bound. (arXiv:2106.12535v1 [cs.LG])
    (2 min) We investigate a stochastic counterpart of majority votes over finite ensembles of classifiers, and study its generalization properties. While our approach holds for arbitrary distributions, we instantiate it with Dirichlet distributions: this allows for a closed-form and differentiable expression for the expected risk, which then turns the generalization bound into a tractable training objective. The resulting stochastic majority vote learning algorithm achieves state-of-the-art accuracy and benefits from (non-vacuous) tight generalization bounds, in a series of numerical experiments when compared to competing algorithms which also minimize PAC-Bayes objectives -- both with uninformed (data-independent) and informed (data-dependent) priors.
    Beyond Predictions in Neural ODEs: Identification and Interventions. (arXiv:2106.12430v1 [cs.LG])
    (2 min) Spurred by tremendous success in pattern matching and prediction tasks, researchers increasingly resort to machine learning to aid original scientific discovery. Given large amounts of observational data about a system, can we uncover the rules that govern its evolution? Solving this task holds the great promise of fully understanding the causal interactions and being able to make reliable predictions about the system's behavior under interventions. We take a step towards answering this question for time-series data generated from systems of ordinary differential equations (ODEs). While the governing ODEs might not be identifiable from data alone, we show that combining simple regularization schemes with flexible neural ODEs can robustly recover the dynamics and causal structures from time-series data. Our results on a variety of (non)-linear first and second order systems as well as real data validate our method. We conclude by showing that we can also make accurate predictions under interventions on variables or the system itself.
    Robust Compressed Sensing using Generative Models. (arXiv:2006.09461v3 [stat.ML] UPDATED)
    (2 min) The goal of compressed sensing is to estimate a high dimensional vector from an underdetermined system of noisy linear equations. In analogy to classical compressed sensing, here we assume a generative model as a prior, that is, we assume the vector is represented by a deep generative model $G: \mathbb{R}^k \rightarrow \mathbb{R}^n$. Classical recovery approaches such as empirical risk minimization (ERM) are guaranteed to succeed when the measurement matrix is sub-Gaussian. However, when the measurement matrix and measurements are heavy-tailed or have outliers, recovery may fail dramatically. In this paper we propose an algorithm inspired by the Median-of-Means (MOM). Our algorithm guarantees recovery for heavy-tailed data, even in the presence of outliers. Theoretically, our results show our novel MOM-based algorithm enjoys the same sample complexity guarantees as ERM under sub-Gaussian assumptions. Our experiments validate both aspects of our claims: other algorithms are indeed fragile and fail under heavy-tailed and/or corrupted data, while our approach exhibits the predicted robustness.
    Who Leads and Who Follows in Strategic Classification?. (arXiv:2106.12529v1 [cs.LG])
    (2 min) As predictive models are deployed into the real world, they must increasingly contend with strategic behavior. A growing body of work on strategic classification treats this problem as a Stackelberg game: the decision-maker "leads" in the game by deploying a model, and the strategic agents "follow" by playing their best response to the deployed model. Importantly, in this framing, the burden of learning is placed solely on the decision-maker, while the agents' best responses are implicitly treated as instantaneous. In this work, we argue that the order of play in strategic classification is fundamentally determined by the relative frequencies at which the decision-maker and the agents adapt to each other's actions. In particular, by generalizing the standard model to allow both players to learn over time, we show that a decision-maker that makes updates faster than the agents can reverse the order of play, meaning that the agents lead and the decision-maker follows. We observe in standard learning settings that such a role reversal can be desirable for both the decision-maker and the strategic agents. Finally, we show that a decision-maker with the freedom to choose their update frequency can induce learning dynamics that converge to Stackelberg equilibria with either order of play.
    Unsupervised Speech Enhancement using Dynamical Variational Auto-Encoders. (arXiv:2106.12271v1 [cs.SD])
    (2 min) Dynamical variational auto-encoders (DVAEs) are a class of deep generative models with latent variables, dedicated to time series data modeling. DVAEs can be considered as extensions of the variational autoencoder (VAE) that include the modeling of temporal dependencies between successive observed and/or latent vectors in data sequences. Previous work has shown the interest of DVAEs and their better performance over the VAE for speech signals (spectrogram) modeling. Independently, the VAE has been successfully applied to speech enhancement in noise, in an unsupervised noise-agnostic set-up that does not require the use of a parallel dataset of clean and noisy speech samples for training, but only requires clean speech signals. In this paper, we extend those works to DVAE-based single-channel unsupervised speech enhancement, hence exploiting both speech signals unsupervised representation learning and dynamics modeling. We propose an unsupervised speech enhancement algorithm based on the most general form of DVAEs, that we then adapt to three specific DVAE models to illustrate the versatility of the framework. More precisely, we combine DVAE-based speech priors with a noise model based on nonnegative matrix factorization, and we derive a variational expectation-maximization (VEM) algorithm to perform speech enhancement. Experimental results show that the proposed approach based on DVAEs outperforms its VAE counterpart and a supervised speech enhancement baseline.
    Sampling with Mirrored Stein Operators. (arXiv:2106.12506v1 [stat.ML])
    (2 min) We introduce a new family of particle evolution samplers suitable for constrained domains and non-Euclidean geometries. Stein Variational Mirror Descent and Mirrored Stein Variational Gradient Descent minimize the Kullback-Leibler (KL) divergence to constrained target distributions by evolving particles in a dual space defined by a mirror map. Stein Variational Natural Gradient exploits non-Euclidean geometry to more efficiently minimize the KL divergence to unconstrained targets. We derive these samplers from a new class of mirrored Stein operators and adaptive kernels developed in this work. We demonstrate that these new samplers yield accurate approximations to distributions on the simplex, deliver valid confidence intervals in post-selection inference, and converge more rapidly than prior methods in large-scale unconstrained posterior inference. Finally, we establish the convergence of our new procedures under verifiable conditions on the target distribution.
    False perfection in machine prediction: Detecting and assessing circularity problems in machine learning. (arXiv:2106.12417v1 [cs.LG])
    (2 min) Machine learning algorithms train models from patterns of input data and target outputs, with the goal of predicting correct outputs for unseen test inputs. Here we demonstrate a problem of machine learning in vital application areas such as medical informatics or patent law that consists of the inclusion of measurements on which target outputs are deterministically defined in the representations of input data. This leads to perfect, but circular predictions based on a machine reconstruction of the known target definition, but fails on real-world data where the defining measurements may not or only incompletely be available. We present a circularity test that shows, for given datasets and black-box machine learning models, whether the target functional definition can be reconstructed and has been used in training. We argue that a transfer of research results to real-world applications requires to avoid circularity by separating measurements that define target outcomes from data representations in machine learning.
    Blur, Noise, and Compression Robust Generative Adversarial Networks. (arXiv:2003.07849v2 [cs.CV] UPDATED)
    (2 min) Generative adversarial networks (GANs) have gained considerable attention owing to their ability to reproduce images. However, they can recreate training images faithfully despite image degradation in the form of blur, noise, and compression, generating similarly degraded images. To solve this problem, the recently proposed noise robust GAN (NR-GAN) provides a partial solution by demonstrating the ability to learn a clean image generator directly from noisy images using a two-generator model comprising image and noise generators. However, its application is limited to noise, which is relatively easy to decompose owing to its additive and reversible characteristics, and its application to irreversible image degradation, in the form of blur, compression, and combination of all, remains a challenge. To address these problems, we propose blur, noise, and compression robust GAN (BNCR-GAN) that can learn a clean image generator directly from degraded images without knowledge of degradation parameters (e.g., blur kernel types, noise amounts, or quality factor values). Inspired by NR-GAN, BNCR-GAN uses a multiple-generator model composed of image, blur-kernel, noise, and quality-factor generators. However, in contrast to NR-GAN, to address irreversible characteristics, we introduce masking architectures adjusting degradation strength values in a data-driven manner using bypasses before and after degradation. Furthermore, to suppress uncertainty caused by the combination of blur, noise, and compression, we introduce adaptive consistency losses imposing consistency between irreversible degradation processes according to the degradation strengths. We demonstrate the effectiveness of BNCR-GAN through large-scale comparative studies on CIFAR-10 and a generality analysis on FFHQ. In addition, we demonstrate the applicability of BNCR-GAN in image restoration.
    Imitation Learning: Progress, Taxonomies and Opportunities. (arXiv:2106.12177v1 [cs.LG])
    (2 min) Imitation learning aims to extract knowledge from human experts' demonstrations or artificially created agents in order to replicate their behaviors. Its success has been demonstrated in areas such as video games, autonomous driving, robotic simulations and object manipulation. However, this replicating process could be problematic, such as the performance is highly dependent on the demonstration quality, and most trained agents are limited to perform well in task-specific environments. In this survey, we provide a systematic review on imitation learning. We first introduce the background knowledge from development history and preliminaries, followed by presenting different taxonomies within Imitation Learning and key milestones of the field. We then detail challenges in learning strategies and present research opportunities with learning policy from suboptimal demonstration, voice instructions and other associated optimization schemes.
    Predicting Legal Proceedings Status: Approaches Based on Sequential Text Data. (arXiv:2003.11561v4 [cs.CL] UPDATED)
    (2 min) The objective of this paper is to develop predictive models to classify Brazilian legal proceedings in three possible classes of status: (i) archived proceedings, (ii) active proceedings, and (iii) suspended proceedings. This problem's resolution is intended to assist public and private institutions in managing large portfolios of legal proceedings, providing gains in scale and efficiency. In this paper, legal proceedings are made up of sequences of short texts called "motions." We combined several natural language processing (NLP) and machine learning techniques to solve the problem. Although working with Portuguese NLP, which can be challenging due to lack of resources, our approaches performed remarkably well in the classification task, achieving maximum accuracy of .93 and top average F1 Scores of .89 (macro) and .93 (weighted). Furthermore, we could extract and interpret the patterns learned by one of our models besides quantifying how those patterns relate to the classification task. The interpretability step is important among machine learning legal applications and gives us an exciting insight into how black-box models make decisions.
    Differentially Quantized Gradient Methods. (arXiv:2002.02508v3 [cs.LG] UPDATED)
    (2 min) This paper considers quantized distributed optimization algorithms in the parameter server framework of distributed training. We introduce the principle we call Differential Quantization (DQ) that prescribes that the past quantization errors should be compensated in such a way as to direct the descent trajectory of a quantized algorithm towards that of its unquantized counterpart. Assuming that the objective function is smooth and strongly convex, we prove that in the limit of large problem dimension, Differentially Quantized Gradient Descent (DQ-GD) attains a linear contraction factor of $\max\{\sigma_{\mathrm{GD}}, 2^{-R}\}$, where $\sigma_{\mathrm{GD}}$ is the contraction factor of unquantized gradient descent (GD). Thus at any $R\geq\log_2 1 /\sigma_{\mathrm{GD}}$ bits, the contraction factor of DQ-GD is the same as that of unquantized GD, i.e., there is no loss due to quantization. We show a converse demonstrating that no quantized gradient descent algorithm can converge faster than $\max\{\sigma_{\mathrm{GD}}, 2^{-R}\}$. In contrast, naively quantized GD where the worker directly quantizes the gradient barely attains $\sigma_{\mathrm{GD}} + 2^{-R}$. The principle of differential quantization continues to apply to gradient methods with momentum such as Nesterov's accelerated gradient descent, and Polyak's heavy ball method. For these algorithms as well, if the rate is above a certain threshold, there is no loss in contraction factor obtained by the differentially quantized algorithm compared to its unquantized counterpart, and furthermore, the differentially quantized heavy ball method attains the optimal contraction achievable among all (even unquantized) gradient methods. Experimental results on both simulated and real-world least-squares problems validate our theoretical analysis.
    Innovations Autoencoder and its Application in Real-Time Anomaly Detection. (arXiv:2106.12382v1 [stat.ML])
    (2 min) An innovations sequence of a time series is a sequence of independent and identically distributed random variables with which the original time series has a causal representation. The innovation at a time is statistically independent of the prior history of the time series. As such, it represents the new information contained at present but not in the past. Because of its simple probability structure, an innovations sequence is the most efficient signature of the original. Unlike the principle or independent analysis (PCA/ICA) representations, an innovations sequence preserves not only the complete statistical properties but also the temporal order of the original time series. An long-standing open problem is to find a computationally tractable way to extract an innovations sequence of non-Gaussian processes. This paper presents a deep learning approach, referred to as Innovations Autoencoder (IAE), that extracts innovations sequences using a causal convolutional neural network. An application of IAE to nonparametric anomaly detection with unknown anomaly and anomaly-free models is also presented.
    Training Data Subset Selection for Regression with Controlled Generalization Error. (arXiv:2106.12491v1 [cs.LG])
    (2 min) Data subset selection from a large number of training instances has been a successful approach toward efficient and cost-effective machine learning. However, models trained on a smaller subset may show poor generalization ability. In this paper, our goal is to design an algorithm for selecting a subset of the training data, so that the model can be trained quickly, without significantly sacrificing on accuracy. More specifically, we focus on data subset selection for L2 regularized regression problems and provide a novel problem formulation which seeks to minimize the training loss with respect to both the trainable parameters and the subset of training data, subject to error bounds on the validation set. We tackle this problem using several technical innovations. First, we represent this problem with simplified constraints using the dual of the original training problem and show that the objective of this new representation is a monotone and alpha-submodular function, for a wide variety of modeling choices. Such properties lead us to develop SELCON, an efficient majorization-minimization algorithm for data subset selection, that admits an approximation guarantee even when the training provides an imperfect estimate of the trained model. Finally, our experiments on several datasets show that SELCON trades off accuracy and efficiency more effectively than the current state-of-the-art.
    Graph Universal Adversarial Attacks: A Few Bad Actors Ruin Graph Learning Models. (arXiv:2002.04784v2 [cs.LG] UPDATED)
    (2 min) Deep neural networks, while generalize well, are known to be sensitive to small adversarial perturbations. This phenomenon poses severe security threat and calls for in-depth investigation of the robustness of deep learning models. With the emergence of neural networks for graph structured data, similar investigations are urged to understand their robustness. It has been found that adversarially perturbing the graph structure and/or node features may result in a significant degradation of the model performance. In this work, we show from a different angle that such fragility similarly occurs if the graph contains a few bad-actor nodes, which compromise a trained graph neural network through flipping the connections to any targeted victim. Worse, the bad actors found for one graph model severely compromise other models as well. We call the bad actors ``anchor nodes'' and propose an algorithm, named GUA, to identify them. Thorough empirical investigations suggest an interesting finding that the anchor nodes often belong to the same class; and they also corroborate the intuitive trade-off between the number of anchor nodes and the attack success rate. For the dataset Cora which contains 2708 nodes, as few as six anchor nodes will result in an attack success rate higher than 80\% for GCN and other three models.
    Gradient-Based Interpretability Methods and Binarized Neural Networks. (arXiv:2106.12569v1 [cs.CV])
    (2 min) Binarized Neural Networks (BNNs) have the potential to revolutionize the way that deep learning is carried out in edge computing platforms. However, the effectiveness of interpretability methods on these networks has not been assessed. In this paper, we compare the performance of several widely used saliency map-based interpretabilty techniques (Gradient, SmoothGrad and GradCAM), when applied to Binarized or Full Precision Neural Networks (FPNNs). We found that the basic Gradient method produces very similar-looking maps for both types of network. However, SmoothGrad produces significantly noisier maps for BNNs. GradCAM also produces saliency maps which differ between network types, with some of the BNNs having seemingly nonsensical explanations. We comment on possible reasons for these differences in explanations and present it as an example of why interpretability techniques should be tested on a wider range of network types.
    Diabetic Retinopathy Detection using Ensemble Machine Learning. (arXiv:2106.12545v1 [eess.IV])
    (2 min) Diabetic Retinopathy (DR) is among the worlds leading vision loss causes in diabetic patients. DR is a microvascular disease that affects the eye retina, which causes vessel blockage and therefore cuts the main source of nutrition for the retina tissues. Treatment for this visual disorder is most effective when it is detected in its earliest stages, as severe DR can result in irreversible blindness. Nonetheless, DR identification requires the expertise of Ophthalmologists which is often expensive and time-consuming. Therefore, automatic detection systems were introduced aiming to facilitate the identification process, making it available globally in a time and cost-efficient manner. However, due to the limited reliable datasets and medical records for this particular eye disease, the obtained predictions accuracies were relatively unsatisfying for eye specialists to rely on them as diagnostic systems. Thus, we explored an ensemble-based learning strategy, merging a substantial selection of well-known classification algorithms in one sophisticated diagnostic model. The proposed framework achieved the highest accuracy rates among all other common classification algorithms in the area. 4 subdatasets were generated to contain the top 5 and top 10 features of the Messidor dataset, selected by InfoGainEval. and WrapperSubsetEval., accuracies of 70.7% and 75.1% were achieved on the InfoGainEval. top 5 and original dataset respectively. The results imply the impressive performance of the subdataset, which significantly conduces to a less complex classification process
    High-Throughput Precision Phenotyping of Left Ventricular Hypertrophy with Cardiovascular Deep Learning. (arXiv:2106.12511v1 [eess.IV])
    (2 min) Left ventricular hypertrophy (LVH) results from chronic remodeling caused by a broad range of systemic and cardiovascular disease including hypertension, aortic stenosis, hypertrophic cardiomyopathy, and cardiac amyloidosis. Early detection and characterization of LVH can significantly impact patient care but is limited by under-recognition of hypertrophy, measurement error and variability, and difficulty differentiating etiologies of LVH. To overcome this challenge, we present EchoNet-LVH - a deep learning workflow that automatically quantifies ventricular hypertrophy with precision equal to human experts and predicts etiology of LVH. Trained on 28,201 echocardiogram videos, our model accurately measures intraventricular wall thickness (mean absolute error [MAE] 1.4mm, 95% CI 1.2-1.5mm), left ventricular diameter (MAE 2.4mm, 95% CI 2.2-2.6mm), and posterior wall thickness (MAE 1.2mm, 95% CI 1.1-1.3mm) and classifies cardiac amyloidosis (area under the curve of 0.83) and hypertrophic cardiomyopathy (AUC 0.98) from other etiologies of LVH. In external datasets from independent domestic and international healthcare systems, EchoNet-LVH accurately quantified ventricular parameters (R2 of 0.96 and 0.90 respectively) and detected cardiac amyloidosis (AUC 0.79) and hypertrophic cardiomyopathy (AUC 0.89) on the domestic external validation site. Leveraging measurements across multiple heart beats, our model can more accurately identify subtle changes in LV geometry and its causal etiologies. Compared to human experts, EchoNet-LVH is fully automated, allowing for reproducible, precise measurements, and lays the foundation for precision diagnosis of cardiac hypertrophy. As a resource to promote further innovation, we also make publicly available a large dataset of 23,212 annotated echocardiogram videos.
    From Canonical Correlation Analysis to Self-supervised Graph Neural Networks. (arXiv:2106.12484v1 [cs.LG])
    (2 min) We introduce a conceptually simple yet effective model for self-supervised representation learning with graph data. It follows the previous methods that generate two views of an input graph through data augmentation. However, unlike contrastive methods that focus on instance-level discrimination, we optimize an innovative feature-level objective inspired by classical Canonical Correlation Analysis. Compared with other works, our approach requires none of the parameterized mutual information estimator, additional projector, asymmetric structures, and most importantly, negative samples which can be costly. We show that the new objective essentially 1) aims at discarding augmentation-variant information by learning invariant representations, and 2) can prevent degenerated solutions by decorrelating features in different dimensions. Our theoretical analysis further provides an understanding for the new objective which can be equivalently seen as an instantiation of the Information Bottleneck Principle under the self-supervised setting. Despite its simplicity, our method performs competitively on seven public graph datasets.
    Should You Go Deeper? Optimizing Convolutional Neural Network Architectures without Training by Receptive Field Analysis. (arXiv:2106.12307v1 [cs.LG])
    (2 min) Applying artificial neural networks (ANN) to specific tasks, researchers, programmers, and other specialists usually overshot the number of convolutional layers in their designs. By implication, these ANNs hold too many parameters, which needed unnecessarily trained without impacting the result. The features, a convolutional layer can process, are strictly limited by its receptive field. By layer-wise analyzing the expansion of the receptive fields, we can reliably predict sequences of layers that will not contribute qualitatively to the inference in thegiven ANN architecture. Based on these analyses, we propose design strategies to resolve these inefficiencies, optimizing the explainability and the computational performance of ANNs. Since neither the strategies nor the analysis requires training of the actual model, these insights allow for a very efficient design process of ANNs architectures which might be automated in the future.
    Generative Self-training for Cross-domain Unsupervised Tagged-to-Cine MRI Synthesis. (arXiv:2106.12499v1 [cs.CV])
    (2 min) Self-training based unsupervised domain adaptation (UDA) has shown great potential to address the problem of domain shift, when applying a trained deep learning model in a source domain to unlabeled target domains. However, while the self-training UDA has demonstrated its effectiveness on discriminative tasks, such as classification and segmentation, via the reliable pseudo-label selection based on the softmax discrete histogram, the self-training UDA for generative tasks, such as image synthesis, is not fully investigated. In this work, we propose a novel generative self-training (GST) UDA framework with continuous value prediction and regression objective for cross-domain image synthesis. Specifically, we propose to filter the pseudo-label with an uncertainty mask, and quantify the predictive confidence of generated images with practical variational Bayes learning. The fast test-time adaptation is achieved by a round-based alternative optimization scheme. We validated our framework on the tagged-to-cine magnetic resonance imaging (MRI) synthesis problem, where datasets in the source and target domains were acquired from different scanners or centers. Extensive validations were carried out to verify our framework against popular adversarial training UDA methods. Results show that our GST, with tagged MRI of test subjects in new target domains, improved the synthesis quality by a large margin, compared with the adversarial training UDA methods.
    Classifying Textual Data with Pre-trained Vision Models through Transfer Learning and Data Transformations. (arXiv:2106.12479v1 [cs.CL])
    (2 min) Knowledge is acquired by humans through experience, and no boundary is set between the kinds of knowledge or skill levels we can achieve on different tasks at the same time. When it comes to Neural Networks, that is not the case, the major breakthroughs in the field are extremely task and domain specific. Vision and language are dealt with in separate manners, using separate methods and different datasets. In this work, we propose to use knowledge acquired by benchmark Vision Models which are trained on ImageNet to help a much smaller architecture learn to classify text. After transforming the textual data contained in the IMDB dataset to gray scale images. An analysis of different domains and the Transfer Learning method is carried out. Despite the challenge posed by the very different datasets, promising results are achieved. The main contribution of this work is a novel approach which links large pretrained models on both language and vision to achieve state-of-the-art results in different sub-fields from the original task. Without needing high compute capacity resources. Specifically, Sentiment Analysis is achieved after transferring knowledge between vision and language models. BERT embeddings are transformed into grayscale images, these images are then used as training examples for pretrained vision models such as VGG16 and ResNet Index Terms: Natural language, Vision, BERT, Transfer Learning, CNN, Domain Adaptation.
    Teacher Model Fingerprinting Attacks Against Transfer Learning. (arXiv:2106.12478v1 [cs.CR])
    (2 min) Transfer learning has become a common solution to address training data scarcity in practice. It trains a specified student model by reusing or fine-tuning early layers of a well-trained teacher model that is usually publicly available. However, besides utility improvement, the transferred public knowledge also brings potential threats to model confidentiality, and even further raises other security and privacy issues. In this paper, we present the first comprehensive investigation of the teacher model exposure threat in the transfer learning context, aiming to gain a deeper insight into the tension between public knowledge and model confidentiality. To this end, we propose a teacher model fingerprinting attack to infer the origin of a student model, i.e., the teacher model it transfers from. Specifically, we propose a novel optimization-based method to carefully generate queries to probe the student model to realize our attack. Unlike existing model reverse engineering approaches, our proposed fingerprinting method neither relies on fine-grained model outputs, e.g., posteriors, nor auxiliary information of the model architecture or training dataset. We systematically evaluate the effectiveness of our proposed attack. The empirical results demonstrate that our attack can accurately identify the model origin with few probing queries. Moreover, we show that the proposed attack can serve as a stepping stone to facilitating other attacks against machine learning models, such as model stealing.
    Bayesian Deep Learning Hyperparameter Search for Robust Function Mapping to Polynomials with Noise. (arXiv:2106.12532v1 [cs.LG])
    (2 min) Advances in neural architecture search, as well as explainability and interpretability of connectionist architectures, have been reported in the recent literature. However, our understanding of how to design Bayesian Deep Learning (BDL) hyperparameters, specifically, the depth, width and ensemble size, for robust function mapping with uncertainty quantification, is still emerging. This paper attempts to further our understanding by mapping Bayesian connectionist representations to polynomials of different orders with varying noise types and ratios. We examine the noise-contaminated polynomials to search for the combination of hyperparameters that can extract the underlying polynomial signals while quantifying uncertainties based on the noise attributes. Specifically, we attempt to study the question that an appropriate neural architecture and ensemble configuration can be found to detect a signal of any n-th order polynomial contaminated with noise having different distributions and signal-to-noise (SNR) ratios and varying noise attributes. Our results suggest the possible existence of an optimal network depth as well as an optimal number of ensembles for prediction skills and uncertainty quantification, respectively. However, optimality is not discernible for width, even though the performance gain reduces with increasing width at high values of width. Our experiments and insights can be directional to understand theoretical properties of BDL representations and to design practical solutions.
    Adapting Off-the-Shelf Source Segmenter for Target Medical Image Segmentation. (arXiv:2106.12497v1 [cs.CV])
    (2 min) Unsupervised domain adaptation (UDA) aims to transfer knowledge learned from a labeled source domain to an unlabeled and unseen target domain, which is usually trained on data from both domains. Access to the source domain data at the adaptation stage, however, is often limited, due to data storage or privacy issues. To alleviate this, in this work, we target source free UDA for segmentation, and propose to adapt an ``off-the-shelf" segmentation model pre-trained in the source domain to the target domain, with an adaptive batch-wise normalization statistics adaptation framework. Specifically, the domain-specific low-order batch statistics, i.e., mean and variance, are gradually adapted with an exponential momentum decay scheme, while the consistency of domain shareable high-order batch statistics, i.e., scaling and shifting parameters, is explicitly enforced by our optimization objective. The transferability of each channel is adaptively measured first from which to balance the contribution of each channel. Moreover, the proposed source free UDA framework is orthogonal to unsupervised learning methods, e.g., self-entropy minimization, which can thus be simply added on top of our framework. Extensive experiments on the BraTS 2018 database show that our source free UDA framework outperformed existing source-relaxed UDA methods for the cross-subtype UDA segmentation task and yielded comparable results for the cross-modality UDA segmentation task, compared with a supervised UDA methods with the source data.
    Deep Neural Network Based Respiratory Pathology Classification Using Cough Sounds. (arXiv:2106.12174v1 [cs.LG])
    (2 min) Intelligent systems are transforming the world, as well as our healthcare system. We propose a deep learning-based cough sound classification model that can distinguish between children with healthy versus pathological coughs such as asthma, upper respiratory tract infection (URTI), and lower respiratory tract infection (LRTI). In order to train a deep neural network model, we collected a new dataset of cough sounds, labelled with clinician's diagnosis. The chosen model is a bidirectional long-short term memory network (BiLSTM) based on Mel Frequency Cepstral Coefficients (MFCCs) features. The resulting trained model when trained for classifying two classes of coughs -- healthy or pathology (in general or belonging to a specific respiratory pathology), reaches accuracy exceeding 84\% when classifying cough to the label provided by the physicians' diagnosis. In order to classify subject's respiratory pathology condition, results of multiple cough epochs per subject were combined. The resulting prediction accuracy exceeds 91\% for all three respiratory pathologies. However, when the model is trained to classify and discriminate among the four classes of coughs, overall accuracy dropped: one class of pathological coughs are often misclassified as other. However, if one consider the healthy cough classified as healthy and pathological cough classified to have some kind of pathologies, then the overall accuracy of four class model is above 84\%. A longitudinal study of MFCC feature space when comparing pathological and recovered coughs collected from the same subjects revealed the fact that pathological cough irrespective of the underlying conditions occupy the same feature space making it harder to differentiate only using MFCC features.
    Universal Consistency of Deep Convolutional Neural Networks. (arXiv:2106.12498v1 [cs.LG])
    (2 min) Compared with avid research activities of deep convolutional neural networks (DCNNs) in practice, the study of theoretical behaviors of DCNNs lags heavily behind. In particular, the universal consistency of DCNNs remains open. In this paper, we prove that implementing empirical risk minimization on DCNNs with expansive convolution (with zero-padding) is strongly universally consistent. Motivated by the universal consistency, we conduct a series of experiments to show that without any fully connected layers, DCNNs with expansive convolution perform not worse than the widely used deep neural networks with hybrid structure containing contracting (without zero-padding) convolution layers and several fully connected layers.
    GraphConfRec: A Graph Neural Network-Based Conference Recommender System. (arXiv:2106.12340v1 [cs.IR])
    (2 min) In today's academic publishing model, especially in Computer Science, conferences commonly constitute the main platforms for releasing the latest peer-reviewed advancements in their respective fields. However, choosing a suitable academic venue for publishing one's research can represent a challenging task considering the plethora of available conferences, particularly for those at the start of their academic careers, or for those seeking to publish outside of their usual domain. In this paper, we propose GraphConfRec, a conference recommender system which combines SciGraph and graph neural networks, to infer suggestions based not only on title and abstract, but also on co-authorship and citation relationships. GraphConfRec achieves a recall@10 of up to 0.580 and a MAP of up to 0.336 with a graph attention network-based recommendation model. A user study with 25 subjects supports the positive results.
    Learned Interpretable Residual Extragradient ISTA for Sparse Coding. (arXiv:2106.11970v1 [cs.LG])
    (2 min) Recently, the study on learned iterative shrinkage thresholding algorithm (LISTA) has attracted increasing attentions. A large number of experiments as well as some theories have proved the high efficiency of LISTA for solving sparse coding problems. However, existing LISTA methods are all serial connection. To address this issue, we propose a novel extragradient based LISTA (ELISTA), which has a residual structure and theoretical guarantees. In particular, our algorithm can also provide the interpretability for Res-Net to a certain extent. From a theoretical perspective, we prove that our method attains linear convergence. In practice, extensive empirical results verify the advantages of our method.
    Calibrating the Lee-Carter and the Poisson Lee-Carter models via Neural Networks. (arXiv:2106.12312v1 [stat.ML])
    (2 min) This paper introduces a neural network approach for fitting the Lee-Carter and the Poisson Lee-Carter model on multiple populations. We develop some neural networks that replicate the structure of the individual LC models and allow their joint fitting by analysing the mortality data of all the considered populations simultaneously. The neural network architecture is specifically designed to calibrate each individual model using all available information instead of using a population-specific subset of data as in the traditional estimation schemes. A large set of numerical experiments performed on all the countries of the Human Mortality Database (HMD) shows the effectiveness of our approach. In particular, the resulting parameter estimates appear smooth and less sensitive to the random fluctuations often present in the mortality rates' data, especially for low-population countries. In addition, the forecasting performance results significantly improved as well.
    IQ-Learn: Inverse soft-Q Learning for Imitation. (arXiv:2106.12142v1 [cs.LG])
    (2 min) In many sequential decision-making problems (e.g., robotics control, game playing, sequential prediction), human or expert data is available containing useful information about the task. However, imitation learning (IL) from a small amount of expert data can be challenging in high-dimensional environments with complex dynamics. Behavioral cloning is a simple method that is widely used due to its simplicity of implementation and stable convergence but doesn't utilize any information involving the environment's dynamics. Many existing methods that exploit dynamics information are difficult to train in practice due to an adversarial optimization process over reward and policy approximators or biased, high variance gradient estimators. We introduce a method for dynamics-aware IL which avoids adversarial training by learning a single Q-function, implicitly representing both reward and policy. On standard benchmarks, the implicitly learned rewards show a high positive correlation with the ground-truth rewards, illustrating our method can also be used for inverse reinforcement learning (IRL). Our method, Inverse soft-Q learning (IQ-Learn) obtains state-of-the-art results in offline and online imitation learning settings, surpassing existing methods both in the number of required environment interactions and scalability in high-dimensional spaces.
    Exploring the Representational Power of Graph Autoencoder. (arXiv:2106.12005v1 [cs.LG])
    (2 min) While representation learning has yielded a great success on many graph learning tasks, there is little understanding behind the structures that are being captured by these embeddings. For example, we wonder if the topological features, such as the Triangle Count, the Degree of the node and other centrality measures are concretely encoded in the embeddings. Furthermore, we ask if the presence of these structures in the embeddings is necessary for a better performance on the downstream tasks, such as clustering and classification. To address these questions, we conduct an extensive empirical study over three classes of unsupervised graph embedding models and seven different variants of Graph Autoencoders. Our results show that five topological features: the Degree, the Local Clustering Score, the Betweenness Centrality, the Eigenvector Centrality, and Triangle Count are concretely preserved in the first layer of the graph autoencoder that employs the SUM aggregation rule, under the condition that the model preserves the second-order proximity. We supplement further evidence for the presence of these features by revealing a hierarchy in the distribution of the topological features in the embeddings of the aforementioned model. We also show that a model with such properties can outperform other models on certain downstream tasks, especially when the preserved features are relevant to the task at hand. Finally, we evaluate the suitability of our findings through a test case study related to social influence prediction.
    The Rate of Convergence of Variation-Constrained Deep Neural Networks. (arXiv:2106.12068v1 [cs.LG])
    (2 min) Multi-layer feedforward networks have been used to approximate a wide range of nonlinear functions. An important and fundamental problem is to understand the learnability of a network model through its statistical risk, or the expected prediction error on future data. To the best of our knowledge, the rate of convergence of neural networks shown by existing works is bounded by at most the order of $n^{-1/4}$ for a sample size of $n$. In this paper, we show that a class of variation-constrained neural networks, with arbitrary width, can achieve near-parametric rate $n^{-1/2+\delta}$ for an arbitrarily small positive constant $\delta$. It is equivalent to $n^{-1 +2\delta}$ under the mean squared error. This rate is also observed by numerical experiments. The result indicates that the neural function space needed for approximating smooth functions may not be as large as what is often perceived. Our result also provides insight to the phenomena that deep neural networks do not easily suffer from overfitting when the number of neurons and learning parameters rapidly grow with $n$ or even surpass $n$. We also discuss the rate of convergence regarding other network parameters, including the input dimension, network layer, and coefficient norm.
    Learning Under Delayed Feedback: Implicitly Adapting to Gradient Delays. (arXiv:2106.12261v1 [cs.LG])
    (2 min) We consider stochastic convex optimization problems, where several machines act asynchronously in parallel while sharing a common memory. We propose a robust training method for the constrained setting and derive non asymptotic convergence guarantees that do not depend on prior knowledge of update delays, objective smoothness, and gradient variance. Conversely, existing methods for this setting crucially rely on this prior knowledge, which render them unsuitable for essentially all shared-resources computational environments, such as clouds and data centers. Concretely, existing approaches are unable to accommodate changes in the delays which result from dynamic allocation of the machines, while our method implicitly adapts to such changes.
    Forecasting Health and Wellbeing for Shift Workers Using Job-role Based Deep Neural Network. (arXiv:2106.12081v1 [cs.LG])
    (2 min) Shift workers who are essential contributors to our society, face high risks of poor health and wellbeing. To help with their problems, we collected and analyzed physiological and behavioral wearable sensor data from shift working nurses and doctors, as well as their behavioral questionnaire data and their self-reported daily health and wellbeing labels, including alertness, happiness, energy, health, and stress. We found the similarities and differences between the responses of nurses and doctors. According to the differences in self-reported health and wellbeing labels between nurses and doctors, and the correlations among their labels, we proposed a job-role based multitask and multilabel deep learning model, where we modeled physiological and behavioral data for nurses and doctors simultaneously to predict participants' next day's multidimensional self-reported health and wellbeing status. Our model showed significantly better performances than baseline models and previous state-of-the-art models in the evaluations of binary/3-class classification and regression prediction tasks. We also found features related to heart rate, sleep, and work shift contributed to shift workers' health and wellbeing.
    Structured in Space, Randomized in Time: Leveraging Dropout in RNNs for Efficient Training. (arXiv:2106.12089v1 [cs.LG])
    (2 min) Recurrent Neural Networks (RNNs), more specifically their Long Short-Term Memory (LSTM) variants, have been widely used as a deep learning tool for tackling sequence-based learning tasks in text and speech. Training of such LSTM applications is computationally intensive due to the recurrent nature of hidden state computation that repeats for each time step. While sparsity in Deep Neural Nets has been widely seen as an opportunity for reducing computation time in both training and inference phases, the usage of non-ReLU activation in LSTM RNNs renders the opportunities for such dynamic sparsity associated with neuron activation and gradient values to be limited or non-existent. In this work, we identify dropout induced sparsity for LSTMs as a suitable mode of computation reduction. Dropout is a widely used regularization mechanism, which randomly drops computed neuron values during each iteration of training. We propose to structure dropout patterns, by dropping out the same set of physical neurons within a batch, resulting in column (row) level hidden state sparsity, which are well amenable to computation reduction at run-time in general-purpose SIMD hardware as well as systolic arrays. We conduct our experiments for three representative NLP tasks: language modelling on the PTB dataset, OpenNMT based machine translation using the IWSLT De-En and En-Vi datasets, and named entity recognition sequence labelling using the CoNLL-2003 shared task. We demonstrate that our proposed approach can be used to translate dropout-based computation reduction into reduced training time, with improvement ranging from 1.23x to 1.64x, without sacrificing the target metric.
    Behavior Mimics Distribution: Combining Individual and Group Behaviors for Federated Learning. (arXiv:2106.12300v1 [cs.LG])
    (2 min) Federated Learning (FL) has become an active and promising distributed machine learning paradigm. As a result of statistical heterogeneity, recent studies clearly show that the performance of popular FL methods (e.g., FedAvg) deteriorates dramatically due to the client drift caused by local updates. This paper proposes a novel Federated Learning algorithm (called IGFL), which leverages both Individual and Group behaviors to mimic distribution, thereby improving the ability to deal with heterogeneity. Unlike existing FL methods, our IGFL can be applied to both client and server optimization. As a by-product, we propose a new attention-based federated learning in the server optimization of IGFL. To the best of our knowledge, this is the first time to incorporate attention mechanisms into federated optimization. We conduct extensive experiments and show that IGFL can significantly improve the performance of existing federated learning methods. Especially when the distributions of data among individuals are diverse, IGFL can improve the classification accuracy by about 13% compared with prior baselines.
    Test-time Collective Prediction. (arXiv:2106.12012v1 [cs.LG])
    (2 min) An increasingly common setting in machine learning involves multiple parties, each with their own data, who want to jointly make predictions on future test points. Agents wish to benefit from the collective expertise of the full set of agents to make better predictions than they would individually, but may not be willing to release their data or model parameters. In this work, we explore a decentralized mechanism to make collective predictions at test time, leveraging each agent's pre-trained model without relying on external validation, model retraining, or data pooling. Our approach takes inspiration from the literature in social science on human consensus-making. We analyze our mechanism theoretically, showing that it converges to inverse meansquared-error (MSE) weighting in the large-sample limit. To compute error bars on the collective predictions we propose a decentralized Jackknife procedure that evaluates the sensitivity of our mechanism to a single agent's prediction. Empirically, we demonstrate that our scheme effectively combines models with differing quality across the input space. The proposed consensus prediction achieves significant gains over classical model averaging, and even outperforms weighted averaging schemes that have access to additional validation data.
    Finding simplicity: unsupervised discovery of features, patterns, and order parameters via shift-invariant variational autoencoders. (arXiv:2106.12472v1 [cond-mat.dis-nn])
    (2 min) Recent advances in scanning tunneling and transmission electron microscopies (STM and STEM) have allowed routine generation of large volumes of imaging data containing information on the structure and functionality of materials. The experimental data sets contain signatures of long-range phenomena such as physical order parameter fields, polarization and strain gradients in STEM, or standing electronic waves and carrier-mediated exchange interactions in STM, all superimposed onto scanning system distortions and gradual changes of contrast due to drift and/or mis-tilt effects. Correspondingly, while the human eye can readily identify certain patterns in the images such as lattice periodicities, repeating structural elements, or microstructures, their automatic extraction and classification are highly non-trivial and universal pathways to accomplish such analyses are absent. We pose that the most distinctive elements of the patterns observed in STM and (S)TEM images are similarity and (almost-) periodicity, behaviors stemming directly from the parsimony of elementary atomic structures, superimposed on the gradual changes reflective of order parameter distributions. However, the discovery of these elements via global Fourier methods is non-trivial due to variability and lack of ideal discrete translation symmetry. To address this problem, we develop shift-invariant variational autoencoders (shift-VAE) that allow disentangling characteristic repeating features in the images, their variations, and shifts inevitable for random sampling of image space. Shift-VAEs balance the uncertainty in the position of the object of interest with the uncertainty in shape reconstruction. This approach is illustrated for model 1D data, and further extended to synthetic and experimental STM and STEM 2D data.
    AC/DC: Alternating Compressed/DeCompressed Training of Deep Neural Networks. (arXiv:2106.12379v1 [cs.LG])
    (2 min) The increasing computational requirements of deep neural networks (DNNs) have led to significant interest in obtaining DNN models that are sparse, yet accurate. Recent work has investigated the even harder case of sparse training, where the DNN weights are, for as much as possible, already sparse to reduce computational costs during training. Existing sparse training methods are mainly empirical and often have lower accuracy relative to the dense baseline. In this paper, we present a general approach called Alternating Compressed/DeCompressed (AC/DC) training of DNNs, demonstrate convergence for a variant of the algorithm, and show that AC/DC outperforms existing sparse training methods in accuracy at similar computational budgets; at high sparsity levels, AC/DC even outperforms existing methods that rely on accurate pre-trained dense models. An important property of AC/DC is that it allows co-training of dense and sparse models, yielding accurate sparse-dense model pairs at the end of the training process. This is useful in practice, where compressed variants may be desirable for deployment in resource-constrained settings without re-doing the entire training flow, and also provides us with insights into the accuracy gap between dense and compressed models.
    groupShapley: Efficient prediction explanation with Shapley values for feature groups. (arXiv:2106.12228v1 [stat.ML])
    (2 min) Shapley values has established itself as one of the most appropriate and theoretically sound frameworks for explaining predictions from complex machine learning models. The popularity of Shapley values in the explanation setting is probably due to its unique theoretical properties. The main drawback with Shapley values, however, is that its computational complexity grows exponentially in the number of input features, making it unfeasible in many real world situations where there could be hundreds or thousands of features. Furthermore, with many (dependent) features, presenting/visualizing and interpreting the computed Shapley values also becomes challenging. The present paper introduces groupShapley: a conceptually simple approach for dealing with the aforementioned bottlenecks. The idea is to group the features, for example by type or dependence, and then compute and present Shapley values for these groups instead of for all individual features. Reducing hundreds or thousands of features to half a dozen or so, makes precise computations practically feasible and the presentation and knowledge extraction greatly simplified. We prove that under certain conditions, groupShapley is equivalent to summing the feature-wise Shapley values within each feature group. Moreover, we provide a simulation study exemplifying the differences when these conditions are not met. We illustrate the usability of the approach in a real world car insurance example, where groupShapley is used to provide simple and intuitive explanations.
    ADAVI: Automatic Dual Amortized Variational Inference Applied To Pyramidal Bayesian Models. (arXiv:2106.12248v1 [cs.LG])
    (2 min) Frequently, population studies feature pyramidally-organized data represented using Hierarchical Bayesian Models (HBM) enriched with plates. These models can become prohibitively large in settings such as neuroimaging, where a sample is composed of a functional MRI signal measured on 64 thousand brain locations, across 4 measurement sessions, and at least tens of subjects. Even a reduced example on a specific cortical region of 300 brain locations features around 1 million parameters, hampering the usage of modern density estimation techniques such as Simulation-Based Inference (SBI). To infer parameter posterior distributions in this challenging class of problems, we designed a novel methodology that automatically produces a variational family dual to a target HBM. This variatonal family, represented as a neural network, consists in the combination of an attention-based hierarchical encoder feeding summary statistics to a set of normalizing flows. Our automatically-derived neural network exploits exchangeability in the plate-enriched HBM and factorizes its parameter space. The resulting architecture reduces by orders of magnitude its parameterization with respect to that of a typical SBI representation, while maintaining expressivity. Our method performs inference on the specified HBM in an amortized setup: once trained, it can readily be applied to a new data sample to compute the parameters' full posterior. We demonstrate the capability of our method on simulated data, as well as a challenging high-dimensional brain parcellation experiment. We also open up several questions that lie at the intersection between SBI techniques and structured Variational Inference.
    Better Algorithms for Individually Fair $k$-Clustering. (arXiv:2106.12150v1 [cs.DS])
    (2 min) We study data clustering problems with $\ell_p$-norm objectives (e.g. $k$-Median and $k$-Means) in the context of individual fairness. The dataset consists of $n$ points, and we want to find $k$ centers such that (a) the objective is minimized, while (b) respecting the individual fairness constraint that every point $v$ has a center within a distance at most $r(v)$, where $r(v)$ is $v$'s distance to its $(n/k)$th nearest point. Jung, Kannan, and Lutz [FORC 2020] introduced this concept and designed a clustering algorithm with provable (approximate) fairness and objective guarantees for the $\ell_\infty$ or $k$-Center objective. Mahabadi and Vakilian [ICML 2020] revisited this problem to give a local-search algorithm for all $\ell_p$-norms. Empirically, their algorithms outperform Jung et. al.'s by a large margin in terms of cost (for $k$-Median and $k$-Means), but they incur a reasonable loss in fairness. In this paper, our main contribution is to use Linear Programming (LP) techniques to obtain better algorithms for this problem, both in theory and in practice. We prove that by modifying known LP rounding techniques, one gets a worst-case guarantee on the objective which is much better than in MV20, and empirically, this objective is extremely close to the optimal. Furthermore, our theoretical fairness guarantees are comparable with MV20 in theory, and empirically, we obtain noticeably fairer solutions. Although solving the LP {\em exactly} might be prohibitive, we demonstrate that in practice, a simple sparsification technique drastically improves the run-time of our algorithm.
    Lagrangian dual framework for conservative neural network solutions of kinetic equations. (arXiv:2106.12147v1 [math.NA])
    (2 min) In this paper, we propose a novel conservative formulation for solving kinetic equations via neural networks. More precisely, we formulate the learning problem as a constrained optimization problem with constraints that represent the physical conservation laws. The constraints are relaxed toward the residual loss function by the Lagrangian duality. By imposing physical conservation properties of the solution as constraints of the learning problem, we demonstrate far more accurate approximations of the solutions in terms of errors and the conservation laws, for the kinetic Fokker-Planck equation and the homogeneous Boltzmann equation.
    BiblioDAP: The 1st Workshop on Bibliographic Data Analysis and Processing. (arXiv:2106.12320v1 [cs.DL])
    (2 min) Automatic processing of bibliographic data becomes very important in digital libraries, data science and machine learning due to its importance in keeping pace with the significant increase of published papers every year from one side and to the inherent challenges from the other side. This processing has several aspects including but not limited to I) Automatic extraction of references from PDF documents, II) Building an accurate citation graph, III) Author name disambiguation, etc. Bibliographic data is heterogeneous by nature and occurs in both structured (e.g. citation graph) and unstructured (e.g. publications) formats. Therefore, it requires data science and machine learning techniques to be processed and analysed. Here we introduce BiblioDAP'21: The 1st Workshop on Bibliographic Data Analysis and Processing.
    Euro-PVI: Pedestrian Vehicle Interactions in Dense Urban Centers. (arXiv:2106.12442v1 [cs.CV])
    (2 min) Accurate prediction of pedestrian and bicyclist paths is integral to the development of reliable autonomous vehicles in dense urban environments. The interactions between vehicle and pedestrian or bicyclist have a significant impact on the trajectories of traffic participants e.g. stopping or turning to avoid collisions. Although recent datasets and trajectory prediction approaches have fostered the development of autonomous vehicles yet the amount of vehicle-pedestrian (bicyclist) interactions modeled are sparse. In this work, we propose Euro-PVI, a dataset of pedestrian and bicyclist trajectories. In particular, our dataset caters more diverse and complex interactions in dense urban scenarios compared to the existing datasets. To address the challenges in predicting future trajectories with dense interactions, we develop a joint inference model that learns an expressive multi-modal shared latent space across agents in the urban scene. This enables our Joint-$\beta$-cVAE approach to better model the distribution of future trajectories. We achieve state of the art results on the nuScenes and Euro-PVI datasets demonstrating the importance of capturing interactions between ego-vehicle and pedestrians (bicyclists) for accurate predictions.
    Finding Phish in a Haystack: A Pipeline for Phishing Classification on Certificate Transparency Logs. (arXiv:2106.12343v1 [cs.CR])
    (2 min) Current popular phishing prevention techniques mainly utilize reactive blocklists, which leave a ``window of opportunity'' for attackers during which victims are unprotected. One possible approach to shorten this window aims to detect phishing attacks earlier, during website preparation, by monitoring Certificate Transparency (CT) logs. Previous attempts to work with CT log data for phishing classification exist, however they lack evaluations on actual CT log data. In this paper, we present a pipeline that facilitates such evaluations by addressing a number of problems when working with CT log data. The pipeline includes dataset creation, training, and past or live classification of CT logs. Its modular structure makes it possible to easily exchange classifiers or verification sources to support ground truth labeling efforts and classifier comparisons. We test the pipeline on a number of new and existing classifiers, and find a general potential to improve classifiers for this scenario in the future. We publish the source code of the pipeline and the used datasets along with this paper (https://gitlab.com/rwth-itsec/ctl-pipeline), thus making future research in this direction more accessible.
    Real-time Neural Radiance Caching for Path Tracing. (arXiv:2106.12372v1 [cs.GR])
    (2 min) We present a real-time neural radiance caching method for path-traced global illumination. Our system is designed to handle fully dynamic scenes, and makes no assumptions about the lighting, geometry, and materials. The data-driven nature of our approach sidesteps many difficulties of caching algorithms, such as locating, interpolating, and updating cache points. Since pretraining neural networks to handle novel, dynamic scenes is a formidable generalization challenge, we do away with pretraining and instead achieve generalization via adaptation, i.e. we opt for training the radiance cache while rendering. We employ self-training to provide low-noise training targets and simulate infinite-bounce transport by merely iterating few-bounce training updates. The updates and cache queries incur a mild overhead -- about 2.6ms on full HD resolution -- thanks to a streaming implementation of the neural network that fully exploits modern hardware. We demonstrate significant noise reduction at the cost of little induced bias, and report state-of-the-art, real-time performance on a number of challenging scenarios.
    Alias-Free Generative Adversarial Networks. (arXiv:2106.12423v1 [cs.CV])
    (2 min) We observe that despite their hierarchical convolutional nature, the synthesis process of typical generative adversarial networks depends on absolute pixel coordinates in an unhealthy manner. This manifests itself as, e.g., detail appearing to be glued to image coordinates instead of the surfaces of depicted objects. We trace the root cause to careless signal processing that causes aliasing in the generator network. Interpreting all signals in the network as continuous, we derive generally applicable, small architectural changes that guarantee that unwanted information cannot leak into the hierarchical synthesis process. The resulting networks match the FID of StyleGAN2 but differ dramatically in their internal representations, and they are fully equivariant to translation and rotation even at subpixel scales. Our results pave the way for generative models better suited for video and animation.
    Some Hoeffding- and Bernstein-type Concentration Inequalities. (arXiv:2102.06304v4 [math.PR] UPDATED)
    (2 min) We prove concentration inequalities for functions of independent random variables {under} sub-gaussian and sub-exponential conditions. The utility of the inequalities is demonstrated by an extension of the now classical method of Rademacher complexities to Lipschitz function classes and unbounded sub-exponential distribution.
    How Well do Feature Visualizations Support Causal Understanding of CNN Activations?. (arXiv:2106.12447v1 [cs.CV])
    (2 min) One widely used approach towards understanding the inner workings of deep convolutional neural networks is to visualize unit responses via activation maximization. Feature visualizations via activation maximization are thought to provide humans with precise information about the image features that cause a unit to be activated. If this is indeed true, these synthetic images should enable humans to predict the effect of an intervention, such as whether occluding a certain patch of the image (say, a dog's head) changes a unit's activation. Here, we test this hypothesis by asking humans to predict which of two square occlusions causes a larger change to a unit's activation. Both a large-scale crowdsourced experiment and measurements with experts show that on average, the extremely activating feature visualizations by Olah et al. (2017) indeed help humans on this task ($67 \pm 4\%$ accuracy; baseline performance without any visualizations is $60 \pm 3\%$). However, they do not provide any significant advantage over other visualizations (such as e.g. dataset samples), which yield similar performance ($66 \pm 3\%$ to $67 \pm 3\%$ accuracy). Taken together, we propose an objective psychophysical task to quantify the benefit of unit-level interpretability methods for humans, and find no evidence that feature visualizations provide humans with better "causal understanding" than simple alternative visualizations.
    Tilting the playing field: Dynamical loss functions for machine learning. (arXiv:2102.03793v3 [cs.LG] UPDATED)
    (2 min) We show that learning can be improved by using loss functions that evolve cyclically during training to emphasize one class at a time. In underparameterized networks, such dynamical loss functions can lead to successful training for networks that fail to find a deep minima of the standard cross-entropy loss. In overparameterized networks, dynamical loss functions can lead to better generalization. Improvement arises from the interplay of the changing loss landscape with the dynamics of the system as it evolves to minimize the loss. In particular, as the loss function oscillates, instabilities develop in the form of bifurcation cascades, which we study using the Hessian and Neural Tangent Kernel. Valleys in the landscape widen and deepen, and then narrow and rise as the loss landscape changes during a cycle. As the landscape narrows, the learning rate becomes too large and the network becomes unstable and bounces around the valley. This process ultimately pushes the system into deeper and wider regions of the loss landscape and is characterized by decreasing eigenvalues of the Hessian. This results in better regularized models with improved generalization performance.
    Closed-Form, Provable, and Robust PCA via Leverage Statistics and Innovation Search. (arXiv:2106.12190v1 [stat.ML])
    (2 min) The idea of Innovation Search, which was initially proposed for data clustering, was recently used for outlier detection. In the application of Innovation Search for outlier detection, the directions of innovation were utilized to measure the innovation of the data points. We study the Innovation Values computed by the Innovation Search algorithm under a quadratic cost function and it is proved that Innovation Values with the new cost function are equivalent to Leverage Scores. This interesting connection is utilized to establish several theoretical guarantees for a Leverage Score based robust PCA method and to design a new robust PCA method. The theoretical results include performance guarantees with different models for the distribution of outliers and the distribution of inliers. In addition, we demonstrate the robustness of the algorithms against the presence of noise. The numerical and theoretical studies indicate that while the presented approach is fast and closed-form, it can outperform most of the existing algorithms.
    First Step Towards EXPLAINable DGA Multiclass Classification. (arXiv:2106.12336v1 [cs.CR])
    (2 min) Numerous malware families rely on domain generation algorithms (DGAs) to establish a connection to their command and control (C2) server. Counteracting DGAs, several machine learning classifiers have been proposed enabling the identification of the DGA that generated a specific domain name and thus triggering targeted remediation measures. However, the proposed state-of-the-art classifiers are based on deep learning models. The black box nature of these makes it difficult to evaluate their reasoning. The resulting lack of confidence makes the utilization of such models impracticable. In this paper, we propose EXPLAIN, a feature-based and contextless DGA multiclass classifier. We comparatively evaluate several combinations of feature sets and hyperparameters for our approach against several state-of-the-art classifiers in a unified setting on the same real-world data. Our classifier achieves competitive results, is real-time capable, and its predictions are easier to trace back to features than the predictions made by the DGA multiclass classifiers proposed in related work.
    Secure Domain Adaptation with Multiple Sources. (arXiv:2106.12124v1 [cs.LG])
    (2 min) Multi-source unsupervised domain adaptation (MUDA) is a recently explored learning framework, where the goal is to address the challenge of labeled data scarcity in a target domain via transferring knowledge from multiple source domains with annotated data. Since the source data is distributed, the privacy of source domains' data can be a natural concern. We benefit from the idea of domain alignment in an embedding space to address the privacy concern for MUDA. Our method is based on aligning the sources and target distributions indirectly via internally learned distributions, without communicating data samples between domains. We justify our approach theoretically and perform extensive experiments to demonstrate that our method is effective and compares favorably against existing methods.
    Multiband VAE: Latent Space Partitioning for Knowledge Consolidation in Continual Learning. (arXiv:2106.12196v1 [cs.LG])
    (2 min) We propose a new method for unsupervised continual knowledge consolidation in generative models that relies on the partitioning of Variational Autoencoder's latent space. Acquiring knowledge about new data samples without forgetting previous ones is a critical problem of continual learning. Currently proposed methods achieve this goal by extending the existing model while constraining its behavior not to degrade on the past data, which does not exploit the full potential of relations within the entire training dataset. In this work, we identify this limitation and posit the goal of continual learning as a knowledge accumulation task. We solve it by continuously re-aligning latent space partitions that we call bands which are representations of samples seen in different tasks, driven by the similarity of the information they contain. In addition, we introduce a simple yet effective method for controlled forgetting of past data that improves the quality of reconstructions encoded in latent bands and a latent space disentanglement technique that improves knowledge consolidation. On top of the standard continual learning evaluation benchmarks, we evaluate our method on a new knowledge consolidation scenario and show that the proposed approach outperforms state-of-the-art by up to twofold across all testing scenarios.
    Co-advise: Cross Inductive Bias Distillation. (arXiv:2106.12378v1 [cs.CV])
    (2 min) Transformers recently are adapted from the community of natural language processing as a promising substitute of convolution-based neural networks for visual learning tasks. However, its supremacy degenerates given an insufficient amount of training data (e.g., ImageNet). To make it into practical utility, we propose a novel distillation-based method to train vision transformers. Unlike previous works, where merely heavy convolution-based teachers are provided, we introduce lightweight teachers with different architectural inductive biases (e.g., convolution and involution) to co-advise the student transformer. The key is that teachers with different inductive biases attain different knowledge despite that they are trained on the same dataset, and such different knowledge compounds and boosts the student's performance during distillation. Equipped with this cross inductive bias distillation method, our vision transformers (termed as CivT) outperform all previous transformers of the same architecture on ImageNet.
    MG-DVD: A Real-time Framework for Malware Variant Detection Based on Dynamic Heterogeneous Graph Learning. (arXiv:2106.12288v1 [cs.CR])
    (2 min) Detecting the newly emerging malware variants in real time is crucial for mitigating cyber risks and proactively blocking intrusions. In this paper, we propose MG-DVD, a novel detection framework based on dynamic heterogeneous graph learning, to detect malware variants in real time. Particularly, MG-DVD first models the fine-grained execution event streams of malware variants into dynamic heterogeneous graphs and investigates real-world meta-graphs between malware objects, which can effectively characterize more discriminative malicious evolutionary patterns between malware and their variants. Then, MG-DVD presents two dynamic walk-based heterogeneous graph learning methods to learn more comprehensive representations of malware variants, which significantly reduces the cost of the entire graph retraining. As a result, MG-DVD is equipped with the ability to detect malware variants in real time, and it presents better interpretability by introducing meaningful meta-graphs. Comprehensive experiments on large-scale samples prove that our proposed MG-DVD outperforms state-of-the-art methods in detecting malware variants in terms of effectiveness and efficiency.
    Random Effect Bandits. (arXiv:2106.12200v1 [cs.LG])
    (2 min) This paper studies regret minimization in multi-armed bandits, a classical online learning problem. To develop more statistically-efficient algorithms, we propose to use the assumption of a random-effect model. In this model, the mean rewards of arms are drawn independently from an unknown distribution, whose parameters we estimate. We provide an estimator of the arm means in this model and also analyze its uncertainty. Based on these results, we design a UCB algorithm, which we call ReUCB. We analyze ReUCB and prove a Bayes regret bound on its $n$-round regret, which matches an existing lower bound. Our experiments show that ReUCB can outperform Thompson sampling in various scenarios, without assuming that the prior distribution of arm means is known.
    Bregman Gradient Policy Optimization. (arXiv:2106.12112v1 [cs.LG])
    (2 min) In this paper, we design a novel Bregman gradient policy optimization framework for reinforcement learning based on Bregman divergences and momentum techniques. Specifically, we propose a Bregman gradient policy optimization (BGPO) algorithm based on the basic momentum technique and mirror descent iteration. At the same time, we present an accelerated Bregman gradient policy optimization (VR-BGPO) algorithm based on a momentum variance-reduced technique. Moreover, we introduce a convergence analysis framework for our Bregman gradient policy optimization under the nonconvex setting. Specifically, we prove that BGPO achieves the sample complexity of $\tilde{O}(\epsilon^{-4})$ for finding $\epsilon$-stationary point only requiring one trajectory at each iteration, and VR-BGPO reaches the best known sample complexity of $\tilde{O}(\epsilon^{-3})$ for finding an $\epsilon$-stationary point, which also only requires one trajectory at each iteration. In particular, by using different Bregman divergences, our methods unify many existing policy optimization algorithms and their new variants such as the existing (variance-reduced) policy gradient algorithms and (variance-reduced) natural policy gradient algorithms. Extensive experimental results on multiple reinforcement learning tasks demonstrate the efficiency of our new algorithms.
    A Unified Approach to Fair Online Learning via Blackwell Approachability. (arXiv:2106.12242v1 [cs.LG])
    (2 min) We provide a setting and a general approach to fair online learning with stochastic sensitive and non-sensitive contexts. The setting is a repeated game between the Player and Nature, where at each stage both pick actions based on the contexts. Inspired by the notion of unawareness, we assume that the Player can only access the non-sensitive context before making a decision, while we discuss both cases of Nature accessing the sensitive contexts and Nature unaware of the sensitive contexts. Adapting Blackwell's approachability theory to handle the case of an unknown contexts' distribution, we provide a general necessary and sufficient condition for learning objectives to be compatible with some fairness constraints. This condition is instantiated on (group-wise) no-regret and (group-wise) calibration objectives, and on demographic parity as an additional constraint. When the objective is not compatible with the constraint, the provided framework permits to characterise the optimal trade-off between the two.
    Regret-optimal Estimation and Control. (arXiv:2106.12097v1 [cs.LG])
    (2 min) We consider estimation and control in linear time-varying dynamical systems from the perspective of regret minimization. Unlike most prior work in this area, we focus on the problem of designing causal estimators and controllers which compete against a clairvoyant noncausal policy, instead of the best policy selected in hindsight from some fixed parametric class. We show that the regret-optimal estimator and regret-optimal controller can be derived in state-space form using operator-theoretic techniques from robust control and present tight,data-dependent bounds on the regret incurred by our algorithms in terms of the energy of the disturbances. Our results can be viewed as extending traditional robust estimation and control, which focuses on minimizing worst-case cost, to minimizing worst-case regret. We propose regret-optimal analogs of Model-Predictive Control (MPC) and the Extended KalmanFilter (EKF) for systems with nonlinear dynamics and present numerical experiments which show that our regret-optimal algorithms can significantly outperform standard approaches to estimation and control.
    Improved Acyclicity Reasoning for Bayesian Network Structure Learning with Constraint Programming. (arXiv:2106.12269v1 [cs.AI])
    (2 min) Bayesian networks are probabilistic graphical models with a wide range of application areas including gene regulatory networks inference, risk analysis and image processing. Learning the structure of a Bayesian network (BNSL) from discrete data is known to be an NP-hard task with a superexponential search space of directed acyclic graphs. In this work, we propose a new polynomial time algorithm for discovering a subset of all possible cluster cuts, a greedy algorithm for approximately solving the resulting linear program, and a generalised arc consistency algorithm for the acyclicity constraint. We embed these in the constraint programmingbased branch-and-bound solver CPBayes and show that, despite being suboptimal, they improve performance by orders of magnitude. The resulting solver also compares favourably with GOBNILP, a state-of-the-art solver for the BNSL problem which solves an NP-hard problem to discover each cut and solves the linear program exactly.
    Combination of Convolutional Neural Network and Gated Recurrent Unit for Energy Aware Resource Allocation. (arXiv:2106.12178v1 [cs.DC])
    (2 min) Cloud computing service models have experienced rapid growth and inefficient resource usage is known as one of the greatest causes of high energy consumption in cloud data centers. Resource allocation in cloud data centers aiming to reduce energy consumption has been conducted using live migration of Virtual Machines (VMs) and their consolidation into the small number of Physical Machines (PMs). However, the selection of the appropriate VM for migration is an important challenge. To solve this issue, VMs can be classified according to the pattern of user requests into sensitive or insensitive classes to latency, and thereafter suitable VMs can be selected for migration. In this paper, the combination of Convolution Neural Network (CNN) and Gated Recurrent Unit (GRU) is utilized for the classification of VMs in the Microsoft Azure dataset. Due to the fact the majority of VMs in this dataset are labeled as insensitive to latency, migration of more VMs in this group not only reduces energy consumption but also decreases the violation of Service Level Agreements (SLA). Based on the empirical results, the proposed model obtained an accuracy of 95.18which clearly demonstrates the superiority of our proposed model compared to other existing models.
    BFTrainer: Low-Cost Training of Neural Networks on Unfillable Supercomputer Nodes. (arXiv:2106.12091v1 [cs.DC])
    (2 min) Supercomputer FCFS-based scheduling policies result in many transient idle nodes, a phenomenon that is only partially alleviated by backfill scheduling methods that promote small jobs to run before large jobs. Here we describe how to realize a novel use for these otherwise wasted resources, namely, deep neural network (DNN) training. This important workload is easily organized as many small fragments that can be configured dynamically to fit essentially any node*time hole in a supercomputer's schedule. We describe how the task of rescaling suitable DNN training tasks to fit dynamically changing holes can be formulated as a deterministic mixed integer linear programming (MILP)-based resource allocation algorithm, and show that this MILP problem can be solved efficiently at run time. We show further how this MILP problem can be adapted to optimize for administrator- or user-defined metrics. We validate our method with supercomputer scheduler logs and different DNN training scenarios, and demonstrate efficiencies of up to 93% compared with running the same training tasks on dedicated nodes. Our method thus enables substantial supercomputer resources to be allocated to DNN training with no impact on other applications.
    Near-Optimal Linear Regression under Distribution Shift. (arXiv:2106.12108v1 [cs.LG])
    (2 min) Transfer learning is essential when sufficient data comes from the source domain, with scarce labeled data from the target domain. We develop estimators that achieve minimax linear risk for linear regression problems under distribution shift. Our algorithms cover different transfer learning settings including covariate shift and model shift. We also consider when data are generated from either linear or general nonlinear models. We show that linear minimax estimators are within an absolute constant of the minimax risk even among nonlinear estimators for various source/target distributions.
    Clustering of check-in sequences using the mixture Markov chain process. (arXiv:2106.12039v1 [cs.SI])
    (2 min) This work is devoted to the clustering of check-in sequences from a geosocial network. We used the mixture Markov chain process as a mathematical model for time-dependent types of data. For clustering, we adjusted the Expectation-Maximization (EM) algorithm. As a result, we obtained highly detailed communities (clusters) of users of the now defunct geosocial network, Weeplaces.
    Towards Consistent Predictive Confidence through Fitted Ensembles. (arXiv:2106.12070v1 [cs.LG])
    (2 min) Deep neural networks are behind many of the recent successes in machine learning applications. However, these models can produce overconfident decisions while encountering out-of-distribution (OOD) examples or making a wrong prediction. This inconsistent predictive confidence limits the integration of independently-trained learning models into a larger system. This paper introduces separable concept learning framework to realistically measure the performance of classifiers in presence of OOD examples. In this setup, several instances of a classifier are trained on different parts of a partition of the set of classes. Later, the performance of the combination of these models is evaluated on a separate test set. Unlike current OOD detection techniques, this framework does not require auxiliary OOD datasets and does not separate classification from detection performance. Furthermore, we present a new strong baseline for more consistent predictive confidence in deep models, called fitted ensembles, where overconfident predictions are rectified by transformed versions of the original classification task. Fitted ensembles can naturally detect OOD examples without requiring auxiliary data by observing contradicting predictions among its components. Experiments on MNIST, SVHN, CIFAR-10/100, and ImageNet show fitted ensemble significantly outperform conventional ensembles on OOD examples and are possible to scale.
    ParK: Sound and Efficient Kernel Ridge Regression by Feature Space Partitions. (arXiv:2106.12231v1 [stat.ML])
    (2 min) We introduce ParK, a new large-scale solver for kernel ridge regression. Our approach combines partitioning with random projections and iterative optimization to reduce space and time complexity while provably maintaining the same statistical accuracy. In particular, constructing suitable partitions directly in the feature space rather than in the input space, we promote orthogonality between the local estimators, thus ensuring that key quantities such as local effective dimension and bias remain under control. We characterize the statistical-computational tradeoff of our model, and demonstrate the effectiveness of our method by numerical experiments on large-scale datasets.
    NodePiece: Compositional and Parameter-Efficient Representations of Large Knowledge Graphs. (arXiv:2106.12144v1 [cs.CL])
    (2 min) Conventional representation learning algorithms for knowledge graphs (KG) map each entity to a unique embedding vector. Such a shallow lookup results in a linear growth of memory consumption for storing the embedding matrix and incurs high computational costs when working with real-world KGs. Drawing parallels with subword tokenization commonly used in NLP, we explore the landscape of more parameter-efficient node embedding strategies with possibly sublinear memory requirements. To this end, we propose NodePiece, an anchor-based approach to learn a fixed-size entity vocabulary. In NodePiece, a vocabulary of subword/sub-entity units is constructed from anchor nodes in a graph with known relation types. Given such a fixed-size vocabulary, it is possible to bootstrap an encoding and embedding for any entity, including those unseen during training. Experiments show that NodePiece performs competitively in node classification, link prediction, and relation prediction tasks while retaining less than 10% of explicit nodes in a graph as anchors and often having 10x fewer parameters.
    Fairness for Image Generation with Uncertain Sensitive Attributes. (arXiv:2106.12182v1 [cs.LG])
    (2 min) This work tackles the issue of fairness in the context of generative procedures, such as image super-resolution, which entail different definitions from the standard classification setting. Moreover, while traditional group fairness definitions are typically defined with respect to specified protected groups -- camouflaging the fact that these groupings are artificial and carry historical and political motivations -- we emphasize that there are no ground truth identities. For instance, should South and East Asians be viewed as a single group or separate groups? Should we consider one race as a whole or further split by gender? Choosing which groups are valid and who belongs in them is an impossible dilemma and being ``fair'' with respect to Asians may require being ``unfair'' with respect to South Asians. This motivates the introduction of definitions that allow algorithms to be \emph{oblivious} to the relevant groupings. We define several intuitive notions of group fairness and study their incompatibilities and trade-offs. We show that the natural extension of demographic parity is strongly dependent on the grouping, and \emph{impossible} to achieve obliviously. On the other hand, the conceptually new definition we introduce, Conditional Proportional Representation, can be achieved obliviously through Posterior Sampling. Our experiments validate our theoretical results and achieve fair image reconstruction using state-of-the-art generative models.
    ABCD: A Graph Framework to Convert Complex Sentences to a Covering Set of Simple Sentences. (arXiv:2106.12027v1 [cs.CL])
    (2 min) Atomic clauses are fundamental text units for understanding complex sentences. Identifying the atomic sentences within complex sentences is important for applications such as summarization, argument mining, discourse analysis, discourse parsing, and question answering. Previous work mainly relies on rule-based methods dependent on parsing. We propose a new task to decompose each complex sentence into simple sentences derived from the tensed clauses in the source, and a novel problem formulation as a graph edit task. Our neural model learns to Accept, Break, Copy or Drop elements of a graph that combines word adjacency and grammatical dependencies. The full processing pipeline includes modules for graph construction, graph editing, and sentence generation from the output graph. We introduce DeSSE, a new dataset designed to train and evaluate complex sentence decomposition, and MinWiki, a subset of MinWikiSplit. ABCD achieves comparable performance as two parsing baselines on MinWiki. On DeSSE, which has a more even balance of complex sentence types, our model achieves higher accuracy on the number of atomic sentences than an encoder-decoder baseline. Results include a detailed error analysis.
    Deep Gaussian Processes: A Survey. (arXiv:2106.12135v1 [cs.LG])
    (2 min) Gaussian processes are one of the dominant approaches in Bayesian learning. Although the approach has been applied to numerous problems with great success, it has a few fundamental limitations. Multiple methods in literature have addressed these limitations. However, there has not been a comprehensive survey of the topics as of yet. Most existing surveys focus on only one particular variant of Gaussian processes and their derivatives. This survey details the core motivations for using Gaussian processes, their mathematical formulations, limitations, and research themes that have flourished over the years to address said limitations. Furthermore, one particular research area is Deep Gaussian Processes (DGPs), it has improved substantially in the past decade. The significant publications that advanced the forefront of this research area are outlined in their survey. Finally, a brief discussion on open problems and research directions for future work is presented at the end.
    Uncertainty-Aware Model-Based Reinforcement Learning with Application to Autonomous Driving. (arXiv:2106.12194v1 [cs.RO])
    (2 min) To further improve the learning efficiency and performance of reinforcement learning (RL), in this paper we propose a novel uncertainty-aware model-based RL (UA-MBRL) framework, and then implement and validate it in autonomous driving under various task scenarios. First, an action-conditioned ensemble model with the ability of uncertainty assessment is established as the virtual environment model. Then, a novel uncertainty-aware model-based RL framework is developed based on the adaptive truncation approach, providing virtual interactions between the agent and environment model, and improving RL's training efficiency and performance. The developed algorithms are then implemented in end-to-end autonomous vehicle control tasks, validated and compared with state-of-the-art methods under various driving scenarios. The validation results suggest that the proposed UA-MBRL method surpasses the existing model-based and model-free RL approaches, in terms of learning efficiency and achieved performance. The results also demonstrate the good ability of the proposed method with respect to the adaptiveness and robustness, under various autonomous driving scenarios.
    Q-Learning Lagrange Policies for Multi-Action Restless Bandits. (arXiv:2106.12024v1 [cs.LG])
    (2 min) Multi-action restless multi-armed bandits (RMABs) are a powerful framework for constrained resource allocation in which $N$ independent processes are managed. However, previous work only study the offline setting where problem dynamics are known. We address this restrictive assumption, designing the first algorithms for learning good policies for Multi-action RMABs online using combinations of Lagrangian relaxation and Q-learning. Our first approach, MAIQL, extends a method for Q-learning the Whittle index in binary-action RMABs to the multi-action setting. We derive a generalized update rule and convergence proof and establish that, under standard assumptions, MAIQL converges to the asymptotically optimal multi-action RMAB policy as $t\rightarrow{}\infty$. However, MAIQL relies on learning Q-functions and indexes on two timescales which leads to slow convergence and requires problem structure to perform well. Thus, we design a second algorithm, LPQL, which learns the well-performing and more general Lagrange policy for multi-action RMABs by learning to minimize the Lagrange bound through a variant of Q-learning. To ensure fast convergence, we take an approximation strategy that enables learning on a single timescale, then give a guarantee relating the approximation's precision to an upper bound of LPQL's return as $t\rightarrow{}\infty$. Finally, we show that our approaches always outperform baselines across multiple settings, including one derived from real-world medication adherence data.
    Learning Identity-Preserving Transformations on Data Manifolds. (arXiv:2106.12096v1 [cs.LG])
    (2 min) Many machine learning techniques incorporate identity-preserving transformations into their models to generalize their performance to previously unseen data. These transformations are typically selected from a set of functions that are known to maintain the identity of an input when applied (e.g., rotation, translation, flipping, and scaling). However, there are many natural variations that cannot be labeled for supervision or defined through examination of the data. As suggested by the manifold hypothesis, many of these natural variations live on or near a low-dimensional, nonlinear manifold. Several techniques represent manifold variations through a set of learned Lie group operators that define directions of motion on the manifold. However theses approaches are limited because they require transformation labels when training their models and they lack a method for determining which regions of the manifold are appropriate for applying each specific operator. We address these limitations by introducing a learning strategy that does not require transformation labels and developing a method that learns the local regions where each operator is likely to be used while preserving the identity of inputs. Experiments on MNIST and Fashion MNIST highlight our model's ability to learn identity-preserving transformations on multi-class datasets. Additionally, we train on CelebA to showcase our model's ability to learn semantically meaningful transformations on complex datasets in an unsupervised manner.
    A Practical & Unified Notation for Information-Theoretic Quantities in ML. (arXiv:2106.12062v1 [cs.LG])
    (2 min) Information theory is of importance to machine learning, but the notation for information-theoretic quantities is sometimes opaque. The right notation can convey valuable intuitions and concisely express new ideas. We propose such a notation for machine learning users and expand it to include information-theoretic quantities between events (outcomes) and random variables. We apply this notation to a popular information-theoretic acquisition function in Bayesian active learning which selects the most informative (unlabelled) samples to be labelled by an expert. We demonstrate the value of our notation when extending the acquisition function to the core-set problem, which consists of selecting the most informative samples \emph{given} the labels.
    It's All in the Heads: Using Attention Heads as a Baseline for Cross-Lingual Transfer in Commonsense Reasoning. (arXiv:2106.12066v1 [cs.CL])
    (2 min) Commonsense reasoning is one of the key problems in natural language processing, but the relative scarcity of labeled data holds back the progress for languages other than English. Pretrained cross-lingual models are a source of powerful language-agnostic representations, yet their inherent reasoning capabilities are still actively studied. In this work, we design a simple approach to commonsense reasoning which trains a linear classifier with weights of multi-head attention as features. To evaluate this approach, we create a multilingual Winograd Schema corpus by processing several datasets from prior work within a standardized pipeline and measure cross-lingual generalization ability in terms of out-of-sample performance. The method performs competitively with recent supervised and unsupervised approaches for commonsense reasoning, even when applied to other languages in a zero-shot manner. Also, we demonstrate that most of the performance is given by the same small subset of attention heads for all studied languages, which provides evidence of universal reasoning capabilities in multilingual encoders.
    Deep learning for improved global precipitation in numerical weather prediction systems. (arXiv:2106.12045v1 [physics.ao-ph])
    (2 min) The formation of precipitation in state-of-the-art weather and climate models is an important process. The understanding of its relationship with other variables can lead to endless benefits, particularly for the world's monsoon regions dependent on rainfall as a support for livelihood. Various factors play a crucial role in the formation of rainfall, and those physical processes are leading to significant biases in the operational weather forecasts. We use the UNET architecture of a deep convolutional neural network with residual learning as a proof of concept to learn global data-driven models of precipitation. The models are trained on reanalysis datasets projected on the cubed-sphere projection to minimize errors due to spherical distortion. The results are compared with the operational dynamical model used by the India Meteorological Department. The theoretical deep learning-based model shows doubling of the grid point, as well as area averaged skill measured in Pearson correlation coefficients relative to operational system. This study is a proof-of-concept showing that residual learning-based UNET can unravel physical relationships to target precipitation, and those physical constraints can be used in the dynamical operational models towards improved precipitation forecasts. Our results pave the way for the development of online, hybrid models in the future.
    Pure Exploration in Kernel and Neural Bandits. (arXiv:2106.12034v1 [stat.ML])
    (2 min) We study pure exploration in bandits, where the dimension of the feature representation can be much larger than the number of arms. To overcome the curse of dimensionality, we propose to adaptively embed the feature representation of each arm into a lower-dimensional space and carefully deal with the induced model misspecifications. Our approach is conceptually very different from existing works that can either only handle low-dimensional linear bandits or passively deal with model misspecifications. We showcase the application of our approach to two pure exploration settings that were previously under-studied: (1) the reward function belongs to a possibly infinite-dimensional Reproducing Kernel Hilbert Space, and (2) the reward function is nonlinear and can be approximated by neural networks. Our main results provide sample complexity guarantees that only depend on the effective dimension of the feature spaces in the kernel or neural representations. Extensive experiments conducted on both synthetic and real-world datasets demonstrate the efficacy of our methods.
    Analysis of the Evolution of Parametric Drivers of High-End Sea-Level Hazards. (arXiv:2106.12041v1 [physics.ao-ph])
    (2 min) Climate models are critical tools for developing strategies to manage the risks posed by sea-level rise to coastal communities. While these models are necessary for understanding climate risks, there is a level of uncertainty inherent in each parameter in the models. This model parametric uncertainty leads to uncertainty in future climate risks. Consequently, there is a need to understand how those parameter uncertainties impact our assessment of future climate risks and the efficacy of strategies to manage them. Here, we use random forests to examine the parametric drivers of future climate risk and how the relative importances of those drivers change over time. We find that the equilibrium climate sensitivity and a factor that scales the effect of aerosols on radiative forcing are consistently the most important climate model parametric uncertainties throughout the 2020 to 2150 interval for both low and high radiative forcing scenarios. The near-term hazards of high-end sea-level rise are driven primarily by thermal expansion, while the longer-term hazards are associated with mass loss from the Antarctic and Greenland ice sheets. Our results highlight the practical importance of considering time-evolving parametric uncertainties when developing strategies to manage future climate risks.
    A Simple Baseline for Batch Active Learning with Stochastic Acquisition Functions. (arXiv:2106.12059v1 [cs.LG])
    (2 min) In active learning, new labels are commonly acquired in batches. However, common acquisition functions are only meant for one-sample acquisition rounds at a time, and when their scores are used naively for batch acquisition, they result in batches lacking diversity, which deteriorates performance. On the other hand, state-of-the-art batch acquisition functions are costly to compute. In this paper, we present a novel class of stochastic acquisition functions that extend one-sample acquisition functions to the batch setting by observing how one-sample acquisition scores change as additional samples are acquired and modelling this difference for additional batch samples. We simply acquire new samples by sampling from the pool set using a Gibbs distribution based on the acquisition scores. Our acquisition functions are both vastly cheaper to compute and out-perform other batch acquisition functions.
    The Neurally-Guided Shape Parser: A Monte Carlo Method for Hierarchical Labeling of Over-segmented 3D Shapes. (arXiv:2106.12026v1 [cs.CV])
    (2 min) Many learning-based 3D shape semantic segmentation methods assign labels to shape atoms (e.g. points in a point cloud or faces in a mesh) with a single-pass approach trained in an end-to-end fashion. Such methods achieve impressive performance but require large amounts of labeled training data. This paradigm entangles two separable subproblems: (1) decomposing a shape into regions and (2) assigning semantic labels to these regions. We claim that disentangling these subproblems reduces the labeled data burden: (1) region decomposition requires no semantic labels and could be performed in an unsupervised fashion, and (2) labeling shape regions instead of atoms results in a smaller search space and should be learnable with less labeled training data. In this paper, we investigate this second claim by presenting the Neurally-Guided Shape Parser (NGSP), a method that learns how to assign semantic labels to regions of an over-segmented 3D shape. We solve this problem via MAP inference, modeling the posterior probability of a labeling assignment conditioned on an input shape. We employ a Monte Carlo importance sampling approach guided by a neural proposal network, a search-based approach made feasible by assuming the input shape is decomposed into discrete regions. We evaluate NGSP on the task of hierarchical semantic segmentation on manufactured 3D shapes from PartNet. We find that NGSP delivers significant performance improvements over baselines that learn to label shape atoms and then aggregate predictions for each shape region, especially in low-data regimes. Finally, we demonstrate that NGSP is robust to region granularity, as it maintains strong segmentation performance even as the regions undergo significant corruption.

2021-06-23

  • cs.CL updates on arXiv.org

    A Systematic Evaluation of Transfer Learning and Pseudo-labeling with BERT-based Ranking Models. (arXiv:2103.03335v3 [cs.IR] UPDATED)
    (2 min) Due to high annotation costs making the best use of existing human-created training data is an important research direction. We, therefore, carry out a systematic evaluation of transferability of BERT-based neural ranking models across five English datasets. Previous studies focused primarily on zero-shot and few-shot transfer from a large dataset to a dataset with a small number of queries. In contrast, each of our collections has a substantial number of queries, which enables a full-shot evaluation mode and improves reliability of our results. Furthermore, since source datasets licences often prohibit commercial use, we compare transfer learning to training on pseudo-labels generated by a BM25 scorer. We find that training on pseudo-labels -- possibly with subsequent fine-tuning using a modest number of annotated queries -- can produce a competitive or better model compared to transfer learning. Yet, it is necessary to improve the stability and/or effectiveness of the few-shot training, which, sometimes, can degrade performance of a pretrained model.
    SENT: Sentence-level Distant Relation Extraction via Negative Training. (arXiv:2106.11566v1 [cs.CL])
    (2 min) Distant supervision for relation extraction provides uniform bag labels for each sentence inside the bag, while accurate sentence labels are important for downstream applications that need the exact relation type. Directly using bag labels for sentence-level training will introduce much noise, thus severely degrading performance. In this work, we propose the use of negative training (NT), in which a model is trained using complementary labels regarding that ``the instance does not belong to these complementary labels". Since the probability of selecting a true label as a complementary label is low, NT provides less noisy information. Furthermore, the model trained with NT is able to separate the noisy data from the training data. Based on NT, we propose a sentence-level framework, SENT, for distant relation extraction. SENT not only filters the noisy data to construct a cleaner dataset, but also performs a re-labeling process to transform the noisy data into useful training data, thus further benefiting the model's performance. Experimental results show the significant improvement of the proposed method over previous methods on sentence-level evaluation and de-noise effect.
    ETC-NLG: End-to-end Topic-Conditioned Natural Language Generation. (arXiv:2008.10875v3 [cs.CL] UPDATED)
    (2 min) Plug-and-play language models (PPLMs) enable topic-conditioned natural language generation by pairing large pre-trained generators with attribute models used to steer the predicted token distribution towards the selected topic. Despite their computational efficiency, PPLMs require large amounts of labeled texts to effectively balance generation fluency and proper conditioning, making them unsuitable for low-resource settings. We present ETC-NLG, an approach leveraging topic modeling annotations to enable fully-unsupervised End-to-end Topic-Conditioned Natural Language Generation over emergent topics in unlabeled document collections. We first test the effectiveness of our approach in a low-resource setting for Italian, evaluating the conditioning for both topic models and gold annotations. We then perform a comparative evaluation of ETC-NLG for Italian and English using a parallel corpus. Finally, we propose an automatic approach to estimate the effectiveness of conditioning on the generated utterances.
    IITP@COLIEE 2019: Legal Information Retrieval using BM25 and BERT. (arXiv:2104.08653v3 [cs.CL] UPDATED)
    (2 min) Natural Language Processing (NLP) and Information Retrieval (IR) in the judicial domain is an essential task. With the advent of availability domain-specific data in electronic form and aid of different Artificial intelligence (AI) technologies, automated language processing becomes more comfortable, and hence it becomes feasible for researchers and developers to provide various automated tools to the legal community to reduce human burden. The Competition on Legal Information Extraction/Entailment (COLIEE-2019) run in association with the International Conference on Artificial Intelligence and Law (ICAIL)-2019 has come up with few challenging tasks. The shared defined four sub-tasks (i.e. Task1, Task2, Task3 and Task4), which will be able to provide few automated systems to the judicial system. The paper presents our working note on the experiments carried out as a part of our participation in all the sub-tasks defined in this shared task. We make use of different Information Retrieval(IR) and deep learning based approaches to tackle these problems. We obtain encouraging results in all these four sub-tasks.
    ProphetNet-X: Large-Scale Pre-training Models for English, Chinese, Multi-lingual, Dialog, and Code Generation. (arXiv:2104.08006v2 [cs.CL] UPDATED)
    (2 min) Now, the pre-training technique is ubiquitous in natural language processing field. ProphetNet is a pre-training based natural language generation method which shows powerful performance on English text summarization and question generation tasks. In this paper, we extend ProphetNet into other domains and languages, and present the ProphetNet family pre-training models, named ProphetNet-X, where X can be English, Chinese, Multi-lingual, and so on. We pre-train a cross-lingual generation model ProphetNet-Multi, a Chinese generation model ProphetNet-Zh, two open-domain dialog generation models ProphetNet-Dialog-En and ProphetNet-Dialog-Zh. And also, we provide a PLG (Programming Language Generation) model ProphetNet-Code to show the generation performance besides NLG (Natural Language Generation) tasks. In our experiments, ProphetNet-X models achieve new state-of-the-art performance on 10 benchmarks. All the models of ProphetNet-X share the same model structure, which allows users to easily switch between different models. We make the code and models publicly available, and we will keep updating more pre-training models and finetuning scripts.
    LV-BERT: Exploiting Layer Variety for BERT. (arXiv:2106.11740v1 [cs.CL])
    (2 min) Modern pre-trained language models are mostly built upon backbones stacking self-attention and feed-forward layers in an interleaved order. In this paper, beyond this stereotyped layer pattern, we aim to improve pre-trained models by exploiting layer variety from two aspects: the layer type set and the layer order. Specifically, besides the original self-attention and feed-forward layers, we introduce convolution into the layer type set, which is experimentally found beneficial to pre-trained models. Furthermore, beyond the original interleaved order, we explore more layer orders to discover more powerful architectures. However, the introduced layer variety leads to a large architecture space of more than billions of candidates, while training a single candidate model from scratch already requires huge computation cost, making it not affordable to search such a space by directly training large amounts of candidate models. To solve this problem, we first pre-train a supernet from which the weights of all candidate models can be inherited, and then adopt an evolutionary algorithm guided by pre-training accuracy to find the optimal architecture. Extensive experiments show that LV-BERT model obtained by our method outperforms BERT and its variants on various downstream tasks. For example, LV-BERT-small achieves 78.8 on the GLUE testing set, 1.8 higher than the strong baseline ELECTRA-small.
    Towards Knowledge-Grounded Counter Narrative Generation for Hate Speech. (arXiv:2106.11783v1 [cs.CL])
    (2 min) Tackling online hatred using informed textual responses - called counter narratives - has been brought under the spotlight recently. Accordingly, a research line has emerged to automatically generate counter narratives in order to facilitate the direct intervention in the hate discussion and to prevent hate content from further spreading. Still, current neural approaches tend to produce generic/repetitive responses and lack grounded and up-to-date evidence such as facts, statistics, or examples. Moreover, these models can create plausible but not necessarily true arguments. In this paper we present the first complete knowledge-bound counter narrative generation pipeline, grounded in an external knowledge repository that can provide more informative content to fight online hatred. Together with our approach, we present a series of experiments that show its feasibility to produce suitable and informative counter narratives in in-domain and cross-domain settings.
    Generating abstractive summaries of Lithuanian news articles using a transformer model. (arXiv:2105.03279v2 [cs.CL] UPDATED)
    (2 min) In this work, we train the first monolingual Lithuanian transformer model on a relatively large corpus of Lithuanian news articles and compare various output decoding algorithms for abstractive news summarization. We achieve an average ROUGE-2 score 0.163, generated summaries are coherent and look impressive at first glance. However, some of them contain misleading information that is not so easy to spot. We describe all the technical details and share our trained model and accompanying code in an online open-source repository, as well as some characteristic samples of the generated summaries.
    Exemplars-guided Empathetic Response Generation Controlled by the Elements of Human Communication. (arXiv:2106.11791v1 [cs.CL])
    (2 min) The majority of existing methods for empathetic response generation rely on the emotion of the context to generate empathetic responses. However, empathy is much more than generating responses with an appropriate emotion. It also often entails subtle expressions of understanding and personal resonance with the situation of the other interlocutor. Unfortunately, such qualities are difficult to quantify and the datasets lack the relevant annotations. To address this issue, in this paper we propose an approach that relies on exemplars to cue the generative model on fine stylistic properties that signal empathy to the interlocutor. To this end, we employ dense passage retrieval to extract relevant exemplary responses from the training set. Three elements of human communication -- emotional presence, interpretation, and exploration, and sentiment are additionally introduced using synthetic labels to guide the generation towards empathy. The human evaluation is also extended by these elements of human communication. We empirically show that these approaches yield significant improvements in empathetic response quality in terms of both automated and human-evaluated metrics. The implementation is available at https://github.com/declare-lab/exemplary-empathy.
    Unsupervised Cross-lingual Adaptation for Sequence Tagging and Beyond. (arXiv:2010.12405v3 [cs.CL] UPDATED)
    (2 min) Cross-lingual adaptation with multilingual pre-trained language models (mPTLMs) mainly consists of two lines of works: zero-shot approach and translation-based approach, which have been studied extensively on the sequence-level tasks. We further verify the efficacy of these cross-lingual adaptation approaches by evaluating their performances on more fine-grained sequence tagging tasks. After re-examining their strengths and drawbacks, we propose a novel framework to consolidate the zero-shot approach and the translation-based approach for better adaptation performance. Instead of simply augmenting the source data with the machine-translated data, we tailor-make a warm-up mechanism to quickly update the mPTLMs with the gradients estimated on a few translated data. Then, the adaptation approach is applied to the refined parameters and the cross-lingual transfer is performed in a warm-start way. The experimental results on nine target languages demonstrate that our method is beneficial to the cross-lingual adaptation of various sequence tagging tasks.
    Error-Aware Interactive Semantic Parsing of OpenStreetMap. (arXiv:2106.11739v1 [cs.CL])
    (2 min) In semantic parsing of geographical queries against real-world databases such as OpenStreetMap (OSM), unique correct answers do not necessarily exist. Instead, the truth might be lying in the eye of the user, who needs to enter an interactive setup where ambiguities can be resolved and parsing mistakes can be corrected. Our work presents an approach to interactive semantic parsing where an explicit error detection is performed, and a clarification question is generated that pinpoints the suspected source of ambiguity or error and communicates it to the human user. Our experimental results show that a combination of entropy-based uncertainty detection and beam search, together with multi-source training on clarification question, initial parse, and user answer, results in improvements of 1.2% F1 score on a parser that already performs at 90.26% on the NLMaps dataset for OSM semantic parsing.
    Provable Limitations of Acquiring Meaning from Ungrounded Form: What Will Future Language Models Understand?. (arXiv:2104.10809v2 [cs.CL] UPDATED)
    (2 min) Language models trained on billions of tokens have recently led to unprecedented results on many NLP tasks. This success raises the question of whether, in principle, a system can ever ``understand'' raw text without access to some form of grounding. We formally investigate the abilities of ungrounded systems to acquire meaning. Our analysis focuses on the role of ``assertions'': textual contexts that provide indirect clues about the underlying semantics. We study whether assertions enable a system to emulate representations preserving semantic relations like equivalence. We find that assertions enable semantic emulation of languages that satisfy a strong notion of semantic transparency. However, for classes of languages where the same expression can take different values in different contexts, we show that emulation can become uncomputable. Finally, we discuss differences between our formal model and natural language, exploring how our results generalize to a modal setting and other semantic relations. Together, our results suggest that assertions in code or language do not provide sufficient signal to fully emulate semantic representations. We formalize ways in which ungrounded language models appear to be fundamentally limited in their ability to ``understand''.
    Do Language Models Perform Generalizable Commonsense Inference?. (arXiv:2106.11533v1 [cs.CL])
    (2 min) Inspired by evidence that pretrained language models (LMs) encode commonsense knowledge, recent work has applied LMs to automatically populate commonsense knowledge graphs (CKGs). However, there is a lack of understanding on their generalization to multiple CKGs, unseen relations, and novel entities. This paper analyzes the ability of LMs to perform generalizable commonsense inference, in terms of knowledge capacity, transferability, and induction. Our experiments with these three aspects show that: (1) LMs can adapt to different schemas defined by multiple CKGs but fail to reuse the knowledge to generalize to new relations. (2) Adapted LMs generalize well to unseen subjects, but less so on novel objects. Future work should investigate how to improve the transferability and induction of commonsense mining from LMs.
    Enhancing Dialogue Generation via Multi-Level Contrastive Learning. (arXiv:2009.09147v2 [cs.CL] UPDATED)
    (2 min) Most of the existing works for dialogue generation are data-driven models trained directly on corpora crawled from websites. They mainly focus on improving the model architecture to produce better responses but pay little attention to considering the quality of the training data contrastively. In this paper, we propose a multi-level contrastive learning paradigm to model the fine-grained quality of the responses with respect to the query. A Rank-aware Calibration (RC) network is designed to construct the multi-level contrastive optimization objectives. Since these objectives are calculated based on the sentence level, which may erroneously encourage/suppress the generation of uninformative/informative words. To tackle this incidental issue, on one hand, we design an exquisite token-level strategy for estimating the instance loss more accurately. On the other hand, we build a Knowledge Inference (KI) component to capture the keyword knowledge from the reference during training and exploit such information to encourage the generation of informative words. We evaluate the proposed model on a carefully annotated dialogue dataset and the results suggest that our model can generate more relevant and diverse responses compared to the baseline models.
    Graph Routing between Capsules. (arXiv:2106.11531v1 [cs.LG])
    (2 min) Routing methods in capsule networks often learn a hierarchical relationship for capsules in successive layers, but the intra-relation between capsules in the same layer is less studied, while this intra-relation is a key factor for the semantic understanding in text data. Therefore, in this paper, we introduce a new capsule network with graph routing to learn both relationships, where capsules in each layer are treated as the nodes of a graph. We investigate strategies to yield adjacency and degree matrix with three different distances from a layer of capsules, and propose the graph routing mechanism between those capsules. We validate our approach on five text classification datasets, and our findings suggest that the approach combining bottom-up routing and top-down attention performs the best. Such an approach demonstrates generalization capability across datasets. Compared to the state-of-the-art routing methods, the improvements in accuracy in the five datasets we used were 0.82, 0.39, 0.07, 1.01, and 0.02, respectively.
    Learn to Resolve Conversational Dependency: A Consistency Training Framework for Conversational Question Answering. (arXiv:2106.11575v1 [cs.CL])
    (2 min) One of the main challenges in conversational question answering (CQA) is to resolve the conversational dependency, such as anaphora and ellipsis. However, existing approaches do not explicitly train QA models on how to resolve the dependency, and thus these models are limited in understanding human dialogues. In this paper, we propose a novel framework, ExCorD (Explicit guidance on how to resolve Conversational Dependency) to enhance the abilities of QA models in comprehending conversational context. ExCorD first generates self-contained questions that can be understood without the conversation history, then trains a QA model with the pairs of original and self-contained questions using a consistency-based regularizer. In our experiments, we demonstrate that ExCorD significantly improves the QA models' performance by up to 1.2 F1 on QuAC, and 5.2 F1 on CANARD, while addressing the limitations of the existing approaches.
    End-to-End Task-Oriented Dialog Modeling with Semi-Structured Knowledge Management. (arXiv:2106.11796v1 [cs.CL])
    (2 min) Current task-oriented dialog (TOD) systems mostly manage structured knowledge (e.g. databases and tables) to guide the goal-oriented conversations. However, they fall short of handling dialogs which also involve unstructured knowledge (e.g. reviews and documents). In this paper, we formulate a task of modeling TOD grounded on a fusion of structured and unstructured knowledge. To address this task, we propose a TOD system with semi-structured knowledge management, SeKnow, which extends the belief state to manage knowledge with both structured and unstructured contents. Furthermore, we introduce two implementations of SeKnow based on a non-pretrained sequence-to-sequence model and a pretrained language model, respectively. Both implementations use the end-to-end manner to jointly optimize dialog modeling grounded on structured and unstructured knowledge. We conduct experiments on the modified version of MultiWOZ 2.1 dataset, where dialogs are processed to involve semi-structured knowledge. Experimental results show that SeKnow has strong performances in both end-to-end dialog and intermediate knowledge management, compared to existing TOD systems and their extensions with pipeline knowledge management schemes.
    Analysis and Tuning of a Voice Assistant System for Dysfluent Speech. (arXiv:2106.11759v1 [eess.AS])
    (2 min) Dysfluencies and variations in speech pronunciation can severely degrade speech recognition performance, and for many individuals with moderate-to-severe speech disorders, voice operated systems do not work. Current speech recognition systems are trained primarily with data from fluent speakers and as a consequence do not generalize well to speech with dysfluencies such as sound or word repetitions, sound prolongations, or audible blocks. The focus of this work is on quantitative analysis of a consumer speech recognition system on individuals who stutter and production-oriented approaches for improving performance for common voice assistant tasks (i.e., "what is the weather?"). At baseline, this system introduces a significant number of insertion and substitution errors resulting in intended speech Word Error Rates (isWER) that are 13.64\% worse (absolute) for individuals with fluency disorders. We show that by simply tuning the decoding parameters in an existing hybrid speech recognition system one can improve isWER by 24\% (relative) for individuals with fluency disorders. Tuning these parameters translates to 3.6\% better domain recognition and 1.7\% better intent recognition relative to the default setup for the 18 study participants across all stuttering severities.
    On the Evaluation of Machine Translation for Terminology Consistency. (arXiv:2106.11891v1 [cs.CL])
    (2 min) As neural machine translation (NMT) systems become an important part of professional translator pipelines, a growing body of work focuses on combining NMT with terminologies. In many scenarios and particularly in cases of domain adaptation, one expects the MT output to adhere to the constraints provided by a terminology. In this work, we propose metrics to measure the consistency of MT output with regards to a domain terminology. We perform studies on the COVID-19 domain over 5 languages, also performing terminology-targeted human evaluation. We open-source the code for computing all proposed metrics: https://github.com/mahfuzibnalam/terminology_evaluation
    BARTScore: Evaluating Generated Text as Text Generation. (arXiv:2106.11520v1 [cs.CL])
    (2 min) A wide variety of NLP applications, such as machine translation, summarization, and dialog, involve text generation. One major challenge for these applications is how to evaluate whether such generated texts are actually fluent, accurate, or effective. In this work, we conceptualize the evaluation of generated text as a text generation problem, modeled using pre-trained sequence-to-sequence models. The general idea is that models trained to convert the generated text to/from a reference output or the source text will achieve higher scores when the generated text is better. We operationalize this idea using BART, an encoder-decoder based pre-trained model, and propose a metric BARTScore with a number of variants that can be flexibly applied in an unsupervised fashion to evaluation of text from different perspectives (e.g. informativeness, fluency, or factuality). BARTScore is conceptually simple and empirically effective. It can outperform existing top-scoring metrics in 16 of 22 test settings, covering evaluation of 16 datasets (e.g., machine translation, text summarization) and 7 different perspectives (e.g., informativeness, factuality). Code to calculate BARTScore is available at https://github.com/neulab/BARTScore, and we have released an interactive leaderboard for meta-evaluation at this http URL on the ExplainaBoard platform, which allows us to interactively understand the strengths, weaknesses, and complementarity of each metric.
    How well do you know your summarization datasets?. (arXiv:2106.11388v1 [cs.CL])
    (2 min) State-of-the-art summarization systems are trained and evaluated on massive datasets scraped from the web. Despite their prevalence, we know very little about the underlying characteristics (data noise, summarization complexity, etc.) of these datasets, and how these affect system performance and the reliability of automatic metrics like ROUGE. In this study, we manually analyze 600 samples from three popular summarization datasets. Our study is driven by a six-class typology which captures different noise types (missing facts, entities) and degrees of summarization difficulty (extractive, abstractive). We follow with a thorough analysis of 27 state-of-the-art summarization models and 5 popular metrics, and report our key insights: (1) Datasets have distinct data quality and complexity distributions, which can be traced back to their collection process. (2) The performance of models and reliability of metrics is dependent on sample complexity. (3) Faithful summaries often receive low scores because of the poor diversity of references. We release the code, annotated data and model outputs.
    KaggleDBQA: Realistic Evaluation of Text-to-SQL Parsers. (arXiv:2106.11455v1 [cs.CL])
    (2 min) The goal of database question answering is to enable natural language querying of real-life relational databases in diverse application domains. Recently, large-scale datasets such as Spider and WikiSQL facilitated novel modeling techniques for text-to-SQL parsing, improving zero-shot generalization to unseen databases. In this work, we examine the challenges that still prevent these techniques from practical deployment. First, we present KaggleDBQA, a new cross-domain evaluation dataset of real Web databases, with domain-specific data types, original formatting, and unrestricted questions. Second, we re-examine the choice of evaluation tasks for text-to-SQL parsers as applied in real-life settings. Finally, we augment our in-domain evaluation task with database documentation, a naturally occurring source of implicit domain knowledge. We show that KaggleDBQA presents a challenge to state-of-the-art zero-shot parsers but a more realistic evaluation setting and creative use of associated database documentation boosts their accuracy by over 13.2%, doubling their performance.
    Incremental Deep Neural Network Learning using Classification Confidence Thresholding. (arXiv:2106.11437v1 [cs.LG])
    (2 min) Most modern neural networks for classification fail to take into account the concept of the unknown. Trained neural networks are usually tested in an unrealistic scenario with only examples from a closed set of known classes. In an attempt to develop a more realistic model, the concept of working in an open set environment has been introduced. This in turn leads to the concept of incremental learning where a model with its own architecture and initial trained set of data can identify unknown classes during the testing phase and autonomously update itself if evidence of a new class is detected. Some problems that arise in incremental learning are inefficient use of resources to retrain the classifier repeatedly and the decrease of classification accuracy as multiple classes are added over time. This process of instantiating new classes is repeated as many times as necessary, accruing errors. To address these problems, this paper proposes the Classification Confidence Threshold approach to prime neural networks for incremental learning to keep accuracies high by limiting forgetting. A lean method is also used to reduce resources used in the retraining of the neural network. The proposed method is based on the idea that a network is able to incrementally learn a new class even when exposed to a limited number samples associated with the new class. This method can be applied to most existing neural networks with minimal changes to network architecture.
    A Survey of Race, Racism, and Anti-Racism in NLP. (arXiv:2106.11410v1 [cs.CL])
    (2 min) Despite inextricable ties between race and language, little work has considered race in NLP research and development. In this work, we survey 79 papers from the ACL anthology that mention race. These papers reveal various types of race-related bias in all stages of NLP model development, highlighting the need for proactive consideration of how NLP systems can uphold racial hierarchies. However, persistent gaps in research on race and NLP remain: race has been siloed as a niche topic and remains ignored in many NLP tasks; most work operationalizes race as a fixed single-dimensional variable with a ground-truth label, which risks reinforcing differences produced by historical racism; and the voices of historically marginalized people are nearly absent in NLP literature. By identifying where and how NLP literature has and has not considered race, especially in comparison to related fields, our work calls for inclusion and racial justice in NLP research practices.
    Phrase-level Active Learning for Neural Machine Translation. (arXiv:2106.11375v1 [cs.CL])
    (2 min) Neural machine translation (NMT) is sensitive to domain shift. In this paper, we address this problem in an active learning setting where we can spend a given budget on translating in-domain data, and gradually fine-tune a pre-trained out-of-domain NMT model on the newly translated data. Existing active learning methods for NMT usually select sentences based on uncertainty scores, but these methods require costly translation of full sentences even when only one or two key phrases within the sentence are informative. To address this limitation, we re-examine previous work from the phrase-based machine translation (PBMT) era that selected not full sentences, but rather individual phrases. However, while incorporating these phrases into PBMT systems was relatively simple, it is less trivial for NMT systems, which need to be trained on full sequences to capture larger structural properties of sentences unique to the new domain. To overcome these hurdles, we propose to select both full sentences and individual phrases from unlabelled data in the new domain for routing to human translators. In a German-English translation task, our active learning approach achieves consistent improvements over uncertainty-based sentence selection methods, improving up to 1.2 BLEU score over strong active learning baselines.
    A Comprehensive Exploration of Pre-training Language Models. (arXiv:2106.11483v1 [cs.CL])
    (2 min) Recently, the development of pre-trained language models has brought natural language processing (NLP) tasks to the new state-of-the-art. In this paper we explore the efficiency of various pre-trained language models. We pre-train a list of transformer-based models with the same amount of text and the same training steps. The experimental results shows that the most improvement upon the origin BERT is adding the RNN-layer to capture more contextual information for the transformer-encoder layers.
    Deep Learning Models in Detection of Dietary Supplement Adverse Event Signals from Twitter. (arXiv:2106.11403v1 [cs.CL])
    (2 min) Objective: The objective of this study is to develop a deep learning pipeline to detect signals on dietary supplement-related adverse events (DS AEs) from Twitter. Material and Methods: We obtained 247,807 tweets ranging from 2012 to 2018 that mentioned both DS and AE. We annotated biomedical entities and relations on 2,000 randomly selected tweets. For the concept extraction task, we compared the performance of traditional word embeddings with SVM, CRF and LSTM-CRF classifiers to BERT models. For the relation extraction task, we compared GloVe vectors with CNN classifiers to BERT models. We chose the best performing models in each task to assemble an end-to-end deep learning pipeline to detect DS AE signals and compared the results to the known DS AEs from a DS knowledge base (i.e., iDISK). Results: In both tasks, the BERT-based models outperformed traditional word embeddings. The best performing concept extraction model is the BioBERT model that can identify supplement, symptom, and body organ entities with F1-scores of 0.8646, 0.8497, and 0.7104, respectively. The best performing relation extraction model is the BERT model that can identify purpose and AE relations with F1-scores of 0.8335 and 0.7538, respectively. The end-to-end pipeline was able to extract DS indication and DS AEs with an F1-score of 0.7459 and 0,7414, respectively. Comparing to the iDISK, we could find both known and novel DS-AEs. Conclusion: We have demonstrated the feasibility of detecting DS AE signals from Twitter with a BioBERT-based deep learning pipeline.
    Fine-tune the Entire RAG Architecture (including DPR retriever) for Question-Answering. (arXiv:2106.11517v1 [cs.IR])
    (2 min) In this paper, we illustrate how to fine-tune the entire Retrieval Augment Generation (RAG) architecture in an end-to-end manner. We highlighted the main engineering challenges that needed to be addressed to achieve this objective. We also compare how end-to-end RAG architecture outperforms the original RAG architecture for the task of question answering. We have open-sourced our implementation in the HuggingFace Transformers library.
    Dive into Deep Learning. (arXiv:2106.11342v1 [cs.LG])
    (2 min) This open-source book represents our attempt to make deep learning approachable, teaching readers the concepts, the context, and the code. The entire book is drafted in Jupyter notebooks, seamlessly integrating exposition figures, math, and interactive examples with self-contained code. Our goal is to offer a resource that could (i) be freely available for everyone; (ii) offer sufficient technical depth to provide a starting point on the path to actually becoming an applied machine learning scientist; (iii) include runnable code, showing readers how to solve problems in practice; (iv) allow for rapid updates, both by us and also by the community at large; (v) be complemented by a forum for interactive discussion of technical details and to answer questions.
    Membership Inference on Word Embedding and Beyond. (arXiv:2106.11384v1 [cs.CL])
    (2 min) In the text processing context, most ML models are built on word embeddings. These embeddings are themselves trained on some datasets, potentially containing sensitive data. In some cases this training is done independently, in other cases, it occurs as part of training a larger, task-specific model. In either case, it is of interest to consider membership inference attacks based on the embedding layer as a way of understanding sensitive information leakage. But, somewhat surprisingly, membership inference attacks on word embeddings and their effect in other natural language processing (NLP) tasks that use these embeddings, have remained relatively unexplored. In this work, we show that word embeddings are vulnerable to black-box membership inference attacks under realistic assumptions. Furthermore, we show that this leakage persists through two other major NLP applications: classification and text-generation, even when the embedding layer is not exposed to the attacker. We show that our MI attack achieves high attack accuracy against a classifier model and an LSTM-based language model. Indeed, our attack is a cheaper membership inference attack on text-generative models, which does not require the knowledge of the target model or any expensive training of text-generative models as shadow models.
  • cs.CV updates on arXiv.org

    Pruning of Deep Spiking Neural Networks through Gradient Rewiring. (arXiv:2105.04916v3 [cs.NE] UPDATED)
    (2 min) Spiking Neural Networks (SNNs) have been attached great importance due to their biological plausibility and high energy-efficiency on neuromorphic chips. As these chips are usually resource-constrained, the compression of SNNs is thus crucial along the road of practical use of SNNs. Most existing methods directly apply pruning approaches in artificial neural networks (ANNs) to SNNs, which ignore the difference between ANNs and SNNs, thus limiting the performance of the pruned SNNs. Besides, these methods are only suitable for shallow SNNs. In this paper, inspired by synaptogenesis and synapse elimination in the neural system, we propose gradient rewiring (Grad R), a joint learning algorithm of connectivity and weight for SNNs, that enables us to seamlessly optimize network structure without retraining. Our key innovation is to redefine the gradient to a new synaptic parameter, allowing better exploration of network structures by taking full advantage of the competition between pruning and regrowth of connections. The experimental results show that the proposed method achieves minimal loss of SNNs' performance on MNIST and CIFAR-10 dataset so far. Moreover, it reaches a $\sim$3.5% accuracy loss under unprecedented 0.73% connectivity, which reveals remarkable structure refining capability in SNNs. Our work suggests that there exists extremely high redundancy in deep SNNs. Our codes are available at https://github.com/Yanqi-Chen/Gradient-Rewiring.
    EC-GAN: Low-Sample Classification using Semi-Supervised Algorithms and GANs. (arXiv:2012.15864v3 [cs.LG] UPDATED)
    (2 min) Semi-supervised learning has been gaining attention as it allows for performing image analysis tasks such as classification with limited labeled data. Some popular algorithms using Generative Adversarial Networks (GANs) for semi-supervised classification share a single architecture for classification and discrimination. However, this may require a model to converge to a separate data distribution for each task, which may reduce overall performance. While progress in semi-supervised learning has been made, less addressed are small-scale, fully-supervised tasks where even unlabeled data is unavailable and unattainable. We therefore, propose a novel GAN model namely External Classifier GAN (EC-GAN), that utilizes GANs and semi-supervised algorithms to improve classification in fully-supervised regimes. Our method leverages a GAN to generate artificial data used to supplement supervised classification. More specifically, we attach an external classifier, hence the name EC-GAN, to the GAN's generator, as opposed to sharing an architecture with the discriminator. Our experiments demonstrate that EC-GAN's performance is comparable to the shared architecture method, far superior to the standard data augmentation and regularization-based approach, and effective on a small, realistic dataset.
    Fast and Reliable Probabilistic Face Embeddings in the Wild. (arXiv:2102.04075v3 [cs.CV] UPDATED)
    (2 min) Probabilistic Face Embeddings (PFE) can improve face recognition performance in unconstrained scenarios by integrating data uncertainty into the feature representation. However, existing PFE methods tend to be over-confident in estimating uncertainty and is too slow to apply to large-scale face matching. This paper proposes a regularized probabilistic face embedding method to improve the robustness and speed of PFE. Specifically, the mutual likelihood score (MLS) metric used in PFE is simplified to speedup the matching of face feature pairs. Then, an output-constraint loss is proposed to penalize the variance of the uncertainty output, which can regularize the output of the neural network. In addition, an identification preserving loss is proposed to improve the discriminative of the MLS metric, and a multi-layer feature fusion module is proposed to improve the neural network's uncertainty estimation ability. Comprehensive experiments show that the proposed method can achieve comparable or better results in 9 benchmarks than the state-of-the-art methods, and can improve the performance of risk-controlled face recognition. The code of our work is publicly available in GitHub (https://github.com/KaenChan/ProbFace).
    SODA10M: Towards Large-Scale Object Detection Benchmark for Autonomous Driving. (arXiv:2106.11118v2 [cs.CV] UPDATED)
    (2 min) Aiming at facilitating a real-world, ever-evolving and scalable autonomous driving system, we present a large-scale benchmark for standardizing the evaluation of different self-supervised and semi-supervised approaches by learning from raw data, which is the first and largest benchmark to date. Existing autonomous driving systems heavily rely on `perfect' visual perception models (e.g., detection) trained using extensive annotated data to ensure the safety. However, it is unrealistic to elaborately label instances of all scenarios and circumstances (e.g., night, extreme weather, cities) when deploying a robust autonomous driving system. Motivated by recent powerful advances of self-supervised and semi-supervised learning, a promising direction is to learn a robust detection model by collaboratively exploiting large-scale unlabeled data and few labeled data. Existing dataset (e.g., KITTI, Waymo) either provides only a small amount of data or covers limited domains with full annotation, hindering the exploration of large-scale pre-trained models. Here, we release a Large-Scale Object Detection benchmark for Autonomous driving, named as SODA10M, containing 10 million unlabeled images and 20K images labeled with 6 representative object categories. To improve diversity, the images are collected every ten seconds per frame within 32 different cities under different weather conditions, periods and location scenes. We provide extensive experiments and deep analyses of existing supervised state-of-the-art detection models, popular self-supervised and semi-supervised approaches, and some insights about how to develop future models. The data and more up-to-date information have been released at https://soda-2d.github.io.
    Hessian-Aware Pruning and Optimal Neural Implant. (arXiv:2101.08940v3 [cs.CV] UPDATED)
    (2 min) Pruning is an effective method to reduce the memory footprint and FLOPs associated with neural network models. However, existing structured-pruning methods often result in significant accuracy degradation for moderate pruning levels. To address this problem, we introduce a new Hessian Aware Pruning (HAP) method coupled with a Neural Implant approach that uses second-order sensitivity as a metric for structured pruning. The basic idea is to prune insensitive components and to use a Neural Implant for moderately sensitive components, instead of completely pruning them. For the latter approach, the moderately sensitive components are replaced with with a low rank implant that is smaller and less computationally expensive than the original component. We use the relative Hessian trace to measure sensitivity, as opposed to the magnitude based sensitivity metric commonly used in the literature. We test HAP for both computer vision tasks and natural language tasks, and we achieve new state-of-the-art results. Specifically, HAP achieves less than $0.1\%$/$0.5\%$ degradation on PreResNet29/ResNet50 (CIFAR-10/ImageNet) with more than 70\%/50\% of parameters pruned. Meanwhile, HAP also achieves significantly better performance (up to 0.8\% with 60\% of parameters pruned) as compared to gradient based method for head pruning on transformer-based models. The framework has been open sourced and available online.
    Confidence-Guided Radiology Report Generation. (arXiv:2106.10887v2 [cs.CV] UPDATED)
    (2 min) Medical imaging plays a pivotal role in diagnosis and treatment in clinical practice. Inspired by the significant progress in automatic image captioning, various deep learning (DL)-based architectures have been proposed for generating radiology reports for medical images. However, model uncertainty (i.e., model reliability/confidence on report generation) is still an under-explored problem. In this paper, we propose a novel method to explicitly quantify both the visual uncertainty and the textual uncertainty for the task of radiology report generation. Such multi-modal uncertainties can sufficiently capture the model confidence scores at both the report-level and the sentence-level, and thus they are further leveraged to weight the losses for achieving more comprehensive model optimization. Our experimental results have demonstrated that our proposed method for model uncertainty characterization and estimation can provide more reliable confidence scores for radiology report generation, and our proposed uncertainty-weighted losses can achieve more comprehensive model optimization and result in state-of-the-art performance on a public radiology report dataset.
    Multiple Object Tracking with Mixture Density Networks for Trajectory Estimation. (arXiv:2106.10950v2 [cs.CV] UPDATED)
    (2 min) Multiple object tracking faces several challenges that may be alleviated with trajectory information. Knowing the posterior locations of an object helps disambiguating and solving situations such as occlusions, re-identification, and identity switching. In this work, we show that trajectory estimation can become a key factor for tracking, and present TrajE, a trajectory estimator based on recurrent mixture density networks, as a generic module that can be added to existing object trackers. To provide several trajectory hypotheses, our method uses beam search. Also, relying on the same estimated trajectory, we propose to reconstruct a track after an occlusion occurs. We integrate TrajE into two state of the art tracking algorithms, CenterTrack [63] and Tracktor [3]. Their respective performances in the MOTChallenge 2017 test set are boosted 6.3 and 0.3 points in MOTA score, and 1.8 and 3.1 in IDF1, setting a new state of the art for the CenterTrack+TrajE configuration
    Cross-Dataset Collaborative Learning for Semantic Segmentation. (arXiv:2103.11351v2 [cs.CV] UPDATED)
    (2 min) Recent work attempts to improve semantic segmentation performance by exploring well-designed architectures on a target dataset. However, it remains challenging to build a unified system that simultaneously learns from various datasets due to the inherent distribution shift across different datasets. In this paper, we present a simple, flexible, and general method for semantic segmentation, termed Cross-Dataset Collaborative Learning (CDCL). Given multiple labeled datasets, we aim to improve the generalization and discrimination of feature representations on each dataset. Specifically, we first introduce a family of Dataset-Aware Blocks (DAB) as the fundamental computing units of the network, which help capture homogeneous representations and heterogeneous statistics across different datasets. Second, we propose a Dataset Alternation Training (DAT) mechanism to efficiently facilitate the optimization procedure. We conduct extensive evaluations on four diverse datasets, i.e., Cityscapes, BDD100K, CamVid, and COCO Stuff, with single-dataset and cross-dataset settings. Experimental results demonstrate our method consistently achieves notable improvements over prior single-dataset and cross-dataset training methods without introducing extra FLOPs. Particularly, with the same architecture of PSPNet (ResNet-18), our method outperforms the single-dataset baseline by 5.65\%, 6.57\%, and 5.79\% of mIoU on the validation sets of Cityscapes, BDD100K, CamVid, respectively. Code and models will be released.
    Unified Shape and SVBRDF Recovery using Differentiable Monte Carlo Rendering. (arXiv:2103.15208v2 [cs.CV] UPDATED)
    (2 min) Reconstructing the shape and appearance of real-world objects using measured 2D images has been a long-standing problem in computer vision. In this paper, we introduce a new analysis-by-synthesis technique capable of producing high-quality reconstructions through robust coarse-to-fine optimization and physics-based differentiable rendering. Unlike most previous methods that handle geometry and reflectance largely separately, our method unifies the optimization of both by leveraging image gradients with respect to both object reflectance and geometry. To obtain physically accurate gradient estimates, we develop a new GPU-based Monte Carlo differentiable renderer leveraging recent advances in differentiable rendering theory to offer unbiased gradients while enjoying better performance than existing tools like PyTorch3D and redner. To further improve robustness, we utilize several shape and material priors as well as a coarse-to-fine optimization strategy to reconstruct geometry. We demonstrate that our technique can produce reconstructions with higher quality than previous methods such as COLMAP and Kinect Fusion.
    Generation and frame characteristics of predefined evenly-distributed class centroids for pattern classification. (arXiv:2105.00401v2 [cs.CV] UPDATED)
    (2 min) Predefined evenly-distributed class centroids (PEDCC) can be widely used in models and algorithms of pattern classification, such as CNN classifiers, classification autoencoders, clustering, and semi-supervised learning, etc. Its basic idea is to predefine the class centers, which are evenly-distributed on the unit hypersphere in feature space, to maximize the inter-class distance. The previous method of generating PEDCC uses an iterative algorithm based on a charge model, that is, the initial values of various centers (charge positions) are randomly set from the normal distribution, and the charge positions are updated iteratively with the help of the repulsive force between charges of the same polarity. The class centers generated by the algorithm will produce some errors with the theoretically evenly-distributed points, and the generation time will be longer. This paper takes advantage of regular polyhedron in high-dimensional space and the evenly distribution of points on the n dimensional hypersphere to generate PEDCC mathematically. Then, we discussed the basic and extensive characteristics of the frames formed by PEDCC. Finally, experiments show that new algorithm is not only faster than the iterative method, but also more accurate in position. The mathematical analysis and experimental results of this paper can provide a theoretical tool for using PEDCC to solve the key problems in the field of pattern recognition, such as interpretable supervised/unsupervised learning, incremental learning, uncertainty analysis and so on.
    GEM: Glare or Gloom, I Can Still See You -- End-to-End Multimodal Object Detection. (arXiv:2102.12319v3 [cs.CV] UPDATED)
    (2 min) Deep neural networks designed for vision tasks are often prone to failure when they encounter environmental conditions not covered by the training data. Single-modal strategies are insufficient when the sensor fails to acquire information due to malfunction or its design limitations. Multi-sensor configurations are known to provide redundancy, increase reliability, and are crucial in achieving robustness against asymmetric sensor failures. To address the issue of changing lighting conditions and asymmetric sensor degradation in object detection, we develop a multi-modal 2D object detector, and propose deterministic and stochastic sensor-aware feature fusion strategies. The proposed fusion mechanisms are driven by the estimated sensor measurement reliability values/weights. Reliable object detection in harsh lighting conditions is essential for applications such as self-driving vehicles and human-robot interaction. We also propose a new "r-blended" hybrid depth modality for RGB-D sensors. Through extensive experimentation, we show that the proposed strategies outperform the existing state-of-the-art methods on the FLIR-Thermal dataset, and obtain promising results on the SUNRGB-D dataset. We additionally record a new RGB-Infra indoor dataset, namely L515-Indoors, and demonstrate that the proposed object detection methodologies are highly effective for a variety of lighting conditions.
    Focus U-Net: A novel dual attention-gated CNN for polyp segmentation during colonoscopy. (arXiv:2105.07467v2 [eess.IV] UPDATED)
    (2 min) Background: Colonoscopy remains the gold-standard screening for colorectal cancer. However, significant miss rates for polyps have been reported, particularly when there are multiple small adenomas. This presents an opportunity to leverage computer-aided systems to support clinicians and reduce the number of polyps missed. Method: In this work we introduce the Focus U-Net, a novel dual attention-gated deep neural network, which combines efficient spatial and channel-based attention into a single Focus Gate module to encourage selective learning of polyp features. The Focus U-Net further incorporates short-range skip connections and deep supervision. Furthermore, we introduce the Hybrid Focal loss, a new compound loss function based on the Focal loss and Focal Tversky loss, to handle class-imbalanced image segmentation. For our experiments, we selected five public datasets containing images of polyps obtained during optical colonoscopy: CVC-ClinicDB, Kvasir-SEG, CVC-ColonDB, ETIS-Larib PolypDB and EndoScene test set. To evaluate model performance, we use the Dice similarity coefficient (DSC) and Intersection over Union (IoU) metrics. Results: Our model achieves state-of-the-art results for both CVC-ClinicDB and Kvasir-SEG, with a mean DSC of 0.941 and 0.910, respectively. When evaluated on a combination of five public polyp datasets, our model similarly achieves state-of-the-art results with a mean DSC of 0.878 and mean IoU of 0.809, a 14% and 15% improvement over the previous state-of-the-art results of 0.768 and 0.702, respectively. Conclusions: This study shows the potential for deep learning to provide fast and accurate polyp segmentation results for use during colonoscopy. The Focus U-Net may be adapted for future use in newer non-invasive screening and more broadly to other biomedical image segmentation tasks involving class imbalance and requiring efficiency.
    Meta Adversarial Training against Universal Patches. (arXiv:2101.11453v2 [cs.LG] UPDATED)
    (2 min) Recently demonstrated physical-world adversarial attacks have exposed vulnerabilities in perception systems that pose severe risks for safety-critical applications such as autonomous driving. These attacks place adversarial artifacts in the physical world that indirectly cause the addition of a universal patch to inputs of a model that can fool it in a variety of contexts. Adversarial training is the most effective defense against image-dependent adversarial attacks. However, tailoring adversarial training to universal patches is computationally expensive since the optimal universal patch depends on the model weights which change during training. We propose meta adversarial training (MAT), a novel combination of adversarial training with meta-learning, which overcomes this challenge by meta-learning universal patches along with model training. MAT requires little extra computation while continuously adapting a large set of patches to the current model. MAT considerably increases robustness against universal patch attacks on image classification and traffic-light detection.
    Towards Solving Inefficiency of Self-supervised Representation Learning. (arXiv:2104.08760v2 [cs.CV] UPDATED)
    (2 min) Self-supervised learning (especially contrastive learning) has attracted great interest due to its tremendous potentials in learning discriminative representations in an unsupervised manner. Despite the acknowledged successes, existing contrastive learning methods suffer from very low learning efficiency, e.g., taking about ten times more training epochs than supervised learning for comparable recognition accuracy. In this paper, we discover two contradictory phenomena in contrastive learning that we call under-clustering and over-clustering problems, which are major obstacles to learning efficiency. Under-clustering means that the model cannot efficiently learn to discover the dissimilarity between inter-class samples when the negative sample pairs for contrastive learning are insufficient to differentiate all the actual object categories. Over-clustering implies that the model cannot efficiently learn the feature representation from excessive negative sample pairs, which enforces the model to over-cluster samples of the same actual categories into different clusters. To simultaneously overcome these two problems, we propose a novel self-supervised learning framework using a median triplet loss. Precisely, we employ a triplet loss tending to maximize the relative distance between the positive pair and negative pairs to address the under-clustering problem; and we construct the negative pair by selecting the negative sample of a median similarity score from all negative samples to avoid the over-clustering problem, guaranteed by the Bernoulli Distribution model. We extensively evaluate our proposed framework in several large-scale benchmarks (e.g., ImageNet, SYSU-30k, and COCO). The results demonstrate the superior performance (e.g., the learning efficiency) of our model over the latest state-of-the-art methods by a clear margin. Codes available at: https://github.com/wanggrun/triplet.
    Untrained networks for compressive lensless photography. (arXiv:2103.07609v2 [eess.IV] UPDATED)
    (2 min) Compressive lensless imagers enable novel applications in an extremely compact device, requiring only a phase or amplitude mask placed close to the sensor. They have been demonstrated for 2D and 3D microscopy, single-shot video, and single-shot hyperspectral imaging; in each of these cases, a compressive-sensing-based inverse problem is solved in order to recover a 3D data-cube from a 2D measurement. Typically, this is accomplished using convex optimization and hand-picked priors. Alternatively, deep learning-based reconstruction methods offer the promise of better priors, but require many thousands of ground truth training pairs, which can be difficult or impossible to acquire. In this work, we propose the use of untrained networks for compressive image recovery. Our approach does not require any labeled training data, but instead uses the measurement itself to update the network weights. We demonstrate our untrained approach on lensless compressive 2D imaging as well as single-shot high-speed video recovery using the camera's rolling shutter, and single-shot hyperspectral imaging. We provide simulation and experimental verification, showing that our method results in improved image quality over existing methods.
    Generate High Resolution Images With Generative Variational Autoencoder. (arXiv:2008.10399v3 [eess.IV] UPDATED)
    (2 min) In this work, we present a novel neural network to generate high resolution images. We replace the decoder of VAE with a discriminator while using the encoder as it is. The encoder is fed data from a normal distribution while the generator is fed from a gaussian distribution. The combination from both is given to a discriminator which tells whether the generated image is correct or not. We evaluate our network on 3 different datasets: MNIST, LSUN and CelebA dataset. Our network beats the previous state of the art using MMD, SSIM, log likelihood, reconstruction error, ELBO and KL divergence as the evaluation metrics while generating much sharper images. This work is potentially very exciting as we are able to combine the advantages of generative models and inference models in a principled bayesian manner.
    Unsupervised Object-Level Representation Learning from Scene Images. (arXiv:2106.11952v1 [cs.CV])
    (2 min) Contrastive self-supervised learning has largely narrowed the gap to supervised pre-training on ImageNet. However, its success highly relies on the object-centric priors of ImageNet, i.e., different augmented views of the same image correspond to the same object. Such a heavily curated constraint becomes immediately infeasible when pre-trained on more complex scene images with many objects. To overcome this limitation, we introduce Object-level Representation Learning (ORL), a new self-supervised learning framework towards scene images. Our key insight is to leverage image-level self-supervised pre-training as the prior to discover object-level semantic correspondence, thus realizing object-level representation learning from scene images. Extensive experiments on COCO show that ORL significantly improves the performance of self-supervised learning on scene images, even surpassing supervised ImageNet pre-training on several downstream tasks. Furthermore, ORL improves the downstream performance when more unlabeled scene images are available, demonstrating its great potential of harnessing unlabeled data in the wild. We hope our approach can motivate future research on more general-purpose unsupervised representation learning from scene data. Project page: https://www.mmlab-ntu.com/project/orl/.
    Lasry-Lions Envelopes and Nonconvex Optimization: A Homotopy Approach. (arXiv:2103.08533v2 [math.OC] UPDATED)
    (2 min) In large-scale optimization, the presence of nonsmooth and nonconvex terms in a given problem typically makes it hard to solve. A popular approach to address nonsmooth terms in convex optimization is to approximate them with their respective Moreau envelopes. In this work, we study the use of Lasry-Lions double envelopes to approximate nonsmooth terms that are also not convex. These envelopes are an extension of the Moreau ones but exhibit an additional smoothness property that makes them amenable to fast optimization algorithms. Lasry-Lions envelopes can also be seen as an "intermediate" between a given function and its convex envelope, and we make use of this property to develop a method that builds a sequence of approximate subproblems that are easier to solve than the original problem. We discuss convergence properties of this method when used to address composite minimization problems; additionally, based on a number of experiments, we discuss settings where it may be more useful than classical alternatives in two domains: signal decoding and spectral unmixing.
    Semantic Hierarchy Preserving Deep Hashing for Large-scale Image Retrieval. (arXiv:1901.11259v3 [cs.CV] UPDATED)
    (2 min) Deep hashing models have been proposed as an efficient method for large-scale similarity search. However, most existing deep hashing methods only utilize fine-level labels for training while ignoring the natural semantic hierarchy structure. This paper presents an effective method that preserves the classwise similarity of full-level semantic hierarchy for large-scale image retrieval. Experiments on two benchmark datasets show that our method helps improve the fine-level retrieval performance. Moreover, with the help of the semantic hierarchy, it can produce significantly better binary codes for hierarchical retrieval, which indicates its potential of providing more user-desired retrieval results.
    Lightweight Image Super-Resolution with Multi-scale Feature Interaction Network. (arXiv:2103.13028v2 [eess.IV] UPDATED)
    (2 min) Recently, the single image super-resolution (SISR) approaches with deep and complex convolutional neural network structures have achieved promising performance. However, those methods improve the performance at the cost of higher memory consumption, which is difficult to be applied for some mobile devices with limited storage and computing resources. To solve this problem, we present a lightweight multi-scale feature interaction network (MSFIN). For lightweight SISR, MSFIN expands the receptive field and adequately exploits the informative features of the low-resolution observed images from various scales and interactive connections. In addition, we design a lightweight recurrent residual channel attention block (RRCAB) so that the network can benefit from the channel attention mechanism while being sufficiently lightweight. Extensive experiments on some benchmarks have confirmed that our proposed MSFIN can achieve comparable performance against the state-of-the-arts with a more lightweight model.
    A Survey of Quantization Methods for Efficient Neural Network Inference. (arXiv:2103.13630v3 [cs.CV] UPDATED)
    (3 min) As soon as abstract mathematical computations were adapted to computation on digital computers, the problem of efficient representation, manipulation, and communication of the numerical values in those computations arose. Strongly related to the problem of numerical representation is the problem of quantization: in what manner should a set of continuous real-valued numbers be distributed over a fixed discrete set of numbers to minimize the number of bits required and also to maximize the accuracy of the attendant computations? This perennial problem of quantization is particularly relevant whenever memory and/or computational resources are severely restricted, and it has come to the forefront in recent years due to the remarkable performance of Neural Network models in computer vision, natural language processing, and related areas. Moving from floating-point representations to low-precision fixed integer values represented in four bits or less holds the potential to reduce the memory footprint and latency by a factor of 16x; and, in fact, reductions of 4x to 8x are often realized in practice in these applications. Thus, it is not surprising that quantization has emerged recently as an important and very active sub-area of research in the efficient implementation of computations associated with Neural Networks. In this article, we survey approaches to the problem of quantizing the numerical values in deep Neural Network computations, covering the advantages/disadvantages of current methods. With this survey and its organization, we hope to have presented a useful snapshot of the current research in quantization for Neural Networks and to have given an intelligent organization to ease the evaluation of future research in this area.
    Data Augmentation for Meta-Learning. (arXiv:2010.07092v2 [cs.LG] UPDATED)
    (2 min) Conventional image classifiers are trained by randomly sampling mini-batches of images. To achieve state-of-the-art performance, practitioners use sophisticated data augmentation schemes to expand the amount of training data available for sampling. In contrast, meta-learning algorithms sample support data, query data, and tasks on each training step. In this complex sampling scenario, data augmentation can be used not only to expand the number of images available per class, but also to generate entirely new classes/tasks. We systematically dissect the meta-learning pipeline and investigate the distinct ways in which data augmentation can be integrated at both the image and class levels. Our proposed meta-specific data augmentation significantly improves the performance of meta-learners on few-shot classification benchmarks.
    Obstacle Detection for BVLOS Drones. (arXiv:2106.11098v2 [cs.CV] UPDATED)
    (2 min) With the introduction of new regulations in the European Union, the future of Beyond Visual Line Of Sight (BVLOS) drones is set to bloom. This led to the creation of the theBEAST project, which aims to create an autonomous security drone, with focus on those regulations and on safety. This technical paper describes the first steps of a module within this project, which revolves around detecting obstacles so they can be avoided in a fail-safe landing. A deep learning powered object detection method is the subject of our research, and various experiments are held to maximize its performance, such as comparing various data augmentation techniques or YOLOv3 and YOLOv5. According to the results of the experiments, we conclude that although object detection is a promising approach to resolve this problem, more volume of data is required for potential usage in a real-life application.
    Supervised Momentum Contrastive Learning for Few-Shot Classification. (arXiv:2101.11058v2 [cs.CV] UPDATED)
    (2 min) Few-shot learning aims to transfer information from one task to enable generalization on novel tasks given a few examples. This information is present both in the domain and the class labels. In this work we investigate the complementary roles of these two sources of information by combining instance-discriminative contrastive learning and supervised learning in a single framework called Supervised Momentum Contrastive learning (SUPMOCO). Our approach avoids a problem observed in supervised learning where information in images not relevant to the task is discarded, which hampers their generalization to novel tasks. We show that (self-supervised) contrastive learning and supervised learning are mutually beneficial, leading to a new state-of-the-art on the META-DATASET - a recently introduced benchmark for few-shot learning. Our method is based on a simple modification of MOCO and scales better than prior work on combining supervised and self-supervised learning. This allows us to easily combine data from multiple domains leading to further improvements.
    RUHSNet: 3D Object Detection Using Lidar Data in Real Time. (arXiv:2006.01250v6 [cs.CV] UPDATED)
    (2 min) In this work, we address the problem of 3D object detection from point cloud data in real time. For autonomous vehicles to work, it is very important for the perception component to detect the real world objects with both high accuracy and fast inference. We propose a novel neural network architecture along with the training and optimization details for detecting 3D objects in point cloud data. We compare the results with different backbone architectures including the standard ones like VGG, ResNet, Inception with our backbone. Also we present the optimization and ablation studies including designing an efficient anchor. We use the Kitti 3D Birds Eye View dataset for benchmarking and validating our results. Our work surpasses the state of the art in this domain both in terms of average precision and speed running at > 30 FPS. This makes it a feasible option to be deployed in real time applications including self driving cars.
    On the importance of cross-task features for class-incremental learning. (arXiv:2106.11930v1 [cs.LG])
    (2 min) In class-incremental learning, an agent with limited resources needs to learn a sequence of classification tasks, forming an ever growing classification problem, with the constraint of not being able to access data from previous tasks. The main difference with task-incremental learning, where a task-ID is available at inference time, is that the learner also needs to perform cross-task discrimination, i.e. distinguish between classes that have not been seen together. Approaches to tackle this problem are numerous and mostly make use of an external memory (buffer) of non-negligible size. In this paper, we ablate the learning of cross-task features and study its influence on the performance of basic replay strategies used for class-IL. We also define a new forgetting measure for class-incremental learning, and see that forgetting is not the principal cause of low performance. Our experimental results show that future algorithms for class-incremental learning should not only prevent forgetting, but also aim to improve the quality of the cross-task features. This is especially important when the number of classes per task is small.
    MetaAvatar: Learning Animatable Clothed Human Models from Few Depth Images. (arXiv:2106.11944v1 [cs.CV])
    (2 min) In this paper, we aim to create generalizable and controllable neural signed distance fields (SDFs) that represent clothed humans from monocular depth observations. Recent advances in deep learning, especially neural implicit representations, have enabled human shape reconstruction and controllable avatar generation from different sensor inputs. However, to generate realistic cloth deformations from novel input poses, watertight meshes or dense full-body scans are usually needed as inputs. Furthermore, due to the difficulty of effectively modeling pose-dependent cloth deformations for diverse body shapes and cloth types, existing approaches resort to per-subject/cloth-type optimization from scratch, which is computationally expensive. In contrast, we propose an approach that can quickly generate realistic clothed human avatars, represented as controllable neural SDFs, given only monocular depth images. We achieve this by using meta-learning to learn an initialization of a hypernetwork that predicts the parameters of neural SDFs. The hypernetwork is conditioned on human poses and represents a clothed neural avatar that deforms non-rigidly according to the input poses. Meanwhile, it is meta-learned to effectively incorporate priors of diverse body shapes and cloth types and thus can be much faster to fine-tune, compared to models trained from scratch. We qualitatively and quantitatively show that our approach outperforms state-of-the-art approaches that require complete meshes as inputs while our approach requires only depth frames as inputs and runs orders of magnitudes faster. Furthermore, we demonstrate that our meta-learned hypernetwork is very robust, being the first to generate avatars with realistic dynamic cloth deformations given as few as 8 monocular depth frames.
    Skeleton-based Action Recognition via Spatial and Temporal Transformer Networks. (arXiv:2008.07404v4 [cs.CV] UPDATED)
    (2 min) Skeleton-based Human Activity Recognition has achieved great interest in recent years as skeleton data has demonstrated being robust to illumination changes, body scales, dynamic camera views, and complex background. In particular, Spatial-Temporal Graph Convolutional Networks (ST-GCN) demonstrated to be effective in learning both spatial and temporal dependencies on non-Euclidean data such as skeleton graphs. Nevertheless, an effective encoding of the latent information underlying the 3D skeleton is still an open problem, especially when it comes to extracting effective information from joint motion patterns and their correlations. In this work, we propose a novel Spatial-Temporal Transformer network (ST-TR) which models dependencies between joints using the Transformer self-attention operator. In our ST-TR model, a Spatial Self-Attention module (SSA) is used to understand intra-frame interactions between different body parts, and a Temporal Self-Attention module (TSA) to model inter-frame correlations. The two are combined in a two-stream network, whose performance is evaluated on three large-scale datasets, NTU-RGB+D 60, NTU-RGB+D 120, and Kinetics Skeleton 400, consistently improving backbone results. Compared with methods that use the same input data, the proposed ST-TR achieves state-of-the-art performance on all datasets when using joints' coordinates as input, and results on-par with state-of-the-art when adding bones information.
    Towards Reducing Labeling Cost in Deep Object Detection. (arXiv:2106.11921v1 [cs.CV])
    (2 min) Deep neural networks have reached very high accuracy on object detection but their success hinges on large amounts of labeled data. To reduce the dependency on labels, various active-learning strategies have been proposed, typically based on the confidence of the detector. However, these methods are biased towards best-performing classes and can lead to acquired datasets that are not good representatives of the data in the testing set. In this work, we propose a unified framework for active learning, that considers both the uncertainty and the robustness of the detector, ensuring that the network performs accurately in all classes. Furthermore, our method is able to pseudo-label the very confident predictions, suppressing a potential distribution drift while further boosting the performance of the model. Experiments show that our method comprehensively outperforms a wide range of active-learning methods on PASCAL VOC07+12 and MS-COCO, having up to a 7.7% relative improvement, or up to 82% reduction in labeling cost.
    Nutrition5k: Towards Automatic Nutritional Understanding of Generic Food. (arXiv:2103.03375v2 [cs.CV] UPDATED)
    (2 min) Understanding the nutritional content of food from visual data is a challenging computer vision problem, with the potential to have a positive and widespread impact on public health. Studies in this area are limited to existing datasets in the field that lack sufficient diversity or labels required for training models with nutritional understanding capability. We introduce Nutrition5k, a novel dataset of 5k diverse, real world food dishes with corresponding video streams, depth images, component weights, and high accuracy nutritional content annotation. We demonstrate the potential of this dataset by training a computer vision algorithm capable of predicting the caloric and macronutrient values of a complex, real world dish at an accuracy that outperforms professional nutritionists. Further we present a baseline for incorporating depth sensor data to improve nutrition predictions. We will publicly release Nutrition5k in the hope that it will accelerate innovation in the space of nutritional understanding.
    Robust Consistent Video Depth Estimation. (arXiv:2012.05901v2 [cs.CV] UPDATED)
    (2 min) We present an algorithm for estimating consistent dense depth maps and camera poses from a monocular video. We integrate a learning-based depth prior, in the form of a convolutional neural network trained for single-image depth estimation, with geometric optimization, to estimate a smooth camera trajectory as well as detailed and stable depth reconstruction. Our algorithm combines two complementary techniques: (1) flexible deformation-splines for low-frequency large-scale alignment and (2) geometry-aware depth filtering for high-frequency alignment of fine depth details. In contrast to prior approaches, our method does not require camera poses as input and achieves robust reconstruction for challenging hand-held cell phone captures containing a significant amount of noise, shake, motion blur, and rolling shutter deformations. Our method quantitatively outperforms state-of-the-arts on the Sintel benchmark for both depth and pose estimations and attains favorable qualitative results across diverse wild datasets.
    Data Quality as Predictor of Voice Anti-Spoofing Generalization. (arXiv:2103.14602v2 [eess.AS] UPDATED)
    (2 min) Voice anti-spoofing aims at classifying a given utterance either as a bonafide human sample, or a spoofing attack (e.g. synthetic or replayed sample). Many anti-spoofing methods have been proposed but most of them fail to generalize across domains (corpora) -- and we do not know \emph{why}. We outline a novel interpretative framework for gauging the impact of data quality upon anti-spoofing performance. Our within- and between-domain experiments pool data from seven public corpora and three anti-spoofing methods based on Gaussian mixture and convolutive neural network models. We assess the impacts of long-term spectral information, speaker population (through x-vector speaker embeddings), signal-to-noise ratio, and selected voice quality features.
    MEAL: Manifold Embedding-based Active Learning. (arXiv:2106.11858v1 [cs.CV])
    (2 min) Image segmentation is a common and challenging task in autonomous driving. Availability of sufficient pixel-level annotations for the training data is a hurdle. Active learning helps learning from small amounts of data by suggesting the most promising samples for labeling. In this work, we propose a new pool-based method for active learning, which proposes promising image regions, in each acquisition step. The problem is framed in an exploration-exploitation framework by combining an embedding based on Uniform Manifold Approximation to model representativeness with entropy as uncertainty measure to model informativeness. We applied our proposed method to the challenging autonomous driving data sets CamVid and Cityscapes and performed a quantitative comparison with state-of-the-art methods. We find that our active learning method achieves better performance on CamVid compared to other methods, while on Cityscapes, the performance lift was negligible.
    G-VAE, a Geometric Convolutional VAE for ProteinStructure Generation. (arXiv:2106.11920v1 [cs.CV])
    (2 min) Analyzing the structure of proteins is a key part of understanding their functions and thus their role in biology at the molecular level. In addition, design new proteins in a methodical way is a major engineering challenge. In this work, we introduce a joint geometric-neural networks approach for comparing, deforming and generating 3D protein structures. Viewing protein structures as 3D open curves, we adopt the Square Root Velocity Function (SRVF) representation and leverage its suitable geometric properties along with Deep Residual Networks (ResNets) for a joint registration and comparison. Our ResNets handle better large protein deformations while being more computationally efficient. On top of the mathematical framework, we further design a Geometric Variational Auto-Encoder (G-VAE), that once trained, maps original, previously unseen structures, into a low-dimensional (latent) hyper-sphere. Motivated by the spherical structure of the pre-shape space, we naturally adopt the von Mises-Fisher (vMF) distribution to model our hidden variables. We test the effectiveness of our models by generating novel protein structures and predicting completions of corrupted protein structures. Experimental results show that our method is able to generate plausible structures, different from the structures in the training data.
    nuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles. (arXiv:2106.11810v1 [cs.CV])
    (2 min) In this work, we propose the world's first closed-loop ML-based planning benchmark for autonomous driving. While there is a growing body of ML-based motion planners, the lack of established datasets and metrics has limited the progress in this area. Existing benchmarks for autonomous vehicle motion prediction have focused on short-term motion forecasting, rather than long-term planning. This has led previous works to use open-loop evaluation with L2-based metrics, which are not suitable for fairly evaluating long-term planning. Our benchmark overcomes these limitations by introducing a large-scale driving dataset, lightweight closed-loop simulator, and motion-planning-specific metrics. We provide a high-quality dataset with 1500h of human driving data from 4 cities across the US and Asia with widely varying traffic patterns (Boston, Pittsburgh, Las Vegas and Singapore). We will provide a closed-loop simulation framework with reactive agents and provide a large set of both general and scenario-specific planning metrics. We plan to release the dataset at NeurIPS 2021 and organize benchmark challenges starting in early 2022.
    Adversarial Robustness vs Model Compression, or Both?. (arXiv:1903.12561v5 [cs.CV] UPDATED)
    (2 min) It is well known that deep neural networks (DNNs) are vulnerable to adversarial attacks, which are implemented by adding crafted perturbations onto benign examples. Min-max robust optimization based adversarial training can provide a notion of security against adversarial attacks. However, adversarial robustness requires a significantly larger capacity of the network than that for the natural training with only benign examples. This paper proposes a framework of concurrent adversarial training and weight pruning that enables model compression while still preserving the adversarial robustness and essentially tackles the dilemma of adversarial training. Furthermore, this work studies two hypotheses about weight pruning in the conventional setting and finds that weight pruning is essential for reducing the network model size in the adversarial setting, training a small model from scratch even with inherited initialization from the large model cannot achieve both adversarial robustness and high standard accuracy. Code is available at https://github.com/yeshaokai/Robustness-Aware-Pruning-ADMM.
    Improving Ultrasound Tongue Image Reconstruction from Lip Images Using Self-supervised Learning and Attention Mechanism. (arXiv:2106.11769v1 [eess.AS])
    (2 min) Speech production is a dynamic procedure, which involved multi human organs including the tongue, jaw and lips. Modeling the dynamics of the vocal tract deformation is a fundamental problem to understand the speech, which is the most common way for human daily communication. Researchers employ several sensory streams to describe the process simultaneously, which are incontrovertibly statistically related to other streams. In this paper, we address the following question: given an observable image sequences of lips, can we picture the corresponding tongue motion. We formulated this problem as the self-supervised learning problem, and employ the two-stream convolutional network and long-short memory network for the learning task, with the attention mechanism. We evaluate the performance of the proposed method by leveraging the unlabeled lip videos to predict an upcoming ultrasound tongue image sequence. The results show that our model is able to generate images that close to the real ultrasound tongue images, and results in the matching between two imaging modalities.
    Prototypical Cross-Attention Networks for Multiple Object Tracking and Segmentation. (arXiv:2106.11958v1 [cs.CV])
    (2 min) Multiple object tracking and segmentation requires detecting, tracking, and segmenting objects belonging to a set of given classes. Most approaches only exploit the temporal dimension to address the association problem, while relying on single frame predictions for the segmentation mask itself. We propose Prototypical Cross-Attention Network (PCAN), capable of leveraging rich spatio-temporal information for online multiple object tracking and segmentation. PCAN first distills a space-time memory into a set of prototypes and then employs cross-attention to retrieve rich information from the past frames. To segment each object, PCAN adopts a prototypical appearance module to learn a set of contrastive foreground and background prototypes, which are then propagated over time. Extensive experiments demonstrate that PCAN outperforms current video instance tracking and segmentation competition winners on both Youtube-VIS and BDD100K datasets, and shows efficacy to both one-stage and two-stage segmentation frameworks. Code will be available at this http URL
    RootPainter3D: Interactive-machine-learning enables rapid and accurate contouring for radiotherapy. (arXiv:2106.11942v1 [cs.CV])
    (2 min) Organ-at-risk contouring is still a bottleneck in radiotherapy, with many deep learning methods falling short of promised results when evaluated on clinical data. We investigate the accuracy and time-savings resulting from the use of an interactive-machine-learning method for an organ-at-risk contouring task. We compare the method to the Eclipse contouring software and find strong agreement with manual delineations, with a dice score of 0.95. The annotations created using corrective-annotation also take less time to create as more images are annotated, resulting in substantial time savings compared to manual methods, with hearts that take 2 minutes and 2 seconds to delineate on average, after 923 images have been delineated, compared to 7 minutes and 1 seconds when delineating manually. Our experiment demonstrates that interactive-machine-learning with corrective-annotation provides a fast and accessible way for non computer-scientists to train deep-learning models to segment their own structures of interest as part of routine clinical workflows. Source code is available at \href{https://github.com/Abe404/RootPainter3D}{this HTTPS URL}.
    Data Augmentation for Opcode Sequence Based Malware Detection. (arXiv:2106.11821v1 [cs.CR])
    (2 min) Data augmentation has been successfully used in many areas of deep-learning to significantly improve model performance. Typically data augmentation simulates realistic variations in data in order to increase the apparent diversity of the training-set. However, for opcode-based malware analysis, where deep learning methods are already achieving state of the art performance, it is not immediately clear how to apply data augmentation. In this paper we study different methods of data augmentation starting with basic methods using fixed transformations and moving to methods that adapt to the data. We propose a novel data augmentation method based on using an opcode embedding layer within the network and its corresponding opcode embedding matrix to perform adaptive data augmentation during training. To the best of our knowledge this is the first paper to carry out a systematic study of different augmentation methods applied to opcode sequence based malware classification.
    From Points to Multi-Object 3D Reconstruction. (arXiv:2012.11575v3 [cs.CV] UPDATED)
    (2 min) We propose a method to detect and reconstruct multiple 3D objects from a single RGB image. The key idea is to optimize for detection, alignment and shape jointly over all objects in the RGB image, while focusing on realistic and physically plausible reconstructions. To this end, we propose a keypoint detector that localizes objects as center points and directly predicts all object properties, including 9-DoF bounding boxes and 3D shapes -- all in a single forward pass. The proposed method formulates 3D shape reconstruction as a shape selection problem, i.e. it selects among exemplar shapes from a given database. This makes it agnostic to shape representations, which enables a lightweight reconstruction of realistic and visually-pleasing shapes based on CAD-models, while the training objective is formulated around point clouds and voxel representations. A collision-loss promotes non-intersecting objects, further increasing the reconstruction realism. Given the RGB image, the presented approach performs lightweight reconstruction in a single-stage, it is real-time capable, fully differentiable and end-to-end trainable. Our experiments compare multiple approaches for 9-DoF bounding box estimation, evaluate the novel shape-selection mechanism and compare to recent methods in terms of 3D bounding box estimation and 3D shape reconstruction quality.
    Tracking Instances as Queries. (arXiv:2106.11963v1 [cs.CV])
    (2 min) Recently, query based deep networks catch lots of attention owing to their end-to-end pipeline and competitive results on several fundamental computer vision tasks, such as object detection, semantic segmentation, and instance segmentation. However, how to establish a query based video instance segmentation (VIS) framework with elegant architecture and strong performance remains to be settled. In this paper, we present \textbf{QueryTrack} (i.e., tracking instances as queries), a unified query based VIS framework fully leveraging the intrinsic one-to-one correspondence between instances and queries in QueryInst. The proposed method obtains 52.7 / 52.3 AP on YouTube-VIS-2019 / 2021 datasets, which wins the 2-nd place in the YouTube-VIS Challenge at CVPR 2021 \textbf{with a single online end-to-end model, single scale testing \& modest amount of training data}. We also provide QueryTrack-ResNet-50 baseline results on YouTube-VIS-2021 dataset as references for the VIS community.
    PALMAR: Towards Adaptive Multi-inhabitant Activity Recognition in Point-Cloud Technology. (arXiv:2106.11902v1 [cs.CV])
    (2 min) With the advancement of deep neural networks and computer vision-based Human Activity Recognition, employment of Point-Cloud Data technologies (LiDAR, mmWave) has seen a lot interests due to its privacy preserving nature. Given the high promise of accurate PCD technologies, we develop, PALMAR, a multiple-inhabitant activity recognition system by employing efficient signal processing and novel machine learning techniques to track individual person towards developing an adaptive multi-inhabitant tracking and HAR system. More specifically, we propose (i) a voxelized feature representation-based real-time PCD fine-tuning method, (ii) efficient clustering (DBSCAN and BIRCH), Adaptive Order Hidden Markov Model based multi-person tracking and crossover ambiguity reduction techniques and (iii) novel adaptive deep learning-based domain adaptation technique to improve the accuracy of HAR in presence of data scarcity and diversity (device, location and population diversity). We experimentally evaluate our framework and systems using (i) a real-time PCD collected by three devices (3D LiDAR and 79 GHz mmWave) from 6 participants, (ii) one publicly available 3D LiDAR activity data (28 participants) and (iii) an embedded hardware prototype system which provided promising HAR performances in multi-inhabitants (96%) scenario with a 63% improvement of multi-person tracking than state-of-art framework without losing significant system performances in the edge computing device.
    A Review of the Vision-based Approaches for Dietary Assessment. (arXiv:2106.11776v1 [cs.CV])
    (2 min) Dietary-related problems such as obesity are a growing concern in todays modern world. If the current trend continues, it is most likely that the quality of life, in general, is significantly affected since obesity is associated with other chronic diseases such as hypertension, irregular blood sugar levels, and increased risk of heart attacks. The primary cause of these problems is poor lifestyle choices and unhealthy dietary habits, with emphasis on a select few food groups such as sugars, fats, and carbohydrates. In this regard, computer-based food recognition offers automatic visual-based methods to assess dietary intake and help people make healthier choices. Thus, the following paper presents a brief review of visual-based methods for food recognition, including their accuracy, performance, and the use of popular food databases to evaluate existing models. The work further aims to highlight future challenges in this area. New high-quality studies for developing standard benchmarks and using continual learning methods for food recognition are recommended.
    A Comparison for Patch-level Classification of Deep Learning Methods on Transparent Images: from Convolutional Neural Networks to Visual Transformers. (arXiv:2106.11582v1 [cs.CV])
    (2 min) Nowadays, analysis of transparent images in the field of computer vision has gradually become a hot spot. In this paper, we compare the classification performance of different deep learning for the problem that transparent images are difficult to analyze. We crop the transparent images into 8 * 8 and 224 * 224 pixels patches in the same proportion, and then divide the two different pixels patches into foreground and background according to groundtruch. We also use 4 types of convolutional neural networks and a novel ViT network model to compare the foreground and background classification experiments. We conclude that ViT performs the worst in classifying 8 * 8 pixels patches, but it outperforms most convolutional neural networks in classifying 224 * 224.
    Enhanced Separable Disentanglement for Unsupervised Domain Adaptation. (arXiv:2106.11915v1 [cs.CV])
    (2 min) Domain adaptation aims to mitigate the domain gap when transferring knowledge from an existing labeled domain to a new domain. However, existing disentanglement-based methods do not fully consider separation between domain-invariant and domain-specific features, which means the domain-invariant features are not discriminative. The reconstructed features are also not sufficiently used during training. In this paper, we propose a novel enhanced separable disentanglement (ESD) model. We first employ a disentangler to distill domain-invariant and domain-specific features. Then, we apply feature separation enhancement processes to minimize contamination between domain-invariant and domain-specific features. Finally, our model reconstructs complete feature vectors, which are used for further disentanglement during the training phase. Extensive experiments from three benchmark datasets outperform state-of-the-art methods, especially on challenging cross-domain tasks.
    Part-Aware Measurement for Robust Multi-View Multi-Human 3D Pose Estimation and Tracking. (arXiv:2106.11589v1 [cs.CV])
    (2 min) This paper introduces an approach for multi-human 3D pose estimation and tracking based on calibrated multi-view. The main challenge lies in finding the cross-view and temporal correspondences correctly even when several human pose estimations are noisy. Compare to previous solutions that construct 3D poses from multiple views, our approach takes advantage of temporal consistency to match the 2D poses estimated with previously constructed 3D skeletons in every view. Therefore cross-view and temporal associations are accomplished simultaneously. Since the performance suffers from mistaken association and noisy predictions, we design two strategies for aiming better correspondences and 3D reconstruction. Specifically, we propose a part-aware measurement for 2D-3D association and a filter that can cope with 2D outliers during reconstruction. Our approach is efficient and effective comparing to state-of-the-art methods; it achieves competitive results on two benchmarks: 96.8% on Campus and 97.4% on Shelf. Moreover, we extends the length of Campus evaluation frames to be more challenging and our proposal also reach well-performed result.
    HybVIO: Pushing the Limits of Real-time Visual-inertial Odometry. (arXiv:2106.11857v1 [cs.CV])
    (2 min) We present HybVIO, a novel hybrid approach for combining filtering-based visual-inertial odometry (VIO) with optimization-based SLAM. The core of our method is highly robust, independent VIO with improved IMU bias modeling, outlier rejection, stationarity detection, and feature track selection, which is adjustable to run on embedded hardware. Long-term consistency is achieved with a loosely-coupled SLAM module. In academic benchmarks, our solution yields excellent performance in all categories, especially in the real-time use case, where we outperform the current state-of-the-art. We also demonstrate the feasibility of VIO for vehicular tracking on consumer-grade hardware using a custom dataset, and show good performance in comparison to current commercial VISLAM alternatives.
    Weakly-Supervised Temporal Action Localization Through Local-Global Background Modeling. (arXiv:2106.11811v1 [cs.CV])
    (2 min) Weakly-Supervised Temporal Action Localization (WS-TAL) task aims to recognize and localize temporal starts and ends of action instances in an untrimmed video with only video-level label supervision. Due to lack of negative samples of background category, it is difficult for the network to separate foreground and background, resulting in poor detection performance. In this report, we present our 2021 HACS Challenge - Weakly-supervised Learning Track solution that based on BaSNet to address above problem. Specifically, we first adopt pre-trained CSN, Slowfast, TDN, and ViViT as feature extractors to get feature sequences. Then our proposed Local-Global Background Modeling Network (LGBM-Net) is trained to localize instances by using only video-level labels based on Multi-Instance Learning (MIL). Finally, we ensemble multiple models to get the final detection results and reach 22.45% mAP on the test set
    Zero-Shot Chinese Character Recognition with Stroke-Level Decomposition. (arXiv:2106.11613v1 [cs.CV])
    (2 min) Chinese character recognition has attracted much research interest due to its wide applications. Although it has been studied for many years, some issues in this field have not been completely resolved yet, e.g. the zero-shot problem. Previous character-based and radical-based methods have not fundamentally addressed the zero-shot problem since some characters or radicals in test sets may not appear in training sets under a data-hungry condition. Inspired by the fact that humans can generalize to know how to write characters unseen before if they have learned stroke orders of some characters, we propose a stroke-based method by decomposing each character into a sequence of strokes, which are the most basic units of Chinese characters. However, we observe that there is a one-to-many relationship between stroke sequences and Chinese characters. To tackle this challenge, we employ a matching-based strategy to transform the predicted stroke sequence to a specific character. We evaluate the proposed method on handwritten characters, printed artistic characters, and scene characters. The experimental results validate that the proposed method outperforms existing methods on both character zero-shot and radical zero-shot tasks. Moreover, the proposed method can be easily generalized to other languages whose characters can be decomposed into strokes.
    Domain-Smoothing Network for Zero-Shot Sketch-Based Image Retrieval. (arXiv:2106.11841v1 [cs.CV])
    (2 min) Zero-Shot Sketch-Based Image Retrieval (ZS-SBIR) is a novel cross-modal retrieval task, where abstract sketches are used as queries to retrieve natural images under zero-shot scenario. Most existing methods regard ZS-SBIR as a traditional classification problem and employ a cross-entropy or triplet-based loss to achieve retrieval, which neglect the problems of the domain gap between sketches and natural images and the large intra-class diversity in sketches. Toward this end, we propose a novel Domain-Smoothing Network (DSN) for ZS-SBIR. Specifically, a cross-modal contrastive method is proposed to learn generalized representations to smooth the domain gap by mining relations with additional augmented samples. Furthermore, a category-specific memory bank with sketch features is explored to reduce intra-class diversity in the sketch domain. Extensive experiments demonstrate that our approach notably outperforms the state-of-the-art methods in both Sketchy and TU-Berlin datasets. Our source code is publicly available at https://github.com/haowang1992/DSN.
    Residual Networks as Flows of Velocity Fields for Diffeomorphic Time Series Alignment. (arXiv:2106.11911v1 [cs.CV])
    (2 min) Non-linear (large) time warping is a challenging source of nuisance in time-series analysis. In this paper, we propose a novel diffeomorphic temporal transformer network for both pairwise and joint time-series alignment. Our ResNet-TW (Deep Residual Network for Time Warping) tackles the alignment problem by compositing a flow of incremental diffeomorphic mappings. Governed by the flow equation, our Residual Network (ResNet) builds smooth, fluid and regular flows of velocity fields and consequently generates smooth and invertible transformations (i.e. diffeomorphic warping functions). Inspired by the elegant Large Deformation Diffeomorphic Metric Mapping (LDDMM) framework, the final transformation is built by the flow of time-dependent vector fields which are none other than the building blocks of our Residual Network. The latter is naturally viewed as an Eulerian discretization schema of the flow equation (an ODE). Once trained, our ResNet-TW aligns unseen data by a single inexpensive forward pass. As we show in experiments on both univariate (84 datasets from UCR archive) and multivariate time-series (MSR Action-3D, Florence-3D and MSR Daily Activity), ResNet-TW achieves competitive performance in joint alignment and classification.
    Image Resizing by Reconstruction from Deep Features. (arXiv:1904.08475v2 [cs.CV] UPDATED)
    (2 min) Traditional image resizing methods usually work in pixel space and use various saliency measures. The challenge is to adjust the image shape while trying to preserve important content. In this paper we perform image resizing in feature space where the deep layers of a neural network contain rich important semantic information. We directly adjust the image feature maps, extracted from a pre-trained classification network, and reconstruct the resized image using a neural-network based optimization. This novel approach leverages the hierarchical encoding of the network, and in particular, the high-level discriminative power of its deeper layers, that recognizes semantic objects and regions and allows maintaining their aspect ratio. Our use of reconstruction from deep features diminishes the artifacts introduced by image-space resizing operators. We evaluate our method on benchmarks, compare to alternative approaches, and demonstrate its strength on challenging images.
    A Latent Transformer for Disentangled and Identity-Preserving Face Editing. (arXiv:2106.11895v1 [cs.CV])
    (2 min) High quality facial image editing is a challenging problem in the movie post-production industry, requiring a high degree of control and identity preservation. Previous works that attempt to tackle this problem may suffer from the entanglement of facial attributes and the loss of the person's identity. Furthermore, many algorithms are limited to a certain task. To tackle these limitations, we propose to edit facial attributes via the latent space of a StyleGAN generator, by training a dedicated latent transformation network and incorporating explicit disentanglement and identity preservation terms in the loss function. We further introduce a pipeline to generalize our face editing to videos. Our model achieves a disentangled, controllable, and identity-preserving facial attribute editing, even in the challenging case of real (i.e., non-synthetic) images and videos. We conduct extensive experiments on image and video datasets and show that our model outperforms other state-of-the-art methods in visual quality and quantitative evaluation.
    Confidence-Aware Learning for Camouflaged Object Detection. (arXiv:2106.11641v1 [cs.CV])
    (2 min) Confidence-aware learning is proven as an effective solution to prevent networks becoming overconfident. We present a confidence-aware camouflaged object detection framework using dynamic supervision to produce both accurate camouflage map and meaningful "confidence" representing model awareness about the current prediction. A camouflaged object detection network is designed to produce our camouflage prediction. Then, we concatenate it with the input image and feed it to the confidence estimation network to produce an one channel confidence map.We generate dynamic supervision for the confidence estimation network, representing the agreement of camouflage prediction with the ground truth camouflage map. With the produced confidence map, we introduce confidence-aware learning with the confidence map as guidance to pay more attention to the hard/low-confidence pixels in the loss function. We claim that, once trained, our confidence estimation network can evaluate pixel-wise accuracy of the prediction without relying on the ground truth camouflage map. Extensive results on four camouflaged object detection testing datasets illustrate the superior performance of the proposed model in explaining the camouflage prediction.
    DeepMesh: Differentiable Iso-Surface Extraction. (arXiv:2106.11795v1 [cs.CV])
    (2 min) Geometric Deep Learning has recently made striking progress with the advent of continuous Deep Implicit Fields. They allow for detailed modeling of watertight surfaces of arbitrary topology while not relying on a 3D Euclidean grid, resulting in a learnable parameterization that is unlimited in resolution. Unfortunately, these methods are often unsuitable for applications that require an explicit mesh-based surface representation because converting an implicit field to such a representation relies on the Marching Cubes algorithm, which cannot be differentiated with respect to the underlying implicit field. In this work, we remove this limitation and introduce a differentiable way to produce explicit surface mesh representations from Deep Implicit Fields. Our key insight is that by reasoning on how implicit field perturbations impact local surface geometry, one can ultimately differentiate the 3D location of surface samples with respect to the underlying deep implicit field. We exploit this to define DeepMesh -- end-to-end differentiable mesh representation that can vary its topology. We use two different applications to validate our theoretical insight: Single view 3D Reconstruction via Differentiable Rendering and Physically-Driven Shape Optimization. In both cases our end-to-end differentiable parameterization gives us an edge over state-of-the-art algorithms.
    Evaluation of a Region Proposal Architecture for Multi-task Document Layout Analysis. (arXiv:2106.11797v1 [cs.CV])
    (2 min) Automatically recognizing the layout of handwritten documents is an important step towards useful extraction of information from those documents. The most common application is to feed downstream applications such as automatic text recognition and keyword spotting; however, the recognition of the layout also helps to establish relationships between elements in the document which allows to enrich the information that can be extracted. Most of the modern document layout analysis systems are designed to address only one part of the document layout problem, namely: baseline detection or region segmentation. In contrast, we evaluate the effectiveness of the Mask-RCNN architecture to address the problem of baseline detection and region segmentation in an integrated manner. We present experimental results on two handwritten text datasets and one handwritten music dataset. The analyzed architecture yields promising results, outperforming state-of-the-art techniques in all three datasets.
    Give Me Your Trained Model: Domain Adaptive Semantic Segmentation without Source Data. (arXiv:2106.11653v1 [cs.CV])
    (2 min) Benefited from considerable pixel-level annotations collected from a specific situation (source), the trained semantic segmentation model performs quite well, but fails in a new situation (target) due to the large domain shift. To mitigate the domain gap, previous cross-domain semantic segmentation methods always assume the co-existence of source data and target data during distribution alignment. However, the access to source data in the real scenario may raise privacy concerns and violate intellectual property. To tackle this problem, we focus on an interesting and challenging cross-domain semantic segmentation task where only the trained source model is provided to the target domain, and further propose a unified framework called Domain Adaptive Semantic Segmentation without Source data (DAS$^3$ for short). Specifically, DAS$^3$ consists of three schemes, i.e., feature alignment, self-training, and information propagation. First, we mainly develop a focal entropic loss on the network outputs to implicitly align the target features with unseen source features via the provided source model. Second, besides positive pseudo labels in vanilla self-training, we first introduce negative pseudo labels to the field and develop a bi-directional self-training strategy to enhance the representation learning in the target domain. Finally, the information propagation scheme further reduces the intra-domain discrepancy within the target domain via pseudo semi-supervised learning. Extensive results on synthesis-to-real and cross-city driving datasets validate DAS$^3$ yields state-of-the-art performance, even on par with methods that need access to source data.
    A Stealthy and Robust Fingerprinting Scheme for Generative Models. (arXiv:2106.11760v1 [cs.CR])
    (2 min) This paper presents a novel fingerprinting methodology for the Intellectual Property protection of generative models. Prior solutions for discriminative models usually adopt adversarial examples as the fingerprints, which give anomalous inference behaviors and prediction results. Hence, these methods are not stealthy and can be easily recognized by the adversary. Our approach leverages the invisible backdoor technique to overcome the above limitation. Specifically, we design verification samples, whose model outputs look normal but can trigger a backdoor classifier to make abnormal predictions. We propose a new backdoor embedding approach with Unique-Triplet Loss and fine-grained categorization to enhance the effectiveness of our fingerprints. Extensive evaluations show that this solution can outperform other strategies with higher robustness, uniqueness and stealthiness for various GAN models.
    RGB2Hands: Real-Time Tracking of 3D Hand Interactions from Monocular RGB Video. (arXiv:2106.11725v1 [cs.CV])
    (2 min) Tracking and reconstructing the 3D pose and geometry of two hands in interaction is a challenging problem that has a high relevance for several human-computer interaction applications, including AR/VR, robotics, or sign language recognition. Existing works are either limited to simpler tracking settings (e.g., considering only a single hand or two spatially separated hands), or rely on less ubiquitous sensors, such as depth cameras. In contrast, in this work we present the first real-time method for motion capture of skeletal pose and 3D surface geometry of hands from a single RGB camera that explicitly considers close interactions. In order to address the inherent depth ambiguities in RGB data, we propose a novel multi-task CNN that regresses multiple complementary pieces of information, including segmentation, dense matchings to a 3D hand model, and 2D keypoint positions, together with newly proposed intra-hand relative depth and inter-hand distance maps. These predictions are subsequently used in a generative model fitting framework in order to estimate pose and shape parameters of a 3D hand model for both hands. We experimentally verify the individual components of our RGB two-hand tracking and 3D reconstruction pipeline through an extensive ablation study. Moreover, we demonstrate that our approach offers previously unseen two-hand tracking performance from RGB, and quantitatively and qualitatively outperforms existing RGB-based methods that were not explicitly designed for two-hand interactions. Moreover, our method even performs on-par with depth-based real-time methods.
    Multi-layered Semantic Representation Network for Multi-label Image Classification. (arXiv:2106.11596v1 [cs.CV])
    (2 min) Multi-label image classification (MLIC) is a fundamental and practical task, which aims to assign multiple possible labels to an image. In recent years, many deep convolutional neural network (CNN) based approaches have been proposed which model label correlations to discover semantics of labels and learn semantic representations of images. This paper advances this research direction by improving both the modeling of label correlations and the learning of semantic representations. On the one hand, besides the local semantics of each label, we propose to further explore global semantics shared by multiple labels. On the other hand, existing approaches mainly learn the semantic representations at the last convolutional layer of a CNN. But it has been noted that the image representations of different layers of CNN capture different levels or scales of features and have different discriminative abilities. We thus propose to learn semantic representations at multiple convolutional layers. To this end, this paper designs a Multi-layered Semantic Representation Network (MSRN) which discovers both local and global semantics of labels through modeling label correlations and utilizes the label semantics to guide the semantic representations learning at multiple layers through an attention mechanism. Extensive experiments on four benchmark datasets including VOC 2007, COCO, NUS-WIDE, and Apparel show a competitive performance of the proposed MSRN against state-of-the-art models.
    Proposal Relation Network for Temporal Action Detection. (arXiv:2106.11812v1 [cs.CV])
    (2 min) This technical report presents our solution for temporal action detection task in AcitivityNet Challenge 2021. The purpose of this task is to locate and identify actions of interest in long untrimmed videos. The crucial challenge of the task comes from that the temporal duration of action varies dramatically, and the target actions are typically embedded in a background of irrelevant activities. Our solution builds on BMN, and mainly contains three steps: 1) action classification and feature encoding by Slowfast, CSN and ViViT; 2) proposal generation. We improve BMN by embedding the proposed Proposal Relation Network (PRN), by which we can generate proposals of high quality; 3) action detection. We calculate the detection results by assigning the proposals with corresponding classification results. Finally, we ensemble the results under different settings and achieve 44.7% on the test set, which improves the champion result in ActivityNet 2020 by 1.9% in terms of average mAP.
    MIMIR: Deep Regression for Automated Analysis of UK Biobank Body MRI. (arXiv:2106.11731v1 [eess.IV])
    (2 min) UK Biobank (UKB) is conducting a large-scale study of more than half a million volunteers, collecting health-related information on genetics, lifestyle, blood biochemistry, and more. Medical imaging furthermore targets 100,000 subjects, with 70,000 follow-up sessions, enabling measurements of organs, muscle, and body composition. With up to 170,000 mounting MR images, various methodologies are accordingly engaged in large-scale image analysis. This work presents an experimental inference engine that can automatically predict a comprehensive profile of subject metadata from UKB neck-to-knee body MRI. In cross-validation, it accurately inferred baseline characteristics such as age, height, weight, and sex, but also emulated measurements of body composition by DXA, organ volumes, and abstract properties like grip strength, pulse rate, and type 2 diabetic status (AUC: 0.866). The proposed system can automatically analyze thousands of subjects within hours and provide individual confidence intervals. The underlying methodology is based on convolutional neural networks for image-based mean-variance regression on two-dimensional representations of the MRI data. This work aims to make the proposed system available for free to researchers, who can use it to obtain fast and fully-automated estimates of 72 different measurements immediately upon release of new UK Biobank image data.
    Self-Supervised Iterative Contextual Smoothing for Efficient Adversarial Defense against Gray- and Black-Box Attack. (arXiv:2106.11644v1 [cs.CV])
    (2 min) We propose a novel and effective input transformation based adversarial defense method against gray- and black-box attack, which is computationally efficient and does not require any adversarial training or retraining of a classification model. We first show that a very simple iterative Gaussian smoothing can effectively wash out adversarial noise and achieve substantially high robust accuracy. Based on the observation, we propose Self-Supervised Iterative Contextual Smoothing (SSICS), which aims to reconstruct the original discriminative features from the Gaussian-smoothed image in context-adaptive manner, while still smoothing out the adversarial noise. From the experiments on ImageNet, we show that our SSICS achieves both high standard accuracy and very competitive robust accuracy for the gray- and black-box attacks; e.g., transfer-based PGD-attack and score-based attack. A note-worthy point to stress is that our defense is free of computationally expensive adversarial training, yet, can approach its robust accuracy via input transformation.
    Creating A New Color Space utilizing PSO and FCM to Perform Skin Detection by using Neural Network and ANFIS. (arXiv:2106.11563v1 [cs.CV])
    (2 min) Skin color detection is an essential required step in various applications related to computer vision. These applications will include face detection, finding pornographic images in movies and photos, finding ethnicity, age, diagnosis, and so on. Therefore, proposing a proper skin detection method can provide solution to several problems. In this study, first a new color space is created using FCM and PSO algorithms. Then, skin classification has been performed in the new color space utilizing linear and nonlinear modes. Additionally, it has been done in RGB and LAB color spaces by using ANFIS and neural network. Skin detection in RBG color space has been performed using Mahalanobis distance and Euclidean distance algorithms. In comparison, this method has 18.38% higher accuracy than the most accurate method on the same database. Additionally, this method has achieved 90.05% in equal error rate (1-EER) in testing COMPAQ dataset and 92.93% accuracy in testing Pratheepan dataset, which compared to the previous method on COMPAQ database, 1-EER has increased by %0.87.
    Trinity: A No-Code AI platform for complex spatial datasets. (arXiv:2106.11756v1 [cs.SE])
    (2 min) We present a no-code Artificial Intelligence (AI) platform called Trinity with the main design goal of enabling both machine learning researchers and non-technical geospatial domain experts to experiment with domain-specific signals and datasets for solving a variety of complex problems on their own. This versatility to solve diverse problems is achieved by transforming complex Spatio-temporal datasets to make them consumable by standard deep learning models, in this case, Convolutional Neural Networks (CNNs), and giving the ability to formulate disparate problems in a standard way, eg. semantic segmentation. With an intuitive user interface, a feature store that hosts derivatives of complex feature engineering, a deep learning kernel, and a scalable data processing mechanism, Trinity provides a powerful platform for domain experts to share the stage with scientists and engineers in solving business-critical problems. It enables quick prototyping, rapid experimentation and reduces the time to production by standardizing model building and deployment. In this paper, we present our motivation behind Trinity and its design along with showcasing sample applications to motivate the idea of lowering the bar to using AI.
    SSUL: Semantic Segmentation with Unknown Label for Exemplar-based Class-Incremental Learning. (arXiv:2106.11562v1 [cs.CV])
    (2 min) We consider a class-incremental semantic segmentation (CISS) problem. While some recently proposed algorithms utilized variants of knowledge distillation (KD) technique to tackle the problem, they only partially addressed the key additional challenges in CISS that causes the catastrophic forgetting; i.e., the semantic drift of the background class and multi-label prediction issue. To better address these challenges, we propose a new method, dubbed as SSUL-M (Semantic Segmentation with Unknown Label with Memory), by carefully combining several techniques tailored for semantic segmentation. More specifically, we make three main contributions; (1) modeling unknown class within the background class to help learning future classes (help plasticity), (2) freezing backbone network and past classifiers with binary cross-entropy loss and pseudo-labeling to overcome catastrophic forgetting (help stability), and (3) utilizing tiny exemplar memory for the first time in CISS to improve both plasticity and stability. As a result, we show our method achieves significantly better performance than the recent state-of-the-art baselines on the standard benchmark datasets. Furthermore, we justify our contributions with thorough and extensive ablation analyses and discuss different natures of the CISS problem compared to the standard class-incremental learning for classification.
    Analysis and Tuning of a Voice Assistant System for Dysfluent Speech. (arXiv:2106.11759v1 [eess.AS])
    (2 min) Dysfluencies and variations in speech pronunciation can severely degrade speech recognition performance, and for many individuals with moderate-to-severe speech disorders, voice operated systems do not work. Current speech recognition systems are trained primarily with data from fluent speakers and as a consequence do not generalize well to speech with dysfluencies such as sound or word repetitions, sound prolongations, or audible blocks. The focus of this work is on quantitative analysis of a consumer speech recognition system on individuals who stutter and production-oriented approaches for improving performance for common voice assistant tasks (i.e., "what is the weather?"). At baseline, this system introduces a significant number of insertion and substitution errors resulting in intended speech Word Error Rates (isWER) that are 13.64\% worse (absolute) for individuals with fluency disorders. We show that by simply tuning the decoding parameters in an existing hybrid speech recognition system one can improve isWER by 24\% (relative) for individuals with fluency disorders. Tuning these parameters translates to 3.6\% better domain recognition and 1.7\% better intent recognition relative to the default setup for the 18 study participants across all stuttering severities.
    Hand-Drawn Electrical Circuit Recognition using Object Detection and Node Recognition. (arXiv:2106.11559v1 [cs.CV])
    (2 min) With the recent developments in neural networks, there has been a resurgence in algorithms for the automatic generation of simulation ready electronic circuits from hand-drawn circuits. However, most of the approaches in literature were confined to classify different types of electrical components and only a few of those methods have shown a way to rebuild the circuit schematic from the scanned image, which is extremely important for further automation of netlist generation. This paper proposes a real-time algorithm for the automatic recognition of hand-drawn electrical circuits based on object detection and circuit node recognition. The proposed approach employs You Only Look Once version 5 (YOLOv5) for detection of circuit components and a novel Hough transform based approach for node recognition. Using YOLOv5 object detection algorithm, a mean average precision (mAP0.5) of 98.2% is achieved in detecting the components. The proposed method is also able to rebuild the circuit schematic with 80% accuracy.
    The Hitchhiker's Guide to Prior-Shift Adaptation. (arXiv:2106.11695v1 [cs.CV])
    (2 min) In many computer vision classification tasks, class priors at test time often differ from priors on the training set. In the case of such prior shift, classifiers must be adapted correspondingly to maintain close to optimal performance. This paper analyzes methods for adaptation of probabilistic classifiers to new priors and for estimating new priors on an unlabeled test set. We propose a novel method to address a known issue of prior estimation methods based on confusion matrices, where inconsistent estimates of decision probabilities and confusion matrices lead to negative values in the estimated priors. Experiments on fine-grained image classification datasets provide insight into the best practice of prior shift estimation and classifier adaptation and show that the proposed method achieves state-of-the-art results in prior adaptation. Applying the best practice to two tasks with naturally imbalanced priors, learning from web-crawled images and plant species classification, increased the recognition accuracy by 1.1% and 3.4% respectively.
    A Survey on Human-aware Robot Navigation. (arXiv:2106.11650v1 [cs.RO])
    (2 min) Intelligent systems are increasingly part of our everyday lives and have been integrated seamlessly to the point where it is difficult to imagine a world without them. Physical manifestations of those systems on the other hand, in the form of embodied agents or robots, have so far been used only for specific applications and are often limited to functional roles (e.g. in the industry, entertainment and military fields). Given the current growth and innovation in the research communities concerned with the topics of robot navigation, human-robot-interaction and human activity recognition, it seems like this might soon change. Robots are increasingly easy to obtain and use and the acceptance of them in general is growing. However, the design of a socially compliant robot that can function as a companion needs to take various areas of research into account. This paper is concerned with the navigation aspect of a socially-compliant robot and provides a survey of existing solutions for the relevant areas of research as well as an outlook on possible future directions.
    Learning-Based Practical Light Field Image Compression Using A Disparity-Aware Model. (arXiv:2106.11558v1 [eess.IV])
    (2 min) Light field technology has increasingly attracted the attention of the research community with its many possible applications. The lenslet array in commercial plenoptic cameras helps capture both the spatial and angular information of light rays in a single exposure. While the resulting high dimensionality of light field data enables its superior capabilities, it also impedes its extensive adoption. Hence, there is a compelling need for efficient compression of light field images. Existing solutions are commonly composed of several separate modules, some of which may not have been designed for the specific structure and quality of light field data. This increases the complexity of the codec and results in impractical decoding runtimes. We propose a new learning-based, disparity-aided model for compression of 4D light field images capable of parallel decoding. The model is end-to-end trainable, eliminating the need for hand-tuning separate modules and allowing joint learning of rate and distortion. The disparity-aided approach ensures the structural integrity of the reconstructed light fields. Comparisons with the state of the art show encouraging performance in terms of PSNR and MS-SSIM metrics. Also, there is a notable gain in the encoding and decoding runtimes. Source code is available at https://moha23.github.io/LFDAAE.
    Universal Domain Adaptation in Ordinal Regression. (arXiv:2106.11576v1 [cs.CV])
    (2 min) We address the problem of universal domain adaptation (UDA) in ordinal regression (OR), which attempts to solve classification problems in which labels are not independent, but follow a natural order. We show that the UDA techniques developed for classification and based on the clustering assumption, under-perform in OR settings. We propose a method that complements the OR classifier with an auxiliary task of order learning, which plays the double role of discriminating between common and private instances, and expanding class labels to the private target images via ranking. Combined with adversarial domain discrimination, our model is able to address the closed set, partial and open set configurations. We evaluate our method on three face age estimation datasets, and show that it outperforms the baseline methods.
    Winning the CVPR'2021 Kinetics-GEBD Challenge: Contrastive Learning Approach. (arXiv:2106.11549v1 [cs.CV])
    (2 min) Generic Event Boundary Detection (GEBD) is a newly introduced task that aims to detect "general" event boundaries that correspond to natural human perception. In this paper, we introduce a novel contrastive learning based approach to deal with the GEBD. Our intuition is that the feature similarity of the video snippet would significantly vary near the event boundaries, while remaining relatively the same in the remaining part of the video. In our model, Temporal Self-similarity Matrix (TSM) is utilized as an intermediate representation which takes on a role as an information bottleneck. With our model, we achieved significant performance boost compared to the given baselines. Our code is available at https://github.com/hello-jinwoo/LOVEU-CVPR2021.
    Differentiable Architecture Search Without Training Nor Labels: A Pruning Perspective. (arXiv:2106.11542v1 [cs.LG])
    (2 min) With leveraging the weight-sharing and continuous relaxation to enable gradient-descent to alternately optimize the supernet weights and the architecture parameters through a bi-level optimization paradigm, \textit{Differentiable ARchiTecture Search} (DARTS) has become the mainstream method in Neural Architecture Search (NAS) due to its simplicity and efficiency. However, more recent works found that the performance of the searched architecture barely increases with the optimization proceeding in DARTS. In addition, several concurrent works show that the NAS could find more competitive architectures without labels. The above observations reveal that the supervision signal in DARTS may be a poor indicator for architecture optimization, inspiring a foundational question: instead of using the supervision signal to perform bi-level optimization, \textit{can we find high-quality architectures \textbf{without any training nor labels}}? We provide an affirmative answer by customizing the NAS as a network pruning at initialization problem. By leveraging recent techniques on the network pruning at initialization, we designed a FreeFlow proxy to score the importance of candidate operations in NAS without any training nor labels, and proposed a novel framework called \textit{training and label free neural architecture search} (\textbf{FreeNAS}) accordingly. We show that, without any training nor labels, FreeNAS with the proposed FreeFlow proxy can outperform most NAS baselines. More importantly, our framework is extremely efficient, which completes the architecture search within only \textbf{3.6s} and \textbf{79s} on a single GPU for the NAS-Bench-201 and DARTS search space, respectively. We hope our work inspires more attempts in solving NAS from the perspective of pruning at initialization.
    SA-LOAM: Semantic-aided LiDAR SLAM with Loop Closure. (arXiv:2106.11516v1 [cs.RO])
    (2 min) LiDAR-based SLAM system is admittedly more accurate and stable than others, while its loop closure detection is still an open issue. With the development of 3D semantic segmentation for point cloud, semantic information can be obtained conveniently and steadily, essential for high-level intelligence and conductive to SLAM. In this paper, we present a novel semantic-aided LiDAR SLAM with loop closure based on LOAM, named SA-LOAM, which leverages semantics in odometry as well as loop closure detection. Specifically, we propose a semantic-assisted ICP, including semantically matching, downsampling and plane constraint, and integrates a semantic graph-based place recognition method in our loop closure detection module. Benefitting from semantics, we can improve the localization accuracy, detect loop closures effectively, and construct a global consistent semantic map even in large-scale scenes. Extensive experiments on KITTI and Ford Campus dataset show that our system significantly improves baseline performance, has generalization ability to unseen data and achieves competitive results compared with state-of-the-art methods.
    DocFormer: End-to-End Transformer for Document Understanding. (arXiv:2106.11539v1 [cs.CV])
    (2 min) We present DocFormer -- a multi-modal transformer based architecture for the task of Visual Document Understanding (VDU). VDU is a challenging problem which aims to understand documents in their varied formats (forms, receipts etc.) and layouts. In addition, DocFormer is pre-trained in an unsupervised fashion using carefully designed tasks which encourage multi-modal interaction. DocFormer uses text, vision and spatial features and combines them using a novel multi-modal self-attention layer. DocFormer also shares learned spatial embeddings across modalities which makes it easy for the model to correlate text to visual tokens and vice versa. DocFormer is evaluated on 4 different datasets each with strong baselines. DocFormer achieves state-of-the-art results on all of them, sometimes beating models 4x its size (in no. of parameters).
    Deep3DPose: Realtime Reconstruction of Arbitrarily Posed Human Bodies from Single RGB Images. (arXiv:2106.11536v1 [cs.CV])
    (2 min) We introduce an approach that accurately reconstructs 3D human poses and detailed 3D full-body geometric models from single images in realtime. The key idea of our approach is a novel end-to-end multi-task deep learning framework that uses single images to predict five outputs simultaneously: foreground segmentation mask, 2D joints positions, semantic body partitions, 3D part orientations and uv coordinates (uv map). The multi-task network architecture not only generates more visual cues for reconstruction, but also makes each individual prediction more accurate. The CNN regressor is further combined with an optimization based algorithm for accurate kinematic pose reconstruction and full-body shape modeling. We show that the realtime reconstruction reaches accurate fitting that has not been seen before, especially for wild images. We demonstrate the results of our realtime 3D pose and human body reconstruction system on various challenging in-the-wild videos. We show the system advances the frontier of 3D human body and pose reconstruction from single images by quantitative evaluations and comparisons with state-of-the-art methods.
    VoxelEmbed: 3D Instance Segmentation and Tracking with Voxel Embedding based Deep Learning. (arXiv:2106.11480v1 [cs.CV])
    (2 min) Recent advances in bioimaging have provided scientists a superior high spatial-temporal resolution to observe dynamics of living cells as 3D volumetric videos. Unfortunately, the 3D biomedical video analysis is lagging, impeded by resource insensitive human curation using off-the-shelf 3D analytic tools. Herein, biologists often need to discard a considerable amount of rich 3D spatial information by compromising on 2D analysis via maximum intensity projection. Recently, pixel embedding-based cell instance segmentation and tracking provided a neat and generalizable computing paradigm for understanding cellular dynamics. In this work, we propose a novel spatial-temporal voxel-embedding (VoxelEmbed) based learning method to perform simultaneous cell instance segmenting and tracking on 3D volumetric video sequences. Our contribution is in four-fold: (1) The proposed voxel embedding generalizes the pixel embedding with 3D context information; (2) Present a simple multi-stream learning approach that allows effective spatial-temporal embedding; (3) Accomplished an end-to-end framework for one-stage 3D cell instance segmentation and tracking without heavy parameter tuning; (4) The proposed 3D quantification is memory efficient via a single GPU with 12 GB memory. We evaluate our VoxelEmbed method on four 3D datasets (with different cell types) from the ISBI Cell Tracking Challenge. The proposed VoxelEmbed method achieved consistent superior overall performance (OP) on two densely annotated datasets. The performance is also competitive on two sparsely annotated cohorts with 20.6% and 2% of data-set having segmentation annotations. The results demonstrate that the VoxelEmbed method is a generalizable and memory-efficient solution.
    Multimodal trajectory forecasting based on discrete heat map. (arXiv:2106.11467v1 [cs.CV])
    (2 min) In Argoverse motion forecasting competition, the task is to predict the probabilistic future trajectory distribution for the interested targets in the traffic scene. We use vectorized lane map and 2 s targets' history trajectories as input. Then the model outputs 6 forecasted trajectories with probability for each target.
    Wallpaper Texture Generation and Style Transfer Based on Multi-label Semantics. (arXiv:2106.11482v1 [cs.CV])
    (2 min) Textures contain a wealth of image information and are widely used in various fields such as computer graphics and computer vision. With the development of machine learning, the texture synthesis and generation have been greatly improved. As a very common element in everyday life, wallpapers contain a wealth of texture information, making it difficult to annotate with a simple single label. Moreover, wallpaper designers spend significant time to create different styles of wallpaper. For this purpose, this paper proposes to describe wallpaper texture images by using multi-label semantics. Based on these labels and generative adversarial networks, we present a framework for perception driven wallpaper texture generation and style transfer. In this framework, a perceptual model is trained to recognize whether the wallpapers produced by the generator network are sufficiently realistic and have the attribute designated by given perceptual description; these multi-label semantic attributes are treated as condition variables to generate wallpaper images. The generated wallpaper images can be converted to those with well-known artist styles using CycleGAN. Finally, using the aesthetic evaluation method, the generated wallpaper images are quantitatively measured. The experimental results demonstrate that the proposed method can generate wallpaper textures conforming to human aesthetics and have artistic characteristics.
    Kernel Clustering with Sigmoid-based Regularization for Efficient Segmentation of Sequential Data. (arXiv:2106.11541v1 [cs.LG])
    (2 min) Kernel segmentation aims at partitioning a data sequence into several non-overlapping segments that may have nonlinear and complex structures. In general, it is formulated as a discrete optimization problem with combinatorial constraints. A popular algorithm for optimally solving this problem is dynamic programming (DP), which has quadratic computation and memory requirements. Given that sequences in practice are too long, this algorithm is not a practical approach. Although many heuristic algorithms have been proposed to approximate the optimal segmentation, they have no guarantee on the quality of their solutions. In this paper, we take a differentiable approach to alleviate the aforementioned issues. First, we introduce a novel sigmoid-based regularization to smoothly approximate the combinatorial constraints. Combining it with objective of the balanced kernel clustering, we formulate a differentiable model termed Kernel clustering with sigmoid-based regularization (KCSR), where the gradient-based algorithm can be exploited to obtain the optimal segmentation. Second, we develop a stochastic variant of the proposed model. By using the stochastic gradient descent algorithm, which has much lower time and space complexities, for optimization, the second model can perform segmentation on overlong data sequences. Finally, for simultaneously segmenting multiple data sequences, we slightly modify the sigmoid-based regularization to further introduce an extended variant of the proposed model. Through extensive experiments on various types of data sequences performances of our models are evaluated and compared with those of the existing methods. The experimental results validate advantages of the proposed models. Our Matlab source code is available on github.
    SeqNetVLAD vs PointNetVLAD: Image Sequence vs 3D Point Clouds for Day-Night Place Recognition. (arXiv:2106.11481v1 [cs.CV])
    (2 min) Place Recognition is a crucial capability for mobile robot localization and navigation. Image-based or Visual Place Recognition (VPR) is a challenging problem as scene appearance and camera viewpoint can change significantly when places are revisited. Recent VPR methods based on ``sequential representations'' have shown promising results as compared to traditional sequence score aggregation or single image based techniques. In parallel to these endeavors, 3D point clouds based place recognition is also being explored following the advances in deep learning based point cloud processing. However, a key question remains: is an explicit 3D structure based place representation always superior to an implicit ``spatial'' representation based on sequence of RGB images which can inherently learn scene structure. In this extended abstract, we attempt to compare these two types of methods by considering a similar ``metric span'' to represent places. We compare a 3D point cloud based method (PointNetVLAD) with image sequence based methods (SeqNet and others) and showcase that image sequence based techniques approach, and can even surpass, the performance achieved by point cloud based methods for a given metric span. These performance variations can be attributed to differences in data richness of input sensors as well as data accumulation strategies for a mobile robot. While a perfect apple-to-apple comparison may not be feasible for these two different modalities, the presented comparison takes a step in the direction of answering deeper questions regarding spatial representations, relevant to several applications like Autonomous Driving and Augmented/Virtual Reality. Source code available publicly https://github.com/oravus/seqNet.
    An Alternative Auxiliary Task for Enhancing Image Classification. (arXiv:2106.11478v1 [cs.CV])
    (2 min) Image reconstruction is likely the most predominant auxiliary task for image classification. In this paper, we investigate ``estimating the Fourier Transform of the input image" as a potential alternative auxiliary task, in the hope that it may further boost the performances on the primary task or introduce novel constraints not well covered by image reconstruction. We experimented with five popular classification architectures on the CIFAR-10 dataset, and the empirical results indicated that our proposed auxiliary task generally improves the classification accuracy. More notably, the results showed that in certain cases our proposed auxiliary task may enhance the classifiers' resistance to adversarial attacks generated using the fast gradient sign method.
    Spatial-Temporal Super-Resolution of Satellite Imagery via Conditional Pixel Synthesis. (arXiv:2106.11485v1 [cs.CV])
    (2 min) High-resolution satellite imagery has proven useful for a broad range of tasks, including measurement of global human population, local economic livelihoods, and biodiversity, among many others. Unfortunately, high-resolution imagery is both infrequently collected and expensive to purchase, making it hard to efficiently and effectively scale these downstream tasks over both time and space. We propose a new conditional pixel synthesis model that uses abundant, low-cost, low-resolution imagery to generate accurate high-resolution imagery at locations and times in which it is unavailable. We show that our model attains photo-realistic sample quality and outperforms competing baselines on a key downstream task -- object counting -- particularly in geographic locations where conditions on the ground are changing rapidly.
    Incremental Deep Neural Network Learning using Classification Confidence Thresholding. (arXiv:2106.11437v1 [cs.LG])
    (2 min) Most modern neural networks for classification fail to take into account the concept of the unknown. Trained neural networks are usually tested in an unrealistic scenario with only examples from a closed set of known classes. In an attempt to develop a more realistic model, the concept of working in an open set environment has been introduced. This in turn leads to the concept of incremental learning where a model with its own architecture and initial trained set of data can identify unknown classes during the testing phase and autonomously update itself if evidence of a new class is detected. Some problems that arise in incremental learning are inefficient use of resources to retrain the classifier repeatedly and the decrease of classification accuracy as multiple classes are added over time. This process of instantiating new classes is repeated as many times as necessary, accruing errors. To address these problems, this paper proposes the Classification Confidence Threshold approach to prime neural networks for incremental learning to keep accuracies high by limiting forgetting. A lean method is also used to reduce resources used in the retraining of the neural network. The proposed method is based on the idea that a network is able to incrementally learn a new class even when exposed to a limited number samples associated with the new class. This method can be applied to most existing neural networks with minimal changes to network architecture.
    Image simulation for space applications with the SurRender software. (arXiv:2106.11322v1 [astro-ph.EP])
    (2 min) Image Processing algorithms for vision-based navigation require reliable image simulation capacities. In this paper we explain why traditional rendering engines may present limitations that are potentially critical for space applications. We introduce Airbus SurRender software v7 and provide details on features that make it a very powerful space image simulator. We show how SurRender is at the heart of the development processes of our computer vision solutions and we provide a series of illustrations of rendered images for various use cases ranging from Moon and Solar System exploration, to in orbit rendezvous and planetary robotics.
    BiAdam: Fast Adaptive Bilevel Optimization Methods. (arXiv:2106.11396v1 [math.OC])
    (2 min) Bilevel optimization recently has attracted increased interest in machine learning due to its many applications such as hyper-parameter optimization and policy optimization. Although some methods recently have been proposed to solve the bilevel problems, these methods do not consider using adaptive learning rates. To fill this gap, in the paper, we propose a class of fast and effective adaptive methods for solving bilevel optimization problems that the outer problem is possibly nonconvex and the inner problem is strongly-convex. Specifically, we propose a fast single-loop BiAdam algorithm based on the basic momentum technique, which achieves a sample complexity of $\tilde{O}(\epsilon^{-4})$ for finding an $\epsilon$-stationary point. At the same time, we propose an accelerated version of BiAdam algorithm (VR-BiAdam) by using variance reduced technique, which reaches the best known sample complexity of $\tilde{O}(\epsilon^{-3})$. To further reduce computation in estimating derivatives, we propose a fast single-loop stochastic approximated BiAdam algorithm (saBiAdam) by avoiding the Hessian inverse, which still achieves a sample complexity of $\tilde{O}(\epsilon^{-4})$ without large batches. We further present an accelerated version of saBiAdam algorithm (VR-saBiAdam), which also reaches the best known sample complexity of $\tilde{O}(\epsilon^{-3})$. We apply the unified adaptive matrices to our methods as the SUPER-ADAM \citep{huang2021super}, which including many types of adaptive learning rates. Moreover, our framework can flexibly use the momentum and variance reduced techniques. In particular, we provide a useful convergence analysis framework for both the constrained and unconstrained bilevel optimization. To the best of our knowledge, we first study the adaptive bilevel optimization methods with adaptive learning rates.
    Recent Deep Semi-supervised Learning Approaches and Related Works. (arXiv:2106.11528v1 [cs.LG])
    (2 min) The author of this work proposes an overview of the recent semi-supervised learning approaches and related works. Despite the remarkable success of neural networks in various applications, there exist few formidable constraints including the need for a large amount of labeled data. Therefore, semi-supervised learning, which is a learning scheme in which the scarce labels and a larger amount of unlabeled data are utilized to train models (e.g., deep neural networks) is getting more important. Based on the key assumptions of semi-supervised learning, which are the manifold assumption, cluster assumption, and continuity assumption, the work reviews the recent semi-supervised learning approaches. In particular, the methods in regard to using deep neural networks in a semi-supervised learning setting are primarily discussed. In addition, the existing works are first classified based on the underlying idea and explained, and then the holistic approaches that unify the aforementioned ideas are detailed.
    Unsupervised Embedding Adaptation via Early-Stage Feature Reconstruction for Few-Shot Classification. (arXiv:2106.11486v1 [cs.CV])
    (2 min) We propose unsupervised embedding adaptation for the downstream few-shot classification task. Based on findings that deep neural networks learn to generalize before memorizing, we develop Early-Stage Feature Reconstruction (ESFR) -- a novel adaptation scheme with feature reconstruction and dimensionality-driven early stopping that finds generalizable features. Incorporating ESFR consistently improves the performance of baseline methods on all standard settings, including the recently proposed transductive method. ESFR used in conjunction with the transductive method further achieves state-of-the-art performance on mini-ImageNet, tiered-ImageNet, and CUB; especially with 1.2%~2.0% improvements in accuracy over the previous best performing method on 1-shot setting.
    f-Domain-Adversarial Learning: Theory and Algorithms. (arXiv:2106.11344v1 [cs.LG])
    (2 min) Unsupervised domain adaptation is used in many machine learning applications where, during training, a model has access to unlabeled data in the target domain, and a related labeled dataset. In this paper, we introduce a novel and general domain-adversarial framework. Specifically, we derive a novel generalization bound for domain adaptation that exploits a new measure of discrepancy between distributions based on a variational characterization of f-divergences. It recovers the theoretical results from Ben-David et al. (2010a) as a special case and supports divergences used in practice. Based on this bound, we derive a new algorithmic framework that introduces a key correction in the original adversarial training method of Ganin et al. (2016). We show that many regularizers and ad-hoc objectives introduced over the last years in this framework are then not required to achieve performance comparable to (if not better than) state-of-the-art domain-adversarial methods. Experimental analysis conducted on real-world natural language and computer vision datasets show that our framework outperforms existing baselines, and obtains the best results for f-divergences that were not considered previously in domain-adversarial learning.
    Photozilla: A Large-Scale Photography Dataset and Visual Embedding for 20 Photography Styles. (arXiv:2106.11359v1 [cs.CV])
    (2 min) The advent of social media platforms has been a catalyst for the development of digital photography that engendered a boom in vision applications. With this motivation, we introduce a large-scale dataset termed 'Photozilla', which includes over 990k images belonging to 10 different photographic styles. The dataset is then used to train 3 classification models to automatically classify the images into the relevant style which resulted in an accuracy of ~96%. With the rapid evolution of digital photography, we have seen new types of photography styles emerging at an exponential rate. On that account, we present a novel Siamese-based network that uses the trained classification models as the base architecture to adapt and classify unseen styles with only 25 training samples. We report an accuracy of over 68% for identifying 10 other distinct types of photography styles. This dataset can be found at https://trisha025.github.io/Photozilla/
    FDeblur-GAN: Fingerprint Deblurring using Generative Adversarial Network. (arXiv:2106.11354v1 [cs.CV])
    (2 min) While working with fingerprint images acquired from crime scenes, mobile cameras, or low-quality sensors, it becomes difficult for automated identification systems to verify the identity due to image blur and distortion. We propose a fingerprint deblurring model FDeblur-GAN, based on the conditional Generative Adversarial Networks (cGANs) and multi-stage framework of the stack GAN. Additionally, we integrate two auxiliary sub-networks into the model for the deblurring task. The first sub-network is a ridge extractor model. It is added to generate ridge maps to ensure that fingerprint information and minutiae are preserved in the deblurring process and prevent the model from generating erroneous minutiae. The second sub-network is a verifier that helps the generator to preserve the ID information during the generation process. Using a database of blurred fingerprints and corresponding ridge maps, the deep network learns to deblur from the input blurry samples. We evaluate the proposed method in combination with two different fingerprint matching algorithms. We achieved an accuracy of 95.18% on our fingerprint database for the task of matching deblurred and ground truth fingerprints.
    Context-aware PolyUNet for Liver and Lesion Segmentation from Abdominal CT Images. (arXiv:2106.11330v1 [eess.IV])
    (2 min) Accurate liver and lesion segmentation from computed tomography (CT) images are highly demanded in clinical practice for assisting the diagnosis and assessment of hepatic tumor disease. However, automatic liver and lesion segmentation from contrast-enhanced CT volumes is extremely challenging due to the diversity in contrast, resolution, and quality of images. Previous methods based on UNet for 2D slice-by-slice or 3D volume-by-volume segmentation either lack sufficient spatial contexts or suffer from high GPU computational cost, which limits the performance. To tackle these issues, we propose a novel context-aware PolyUNet for accurate liver and lesion segmentation. It jointly explores structural diversity and consecutive t-adjacent slices to enrich feature expressive power and spatial contextual information while avoiding the overload of GPU memory consumption. In addition, we utilize zoom out/in and two-stage refinement strategy to exclude the irrelevant contexts and focus on the specific region for the fine-grained segmentation. Our method achieved very competitive performance at the MICCAI 2017 Liver Tumor Segmentation (LiTS) Challenge among all tasks with a single model and ranked the $3^{rd}$, $12^{th}$, $2^{nd}$, and $5^{th}$ places in the liver segmentation, lesion segmentation, lesion detection, and tumor burden estimation, respectively.
    Mapping Slums with Medium Resolution Satellite Imagery: a Comparative Analysis of Multi-Spectral Data and Grey-level Co-occurrence Matrix Techniques. (arXiv:2106.11395v1 [cs.CV])
    (2 min) The UN-Habitat estimates that over one billion people live in slums around the world. However, state-of-the-art techniques to detect the location of slum areas employ high-resolution satellite imagery, which is costly to obtain and process. As a result, researchers have started to look at utilising free and open-access medium resolution satellite imagery. Yet, there is no clear consensus on which data preparation and machine learning approaches are the most appropriate to use with such imagery data. In this paper, we evaluate two techniques (multi-spectral data and grey-level co-occurrence matrix feature extraction) on an open-access dataset consisting of labelled Sentinel-2 images with a spatial resolution of 10 meters. Both techniques were paired with a canonical correlation forests classifier. The results show that the grey-level co-occurrence matrix performed better than multi-spectral data for all four cities. It had an average accuracy for the slum class of 97% and a mean intersection over union of 94%, while multi-spectral data had 75% and 64% for the respective metrics. These results indicate that open-access satellite imagery with a resolution of at least 10 meters may be suitable for keeping track of development goals such as the detection of slums in cities.
    Normalized Avatar Synthesis Using StyleGAN and Perceptual Refinement. (arXiv:2106.11423v1 [cs.CV])
    (2 min) We introduce a highly robust GAN-based framework for digitizing a normalized 3D avatar of a person from a single unconstrained photo. While the input image can be of a smiling person or taken in extreme lighting conditions, our method can reliably produce a high-quality textured model of a person's face in neutral expression and skin textures under diffuse lighting condition. Cutting-edge 3D face reconstruction methods use non-linear morphable face models combined with GAN-based decoders to capture the likeness and details of a person but fail to produce neutral head models with unshaded albedo textures which is critical for creating relightable and animation-friendly avatars for integration in virtual environments. The key challenges for existing methods to work is the lack of training and ground truth data containing normalized 3D faces. We propose a two-stage approach to address this problem. First, we adopt a highly robust normalized 3D face generator by embedding a non-linear morphable face model into a StyleGAN2 network. This allows us to generate detailed but normalized facial assets. This inference is then followed by a perceptual refinement step that uses the generated assets as regularization to cope with the limited available training samples of normalized faces. We further introduce a Normalized Face Dataset, which consists of a combination photogrammetry scans, carefully selected photographs, and generated fake people with neutral expressions in diffuse lighting conditions. While our prepared dataset contains two orders of magnitude less subjects than cutting edge GAN-based 3D facial reconstruction methods, we show that it is possible to produce high-quality normalized face models for very challenging unconstrained input images, and demonstrate superior performance to the current state-of-the-art.
    MODETR: Moving Object Detection with Transformers. (arXiv:2106.11422v1 [cs.CV])
    (2 min) Moving Object Detection (MOD) is a crucial task for the Autonomous Driving pipeline. MOD is usually handled via 2-stream convolutional architectures that incorporates both appearance and motion cues, without considering the inter-relations between the spatial or motion features. In this paper, we tackle this problem through multi-head attention mechanisms, both across the spatial and motion streams. We propose MODETR; a Moving Object DEtection TRansformer network, comprised of multi-stream transformer encoders for both spatial and motion modalities, and an object transformer decoder that produces the moving objects bounding boxes using set predictions. The whole architecture is trained end-to-end using bi-partite loss. Several methods of incorporating motion cues with the Transformer model are explored, including two-stream RGB and Optical Flow (OF) methods, and multi-stream architectures that take advantage of sequence information. To incorporate the temporal information, we propose a new Temporal Positional Encoding (TPE) approach to extend the Spatial Positional Encoding(SPE) in DETR. We explore two architectural choices for that, balancing between speed and time. To evaluate the our network, we perform the MOD task on the KITTI MOD [6] data set. Results show significant 5% mAP of the Transformer network for MOD over the state-of-the art methods. Moreover, the proposed TPE encoding provides 10% mAP improvement over the SPE baseline.
    GAIA: A Transfer Learning System of Object Detection that Fits Your Needs. (arXiv:2106.11346v1 [cs.CV])
    (2 min) Transfer learning with pre-training on large-scale datasets has played an increasingly significant role in computer vision and natural language processing recently. However, as there exist numerous application scenarios that have distinctive demands such as certain latency constraints and specialized data distributions, it is prohibitively expensive to take advantage of large-scale pre-training for per-task requirements. In this paper, we focus on the area of object detection and present a transfer learning system named GAIA, which could automatically and efficiently give birth to customized solutions according to heterogeneous downstream needs. GAIA is capable of providing powerful pre-trained weights, selecting models that conform to downstream demands such as latency constraints and specified data domains, and collecting relevant data for practitioners who have very few datapoints for their tasks. With GAIA, we achieve promising results on COCO, Objects365, Open Images, Caltech, CityPersons, and UODB which is a collection of datasets including KITTI, VOC, WiderFace, DOTA, Clipart, Comic, and more. Taking COCO as an example, GAIA is able to efficiently produce models covering a wide range of latency from 16ms to 53ms, and yields AP from 38.2 to 46.5 without whistles and bells. To benefit every practitioner in the community of object detection, GAIA is released at https://github.com/GAIA-vision.
    Dive into Deep Learning. (arXiv:2106.11342v1 [cs.LG])
    (2 min) This open-source book represents our attempt to make deep learning approachable, teaching readers the concepts, the context, and the code. The entire book is drafted in Jupyter notebooks, seamlessly integrating exposition figures, math, and interactive examples with self-contained code. Our goal is to offer a resource that could (i) be freely available for everyone; (ii) offer sufficient technical depth to provide a starting point on the path to actually becoming an applied machine learning scientist; (iii) include runnable code, showing readers how to solve problems in practice; (iv) allow for rapid updates, both by us and also by the community at large; (v) be complemented by a forum for interactive discussion of technical details and to answer questions.
    Understanding top-down attention using task-oriented ablation design. (arXiv:2106.11339v1 [cs.CV])
    (2 min) Top-down attention allows neural networks, both artificial and biological, to focus on the information most relevant for a given task. This is known to enhance performance in visual perception. But it remains unclear how attention brings about its perceptual boost, especially when it comes to naturalistic settings like recognising an object in an everyday scene. What aspects of a visual task does attention help to deal with? We aim to answer this with a computational experiment based on a general framework called task-oriented ablation design. First we define a broad range of visual tasks and identify six factors that underlie task variability. Then on each task we compare the performance of two neural networks, one with top-down attention and one without. These comparisons reveal the task-dependence of attention's perceptual boost, giving a clearer idea of the role attention plays. Whereas many existing cognitive accounts link attention to stimulus-level variables, such as visual clutter and object scale, we find greater explanatory power in system-level variables that capture the interaction between the model, the distribution of training data and the task format. This finding suggests a shift in how attention is studied could be fruitful. We make publicly available our code and results, along with statistics relevant to ImageNet-based experiments beyond this one. Our contribution serves to support the development of more human-like vision models and the design of more informative machine-learning experiments.
    BEyond observation: an approach for ObjectNav. (arXiv:2106.11379v1 [cs.CV])
    (2 min) With the rise of automation, unmanned vehicles became a hot topic both as commercial products and as a scientific research topic. It composes a multi-disciplinary field of robotics that encompasses embedded systems, control theory, path planning, Simultaneous Localization and Mapping (SLAM), scene reconstruction, and pattern recognition. In this work, we present our exploratory research of how sensor data fusion and state-of-the-art machine learning algorithms can perform the Embodied Artificial Intelligence (E-AI) task called Visual Semantic Navigation. This task, a.k.a Object-Goal Navigation (ObjectNav) consists of autonomous navigation using egocentric visual observations to reach an object belonging to the target semantic class without prior knowledge of the environment. Our method reached fourth place on the Habitat Challenge 2021 ObjectNav on the Minival phase and the Test-Standard Phase.
    Encoder-Decoder Architectures for Clinically Relevant Coronary Artery Segmentation. (arXiv:2106.11447v1 [eess.IV])
    (2 min) Coronary X-ray angiography is a crucial clinical procedure for the diagnosis and treatment of coronary artery disease, which accounts for roughly 16% of global deaths every year. However, the images acquired in these procedures have low resolution and poor contrast, making lesion detection and assessment challenging. Accurate coronary artery segmentation not only helps mitigate these problems, but also allows the extraction of relevant anatomical features for further analysis by quantitative methods. Although automated segmentation of coronary arteries has been proposed before, previous approaches have used non-optimal segmentation criteria, leading to less useful results. Most methods either segment only the major vessel, discarding important information from the remaining ones, or segment the whole coronary tree based mostly on contrast information, producing a noisy output that includes vessels that are not relevant for diagnosis. We adopt a better-suited clinical criterion and segment vessels according to their clinical relevance. Additionally, we simultaneously perform catheter segmentation, which may be useful for diagnosis due to the scale factor provided by the catheter's known diameter, and is a task that has not yet been performed with good results. To derive the optimal approach, we conducted an extensive comparative study of encoder-decoder architectures trained on a combination of focal loss and a variant of generalized dice loss. Based on the EfficientNet and the UNet++ architectures, we propose a line of efficient and high-performance segmentation models using a new decoder architecture, the EfficientUNet++, whose best-performing version achieved average dice scores of 0.8904 and 0.7526 for the artery and catheter classes, respectively, and an average generalized dice score of 0.9234.
    Gait analysis with curvature maps: A simulation study. (arXiv:2106.11466v1 [cs.CV])
    (2 min) Gait analysis is an important aspect of clinical investigation for detecting neurological and musculoskeletal disorders and assessing the global health of a patient. In this paper we propose to focus our attention on extracting relevant curvature information from the body surface provided by a depth camera. We assumed that the 3D mesh was made available in a previous step and demonstrated how curvature maps could be useful to assess asymmetric anomalies with two simple simulated abnormal gaits compared with a normal one. This research set the grounds for the future development of a curvature-based gait analysis system for healthcare professionals.
    Spatio-Temporal Multi-Task Learning Transformer for Joint Moving Object Detection and Segmentation. (arXiv:2106.11401v1 [cs.CV])
    (2 min) Moving objects have special importance for Autonomous Driving tasks. Detecting moving objects can be posed as Moving Object Segmentation, by segmenting the object pixels, or Moving Object Detection, by generating a bounding box for the moving targets. In this paper, we present a Multi-Task Learning architecture, based on Transformers, to jointly perform both tasks through one network. Due to the importance of the motion features to the task, the whole setup is based on a Spatio-Temporal aggregation. We evaluate the performance of the individual tasks architecture versus the MTL setup, both with early shared encoders, and late shared encoder-decoder transformers. For the latter, we present a novel joint tasks query decoder transformer, that enables us to have tasks dedicated heads out of the shared model. To evaluate our approach, we use the KITTI MOD [29] data set. Results show1.5% mAP improvement for Moving Object Detection, and 2%IoU improvement for Moving Object Segmentation, over the individual tasks networks.
  • cs.IR updates on arXiv.org

    A Systematic Evaluation of Transfer Learning and Pseudo-labeling with BERT-based Ranking Models. (arXiv:2103.03335v3 [cs.IR] UPDATED)
    (2 min) Due to high annotation costs making the best use of existing human-created training data is an important research direction. We, therefore, carry out a systematic evaluation of transferability of BERT-based neural ranking models across five English datasets. Previous studies focused primarily on zero-shot and few-shot transfer from a large dataset to a dataset with a small number of queries. In contrast, each of our collections has a substantial number of queries, which enables a full-shot evaluation mode and improves reliability of our results. Furthermore, since source datasets licences often prohibit commercial use, we compare transfer learning to training on pseudo-labels generated by a BM25 scorer. We find that training on pseudo-labels -- possibly with subsequent fine-tuning using a modest number of annotated queries -- can produce a competitive or better model compared to transfer learning. Yet, it is necessary to improve the stability and/or effectiveness of the few-shot training, which, sometimes, can degrade performance of a pretrained model.
    Turing Award elites revisited: patterns of productivity, collaboration, authorship and impact. (arXiv:2106.11534v1 [cs.DL])
    (2 min) The Turing Award is recognized as the most influential and prestigious award in the field of computer science(CS). With the rise of the science of science (SciSci), a large amount of bibliographic data has been analyzed in an attempt to understand the hidden mechanism of scientific evolution. These include the analysis of the Nobel Prize, including physics, chemistry, medicine, etc. In this article, we extract and analyze the data of 72 Turing Award laureates from the complete bibliographic data, fill the gap in the lack of Turing Award analysis, and discover the development characteristics of computer science as an independent discipline. First, we show most Turing Award laureates have long-term and high-quality educational backgrounds, and more than 61% of them have a degree in mathematics, which indicates that mathematics has played a significant role in the development of computer science. Secondly, the data shows that not all scholars have high productivity and high h-index; that is, the number of publications and h-index is not the leading indicator for evaluating the Turing Award. Third, the average age of awardees has increased from 40 to around 70 in recent years. This may be because new breakthroughs take longer, and some new technologies need time to prove their influence. Besides, we have also found that in the past ten years, international collaboration has experienced explosive growth, showing a new paradigm in the form of collaboration. It is also worth noting that in recent years, the emergence of female winners has also been eye-catching. Finally, by analyzing the personal publication records, we find that many people are more likely to publish high-impact articles during their high-yield periods.
    Generating abstractive summaries of Lithuanian news articles using a transformer model. (arXiv:2105.03279v2 [cs.CL] UPDATED)
    (2 min) In this work, we train the first monolingual Lithuanian transformer model on a relatively large corpus of Lithuanian news articles and compare various output decoding algorithms for abstractive news summarization. We achieve an average ROUGE-2 score 0.163, generated summaries are coherent and look impressive at first glance. However, some of them contain misleading information that is not so easy to spot. We describe all the technical details and share our trained model and accompanying code in an online open-source repository, as well as some characteristic samples of the generated summaries.
    A Query-Driven Topic Model. (arXiv:2106.07346v2 [cs.IR] UPDATED)
    (2 min) Topic modeling is an unsupervised method for revealing the hidden semantic structure of a corpus. It has been increasingly widely adopted as a tool in the social sciences, including political science, digital humanities and sociological research in general. One desirable property of topic models is to allow users to find topics describing a specific aspect of the corpus. A possible solution is to incorporate domain-specific knowledge into topic modeling, but this requires a specification from domain experts. We propose a novel query-driven topic model that allows users to specify a simple query in words or phrases and return query-related topics, thus avoiding tedious work from domain experts. Our proposed approach is particularly attractive when the user-specified query has a low occurrence in a text corpus, making it difficult for traditional topic models built on word cooccurrence patterns to identify relevant topics. Experimental results demonstrate the effectiveness of our model in comparison with both classical topic models and neural topic models.
    SeqNetVLAD vs PointNetVLAD: Image Sequence vs 3D Point Clouds for Day-Night Place Recognition. (arXiv:2106.11481v1 [cs.CV])
    (2 min) Place Recognition is a crucial capability for mobile robot localization and navigation. Image-based or Visual Place Recognition (VPR) is a challenging problem as scene appearance and camera viewpoint can change significantly when places are revisited. Recent VPR methods based on ``sequential representations'' have shown promising results as compared to traditional sequence score aggregation or single image based techniques. In parallel to these endeavors, 3D point clouds based place recognition is also being explored following the advances in deep learning based point cloud processing. However, a key question remains: is an explicit 3D structure based place representation always superior to an implicit ``spatial'' representation based on sequence of RGB images which can inherently learn scene structure. In this extended abstract, we attempt to compare these two types of methods by considering a similar ``metric span'' to represent places. We compare a 3D point cloud based method (PointNetVLAD) with image sequence based methods (SeqNet and others) and showcase that image sequence based techniques approach, and can even surpass, the performance achieved by point cloud based methods for a given metric span. These performance variations can be attributed to differences in data richness of input sensors as well as data accumulation strategies for a mobile robot. While a perfect apple-to-apple comparison may not be feasible for these two different modalities, the presented comparison takes a step in the direction of answering deeper questions regarding spatial representations, relevant to several applications like Autonomous Driving and Augmented/Virtual Reality. Source code available publicly https://github.com/oravus/seqNet.
    Discovering Mathematical Objects of Interest -- A Study of Mathematical Notations. (arXiv:2002.02712v3 [cs.DL] UPDATED)
    (2 min) Mathematical notation, i.e., the writing system used to communicate concepts in mathematics, encodes valuable information for a variety of information search and retrieval systems. Yet, mathematical notations remain mostly unutilized by today's systems. In this paper, we present the first in-depth study on the distributions of mathematical notation in two large scientific corpora: the open access arXiv (2.5B mathematical objects) and the mathematical reviewing service for pure and applied mathematics zbMATH (61M mathematical objects). Our study lays a foundation for future research projects on mathematical information retrieval for large scientific corpora. Further, we demonstrate the relevance of our results to a variety of use-cases. For example, to assist semantic extraction systems, to improve scientific search engines, and to facilitate specialized math recommendation systems. The contributions of our presented research are as follows: (1) we present the first distributional analysis of mathematical formulae on arXiv and zbMATH; (2) we retrieve relevant mathematical objects for given textual search queries (e.g., linking $P_{n}^{(\alpha, \beta)}\!\left(x\right)$ with `Jacobi polynomial'); (3) we extend zbMATH's search engine by providing relevant mathematical formulae; and (4) we exemplify the applicability of the results by presenting auto-completion for math inputs as the first contribution to math recommendation systems. To expedite future research projects, we have made available our source code and data.
    IITP@COLIEE 2019: Legal Information Retrieval using BM25 and BERT. (arXiv:2104.08653v3 [cs.CL] UPDATED)
    (2 min) Natural Language Processing (NLP) and Information Retrieval (IR) in the judicial domain is an essential task. With the advent of availability domain-specific data in electronic form and aid of different Artificial intelligence (AI) technologies, automated language processing becomes more comfortable, and hence it becomes feasible for researchers and developers to provide various automated tools to the legal community to reduce human burden. The Competition on Legal Information Extraction/Entailment (COLIEE-2019) run in association with the International Conference on Artificial Intelligence and Law (ICAIL)-2019 has come up with few challenging tasks. The shared defined four sub-tasks (i.e. Task1, Task2, Task3 and Task4), which will be able to provide few automated systems to the judicial system. The paper presents our working note on the experiments carried out as a part of our participation in all the sub-tasks defined in this shared task. We make use of different Information Retrieval(IR) and deep learning based approaches to tackle these problems. We obtain encouraging results in all these four sub-tasks.
    What is all this new MeSH about? Exploring the semantic provenance of new descriptors in the MeSH thesaurus. (arXiv:2101.08293v2 [cs.DL] UPDATED)
    (2 min) The Medical Subject Headings (MeSH) thesaurus is a controlled vocabulary widely used in biomedical knowledge systems, particularly for semantic indexing of scientific literature. As the MeSH hierarchy evolves through annual version updates, some new descriptors are introduced that were not previously available. This paper explores the conceptual provenance of these new descriptors. In particular, we investigate whether such new descriptors have been previously covered by older descriptors and what is their current relation to them. To this end, we propose a framework to categorize new descriptors based on their current relation to older descriptors. Based on the proposed classification scheme, we quantify, analyse and present the different types of new descriptors introduced in MeSH during the last fifteen years. The results show that only about 25% of new MeSH descriptors correspond to new emerging concepts, whereas the rest were previously covered by one or more existing descriptors, either implicitly or explicitly. Most of them were covered by a single existing descriptor and they usually end up as descendants of it in the current hierarchy, gradually leading towards a more fine-grained MeSH vocabulary. These insights about the dynamics of the thesaurus are useful for the retrospective study of scientific articles annotated with MeSH, but could also be used to inform the policy of updating the thesaurus in the future.
    Fine-tune the Entire RAG Architecture (including DPR retriever) for Question-Answering. (arXiv:2106.11517v1 [cs.IR])
    (2 min) In this paper, we illustrate how to fine-tune the entire Retrieval Augment Generation (RAG) architecture in an end-to-end manner. We highlighted the main engineering challenges that needed to be addressed to achieve this objective. We also compare how end-to-end RAG architecture outperforms the original RAG architecture for the task of question answering. We have open-sourced our implementation in the HuggingFace Transformers library.
    Quantifying the Impact of Human Capital, Job History, and Language Factors on Job Seniority with a Large-scale Analysis of Resumes. (arXiv:2106.11846v1 [econ.GN])
    (2 min) As job markets worldwide have become more competitive and applicant selection criteria have become more opaque, and different (and sometimes contradictory) information and advice is available for job seekers wishing to progress in their careers, it has never been more difficult to determine which factors in a r\'esum\'e most effectively help career progression. In this work we present a novel, large scale dataset of over half a million r\'esum\'es with preliminary analysis to begin to answer empirically which factors help or hurt people wishing to transition to more senior roles as they progress in their career. We find that previous experience forms the most important factor, outweighing other aspects of human capital, and find which language factors in a r\'esum\'e have significant effects. This lays the groundwork for future inquiry in career trajectories using large scale data analysis and natural language processing techniques.
    Deep Learning Models in Detection of Dietary Supplement Adverse Event Signals from Twitter. (arXiv:2106.11403v1 [cs.CL])
    (2 min) Objective: The objective of this study is to develop a deep learning pipeline to detect signals on dietary supplement-related adverse events (DS AEs) from Twitter. Material and Methods: We obtained 247,807 tweets ranging from 2012 to 2018 that mentioned both DS and AE. We annotated biomedical entities and relations on 2,000 randomly selected tweets. For the concept extraction task, we compared the performance of traditional word embeddings with SVM, CRF and LSTM-CRF classifiers to BERT models. For the relation extraction task, we compared GloVe vectors with CNN classifiers to BERT models. We chose the best performing models in each task to assemble an end-to-end deep learning pipeline to detect DS AE signals and compared the results to the known DS AEs from a DS knowledge base (i.e., iDISK). Results: In both tasks, the BERT-based models outperformed traditional word embeddings. The best performing concept extraction model is the BioBERT model that can identify supplement, symptom, and body organ entities with F1-scores of 0.8646, 0.8497, and 0.7104, respectively. The best performing relation extraction model is the BERT model that can identify purpose and AE relations with F1-scores of 0.8335 and 0.7538, respectively. The end-to-end pipeline was able to extract DS indication and DS AEs with an F1-score of 0.7459 and 0,7414, respectively. Comparing to the iDISK, we could find both known and novel DS-AEs. Conclusion: We have demonstrated the feasibility of detecting DS AE signals from Twitter with a BioBERT-based deep learning pipeline.
  • cs.LG updates on arXiv.org

    Interpretable Deep Learning for the Remote Characterisation of Ambulation in Multiple Sclerosis using Smartphones. (arXiv:2103.09171v2 [cs.LG] UPDATED)
    (2 min) The emergence of digital technologies such as smartphones in healthcare applications have demonstrated the possibility of developing rich, continuous, and objective measures of multiple sclerosis (MS) disability that can be administered remotely and out-of-clinic. In this work, deep convolutional neural networks (DCNN) applied to smartphone inertial sensor data were shown to better distinguish healthy from MS participant ambulation, compared to standard Support Vector Machine (SVM) feature-based methodologies. To overcome the typical limitations associated with remotely generated health data, such as low subject numbers, sparsity, and heterogeneous data, a transfer learning (TL) model from similar large open-source datasets was proposed. Our TL framework utilised the ambulatory information learned on Human Activity Recognition (HAR) tasks collected from similar smartphone-based sensor data. A lack of transparency of "black-box" deep networks remains one of the largest stumbling blocks to the wider acceptance of deep learning for clinical applications. Ensuing work therefore aimed to visualise DCNN decisions attributed by relevance heatmaps using Layer-Wise Relevance Propagation (LRP). Through the LRP framework, the patterns captured from smartphone-based inertial sensor data that were reflective of those who are healthy versus persons with MS (PwMS) could begin to be established and understood. Interpretations suggested that cadence-based measures, gait speed, and ambulation-related signal perturbations were distinct characteristics that distinguished MS disability from healthy participants. Robust and interpretable outcomes, generated from high-frequency out-of-clinic assessments, could greatly augment the current in-clinic assessment picture for PwMS, to inform better disease management techniques, and enable the development of better therapeutic interventions.
    Nutrition5k: Towards Automatic Nutritional Understanding of Generic Food. (arXiv:2103.03375v2 [cs.CV] UPDATED)
    (2 min) Understanding the nutritional content of food from visual data is a challenging computer vision problem, with the potential to have a positive and widespread impact on public health. Studies in this area are limited to existing datasets in the field that lack sufficient diversity or labels required for training models with nutritional understanding capability. We introduce Nutrition5k, a novel dataset of 5k diverse, real world food dishes with corresponding video streams, depth images, component weights, and high accuracy nutritional content annotation. We demonstrate the potential of this dataset by training a computer vision algorithm capable of predicting the caloric and macronutrient values of a complex, real world dish at an accuracy that outperforms professional nutritionists. Further we present a baseline for incorporating depth sensor data to improve nutrition predictions. We will publicly release Nutrition5k in the hope that it will accelerate innovation in the space of nutritional understanding.
    SDCOR: Scalable Density-based Clustering for Local Outlier Detection in Massive-Scale Datasets. (arXiv:2006.07616v10 [cs.LG] UPDATED)
    (2 min) This paper presents a batch-wise density-based clustering approach for local outlier detection in massive-scale datasets. Unlike the well-known traditional algorithms, which assume that all the data is memory-resident, our proposed method is scalable and processes the input data chunk-by-chunk within the confines of a limited memory buffer. A temporary clustering model is built at the first phase; then, it is gradually updated by analyzing consecutive memory loads of points. Subsequently, at the end of scalable clustering, the approximate structure of the original clusters is obtained. Finally, by another scan of the entire dataset and using a suitable criterion, an outlying score is assigned to each object called SDCOR (Scalable Density-based Clustering Outlierness Ratio). Evaluations on real-life and synthetic datasets demonstrate that the proposed method has a low linear time complexity and is more effective and efficient compared to best-known conventional density-based methods, which need to load all data into the memory; and also, to some fast distance-based methods, which can perform on data resident in the disk.
    Any equation is a forest: Symbolic genetic algorithm for discovering open-form partial differential equations (SGA-PDE). (arXiv:2106.11927v1 [cs.NE])
    (2 min) Partial differential equations (PDEs) are concise and understandable representations of domain knowledge, which are essential for deepening our understanding of physical processes and predicting future responses. However, the PDEs of many real-world problems are uncertain, which calls for PDE discovery. We propose the symbolic genetic algorithm (SGA-PDE) to discover open-form PDEs directly from data without prior knowledge about the equation structure. SGA-PDE focuses on the representation and optimization of PDE. Firstly, SGA-PDE uses symbolic mathematics to realize the flexible representation of any given PDE, transforms a PDE into a forest, and converts each function term into a binary tree. Secondly, SGA-PDE adopts a specially designed genetic algorithm to efficiently optimize the binary trees by iteratively updating the tree topology and node attributes. The SGA-PDE is gradient-free, which is a desirable characteristic in PDE discovery since it is difficult to obtain the gradient between the PDE loss and the PDE structure. In the experiment, SGA-PDE not only successfully discovered nonlinear Burgers' equation, Korteweg-de Vries (KdV) equation, and Chafee-Infante equation, but also handled PDEs with fractional structure and compound functions that cannot be solved by conventional PDE discovery methods.
    Sample Complexity of Offline Reinforcement Learning with Deep ReLU Networks. (arXiv:2103.06671v2 [stat.ML] UPDATED)
    (2 min) We study the statistical theory of offline reinforcement learning (RL) with deep ReLU network function approximation. We analyze a variant of fitted-Q iteration (FQI) algorithm under a new dynamic condition that we call Besov dynamic closure, which encompasses the conditions from prior analyses for deep neural network function approximation. Under Besov dynamic closure, we prove that the FQI-type algorithm enjoys the sample complexity of $\tilde{\mathcal{O}}\left( \kappa^{1 + d/\alpha} \cdot \epsilon^{-2 - 2d/\alpha} \right)$ where $\kappa$ is a distribution shift measure, $d$ is the dimensionality of the state-action space, $\alpha$ is the (possibly fractional) smoothness parameter of the underlying MDP, and $\epsilon$ is a user-specified precision. This is an improvement over the sample complexity of $\tilde{\mathcal{O}}\left( K \cdot \kappa^{2 + d/\alpha} \cdot \epsilon^{-2 - d/\alpha} \right)$ in the prior result [Yang et al., 2019] where $K$ is an algorithmic iteration number which is arbitrarily large in practice. Importantly, our sample complexity is obtained under the new general dynamic condition and a data-dependent structure where the latter is either ignored in prior algorithms or improperly handled by prior analyses. This is the first comprehensive analysis for offline RL with deep ReLU network function approximation under a general setting.
    Evo* 2021 -- Late-Breaking Abstracts Volume. (arXiv:2106.11804v1 [cs.NE])
    (2 min) Volumen with the Late-Breaking Abstracts submitted to the Evo* 2021 Conference, held online from 7 to 9 of April 2021. These papers present ongoing research and preliminary results investigating on the application of different approaches of Bioinspired Methods (mainly Evolutionary Computation) to different problems, most of them real world ones.
    Graph coarsening: From scientific computing to machine learning. (arXiv:2106.11863v1 [cs.LG])
    (2 min) The general method of graph coarsening or graph reduction has been a remarkably useful and ubiquitous tool in scientific computing and it is now just starting to have a similar impact in machine learning. The goal of this paper is to take a broad look into coarsening techniques that have been successfully deployed in scientific computing and see how similar principles are finding their way in more recent applications related to machine learning. In scientific computing, coarsening plays a central role in algebraic multigrid methods as well as the related class of multilevel incomplete LU factorizations. In machine learning, graph coarsening goes under various names, e.g., graph downsampling or graph reduction. Its goal in most cases is to replace some original graph by one which has fewer nodes, but whose structure and characteristics are similar to those of the original graph. As will be seen, a common strategy in these methods is to rely on spectral properties to define the coarse graph.
    RUHSNet: 3D Object Detection Using Lidar Data in Real Time. (arXiv:2006.01250v6 [cs.CV] UPDATED)
    (2 min) In this work, we address the problem of 3D object detection from point cloud data in real time. For autonomous vehicles to work, it is very important for the perception component to detect the real world objects with both high accuracy and fast inference. We propose a novel neural network architecture along with the training and optimization details for detecting 3D objects in point cloud data. We compare the results with different backbone architectures including the standard ones like VGG, ResNet, Inception with our backbone. Also we present the optimization and ablation studies including designing an efficient anchor. We use the Kitti 3D Birds Eye View dataset for benchmarking and validating our results. Our work surpasses the state of the art in this domain both in terms of average precision and speed running at > 30 FPS. This makes it a feasible option to be deployed in real time applications including self driving cars.
    Adversarially-Trained Nonnegative Matrix Factorization. (arXiv:2104.04757v2 [cs.LG] UPDATED)
    (2 min) We consider an adversarially-trained version of the nonnegative matrix factorization, a popular latent dimensionality reduction technique. In our formulation, an attacker adds an arbitrary matrix of bounded norm to the given data matrix. We design efficient algorithms inspired by adversarial training to optimize for dictionary and coefficient matrices with enhanced generalization abilities. Extensive simulations on synthetic and benchmark datasets demonstrate the superior predictive performance on matrix completion tasks of our proposed method compared to state-of-the-art competitors, including other variants of adversarial nonnegative matrix factorization.
    Adversarial Robustness vs Model Compression, or Both?. (arXiv:1903.12561v5 [cs.CV] UPDATED)
    (2 min) It is well known that deep neural networks (DNNs) are vulnerable to adversarial attacks, which are implemented by adding crafted perturbations onto benign examples. Min-max robust optimization based adversarial training can provide a notion of security against adversarial attacks. However, adversarial robustness requires a significantly larger capacity of the network than that for the natural training with only benign examples. This paper proposes a framework of concurrent adversarial training and weight pruning that enables model compression while still preserving the adversarial robustness and essentially tackles the dilemma of adversarial training. Furthermore, this work studies two hypotheses about weight pruning in the conventional setting and finds that weight pruning is essential for reducing the network model size in the adversarial setting, training a small model from scratch even with inherited initialization from the large model cannot achieve both adversarial robustness and high standard accuracy. Code is available at https://github.com/yeshaokai/Robustness-Aware-Pruning-ADMM.
    Making Invisible Visible: Data-Driven Seismic Inversion with Physics-Informed Data Augmentation. (arXiv:2106.11892v1 [cs.LG])
    (2 min) Deep learning and data-driven approaches have shown great potential in scientific domains. The promise of data-driven techniques relies on the availability of a large volume of high-quality training datasets. Due to the high cost of obtaining data through expensive physical experiments, instruments, and simulations, data augmentation techniques for scientific applications have emerged as a new direction for obtaining scientific data recently. However, existing data augmentation techniques originating from computer vision, yield physically unacceptable data samples that are not helpful for the domain problems that we are interested in. In this paper, we develop new physics-informed data augmentation techniques based on convolutional neural networks. Specifically, our generative models leverage different physics knowledge (such as governing equations, observable perception, and physics phenomena) to improve the quality of the synthetic data. To validate the effectiveness of our data augmentation techniques, we apply them to solve a subsurface seismic full-waveform inversion using simulated CO$_2$ leakage data. Our interest is to invert for subsurface velocity models associated with very small CO$_2$ leakage. We validate the performance of our methods using comprehensive numerical tests. Via comparison and analysis, we show that data-driven seismic imaging can be significantly enhanced by using our physics-informed data augmentation techniques. Particularly, the imaging quality has been improved by 15% in test scenarios of general-sized leakage and 17% in small-sized leakage when using an augmented training set obtained with our techniques.
    Gradient-based Label Binning in Multi-label Classification. (arXiv:2106.11690v1 [cs.LG])
    (2 min) In multi-label classification, where a single example may be associated with several class labels at the same time, the ability to model dependencies between labels is considered crucial to effectively optimize non-decomposable evaluation measures, such as the Subset 0/1 loss. The gradient boosting framework provides a well-studied foundation for learning models that are specifically tailored to such a loss function and recent research attests the ability to achieve high predictive accuracy in the multi-label setting. The utilization of second-order derivatives, as used by many recent boosting approaches, helps to guide the minimization of non-decomposable losses, due to the information about pairs of labels it incorporates into the optimization process. On the downside, this comes with high computational costs, even if the number of labels is small. In this work, we address the computational bottleneck of such approach -- the need to solve a system of linear equations -- by integrating a novel approximation technique into the boosting procedure. Based on the derivatives computed during training, we dynamically group the labels into a predefined number of bins to impose an upper bound on the dimensionality of the linear system. Our experiments, using an existing rule-based algorithm, suggest that this may boost the speed of training, without any significant loss in predictive performance.
    A Clustering-based Framework for Classifying Data Streams. (arXiv:2106.11823v1 [cs.LG])
    (2 min) The non-stationary nature of data streams strongly challenges traditional machine learning techniques. Although some solutions have been proposed to extend traditional machine learning techniques for handling data streams, these approaches either require an initial label set or rely on specialized design parameters. The overlap among classes and the labeling of data streams constitute other major challenges for classifying data streams. In this paper, we proposed a clustering-based data stream classification framework to handle non-stationary data streams without utilizing an initial label set. A density-based stream clustering procedure is used to capture novel concepts with a dynamic threshold and an effective active label querying strategy is introduced to continuously learn the new concepts from the data streams. The sub-cluster structure of each cluster is explored to handle the overlap among classes. Experimental results and quantitative comparison studies reveal that the proposed method provides statistically better or comparable performance than the existing methods.
    Lower and Upper Bounds on the VC-Dimension of Tensor Network Models. (arXiv:2106.11827v1 [cs.LG])
    (2 min) Tensor network methods have been a key ingredient of advances in condensed matter physics and have recently sparked interest in the machine learning community for their ability to compactly represent very high-dimensional objects. Tensor network methods can for example be used to efficiently learn linear models in exponentially large feature spaces [Stoudenmire and Schwab, 2016]. In this work, we derive upper and lower bounds on the VC dimension and pseudo-dimension of a large class of tensor network models for classification, regression and completion. Our upper bounds hold for linear models parameterized by arbitrary tensor network structures, and we derive lower bounds for common tensor decomposition models~(CP, Tensor Train, Tensor Ring and Tucker) showing the tightness of our general upper bound. These results are used to derive a generalization bound which can be applied to classification with low rank matrices as well as linear classifiers based on any of the commonly used tensor decomposition models. As a corollary of our results, we obtain a bound on the VC dimension of the matrix product state classifier introduced in [Stoudenmire and Schwab, 2016] as a function of the so-called bond dimension~(i.e. tensor train rank), which answers an open problem listed by Cirac, Garre-Rubio and P\'erez-Garc\'ia in [Cirac et al., 2019].
    Focus U-Net: A novel dual attention-gated CNN for polyp segmentation during colonoscopy. (arXiv:2105.07467v2 [eess.IV] UPDATED)
    (2 min) Background: Colonoscopy remains the gold-standard screening for colorectal cancer. However, significant miss rates for polyps have been reported, particularly when there are multiple small adenomas. This presents an opportunity to leverage computer-aided systems to support clinicians and reduce the number of polyps missed. Method: In this work we introduce the Focus U-Net, a novel dual attention-gated deep neural network, which combines efficient spatial and channel-based attention into a single Focus Gate module to encourage selective learning of polyp features. The Focus U-Net further incorporates short-range skip connections and deep supervision. Furthermore, we introduce the Hybrid Focal loss, a new compound loss function based on the Focal loss and Focal Tversky loss, to handle class-imbalanced image segmentation. For our experiments, we selected five public datasets containing images of polyps obtained during optical colonoscopy: CVC-ClinicDB, Kvasir-SEG, CVC-ColonDB, ETIS-Larib PolypDB and EndoScene test set. To evaluate model performance, we use the Dice similarity coefficient (DSC) and Intersection over Union (IoU) metrics. Results: Our model achieves state-of-the-art results for both CVC-ClinicDB and Kvasir-SEG, with a mean DSC of 0.941 and 0.910, respectively. When evaluated on a combination of five public polyp datasets, our model similarly achieves state-of-the-art results with a mean DSC of 0.878 and mean IoU of 0.809, a 14% and 15% improvement over the previous state-of-the-art results of 0.768 and 0.702, respectively. Conclusions: This study shows the potential for deep learning to provide fast and accurate polyp segmentation results for use during colonoscopy. The Focus U-Net may be adapted for future use in newer non-invasive screening and more broadly to other biomedical image segmentation tasks involving class imbalance and requiring efficiency.
    Local policy search with Bayesian optimization. (arXiv:2106.11899v1 [cs.LG])
    (2 min) Reinforcement learning (RL) aims to find an optimal policy by interaction with an environment. Consequently, learning complex behavior requires a vast number of samples, which can be prohibitive in practice. Nevertheless, instead of systematically reasoning and actively choosing informative samples, policy gradients for local search are often obtained from random perturbations. These random samples yield high variance estimates and hence are sub-optimal in terms of sample complexity. Actively selecting informative samples is at the core of Bayesian optimization, which constructs a probabilistic surrogate of the objective from past samples to reason about informative subsequent ones. In this paper, we propose to join both worlds. We develop an algorithm utilizing a probabilistic model of the objective function and its gradient. Based on the model, the algorithm decides where to query a noisy zeroth-order oracle to improve the gradient estimates. The resulting algorithm is a novel type of policy search method, which we compare to existing black-box algorithms. The comparison reveals improved sample complexity and reduced variance in extensive empirical evaluations on synthetic objectives. Further, we highlight the benefits of active sampling on popular RL benchmarks.
    Provably Efficient Representation Learning in Low-rank Markov Decision Processes. (arXiv:2106.11935v1 [cs.LG])
    (2 min) The success of deep reinforcement learning (DRL) is due to the power of learning a representation that is suitable for the underlying exploration and exploitation task. However, existing provable reinforcement learning algorithms with linear function approximation often assume the feature representation is known and fixed. In order to understand how representation learning can improve the efficiency of RL, we study representation learning for a class of low-rank Markov Decision Processes (MDPs) where the transition kernel can be represented in a bilinear form. We propose a provably efficient algorithm called ReLEX that can simultaneously learn the representation and perform exploration. We show that ReLEX always performs no worse than a state-of-the-art algorithm without representation learning, and will be strictly better in terms of sample efficiency if the function class of representations enjoys a certain mild "coverage'' property over the whole state-action space.
    MONCAE: Multi-Objective Neuroevolution of Convolutional Autoencoders. (arXiv:2106.11914v1 [cs.NE])
    (2 min) In this paper, we present a novel neuroevolutionary method to identify the architecture and hyperparameters of convolutional autoencoders. Remarkably, we used a hypervolume indicator in the context of neural architecture search for autoencoders, for the first time to our current knowledge. Results show that images were compressed by a factor of more than 10, while still retaining enough information to achieve image classification for the majority of the tasks. Thus, this new approach can be used to speed up the AutoML pipeline for image compression.
    Data Augmentation for Opcode Sequence Based Malware Detection. (arXiv:2106.11821v1 [cs.CR])
    (2 min) Data augmentation has been successfully used in many areas of deep-learning to significantly improve model performance. Typically data augmentation simulates realistic variations in data in order to increase the apparent diversity of the training-set. However, for opcode-based malware analysis, where deep learning methods are already achieving state of the art performance, it is not immediately clear how to apply data augmentation. In this paper we study different methods of data augmentation starting with basic methods using fixed transformations and moving to methods that adapt to the data. We propose a novel data augmentation method based on using an opcode embedding layer within the network and its corresponding opcode embedding matrix to perform adaptive data augmentation during training. To the best of our knowledge this is the first paper to carry out a systematic study of different augmentation methods applied to opcode sequence based malware classification.
    Generate High Resolution Images With Generative Variational Autoencoder. (arXiv:2008.10399v3 [eess.IV] UPDATED)
    (2 min) In this work, we present a novel neural network to generate high resolution images. We replace the decoder of VAE with a discriminator while using the encoder as it is. The encoder is fed data from a normal distribution while the generator is fed from a gaussian distribution. The combination from both is given to a discriminator which tells whether the generated image is correct or not. We evaluate our network on 3 different datasets: MNIST, LSUN and CelebA dataset. Our network beats the previous state of the art using MMD, SSIM, log likelihood, reconstruction error, ELBO and KL divergence as the evaluation metrics while generating much sharper images. This work is potentially very exciting as we are able to combine the advantages of generative models and inference models in a principled bayesian manner.
    Routine Clustering of Mobile Sensor Data Facilitates Psychotic Relapse Prediction in Schizophrenia Patients. (arXiv:2106.11487v1 [cs.LG])
    (2 min) We aim to develop clustering models to obtain behavioral representations from continuous multimodal mobile sensing data towards relapse prediction tasks. The identified clusters could represent different routine behavioral trends related to daily living of patients as well as atypical behavioral trends associated with impending relapse. We used the mobile sensing data obtained in the CrossCheck project for our analysis. Continuous data from six different mobile sensing-based modalities (e.g. ambient light, sound/conversation, acceleration etc.) obtained from a total of 63 schizophrenia patients, each monitored for up to a year, were used for the clustering models and relapse prediction evaluation. Two clustering models, Gaussian Mixture Model (GMM) and Partition Around Medoids (PAM), were used to obtain behavioral representations from the mobile sensing data. The features obtained from the clustering models were used to train and evaluate a personalized relapse prediction model using Balanced Random Forest. The personalization was done by identifying optimal features for a given patient based on a personalization subset consisting of other patients who are of similar age. The clusters identified using the GMM and PAM models were found to represent different behavioral patterns (such as clusters representing sedentary days, active but with low communications days, etc.). Significant changes near the relapse periods were seen in the obtained behavioral representation features from the clustering models. The clustering model based features, together with other features characterizing the mobile sensing data, resulted in an F2 score of 0.24 for the relapse prediction task in a leave-one-patient-out evaluation setting. This obtained F2 score is significantly higher than a random classification baseline with an average F2 score of 0.042.
    Algorithmic Recourse in Partially and Fully Confounded Settings Through Bounding Counterfactual Effects. (arXiv:2106.11849v1 [stat.ML])
    (2 min) Algorithmic recourse aims to provide actionable recommendations to individuals to obtain a more favourable outcome from an automated decision-making system. As it involves reasoning about interventions performed in the physical world, recourse is fundamentally a causal problem. Existing methods compute the effect of recourse actions using a causal model learnt from data under the assumption of no hidden confounding and modelling assumptions such as additive noise. Building on the seminal work of Balke and Pearl (1994), we propose an alternative approach for discrete random variables which relaxes these assumptions and allows for unobserved confounding and arbitrary structural equations. The proposed approach only requires specification of the causal graph and confounding structure and bounds the expected counterfactual effect of recourse actions. If the lower bound is above a certain threshold, i.e., on the other side of the decision boundary, recourse is guaranteed in expectation.
    Robust Regression Revisited: Acceleration and Improved Estimation Rates. (arXiv:2106.11938v1 [cs.DS])
    (2 min) We study fast algorithms for statistical regression problems under the strong contamination model, where the goal is to approximately optimize a generalized linear model (GLM) given adversarially corrupted samples. Prior works in this line of research were based on the robust gradient descent framework of Prasad et. al., a first-order method using biased gradient queries, or the Sever framework of Diakonikolas et. al., an iterative outlier-removal method calling a stationary point finder. We present nearly-linear time algorithms for robust regression problems with improved runtime or estimation guarantees compared to the state-of-the-art. For the general case of smooth GLMs (e.g. logistic regression), we show that the robust gradient descent framework of Prasad et. al. can be accelerated, and show our algorithm extends to optimizing the Moreau envelopes of Lipschitz GLMs (e.g. support vector machines), answering several open questions in the literature. For the well-studied case of robust linear regression, we present an alternative approach obtaining improved estimation rates over prior nearly-linear time algorithms. Interestingly, our method starts with an identifiability proof introduced in the context of the sum-of-squares algorithm of Bakshi and Prasad, which achieved optimal error rates while requiring large polynomial runtime and sample complexity. We reinterpret their proof within the Sever framework and obtain a dramatically faster and more sample-efficient algorithm under fewer distributional assumptions.
    Notes on the H-measure of classifier performance. (arXiv:2106.11888v1 [cs.LG])
    (2 min) The H-measure is a classifier performance measure which takes into account the context of application without requiring a rigid value of relative misclassification costs to be set. Since its introduction in 2009 it has become widely adopted. This paper answers various queries which users have raised since its introduction, including questions about its interpretation, the choice of a weighting function, whether it is strictly proper, and its coherence, and relates the measure to other work.
    Optimal Best-Arm Identification Methods for Tail-Risk Measures. (arXiv:2008.07606v3 [cs.LG] UPDATED)
    (2 min) Conditional value-at-risk (CVaR) and value-at-risk (VaR) are popular tail-risk measures in finance and insurance industries as well as in highly reliable, safety-critical uncertain environments where often the underlying probability distributions are heavy-tailed. We use the multi-armed bandit best-arm identification framework and consider the problem of identifying the arm from amongst finitely many that has the smallest CVaR, VaR, or weighted sum of CVaR and mean. The latter captures the risk-return trade-off common in finance. Our main contribution is an optimal $\delta$-correct algorithm that acts on general arms, including heavy-tailed distributions, and matches the lower bound on the expected number of samples needed, asymptotically (as $\delta$ approaches $0$). The algorithm requires solving a non-convex optimization problem in the space of probability measures, that requires delicate analysis. En-route, we develop new non-asymptotic empirical likelihood-based concentration inequalities for tail-risk measures which are tighter than those for popular truncation-based empirical estimators.
    Understanding Long Range Memory Effects in Deep Neural Networks. (arXiv:2105.02062v3 [cs.LG] UPDATED)
    (2 min) \textit{Stochastic gradient descent} (SGD) is of fundamental importance in deep learning. Despite its simplicity, elucidating its efficacy remains challenging. Conventionally, the success of SGD is attributed to the \textit{stochastic gradient noise} (SGN) incurred in the training process. Based on this general consensus, SGD is frequently treated and analyzed as the Euler-Maruyama discretization of a \textit{stochastic differential equation} (SDE) driven by either Brownian or L\'evy stable motion. In this study, we argue that SGN is neither Gaussian nor stable. Instead, inspired by the long-time correlation emerging in SGN series, we propose that SGD can be viewed as a discretization of an SDE driven by \textit{fractional Brownian motion} (FBM). Accordingly, the different convergence behavior of SGD dynamics is well grounded. Moreover, the first passage time of an SDE driven by FBM is approximately derived. This indicates a lower escaping rate for a larger Hurst parameter, and thus SGD stays longer in flat minima. This happens to coincide with the well-known phenomenon that SGD favors flat minima that generalize well. Four groups of experiments are conducted to validate our conjecture, and it is demonstrated that long-range memory effects persist across various model architectures, datasets, and training strategies. Our study opens up a new perspective and may contribute to a better understanding of SGD.
    Improving Ultrasound Tongue Image Reconstruction from Lip Images Using Self-supervised Learning and Attention Mechanism. (arXiv:2106.11769v1 [eess.AS])
    (2 min) Speech production is a dynamic procedure, which involved multi human organs including the tongue, jaw and lips. Modeling the dynamics of the vocal tract deformation is a fundamental problem to understand the speech, which is the most common way for human daily communication. Researchers employ several sensory streams to describe the process simultaneously, which are incontrovertibly statistically related to other streams. In this paper, we address the following question: given an observable image sequences of lips, can we picture the corresponding tongue motion. We formulated this problem as the self-supervised learning problem, and employ the two-stream convolutional network and long-short memory network for the learning task, with the attention mechanism. We evaluate the performance of the proposed method by leveraging the unlabeled lip videos to predict an upcoming ultrasound tongue image sequence. The results show that our model is able to generate images that close to the real ultrasound tongue images, and results in the matching between two imaging modalities.
    Aligned Contrastive Predictive Coding. (arXiv:2104.11946v3 [cs.LG] UPDATED)
    (2 min) We investigate the possibility of forcing a self-supervised model trained using a contrastive predictive loss to extract slowly varying latent representations. Rather than producing individual predictions for each of the future representations, the model emits a sequence of predictions shorter than that of the upcoming representations to which they will be aligned. In this way, the prediction network solves a simpler task of predicting the next symbols, but not their exact timing, while the encoding network is trained to produce piece-wise constant latent codes. We evaluate the model on a speech coding task and demonstrate that the proposed Aligned Contrastive Predictive Coding (ACPC) leads to higher linear phone prediction accuracy and lower ABX error rates, while being slightly faster to train due to the reduced number of prediction heads.
    DeepReDuce: ReLU Reduction for Fast Private Inference. (arXiv:2103.01396v2 [cs.LG] UPDATED)
    (2 min) The recent rise of privacy concerns has led researchers to devise methods for private neural inference -- where inferences are made directly on encrypted data, never seeing inputs. The primary challenge facing private inference is that computing on encrypted data levies an impractically-high latency penalty, stemming mostly from non-linear operators like ReLU. Enabling practical and private inference requires new optimization methods that minimize network ReLU counts while preserving accuracy. This paper proposes DeepReDuce: a set of optimizations for the judicious removal of ReLUs to reduce private inference latency. The key insight is that not all ReLUs contribute equally to accuracy. We leverage this insight to drop, or remove, ReLUs from classic networks to significantly reduce inference latency and maintain high accuracy. Given a target network, DeepReDuce outputs a Pareto frontier of networks that tradeoff the number of ReLUs and accuracy. Compared to the state-of-the-art for private inference DeepReDuce improves accuracy and reduces ReLU count by up to 3.5% (iso-ReLU count) and 3.5$\times$ (iso-accuracy), respectively.
    Federated Over-Air Subspace Tracking from Incomplete and Corrupted Data. (arXiv:2002.12873v3 [cs.LG] UPDATED)
    (2 min) Subspace tracking (ST) with missing data (ST-miss) or outliers (Robust ST) or both (Robust ST-miss) has been extensively studied in the last many years. This work provides a new simple algorithm and guarantee for both ST with missing data (ST-miss) and RST-miss. Unlike past work on this topic, the algorithm is much simpler (uses fewer parameters) and the guarantee does not make the artificial assumption of piecewise constant subspace change, although it still handles that setting. Secondly, we extend our approach and its analysis to provably solving these problems when the raw data is federated and when the over-air data communication modality is used for information exchange between the $K$ peer nodes and the center.
    Analysis and Tuning of a Voice Assistant System for Dysfluent Speech. (arXiv:2106.11759v1 [eess.AS])
    (2 min) Dysfluencies and variations in speech pronunciation can severely degrade speech recognition performance, and for many individuals with moderate-to-severe speech disorders, voice operated systems do not work. Current speech recognition systems are trained primarily with data from fluent speakers and as a consequence do not generalize well to speech with dysfluencies such as sound or word repetitions, sound prolongations, or audible blocks. The focus of this work is on quantitative analysis of a consumer speech recognition system on individuals who stutter and production-oriented approaches for improving performance for common voice assistant tasks (i.e., "what is the weather?"). At baseline, this system introduces a significant number of insertion and substitution errors resulting in intended speech Word Error Rates (isWER) that are 13.64\% worse (absolute) for individuals with fluency disorders. We show that by simply tuning the decoding parameters in an existing hybrid speech recognition system one can improve isWER by 24\% (relative) for individuals with fluency disorders. Tuning these parameters translates to 3.6\% better domain recognition and 1.7\% better intent recognition relative to the default setup for the 18 study participants across all stuttering severities.
    Minimax Regret for Stochastic Shortest Path with Adversarial Costs and Known Transition. (arXiv:2012.04053v3 [cs.LG] UPDATED)
    (2 min) We study the stochastic shortest path problem with adversarial costs and known transition, and show that the minimax regret is $\widetilde{O}(\sqrt{DT^\star K})$ and $\widetilde{O}(\sqrt{DT^\star SA K})$ for the full-information setting and the bandit feedback setting respectively, where $D$ is the diameter, $T^\star$ is the expected hitting time of the optimal policy, $S$ is the number of states, $A$ is the number of actions, and $K$ is the number of episodes. Our results significantly improve upon the existing work of (Rosenberg and Mansour, 2020) which only considers the full-information setting and achieves suboptimal regret. Our work is also the first to consider bandit feedback with adversarial costs. Our algorithms are built on top of the Online Mirror Descent framework with a variety of new techniques that might be of independent interest, including an improved multi-scale expert algorithm, a reduction from general stochastic shortest path to a special loop-free case, a skewed occupancy measure space, and a novel correction term added to the cost estimators. Interestingly, the last two elements reduce the variance of the learner via positive bias and the variance of the optimal policy via negative bias respectively, and having them simultaneously is critical for obtaining the optimal high-probability bound in the bandit feedback setting.
    Empirically explaining SGD from a line search perspective. (arXiv:2103.17132v2 [cs.LG] UPDATED)
    (2 min) Optimization in Deep Learning is mainly guided by vague intuitions and strong assumptions, with a limited understanding how and why these work in practice. To shed more light on this, our work provides some deeper understandings of how SGD behaves by empirically analyzing the trajectory taken by SGD from a line search perspective. Specifically, a costly quantitative analysis of the full-batch loss along SGD trajectories from common used models trained on a subset of CIFAR-10 is performed. Our core results include that the full-batch loss along lines in update step direction is highly parabolically. Further on, we show that there exists a learning rate with which SGD always performs almost exact line searches on the full-batch loss. Finally, we provide a different perspective why increasing the batch size has almost the same effect as decreasing the learning rate by the same factor.
    User Identification across Social Networking Sites using User Profiles and Posting Patterns. (arXiv:2106.11815v1 [cs.LG])
    (2 min) With the prevalence of online social networking sites (OSNs) and mobile devices, people are increasingly reliant on a variety of OSNs for keeping in touch with family and friends, and using it as a source of information. For example, a user might utilise multiple OSNs for different purposes, such as using Flickr to share holiday pictures with family and friends, and Twitter to post short messages about their thoughts. Identifying the same user across multiple OSNs is an important task as this allows us to understand the usage patterns of users among different OSNs, make recommendations when a user registers for a new OSN, and various other useful applications. To address this problem, we proposed an algorithm based on the multilayer perceptron using various types of features, namely: (i) user profile, such as name, location, description; (ii) temporal distribution of user generated content; and (iii) embedding based on user name, real name and description. Using a Twitter and Flickr dataset of users and their posting activities, we perform an empirical study on how these features affect the performance of user identification across the two OSNs and discuss our main findings based on the different features.
    A Query-Driven Topic Model. (arXiv:2106.07346v2 [cs.IR] UPDATED)
    (2 min) Topic modeling is an unsupervised method for revealing the hidden semantic structure of a corpus. It has been increasingly widely adopted as a tool in the social sciences, including political science, digital humanities and sociological research in general. One desirable property of topic models is to allow users to find topics describing a specific aspect of the corpus. A possible solution is to incorporate domain-specific knowledge into topic modeling, but this requires a specification from domain experts. We propose a novel query-driven topic model that allows users to specify a simple query in words or phrases and return query-related topics, thus avoiding tedious work from domain experts. Our proposed approach is particularly attractive when the user-specified query has a low occurrence in a text corpus, making it difficult for traditional topic models built on word cooccurrence patterns to identify relevant topics. Experimental results demonstrate the effectiveness of our model in comparison with both classical topic models and neural topic models.
    Benchmarking Invertible Architectures on Inverse Problems. (arXiv:2101.10763v3 [cs.LG] UPDATED)
    (2 min) Recent work demonstrated that flow-based invertible neural networks are promising tools for solving ambiguous inverse problems. Following up on this, we investigate how ten invertible architectures and related models fare on two intuitive, low-dimensional benchmark problems, obtaining the best results with coupling layers and simple autoencoders. We hope that our initial efforts inspire other researchers to evaluate their invertible architectures in the same setting and put forth additional benchmarks, so our evaluation may eventually grow into an official community challenge.
    Data Quality as Predictor of Voice Anti-Spoofing Generalization. (arXiv:2103.14602v2 [eess.AS] UPDATED)
    (2 min) Voice anti-spoofing aims at classifying a given utterance either as a bonafide human sample, or a spoofing attack (e.g. synthetic or replayed sample). Many anti-spoofing methods have been proposed but most of them fail to generalize across domains (corpora) -- and we do not know \emph{why}. We outline a novel interpretative framework for gauging the impact of data quality upon anti-spoofing performance. Our within- and between-domain experiments pool data from seven public corpora and three anti-spoofing methods based on Gaussian mixture and convolutive neural network models. We assess the impacts of long-term spectral information, speaker population (through x-vector speaker embeddings), signal-to-noise ratio, and selected voice quality features.
    Enabling Long-Term Cooperation in Cross-Silo Federated Learning: A Repeated Game Perspective. (arXiv:2106.11814v1 [cs.LG])
    (2 min) Cross-silo federated learning (FL) is a distributed learning approach where clients train a global model cooperatively while keeping their local data private. Different from cross-device FL, clients in cross-silo FL are usually organizations or companies which may execute multiple cross-silo FL processes repeatedly due to their time-varying local data sets, and aim to optimize their long-term benefits by selfishly choosing their participation levels. While there has been some work on incentivizing clients to join FL, the analysis of the long-term selfish participation behaviors of clients in cross-silo FL remains largely unexplored. In this paper, we analyze the selfish participation behaviors of heterogeneous clients in cross-silo FL. Specifically, we model the long-term selfish participation behaviors of clients as an infinitely repeated game, with the stage game being a selfish participation game in one cross-silo FL process (SPFL). For the stage game SPFL, we derive the unique Nash equilibrium (NE), and propose a distributed algorithm for each client to calculate its equilibrium participation strategy. For the long-term interactions among clients, we derive a cooperative strategy for clients which minimizes the number of free riders while increasing the amount of local data for model training. We show that enforced by a punishment strategy, such a cooperative strategy is a SPNE of the infinitely repeated game, under which some clients who are free riders at the NE of the stage game choose to be (partial) contributors. We further propose an algorithm to calculate the optimal SPNE which minimizes the number of free riders while maximizing the amount of local data for model training. Simulation results show that our proposed cooperative strategy at the optimal SPNE can effectively reduce the number of free riders and increase the amount of local data for model training.
    Surrogate-based variational data assimilation for tidal modelling. (arXiv:2106.11926v1 [stat.ML])
    (2 min) Data assimilation (DA) is widely used to combine physical knowledge and observations. It is nowadays commonly used in geosciences to perform parametric calibration. In a context of climate change, old calibrations can not necessarily be used for new scenarios. This raises the question of DA computational cost, as costly physics-based numerical models need to be reanalyzed. Reduction and metamodelling represent therefore interesting perspectives, for example proposed in recent contributions as hybridization between ensemble and variational methods, to combine their advantages (efficiency, non-linear framework). They are however often based on Monte Carlo (MC) type sampling, which often requires considerable increase of the ensemble size for better efficiency, therefore representing a computational burden in ensemble-based methods as well. To address these issues, two methods to replace the complex model by a surrogate are proposed and confronted : (i) PODEn3DVAR directly inspired from PODEn4DVAR, relies on an ensemble-based joint parameter-state Proper Orthogonal Decomposition (POD), which provides a linear metamodel ; (ii) POD-PCE-3DVAR, where the model states are POD reduced then learned using Polynomial Chaos Expansion (PCE), resulting in a non-linear metamodel. Both metamodels allow to write an approximate cost function whose minimum can be analytically computed, or deduced by a gradient descent at negligible cost. Furthermore, adapted metamodelling error covariance matrix is given for POD-PCE-3DVAR, allowing to substantially improve the metamodel-based DA analysis. Proposed methods are confronted on a twin experiment, and compared to classical 3DVAR on a measurement-based problem. Results are promising, in particular superior with POD-PCE-3DVAR, showing good convergence to classical 3DVAR and robustness to noise.
    Pruning of Deep Spiking Neural Networks through Gradient Rewiring. (arXiv:2105.04916v3 [cs.NE] UPDATED)
    (2 min) Spiking Neural Networks (SNNs) have been attached great importance due to their biological plausibility and high energy-efficiency on neuromorphic chips. As these chips are usually resource-constrained, the compression of SNNs is thus crucial along the road of practical use of SNNs. Most existing methods directly apply pruning approaches in artificial neural networks (ANNs) to SNNs, which ignore the difference between ANNs and SNNs, thus limiting the performance of the pruned SNNs. Besides, these methods are only suitable for shallow SNNs. In this paper, inspired by synaptogenesis and synapse elimination in the neural system, we propose gradient rewiring (Grad R), a joint learning algorithm of connectivity and weight for SNNs, that enables us to seamlessly optimize network structure without retraining. Our key innovation is to redefine the gradient to a new synaptic parameter, allowing better exploration of network structures by taking full advantage of the competition between pruning and regrowth of connections. The experimental results show that the proposed method achieves minimal loss of SNNs' performance on MNIST and CIFAR-10 dataset so far. Moreover, it reaches a $\sim$3.5% accuracy loss under unprecedented 0.73% connectivity, which reveals remarkable structure refining capability in SNNs. Our work suggests that there exists extremely high redundancy in deep SNNs. Our codes are available at https://github.com/Yanqi-Chen/Gradient-Rewiring.
    Towards Automated Evaluation of Explanations in Graph Neural Networks. (arXiv:2106.11864v1 [cs.AI])
    (2 min) Explaining Graph Neural Networks predictions to end users of AI applications in easily understandable terms remains an unsolved problem. In particular, we do not have well developed methods for automatically evaluating explanations, in ways that are closer to how users consume those explanations. Based on recent application trends and our own experiences in real world problems, we propose automatic evaluation approaches for GNN Explanations.
    Copyright in Generative Deep Learning. (arXiv:2105.09266v2 [cs.CY] UPDATED)
    (2 min) Machine-generated artworks are now part of the contemporary art scene: they are attracting significant investments and they are presented in exhibitions together with those created by human artists. These artworks are mainly based on generative deep learning techniques. Also given their success, several legal problems arise when working with these techniques. In this article we consider a set of key questions in the area of generative deep learning for the arts. Is it possible to use copyrighted works as training set for generative models? How do we legally store their copies in order to perform the training process? And then, who (if someone) will own the copyright on the generated data? We try to answer these questions considering the law in force in both US and EU and the future alternatives, trying to define a set of guidelines for artists and developers working on deep learning generated art.
    Active Learning under Pool Set Distribution Shift and Noisy Data. (arXiv:2106.11719v1 [cs.LG])
    (2 min) Active Learning is essential for more label-efficient deep learning. Bayesian Active Learning has focused on BALD, which reduces model parameter uncertainty. However, we show that BALD gets stuck on out-of-distribution or junk data that is not relevant for the task. We examine a novel *Expected Predictive Information Gain (EPIG)* to deal with distribution shifts of the pool set. EPIG reduces the uncertainty of *predictions* on an unlabelled *evaluation set* sampled from the test data distribution whose distribution might be different to the pool set distribution. Based on this, our new EPIG-BALD acquisition function for Bayesian Neural Networks selects samples to improve the performance on the test data distribution instead of selecting samples that reduce model uncertainty everywhere, including for out-of-distribution regions with low density in the test data distribution. Our method outperforms state-of-the-art Bayesian active learning methods on high-dimensional datasets and avoids out-of-distribution junk data in cases where current state-of-the-art methods fail.
    Continuous-Depth Neural Models for Dynamic Graph Prediction. (arXiv:2106.11581v1 [cs.LG])
    (2 min) We introduce the framework of continuous-depth graph neural networks (GNNs). Neural graph differential equations (Neural GDEs) are formalized as the counterpart to GNNs where the input-output relationship is determined by a continuum of GNN layers, blending discrete topological structures and differential equations. The proposed framework is shown to be compatible with static GNN models and is extended to dynamic and stochastic settings through hybrid dynamical system theory. Here, Neural GDEs improve performance by exploiting the underlying dynamics geometry, further introducing the ability to accommodate irregularly sampled data. Results prove the effectiveness of the proposed models across applications, such as traffic forecasting or prediction in genetic regulatory networks.
    Symplectic Learning for Hamiltonian Neural Networks. (arXiv:2106.11753v1 [cs.LG])
    (2 min) Machine learning methods are widely used in the natural sciences to model and predict physical systems from observation data. Yet, they are often used as poorly understood "black boxes," disregarding existing mathematical structure and invariants of the problem. Recently, the proposal of Hamiltonian Neural Networks (HNNs) took a first step towards a unified "gray box" approach, using physical insight to improve performance for Hamiltonian systems. In this paper, we explore a significantly improved training method for HNNs, exploiting the symplectic structure of Hamiltonian systems with a different loss function. This frees the loss from an artificial lower bound. We mathematically guarantee the existence of an exact Hamiltonian function which the HNN can learn. This allows us to prove and numerically analyze the errors made by HNNs which, in turn, renders them fully explainable. Finally, we present a novel post-training correction to obtain the true Hamiltonian only from discretized observation data, up to an arbitrary order.
    Randomness In Neural Network Training: Characterizing The Impact of Tooling. (arXiv:2106.11872v1 [cs.LG])
    (2 min) The quest for determinism in machine learning has disproportionately focused on characterizing the impact of noise introduced by algorithmic design choices. In this work, we address a less well understood and studied question: how does our choice of tooling introduce randomness to deep neural network training. We conduct large scale experiments across different types of hardware, accelerators, state of art networks, and open-source datasets, to characterize how tooling choices contribute to the level of non-determinism in a system, the impact of said non-determinism, and the cost of eliminating different sources of noise. Our findings are surprising, and suggest that the impact of non-determinism in nuanced. While top-line metrics such as top-1 accuracy are not noticeably impacted, model performance on certain parts of the data distribution is far more sensitive to the introduction of randomness. Our results suggest that deterministic tooling is critical for AI safety. However, we also find that the cost of ensuring determinism varies dramatically between neural network architectures and hardware types, e.g., with overhead up to $746\%$, $241\%$, and $196\%$ on a spectrum of widely used GPU accelerator architectures, relative to non-deterministic training. The source code used in this paper is available at https://github.com/usyd-fsalab/NeuralNetworkRandomness.
    Deep Phasor Networks: Connecting Conventional and Spiking Neural Networks. (arXiv:2106.11908v1 [cs.NE])
    (2 min) In this work, we extend standard neural networks by building upon an assumption that neuronal activations correspond to the angle of a complex number lying on the unit circle, or 'phasor.' Each layer in such a network produces new activations by taking a weighted superposition of the previous layer's phases and calculating the new phase value. This generalized architecture allows models to reach high accuracy and carries the singular advantage that mathematically equivalent versions of the network can be executed with or without regard to a temporal variable. Importantly, the value of a phase angle in the temporal domain can be sparsely represented by a periodically repeating series of delta functions or 'spikes'. We demonstrate the atemporal training of a phasor network on standard deep learning tasks and show that these networks can then be executed in either the traditional atemporal domain or spiking temporal domain with no conversion step needed. This provides a novel basis for constructing deep networkswhich operate via temporal, spike-based calculations suitable for neuromorphic computing hardware.
    Fair Algorithms for Hierarchical Agglomerative Clustering. (arXiv:2005.03197v3 [cs.LG] UPDATED)
    (2 min) Hierarchical Agglomerative Clustering (HAC) algorithms are extensively utilized in modern data science, and seek to partition the dataset into clusters while generating a hierarchical relationship between the data samples. HAC algorithms are employed in many applications, such as biology, natural language processing, and recommender systems. Thus, it is imperative to ensure that these algorithms are fair -- even if the dataset contains biases against certain protected groups, the cluster outputs generated should not discriminate against samples from any of these groups. However, recent work in clustering fairness has mostly focused on center-based clustering algorithms, such as k-median and k-means clustering. In this paper, we propose fair algorithms for performing HAC that enforce fairness constraints 1) irrespective of the distance linkage criteria used, 2) generalize to any natural measures of clustering fairness for HAC, 3) work for multiple protected groups, and 4) have competitive running times to vanilla HAC. Through extensive experiments on multiple real-world UCI datasets, we show that our proposed algorithm finds fairer clusterings compared to vanilla HAC as well as other state-of-the-art fair clustering approaches.
    EC-GAN: Low-Sample Classification using Semi-Supervised Algorithms and GANs. (arXiv:2012.15864v3 [cs.LG] UPDATED)
    (2 min) Semi-supervised learning has been gaining attention as it allows for performing image analysis tasks such as classification with limited labeled data. Some popular algorithms using Generative Adversarial Networks (GANs) for semi-supervised classification share a single architecture for classification and discrimination. However, this may require a model to converge to a separate data distribution for each task, which may reduce overall performance. While progress in semi-supervised learning has been made, less addressed are small-scale, fully-supervised tasks where even unlabeled data is unavailable and unattainable. We therefore, propose a novel GAN model namely External Classifier GAN (EC-GAN), that utilizes GANs and semi-supervised algorithms to improve classification in fully-supervised regimes. Our method leverages a GAN to generate artificial data used to supplement supervised classification. More specifically, we attach an external classifier, hence the name EC-GAN, to the GAN's generator, as opposed to sharing an architecture with the discriminator. Our experiments demonstrate that EC-GAN's performance is comparable to the shared architecture method, far superior to the standard data augmentation and regularization-based approach, and effective on a small, realistic dataset.
    Analysis of Optimization Algorithms via Sum-of-Squares. (arXiv:1906.04648v4 [math.OC] UPDATED)
    (2 min) We introduce a new framework for unifying and systematizing the performance analysis of first-order black-box optimization algorithms for unconstrained convex minimization. The low-cost iteration complexity enjoyed by first-order algorithms renders them particularly relevant for applications in machine learning and large-scale data analysis. Relying on sum-of-squares (SOS) optimization, we introduce a hierarchy of semidefinite programs that give increasingly better convergence bounds for higher levels of the hierarchy. Alluding to the power of the SOS hierarchy, we show that the (dual of the) first level corresponds to the Performance Estimation Problem (PEP) introduced by Drori and Teboulle [Math. Program., 145(1):451--482, 2014], a powerful framework for determining convergence rates of first-order optimization algorithms. Consequently, many results obtained within the PEP framework can be reinterpreted as degree-1 SOS proofs, and thus, the SOS framework provides a promising new approach for certifying improved rates of convergence by means of higher-order SOS certificates. To determine analytical rate bounds, in this work we use the first level of the SOS hierarchy and derive new result{s} for noisy gradient descent with inexact line search methods (Armijo, Wolfe, and Goldstein).
    Detecting Anomalous User Behavior in Remote Patient Monitoring. (arXiv:2106.11844v1 [cs.LG])
    (2 min) The growth in Remote Patient Monitoring (RPM) services using wearable and non-wearable Internet of Medical Things (IoMT) promises to improve the quality of diagnosis and facilitate timely treatment for a gamut of medical conditions. At the same time, the proliferation of IoMT devices increases the potential for malicious activities that can lead to catastrophic results including theft of personal information, data breach, and compromised medical devices, putting human lives at risk. IoMT devices generate tremendous amount of data that reflect user behavior patterns including both personal and day-to-day social activities along with daily routine health monitoring. In this context, there are possibilities of anomalies generated due to various reasons including unexpected user behavior, faulty sensor, or abnormal values from malicious/compromised devices. To address this problem, there is an imminent need to develop a framework for securing the smart health care infrastructure to identify and mitigate anomalies. In this paper, we present an anomaly detection model for RPM utilizing IoMT and smart home devices. We propose Hidden Markov Model (HMM) based anomaly detection that analyzes normal user behavior in the context of RPM comprising both smart home and smart health devices, and identifies anomalous user behavior. We design a testbed with multiple IoMT devices and home sensors to collect data and use the HMM model to train using network and user behavioral data. Proposed HMM based anomaly detection model achieved over 98% accuracy in identifying the anomalies in the context of RPM.
    Impossible Tuning Made Possible: A New Expert Algorithm and Its Applications. (arXiv:2102.01046v2 [cs.LG] UPDATED)
    (2 min) We resolve the long-standing "impossible tuning" issue for the classic expert problem and show that, it is in fact possible to achieve regret $O\left(\sqrt{(\ln d)\sum_t \ell_{t,i}^2}\right)$ simultaneously for all expert $i$ in a $T$-round $d$-expert problem where $\ell_{t,i}$ is the loss for expert $i$ in round $t$. Our algorithm is based on the Mirror Descent framework with a correction term and a weighted entropy regularizer. While natural, the algorithm has not been studied before and requires a careful analysis. We also generalize the bound to $O\left(\sqrt{(\ln d)\sum_t (\ell_{t,i}-m_{t,i})^2}\right)$ for any prediction vector $m_t$ that the learner receives, and recover or improve many existing results by choosing different $m_t$. Furthermore, we use the same framework to create a master algorithm that combines a set of base algorithms and learns the best one with little overhead. The new guarantee of our master allows us to derive many new results for both the expert problem and more generally Online Linear Optimization.
    OGB-LSC: A Large-Scale Challenge for Machine Learning on Graphs. (arXiv:2103.09430v2 [cs.LG] UPDATED)
    (2 min) Enabling effective and efficient machine learning (ML) over large-scale graph data (e.g., graphs with billions of edges) can have a huge impact on both industrial and scientific applications. However, community efforts to advance large-scale graph ML have been severely limited by the lack of a suitable public benchmark. For KDD Cup 2021, we present OGB Large-Scale Challenge (OGB-LSC), a collection of three real-world datasets for advancing the state-of-the-art in large-scale graph ML. OGB-LSC provides graph datasets that are orders of magnitude larger than existing ones and covers three core graph learning tasks -- link prediction, graph regression, and node classification. Furthermore, OGB-LSC provides dedicated baseline experiments, scaling up expressive graph ML models to the massive datasets. We show that the expressive models significantly outperform simple scalable baselines, indicating an opportunity for dedicated efforts to further improve graph ML at scale. Our datasets and baseline code are released and maintained as part of our OGB initiative (Hu et al., 2020). We hope OGB-LSC at KDD Cup 2021 can empower the community to discover innovative solutions for large-scale graph ML.
    A Unified Framework for Conservative Exploration. (arXiv:2106.11692v1 [cs.LG])
    (2 min) We study bandits and reinforcement learning (RL) subject to a conservative constraint where the agent is asked to perform at least as well as a given baseline policy. This setting is particular relevant in real-world domains including digital marketing, healthcare, production, finance, etc. For multi-armed bandits, linear bandits and tabular RL, specialized algorithms and theoretical analyses were proposed in previous work. In this paper, we present a unified framework for conservative bandits and RL, in which our core technique is to calculate the necessary and sufficient budget obtained from running the baseline policy. For lower bounds, our framework gives a black-box reduction that turns a certain lower bound in the nonconservative setting into a new lower bound in the conservative setting. We strengthen the existing lower bound for conservative multi-armed bandits and obtain new lower bounds for conservative linear bandits, tabular RL and low-rank MDP. For upper bounds, our framework turns a certain nonconservative upper-confidence-bound (UCB) algorithm into a conservative algorithm with a simple analysis. For multi-armed bandits, linear bandits and tabular RL, our new upper bounds tighten or match existing ones with significantly simpler analyses. We also obtain a new upper bound for conservative low-rank MDP.
    Speeding Up OPFython with Numba. (arXiv:2106.11828v1 [cs.LG])
    (2 min) A graph-inspired classifier, known as Optimum-Path Forest (OPF), has proven to be a state-of-the-art algorithm comparable to Logistic Regressors, Support Vector Machines in a wide variety of tasks. Recently, its Python-based version, denoted as OPFython, has been proposed to provide a more friendly framework and a faster prototyping environment. Nevertheless, Python-based algorithms are slower than their counterpart C-based algorithms, impacting their performance when confronted with large amounts of data. Therefore, this paper proposed a simple yet highly efficient speed up using the Numba package, which accelerates Numpy-based calculations and attempts to increase the algorithm's overall performance. Experimental results showed that the proposed approach achieved better results than the na\"ive Python-based OPF and speeded up its distance measurement calculation.
    Differentiable Learning Under Triage. (arXiv:2103.08902v2 [stat.ML] UPDATED)
    (2 min) Multiple lines of evidence suggest that predictive models may benefit from algorithmic triage. Under algorithmic triage, a predictive model does not predict all instances but instead defers some of them to human experts. However, the interplay between the prediction accuracy of the model and the human experts under algorithmic triage is not well understood. In this work, we start by formally characterizing under which circumstances a predictive model may benefit from algorithmic triage. In doing so, we also demonstrate that models trained for full automation may be suboptimal under triage. Then, given any model and desired level of triage, we show that the optimal triage policy is a deterministic threshold rule in which triage decisions are derived deterministically by thresholding the difference between the model and human errors on a per-instance level. Building upon these results, we introduce a practical gradient-based algorithm that is guaranteed to find a sequence of triage policies and predictive models of increasing performance. Experiments on a wide variety of supervised learning tasks using synthetic and real data from two important applications -- content moderation and scientific discovery -- illustrate our theoretical results and show that the models and triage policies provided by our gradient-based algorithm outperform those provided by several competitive baselines.
    NetFense: Adversarial Defenses against Privacy Attacks on Neural Networks for Graph Data. (arXiv:2106.11865v1 [cs.LG])
    (2 min) Recent advances in protecting node privacy on graph data and attacking graph neural networks (GNNs) gain much attention. The eye does not bring these two essential tasks together yet. Imagine an adversary can utilize the powerful GNNs to infer users' private labels in a social network. How can we adversarially defend against such privacy attacks while maintaining the utility of perturbed graphs? In this work, we propose a novel research task, adversarial defenses against GNN-based privacy attacks, and present a graph perturbation-based approach, NetFense, to achieve the goal. NetFense can simultaneously keep graph data unnoticeability (i.e., having limited changes on the graph structure), maintain the prediction confidence of targeted label classification (i.e., preserving data utility), and reduce the prediction confidence of private label classification (i.e., protecting the privacy of nodes). Experiments conducted on single- and multiple-target perturbations using three real graph data exhibit that the perturbed graphs by NetFense can effectively maintain data utility (i.e., model unnoticeability) on targeted label classification and significantly decrease the prediction confidence of private label classification (i.e., privacy protection). Extensive studies also bring several insights, such as the flexibility of NetFense, preserving local neighborhoods in data unnoticeability, and better privacy protection for high-degree nodes.
    Towards Reducing Labeling Cost in Deep Object Detection. (arXiv:2106.11921v1 [cs.CV])
    (2 min) Deep neural networks have reached very high accuracy on object detection but their success hinges on large amounts of labeled data. To reduce the dependency on labels, various active-learning strategies have been proposed, typically based on the confidence of the detector. However, these methods are biased towards best-performing classes and can lead to acquired datasets that are not good representatives of the data in the testing set. In this work, we propose a unified framework for active learning, that considers both the uncertainty and the robustness of the detector, ensuring that the network performs accurately in all classes. Furthermore, our method is able to pseudo-label the very confident predictions, suppressing a potential distribution drift while further boosting the performance of the model. Experiments show that our method comprehensively outperforms a wide range of active-learning methods on PASCAL VOC07+12 and MS-COCO, having up to a 7.7% relative improvement, or up to 82% reduction in labeling cost.
    Towards Solving Inefficiency of Self-supervised Representation Learning. (arXiv:2104.08760v2 [cs.CV] UPDATED)
    (2 min) Self-supervised learning (especially contrastive learning) has attracted great interest due to its tremendous potentials in learning discriminative representations in an unsupervised manner. Despite the acknowledged successes, existing contrastive learning methods suffer from very low learning efficiency, e.g., taking about ten times more training epochs than supervised learning for comparable recognition accuracy. In this paper, we discover two contradictory phenomena in contrastive learning that we call under-clustering and over-clustering problems, which are major obstacles to learning efficiency. Under-clustering means that the model cannot efficiently learn to discover the dissimilarity between inter-class samples when the negative sample pairs for contrastive learning are insufficient to differentiate all the actual object categories. Over-clustering implies that the model cannot efficiently learn the feature representation from excessive negative sample pairs, which enforces the model to over-cluster samples of the same actual categories into different clusters. To simultaneously overcome these two problems, we propose a novel self-supervised learning framework using a median triplet loss. Precisely, we employ a triplet loss tending to maximize the relative distance between the positive pair and negative pairs to address the under-clustering problem; and we construct the negative pair by selecting the negative sample of a median similarity score from all negative samples to avoid the over-clustering problem, guaranteed by the Bernoulli Distribution model. We extensively evaluate our proposed framework in several large-scale benchmarks (e.g., ImageNet, SYSU-30k, and COCO). The results demonstrate the superior performance (e.g., the learning efficiency) of our model over the latest state-of-the-art methods by a clear margin. Codes available at: https://github.com/wanggrun/triplet.
    HDMI: High-order Deep Multiplex Infomax. (arXiv:2102.07810v4 [cs.LG] UPDATED)
    (2 min) Networks have been widely used to represent the relations between objects such as academic networks and social networks, and learning embedding for networks has thus garnered plenty of research attention. Self-supervised network representation learning aims at extracting node embedding without external supervision. Recently, maximizing the mutual information between the local node embedding and the global summary (e.g. Deep Graph Infomax, or DGI for short) has shown promising results on many downstream tasks such as node classification. However, there are two major limitations of DGI. Firstly, DGI merely considers the extrinsic supervision signal (i.e., the mutual information between node embedding and global summary) while ignores the intrinsic signal (i.e., the mutual dependence between node embedding and node attributes). Secondly, nodes in a real-world network are usually connected by multiple edges with different relations, while DGI does not fully explore the various relations among nodes. To address the above-mentioned problems, we propose a novel framework, called High-order Deep Multiplex Infomax (HDMI), for learning node embedding on multiplex networks in a self-supervised way. To be more specific, we first design a joint supervision signal containing both extrinsic and intrinsic mutual information by high-order mutual information, and we propose a High-order Deep Infomax (HDI) to optimize the proposed supervision signal. Then we propose an attention based fusion module to combine node embedding from different layers of the multiplex network. Finally, we evaluate the proposed HDMI on various downstream tasks such as unsupervised clustering and supervised classification. The experimental results show that HDMI achieves state-of-the-art performance on these tasks.
    Stochastic Polyak Stepsize with a Moving Target. (arXiv:2106.11851v1 [cs.LG])
    (2 min) We propose a new stochastic gradient method that uses recorded past loss values to reduce the variance. Our method can be interpreted as a new stochastic variant of the Polyak Stepsize that converges globally without assuming interpolation. Our method introduces auxiliary variables, one for each data point, that track the loss value for each data point. We provide a global convergence theory for our method by showing that it can be interpreted as a special variant of online SGD. The new method only stores a single scalar per data point, opening up new applications for variance reduction where memory is the bottleneck.
    On Constrained Optimization in Differentiable Neural Architecture Search. (arXiv:2106.11655v1 [cs.LG])
    (2 min) Differentiable Architecture Search (DARTS) is a recently proposed neural architecture search (NAS) method based on a differentiable relaxation. Due to its success, numerous variants analyzing and improving parts of the DARTS framework have recently been proposed. By considering the problem as a constrained bilevel optimization, we propose and analyze three improvements to architectural weight competition, update scheduling, and regularization towards discretization. First, we introduce a new approach to the activation of architecture weights, which prevents confounding competition within an edge and allows for fair comparison across edges to aid in discretization. Next, we propose a dynamic schedule based on per-minibatch network information to make architecture updates more informed. Finally, we consider two regularizations, based on proximity to discretization and the Alternating Directions Method of Multipliers (ADMM) algorithm, to promote early discretization. Our results show that this new activation scheme reduces final architecture size and the regularizations improve reliability in search results while maintaining comparable performance to state-of-the-art in NAS, especially when used with our new dynamic informed schedule.
    Latent-CF: A Simple Baseline for Reverse Counterfactual Explanations. (arXiv:2012.09301v2 [cs.LG] UPDATED)
    (2 min) In the environment of fair lending laws and the General Data Protection Regulation (GDPR), the ability to explain a model's prediction is of paramount importance. High quality explanations are the first step in assessing fairness. Counterfactuals are valuable tools for explainability. They provide actionable, comprehensible explanations for the individual who is subject to decisions made from the prediction. It is important to find a baseline for producing them. We propose a simple method for generating counterfactuals by using gradient descent to search in the latent space of an autoencoder and benchmark our method against approaches that search for counterfactuals in feature space. Additionally, we implement metrics to concretely evaluate the quality of the counterfactuals. We show that latent space counterfactual generation strikes a balance between the speed of basic feature gradient descent methods and the sparseness and authenticity of counterfactuals generated by more complex feature space oriented techniques.
    Bayesian Neural Network via Stochastic Gradient Descent. (arXiv:2006.08453v4 [cs.LG] UPDATED)
    (2 min) The goal of bayesian approach used in variational inference is to minimize the KL divergence between variational distribution and unknown posterior distribution. This is done by maximizing the Evidence Lower Bound (ELBO). A neural network is used to parametrize these distributions using Stochastic Gradient Descent. This work extends the work done by others by deriving the variational inference models. We show how SGD can be applied on bayesian neural networks by gradient estimation techniques. For validation, we have tested our model on 5 UCI datasets and the metrics chosen for evaluation are Root Mean Square Error (RMSE) error and negative log likelihood. Our work considerably beats the previous state of the art approaches for regression using bayesian neural networks.
    MIMIR: Deep Regression for Automated Analysis of UK Biobank Body MRI. (arXiv:2106.11731v1 [eess.IV])
    (2 min) UK Biobank (UKB) is conducting a large-scale study of more than half a million volunteers, collecting health-related information on genetics, lifestyle, blood biochemistry, and more. Medical imaging furthermore targets 100,000 subjects, with 70,000 follow-up sessions, enabling measurements of organs, muscle, and body composition. With up to 170,000 mounting MR images, various methodologies are accordingly engaged in large-scale image analysis. This work presents an experimental inference engine that can automatically predict a comprehensive profile of subject metadata from UKB neck-to-knee body MRI. In cross-validation, it accurately inferred baseline characteristics such as age, height, weight, and sex, but also emulated measurements of body composition by DXA, organ volumes, and abstract properties like grip strength, pulse rate, and type 2 diabetic status (AUC: 0.866). The proposed system can automatically analyze thousands of subjects within hours and provide individual confidence intervals. The underlying methodology is based on convolutional neural networks for image-based mean-variance regression on two-dimensional representations of the MRI data. This work aims to make the proposed system available for free to researchers, who can use it to obtain fast and fully-automated estimates of 72 different measurements immediately upon release of new UK Biobank image data.
    LV-BERT: Exploiting Layer Variety for BERT. (arXiv:2106.11740v1 [cs.CL])
    (2 min) Modern pre-trained language models are mostly built upon backbones stacking self-attention and feed-forward layers in an interleaved order. In this paper, beyond this stereotyped layer pattern, we aim to improve pre-trained models by exploiting layer variety from two aspects: the layer type set and the layer order. Specifically, besides the original self-attention and feed-forward layers, we introduce convolution into the layer type set, which is experimentally found beneficial to pre-trained models. Furthermore, beyond the original interleaved order, we explore more layer orders to discover more powerful architectures. However, the introduced layer variety leads to a large architecture space of more than billions of candidates, while training a single candidate model from scratch already requires huge computation cost, making it not affordable to search such a space by directly training large amounts of candidate models. To solve this problem, we first pre-train a supernet from which the weights of all candidate models can be inherited, and then adopt an evolutionary algorithm guided by pre-training accuracy to find the optimal architecture. Extensive experiments show that LV-BERT model obtained by our method outperforms BERT and its variants on various downstream tasks. For example, LV-BERT-small achieves 78.8 on the GLUE testing set, 1.8 higher than the strong baseline ELECTRA-small.
    Solving stochastic optimal control problem via stochastic maximum principle with deep learning method. (arXiv:2007.02227v5 [math.OC] UPDATED)
    (2 min) In this paper, we aim to solve the high dimensional stochastic optimal control problem from the view of the stochastic maximum principle via deep learning. By introducing the extended Hamiltonian system which is essentially an FBSDE with a maximum condition, we reformulate the original control problem as a new one. Three algorithms are proposed to solve the new control problem. Numerical results for different examples demonstrate the effectiveness of our proposed algorithms, especially in high dimensional cases. And an important application of this method is to calculate the sub-linear expectations, which correspond to a kind of fully nonlinear PDEs.
    Credal Self-Supervised Learning. (arXiv:2106.11853v1 [stat.ML])
    (2 min) Self-training is an effective approach to semi-supervised learning. The key idea is to let the learner itself iteratively generate "pseudo-supervision" for unlabeled instances based on its current hypothesis. In combination with consistency regularization, pseudo-labeling has shown promising performance in various domains, for example in computer vision. To account for the hypothetical nature of the pseudo-labels, these are commonly provided in the form of probability distributions. Still, one may argue that even a probability distribution represents an excessive level of informedness, as it suggests that the learner precisely knows the ground-truth conditional probabilities. In our approach, we therefore allow the learner to label instances in the form of credal sets, that is, sets of (candidate) probability distributions. Thanks to this increased expressiveness, the learner is able to represent uncertainty and a lack of knowledge in a more flexible and more faithful manner. To learn from weakly labeled data of that kind, we leverage methods that have recently been proposed in the realm of so-called superset learning. In an exhaustive empirical evaluation, we compare our methodology to state-of-the-art self-supervision approaches, showing competitive to superior performance especially in low-label scenarios incorporating a high degree of uncertainty.
    Emphatic Algorithms for Deep Reinforcement Learning. (arXiv:2106.11779v1 [cs.LG])
    (2 min) Off-policy learning allows us to learn about possible policies of behavior from experience generated by a different behavior policy. Temporal difference (TD) learning algorithms can become unstable when combined with function approximation and off-policy sampling - this is known as the ''deadly triad''. Emphatic temporal difference (ETD($\lambda$)) algorithm ensures convergence in the linear case by appropriately weighting the TD($\lambda$) updates. In this paper, we extend the use of emphatic methods to deep reinforcement learning agents. We show that naively adapting ETD($\lambda$) to popular deep reinforcement learning algorithms, which use forward view multi-step returns, results in poor performance. We then derive new emphatic algorithms for use in the context of such algorithms, and we demonstrate that they provide noticeable benefits in small problems designed to highlight the instability of TD methods. Finally, we observed improved performance when applying these algorithms at scale on classic Atari games from the Arcade Learning Environment.
    Multiple Organ Failure Prediction with Classifier-Guided Generative Adversarial Imputation Networks. (arXiv:2106.11878v1 [cs.LG])
    (2 min) Multiple organ failure (MOF) is a severe syndrome with a high mortality rate among Intensive Care Unit (ICU) patients. Early and precise detection is critical for clinicians to make timely decisions. An essential challenge in applying machine learning models to electronic health records (EHRs) is the pervasiveness of missing values. Most existing imputation methods are involved in the data preprocessing phase, failing to capture the relationship between data and outcome for downstream predictions. In this paper, we propose classifier-guided generative adversarial imputation networks Classifier-GAIN) for MOF prediction to bridge this gap, by incorporating both observed data and label information. Specifically, the classifier takes imputed values from the generator(imputer) to predict task outcomes and provides additional supervision signals to the generator by joint training. The classifier-guide generator imputes missing values with label-awareness during training, improving the classifier's performance during inference. We conduct extensive experiments showing that our approach consistently outperforms classical and state-of-art neural baselines across a range of missing data scenarios and evaluation metrics.
    Asynchronous Stochastic Optimization Robust to Arbitrary Delays. (arXiv:2106.11879v1 [math.OC])
    (2 min) We consider stochastic optimization with delayed gradients where, at each time step $t$, the algorithm makes an update using a stale stochastic gradient from step $t - d_t$ for some arbitrary delay $d_t$. This setting abstracts asynchronous distributed optimization where a central server receives gradient updates computed by worker machines. These machines can experience computation and communication loads that might vary significantly over time. In the general non-convex smooth optimization setting, we give a simple and efficient algorithm that requires $O( \sigma^2/\epsilon^4 + \tau/\epsilon^2 )$ steps for finding an $\epsilon$-stationary point $x$, where $\tau$ is the \emph{average} delay $\smash{\frac{1}{T}\sum_{t=1}^T d_t}$ and $\sigma^2$ is the variance of the stochastic gradients. This improves over previous work, which showed that stochastic gradient decent achieves the same rate but with respect to the \emph{maximal} delay $\max_{t} d_t$, that can be significantly larger than the average delay especially in heterogeneous distributed systems. Our experiments demonstrate the efficacy and robustness of our algorithm in cases where the delay distribution is skewed or heavy-tailed.
    Machine learning for risk assessment in gender-based crime. (arXiv:2106.11847v1 [cs.CY])
    (2 min) Gender-based crime is one of the most concerning scourges of contemporary society. Governments worldwide have invested lots of economic and human resources to radically eliminate this threat. Despite these efforts, providing accurate predictions of the risk that a victim of gender violence has of being attacked again is still a very hard open problem. The development of new methods for issuing accurate, fair and quick predictions would allow police forces to select the most appropriate measures to prevent recidivism. In this work, we propose to apply Machine Learning (ML) techniques to create models that accurately predict the recidivism risk of a gender-violence offender. The relevance of the contribution of this work is threefold: (i) the proposed ML method outperforms the preexisting risk assessment algorithm based on classical statistical techniques, (ii) the study has been conducted through an official specific-purpose database with more than 40,000 reports of gender violence, and (iii) two new quality measures are proposed for assessing the effective police protection that a model supplies and the overload in the invested resources that it generates. Additionally, we propose a hybrid model that combines the statistical prediction methods with the ML method, permitting authorities to implement a smooth transition from the preexisting model to the ML-based model. This hybrid nature enables a decision-making process to optimally balance between the efficiency of the police system and aggressiveness of the protection measures taken.
    A Stealthy and Robust Fingerprinting Scheme for Generative Models. (arXiv:2106.11760v1 [cs.CR])
    (2 min) This paper presents a novel fingerprinting methodology for the Intellectual Property protection of generative models. Prior solutions for discriminative models usually adopt adversarial examples as the fingerprints, which give anomalous inference behaviors and prediction results. Hence, these methods are not stealthy and can be easily recognized by the adversary. Our approach leverages the invisible backdoor technique to overcome the above limitation. Specifically, we design verification samples, whose model outputs look normal but can trigger a backdoor classifier to make abnormal predictions. We propose a new backdoor embedding approach with Unique-Triplet Loss and fine-grained categorization to enhance the effectiveness of our fingerprints. Extensive evaluations show that this solution can outperform other strategies with higher robustness, uniqueness and stealthiness for various GAN models.
    Uniform-PAC Bounds for Reinforcement Learning with Linear Function Approximation. (arXiv:2106.11612v1 [cs.LG])
    (2 min) We study reinforcement learning (RL) with linear function approximation. Existing algorithms for this problem only have high-probability regret and/or Probably Approximately Correct (PAC) sample complexity guarantees, which cannot guarantee the convergence to the optimal policy. In this paper, in order to overcome the limitation of existing algorithms, we propose a new algorithm called FLUTE, which enjoys uniform-PAC convergence to the optimal policy with high probability. The uniform-PAC guarantee is the strongest possible guarantee for reinforcement learning in the literature, which can directly imply both PAC and high probability regret bounds, making our algorithm superior to all existing algorithms with linear function approximation. At the core of our algorithm is a novel minimax value function estimator and a multi-level partition scheme to select the training samples from historical observations. Both of these techniques are new and of independent interest.
    Latency-Aware Neural Architecture Search with Multi-Objective Bayesian Optimization. (arXiv:2106.11890v1 [cs.LG])
    (2 min) When tuning the architecture and hyperparameters of large machine learning models for on-device deployment, it is desirable to understand the optimal trade-offs between on-device latency and model accuracy. In this work, we leverage recent methodological advances in Bayesian optimization over high-dimensional search spaces and multi-objective Bayesian optimization to efficiently explore these trade-offs for a production-scale on-device natural language understanding model at Facebook.
    Failing with Grace: Learning Neural Network Controllers that are Boundedly Unsafe. (arXiv:2106.11881v1 [eess.SY])
    (2 min) In this work, we consider the problem of learning a feed-forward neural network (NN) controller to safely steer an arbitrarily shaped planar robot in a compact and obstacle-occluded workspace. Unlike existing methods that depend strongly on the density of data points close to the boundary of the safe state space to train NN controllers with closed-loop safety guarantees, we propose an approach that lifts such assumptions on the data that are hard to satisfy in practice and instead allows for graceful safety violations, i.e., of a bounded magnitude that can be spatially controlled. To do so, we employ reachability analysis methods to encapsulate safety constraints in the training process. Specifically, to obtain a computationally efficient over-approximation of the forward reachable set of the closed-loop system, we partition the robot's state space into cells and adaptively subdivide the cells that contain states which may escape the safe set under the trained control law. To do so, we first design appropriate under- and over-approximations of the robot's footprint to adaptively subdivide the configuration space into cells. Then, using the overlap between each cell's forward reachable set and the set of infeasible robot configurations as a measure for safety violations, we introduce penalty terms into the loss function that penalize this overlap in the training process. As a result, our method can learn a safe vector field for the closed-loop system and, at the same time, provide numerical worst-case bounds on safety violation over the whole configuration space, defined by the overlap between the over-approximation of the forward reachable set of the closed-loop system and the set of unsafe states. Moreover, it can control the tradeoff between computational complexity and tightness of these bounds. Finally, we provide a simulation study that verifies the efficacy of the proposed scheme.
    Data Augmentation for Meta-Learning. (arXiv:2010.07092v2 [cs.LG] UPDATED)
    (2 min) Conventional image classifiers are trained by randomly sampling mini-batches of images. To achieve state-of-the-art performance, practitioners use sophisticated data augmentation schemes to expand the amount of training data available for sampling. In contrast, meta-learning algorithms sample support data, query data, and tasks on each training step. In this complex sampling scenario, data augmentation can be used not only to expand the number of images available per class, but also to generate entirely new classes/tasks. We systematically dissect the meta-learning pipeline and investigate the distinct ways in which data augmentation can be integrated at both the image and class levels. Our proposed meta-specific data augmentation significantly improves the performance of meta-learners on few-shot classification benchmarks.
    Deep Learning for Suicide and Depression Identification with Unsupervised Label Correction. (arXiv:2102.09427v2 [cs.LG] UPDATED)
    (2 min) Early detection of suicidal ideation in depressed individuals can allow for adequate medical attention and support, which in many cases is life-saving. Recent NLP research focuses on classifying, from a given piece of text, if an individual is suicidal or clinically healthy. However, there have been no major attempts to differentiate between depression and suicidal ideation, which is an important clinical challenge. Due to the scarce availability of EHR data, suicide notes, or other similar verified sources, web query data has emerged as a promising alternative. Online sources, such as Reddit, allow for anonymity that prompts honest disclosure of symptoms, making it a plausible source even in a clinical setting. However, these online datasets also result in lower performance, which can be attributed to the inherent noise in web-scraped labels, which necessitates a noise-removal process. Thus, we propose SDCNL, a suicide versus depression classification method through a deep learning approach. We utilize online content from Reddit to train our algorithm, and to verify and correct noisy labels, we propose a novel unsupervised label correction method which, unlike previous work, does not require prior noise distribution information. Our extensive experimentation with multiple deep word embedding models and classifiers display the strong performance of the method in anew, challenging classification application. We make our code and dataset available at https://github.com/ayaanzhaque/SDCNL
    SSUL: Semantic Segmentation with Unknown Label for Exemplar-based Class-Incremental Learning. (arXiv:2106.11562v1 [cs.CV])
    (2 min) We consider a class-incremental semantic segmentation (CISS) problem. While some recently proposed algorithms utilized variants of knowledge distillation (KD) technique to tackle the problem, they only partially addressed the key additional challenges in CISS that causes the catastrophic forgetting; i.e., the semantic drift of the background class and multi-label prediction issue. To better address these challenges, we propose a new method, dubbed as SSUL-M (Semantic Segmentation with Unknown Label with Memory), by carefully combining several techniques tailored for semantic segmentation. More specifically, we make three main contributions; (1) modeling unknown class within the background class to help learning future classes (help plasticity), (2) freezing backbone network and past classifiers with binary cross-entropy loss and pseudo-labeling to overcome catastrophic forgetting (help stability), and (3) utilizing tiny exemplar memory for the first time in CISS to improve both plasticity and stability. As a result, we show our method achieves significantly better performance than the recent state-of-the-art baselines on the standard benchmark datasets. Furthermore, we justify our contributions with thorough and extensive ablation analyses and discuss different natures of the CISS problem compared to the standard class-incremental learning for classification.
    A Deep Latent Space Model for Graph Representation Learning. (arXiv:2106.11721v1 [cs.LG])
    (2 min) Graph representation learning is a fundamental problem for modeling relational data and benefits a number of downstream applications. Traditional Bayesian-based graph models and recent deep learning based GNN either suffer from impracticability or lack interpretability, thus combined models for undirected graphs have been proposed to overcome the weaknesses. As a large portion of real-world graphs are directed graphs (of which undirected graphs are special cases), in this paper, we propose a Deep Latent Space Model (DLSM) for directed graphs to incorporate the traditional latent variable based generative model into deep learning frameworks. Our proposed model consists of a graph convolutional network (GCN) encoder and a stochastic decoder, which are layer-wise connected by a hierarchical variational auto-encoder architecture. By specifically modeling the degree heterogeneity using node random factors, our model possesses better interpretability in both community structure and degree heterogeneity. For fast inference, the stochastic gradient variational Bayes (SGVB) is adopted using a non-iterative recognition model, which is much more scalable than traditional MCMC-based methods. The experiments on real-world datasets show that the proposed model achieves the state-of-the-art performances on both link prediction and community detection tasks while learning interpretable node embeddings. The source code is available at https://github.com/upperr/DLSM.
    Deep Stereo Image Compression with Decoder Side Information using Wyner Common Information. (arXiv:2106.11723v1 [eess.IV])
    (2 min) We present a novel deep neural network (DNN) architecture for compressing an image when a correlated image is available as side information only at the decoder. This problem is known as distributed source coding (DSC) in information theory. In particular, we consider a pair of stereo images, which generally have high correlation with each other due to overlapping fields of view, and assume that one image of the pair is to be compressed and transmitted, while the other image is available only at the decoder. In the proposed architecture, the encoder maps the input image to a latent space, quantizes the latent representation, and compresses it using entropy coding. The decoder is trained to extract the Wyner's common information between the input image and the correlated image from the latter. The received latent representation and the locally generated common information are passed through a decoder network to obtain an enhanced reconstruction of the input image. The common information provides a succinct representation of the relevant information at the receiver. We train and demonstrate the effectiveness of the proposed approach on the KITTI dataset of stereo image pairs. Our results show that the proposed architecture is capable of exploiting the decoder-only side information, and outperforms previous work on stereo image compression with decoder side information.
    Sphynx: ReLU-Efficient Network Design for Private Inference. (arXiv:2106.11755v1 [cs.CR])
    (2 min) The emergence of deep learning has been accompanied by privacy concerns surrounding users' data and service providers' models. We focus on private inference (PI), where the goal is to perform inference on a user's data sample using a service provider's model. Existing PI methods for deep networks enable cryptographically secure inference with little drop in functionality; however, they incur severe latency costs, primarily caused by non-linear network operations (such as ReLUs). This paper presents Sphynx, a ReLU-efficient network design method based on micro-search strategies for convolutional cell design. Sphynx achieves Pareto dominance over all existing private inference methods on CIFAR-100. We also design large-scale networks that support cryptographically private inference on Tiny-ImageNet and ImageNet.
    MMD-MIX: Value Function Factorisation with Maximum Mean Discrepancy for Cooperative Multi-Agent Reinforcement Learning. (arXiv:2106.11652v1 [cs.MA])
    (2 min) In the real world, many tasks require multiple agents to cooperate with each other under the condition of local observations. To solve such problems, many multi-agent reinforcement learning methods based on Centralized Training with Decentralized Execution have been proposed. One representative class of work is value decomposition, which decomposes the global joint Q-value $Q_\text{jt}$ into individual Q-values $Q_a$ to guide individuals' behaviors, e.g. VDN (Value-Decomposition Networks) and QMIX. However, these baselines often ignore the randomness in the situation. We propose MMD-MIX, a method that combines distributional reinforcement learning and value decomposition to alleviate the above weaknesses. Besides, to improve data sampling efficiency, we were inspired by REM (Random Ensemble Mixture) which is a robust RL algorithm to explicitly introduce randomness into the MMD-MIX. The experiments demonstrate that MMD-MIX outperforms prior baselines in the StarCraft Multi-Agent Challenge (SMAC) environment.
    Self-Supervised Iterative Contextual Smoothing for Efficient Adversarial Defense against Gray- and Black-Box Attack. (arXiv:2106.11644v1 [cs.CV])
    (2 min) We propose a novel and effective input transformation based adversarial defense method against gray- and black-box attack, which is computationally efficient and does not require any adversarial training or retraining of a classification model. We first show that a very simple iterative Gaussian smoothing can effectively wash out adversarial noise and achieve substantially high robust accuracy. Based on the observation, we propose Self-Supervised Iterative Contextual Smoothing (SSICS), which aims to reconstruct the original discriminative features from the Gaussian-smoothed image in context-adaptive manner, while still smoothing out the adversarial noise. From the experiments on ImageNet, we show that our SSICS achieves both high standard accuracy and very competitive robust accuracy for the gray- and black-box attacks; e.g., transfer-based PGD-attack and score-based attack. A note-worthy point to stress is that our defense is free of computationally expensive adversarial training, yet, can approach its robust accuracy via input transformation.
    Variance-Aware Off-Policy Evaluation with Linear Function Approximation. (arXiv:2106.11960v1 [cs.LG])
    (2 min) We study the off-policy evaluation (OPE) problem in reinforcement learning with linear function approximation, which aims to estimate the value function of a target policy based on the offline data collected by a behavior policy. We propose to incorporate the variance information of the value function to improve the sample efficiency of OPE. More specifically, for time-inhomogeneous episodic linear Markov decision processes (MDPs), we propose an algorithm, VA-OPE, which uses the estimated variance of the value function to reweight the Bellman residual in Fitted Q-Iteration. We show that our algorithm achieves a tighter error bound than the best-known result. We also provide a fine-grained characterization of the distribution shift between the behavior policy and the target policy. Extensive numerical experiments corroborate our theory.
    From SIR to SEAIRD: a novel data-driven modeling approach based on the Grey-box System Theory to predict the dynamics of COVID-19. (arXiv:2106.11918v1 [stat.AP])
    (3 min) Common compartmental modeling for COVID-19 is based on a priori knowledge and numerous assumptions. Additionally, they do not systematically incorporate asymptomatic cases. Our study aimed at providing a framework for data-driven approaches, by leveraging the strengths of the grey-box system theory or grey-box identification, known for its robustness in problem solving under partial, incomplete, or uncertain data. Empirical data on confirmed cases and deaths, extracted from an open source repository were used to develop the SEAIRD compartment model. Adjustments were made to fit current knowledge on the COVID-19 behavior. The model was implemented and solved using an Ordinary Differential Equation solver and an optimization tool. A cross-validation technique was applied, and the coefficient of determination $R^2$ was computed in order to evaluate the goodness-of-fit of the model. %to the data. Key epidemiological parameters were finally estimated and we provided the rationale for the construction of SEAIRD model. When applied to Brazil's cases, SEAIRD produced an excellent agreement to the data, with an %coefficient of determination $R^2$ $\geq 90\%$. The probability of COVID-19 transmission was generally high ($\geq 95\%$). On the basis of a 20-day modeling data, the incidence rate of COVID-19 was as low as 3 infected cases per 100,000 exposed persons in Brazil and France. Within the same time frame, the fatality rate of COVID-19 was the highest in France (16.4\%) followed by Brazil (6.9\%), and the lowest in Russia ($\leq 1\%$). SEAIRD represents an asset for modeling infectious diseases in their dynamical stable phase, especially for new viruses when pathophysiology knowledge is very limited.
    Including Sparse Production Knowledge into Variational Autoencoders to Increase Anomaly Detection Reliability. (arXiv:2103.12998v2 [cs.LG] UPDATED)
    (2 min) Digitalization leads to data transparency for production systems that we can benefit from with data-driven analysis methods like neural networks. For example, automated anomaly detection enables saving resources and optimizing the production. We study using rarely occurring information about labeled anomalies into Variational Autoencoder neural network structures to overcome information deficits of supervised and unsupervised approaches. This method outperforms all other models in terms of accuracy, precision, and recall. We evaluate the following methods: Principal Component Analysis, Isolation Forest, Classifying Neural Networks, and Variational Autoencoders on seven time series datasets to find the best performing detection methods. We extend this idea to include more infrequently occurring meta information about production processes. This use of sparse labels, both of anomalies or production data, allows to harness any additional information available for increasing anomaly detection performance.
    Constrained Ensemble Langevin Monte Carlo. (arXiv:2102.04279v2 [stat.ML] UPDATED)
    (2 min) The classical Langevin Monte Carlo method looks for i.i.d. samples from a target distribution by descending along the gradient of the target distribution. It is popular partially due to its fast convergence rate. However, the numerical cost is sometimes high because the gradient can be hard to obtain. One approach to eliminate the gradient computation is to employ the concept of "ensemble", where a large number of particles are evolved together so that the neighboring particles provide gradient information to each other. In this article, we discuss two algorithms that integrate the ensemble feature into LMC, and the associated properties. There are two sides of our discovery: 1. By directly surrogating the gradient using the ensemble approximation, we develop Ensemble Langevin Monte Carlo. We show that this method is unstable due to a potentially small denominator that induces high variance. We provide a counterexample to explicitly show this instability. 2. We then change the strategy and enact the ensemble approximation to the gradient only in a constrained manner, to eliminate the unstable points. The algorithm is termed Constrained Ensemble Langevin Monte Carlo. We show that, with a proper tuning, the surrogation takes place often enough to bring the reasonable numerical saving, while the induced error is still low enough for us to maintain the fast convergence rate, up to a controllable discretization and ensemble error. Such combination of ensemble method and LMC shed light on inventing gradient-free algorithms that produce i.i.d. samples almost exponentially fast.
    The Hitchhiker's Guide to Prior-Shift Adaptation. (arXiv:2106.11695v1 [cs.CV])
    (2 min) In many computer vision classification tasks, class priors at test time often differ from priors on the training set. In the case of such prior shift, classifiers must be adapted correspondingly to maintain close to optimal performance. This paper analyzes methods for adaptation of probabilistic classifiers to new priors and for estimating new priors on an unlabeled test set. We propose a novel method to address a known issue of prior estimation methods based on confusion matrices, where inconsistent estimates of decision probabilities and confusion matrices lead to negative values in the estimated priors. Experiments on fine-grained image classification datasets provide insight into the best practice of prior shift estimation and classifier adaptation and show that the proposed method achieves state-of-the-art results in prior adaptation. Applying the best practice to two tasks with naturally imbalanced priors, learning from web-crawled images and plant species classification, increased the recognition accuracy by 1.1% and 3.4% respectively.
    Bandit Learning in Decentralized Matching Markets. (arXiv:2012.07348v4 [cs.LG] UPDATED)
    (2 min) We study two-sided matching markets in which one side of the market (the players) does not have a priori knowledge about its preferences for the other side (the arms) and is required to learn its preferences from experience. Also, we assume the players have no direct means of communication. This model extends the standard stochastic multi-armed bandit framework to a decentralized multiple player setting with competition. We introduce a new algorithm for this setting that, over a time horizon $T$, attains $\mathcal{O}(\log(T))$ stable regret when preferences of the arms over players are shared, and $\mathcal{O}(\log(T)^2)$ regret when there are no assumptions on the preferences on either side. Moreover, in the setting where a single player may deviate, we show that the algorithm is incentive compatible whenever the arms' preferences are shared, but not necessarily so when preferences are fully general.
    Distributional Gradient Matching for Learning Uncertain Neural Dynamics Models. (arXiv:2106.11609v1 [cs.LG])
    (2 min) Differential equations in general and neural ODEs in particular are an essential technique in continuous-time system identification. While many deterministic learning algorithms have been designed based on numerical integration via the adjoint method, many downstream tasks such as active learning, exploration in reinforcement learning, robust control, or filtering require accurate estimates of predictive uncertainties. In this work, we propose a novel approach towards estimating epistemically uncertain neural ODEs, avoiding the numerical integration bottleneck. Instead of modeling uncertainty in the ODE parameters, we directly model uncertainties in the state space. Our algorithm - distributional gradient matching (DGM) - jointly trains a smoother and a dynamics model and matches their gradients via minimizing a Wasserstein loss. Our experiments show that, compared to traditional approximate inference methods based on numerical integration, our approach is faster to train, faster at predicting previously unseen trajectories, and in the context of neural ODEs, significantly more accurate.
    Distilled Replay: Overcoming Forgetting through Synthetic Samples. (arXiv:2103.15851v2 [cs.LG] UPDATED)
    (2 min) Replay strategies are Continual Learning techniques which mitigate catastrophic forgetting by keeping a buffer of patterns from previous experiences, which are interleaved with new data during training. The amount of patterns stored in the buffer is a critical parameter which largely influences the final performance and the memory footprint of the approach. This work introduces Distilled Replay, a novel replay strategy for Continual Learning which is able to mitigate forgetting by keeping a very small buffer (1 pattern per class) of highly informative samples. Distilled Replay builds the buffer through a distillation process which compresses a large dataset into a tiny set of informative examples. We show the effectiveness of our Distilled Replay against popular replay-based strategies on four Continual Learning benchmarks.
    Online Covariance Matrix Estimation in Stochastic Gradient Descent. (arXiv:2002.03979v3 [stat.ML] UPDATED)
    (2 min) The stochastic gradient descent (SGD) algorithm is widely used for parameter estimation, especially for huge data sets and online learning. While this recursive algorithm is popular for computation and memory efficiency, quantifying variability and randomness of the solutions has been rarely studied. This paper aims at conducting statistical inference of SGD-based estimates in an online setting. In particular, we propose a fully online estimator for the covariance matrix of averaged SGD iterates (ASGD) only using the iterates from SGD. We formally establish our online estimator's consistency and show that the convergence rate is comparable to offline counterparts. Based on the classic asymptotic normality results of ASGD, we construct asymptotically valid confidence intervals for model parameters. Upon receiving new observations, we can quickly update the covariance matrix estimate and the confidence intervals. This approach fits in an online setting and takes full advantage of SGD: efficiency in computation and memory.
    Revisiting Deep Learning Models for Tabular Data. (arXiv:2106.11959v1 [cs.LG])
    (2 min) The necessity of deep learning for tabular data is still an unanswered question addressed by a large number of research efforts. The recent literature on tabular DL proposes several deep architectures reported to be superior to traditional "shallow" models like Gradient Boosted Decision Trees. However, since existing works often use different benchmarks and tuning protocols, it is unclear if the proposed models universally outperform GBDT. Moreover, the models are often not compared to each other, therefore, it is challenging to identify the best deep model for practitioners. In this work, we start from a thorough review of the main families of DL models recently developed for tabular data. We carefully tune and evaluate them on a wide range of datasets and reveal two significant findings. First, we show that the choice between GBDT and DL models highly depends on data and there is still no universally superior solution. Second, we demonstrate that a simple ResNet-like architecture is a surprisingly effective baseline, which outperforms most of the sophisticated models from the DL literature. Finally, we design a simple adaptation of the Transformer architecture for tabular data that becomes a new strong DL baseline and reduces the gap between GBDT and DL models on datasets where GBDT dominates.
    Estimating Smooth GLM in Non-interactive Local Differential Privacy Model with Public Unlabeled Data. (arXiv:1910.00482v3 [cs.LG] UPDATED)
    (2 min) In this paper, we study the problem of estimating smooth Generalized Linear Models (GLM) in the Non-interactive Local Differential Privacy (NLDP) model. Different from its classical setting, our model allows the server to access some additional public but unlabeled data. By using Stein's lemma and its variants, we first show that there is an $(\epsilon, \delta)$-NLDP algorithm for GLM (under some mild assumptions), if each data record is i.i.d sampled from some sub-Gaussian distribution with bounded $\ell_1$-norm. Then with high probability, the sample complexity of the public and private data, for the algorithm to achieve an $\alpha$ estimation error (in $\ell_\infty$-norm), is $O(p^2\alpha^{-2})$ and ${O}(p^2\alpha^{-2}\epsilon^{-2})$, respectively, if $\alpha$ is not too small ({\em i.e.,} $\alpha\geq \Omega(\frac{1}{\sqrt{p}})$), where $p$ is the dimensionality of the data. This is a significant improvement over the previously known quasi-polynomial (in $\alpha$) or exponential (in $p$) complexity of GLM with no public data. Also, our algorithm can answer multiple (at most $\exp(O(p))$) GLM queries with the same sample complexities as in the one GLM query case with at least constant probability. We then extend our idea to the non-linear regression problem and show a similar phenomenon for it. Finally, we demonstrate the effectiveness of our algorithms through experiments on both synthetic and real world datasets. To our best knowledge, this is the first paper showing the existence of efficient and effective algorithms for GLM and non-linear regression in the NLDP model with public unlabeled data.
    Agnostic Reinforcement Learning with Low-Rank MDPs and Rich Observations. (arXiv:2106.11519v1 [cs.LG])
    (2 min) There have been many recent advances on provably efficient Reinforcement Learning (RL) in problems with rich observation spaces. However, all these works share a strong realizability assumption about the optimal value function of the true MDP. Such realizability assumptions are often too strong to hold in practice. In this work, we consider the more realistic setting of agnostic RL with rich observation spaces and a fixed class of policies $\Pi$ that may not contain any near-optimal policy. We provide an algorithm for this setting whose error is bounded in terms of the rank $d$ of the underlying MDP. Specifically, our algorithm enjoys a sample complexity bound of $\widetilde{O}\left((H^{4d} K^{3d} \log |\Pi|)/\epsilon^2\right)$ where $H$ is the length of episodes, $K$ is the number of actions and $\epsilon>0$ is the desired sub-optimality. We also provide a nearly matching lower bound for this agnostic setting that shows that the exponential dependence on rank is unavoidable, without further assumptions.
    FLEA: Provably Fair Multisource Learning from Unreliable Training Data. (arXiv:2106.11732v1 [cs.LG])
    (2 min) Fairness-aware learning aims at constructing classifiers that not only make accurate predictions, but do not discriminate against specific groups. It is a fast-growing area of machine learning with far-reaching societal impact. However, existing fair learning methods are vulnerable to accidental or malicious artifacts in the training data, which can cause them to unknowingly produce unfair classifiers. In this work we address the problem of fair learning from unreliable training data in the robust multisource setting, where the available training data comes from multiple sources, a fraction of which might be not representative of the true data distribution. We introduce FLEA, a filtering-based algorithm that allows the learning system to identify and suppress those data sources that would have a negative impact on fairness or accuracy if they were used for training. We show the effectiveness of our approach by a diverse range of experiments on multiple datasets. Additionally we prove formally that, given enough data, FLEA protects the learner against unreliable data as long as the fraction of affected data sources is less than half.
    Adapting Stepsizes by Momentumized Gradients Improves Optimization and Generalization. (arXiv:2106.11514v1 [cs.LG])
    (2 min) Adaptive gradient methods, such as \textsc{Adam}, have achieved tremendous success in machine learning. Scaling gradients by square roots of the running averages of squared past gradients, such methods are able to attain rapid training of modern deep neural networks. Nevertheless, they are observed to generalize worse than stochastic gradient descent (\textsc{SGD}) and tend to be trapped in local minima at an early stage during training. Intriguingly, we discover that substituting the gradient in the preconditioner term with the momentumized version in \textsc{Adam} can well solve the issues. The intuition is that gradient with momentum contains more accurate directional information and therefore its second moment estimation is a better choice for scaling than raw gradient's. Thereby we propose \textsc{AdaMomentum} as a new optimizer reaching the goal of training faster while generalizing better. We further develop a theory to back up the improvement in optimization and generalization and provide convergence guarantee under both convex and nonconvex settings. Extensive experiments on various models and tasks demonstrate that \textsc{AdaMomentum} exhibits comparable performance to \textsc{SGD} on vision tasks, and achieves state-of-the-art results consistently on other tasks including language processing.
    Supervised Momentum Contrastive Learning for Few-Shot Classification. (arXiv:2101.11058v2 [cs.CV] UPDATED)
    (2 min) Few-shot learning aims to transfer information from one task to enable generalization on novel tasks given a few examples. This information is present both in the domain and the class labels. In this work we investigate the complementary roles of these two sources of information by combining instance-discriminative contrastive learning and supervised learning in a single framework called Supervised Momentum Contrastive learning (SUPMOCO). Our approach avoids a problem observed in supervised learning where information in images not relevant to the task is discarded, which hampers their generalization to novel tasks. We show that (self-supervised) contrastive learning and supervised learning are mutually beneficial, leading to a new state-of-the-art on the META-DATASET - a recently introduced benchmark for few-shot learning. Our method is based on a simple modification of MOCO and scales better than prior work on combining supervised and self-supervised learning. This allows us to easily combine data from multiple domains leading to further improvements.
    Graph Neural Ordinary Differential Equations. (arXiv:1911.07532v4 [cs.LG] UPDATED)
    (2 min) We introduce the framework of continuous--depth graph neural networks (GNNs). Graph neural ordinary differential equations (GDEs) are formalized as the counterpart to GNNs where the input-output relationship is determined by a continuum of GNN layers, blending discrete topological structures and differential equations. The proposed framework is shown to be compatible with various static and autoregressive GNN models. Results prove general effectiveness of GDEs: in static settings they offer computational advantages by incorporating numerical methods in their forward pass; in dynamic settings, on the other hand, they are shown to improve performance by exploiting the geometry of the underlying dynamics.
    Stochastic Bayesian Neural Networks. (arXiv:2008.07587v3 [cs.LG] UPDATED)
    (2 min) Bayesian neural networks perform variational inference over the weights however calculation of the posterior distribution remains a challenge. Our work builds on variational inference techniques for bayesian neural networks using the original Evidence Lower Bound. In this paper, we present a stochastic bayesian neural network in which we maximize Evidence Lower Bound using a new objective function which we name as Stochastic Evidence Lower Bound. We evaluate our network on 5 publicly available UCI datasets using test RMSE and log likelihood as the evaluation metrics. We demonstrate that our work not only beats the previous state of the art algorithms but is also scalable to larger datasets.
    Generating abstractive summaries of Lithuanian news articles using a transformer model. (arXiv:2105.03279v2 [cs.CL] UPDATED)
    (2 min) In this work, we train the first monolingual Lithuanian transformer model on a relatively large corpus of Lithuanian news articles and compare various output decoding algorithms for abstractive news summarization. We achieve an average ROUGE-2 score 0.163, generated summaries are coherent and look impressive at first glance. However, some of them contain misleading information that is not so easy to spot. We describe all the technical details and share our trained model and accompanying code in an online open-source repository, as well as some characteristic samples of the generated summaries.
    Speed Benchmarking of Genetic Programming Frameworks. (arXiv:2106.11919v1 [cs.NE])
    (2 min) Genetic Programming (GP) is known to suffer from the burden of being computationally expensive by design. While, over the years, many techniques have been developed to mitigate this issue, data vectorization, in particular, is arguably still the most attractive strategy due to the parallel nature of GP. In this work, we employ a series of benchmarks meant to compare both the performance and evolution capabilities of different vectorized and iterative implementation approaches across several existing frameworks. Namely, TensorGP, a novel open-source engine written in Python, is shown to greatly benefit from the TensorFlow library to accelerate the domain evaluation phase in GP. The presented performance benchmarks demonstrate that the TensorGP engine manages to pull ahead, with relative speedups above two orders of magnitude for problems with a higher number of fitness cases. Additionally, as a consequence of being able to compute larger domains, we argue that TensorGP performance gains aid the discovery of more accurate candidate solutions.
    Physics-Informed Deep Reversible Regression Model for Temperature Field Reconstruction of Heat-Source Systems. (arXiv:2106.11929v1 [cs.LG])
    (2 min) Temperature monitoring during the life time of heat-source components in engineering systems becomes essential to ensure the normal work and even the long working life of the heat sources. However, prior methods, which mainly use the interpolate estimation, require large amounts of temperature tensors for an accurate estimation. To solve this problem, this work develops a novel physics-informed deep surrogate models for temperature field reconstruction. First, we defines the temperature field reconstruction task of heat-source systems. Then, this work develops the deep surrogate model mapping for the proposed task. Finally, considering the physical properties of heat transfer, this work proposes four different losses and joint learns the deep surrogate model with these losses. Experimental studies have conducted over typical two-dimensional heat-source systems to demonstrate the effectiveness and efficiency of the proposed physics-informed deep surrogate models for temperature field reconstruction.
    Smooth Sequential Optimisation with Delayed Feedback. (arXiv:2106.11294v2 [cs.LG] UPDATED)
    (2 min) Stochastic delays in feedback lead to unstable sequential learning using multi-armed bandits. Recently, empirical Bayesian shrinkage has been shown to improve reward estimation in bandit learning. Here, we propose a novel adaptation to shrinkage that estimates smoothed reward estimates from windowed cumulative inputs, to deal with incomplete knowledge from delayed feedback and non-stationary rewards. Using numerical simulations, we show that this adaptation retains the benefits of shrinkage, and improves the stability of reward estimation by more than 50%. Our proposal reduces variability in treatment allocations to the best arm by up to 3.8x, and improves statistical accuracy - with up to 8% improvement in true positive rates and 37% reduction in false positive rates. Together, these advantages enable control of the trade-off between speed and stability of adaptation, and facilitate human-in-the-loop sequential optimisation.
    Meta Adversarial Training against Universal Patches. (arXiv:2101.11453v2 [cs.LG] UPDATED)
    (2 min) Recently demonstrated physical-world adversarial attacks have exposed vulnerabilities in perception systems that pose severe risks for safety-critical applications such as autonomous driving. These attacks place adversarial artifacts in the physical world that indirectly cause the addition of a universal patch to inputs of a model that can fool it in a variety of contexts. Adversarial training is the most effective defense against image-dependent adversarial attacks. However, tailoring adversarial training to universal patches is computationally expensive since the optimal universal patch depends on the model weights which change during training. We propose meta adversarial training (MAT), a novel combination of adversarial training with meta-learning, which overcomes this challenge by meta-learning universal patches along with model training. MAT requires little extra computation while continuously adapting a large set of patches to the current model. MAT considerably increases robustness against universal patch attacks on image classification and traffic-light detection.
    An Update of a Progressively Expanded Database for Automated Lung Sound Analysis. (arXiv:2102.04062v2 [cs.SD] UPDATED)
    (2 min) A continuous real-time respiratory sound automated analysis system is needed in clinical practice. Previously, we established an open access lung sound database, HF_Lung_V1, and automated lung sound analysis algorithms capable of detecting inhalation, exhalation, continuous adventitious sounds (CASs) and discontinuous adventitious sounds (DASs). In this study, HF-Lung-V1 has been further expanded to HF-Lung-V2 with 1.45 times of increase in audio files. The convolutional neural network (CNN)-bidirectional gated recurrent unit (BiGRU) model was separately trained with training datasets of HF_Lung_V1 (V1_Train) and HF_Lung_V2 (V2_Train), and then were used for the performance comparisons of segment detection and event detection on both test datasets of HF_Lung_V1 (V1_Test) and HF_Lung_V2 (V2_Test). The performance of segment detection was measured by accuracy, predictive positive value (PPV), sensitivity, specificity, F1 score, receiver operating characteristic (ROC) curve and area under the curve (AUC), whereas that of event detection was evaluated with PPV, sensitivity, and F1 score. Results indicate that the model performance trained by V2_Train showed improvement on both V1_Test and V2_Test in inhalation, CASs and DASs, particularly in CASs, as well as on V1_Test in exhalation.
    Transformer-based Spatial-Temporal Feature Learning for EEG Decoding. (arXiv:2106.11170v1 [eess.SP] CROSS LISTED)
    (2 min) At present, people usually use some methods based on convolutional neural networks (CNNs) for Electroencephalograph (EEG) decoding. However, CNNs have limitations in perceiving global dependencies, which is not adequate for common EEG paradigms with a strong overall relationship. Regarding this issue, we propose a novel EEG decoding method that mainly relies on the attention mechanism. The EEG data is firstly preprocessed and spatially filtered. And then, we apply attention transforming on the feature-channel dimension so that the model can enhance more relevant spatial features. The most crucial step is to slice the data in the time dimension for attention transforming, and finally obtain a highly distinguishable representation. At this time, global averaging pooling and a simple fully-connected layer are used to classify different categories of EEG data. Experiments on two public datasets indicate that the strategy of attention transforming effectively utilizes spatial and temporal features. And we have reached the level of the state-of-the-art in multi-classification of EEG, with fewer parameters. As far as we know, it is the first time that a detailed and complete method based on the transformer idea has been proposed in this field. It has good potential to promote the practicality of brain-computer interface (BCI). The source code can be found at: \textit{https://github.com/anranknight/EEG-Transformer}.
    Preconditioned Riemannian Optimization on the Generalized Stiefel Manifold. (arXiv:1902.01635v3 [math.NA] UPDATED)
    (2 min) Optimization problems on the generalized Stiefel manifold (and products of it) are prevalent across science and engineering. For example, in computational science they arise in the symmetric (generalized) eigenvalue problem, in nonlinear eigenvalue problems, and in electronic structures computations, to name a few problems. In statistics and machine learning, they arise, for example, in various dimensionality reduction techniques such as canonical correlation analysis. In deep learning, regularization and improved stability can be obtained by constraining some layers to have parameter matrices that belong to the Stiefel manifold. Solving problems on the generalized Stiefel manifold can be approached via the tools of Riemannian optimization. However, using the standard geometric components for the generalized Stiefel manifold has two possible shortcoming: computing some of the geometric components can be too expensive and converge can be rather slow in certain cases. Both shortcomings can be addressed using a technique called Riemannian preconditioning, which amounts to using geometric components derived using a precoditioner that defines a Riemannian metric on the constraint manifold. In this paper we develop the geometric components required to perform Riemannian optimization on the generalized Stiefel manifold equipped with a non-standard metric, and illustrate theoretically and numerically the use of those components and the effect of Riemannian preconditioning for solving optimization problems on the generalized Stiefel manifold.
    Compressive Statistical Learning with Random Feature Moments. (arXiv:1706.07180v4 [stat.ML] UPDATED)
    (2 min) We describe a general framework -- compressive statistical learning -- for resource-efficient large-scale learning: the training collection is compressed in one pass into a low-dimensional sketch (a vector of random empirical generalized moments) that captures the information relevant to the considered learning task. A near-minimizer of the risk is computed from the sketch through the solution of a nonlinear least squares problem. We investigate sufficient sketch sizes to control the generalization error of this procedure. The framework is illustrated on compressive PCA, compressive clustering, and compressive Gaussian mixture Modeling with fixed known variance. The latter two are further developed in a companion paper.
    Categorising Fine-to-Coarse Grained Misinformation: An Empirical Study of COVID-19 Infodemic. (arXiv:2106.11702v1 [cs.SI])
    (2 min) The spreading COVID-19 misinformation over social media already draws the attention of many researchers. According to Google Scholar, about 26000 COVID-19 related misinformation studies have been published to date. Most of these studies focusing on 1) detect and/or 2) analysing the characteristics of COVID-19 related misinformation. However, the study of the social behaviours related to misinformation is often neglected. In this paper, we introduce a fine-grained annotated misinformation tweets dataset including social behaviours annotation (e.g. comment or question to the misinformation). The dataset not only allows social behaviours analysis but also suitable for both evidence-based or non-evidence-based misinformation classification task. In addition, we introduce leave claim out validation in our experiments and demonstrate the misinformation classification performance could be significantly different when applying to real-world unseen misinformation.
    Reusing Combinatorial Structure: Faster Iterative Projections over Submodular Base Polytopes. (arXiv:2106.11943v1 [cs.LG])
    (2 min) Optimization algorithms such as projected Newton's method, FISTA, mirror descent and its variants enjoy near-optimal regret bounds and convergence rates, but suffer from a computational bottleneck of computing "projections'' in potentially each iteration (e.g., $O(T^{1/2})$ regret of online mirror descent). On the other hand, conditional gradient variants solve a linear optimization in each iteration, but result in suboptimal rates (e.g., $O(T^{3/4})$ regret of online Frank-Wolfe). Motivated by this trade-off in runtime v/s convergence rates, we consider iterative projections of close-by points over widely-prevalent submodular base polytopes $B(f)$. We develop a toolkit to speed up the computation of projections using both discrete and continuous perspectives. We subsequently adapt the away-step Frank-Wolfe algorithm to use this information and enable early termination. For the special case of cardinality based submodular polytopes, we improve the runtime of computing certain Bregman projections by a factor of $\Omega(n/\log(n))$. Our theoretical results show orders of magnitude reduction in runtime in preliminary computational experiments.
    MEAL: Manifold Embedding-based Active Learning. (arXiv:2106.11858v1 [cs.CV])
    (2 min) Image segmentation is a common and challenging task in autonomous driving. Availability of sufficient pixel-level annotations for the training data is a hurdle. Active learning helps learning from small amounts of data by suggesting the most promising samples for labeling. In this work, we propose a new pool-based method for active learning, which proposes promising image regions, in each acquisition step. The problem is framed in an exploration-exploitation framework by combining an embedding based on Uniform Manifold Approximation to model representativeness with entropy as uncertainty measure to model informativeness. We applied our proposed method to the challenging autonomous driving data sets CamVid and Cityscapes and performed a quantitative comparison with state-of-the-art methods. We find that our active learning method achieves better performance on CamVid compared to other methods, while on Cityscapes, the performance lift was negligible.
    SISA: Securing Images by Selective Alteration. (arXiv:2106.11770v1 [cs.CR])
    (2 min) With an increase in mobile and camera devices' popularity, digital content in the form of images has increased drastically. As personal life is being continuously documented in pictures, the risk of losing it to eavesdroppers is a matter of grave concern. Secondary storage is the most preferred medium for the storage of personal and other images. Our work is concerned with the security of such images. While encryption is the best way to ensure image security, full encryption and decryption is a computationally-intensive process. Moreover, as cameras are getting better every day, image quality, and thus, the pixel density has increased considerably. The increased pixel density makes encryption and decryption more expensive. We, therefore, delve into selective encryption and selective blurring based on the region of interest. Instead of encrypting or blurring the entire photograph, we only encode selected regions of the image. We present a comparative analysis of the partial and full encryption of the photos. This kind of encoding will help us lower the encryption overhead without compromising security. The applications utilizing this technique will become more usable due to the reduction in the decryption time. Additionally, blurred images being more readable than encrypted ones, allowed us to define the level of security. We leverage the machine learning algorithms like Mask-RCNN (Region-based convolutional neural network) and YOLO (You Only Look Once) to select the region of interest. These algorithms have set new benchmarks for object recognition. We develop an end to end system to demonstrate our idea of selective encryption.
    Global inducing point variational posteriors for Bayesian neural networks and deep Gaussian processes. (arXiv:2005.08140v5 [stat.ML] UPDATED)
    (2 min) We consider the optimal approximate posterior over the top-layer weights in a Bayesian neural network for regression, and show that it exhibits strong dependencies on the lower-layer weights. We adapt this result to develop a correlated approximate posterior over the weights at all layers in a Bayesian neural network. We extend this approach to deep Gaussian processes, unifying inference in the two model classes. Our approximate posterior uses learned "global" inducing points, which are defined only at the input layer and propagated through the network to obtain inducing inputs at subsequent layers. By contrast, standard, "local", inducing point methods from the deep Gaussian process literature optimise a separate set of inducing inputs at every layer, and thus do not model correlations across layers. Our method gives state-of-the-art performance for a variational Bayesian method, without data augmentation or tempering, on CIFAR-10 of 86.7%, which is comparable to SGMCMC without tempering but with data augmentation (88% in Wenzel et al. 2020).
    An Equivalence Between Private Classification and Online Prediction. (arXiv:2003.00563v3 [cs.LG] UPDATED)
    (2 min) We prove that every concept class with finite Littlestone dimension can be learned by an (approximate) differentially-private algorithm. This answers an open question of Alon et al. (STOC 2019) who proved the converse statement (this question was also asked by Neel et al.~(FOCS 2019)). Together these two results yield an equivalence between online learnability and private PAC learnability. We introduce a new notion of algorithmic stability called "global stability" which is essential to our proof and may be of independent interest. We also discuss an application of our results to boosting the privacy and accuracy parameters of differentially-private learners.
    Dynamic Customer Embeddings for Financial Service Applications. (arXiv:2106.11880v1 [cs.LG])
    (2 min) As financial services (FS) companies have experienced drastic technology driven changes, the availability of new data streams provides the opportunity for more comprehensive customer understanding. We propose Dynamic Customer Embeddings (DCE), a framework that leverages customers' digital activity and a wide range of financial context to learn dense representations of customers in the FS industry. Our method examines customer actions and pageviews within a mobile or web digital session, the sequencing of the sessions themselves, and snapshots of common financial features across our organization at the time of login. We test our customer embeddings using real world data in three prediction problems: 1) the intent of a customer in their next digital session, 2) the probability of a customer calling the call centers after a session, and 3) the probability of a digital session to be fraudulent. DCE showed performance lift in all three downstream problems.
    Sparsistent Model Discovery. (arXiv:2106.11936v1 [stat.ML])
    (2 min) Discovering the partial differential equations underlying a spatio-temporal datasets from very limited observations is of paramount interest in many scientific fields. However, it remains an open question to know when model discovery algorithms based on sparse regression can actually recover the underlying physical processes. We trace back the poor of performance of Lasso based model discovery algorithms to its potential variable selection inconsistency: meaning that even if the true model is present in the library, it might not be selected. By first revisiting the irrepresentability condition (IRC) of the Lasso, we gain some insights of when this might occur. We then show that the adaptive Lasso will have more chances of verifying the IRC than the Lasso and propose to integrate it within a deep learning model discovery framework with stability selection and error control. Experimental results show we can recover several nonlinear and chaotic canonical PDEs with a single set of hyperparameters from a very limited number of samples at high noise levels.
    Effective Semi-Supervised Node Classification on Few-Labeled Graph Data. (arXiv:1910.02684v2 [cs.LG] UPDATED)
    (2 min) Graph neural networks (GNNs) are designed for semi-supervised node classification on graphs where only a small subset of nodes have class labels. However, under extreme cases when very few labels are available (e.g., 1 labeled node per class), GNNs suffer from severe result quality degradation. Several existing studies make an initial effort to ease this situation, but are still far from satisfactory. In this paper, on few-labeled graph data, we propose an effective framework ABN that is readily applicable to both shallow and deep GNN architectures and significantly boosts classification accuracy. In particular, on a benchmark dataset Cora with only 1 labeled node per class, while the classic graph convolutional network (GCN) only has 44.6% accuracy, an immediate instantiation of ABN over GCN achieves 62.5% accuracy; when applied to a deep architecture DAGNN, ABN improves accuracy from 59.8% to 66.4%, which is state of the art. ABN obtains superior performance through three main algorithmic designs. First, it selects high-quality unlabeled nodes via an adaptive pseudo labeling technique, so as to adaptively enhance the training process of GNNs. Second, ABN balances the labels of the selected nodes on real-world skewed graph data by pseudo label balancing. Finally, a negative sampling regularizer is designed for ABN to further utilize the unlabeled nodes. The effectiveness of the three techniques in ABN is well-validated by both theoretical and empirical analysis. Extensive experiments, comparing 12 existing approaches on 4 benchmark datasets, demonstrate that ABN achieves state-of-the-art performance.
    On the importance of cross-task features for class-incremental learning. (arXiv:2106.11930v1 [cs.LG])
    (2 min) In class-incremental learning, an agent with limited resources needs to learn a sequence of classification tasks, forming an ever growing classification problem, with the constraint of not being able to access data from previous tasks. The main difference with task-incremental learning, where a task-ID is available at inference time, is that the learner also needs to perform cross-task discrimination, i.e. distinguish between classes that have not been seen together. Approaches to tackle this problem are numerous and mostly make use of an external memory (buffer) of non-negligible size. In this paper, we ablate the learning of cross-task features and study its influence on the performance of basic replay strategies used for class-IL. We also define a new forgetting measure for class-incremental learning, and see that forgetting is not the principal cause of low performance. Our experimental results show that future algorithms for class-incremental learning should not only prevent forgetting, but also aim to improve the quality of the cross-task features. This is especially important when the number of classes per task is small.
    Learning Dynamical Systems from Noisy Sensor Measurements using Multiple Shooting. (arXiv:2106.11712v1 [cs.LG])
    (2 min) Modeling dynamical systems plays a crucial role in capturing and understanding complex physical phenomena. When physical models are not sufficiently accurate or hardly describable by analytical formulas, one can use generic function approximators such as neural networks to capture the system dynamics directly from sensor measurements. As for now, current methods to learn the parameters of these neural networks are highly sensitive to the inherent instability of most dynamical systems of interest, which in turn prevents the study of very long sequences. In this work, we introduce a generic and scalable method based on multiple shooting to learn latent representations of indirectly observed dynamical systems. We achieve state-of-the-art performances on systems observed directly from raw images. Further, we demonstrate that our method is robust to noisy measurements and can handle complex dynamical systems, such as chaotic ones.
    Privacy Amplification via Iteration for Shuffled and Online PNSGD. (arXiv:2106.11767v1 [cs.CR])
    (2 min) In this paper, we consider the framework of privacy amplification via iteration, which is originally proposed by Feldman et al. and subsequently simplified by Asoodeh et al. in their analysis via the contraction coefficient. This line of work focuses on the study of the privacy guarantees obtained by the projected noisy stochastic gradient descent (PNSGD) algorithm with hidden intermediate updates. A limitation in the existing literature is that only the early stopped PNSGD has been studied, while no result has been proved on the more widely-used PNSGD applied on a shuffled dataset. Moreover, no scheme has been yet proposed regarding how to decrease the injected noise when new data are received in an online fashion. In this work, we first prove a privacy guarantee for shuffled PNSGD, which is investigated asymptotically when the noise is fixed for each sample size $n$ but reduced at a predetermined rate when $n$ increases, in order to achieve the convergence of privacy loss. We then analyze the online setting and provide a faster decaying scheme for the magnitude of the injected noise that also guarantees the convergence of privacy loss.
    Rank-one matrix estimation with groupwise heteroskedasticity. (arXiv:2106.11950v1 [stat.ML])
    (2 min) We study the problem of estimating a rank-one matrix from Gaussian observations where different blocks of the matrix are observed under different noise levels. This problem is motivated by applications in clustering and community detection where latent variables can be partitioned into a fixed number of known groups (e.g., users and items) and the blocks of the matrix correspond to different types of pairwise interactions (e.g., user-user, user-item, or item-item interactions). In the setting where the number of blocks is fixed while the number of variables tends to infinity, we prove asymptotically exact formulas for the minimum mean-squared error in estimating both the matrix and the latent variables. These formulas describe the weak recovery thresholds for the problem and reveal invariance properties with respect to certain scalings of the noise variance. We also derive an approximate message passing algorithm and a gradient descent algorithm and show empirically that these algorithms achieve the information-theoretic limits in certain regimes.
    Off-Policy Reinforcement Learning with Delayed Rewards. (arXiv:2106.11854v1 [cs.LG])
    (2 min) We study deep reinforcement learning (RL) algorithms with delayed rewards. In many real-world tasks, instant rewards are often not readily accessible or even defined immediately after the agent performs actions. In this work, we first formally define the environment with delayed rewards and discuss the challenges raised due to the non-Markovian nature of such environments. Then, we introduce a general off-policy RL framework with a new Q-function formulation that can handle the delayed rewards with theoretical convergence guarantees. For practical tasks with high dimensional state spaces, we further introduce the HC-decomposition rule of the Q-function in our framework which naturally leads to an approximation scheme that helps boost the training efficiency and stability. We finally conduct extensive experiments to demonstrate the superior performance of our algorithms over the existing work and their variants.
    RootPainter3D: Interactive-machine-learning enables rapid and accurate contouring for radiotherapy. (arXiv:2106.11942v1 [cs.CV])
    (2 min) Organ-at-risk contouring is still a bottleneck in radiotherapy, with many deep learning methods falling short of promised results when evaluated on clinical data. We investigate the accuracy and time-savings resulting from the use of an interactive-machine-learning method for an organ-at-risk contouring task. We compare the method to the Eclipse contouring software and find strong agreement with manual delineations, with a dice score of 0.95. The annotations created using corrective-annotation also take less time to create as more images are annotated, resulting in substantial time savings compared to manual methods, with hearts that take 2 minutes and 2 seconds to delineate on average, after 923 images have been delineated, compared to 7 minutes and 1 seconds when delineating manually. Our experiment demonstrates that interactive-machine-learning with corrective-annotation provides a fast and accessible way for non computer-scientists to train deep-learning models to segment their own structures of interest as part of routine clinical workflows. Source code is available at \href{https://github.com/Abe404/RootPainter3D}{this HTTPS URL}.
    FLRA: A Reference Architecture for Federated Learning Systems. (arXiv:2106.11570v1 [cs.LG])
    (2 min) Federated learning is an emerging machine learning paradigm that enables multiple devices to train models locally and formulate a global model, without sharing the clients' local data. A federated learning system can be viewed as a large-scale distributed system, involving different components and stakeholders with diverse requirements and constraints. Hence, developing a federated learning system requires both software system design thinking and machine learning knowledge. Although much effort has been put into federated learning from the machine learning perspectives, our previous systematic literature review on the area shows that there is a distinct lack of considerations for software architecture design for federated learning. In this paper, we propose FLRA, a reference architecture for federated learning systems, which provides a template design for federated learning-based solutions. The proposed FLRA reference architecture is based on an extensive review of existing patterns of federated learning systems found in the literature and existing industrial implementation. The FLRA reference architecture consists of a pool of architectural patterns that could address the frequently recurring design problems in federated learning architectures. The FLRA reference architecture can serve as a design guideline to assist architects and developers with practical solutions for their problems, which can be further customised.
    Dangers of Bayesian Model Averaging under Covariate Shift. (arXiv:2106.11905v1 [cs.LG])
    (2 min) Approximate Bayesian inference for neural networks is considered a robust alternative to standard training, often providing good performance on out-of-distribution data. However, Bayesian neural networks (BNNs) with high-fidelity approximate inference via full-batch Hamiltonian Monte Carlo achieve poor generalization under covariate shift, even underperforming classical estimation. We explain this surprising result, showing how a Bayesian model average can in fact be problematic under covariate shift, particularly in cases where linear dependencies in the input features cause a lack of posterior contraction. We additionally show why the same issue does not affect many approximate inference procedures, or classical maximum a-posteriori (MAP) training. Finally, we propose novel priors that improve the robustness of BNNs to many sources of covariate shift.
    On Adversarial Robustness of Synthetic Code Generation. (arXiv:2106.11629v1 [cs.LG])
    (2 min) Automatic code synthesis from natural language descriptions is a challenging task. We witness massive progress in developing code generation systems for domain-specific languages (DSLs) employing sequence-to-sequence deep learning techniques in the recent past. In this paper, we specifically experiment with \textsc{AlgoLisp} DSL-based generative models and showcase the existence of significant dataset bias through different classes of adversarial examples. We also experiment with two variants of Transformer-based models that outperform all existing \textsc{AlgoLisp} DSL-based code generation baselines. Consistent with the current state-of-the-art systems, our proposed models, too, achieve poor performance under adversarial settings. Therefore, we propose several dataset augmentation techniques to reduce bias and showcase their efficacy using robust experimentation.
    Recent Deep Semi-supervised Learning Approaches and Related Works. (arXiv:2106.11528v1 [cs.LG])
    (2 min) The author of this work proposes an overview of the recent semi-supervised learning approaches and related works. Despite the remarkable success of neural networks in various applications, there exist few formidable constraints including the need for a large amount of labeled data. Therefore, semi-supervised learning, which is a learning scheme in which the scarce labels and a larger amount of unlabeled data are utilized to train models (e.g., deep neural networks) is getting more important. Based on the key assumptions of semi-supervised learning, which are the manifold assumption, cluster assumption, and continuity assumption, the work reviews the recent semi-supervised learning approaches. In particular, the methods in regard to using deep neural networks in a semi-supervised learning setting are primarily discussed. In addition, the existing works are first classified based on the underlying idea and explained, and then the holistic approaches that unify the aforementioned ideas are detailed.
    Machine Learning for Model Order Selection in MIMO OFDM Systems. (arXiv:2106.11633v1 [eess.SP])
    (2 min) A variety of wireless channel estimation methods, e.g., MUSIC and ESPRIT, rely on prior knowledge of the model order. Therefore, it is important to correctly estimate the number of multipath components (MPCs) which compose such channels. However, environments with many scatterers may generate MPCs which are closely spaced. This clustering of MPCs in addition to noise makes the model order selection task difficult in practice to currently known algorithms. In this paper, we exploit the multidimensional characteristics of MIMO orthogonal frequency division multiplexing (OFDM) systems and propose a machine learning (ML) method capable of determining the number of MPCs with a higher accuracy than state of the art methods in almost coherent scenarios. Moreover, our results show that our proposed ML method has an enhanced reliability.
    A Vertical Federated Learning Framework for Graph Convolutional Network. (arXiv:2106.11593v1 [cs.LG])
    (2 min) Recently, Graph Neural Network (GNN) has achieved remarkable success in various real-world problems on graph data. However in most industries, data exists in the form of isolated islands and the data privacy and security is also an important issue. In this paper, we propose FedVGCN, a federated GCN learning paradigm for privacy-preserving node classification task under data vertically partitioned setting, which can be generalized to existing GCN models. Specifically, we split the computation graph data into two parts. For each iteration of the training process, the two parties transfer intermediate results to each other under homomorphic encryption. We conduct experiments on benchmark data and the results demonstrate the effectiveness of FedVGCN in the case of GraphSage.
    Differentiable Architecture Search Without Training Nor Labels: A Pruning Perspective. (arXiv:2106.11542v1 [cs.LG])
    (2 min) With leveraging the weight-sharing and continuous relaxation to enable gradient-descent to alternately optimize the supernet weights and the architecture parameters through a bi-level optimization paradigm, \textit{Differentiable ARchiTecture Search} (DARTS) has become the mainstream method in Neural Architecture Search (NAS) due to its simplicity and efficiency. However, more recent works found that the performance of the searched architecture barely increases with the optimization proceeding in DARTS. In addition, several concurrent works show that the NAS could find more competitive architectures without labels. The above observations reveal that the supervision signal in DARTS may be a poor indicator for architecture optimization, inspiring a foundational question: instead of using the supervision signal to perform bi-level optimization, \textit{can we find high-quality architectures \textbf{without any training nor labels}}? We provide an affirmative answer by customizing the NAS as a network pruning at initialization problem. By leveraging recent techniques on the network pruning at initialization, we designed a FreeFlow proxy to score the importance of candidate operations in NAS without any training nor labels, and proposed a novel framework called \textit{training and label free neural architecture search} (\textbf{FreeNAS}) accordingly. We show that, without any training nor labels, FreeNAS with the proposed FreeFlow proxy can outperform most NAS baselines. More importantly, our framework is extremely efficient, which completes the architecture search within only \textbf{3.6s} and \textbf{79s} on a single GPU for the NAS-Bench-201 and DARTS search space, respectively. We hope our work inspires more attempts in solving NAS from the perspective of pruning at initialization.
    A Logical Neural Network Structure With More Direct Mapping From Logical Relations. (arXiv:2106.11463v1 [cs.NE])
    (2 min) Logical relations widely exist in human activities. Human use them for making judgement and decision according to various conditions, which are embodied in the form of \emph{if-then} rules. As an important kind of cognitive intelligence, it is prerequisite of representing and storing logical relations rightly into computer systems so as to make automatic judgement and decision, especially for high-risk domains like medical diagnosis. However, current numeric ANN (Artificial Neural Network) models are good at perceptual intelligence such as image recognition while they are not good at cognitive intelligence such as logical representation, blocking the further application of ANN. To solve it, researchers have tried to design logical ANN models to represent and store logical relations. Although there are some advances in this research area, recent works still have disadvantages because the structures of these logical ANN models still don't map more directly with logical relations which will cause the corresponding logical relations cannot be read out from their network structures. Therefore, in order to represent logical relations more clearly by the neural network structure and to read out logical relations from it, this paper proposes a novel logical ANN model by designing the new logical neurons and links in demand of logical representation. Compared with the recent works on logical ANN models, this logical ANN model has more clear corresponding with logical relations using the more direct mapping method herein, thus logical relations can be read out following the connection patterns of the network structure. Additionally, less neurons are used.
    Repulsive Deep Ensembles are Bayesian. (arXiv:2106.11642v1 [cs.LG])
    (2 min) Deep ensembles have recently gained popularity in the deep learning community for their conceptual simplicity and efficiency. However, maintaining functional diversity between ensemble members that are independently trained with gradient descent is challenging. This can lead to pathologies when adding more ensemble members, such as a saturation of the ensemble performance, which converges to the performance of a single model. Moreover, this does not only affect the quality of its predictions, but even more so the uncertainty estimates of the ensemble, and thus its performance on out-of-distribution data. We hypothesize that this limitation can be overcome by discouraging different ensemble members from collapsing to the same function. To this end, we introduce a kernelized repulsive term in the update rule of the deep ensembles. We show that this simple modification not only enforces and maintains diversity among the members but, even more importantly, transforms the maximum a posteriori inference into proper Bayesian inference. Namely, we show that the training dynamics of our proposed repulsive ensembles follow a Wasserstein gradient flow of the KL divergence with the true posterior. We study repulsive terms in weight and function space and empirically compare their performance to standard ensembles and Bayesian baselines on synthetic and real-world prediction tasks.
    Kernel Clustering with Sigmoid-based Regularization for Efficient Segmentation of Sequential Data. (arXiv:2106.11541v1 [cs.LG])
    (2 min) Kernel segmentation aims at partitioning a data sequence into several non-overlapping segments that may have nonlinear and complex structures. In general, it is formulated as a discrete optimization problem with combinatorial constraints. A popular algorithm for optimally solving this problem is dynamic programming (DP), which has quadratic computation and memory requirements. Given that sequences in practice are too long, this algorithm is not a practical approach. Although many heuristic algorithms have been proposed to approximate the optimal segmentation, they have no guarantee on the quality of their solutions. In this paper, we take a differentiable approach to alleviate the aforementioned issues. First, we introduce a novel sigmoid-based regularization to smoothly approximate the combinatorial constraints. Combining it with objective of the balanced kernel clustering, we formulate a differentiable model termed Kernel clustering with sigmoid-based regularization (KCSR), where the gradient-based algorithm can be exploited to obtain the optimal segmentation. Second, we develop a stochastic variant of the proposed model. By using the stochastic gradient descent algorithm, which has much lower time and space complexities, for optimization, the second model can perform segmentation on overlong data sequences. Finally, for simultaneously segmenting multiple data sequences, we slightly modify the sigmoid-based regularization to further introduce an extended variant of the proposed model. Through extensive experiments on various types of data sequences performances of our models are evaluated and compared with those of the existing methods. The experimental results validate advantages of the proposed models. Our Matlab source code is available on github.
    Lifted Model Checking for Relational MDPs. (arXiv:2106.11735v1 [cs.LG])
    (2 min) Model checking has been developed for verifying the behaviour of systems with stochastic and non-deterministic behavior. It is used to provide guarantees about such systems. While most model checking methods focus on propositional models, various probabilistic planning and reinforcement frameworks deal with relational domains, for instance, STRIPS planning and relational Markov Decision Processes. Using propositional model checking in relational settings requires one to ground the model, which leads to the well known state explosion problem and intractability. We present pCTL-REBEL, a lifted model checking approach for verifying pCTL properties on relational MDPs. It extends REBEL, the relational Bellman update operator, which is a lifted value iteration approach for model-based relational reinforcement learning, toward relational model-checking. PCTL-REBEL is lifted, which means that rather than grounding, the model exploits symmetries and reasons at an abstract relational level. Theoretically, we show that the pCTL model checking approach is decidable for relational MDPs even for possibly infinite domains provided that the states have a bounded size. Practically, we contribute algorithms and an implementation of lifted relational model checking, and we show that the lifted approach improves the scalability of the model checking approach.
    Reinforcement learning for PHY layer communications. (arXiv:2106.11595v1 [cs.AI])
    (2 min) In this chapter, we will give comprehensive examples of applying RL in optimizing the physical layer of wireless communications by defining different class of problems and the possible solutions to handle them. In Section 9.2, we present all the basic theory needed to address a RL problem, i.e. Markov decision process (MDP), Partially observable Markov decision process (POMDP), but also two very important and widely used algorithms for RL, i.e. the Q-learning and SARSA algorithms. We also introduce the deep reinforcement learning (DRL) paradigm and the section ends with an introduction to the multi-armed bandits (MAB) framework. Section 9.3 focuses on some toy examples to illustrate how the basic concepts of RL are employed in communication systems. We present applications extracted from literature with simplified system models using similar notation as in Section 9.2 of this Chapter. In Section 9.3, we also focus on modeling RL problems, i.e. how action and state spaces and rewards are chosen. The Chapter is concluded in Section 9.4 with a prospective thought on RL trends and it ends with a review of a broader state of the art in Section 9.5.
    Information Retrieval for ZeroSpeech 2021: The Submission by University of Wroclaw. (arXiv:2106.11603v1 [cs.LG])
    (2 min) We present a number of low-resource approaches to the tasks of the Zero Resource Speech Challenge 2021. We build on the unsupervised representations of speech proposed by the organizers as a baseline, derived from CPC and clustered with the k-means algorithm. We demonstrate that simple methods of refining those representations can narrow the gap, or even improve upon the solutions which use a high computational budget. The results lead to the conclusion that the CPC-derived representations are still too noisy for training language models, but stable enough for simpler forms of pattern matching and retrieval.
    Particle Cloud Generation with Message Passing Generative Adversarial Networks. (arXiv:2106.11535v1 [cs.LG])
    (2 min) In high energy physics (HEP), jets are collections of correlated particles produced ubiquitously in particle collisions such as those at the CERN Large Hadron Collider (LHC). Machine-learning-based generative models, such as generative adversarial networks (GANs), have the potential to significantly accelerate LHC jet simulations. However, despite jets having a natural representation as a set of particles in momentum-space, a.k.a. a particle cloud, to our knowledge there exist no generative models applied to such a dataset. We introduce a new particle cloud dataset (JetNet), and, due to similarities between particle and point clouds, apply to it existing point cloud GANs. Results are evaluated using (1) the 1-Wasserstein distance between high- and low-level feature distributions, (2) a newly developed Fr\'{e}chet ParticleNet Distance, and (3) the coverage and (4) minimum matching distance metrics. Existing GANs are found to be inadequate for physics applications, hence we develop a new message passing GAN (MPGAN), which outperforms existing point cloud GANs on virtually every metric and shows promise for use in HEP. We propose JetNet as a novel point-cloud-style dataset for the machine learning community to experiment with, and set MPGAN as a benchmark to improve upon for future generative models.
    Finding Valid Adjustments under Non-ignorability with Minimal DAG Knowledge. (arXiv:2106.11560v1 [cs.LG])
    (2 min) Treatment effect estimation from observational data is a fundamental problem in causal inference. There are two very different schools of thought that have tackled this problem. On the one hand, the Pearlian framework commonly assumes structural knowledge (provided by an expert) in the form of Directed Acyclic Graphs (DAGs) and provides graphical criteria such as the back-door criterion to identify the valid adjustment sets. On the other hand, the potential outcomes (PO) framework commonly assumes that all the observed features satisfy ignorability (i.e., no hidden confounding), which in general is untestable. In this work, we take steps to bridge these two frameworks. We show that even if we know only one parent of the treatment variable (provided by an expert), then quite remarkably it suffices to test a broad class of (but not all) back-door criteria. Importantly, we also cover the non-trivial case where the entire set of observed features is not ignorable (generalizing the PO framework) without requiring all the parents of the treatment variable to be observed. Our key technical idea involves a more general result -- Given a synthetic sub-sampling (or environment) variable that is a function of the parent variable, we show that an invariance test involving this sub-sampling variable is equivalent to testing a broad class of back-door criteria. We demonstrate our approach on synthetic data as well as real causal effect estimation benchmarks.
    Feedback Shaping: A Modeling Approach to Nurture Content Creation. (arXiv:2106.11312v1 [cs.CY])
    (2 min) Social media platforms bring together content creators and content consumers through recommender systems like newsfeed. The focus of such recommender systems has thus far been primarily on modeling the content consumer preferences and optimizing for their experience. However, it is equally critical to nurture content creation by prioritizing the creators' interests, as quality content forms the seed for sustainable engagement and conversations, bringing in new consumers while retaining existing ones. In this work, we propose a modeling approach to predict how feedback from content consumers incentivizes creators. We then leverage this model to optimize the newsfeed experience for content creators by reshaping the feedback distribution, leading to a more active content ecosystem. Practically, we discuss how we balance the user experience for both consumers and creators, and how we carry out online A/B tests with strong network effects. We present a deployed use case on the LinkedIn newsfeed, where we used this approach to improve content creation significantly without compromising the consumers' experience.
    MODETR: Moving Object Detection with Transformers. (arXiv:2106.11422v1 [cs.CV])
    (2 min) Moving Object Detection (MOD) is a crucial task for the Autonomous Driving pipeline. MOD is usually handled via 2-stream convolutional architectures that incorporates both appearance and motion cues, without considering the inter-relations between the spatial or motion features. In this paper, we tackle this problem through multi-head attention mechanisms, both across the spatial and motion streams. We propose MODETR; a Moving Object DEtection TRansformer network, comprised of multi-stream transformer encoders for both spatial and motion modalities, and an object transformer decoder that produces the moving objects bounding boxes using set predictions. The whole architecture is trained end-to-end using bi-partite loss. Several methods of incorporating motion cues with the Transformer model are explored, including two-stream RGB and Optical Flow (OF) methods, and multi-stream architectures that take advantage of sequence information. To incorporate the temporal information, we propose a new Temporal Positional Encoding (TPE) approach to extend the Spatial Positional Encoding(SPE) in DETR. We explore two architectural choices for that, balancing between speed and time. To evaluate the our network, we perform the MOD task on the KITTI MOD [6] data set. Results show significant 5% mAP of the Transformer network for MOD over the state-of-the art methods. Moreover, the proposed TPE encoding provides 10% mAP improvement over the SPE baseline.
    Physics-constrained deep neural network method for estimating parameters in a redox flow battery. (arXiv:2106.11451v1 [physics.chem-ph])
    (2 min) In this paper, we present a physics-constrained deep neural network (PCDNN) method for parameter estimation in the zero-dimensional (0D) model of the vanadium redox flow battery (VRFB). In this approach, we use deep neural networks (DNNs) to approximate the model parameters as functions of the operating conditions. This method allows the integration of the VRFB computational models as the physical constraints in the parameter learning process, leading to enhanced accuracy of parameter estimation and cell voltage prediction. Using an experimental dataset, we demonstrate that the PCDNN method can estimate model parameters for a range of operating conditions and improve the 0D model prediction of voltage compared to the 0D model prediction with constant operation-condition-independent parameters estimated with traditional inverse methods. We also demonstrate that the PCDNN approach has an improved generalization ability for estimating parameter values for operating conditions not used in the DNN training.
    An Accurate Non-accelerometer-based PPG Motion Artifact Removal Technique using CycleGAN. (arXiv:2106.11512v1 [cs.LG])
    (2 min) A photoplethysmography (PPG) is an uncomplicated and inexpensive optical technique widely used in the healthcare domain to extract valuable health-related information, e.g., heart rate variability, blood pressure, and respiration rate. PPG signals can easily be collected continuously and remotely using portable wearable devices. However, these measuring devices are vulnerable to motion artifacts caused by daily life activities. The most common ways to eliminate motion artifacts use extra accelerometer sensors, which suffer from two limitations: i) high power consumption and ii) the need to integrate an accelerometer sensor in a wearable device (which is not required in certain wearables). This paper proposes a low-power non-accelerometer-based PPG motion artifacts removal method outperforming the accuracy of the existing methods. We use Cycle Generative Adversarial Network to reconstruct clean PPG signals from noisy PPG signals. Our novel machine-learning-based technique achieves 9.5 times improvement in motion artifact removal compared to the state-of-the-art without using extra sensors such as an accelerometer.
    Policy Smoothing for Provably Robust Reinforcement Learning. (arXiv:2106.11420v1 [cs.LG])
    (2 min) The study of provable adversarial robustness for deep neural network (DNN) models has mainly focused on static supervised learning tasks such as image classification. However, DNNs have been used extensively in real-world adaptive tasks such as reinforcement learning (RL), making RL systems vulnerable to adversarial attacks. The key challenge in adversarial RL is that the attacker can adapt itself to the defense strategy used by the agent in previous time-steps to strengthen its attack in future steps. In this work, we study the provable robustness of RL against norm-bounded adversarial perturbations of the inputs. We focus on smoothing-based provable defenses and propose policy smoothing where the agent adds a Gaussian noise to its observation at each time-step before applying the policy network to make itself less sensitive to adversarial perturbations of its inputs. Our main theoretical contribution is to prove an adaptive version of the Neyman-Pearson Lemma where the adversarial perturbation at a particular time can be a stochastic function of current and previous observations and states as well as previously observed actions. Using this lemma, we adapt the robustness certificates produced by randomized smoothing in the static setting of image classification to the dynamic setting of RL. We generate certificates that guarantee that the total reward obtained by the smoothed policy will not fall below a certain threshold under a norm-bounded adversarial perturbation of the input. We show that our certificates are tight by constructing a worst-case setting that achieves the bounds derived in our analysis. In our experiments, we show that this method can yield meaningful certificates in complex environments demonstrating its effectiveness against adversarial attacks.
    Learn Like The Pro: Norms from Theory to Size Neural Computation. (arXiv:2106.11409v1 [cs.LG])
    (2 min) The optimal design of neural networks is a critical problem in many applications. Here, we investigate how dynamical systems with polynomial nonlinearities can inform the design of neural systems that seek to emulate them. We propose a Learnability metric and its associated features to quantify the near-equilibrium behavior of learning dynamics. Equating the Learnability of neural systems with equivalent parameter estimation metric of the reference system establishes bounds on network structure. In this way, norms from theory provide a good first guess for neural structure, which may then further adapt with data. The proposed approach neither requires training nor training data. It reveals exact sizing for a class of neural networks with multiplicative nodes that mimic continuous- or discrete-time polynomial dynamics. It also provides relatively tight lower size bounds for classical feed-forward networks that is consistent with simulated assessments.
    Incremental Deep Neural Network Learning using Classification Confidence Thresholding. (arXiv:2106.11437v1 [cs.LG])
    (2 min) Most modern neural networks for classification fail to take into account the concept of the unknown. Trained neural networks are usually tested in an unrealistic scenario with only examples from a closed set of known classes. In an attempt to develop a more realistic model, the concept of working in an open set environment has been introduced. This in turn leads to the concept of incremental learning where a model with its own architecture and initial trained set of data can identify unknown classes during the testing phase and autonomously update itself if evidence of a new class is detected. Some problems that arise in incremental learning are inefficient use of resources to retrain the classifier repeatedly and the decrease of classification accuracy as multiple classes are added over time. This process of instantiating new classes is repeated as many times as necessary, accruing errors. To address these problems, this paper proposes the Classification Confidence Threshold approach to prime neural networks for incremental learning to keep accuracies high by limiting forgetting. A lean method is also used to reduce resources used in the retraining of the neural network. The proposed method is based on the idea that a network is able to incrementally learn a new class even when exposed to a limited number samples associated with the new class. This method can be applied to most existing neural networks with minimal changes to network architecture.
    Interpretable Model-based Hierarchical Reinforcement Learning using Inductive Logic Programming. (arXiv:2106.11417v1 [cs.LG])
    (2 min) Recently deep reinforcement learning has achieved tremendous success in wide ranges of applications. However, it notoriously lacks data-efficiency and interpretability. Data-efficiency is important as interacting with the environment is expensive. Further, interpretability can increase the transparency of the black-box-style deep RL models and hence gain trust from the users. In this work, we propose a new hierarchical framework via symbolic RL, leveraging a symbolic transition model to improve the data-efficiency and introduce the interpretability for learned policy. This framework consists of a high-level agent, a subtask solver and a symbolic transition model. Without assuming any prior knowledge on the state transition, we adopt inductive logic programming (ILP) to learn the rules of symbolic state transitions, introducing interpretability and making the learned behavior understandable to users. In empirical experiments, we confirmed that the proposed framework offers approximately between 30\% to 40\% more data efficiency over previous methods.
    How well do you know your summarization datasets?. (arXiv:2106.11388v1 [cs.CL])
    (2 min) State-of-the-art summarization systems are trained and evaluated on massive datasets scraped from the web. Despite their prevalence, we know very little about the underlying characteristics (data noise, summarization complexity, etc.) of these datasets, and how these affect system performance and the reliability of automatic metrics like ROUGE. In this study, we manually analyze 600 samples from three popular summarization datasets. Our study is driven by a six-class typology which captures different noise types (missing facts, entities) and degrees of summarization difficulty (extractive, abstractive). We follow with a thorough analysis of 27 state-of-the-art summarization models and 5 popular metrics, and report our key insights: (1) Datasets have distinct data quality and complexity distributions, which can be traced back to their collection process. (2) The performance of models and reliability of metrics is dependent on sample complexity. (3) Faithful summaries often receive low scores because of the poor diversity of references. We release the code, annotated data and model outputs.
    Graph Routing between Capsules. (arXiv:2106.11531v1 [cs.LG])
    (2 min) Routing methods in capsule networks often learn a hierarchical relationship for capsules in successive layers, but the intra-relation between capsules in the same layer is less studied, while this intra-relation is a key factor for the semantic understanding in text data. Therefore, in this paper, we introduce a new capsule network with graph routing to learn both relationships, where capsules in each layer are treated as the nodes of a graph. We investigate strategies to yield adjacency and degree matrix with three different distances from a layer of capsules, and propose the graph routing mechanism between those capsules. We validate our approach on five text classification datasets, and our findings suggest that the approach combining bottom-up routing and top-down attention performs the best. Such an approach demonstrates generalization capability across datasets. Compared to the state-of-the-art routing methods, the improvements in accuracy in the five datasets we used were 0.82, 0.39, 0.07, 1.01, and 0.02, respectively.
    Adaptive Learning Rate and Momentum for Training Deep Neural Networks. (arXiv:2106.11548v1 [cs.LG])
    (2 min) Recent progress on deep learning relies heavily on the quality and efficiency of training algorithms. In this paper, we develop a fast training method motivated by the nonlinear Conjugate Gradient (CG) framework. We propose the Conjugate Gradient with Quadratic line-search (CGQ) method. On the one hand, a quadratic line-search determines the step size according to current loss landscape. On the other hand, the momentum factor is dynamically updated in computing the conjugate gradient parameter (like Polak-Ribiere). Theoretical results to ensure the convergence of our method in strong convex settings is developed. And experiments in image classification datasets show that our method yields faster convergence than other local solvers and has better generalization capability (test set accuracy). One major advantage of the paper method is that tedious hand tuning of hyperparameters like the learning rate and momentum is avoided.
    Local convexity of the TAP free energy and AMP convergence for Z2-synchronization. (arXiv:2106.11428v1 [math.ST])
    (2 min) We study mean-field variational Bayesian inference using the TAP approach, for Z2-synchronization as a prototypical example of a high-dimensional Bayesian model. We show that for any signal strength $\lambda > 1$ (the weak-recovery threshold), there exists a unique local minimizer of the TAP free energy functional near the mean of the Bayes posterior law. Furthermore, the TAP free energy in a local neighborhood of this minimizer is strongly convex. Consequently, a natural-gradient/mirror-descent algorithm achieves linear convergence to this minimizer from a local initialization, which may be obtained by a finite number of iterates of Approximate Message Passing (AMP). This provides a rigorous foundation for variational inference in high dimensions via minimization of the TAP free energy. We also analyze the finite-sample convergence of AMP, showing that AMP is asymptotically stable at the TAP minimizer for any $\lambda > 1$, and is linearly convergent to this minimizer from a spectral initialization for sufficiently large $\lambda$. Such a guarantee is stronger than results obtainable by state evolution analyses, which only describe a fixed number of AMP iterations in the infinite-sample limit. Our proofs combine the Kac-Rice formula and Sudakov-Fernique Gaussian comparison inequality to analyze the complexity of critical points that satisfy strong convexity and stability conditions within their local neighborhoods.
    Hardness of Samples Is All You Need: Protecting Deep Learning Models Using Hardness of Samples. (arXiv:2106.11424v1 [cs.LG])
    (2 min) Several recent studies have shown that Deep Neural Network (DNN)-based classifiers are vulnerable against model extraction attacks. In model extraction attacks, an adversary exploits the target classifier to create a surrogate classifier imitating the target classifier with respect to some criteria. In this paper, we investigate the hardness degree of samples and demonstrate that the hardness degree histogram of model extraction attacks samples is distinguishable from the hardness degree histogram of normal samples. Normal samples come from the target classifier's training data distribution. As the training process of DNN-based classifiers is done in several epochs, we can consider this process as a sequence of subclassifiers so that each subclassifier is created at the end of an epoch. We use the sequence of subclassifiers to calculate the hardness degree of samples. We investigate the relation between hardness degree of samples and the trust in the classifier outputs. We propose Hardness-Oriented Detection Approach (HODA) to detect the sample sequences of model extraction attacks. The results demonstrate that HODA can detect the sample sequences of model extraction attacks with a high success rate by only watching 100 attack samples. We also investigate the hardness degree of adversarial examples and indicate that the hardness degree histogram of adversarial examples is distinct from the hardness degree histogram of normal samples.
    SeqNetVLAD vs PointNetVLAD: Image Sequence vs 3D Point Clouds for Day-Night Place Recognition. (arXiv:2106.11481v1 [cs.CV])
    (2 min) Place Recognition is a crucial capability for mobile robot localization and navigation. Image-based or Visual Place Recognition (VPR) is a challenging problem as scene appearance and camera viewpoint can change significantly when places are revisited. Recent VPR methods based on ``sequential representations'' have shown promising results as compared to traditional sequence score aggregation or single image based techniques. In parallel to these endeavors, 3D point clouds based place recognition is also being explored following the advances in deep learning based point cloud processing. However, a key question remains: is an explicit 3D structure based place representation always superior to an implicit ``spatial'' representation based on sequence of RGB images which can inherently learn scene structure. In this extended abstract, we attempt to compare these two types of methods by considering a similar ``metric span'' to represent places. We compare a 3D point cloud based method (PointNetVLAD) with image sequence based methods (SeqNet and others) and showcase that image sequence based techniques approach, and can even surpass, the performance achieved by point cloud based methods for a given metric span. These performance variations can be attributed to differences in data richness of input sensors as well as data accumulation strategies for a mobile robot. While a perfect apple-to-apple comparison may not be feasible for these two different modalities, the presented comparison takes a step in the direction of answering deeper questions regarding spatial representations, relevant to several applications like Autonomous Driving and Augmented/Virtual Reality. Source code available publicly https://github.com/oravus/seqNet.
    Understanding top-down attention using task-oriented ablation design. (arXiv:2106.11339v1 [cs.CV])
    (2 min) Top-down attention allows neural networks, both artificial and biological, to focus on the information most relevant for a given task. This is known to enhance performance in visual perception. But it remains unclear how attention brings about its perceptual boost, especially when it comes to naturalistic settings like recognising an object in an everyday scene. What aspects of a visual task does attention help to deal with? We aim to answer this with a computational experiment based on a general framework called task-oriented ablation design. First we define a broad range of visual tasks and identify six factors that underlie task variability. Then on each task we compare the performance of two neural networks, one with top-down attention and one without. These comparisons reveal the task-dependence of attention's perceptual boost, giving a clearer idea of the role attention plays. Whereas many existing cognitive accounts link attention to stimulus-level variables, such as visual clutter and object scale, we find greater explanatory power in system-level variables that capture the interaction between the model, the distribution of training data and the task format. This finding suggests a shift in how attention is studied could be fruitful. We make publicly available our code and results, along with statistics relevant to ImageNet-based experiments beyond this one. Our contribution serves to support the development of more human-like vision models and the design of more informative machine-learning experiments.
    Hi-BEHRT: Hierarchical Transformer-based model for accurate prediction of clinical events using multimodal longitudinal electronic health records. (arXiv:2106.11360v1 [cs.LG])
    (2 min) Electronic health records represent a holistic overview of patients' trajectories. Their increasing availability has fueled new hopes to leverage them and develop accurate risk prediction models for a wide range of diseases. Given the complex interrelationships of medical records and patient outcomes, deep learning models have shown clear merits in achieving this goal. However, a key limitation of these models remains their capacity in processing long sequences. Capturing the whole history of medical encounters is expected to lead to more accurate predictions, but the inclusion of records collected for decades and from multiple resources can inevitably exceed the receptive field of the existing deep learning architectures. This can result in missing crucial, long-term dependencies. To address this gap, we present Hi-BEHRT, a hierarchical Transformer-based model that can significantly expand the receptive field of Transformers and extract associations from much longer sequences. Using a multimodal large-scale linked longitudinal electronic health records, the Hi-BEHRT exceeds the state-of-the-art BEHRT 1% to 5% for area under the receiver operating characteristic (AUROC) curve and 3% to 6% for area under the precision recall (AUPRC) curve on average, and 3% to 6% (AUROC) and 3% to 11% (AUPRC) for patients with long medical history for 5-year heart failure, diabetes, chronic kidney disease, and stroke risk prediction. Additionally, because pretraining for hierarchical Transformer is not well-established, we provide an effective end-to-end contrastive pre-training strategy for Hi-BEHRT using EHR, improving its transferability on predicting clinical events with relatively small training dataset.
    Cogment: Open Source Framework For Distributed Multi-actor Training, Deployment & Operations. (arXiv:2106.11345v1 [cs.AI])
    (2 min) Involving humans directly for the benefit of AI agents' training is getting traction thanks to several advances in reinforcement learning and human-in-the-loop learning. Humans can provide rewards to the agent, demonstrate tasks, design a curriculum, or act in the environment, but these benefits also come with architectural, functional design and engineering complexities. We present Cogment, a unifying open-source framework that introduces an actor formalism to support a variety of humans-agents collaboration typologies and training approaches. It is also scalable out of the box thanks to a distributed micro service architecture, and offers solutions to the aforementioned complexities.
    Sequential Late Fusion Technique for Multi-modal Sentiment Analysis. (arXiv:2106.11473v1 [cs.LG])
    (2 min) Multi-modal sentiment analysis plays an important role for providing better interactive experiences to users. Each modality in multi-modal data can provide different viewpoints or reveal unique aspects of a user's emotional state. In this work, we use text, audio and visual modalities from MOSI dataset and we propose a novel fusion technique using a multi-head attention LSTM network. Finally, we perform a classification task and evaluate its performance.
    ConvDySAT: Deep Neural Representation Learning on Dynamic Graphs via Self-Attention and Convolutional Neural Networks. (arXiv:2106.11430v1 [cs.LG])
    (2 min) Learning node representations on temporal graphs is a fundamental step to learn real-word dynamic graphs efficiently. Real-world graphs have the nature of continuously evolving over time, such as changing edges weights, removing and adding nodes and appearing and disappearing of edges, while previous graph representation learning methods focused generally on static graphs. We present ConvDySAT as an enhancement of DySAT, one of the state-of-the-art dynamic methods, by augmenting convolution neural networks with the self-attention mechanism, the employed method in DySAT to express the structural and temporal evolution. We conducted single-step link prediction on a communication network and rating network, Experimental results show significant performance gains for ConvDySAT over various state-of-the-art methods.
    Encoder-Decoder Architectures for Clinically Relevant Coronary Artery Segmentation. (arXiv:2106.11447v1 [eess.IV])
    (2 min) Coronary X-ray angiography is a crucial clinical procedure for the diagnosis and treatment of coronary artery disease, which accounts for roughly 16% of global deaths every year. However, the images acquired in these procedures have low resolution and poor contrast, making lesion detection and assessment challenging. Accurate coronary artery segmentation not only helps mitigate these problems, but also allows the extraction of relevant anatomical features for further analysis by quantitative methods. Although automated segmentation of coronary arteries has been proposed before, previous approaches have used non-optimal segmentation criteria, leading to less useful results. Most methods either segment only the major vessel, discarding important information from the remaining ones, or segment the whole coronary tree based mostly on contrast information, producing a noisy output that includes vessels that are not relevant for diagnosis. We adopt a better-suited clinical criterion and segment vessels according to their clinical relevance. Additionally, we simultaneously perform catheter segmentation, which may be useful for diagnosis due to the scale factor provided by the catheter's known diameter, and is a task that has not yet been performed with good results. To derive the optimal approach, we conducted an extensive comparative study of encoder-decoder architectures trained on a combination of focal loss and a variant of generalized dice loss. Based on the EfficientNet and the UNet++ architectures, we propose a line of efficient and high-performance segmentation models using a new decoder architecture, the EfficientUNet++, whose best-performing version achieved average dice scores of 0.8904 and 0.7526 for the artery and catheter classes, respectively, and an average generalized dice score of 0.9234.
    Membership Inference on Word Embedding and Beyond. (arXiv:2106.11384v1 [cs.CL])
    (2 min) In the text processing context, most ML models are built on word embeddings. These embeddings are themselves trained on some datasets, potentially containing sensitive data. In some cases this training is done independently, in other cases, it occurs as part of training a larger, task-specific model. In either case, it is of interest to consider membership inference attacks based on the embedding layer as a way of understanding sensitive information leakage. But, somewhat surprisingly, membership inference attacks on word embeddings and their effect in other natural language processing (NLP) tasks that use these embeddings, have remained relatively unexplored. In this work, we show that word embeddings are vulnerable to black-box membership inference attacks under realistic assumptions. Furthermore, we show that this leakage persists through two other major NLP applications: classification and text-generation, even when the embedding layer is not exposed to the attacker. We show that our MI attack achieves high attack accuracy against a classifier model and an LSTM-based language model. Indeed, our attack is a cheaper membership inference attack on text-generative models, which does not require the knowledge of the target model or any expensive training of text-generative models as shadow models.
    BiAdam: Fast Adaptive Bilevel Optimization Methods. (arXiv:2106.11396v1 [math.OC])
    (2 min) Bilevel optimization recently has attracted increased interest in machine learning due to its many applications such as hyper-parameter optimization and policy optimization. Although some methods recently have been proposed to solve the bilevel problems, these methods do not consider using adaptive learning rates. To fill this gap, in the paper, we propose a class of fast and effective adaptive methods for solving bilevel optimization problems that the outer problem is possibly nonconvex and the inner problem is strongly-convex. Specifically, we propose a fast single-loop BiAdam algorithm based on the basic momentum technique, which achieves a sample complexity of $\tilde{O}(\epsilon^{-4})$ for finding an $\epsilon$-stationary point. At the same time, we propose an accelerated version of BiAdam algorithm (VR-BiAdam) by using variance reduced technique, which reaches the best known sample complexity of $\tilde{O}(\epsilon^{-3})$. To further reduce computation in estimating derivatives, we propose a fast single-loop stochastic approximated BiAdam algorithm (saBiAdam) by avoiding the Hessian inverse, which still achieves a sample complexity of $\tilde{O}(\epsilon^{-4})$ without large batches. We further present an accelerated version of saBiAdam algorithm (VR-saBiAdam), which also reaches the best known sample complexity of $\tilde{O}(\epsilon^{-3})$. We apply the unified adaptive matrices to our methods as the SUPER-ADAM \citep{huang2021super}, which including many types of adaptive learning rates. Moreover, our framework can flexibly use the momentum and variance reduced techniques. In particular, we provide a useful convergence analysis framework for both the constrained and unconstrained bilevel optimization. To the best of our knowledge, we first study the adaptive bilevel optimization methods with adaptive learning rates.
    Instance-Optimal Compressed Sensing via Posterior Sampling. (arXiv:2106.11438v1 [cs.LG])
    (2 min) We characterize the measurement complexity of compressed sensing of signals drawn from a known prior distribution, even when the support of the prior is the entire space (rather than, say, sparse vectors). We show for Gaussian measurements and \emph{any} prior distribution on the signal, that the posterior sampling estimator achieves near-optimal recovery guarantees. Moreover, this result is robust to model mismatch, as long as the distribution estimate (e.g., from an invertible generative model) is close to the true distribution in Wasserstein distance. We implement the posterior sampling estimator for deep generative priors using Langevin dynamics, and empirically find that it produces accurate estimates with more diversity than MAP.
    Dive into Deep Learning. (arXiv:2106.11342v1 [cs.LG])
    (2 min) This open-source book represents our attempt to make deep learning approachable, teaching readers the concepts, the context, and the code. The entire book is drafted in Jupyter notebooks, seamlessly integrating exposition figures, math, and interactive examples with self-contained code. Our goal is to offer a resource that could (i) be freely available for everyone; (ii) offer sufficient technical depth to provide a starting point on the path to actually becoming an applied machine learning scientist; (iii) include runnable code, showing readers how to solve problems in practice; (iv) allow for rapid updates, both by us and also by the community at large; (v) be complemented by a forum for interactive discussion of technical details and to answer questions.
    f-Domain-Adversarial Learning: Theory and Algorithms. (arXiv:2106.11344v1 [cs.LG])
    (2 min) Unsupervised domain adaptation is used in many machine learning applications where, during training, a model has access to unlabeled data in the target domain, and a related labeled dataset. In this paper, we introduce a novel and general domain-adversarial framework. Specifically, we derive a novel generalization bound for domain adaptation that exploits a new measure of discrepancy between distributions based on a variational characterization of f-divergences. It recovers the theoretical results from Ben-David et al. (2010a) as a special case and supports divergences used in practice. Based on this bound, we derive a new algorithmic framework that introduces a key correction in the original adversarial training method of Ganin et al. (2016). We show that many regularizers and ad-hoc objectives introduced over the last years in this framework are then not required to achieve performance comparable to (if not better than) state-of-the-art domain-adversarial methods. Experimental analysis conducted on real-world natural language and computer vision datasets show that our framework outperforms existing baselines, and obtains the best results for f-divergences that were not considered previously in domain-adversarial learning.
    Tensor Learning-based Precoder Codebooks for FD-MIMO Systems. (arXiv:2106.11374v1 [eess.SP])
    (2 min) This paper develops an efficient procedure for designing low-complexity codebooks for precoding in a full-dimension (FD) multiple-input multiple-output (MIMO) system with a uniform planar array (UPA) antenna at the transmitter (Tx) using tensor learning. In particular, instead of using statistical channel models, we utilize a model-free data-driven approach with foundations in machine learning to generate codebooks that adapt to the surrounding propagation conditions. We use a tensor representation of the FD-MIMO channel and exploit its properties to design quantized version of the channel precoders. We find the best representation of the optimal precoder as a function of Kronecker Product (KP) of two low-dimensional precoders, respectively corresponding to the horizontal and vertical dimensions of the UPA, obtained from the tensor decomposition of the channel. We then quantize this precoder to design product codebooks such that an average loss in mutual information due to quantization of channel state information (CSI) is minimized. The key technical contribution lies in exploiting the constraints on the precoders to reduce the product codebook design problem to an unsupervised clustering problem on a Cartesian Product Grassmann manifold (CPM), where the cluster centroids form a finite-sized precoder codebook. This codebook can be found efficiently by running a $K$-means clustering on the CPM. With a suitable induced distance metric on the CPM, we show that the construction of product codebooks is equivalent to finding the optimal set of centroids on the factor manifolds corresponding to the horizontal and vertical dimensions. Simulation results are presented to demonstrate the capability of the proposed design criterion in learning the codebooks and the attractive performance of the designed codebooks.
    Efficient Inference via Universal LSH Kernel. (arXiv:2106.11426v1 [cs.LG])
    (2 min) Large machine learning models achieve unprecedented performance on various tasks and have evolved as the go-to technique. However, deploying these compute and memory hungry models on resource constraint environments poses new challenges. In this work, we propose mathematically provable Representer Sketch, a concise set of count arrays that can approximate the inference procedure with simple hashing computations and aggregations. Representer Sketch builds upon the popular Representer Theorem from kernel literature, hence the name, providing a generic fundamental alternative to the problem of efficient inference that goes beyond the popular approach such as quantization, iterative pruning and knowledge distillation. A neural network function is transformed to its weighted kernel density representation, which can be very efficiently estimated with our sketching algorithm. Empirically, we show that Representer Sketch achieves up to 114x reduction in storage requirement and 59x reduction in computation complexity without any drop in accuracy.
    Photozilla: A Large-Scale Photography Dataset and Visual Embedding for 20 Photography Styles. (arXiv:2106.11359v1 [cs.CV])
    (2 min) The advent of social media platforms has been a catalyst for the development of digital photography that engendered a boom in vision applications. With this motivation, we introduce a large-scale dataset termed 'Photozilla', which includes over 990k images belonging to 10 different photographic styles. The dataset is then used to train 3 classification models to automatically classify the images into the relevant style which resulted in an accuracy of ~96%. With the rapid evolution of digital photography, we have seen new types of photography styles emerging at an exponential rate. On that account, we present a novel Siamese-based network that uses the trained classification models as the base architecture to adapt and classify unseen styles with only 25 training samples. We report an accuracy of over 68% for identifying 10 other distinct types of photography styles. This dataset can be found at https://trisha025.github.io/Photozilla/
    Spatio-Temporal Multi-Task Learning Transformer for Joint Moving Object Detection and Segmentation. (arXiv:2106.11401v1 [cs.CV])
    (2 min) Moving objects have special importance for Autonomous Driving tasks. Detecting moving objects can be posed as Moving Object Segmentation, by segmenting the object pixels, or Moving Object Detection, by generating a bounding box for the moving targets. In this paper, we present a Multi-Task Learning architecture, based on Transformers, to jointly perform both tasks through one network. Due to the importance of the motion features to the task, the whole setup is based on a Spatio-Temporal aggregation. We evaluate the performance of the individual tasks architecture versus the MTL setup, both with early shared encoders, and late shared encoder-decoder transformers. For the latter, we present a novel joint tasks query decoder transformer, that enables us to have tasks dedicated heads out of the shared model. To evaluate our approach, we use the KITTI MOD [29] data set. Results show1.5% mAP improvement for Moving Object Detection, and 2%IoU improvement for Moving Object Segmentation, over the individual tasks networks.

2021-06-22

  • cs.CL updates on arXiv.org

    Seeing is Knowing! Fact-based Visual Question Answering using Knowledge Graph Embeddings. (arXiv:2012.15484v2 [cs.CL] UPDATED)
    (2 min) Fact-based Visual Question Answering (FVQA), a challenging variant of VQA, requires a QA-system to include facts from a diverse knowledge graph (KG) in its reasoning process to produce an answer. Large KGs, especially common-sense KGs, are known to be incomplete, i.e., not all non-existent facts are always incorrect. Therefore, being able to reason over incomplete KGs for QA is a critical requirement in real-world applications that has not been addressed extensively in the literature. We develop a novel QA architecture that allows us to reason over incomplete KGs, something current FVQA state-of-the-art (SOTA) approaches lack due to their critical reliance on fact retrieval. We use KG Embeddings, a technique widely used for KG completion, for the downstream task of FVQA. We also employ a new image representation technique we call 'Image-as-Knowledge' to enable this capability, alongside a simple one-step CoAttention mechanism to attend to text and image during QA. Our FVQA architecture is faster during inference time, being O(m), as opposed to existing FVQA SOTA methods which are O(N log N), where m = number of vertices, N = number of edges = O(m^2). KG embeddings are shown to hold complementary information to word embeddings: a combination of both metrics permits performance comparable to SOTA methods in the standard answer retrieval task, and significantly better (26% absolute) in the proposed missing-edge reasoning task.
    Out of Context: A New Clue for Context Modeling of Aspect-based Sentiment Analysis. (arXiv:2106.10816v1 [cs.CL])
    (2 min) Aspect-based sentiment analysis (ABSA) aims to predict the sentiment expressed in a review with respect to a given aspect. The core of ABSA is to model the interaction between the context and given aspect to extract the aspect-related information. In prior work, attention mechanisms and dependency graph networks are commonly adopted to capture the relations between the context and given aspect. And the weighted sum of context hidden states is used as the final representation fed to the classifier. However, the information related to the given aspect may be already discarded and adverse information may be retained in the context modeling processes of existing models. This problem cannot be solved by subsequent modules and there are two reasons: first, their operations are conducted on the encoder-generated context hidden states, whose value cannot change after the encoder; second, existing encoders only consider the context while not the given aspect. To address this problem, we argue the given aspect should be considered as a new clue out of context in the context modeling process. As for solutions, we design several aspect-aware context encoders based on different backbones: an aspect-aware LSTM and three aspect-aware BERTs. They are dedicated to generate aspect-aware hidden states which are tailored for ABSA task. In these aspect-aware context encoders, the semantics of the given aspect is used to regulate the information flow. Consequently, the aspect-related information can be retained and aspect-irrelevant information can be excluded in the generated hidden states. We conduct extensive experiments on several benchmark datasets with empirical analysis, demonstrating the efficacies and advantages of our proposed aspect-aware context encoders.
    A Review of Speaker Diarization: Recent Advances with Deep Learning. (arXiv:2101.09624v2 [eess.AS] UPDATED)
    (2 min) Speaker diarization is a task to label audio or video recordings with classes that correspond to speaker identity, or in short, a task to identify "who spoke when". In the early years, speaker diarization algorithms were developed for speech recognition on multispeaker audio recordings to enable speaker adaptive processing. These algorithms also gained their own value as a standalone application over time to provide speaker-specific metainformation for downstream tasks such as audio retrieval. More recently, with the emergence of deep learning technology, which has driven revolutionary changes in research and practices across speech application domains, rapid advancements have been made for speaker diarization. In this paper, we review not only the historical development of speaker diarization technology but also the recent advancements in neural speaker diarization approaches. Furthermore, we discuss how speaker diarization systems have been integrated with speech recognition applications and how the recent surge of deep learning is leading the way of jointly modeling these two components to be complementary to each other. By considering such exciting technical trends, we believe that this paper is a valuable contribution to the community to provide a survey work by consolidating the recent developments with neural methods and thus facilitating further progress toward a more efficient speaker diarization.
    Structure-Grounded Pretraining for Text-to-SQL. (arXiv:2010.12773v2 [cs.CL] UPDATED)
    (2 min) Learning to capture text-table alignment is essential for tasks like text-to-SQL. A model needs to correctly recognize natural language references to columns and values and to ground them in the given database schema. In this paper, we present a novel weakly supervised Structure-Grounded pretraining framework (StruG) for text-to-SQL that can effectively learn to capture text-table alignment based on a parallel text-table corpus. We identify a set of novel prediction tasks: column grounding, value grounding and column-value mapping, and leverage them to pretrain a text-table encoder. Additionally, to evaluate different methods under more realistic text-table alignment settings, we create a new evaluation set Spider-Realistic based on Spider dev set with explicit mentions of column names removed, and adopt eight existing text-to-SQL datasets for cross-database evaluation. STRUG brings significant improvement over BERT-LARGE in all settings. Compared with existing pretraining methods such as GRAPPA, STRUG achieves similar performance on Spider, and outperforms all baselines on more realistic sets. All the code and data used in this work is public available at https://aka.ms/strug.
    A Disentangled Adversarial Neural Topic Model for Separating Opinions from Plots in User Reviews. (arXiv:2010.11384v2 [cs.CL] UPDATED)
    (2 min) The flexibility of the inference process in Variational Autoencoders (VAEs) has recently led to revising traditional probabilistic topic models giving rise to Neural Topic Models (NTMs). Although these approaches have achieved significant results, surprisingly very little work has been done on how to disentangle the latent topics. Existing topic models when applied to reviews may extract topics associated with writers' subjective opinions mixed with those related to factual descriptions such as plot summaries in movie and book reviews. It is thus desirable to automatically separate opinion topics from plot/neutral ones enabling a better interpretability. In this paper, we propose a neural topic model combined with adversarial training to disentangle opinion topics from plot and neutral ones. We conduct an extensive experimental assessment introducing a new collection of movie and book reviews paired with their plots, namely MOBO dataset, showing an improved coherence and variety of topics, a consistent disentanglement rate, and sentiment classification performance superior to other supervised topic models.
    FNet: Mixing Tokens with Fourier Transforms. (arXiv:2105.03824v2 [cs.CL] UPDATED)
    (2 min) We show that Transformer encoder architectures can be massively sped up, with limited accuracy costs, by replacing the self-attention sublayers with simple linear transformations that "mix" input tokens. These linear transformations, along with standard nonlinearities in feed-forward layers, prove competent at modeling semantic relationships in several text classification tasks. Most surprisingly, we find that replacing the self-attention sublayer in a Transformer encoder with a standard, unparameterized Fourier Transform achieves 92-97% of the accuracy of BERT counterparts on the GLUE benchmark, but trains nearly seven times faster on GPUs and twice as fast on TPUs. The resulting model, FNet, also scales very efficiently to long inputs. Specifically, when compared to the "efficient" Transformers on the Long Range Arena benchmark, FNet matches the accuracy of the most accurate models, but is faster than the fastest models across all sequence lengths on GPUs (and across relatively shorter lengths on TPUs). Finally, FNet has a light memory footprint and is particularly efficient at smaller model sizes: for a fixed speed and accuracy budget, small FNet models outperform Transformer counterparts.
    "Subverting the Jewtocracy": Online Antisemitism Detection Using Multimodal Deep Learning. (arXiv:2104.05947v3 [cs.MM] UPDATED)
    (2 min) The exponential rise of online social media has enabled the creation, distribution, and consumption of information at an unprecedented rate. However, it has also led to the burgeoning of various forms of online abuse. Increasing cases of online antisemitism have become one of the major concerns because of its socio-political consequences. Unlike other major forms of online abuse like racism, sexism, etc., online antisemitism has not been studied much from a machine learning perspective. To the best of our knowledge, we present the first work in the direction of automated multimodal detection of online antisemitism. The task poses multiple challenges that include extracting signals across multiple modalities, contextual references, and handling multiple aspects of antisemitism. Unfortunately, there does not exist any publicly available benchmark corpus for this critical task. Hence, we collect and label two datasets with 3,102 and 3,509 social media posts from Twitter and Gab respectively. Further, we present a multimodal deep learning system that detects the presence of antisemitic content and its specific antisemitism category using text and images from posts. We perform an extensive set of experiments on the two datasets to evaluate the efficacy of the proposed system. Finally, we also present a qualitative analysis of our study.
    Does Robustness Improve Fairness? Approaching Fairness with Word Substitution Robustness Methods for Text Classification. (arXiv:2106.10826v1 [cs.CL])
    (2 min) Existing bias mitigation methods to reduce disparities in model outcomes across cohorts have focused on data augmentation, debiasing model embeddings, or adding fairness-based optimization objectives during training. Separately, certified word substitution robustness methods have been developed to decrease the impact of spurious features and synonym substitutions on model predictions. While their end goals are different, they both aim to encourage models to make the same prediction for certain changes in the input. In this paper, we investigate the utility of certified word substitution robustness methods to improve equality of odds and equality of opportunity on multiple text classification tasks. We observe that certified robustness methods improve fairness, and using both robustness and bias mitigation methods in training results in an improvement in both fronts
    Pay Better Attention to Attention: Head Selection in Multilingual and Multi-Domain Sequence Modeling. (arXiv:2106.10840v1 [cs.CL])
    (2 min) Multi-head attention has each of the attention heads collect salient information from different parts of an input sequence, making it a powerful mechanism for sequence modeling. Multilingual and multi-domain learning are common scenarios for sequence modeling, where the key challenge is to maximize positive transfer and mitigate negative transfer across languages and domains. In this paper, we find that non-selective attention sharing is sub-optimal for achieving good generalization across all languages and domains. We further propose attention sharing strategies to facilitate parameter sharing and specialization in multilingual and multi-domain sequence modeling. Our approach automatically learns shared and specialized attention heads for different languages and domains to mitigate their interference. Evaluated in various tasks including speech recognition, text-to-text and speech-to-text translation, the proposed attention sharing strategies consistently bring gains to sequence models built upon multi-head attention. For speech-to-text translation, our approach yields an average of $+2.0$ BLEU over $13$ language directions in multilingual setting and $+2.0$ BLEU over $3$ domains in multi-domain setting.
    ROPE: Reading Order Equivariant Positional Encoding for Graph-based Document Information Extraction. (arXiv:2106.10786v1 [cs.CL])
    (2 min) Natural reading orders of words are crucial for information extraction from form-like documents. Despite recent advances in Graph Convolutional Networks (GCNs) on modeling spatial layout patterns of documents, they have limited ability to capture reading orders of given word-level node representations in a graph. We propose Reading Order Equivariant Positional Encoding (ROPE), a new positional encoding technique designed to apprehend the sequential presentation of words in documents. ROPE generates unique reading order codes for neighboring words relative to the target word given a word-level graph connectivity. We study two fundamental document entity extraction tasks including word labeling and word grouping on the public FUNSD dataset and a large-scale payment dataset. We show that ROPE consistently improves existing GCNs with a margin up to 8.4% F1-score.
    Improving Dialog Systems for Negotiation with Personality Modeling. (arXiv:2010.09954v2 [cs.CL] UPDATED)
    (2 min) In this paper, we explore the ability to model and infer personality types of opponents, predict their responses, and use this information to adapt a dialog agent's high-level strategy in negotiation tasks. Inspired by the idea of incorporating a theory of mind (ToM) into machines, we introduce a probabilistic formulation to encapsulate the opponent's personality type during both learning and inference. We test our approach on the CraigslistBargain dataset and show that our method using ToM inference achieves a 20% higher dialog agreement rate compared to baselines on a mixed population of opponents. We also find that our model displays diverse negotiation behavior with different types of opponents.
    CoreGen: Contextualized Code Representation Learning for Commit Message Generation. (arXiv:2007.06934v3 [cs.CL] UPDATED)
    (2 min) Automatic generation of high-quality commit messages for code commits can substantially facilitate software developers' works and coordination. However, the semantic gap between source code and natural language poses a major challenge for the task. Several studies have been proposed to alleviate the challenge but none explicitly involves code contextual information during commit message generation. Specifically, existing research adopts static embedding for code tokens, which maps a token to the same vector regardless of its context. In this paper, we propose a novel Contextualized code representation learning strategy for commit message Generation (CoreGen). CoreGen first learns contextualized code representations which exploit the contextual information behind code commit sequences. The learned representations of code commits built upon Transformer are then fine-tuned for downstream commit message generation. Experiments on the benchmark dataset demonstrate the superior effectiveness of our model over the baseline models with at least 28.18% improvement in terms of BLEU-4 score. Furthermore, we also highlight the future opportunities in training contextualized code representations on larger code corpus as a solution to low-resource tasks and adapting the contextualized code representation framework to other code-to-text generation tasks.
    DISCO PAL: Diachronic Spanish Sonnet Corpus with Psychological and Affective Labels. (arXiv:2007.04626v3 [cs.CL] UPDATED)
    (3 min) Nowadays, there are many applications of text mining over corpora from different languages. However, most of them are based on texts in prose, lacking applications that work with poetry texts. An example of an application of text mining in poetry is the usage of features derived from their individual words in order to capture the lexical, sublexical and interlexical meaning, and infer the General Affective Meaning (GAM) of the text. However, even though this proposal has been proved as useful for poetry in some languages, there is a lack of studies for both Spanish poetry and for highly-structured poetic compositions such as sonnets. This article presents a study over an annotated corpus of Spanish sonnets, in order to analyse if it is possible to build features from their individual words for predicting their GAM. The purpose of this is to model sonnets at an affective level. The article also analyses the relationship between the GAM of the sonnets and the content itself. For this, we consider the content from a psychological perspective, identifying with tags when a sonnet is related to a specific term. Then, we study how GAM changes according to each of those psychological terms. The corpus used contains 274 Spanish sonnets from authors of different centuries, from 15th to 19th. This corpus was annotated by different domain experts. The experts annotated the poems with affective and lexico-semantic features, as well as with domain concepts that belong to psychology. Thanks to this, the corpus of sonnets can be used in different applications, such as poetry recommender systems, personality text mining studies of the authors, or the usage of poetry for therapeutic purposes.
    Order in the Court: Explainable AI Methods Prone to Disagreement. (arXiv:2105.03287v2 [cs.LG] UPDATED)
    (2 min) By computing the rank correlation between attention weights and feature-additive explanation methods, previous analyses either invalidate or support the role of attention-based explanations as a faithful and plausible measure of salience. To investigate whether this approach is appropriate, we compare LIME, Integrated Gradients, DeepLIFT, Grad-SHAP, Deep-SHAP, and attention-based explanations, applied to two neural architectures trained on single- and pair-sequence language tasks. In most cases, we find that none of our chosen methods agree. Based on our empirical observations and theoretical objections, we conclude that rank correlation does not measure the quality of feature-additive methods. Practitioners should instead use the numerous and rigorous diagnostic methods proposed by the community.
    Fine-grained Fact Verification with Kernel Graph Attention Network. (arXiv:1910.09796v4 [cs.CL] UPDATED)
    (2 min) Fact Verification requires fine-grained natural language inference capability that finds subtle clues to identify the syntactical and semantically correct but not well-supported claims. This paper presents Kernel Graph Attention Network (KGAT), which conducts more fine-grained fact verification with kernel-based attentions. Given a claim and a set of potential evidence sentences that form an evidence graph, KGAT introduces node kernels, which better measure the importance of the evidence node, and edge kernels, which conduct fine-grained evidence propagation in the graph, into Graph Attention Networks for more accurate fact verification. KGAT achieves a 70.38% FEVER score and significantly outperforms existing fact verification models on FEVER, a large-scale benchmark for fact verification. Our analyses illustrate that, compared to dot-product attentions, the kernel-based attention concentrates more on relevant evidence sentences and meaningful clues in the evidence graph, which is the main source of KGAT's effectiveness.
    SUPERB: Speech processing Universal PERformance Benchmark. (arXiv:2105.01051v2 [cs.CL] UPDATED)
    (2 min) Self-supervised learning (SSL) has proven vital for advancing research in natural language processing (NLP) and computer vision (CV). The paradigm pretrains a shared model on large volumes of unlabeled data and achieves state-of-the-art (SOTA) for various tasks with minimal adaptation. However, the speech processing community lacks a similar setup to systematically explore the paradigm. To bridge this gap, we introduce Speech processing Universal PERformance Benchmark (SUPERB). SUPERB is a leaderboard to benchmark the performance of a shared model across a wide range of speech processing tasks with minimal architecture changes and labeled data. Among multiple usages of the shared model, we especially focus on extracting the representation learned from SSL due to its preferable re-usability. We present a simple framework to solve SUPERB tasks by learning task-specialized lightweight prediction heads on top of the frozen shared model. Our results demonstrate that the framework is promising as SSL representations show competitive generalizability and accessibility across SUPERB tasks. We release SUPERB as a challenge with a leaderboard and a benchmark toolkit to fuel the research in representation learning and general speech processing.
    A Sequence-to-Set Network for Nested Named Entity Recognition. (arXiv:2105.08901v2 [cs.CL] UPDATED)
    (2 min) Named entity recognition (NER) is a widely studied task in natural language processing. Recently, a growing number of studies have focused on the nested NER. The span-based methods, considering the entity recognition as a span classification task, can deal with nested entities naturally. But they suffer from the huge search space and the lack of interactions between entities. To address these issues, we propose a novel sequence-to-set neural network for nested NER. Instead of specifying candidate spans in advance, we provide a fixed set of learnable vectors to learn the patterns of the valuable spans. We utilize a non-autoregressive decoder to predict the final set of entities in one pass, in which we are able to capture dependencies between entities. Compared with the sequence-to-sequence method, our model is more suitable for such unordered recognition task as it is insensitive to the label order. In addition, we utilize the loss function based on bipartite matching to compute the overall training loss. Experimental results show that our proposed model achieves state-of-the-art on three nested NER corpora: ACE 2004, ACE 2005 and KBP 2017. The code is available at https://github.com/zqtan1024/sequence-to-set.
    Empower Distantly Supervised Relation Extraction with Collaborative Adversarial Training. (arXiv:2106.10835v1 [cs.CL])
    (2 min) With recent advances in distantly supervised (DS) relation extraction (RE), considerable attention is attracted to leverage multi-instance learning (MIL) to distill high-quality supervision from the noisy DS. Here, we go beyond label noise and identify the key bottleneck of DS-MIL to be its low data utilization: as high-quality supervision being refined by MIL, MIL abandons a large amount of training instances, which leads to a low data utilization and hinders model training from having abundant supervision. In this paper, we propose collaborative adversarial training to improve the data utilization, which coordinates virtual adversarial training (VAT) and adversarial training (AT) at different levels. Specifically, since VAT is label-free, we employ the instance-level VAT to recycle instances abandoned by MIL. Besides, we deploy AT at the bag-level to unleash the full potential of the high-quality supervision got by MIL. Our proposed method brings consistent improvements (~ 5 absolute AUC score) to the previous state of the art, which verifies the importance of the data utilization issue and the effectiveness of our method.
    Unsupervised Learning of Disentangled Speech Content and Style Representation. (arXiv:2010.12973v2 [cs.CL] UPDATED)
    (2 min) We present an approach for unsupervised learning of speech representation disentangling contents and styles. Our model consists of: (1) a local encoder that captures per-frame information; (2) a global encoder that captures per-utterance information; and (3) a conditional decoder that reconstructs speech given local and global latent variables. Our experiments show that (1) the local latent variables encode speech contents, as reconstructed speech can be recognized by ASR with low word error rates (WER), even with a different global encoding; (2) the global latent variables encode speaker style, as reconstructed speech shares speaker identity with the source utterance of the global encoding. Additionally, we demonstrate an useful application from our pre-trained model, where we can train a speaker recognition model from the global latent variables and achieve high accuracy by fine-tuning with as few data as one label per speaker.
    Learning to Augment for Data-Scarce Domain BERT Knowledge Distillation. (arXiv:2101.08106v2 [cs.CL] UPDATED)
    (2 min) Despite pre-trained language models such as BERT have achieved appealing performance in a wide range of natural language processing tasks, they are computationally expensive to be deployed in real-time applications. A typical method is to adopt knowledge distillation to compress these large pre-trained models (teacher models) to small student models. However, for a target domain with scarce training data, the teacher can hardly pass useful knowledge to the student, which yields performance degradation for the student models. To tackle this problem, we propose a method to learn to augment for data-scarce domain BERT knowledge distillation, by learning a cross-domain manipulation scheme that automatically augments the target with the help of resource-rich source domains. Specifically, the proposed method generates samples acquired from a stationary distribution near the target data and adopts a reinforced selector to automatically refine the augmentation strategy according to the performance of the student. Extensive experiments demonstrate that the proposed method significantly outperforms state-of-the-art baselines on four different tasks, and for the data-scarce domains, the compressed student models even perform better than the original large teacher model, with much fewer parameters (only ${\sim}13.3\%$) when only a few labeled examples available.
    Context-Aware Legal Citation Recommendation using Deep Learning. (arXiv:2106.10776v1 [cs.IR])
    (2 min) Lawyers and judges spend a large amount of time researching the proper legal authority to cite while drafting decisions. In this paper, we develop a citation recommendation tool that can help improve efficiency in the process of opinion drafting. We train four types of machine learning models, including a citation-list based method (collaborative filtering) and three context-based methods (text similarity, BiLSTM and RoBERTa classifiers). Our experiments show that leveraging local textual context improves recommendation, and that deep neural models achieve decent performance. We show that non-deep text-based methods benefit from access to structured case metadata, but deep models only benefit from such access when predicting from context of insufficient length. We also find that, even after extensive training, RoBERTa does not outperform a recurrent neural model, despite its benefits of pretraining. Our behavior analysis of the RoBERTa model further shows that predictive performance is stable across time and citation classes.
    Efficient Urdu Caption Generation using Attention based LSTM. (arXiv:2008.01663v4 [cs.CL] UPDATED)
    (2 min) Recent advancements in deep learning have created many opportunities to solve real-world problems that remained unsolved for more than a decade. Automatic caption generation is a major research field, and the research community has done a lot of work on it in most common languages like English. Urdu is the national language of Pakistan and also much spoken and understood in the sub-continent region of Pakistan-India, and yet no work has been done for Urdu language caption generation. Our research aims to fill this gap by developing an attention-based deep learning model using techniques of sequence modeling specialized for the Urdu language. We have prepared a dataset in the Urdu language by translating a subset of the "Flickr8k" dataset containing 700 'man' images. We evaluate our proposed technique on this dataset and show that it can achieve a BLEU score of 0.83 in the Urdu language. We improve on the previous state-of-the-art by using better CNN architectures and optimization techniques. Furthermore, we provide a discussion on how the generated captions can be made correct grammar-wise.
    Inducing Language-Agnostic Multilingual Representations. (arXiv:2008.09112v2 [cs.CL] UPDATED)
    (2 min) Cross-lingual representations have the potential to make NLP techniques available to the vast majority of languages in the world. However, they currently require large pretraining corpora or access to typologically similar languages. In this work, we address these obstacles by removing language identity signals from multilingual embeddings. We examine three approaches for this: (i) re-aligning the vector spaces of target languages (all together) to a pivot source language; (ii) removing language-specific means and variances, which yields better discriminativeness of embeddings as a by-product; and (iii) increasing input similarity across languages by removing morphological contractions and sentence reordering. We evaluate on XNLI and reference-free MT across 19 typologically diverse languages. Our findings expose the limitations of these approaches -- unlike vector normalization, vector space re-alignment and text normalization do not achieve consistent gains across encoders and languages. Due to the approaches' additive effects, their combination decreases the cross-lingual transfer gap by 8.9 points (m-BERT) and 18.2 points (XLM-R) on average across all tasks and languages, however. Our code and models are publicly available.
    Speech2Phone: A Novel and Efficient Method for Training Speaker Recognition Models. (arXiv:2002.11213v2 [cs.CL] UPDATED)
    (2 min) In this paper we present an efficient method for training models for speaker recognition using small or under-resourced datasets. This method requires less data than other SOTA (State-Of-The-Art) methods, e.g. the Angular Prototypical and GE2E loss functions, while achieving similar results to those methods. This is done using the knowledge of the reconstruction of a phoneme in the speaker's voice. For this purpose, a new dataset was built, composed of 40 male speakers, who read sentences in Portuguese, totaling approximately 3h. We compare the three best architectures trained using our method to select the best one, which is the one with a shallow architecture. Then, we compared this model with the SOTA method for the speaker recognition task: the Fast ResNet-34 trained with approximately 2,000 hours, using the loss functions Angular Prototypical and GE2E. Three experiments were carried out with datasets in different languages. Among these three experiments, our model achieved the second best result in two experiments and the best result in one of them. This highlights the importance of our method, which proved to be a great competitor to SOTA speaker recognition models, with 500x less data and a simpler approach.
    Dialogue Relation Extraction with Document-level Heterogeneous Graph Attention Networks. (arXiv:2009.05092v3 [cs.CL] UPDATED)
    (2 min) Dialogue relation extraction (DRE) aims to detect the relation between two entities mentioned in a multi-party dialogue. It plays an important role in constructing knowledge graphs from conversational data increasingly abundant on the internet and facilitating intelligent dialogue system development. The prior methods of DRE do not meaningfully leverage speaker information-they just prepend the utterances with the respective speaker names. Thus, they fail to model the crucial inter-speaker relations that may give additional context to relevant argument entities through pronouns and triggers. We, however, present a graph attention network-based method for DRE where a graph, that contains meaningfully connected speaker, entity, entity-type, and utterance nodes, is constructed. This graph is fed to a graph attention network for context propagation among relevant nodes, which effectively captures the dialogue context. We empirically show that this graph-based approach quite effectively captures the relations between different entity pairs in a dialogue as it outperforms the state-of-the-art approaches by a significant margin on the benchmark dataset DialogRE. Our code is released at: https://github.com/declare-lab/dialog-HGAT
    Institutional Grammar 2.0 Codebook. (arXiv:2008.08937v3 [cs.MA] UPDATED)
    (3 min) The Grammar of Institutions, or Institutional Grammar, is an established approach to encode policy information in terms of institutional statements based on a set of pre-defined syntactic components. This codebook provides coding guidelines for a revised version of the Institutional Grammar, the Institutional Grammar 2.0 (IG 2.0). IG 2.0 is a specification that aims at facilitating the encoding of policy to meet varying analytical objectives. To this end, it revises the grammar with respect to comprehensiveness, flexibility, and specificity by offering multiple levels of expressiveness (IG Core, IG Extended, IG Logico). In addition to the encoding of regulative statements, it further introduces the encoding of constitutive institutional statements, as well as statements that exhibit both constitutive and regulative characteristics. Introducing those aspects, the codebook initially covers fundamental concepts of IG 2.0, before providing an overview of pre-coding steps relevant for document preparation. Detailed coding guidelines are provided for both regulative and constitutive statements across all levels of expressiveness, along with the encoding guidelines for statements of mixed form -- hybrid and polymorphic institutional statements. The document further provides an overview of taxonomies used in the encoding process and referred to throughout the codebook. The codebook concludes with a summary and discussion of relevant considerations to facilitate the coding process. An initial Reader's Guide helps the reader tailor the content to her interest. Note that this codebook specifically focuses on operational aspects of IG 2.0 in the context of policy coding. Links to additional resources such as the underlying scientific literature (that offers a comprehensive treatment of the underlying theoretical concepts) are referred to in the concluding section of the codebook.
    CPM-2: Large-scale Cost-effective Pre-trained Language Models. (arXiv:2106.10715v1 [cs.CL])
    (2 min) In recent years, the size of pre-trained language models (PLMs) has grown by leaps and bounds. However, efficiency issues of these large-scale PLMs limit their utilization in real-world scenarios. We present a suite of cost-effective techniques for the use of PLMs to deal with the efficiency issues of pre-training, fine-tuning, and inference. (1) We introduce knowledge inheritance to accelerate the pre-training process by exploiting existing PLMs instead of training models from scratch. (2) We explore the best practice of prompt tuning with large-scale PLMs. Compared with conventional fine-tuning, prompt tuning significantly reduces the number of task-specific parameters. (3) We implement a new inference toolkit, namely InfMoE, for using large-scale PLMs with limited computational resources. Based on our cost-effective pipeline, we pre-train two models: an encoder-decoder bilingual model with 11 billion parameters (CPM-2) and its corresponding MoE version with 198 billion parameters. In our experiments, we compare CPM-2 with mT5 on downstream tasks. Experimental results show that CPM-2 has excellent general language intelligence. Moreover, we validate the efficiency of InfMoE when conducting inference of large-scale models having tens of billions of parameters on a single GPU. All source code and model parameters are available at https://github.com/TsinghuaAI/CPM.
    Ensemble of MRR and NDCG models for Visual Dialog. (arXiv:2104.07511v2 [cs.AI] UPDATED)
    (2 min) Assessing an AI agent that can converse in human language and understand visual content is challenging. Generation metrics, such as BLEU scores favor correct syntax over semantics. Hence a discriminative approach is often used, where an agent ranks a set of candidate options. The mean reciprocal rank (MRR) metric evaluates the model performance by taking into account the rank of a single human-derived answer. This approach, however, raises a new challenge: the ambiguity and synonymy of answers, for instance, semantic equivalence (e.g., `yeah' and `yes'). To address this, the normalized discounted cumulative gain (NDCG) metric has been used to capture the relevance of all the correct answers via dense annotations. However, the NDCG metric favors the usually applicable uncertain answers such as `I don't know. Crafting a model that excels on both MRR and NDCG metrics is challenging. Ideally, an AI agent should answer a human-like reply and validate the correctness of any answer. To address this issue, we describe a two-step non-parametric ranking approach that can merge strong MRR and NDCG models. Using our approach, we manage to keep most MRR state-of-the-art performance (70.41% vs. 71.24%) and the NDCG state-of-the-art performance (72.16% vs. 75.35%). Moreover, our approach won the recent Visual Dialog 2020 challenge. Source code is available at https://github.com/idansc/mrr-ndcg.
    Do Encoder Representations of Generative Dialogue Models Encode Sufficient Information about the Task ?. (arXiv:2106.10622v1 [cs.CL])
    (2 min) Predicting the next utterance in dialogue is contingent on encoding of users' input text to generate appropriate and relevant response in data-driven approaches. Although the semantic and syntactic quality of the language generated is evaluated, more often than not, the encoded representation of input is not evaluated. As the representation of the encoder is essential for predicting the appropriate response, evaluation of encoder representation is a challenging yet important problem. In this work, we showcase evaluating the text generated through human or automatic metrics is not sufficient to appropriately evaluate soundness of the language understanding of dialogue models and, to that end, propose a set of probe tasks to evaluate encoder representation of different language encoders commonly used in dialogue models. From experiments, we observe that some of the probe tasks are easier and some are harder for even sophisticated model architectures to learn. And, through experiments we observe that RNN based architectures have lower performance on automatic metrics on text generation than transformer model but perform better than the transformer model on the probe tasks indicating that RNNs might preserve task information better than the Transformers.
    Enhancing Question Generation with Commonsense Knowledge. (arXiv:2106.10454v1 [cs.CL])
    (2 min) Question generation (QG) is to generate natural and grammatical questions that can be answered by a specific answer for a given context. Previous sequence-to-sequence models suffer from a problem that asking high-quality questions requires commonsense knowledge as backgrounds, which in most cases can not be learned directly from training data, resulting in unsatisfactory questions deprived of knowledge. In this paper, we propose a multi-task learning framework to introduce commonsense knowledge into question generation process. We first retrieve relevant commonsense knowledge triples from mature databases and select triples with the conversion information from source context to question. Based on these informative knowledge triples, we design two auxiliary tasks to incorporate commonsense knowledge into the main QG model, where one task is Concept Relation Classification and the other is Tail Concept Generation. Experimental results on SQuAD show that our proposed methods are able to noticeably improve the QG performance on both automatic and human evaluation metrics, demonstrating that incorporating external commonsense knowledge with multi-task learning can help the model generate human-like and high-quality questions.
    Calliar: An Online Handwritten Dataset for Arabic Calligraphy. (arXiv:2106.10745v1 [cs.CL])
    (2 min) Calligraphy is an essential part of the Arabic heritage and culture. It has been used in the past for the decoration of houses and mosques. Usually, such calligraphy is designed manually by experts with aesthetic insights. In the past few years, there has been a considerable effort to digitize such type of art by either taking a photo of decorated buildings or drawing them using digital devices. The latter is considered an online form where the drawing is tracked by recording the apparatus movement, an electronic pen for instance, on a screen. In the literature, there are many offline datasets collected with a diversity of Arabic styles for calligraphy. However, there is no available online dataset for Arabic calligraphy. In this paper, we illustrate our approach for the collection and annotation of an online dataset for Arabic calligraphy called Calliar that consists of 2,500 sentences. Calliar is annotated for stroke, character, word and sentence level prediction.
    Challenges in Translation of Emotions in Multilingual User-Generated Content: Twitter as a Case Study. (arXiv:2106.10719v1 [cs.CL])
    (2 min) Although emotions are universal concepts, transferring the different shades of emotion from one language to another may not always be straightforward for human translators, let alone for machine translation systems. Moreover, the cognitive states are established by verbal explanations of experience which is shaped by both the verbal and cultural contexts. There are a number of verbal contexts where expression of emotions constitutes the pivotal component of the message. This is particularly true for User-Generated Content (UGC) which can be in the form of a review of a product or a service, a tweet, or a social media post. Recently, it has become common practice for multilingual websites such as Twitter to provide an automatic translation of UGC to reach out to their linguistically diverse users. In such scenarios, the process of translating the user's emotion is entirely automatic with no human intervention, neither for post-editing nor for accuracy checking. In this research, we assess whether automatic translation tools can be a successful real-life utility in transferring emotion in user-generated multilingual data such as tweets. We show that there are linguistic phenomena specific of Twitter data that pose a challenge in translation of emotions in different languages. We summarise these challenges in a list of linguistic features and show how frequent these features are in different language pairs. We also assess the capacity of commonly used methods for evaluating the performance of an MT system with respect to the preservation of emotion in the source text.
    Transformers for Headline Selection for Russian News Clusters. (arXiv:2106.10487v1 [cs.CL])
    (2 min) In this paper, we explore various multilingual and Russian pre-trained transformer-based models for the Dialogue Evaluation 2021 shared task on headline selection. Our experiments show that the combined approach is superior to individual multilingual and monolingual models. We present an analysis of a number of ways to obtain sentence embeddings and learn a ranking model on top of them. We achieve the result of 87.28% and 86.60% accuracy for the public and private test sets respectively.
    Multi-Pair Text Style Transfer on Unbalanced Data. (arXiv:2106.10608v1 [cs.CL])
    (2 min) Text-style transfer aims to convert text given in one domain into another by paraphrasing the sentence or substituting the keywords without altering the content. By necessity, state-of-the-art methods have evolved to accommodate nonparallel training data, as it is frequently the case there are multiple data sources of unequal size, with a mixture of labeled and unlabeled sentences. Moreover, the inherent style defined within each source might be distinct. A generic bidirectional (e.g., formal $\Leftrightarrow$ informal) style transfer regardless of different groups may not generalize well to different applications. In this work, we developed a task adaptive meta-learning framework that can simultaneously perform a multi-pair text-style transfer using a single model. The proposed method can adaptively balance the difference of meta-knowledge across multiple tasks. Results show that our method leads to better quantitative performance as well as coherent style variations. Common challenges of unbalanced data and mismatched domains are handled well by this method.
    Hybrid approach to detecting symptoms of depression in social media entries. (arXiv:2106.10485v1 [cs.CL])
    (2 min) Sentiment and lexical analyses are widely used to detect depression or anxiety disorders. It has been documented that there are significant differences in the language used by a person with emotional disorders in comparison to a healthy individual. Still, the effectiveness of these lexical approaches could be improved further because the current analysis focuses on what the social media entries are about, and not how they are written. In this study, we focus on aspects in which these short texts are similar to each other, and how they were created. We present an innovative approach to the depression screening problem by applying Collgram analysis, which is a known effective method of obtaining linguistic information from texts. We compare these results with sentiment analysis based on the BERT architecture. Finally, we create a hybrid model achieving a diagnostic accuracy of 71%.
    Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets. (arXiv:2106.10328v1 [cs.CL])
    (2 min) Language models can generate harmful and biased outputs and exhibit undesirable behavior. We propose a Process for Adapting Language Models to Society (PALMS) with Values-Targeted Datasets, an iterative process to significantly change model behavior by crafting and fine-tuning on a dataset that reflects a predetermined set of target values. We evaluate our process using three metrics: quantitative metrics with human evaluations that score output adherence to a target value, and toxicity scoring on outputs; and qualitative metrics analyzing the most common word associated with a given social category. Through each iteration, we add additional training dataset examples based on observed shortcomings from evaluations. PALMS performs significantly better on all metrics compared to baseline and control models for a broad range of GPT-3 language model sizes without compromising capability integrity. We find that the effectiveness of PALMS increases with model size. We show that significantly adjusting language model behavior is feasible with a small, hand-curated dataset.
    A Condense-then-Select Strategy for Text Summarization. (arXiv:2106.10468v1 [cs.CL])
    (2 min) Select-then-compress is a popular hybrid, framework for text summarization due to its high efficiency. This framework first selects salient sentences and then independently condenses each of the selected sentences into a concise version. However, compressing sentences separately ignores the context information of the document, and is therefore prone to delete salient information. To address this limitation, we propose a novel condense-then-select framework for text summarization. Our framework first concurrently condenses each document sentence. Original document sentences and their compressed versions then become the candidates for extraction. Finally, an extractor utilizes the context information of the document to select candidates and assembles them into a summary. If salient information is deleted during condensing, the extractor can select an original sentence to retain the information. Thus, our framework helps to avoid the loss of salient information, while preserving the high efficiency of sentence-level compression. Experiment results on the CNN/DailyMail, DUC-2002, and Pubmed datasets demonstrate that our framework outperforms the select-then-compress framework and other strong baselines.
    JointGT: Graph-Text Joint Representation Learning for Text Generation from Knowledge Graphs. (arXiv:2106.10502v1 [cs.CL])
    (2 min) Existing pre-trained models for knowledge-graph-to-text (KG-to-text) generation simply fine-tune text-to-text pre-trained models such as BART or T5 on KG-to-text datasets, which largely ignore the graph structure during encoding and lack elaborate pre-training tasks to explicitly model graph-text alignments. To tackle these problems, we propose a graph-text joint representation learning model called JointGT. During encoding, we devise a structure-aware semantic aggregation module which is plugged into each Transformer layer to preserve the graph structure. Furthermore, we propose three new pre-training tasks to explicitly enhance the graph-text alignment including respective text / graph reconstruction, and graph-text alignment in the embedding space via Optimal Transport. Experiments show that JointGT obtains new state-of-the-art performance on various KG-to-text datasets.
    A Brief Study on the Effects of Training Generative Dialogue Models with a Semantic loss. (arXiv:2106.10619v1 [cs.CL])
    (2 min) Neural models trained for next utterance generation in dialogue task learn to mimic the n-gram sequences in the training set with training objectives like negative log-likelihood (NLL) or cross-entropy. Such commonly used training objectives do not foster generating alternate responses to a context. But, the effects of minimizing an alternate training objective that fosters a model to generate alternate response and score it on semantic similarity has not been well studied. We hypothesize that a language generation model can improve on its diversity by learning to generate alternate text during training and minimizing a semantic loss as an auxiliary objective. We explore this idea on two different sized data sets on the task of next utterance generation in goal oriented dialogues. We make two observations (1) minimizing a semantic objective improved diversity in responses in the smaller data set (Frames) but only as-good-as minimizing the NLL in the larger data set (MultiWoZ) (2) large language model embeddings can be more useful as a semantic loss objective than as initialization for token embeddings.
    TweeNLP: A Twitter Exploration Portal for Natural Language Processing. (arXiv:2106.10512v1 [cs.CL])
    (2 min) We present TweeNLP, a one-stop portal that organizes Twitter's natural language processing (NLP) data and builds a visualization and exploration platform. It curates 19,395 tweets (as of April 2021) from various NLP conferences and general NLP discussions. It supports multiple features such as TweetExplorer to explore tweets by topics, visualize insights from Twitter activity throughout the organization cycle of conferences, discover popular research papers and researchers. It also builds a timeline of conference and workshop submission deadlines. We envision TweeNLP to function as a collective memory unit for the NLP community by integrating the tweets pertaining to research papers with the NLPExplorer scientific literature search engine. The current system is hosted at this http URL .
    Improving Compositional Generalization in Classification Tasks via Structure Annotations. (arXiv:2106.10434v1 [cs.LG])
    (2 min) Compositional generalization is the ability to generalize systematically to a new data distribution by combining known components. Although humans seem to have a great ability to generalize compositionally, state-of-the-art neural models struggle to do so. In this work, we study compositional generalization in classification tasks and present two main contributions. First, we study ways to convert a natural language sequence-to-sequence dataset to a classification dataset that also requires compositional generalization. Second, we show that providing structural hints (specifically, providing parse trees and entity links as attention masks for a Transformer model) helps compositional generalization.
  • cs.CV updates on arXiv.org

    Interpretable Face Manipulation Detection via Feature Whitening. (arXiv:2106.10834v1 [cs.CV])
    (2 min) Why should we trust the detections of deep neural networks for manipulated faces? Understanding the reasons is important for users in improving the fairness, reliability, privacy and trust of the detection models. In this work, we propose an interpretable face manipulation detection approach to achieve the trustworthy and accurate inference. The approach could make the face manipulation detection process transparent by embedding the feature whitening module. This module aims to whiten the internal working mechanism of deep networks through feature decorrelation and feature constraint. The experimental results demonstrate that our proposed approach can strike a balance between the detection accuracy and the model interpretability.
    ToAlign: Task-oriented Alignment for Unsupervised Domain Adaptation. (arXiv:2106.10812v1 [cs.CV])
    (2 min) Unsupervised domain adaptive classification intends to improve theclassification performance on unlabeled target domain. To alleviate the adverse effect of domain shift, many approaches align the source and target domains in the feature space. However, a feature is usually taken as a whole for alignment without explicitly making domain alignment proactively serve the classification task, leading to sub-optimal solution. What sub-feature should be aligned for better adaptation is under-explored. In this paper, we propose an effective Task-oriented Alignment (ToAlign) for unsupervised domain adaptation (UDA). We study what features should be aligned across domains and propose to make the domain alignment proactively serve classification by performing feature decomposition and alignment under the guidance of the prior knowledge induced from the classification taskitself. Particularly, we explicitly decompose a feature in the source domain intoa task-related/discriminative feature that should be aligned, and a task-irrelevant feature that should be avoided/ignored, based on the classification meta-knowledge. Extensive experimental results on various benchmarks (e.g., Office-Home, Visda-2017, and DomainNet) under different domain adaptation settings demonstrate theeffectiveness of ToAlign which helps achieve the state-of-the-art performance.
    S2-BNN: Bridging the Gap Between Self-Supervised Real and 1-bit Neural Networks via Guided Distribution Calibration. (arXiv:2102.08946v2 [cs.CV] UPDATED)
    (2 min) Previous studies dominantly target at self-supervised learning on real-valued networks and have achieved many promising results. However, on the more challenging binary neural networks (BNNs), this task has not yet been fully explored in the community. In this paper, we focus on this more difficult scenario: learning networks where both weights and activations are binary, meanwhile, without any human annotated labels. We observe that the commonly used contrastive objective is not satisfying on BNNs for competitive accuracy, since the backbone network contains relatively limited capacity and representation ability. Hence instead of directly applying existing self-supervised methods, which cause a severe decline in performance, we present a novel guided learning paradigm from real-valued to distill binary networks on the final prediction distribution, to minimize the loss and obtain desirable accuracy. Our proposed method can boost the simple contrastive learning baseline by an absolute gain of 5.5~15% on BNNs. We further reveal that it is difficult for BNNs to recover the similar predictive distributions as real-valued models when training without labels. Thus, how to calibrate them is key to address the degradation in performance. Extensive experiments are conducted on the large-scale ImageNet and downstream datasets. Our method achieves substantial improvement over the simple contrastive learning baseline, and is even comparable to many mainstream supervised BNN methods. Code is available at https://github.com/szq0214/S2-BNN.
    Which Parts Determine the Impression of the Font?. (arXiv:2103.14216v3 [cs.CV] UPDATED)
    (2 min) Various fonts give different impressions, such as legible, rough, and comic-text.This paper aims to analyze the correlation between the local shapes, or parts, and the impression of fonts. By focusing on local shapes instead of the whole letter shape, we can realize letter-shape independent and more general analysis. The analysis is performed by newly combining SIFT and DeepSets, to extract an arbitrary number of essential parts from a particular font and aggregate them to infer the font impressions by nonlinear regression. Our qualitative and quantitative analyses prove that (1)fonts with similar parts have similar impressions, (2)many impressions, such as legible and rough, largely depend on specific parts, (3)several impressions are very irrelevant to parts.
    3D Object Classification on Partial Point Clouds: A Practical Perspective. (arXiv:2012.10042v3 [cs.CV] UPDATED)
    (2 min) As a 3D counterpart of object classification in images, object point cloud classification is fundamental to 3D scene understanding, and has drawn great research attention since the release of benchmarking datasets, such as the ModelNet and the ShapeNet. These benchmarks assume point clouds covering complete surfaces of object instances, for which plenty of high-performing methods have been developed. However, their settings deviate from those often met in practice, where, due to (self-)occlusion, a point cloud covering partial surface of an object is captured from an arbitrary view. We show in this paper that performance of existing point cloud classification methods drops drastically under the considered practical single-view, partial setting; the phenomenon is consistent with the observation that semantic category of a partial object surface is less ambiguous only when its distribution on the whole surface is clearly specified. To this end, we argue for a single-view, partial setting where supervised learning of object pose estimation should be accompanied with classification. Technically, we propose a baseline method of Pose-Accompanied Point cloud classification Network (PAPNet); built upon SE(3)-equivariant convolutions, the PAPNet learns intermediate pose transformations for equivariant features defined on vector fields, which makes the subsequent classification easier (ideally) in the category-level, canonical pose. We adapt existing ModelNet40 and ScanNet datasets on point set classification to the introduced single-view, partial setting to verify our hypothesis. Thorough experiments confirm the necessity of object pose estimation; our PAPNet also outperforms existing methods greatly on the new benchmarks.
    This Looks Like That... Does it? Shortcomings of Latent Space Prototype Interpretability in Deep Networks. (arXiv:2105.02968v3 [cs.CV] UPDATED)
    (2 min) Deep neural networks that yield human interpretable decisions by architectural design have lately become an increasingly popular alternative to post hoc interpretation of traditional black-box models. Among these networks, the arguably most widespread approach is so-called prototype learning, where similarities to learned latent prototypes serve as the basis of classifying an unseen data point. In this work, we point to an important shortcoming of such approaches. Namely, there is a semantic gap between similarity in latent space and similarity in input space, which can corrupt interpretability. We design two experiments that exemplify this issue on the so-called ProtoPNet. Specifically, we find that this network's interpretability mechanism can be led astray by intentionally crafted or even JPEG compression artefacts, which can produce incomprehensible decisions. We argue that practitioners ought to have this shortcoming in mind when deploying prototype-based models in practice.
    Space-time Neural Irradiance Fields for Free-Viewpoint Video. (arXiv:2011.12950v2 [cs.CV] UPDATED)
    (2 min) We present a method that learns a spatiotemporal neural irradiance field for dynamic scenes from a single video. Our learned representation enables free-viewpoint rendering of the input video. Our method builds upon recent advances in implicit representations. Learning a spatiotemporal irradiance field from a single video poses significant challenges because the video contains only one observation of the scene at any point in time. The 3D geometry of a scene can be legitimately represented in numerous ways since varying geometry (motion) can be explained with varying appearance and vice versa. We address this ambiguity by constraining the time-varying geometry of our dynamic scene representation using the scene depth estimated from video depth estimation methods, aggregating contents from individual frames into a single global representation. We provide an extensive quantitative evaluation and demonstrate compelling free-viewpoint rendering results.
    Learning to Localize in New Environments from Synthetic Training Data. (arXiv:2011.04539v2 [cs.RO] UPDATED)
    (2 min) Most existing approaches for visual localization either need a detailed 3D model of the environment or, in the case of learning-based methods, must be retrained for each new scene. This can either be very expensive or simply impossible for large, unknown environments, for example in search-and-rescue scenarios. Although there are learning-based approaches that operate scene-agnostically, the generalization capability of these methods is still outperformed by classical approaches. In this paper, we present an approach that can generalize to new scenes by applying specific changes to the model architecture, including an extended regression part, the use of hierarchical correlation layers, and the exploitation of scale and uncertainty information. Our approach outperforms the 5-point algorithm using SIFT features on equally big images and additionally surpasses all previous learning-based approaches that were trained on different data. It is also superior to most of the approaches that were specifically trained on the respective scenes. We also evaluate our approach in a scenario where only very few reference images are available, showing that under such more realistic conditions our learning-based approach considerably exceeds both existing learning-based and classical methods.
    Edge, Ridge, and Blob Detection with Symmetric Molecules. (arXiv:1901.09723v3 [cs.CV] UPDATED)
    (2 min) We present a novel approach to the detection and characterization of edges, ridges, and blobs in two-dimensional images which exploits the symmetry properties of directionally sensitive analyzing functions in multiscale systems that are constructed in the framework of alpha-molecules. The proposed feature detectors are inspired by the notion of phase congruency, stable in the presence of noise, and by definition invariant to changes in contrast. We also show how the behavior of coefficients corresponding to differently scaled and oriented analyzing functions can be used to obtain a comprehensive characterization of the geometry of features in terms of local tangent directions, widths, and heights. The accuracy and robustness of the proposed measures are validated and compared to various state-of-the-art algorithms in extensive numerical experiments in which we consider sets of clean and distorted synthetic images that are associated with reliable ground truths. To further demonstrate the applicability, we show how the proposed ridge measure can be used to detect and characterize blood vessels in digital retinal images and how the proposed blob measure can be applied to automatically count the number of cell colonies in a Petri dish.
    Revisiting Model's Uncertainty and Confidences for Adversarial Example Detection. (arXiv:2103.05354v2 [cs.CR] UPDATED)
    (2 min) Security-sensitive applications that rely on Deep Neural Networks (DNNs) are vulnerable to small perturbations that are crafted to generate Adversarial Examples(AEs). The AEs are imperceptible to humans and cause DNN to misclassify them. Many defense and detection techniques have been proposed. Model's confidences and Dropout, as a popular way to estimate the model's uncertainty, have been used for AE detection but they showed limited success against black- and gray-box attacks. Moreover, the state-of-the-art detection techniques have been designed for specific attacks or broken by others, need knowledge about the attacks, are not consistent, increase model parameters overhead, are time-consuming, or have latency in inference time. To trade off these factors, we revisit the model's uncertainty and confidences and propose a novel unsupervised ensemble AE detection mechanism that 1) uses the uncertainty method called SelectiveNet, 2) processes model layers outputs, i.e.feature maps, to generate new confidence probabilities. The detection method is called Selective and Feature based Adversarial Detection (SFAD). Experimental results show that the proposed approach achieves better performance against black- and gray-box attacks than the state-of-the-art methods and achieves comparable performance against white-box attacks. Moreover, results show that SFAD is fully robust against High Confidence Attacks (HCAs) for MNIST and partially robust for CIFAR10 datasets.
    Ensemble of MRR and NDCG models for Visual Dialog. (arXiv:2104.07511v2 [cs.AI] UPDATED)
    (2 min) Assessing an AI agent that can converse in human language and understand visual content is challenging. Generation metrics, such as BLEU scores favor correct syntax over semantics. Hence a discriminative approach is often used, where an agent ranks a set of candidate options. The mean reciprocal rank (MRR) metric evaluates the model performance by taking into account the rank of a single human-derived answer. This approach, however, raises a new challenge: the ambiguity and synonymy of answers, for instance, semantic equivalence (e.g., `yeah' and `yes'). To address this, the normalized discounted cumulative gain (NDCG) metric has been used to capture the relevance of all the correct answers via dense annotations. However, the NDCG metric favors the usually applicable uncertain answers such as `I don't know. Crafting a model that excels on both MRR and NDCG metrics is challenging. Ideally, an AI agent should answer a human-like reply and validate the correctness of any answer. To address this issue, we describe a two-step non-parametric ranking approach that can merge strong MRR and NDCG models. Using our approach, we manage to keep most MRR state-of-the-art performance (70.41% vs. 71.24%) and the NDCG state-of-the-art performance (72.16% vs. 75.35%). Moreover, our approach won the recent Visual Dialog 2020 challenge. Source code is available at https://github.com/idansc/mrr-ndcg.
    Image Captioning as an Assistive Technology: Lessons Learned from VizWiz 2020 Challenge. (arXiv:2012.11696v2 [cs.CV] UPDATED)
    (2 min) Image captioning has recently demonstrated impressive progress largely owing to the introduction of neural network algorithms trained on curated dataset like MS-COCO. Often work in this field is motivated by the promise of deployment of captioning systems in practical applications. However, the scarcity of data and contexts in many competition datasets renders the utility of systems trained on these datasets limited as an assistive technology in real-world settings, such as helping visually impaired people navigate and accomplish everyday tasks. This gap motivated the introduction of the novel VizWiz dataset, which consists of images taken by the visually impaired and captions that have useful, task-oriented information. In an attempt to help the machine learning computer vision field realize its promise of producing technologies that have positive social impact, the curators of the VizWiz dataset host several competitions, including one for image captioning. This work details the theory and engineering from our winning submission to the 2020 captioning competition. Our work provides a step towards improved assistive image captioning systems.
    Mixed-Privacy Forgetting in Deep Networks. (arXiv:2012.13431v2 [cs.LG] UPDATED)
    (2 min) We show that the influence of a subset of the training samples can be removed -- or "forgotten" -- from the weights of a network trained on large-scale image classification tasks, and we provide strong computable bounds on the amount of remaining information after forgetting. Inspired by real-world applications of forgetting techniques, we introduce a novel notion of forgetting in mixed-privacy setting, where we know that a "core" subset of the training samples does not need to be forgotten. While this variation of the problem is conceptually simple, we show that working in this setting significantly improves the accuracy and guarantees of forgetting methods applied to vision classification tasks. Moreover, our method allows efficient removal of all information contained in non-core data by simply setting to zero a subset of the weights with minimal loss in performance. We achieve these results by replacing a standard deep network with a suitable linear approximation. With opportune changes to the network architecture and training procedure, we show that such linear approximation achieves comparable performance to the original network and that the forgetting problem becomes quadratic and can be solved efficiently even for large models. Unlike previous forgetting methods on deep networks, ours can achieve close to the state-of-the-art accuracy on large scale vision tasks. In particular, we show that our method allows forgetting without having to trade off the model accuracy.
    Domain Invariant Adversarial Learning. (arXiv:2104.00322v2 [cs.LG] UPDATED)
    (2 min) The phenomenon of adversarial examples illustrates one of the most basic vulnerabilities of deep neural networks. Among the variety of techniques introduced to surmount this inherent weakness, adversarial training has emerged as the most common and efficient strategy to achieve robustness. Typically, this is achieved by balancing robust and natural objectives. In this work, we aim to achieve better trade-off between robust and natural performances by enforcing a domain-invariant feature representation. We present a new adversarial training method, Domain Invariant Adversarial Learning (DIAL), which learns a feature representation which is both robust and domain invariant. DIAL uses a variant of Domain Adversarial Neural Network (DANN) on the natural domain and its corresponding adversarial domain. In a case where the source domain consists of natural examples and the target domain is the adversarially perturbed examples, our method learns a feature representation constrained not to discriminate between the natural and adversarial examples, and can therefore achieve a more robust representation. Our experiments indicate that our method improves both robustness and natural accuracy, when compared to current state-of-the-art adversarial training methods.
    TDA-Net: Fusion of Persistent Homology and Deep Learning Features for COVID-19 Detection in Chest X-Ray Images. (arXiv:2101.08398v2 [cs.CV] UPDATED)
    (2 min) Topological Data Analysis (TDA) has emerged recently as a robust tool to extract and compare the structure of datasets. TDA identifies features in data such as connected components and holes and assigns a quantitative measure to these features. Several studies reported that topological features extracted by TDA tools provide unique information about the data, discover new insights, and determine which feature is more related to the outcome. On the other hand, the overwhelming success of deep neural networks in learning patterns and relationships has been proven on a vast array of data applications, images in particular. To capture the characteristics of both powerful tools, we propose \textit{TDA-Net}, a novel ensemble network that fuses topological and deep features for the purpose of enhancing model generalizability and accuracy. We apply the proposed \textit{TDA-Net} to a critical application, which is the automated detection of COVID-19 from CXR images. The experimental results showed that the proposed network achieved excellent performance and suggests the applicability of our method in practice.
    Unlocking Pixels for Reinforcement Learning via Implicit Attention. (arXiv:2102.04353v3 [cs.LG] UPDATED)
    (2 min) There has recently been significant interest in training reinforcement learning (RL) agents in vision-based environments. This poses many challenges, such as high dimensionality and potential for observational overfitting through spurious correlations. A promising approach to solve both of these problems is a self-attention bottleneck, which provides a simple and effective framework for learning high performing policies, even in the presence of distractions. However, due to poor scalability of attention architectures, these methods do not scale beyond low resolution visual inputs, using large patches (thus small attention matrices). In this paper we make use of new efficient attention algorithms, recently shown to be highly effective for Transformers, and demonstrate that these new techniques can be applied in the RL setting. This allows our attention-based controllers to scale to larger visual inputs, and facilitate the use of smaller patches, even individual pixels, improving generalization. In addition, we propose a new efficient algorithm approximating softmax attention with what we call hybrid random features, leveraging the theory of angular kernels. We show theoretically and empirically that hybrid random features is a promising approach when using attention for vision-based RL.
    SiamSNN: Siamese Spiking Neural Networks for Energy-Efficient Object Tracking. (arXiv:2003.07584v3 [cs.CV] UPDATED)
    (2 min) Recently spiking neural networks (SNNs), the third-generation of neural networks has shown remarkable capabilities of energy-efficient computing, which is a promising alternative for deep neural networks (DNNs) with high energy consumption. SNNs have reached competitive results compared to DNNs in relatively simple tasks and small datasets such as image classification and MNIST/CIFAR, while few studies on more challenging vision tasks on complex datasets. In this paper, we focus on extending deep SNNs to object tracking, a more advanced vision task with embedded applications and energy-saving requirements, and present a spike-based Siamese network called SiamSNN. Specifically, we propose an optimized hybrid similarity estimation method to exploit temporal information in the SNNs, and introduce a novel two-status coding scheme to optimize the temporal distribution of output spike trains for further improvements. SiamSNN is the first deep SNN tracker that achieves short latency and low precision loss on the visual object tracking benchmarks OTB2013/2015, VOT2016/2018, and GOT-10k. Moreover, SiamSNN achieves notably low energy consumption and real-time on Neuromorphic chip TrueNorth.
    Self-Supervised Learning for Gastritis Detection with Gastric X-ray Images. (arXiv:2104.02864v2 [cs.CV] UPDATED)
    (2 min) Background and Objective: Manually annotating gastric X-ray images for gastritis detection is time-consuming and expensive because it typically requires expert knowledge. This paper proposes a self-supervised learning method to solve this problem. This study aims to verify the effectiveness of the proposed self-supervised learning method in gastritis detection using a few annotated gastric X-ray images. Methods: In this paper, we propose a novel self-supervised learning method that can perform explicit self-supervised learning and learn discriminative representations from gastric X-ray images. Models trained with the proposed method were fine-tuned on datasets with a few annotated gastric X-ray images. For comparison, several state-of-the-art self-supervised learning methods, i.e., containing SimSiam, BYOL, PIRL-jigsaw, PIRL-rotation, and SimCLR, were compared with the proposed method. Furthermore, two baseline methods, one pretrained on ImageNet and the other trained from scratch, were compared with the proposed method. Results: The proposed method's harmonic mean score of sensitivity and specificity after fine-tuning with the annotated data of 10, 20, 30, and 40 patients were 0.875, 0.911, 0.915, and 0.931, respectively. The proposed method outperformed all comparative methods, including the five state-of-the-art self-supervised learning and two baseline methods. Experimental results showed the effectiveness of the proposed method in gastritis detection with a few annotated gastric X-ray images. Conclusions: The proposed self-supervised learning method shows potential for clinical use in gastritis detection using a few annotated gastric X-ray images.
    Recent Advances in Large Margin Learning. (arXiv:2103.13598v2 [cs.LG] UPDATED)
    (2 min) This paper serves as a survey of recent advances in large margin training and its theoretical foundations, mostly for (nonlinear) deep neural networks (DNNs) that are probably the most prominent machine learning models for large-scale data in the community over the past decade. We generalize the formulation of classification margins from classical research to latest DNNs, summarize theoretical connections between the margin, network generalization, and robustness, and introduce recent efforts in enlarging the margins for DNNs comprehensively. Since the viewpoint of different methods is discrepant, we categorize them into groups for ease of comparison and discussion in the paper. Hopefully, our discussions and overview inspire new research work in the community that aim to improve the performance of DNNs, and we also point to directions where the large margin principle can be verified to provide theoretical evidence why certain regularizations for DNNs function well in practice. We managed to shorten the paper such that the crucial spirit of large margin learning and related methods are better emphasized.
    RetiNerveNet: Using Recursive Deep Learning to Estimate Pointwise 24-2 Visual Field Data based on Retinal Structure. (arXiv:2010.07488v2 [cs.LG] UPDATED)
    (2 min) Glaucoma is the leading cause of irreversible blindness in the world, affecting over 70 million people. The cumbersome Standard Automated Perimetry (SAP) test is most frequently used to detect visual loss due to glaucoma. Due to the SAP test's innate difficulty and its high test-retest variability, we propose the RetiNerveNet, a deep convolutional recursive neural network for obtaining estimates of the SAP visual field. RetiNerveNet uses information from the more objective Spectral-Domain Optical Coherence Tomography (SDOCT). RetiNerveNet attempts to trace-back the arcuate convergence of the retinal nerve fibers, starting from the Retinal Nerve Fiber Layer (RNFL) thickness around the optic disc, to estimate individual age-corrected 24-2 SAP values. Recursive passes through the proposed network sequentially yield estimates of the visual locations progressively farther from the optic disc. While all the methods used for our experiments exhibit lower performance for the advanced disease group, the proposed network is observed to be more accurate than all the baselines for estimating the individual visual field values. We further augment RetiNerveNet to additionally predict the SAP Mean Deviation values and also create an ensemble of RetiNerveNets that further improves the performance, by increasingly weighting-up underrepresented parts of the training data.
    Unconstrained Facial Action Unit Detection via Latent Feature Domain. (arXiv:1903.10143v4 [cs.CV] UPDATED)
    (2 min) Facial action unit (AU) detection in the wild is a challenging problem, due to the unconstrained variability in facial appearances and the lack of accurate annotations. Most existing methods depend on either impractical labor-intensive labeling or inaccurate pseudo labels. In this paper, we propose an end-to-end unconstrained facial AU detection framework based on domain adaptation, which transfers accurate AU labels from a constrained source domain to an unconstrained target domain by exploiting labels of AU-related facial landmarks. Specifically, we map a source image with label and a target image without label into a latent feature domain by combining source landmark-related feature with target landmark-free feature. Due to the combination of source AU-related information and target AU-free information, the latent feature domain with transferred source label can be learned by maximizing the target-domain AU detection performance. Moreover, we introduce a novel landmark adversarial loss to disentangle the landmark-free feature from the landmark-related feature by treating the adversarial learning as a multi-player minimax game. Our framework can also be naturally extended for use with target-domain pseudo AU labels. Extensive experiments show that our method soundly outperforms lower-bounds and upper-bounds of the basic model, as well as state-of-the-art approaches on the challenging in-the-wild benchmarks. The code is available at https://github.com/ZhiwenShao/ADLD.
    AINet: Association Implantation for Superpixel Segmentation. (arXiv:2101.10696v2 [cs.CV] UPDATED)
    (2 min) Recently, some approaches are proposed to harness deep convolutional networks to facilitate superpixel segmentation. The common practice is to first evenly divide the image into a pre-defined number of grids and then learn to associate each pixel with its surrounding grids. However, simply applying a series of convolution operations with limited receptive fields can only implicitly perceive the relations between the pixel and its surrounding grids. Consequently, existing methods often fail to provide an effective context when inferring the association map. To remedy this issue, we propose a novel \textbf{A}ssociation \textbf{I}mplantation (AI) module to enable the network to explicitly capture the relations between the pixel and its surrounding grids. The proposed AI module directly implants the features of grid cells to the surrounding of its corresponding central pixel, and conducts convolution on the padded window to adaptively transfer knowledge between them. With such an implantation operation, the network could explicitly harvest the pixel-grid level context, which is more in line with the target of superpixel segmentation comparing to the pixel-wise relation. Furthermore, to pursue better boundary precision, we design a boundary-perceiving loss to help the network discriminate the pixels around boundaries in hidden feature level, which could benefit the subsequent inferring modules to accurately identify more boundary pixels. Extensive experiments on BSDS500 and NYUv2 datasets show that our method could not only achieve state-of-the-art performance but maintain satisfactory inference efficiency.
    Analysis Towards Classification of Infection and Ischaemia of Diabetic Foot Ulcers. (arXiv:2104.03068v2 [cs.CV] UPDATED)
    (2 min) This paper introduces the Diabetic Foot Ulcers dataset (DFUC2021) for analysis of pathology, focusing on infection and ischaemia. We describe the data preparation of DFUC2021 for ground truth annotation, data curation and data analysis. The final release of DFUC2021 consists of 15,683 DFU patches, with 5,955 training, 5,734 for testing and 3,994 unlabeled DFU patches. The ground truth labels are four classes, i.e. control, infection, ischaemia and both conditions. We curate the dataset using image hashing techniques and analyse the separability using UMAP projection. We benchmark the performance of five key backbones of deep learning, i.e. VGG16, ResNet101, InceptionV3, DenseNet121 and EfficientNet on DFUC2021. We report the optimised results of these key backbones with different strategies. Based on our observations, we conclude that EfficientNetB0 with data augmentation and transfer learning provided the best results for multi-class (4-class) classification with macro-average Precision, Recall and F1-score of 0.57, 0.62 and 0.55, respectively. In ischaemia and infection recognition, when trained on one-versus-all, EfficientNetB0 achieved comparable results with the state of the art. Finally, we interpret the results with statistical analysis and Grad-CAM visualisation.
    Learning a Universal Template for Few-shot Dataset Generalization. (arXiv:2105.07029v2 [cs.LG] UPDATED)
    (2 min) Few-shot dataset generalization is a challenging variant of the well-studied few-shot classification problem where a diverse training set of several datasets is given, for the purpose of training an adaptable model that can then learn classes from new datasets using only a few examples. To this end, we propose to utilize the diverse training set to construct a universal template: a partial model that can define a wide array of dataset-specialized models, by plugging in appropriate components. For each new few-shot classification problem, our approach therefore only requires inferring a small number of parameters to insert into the universal template. We design a separate network that produces an initialization of those parameters for each given task, and we then fine-tune its proposed initialization via a few steps of gradient descent. Our approach is more parameter-efficient, scalable and adaptable compared to previous methods, and achieves the state-of-the-art on the challenging Meta-Dataset benchmark.
    A non-alternating graph hashing algorithm for large scale image search. (arXiv:2012.13138v2 [cs.CV] UPDATED)
    (2 min) In the era of big data, methods for improving memory and computational efficiency have become crucial for successful deployment of technologies. Hashing is one of the most effective approaches to deal with computational limitations that come with big data. One natural way for formulating this problem is spectral hashing that directly incorporates affinity to learn binary codes. However, due to binary constraints, the optimization becomes intractable. To mitigate this challenge, different relaxation approaches have been proposed to reduce the computational load of obtaining binary codes and still attain a good solution. The problem with all existing relaxation methods is resorting to one or more additional auxiliary variables to attain high quality binary codes while relaxing the problem. The existence of auxiliary variables leads to coordinate descent approach which increases the computational complexity. We argue that introducing these variables is unnecessary. To this end, we propose a novel relaxed formulation for spectral hashing that adds no additional variables to the problem. Furthermore, instead of solving the problem in original space where number of variables is equal to the data points, we solve the problem in a much smaller space and retrieve the binary codes from this solution. This trick reduces both the memory and computational complexity at the same time. We apply two optimization techniques, namely projected gradient and optimization on manifold, to obtain the solution. Using comprehensive experiments on four public datasets, we show that the proposed efficient spectral hashing (ESH) algorithm achieves highly competitive retrieval performance compared with state of the art at low complexity.
    Channel Pruning Guided by Spatial and Channel Attention for DNNs in Intelligent Edge Computing. (arXiv:2011.03891v2 [cs.CV] UPDATED)
    (2 min) Deep Neural Networks (DNNs) have achieved remarkable success in many computer vision tasks recently, but the huge number of parameters and the high computation overhead hinder their deployments on resource-constrained edge devices. It is worth noting that channel pruning is an effective approach for compressing DNN models. A critical challenge is to determine which channels are to be removed, so that the model accuracy will not be negatively affected. In this paper, we first propose Spatial and Channel Attention (SCA), a new attention module combining both spatial and channel attention that respectively focuses on "where" and "what" are the most informative parts. Guided by the scale values generated by SCA for measuring channel importance, we further propose a new channel pruning approach called Channel Pruning guided by Spatial and Channel Attention (CPSCA). Experimental results indicate that SCA achieves the best inference accuracy, while incurring negligibly extra resource consumption, compared to other state-of-the-art attention modules. Our evaluation on two benchmark datasets shows that, with the guidance of SCA, our CPSCA approach achieves higher inference accuracy than other state-of-the-art pruning methods under the same pruning ratios.
    MIA-COV19D: COVID-19 Detection through 3-D Chest CT Image Analysis. (arXiv:2106.07524v2 [eess.IV] UPDATED)
    (2 min) Early and reliable COVID-19 diagnosis based on chest 3-D CT scans can assist medical specialists in vital circumstances. Deep learning methodologies constitute a main approach for chest CT scan analysis and disease prediction. However, large annotated databases are necessary for developing deep learning models that are able to provide COVID-19 diagnosis across various medical environments in different countries. Due to privacy issues, publicly available COVID-19 CT datasets are highly difficult to obtain, which hinders the research and development of AI-enabled diagnosis methods of COVID-19 based on CT scans. In this paper we present the COV19-CT-DB database which is annotated for COVID-19, consisting of about 5,000 3-D CT scans, We have split the database in training, validation and test datasets. The former two datasets can be used for training and validation of machine learning models, while the latter will be used for evaluation of the developed models. We also present a deep learning approach, based on a CNN-RNN network and report its performance on the COVID19-CT-DB database.
    LayoutParser: A Unified Toolkit for Deep Learning Based Document Image Analysis. (arXiv:2103.15348v2 [cs.CV] UPDATED)
    (2 min) Recent advances in document image analysis (DIA) have been primarily driven by the application of neural networks. Ideally, research outcomes could be easily deployed in production and extended for further investigation. However, various factors like loosely organized codebases and sophisticated model configurations complicate the easy reuse of important innovations by a wide audience. Though there have been on-going efforts to improve reusability and simplify deep learning (DL) model development in disciplines like natural language processing and computer vision, none of them are optimized for challenges in the domain of DIA. This represents a major gap in the existing toolkit, as DIA is central to academic research across a wide range of disciplines in the social sciences and humanities. This paper introduces layoutparser, an open-source library for streamlining the usage of DL in DIA research and applications. The core layoutparser library comes with a set of simple and intuitive interfaces for applying and customizing DL models for layout detection, character recognition, and many other document processing tasks. To promote extensibility, layoutparser also incorporates a community platform for sharing both pre-trained models and full document digitization pipelines. We demonstrate that layoutparser is helpful for both lightweight and large-scale digitization pipelines in real-word use cases. The library is publicly available at https://layout-parser.github.io/.
    Greedy Hierarchical Variational Autoencoders for Large-Scale Video Prediction. (arXiv:2103.04174v3 [cs.CV] UPDATED)
    (2 min) A video prediction model that generalizes to diverse scenes would enable intelligent agents such as robots to perform a variety of tasks via planning with the model. However, while existing video prediction models have produced promising results on small datasets, they suffer from severe underfitting when trained on large and diverse datasets. To address this underfitting challenge, we first observe that the ability to train larger video prediction models is often bottlenecked by the memory constraints of GPUs or TPUs. In parallel, deep hierarchical latent variable models can produce higher quality predictions by capturing the multi-level stochasticity of future observations, but end-to-end optimization of such models is notably difficult. Our key insight is that greedy and modular optimization of hierarchical autoencoders can simultaneously address both the memory constraints and the optimization challenges of large-scale video prediction. We introduce Greedy Hierarchical Variational Autoencoders (GHVAEs), a method that learns high-fidelity video predictions by greedily training each level of a hierarchical autoencoder. In comparison to state-of-the-art models, GHVAEs provide 17-55% gains in prediction performance on four video datasets, a 35-40% higher success rate on real robot tasks, and can improve performance monotonically by simply adding more modules.
    Mobile Sensing for Multipurpose Applications in Transportation. (arXiv:2106.10733v1 [cs.CV])
    (2 min) Routine and consistent data collection is required to address contemporary transportation issues.The cost of data collection increases significantly when sophisticated machines are used to collect data. Due to this constraint, State Departments of Transportation struggles to collect consistent data for analyzing and resolving transportation problems in a timely manner. Recent advancements in the sensors integrated into smartphones have resulted in a more affordable method of data collection.The primary objective of this study is to develop and implement a smartphone application for data collection.The currently designed app consists of three major modules: a frontend graphical user interface (GUI), a sensor module, and a backend module. While the frontend user interface enables interaction with the app, the sensor modules collect relevant data such as video and accelerometer readings while the app is in use. The backend, on the other hand, is made up of firebase storage, which is used to store the gathered data.In comparison to other developed apps for collecting pavement information, this current app is not overly reliant on the internet enabling the app to be used in areas of restricted internet access.The developed application was evaluated by collecting data on the i70W highway connecting Columbia, Missouri, and Kansas City, Missouri.The data was analyzed for a variety of purposes, including calculating the International Roughness Index (IRI), identifying pavement distresses, and understanding driver's behaviour and environment .The results of the application indicate that the data collected by the app is of high quality.
    A Stitching Algorithm for Automated Surface Inspection of Rotationally Symmetric Components. (arXiv:2012.00308v3 [cs.CV] UPDATED)
    (2 min) This paper provides a novel approach to stitching surface images of rotationally symmetric parts. It presents a process pipeline that uses a feature-based stitching approach to create a distortion-free and true-to-life image from a video file. The developed process thus enables, for example, condition monitoring without having to view many individual images. For validation purposes, this will be demonstrated in the paper using the concrete example of a worn ball screw drive spindle. The developed algorithm aims at reproducing the functional principle of a line scan camera system, whereby the physical measuring systems are replaced by a feature-based approach. For evaluation of the stitching algorithms, metrics are used, some of which have only been developed in this work or have been supplemented by test procedures already in use. The applicability of the developed algorithm is not only limited to machine tool spindles. Instead, the developed method allows a general approach to the surface inspection of various rotationally symmetric components and can therefore be used in a variety of industrial applications. Deep-learning-based detection Algorithms can easily be implemented to generate a complete pipeline for failure detection and condition monitoring on rotationally symmetric parts.
    More than Encoder: Introducing Transformer Decoder to Upsample. (arXiv:2106.10637v1 [cs.CV])
    (2 min) General segmentation models downsample images and then upsample to restore resolution for pixel level prediction. In such schema, upsample technique is vital in maintaining information for better performance. In this paper, we present a new upsample approach, Attention Upsample (AU), that could serve as general upsample method and be incorporated into any segmentation model that possesses lateral connections. AU leverages pixel-level attention to model long range dependency and global information for better reconstruction. It consists of Attention Decoder (AD) and bilinear upsample as residual connection to complement the upsampled features. AD adopts the idea of decoder from transformer which upsamples features conditioned on local and detailed information from contracting path. Moreover, considering the extensive memory and computation cost of pixel-level attention, we further propose to use window attention scheme to restrict attention computation in local windows instead of global range. Incorporating window attention, we denote our decoder as Window Attention Decoder (WAD) and our upsample method as Window Attention Upsample (WAU). We test our method on classic U-Net structure with lateral connection to deliver information from contracting path and achieve state-of-the-arts performance on Synapse (80.30 DSC and 23.12 HD) and MSD Brain (74.75 DSC) datasets.
    GAN Inversion: A Survey. (arXiv:2101.05278v3 [cs.CV] UPDATED)
    (2 min) GAN inversion aims to invert a given image back into the latent space of a pretrained GAN model, for the image to be faithfully reconstructed from the inverted code by the generator. As an emerging technique to bridge the real and fake image domains, GAN inversion plays an essential role in enabling the pretrained GAN models such as StyleGAN and BigGAN to be used for real image editing applications. Meanwhile, GAN inversion also provides insights on the interpretation of GAN's latent space and how the realistic images can be generated. In this paper, we provide an overview of GAN inversion with a focus on its recent algorithms and applications. We cover important techniques of GAN inversion and their applications to image restoration and image manipulation. We further elaborate on some trends and challenges for future directions.
    Intriguing Properties of Contrastive Losses. (arXiv:2011.02803v2 [cs.LG] UPDATED)
    (2 min) Contrastive loss and its variants have become very popular recently for learning visual representations without supervision. In this work, we study three intriguing properties of contrastive learning. We first generalize the standard contrastive loss to a broader family of losses, and we find that various instantiations of the generalized loss perform similarly under the presence of a multi-layer non-linear projection head. We then study if instance-based contrastive learning (such as in SimCLR, MoCo, BYOL, and so on, which are based on global image representation) can learn well on images with multiple objects present. We find that meaningful hierarchical local features can be learned despite the fact that these objectives operate on global instance-level features. Finally, we study an intriguing phenomenon of feature suppression among competing features shared across augmented views, such as "color distribution" vs "object class". We construct datasets with explicit and controllable competing features, and show that, for contrastive learning, a few bits of easy-to-learn shared features can suppress, and even fully prevent, the learning of other sets of competing features. In scenarios where there are multiple objects in an image, the dominant object would suppress the learning of smaller objects. Existing contrastive learning methods critically rely on data augmentation to favor certain sets of features over others, and face potential limitation for scenarios where existing augmentations cannot fully address the feature suppression. This poses open challenges to existing contrastive learning techniques.
    Measuring breathing induced oesophageal motion and its dosimetric impact. (arXiv:2010.09391v3 [physics.med-ph] UPDATED)
    (2 min) Stereotactic body radiation therapy allows for a precise and accurate dose delivery. Organ motion during treatment bears the risk of undetected high dose healthy tissue exposure. An organ very susceptible to high dose is the oesophagus. Its low contrast on CT and the oblong shape renders motion estimation difficult. We tackle this issue by modern algorithms to measure the oesophageal motion voxel-wise and to estimate motion related dosimetric impact. Oesophageal motion was measured using deformable image registration and 4DCT of 11 internal and 5 public datasets. Current clinical practice of contouring the organ on 3DCT was compared to timely resolved 4DCT contours. The dosimetric impact of the motion was estimated by analysing the trajectory of each voxel in the 4D dose distribution. Finally an organ motion model was built, allowing for easier patient-wise comparisons. Motion analysis showed mean absolute maximal motion amplitudes of 4.55 +/- 1.81 mm left-right, 5.29 +/- 2.67 mm anterior-posterior and 10.78 +/- 5.30 mm superior-inferior. Motion between the cohorts differed significantly. In around 50 % of the cases the dosimetric passing criteria was violated. Contours created on 3DCT did not cover 14 % of the organ for 50 % of the respiratory cycle and the 3D contour is around 38 % smaller than the union of all 4D contours. The motion model revealed that the maximal motion is not limited to the lower part of the organ. Our results showed motion amplitudes higher than most reported values in the literature and that motion is very heterogeneous across patients. Therefore, individual motion information should be considered in contouring and planning.
    Deep Evaluation Metric: Learning to Evaluate Simulated Radar Point Clouds for Virtual Testing of Autonomous Driving. (arXiv:2104.06772v2 [cs.CV] UPDATED)
    (2 min) The usage of environment sensor models for virtual testing is a promising approach to reduce the testing effort of autonomous driving. However, in order to deduce any statements regarding the performance of an autonomous driving function based on simulation, the sensor model has to be validated to determine the discrepancy between the synthetic and real sensor data. Since a certain degree of divergence can be assumed to exist, the sufficient level of fidelity must be determined, which poses a major challenge. In particular, a method for quantifying the fidelity of a sensor model does not exist and the problem of defining an appropriate metric remains. In this work, we train a neural network to distinguish real and simulated radar sensor data with the purpose of learning the latent features of real radar point clouds. Furthermore, we propose the classifier's confidence score for the `real radar point cloud' class as a metric to determine the degree of fidelity of synthetically generated radar data. The presented approach is evaluated and it can be demonstrated that the proposed deep evaluation metric outperforms conventional metrics in terms of its capability to identify characteristic differences between real and simulated radar data.
    Structured Sparse R-CNN for Direct Scene Graph Generation. (arXiv:2106.10815v1 [cs.CV])
    (2 min) Scene graph generation (SGG) is to detect entity pairs with their relations in an image. Existing SGG approaches often use multi-stage pipelines to decompose this task into object detection, relation graph construction, and dense or dense-to-sparse relation prediction. Instead, from a perspective on SGG as a direct set prediction, this paper presents a simple, sparse, and unified framework for relation detection, termed as Structured Sparse R-CNN. The key to our method is a set of learnable triplet queries and structured triplet detectors which could be jointly optimized from the training set in an end-to-end manner. Specifically, the triplet queries encode the general prior for entity pair locations, categories, and their relations, and provide an initial guess of relation detection for subsequent refinement. The triplet detector presents a cascaded dynamic head design to progressively refine the results of relation detection. In addition, to relieve the training difficulty of Structured Sparse R-CNN, we propose a relaxed and enhanced training strategy based on knowledge distillation from a Siamese Sparse R-CNN. We also propose adaptive focusing parameter and average logit approach for imbalance data distribution. We perform experiments on two benchmarks: Visual Genome and Open Images, and the results demonstrate that our method achieves the state-of-the-art performance. Meanwhile, we perform in-depth ablation studies to provide insights on our structured modeling in triplet detector design and training strategies.
    Solution for Large-scale Long-tailed Recognition with Noisy Labels. (arXiv:2106.10683v1 [cs.CV])
    (2 min) This is a technical report for CVPR 2021 AliProducts Challenge. AliProducts Challenge is a competition proposed for studying the large-scale and fine-grained commodity image recognition problem encountered by worldleading ecommerce companies. The large-scale product recognition simultaneously meets the challenge of noisy annotations, imbalanced (long-tailed) data distribution and fine-grained classification. In our solution, we adopt stateof-the-art model architectures of both CNNs and Transformer, including ResNeSt, EfficientNetV2, and DeiT. We found that iterative data cleaning, classifier weight normalization, high-resolution finetuning, and test time augmentation are key components to improve the performance of training with the noisy and imbalanced dataset. Finally, we obtain 6.4365% mean class error rate in the leaderboard with our ensemble model.
    Practical Assessment of Generalization Performance Robustness for Deep Networks via Contrastive Examples. (arXiv:2106.10653v1 [cs.LG])
    (2 min) Training images with data transformations have been suggested as contrastive examples to complement the testing set for generalization performance evaluation of deep neural networks (DNNs). In this work, we propose a practical framework ContRE (The word "contre" means "against" or "versus" in French.) that uses Contrastive examples for DNN geneRalization performance Estimation. Specifically, ContRE follows the assumption in contrastive learning that robust DNN models with good generalization performance are capable of extracting a consistent set of features and making consistent predictions from the same image under varying data transformations. Incorporating with a set of randomized strategies for well-designed data transformations over the training set, ContRE adopts classification errors and Fisher ratios on the generated contrastive examples to assess and analyze the generalization performance of deep models in complement with a testing set. To show the effectiveness and the efficiency of ContRE, extensive experiments have been done using various DNN models on three open source benchmark datasets with thorough ablation studies and applicability analyses. Our experiment results confirm that (1) behaviors of deep models on contrastive examples are strongly correlated to what on the testing set, and (2) ContRE is a robust measure of generalization performance complementing to the testing set in various settings.
    Image Segmentation, Compression and Reconstruction from Edge Distribution Estimation with Random Field and Random Cluster Theories. (arXiv:2104.10762v7 [eess.IV] UPDATED)
    (2 min) Random field and random cluster theory are used to prove certain mathematical results concerning the probability distribution of image pixel intensities characterized as generic $2D$ integer arrays. The size of the smallest bounded region within an image is estimated for segmenting an image, from which, the equilibrium distribution of intensities can be recovered. From the estimated bounded regions, properties of the sub-optimal and equilibrium distributions of intensities are derived, which leads to an image compression methodology whereby only slightly more than half of all pixels are required for a worst-case reconstruction of the original image. An example in unsupervised object detection illustrates the mathematical results.
    FloorPP-Net: Reconstructing Floor Plans using Point Pillars for Scan-to-BIM. (arXiv:2106.10635v1 [cs.CV])
    (2 min) This paper presents a deep learning-based point cloud processing method named FloorPP-Net for the task of Scan-to-BIM (building information model). FloorPP-Net first converts the input point cloud of a building story into point pillars (PP), then predicts the corners and edges to output the floor plan. Altogether, FloorPP-Net establishes an end-to-end supervised learning framework for the Scan-to-Floor-Plan (Scan2FP) task. In the 1st International Scan-to-BIM Challenge held in conjunction with CVPR 2021, FloorPP-Net was ranked the second runner-up in the floor plan reconstruction track. Future work includes general edge proposals, 2D plan regularization, and 3D BIM reconstruction.
    3D Object Detection with Pointformer. (arXiv:2012.11409v2 [cs.CV] UPDATED)
    (2 min) Feature learning for 3D object detection from point clouds is very challenging due to the irregularity of 3D point cloud data. In this paper, we propose Pointformer, a Transformer backbone designed for 3D point clouds to learn features effectively. Specifically, a Local Transformer module is employed to model interactions among points in a local region, which learns context-dependent region features at an object level. A Global Transformer is designed to learn context-aware representations at the scene level. To further capture the dependencies among multi-scale representations, we propose Local-Global Transformer to integrate local features with global features from higher resolution. In addition, we introduce an efficient coordinate refinement module to shift down-sampled points closer to object centroids, which improves object proposal generation. We use Pointformer as the backbone for state-of-the-art object detection models and demonstrate significant improvements over original models on both indoor and outdoor datasets.
    Adversarial Distortion for Learned Video Compression. (arXiv:2004.09508v3 [eess.IV] UPDATED)
    (2 min) In this paper, we present a novel adversarial lossy video compression model. At extremely low bit-rates, standard video coding schemes suffer from unpleasant reconstruction artifacts such as blocking, ringing etc. Existing learned neural approaches to video compression have achieved reasonable success on reducing the bit-rate for efficient transmission and reduce the impact of artifacts to an extent. However, they still tend to produce blurred results under extreme compression. In this paper, we present a deep adversarial learned video compression model that minimizes an auxiliary adversarial distortion objective. We find this adversarial objective to correlate better with human perceptual quality judgement relative to traditional quality metrics such as MS-SSIM and PSNR. Our experiments using a state-of-the-art learned video compression system demonstrate a reduction of perceptual artifacts and reconstruction of detail lost especially under extremely high compression.
    Underwater Image Restoration via Contrastive Learning and a Real-world Dataset. (arXiv:2106.10718v1 [eess.IV])
    (2 min) Underwater image restoration is of significant importance in unveiling the underwater world. Numerous techniques and algorithms have been developed in the past decades. However, due to fundamental difficulties associated with imaging/sensing, lighting, and refractive geometric distortions, in capturing clear underwater images, no comprehensive evaluations have been conducted of underwater image restoration. To address this gap, we have constructed a large-scale real underwater image dataset, dubbed `HICRD' (Heron Island Coral Reef Dataset), for the purpose of benchmarking existing methods and supporting the development of new deep-learning based methods. We employ accurate water parameter (diffuse attenuation coefficient) in generating reference images. There are 2000 reference restored images and 6003 original underwater images in the unpaired training set. Further, we present a novel method for underwater image restoration based on unsupervised image-to-image translation framework. Our proposed method leveraged contrastive learning and generative adversarial networks to maximize the mutual information between raw and restored images. Extensive experiments with comparisons to recent approaches further demonstrate the superiority of our proposed method. Our code and dataset are publicly available at GitHub.
    Attack to Fool and Explain Deep Networks. (arXiv:2106.10606v1 [cs.CV])
    (2 min) Deep visual models are susceptible to adversarial perturbations to inputs. Although these signals are carefully crafted, they still appear noise-like patterns to humans. This observation has led to the argument that deep visual representation is misaligned with human perception. We counter-argue by providing evidence of human-meaningful patterns in adversarial perturbations. We first propose an attack that fools a network to confuse a whole category of objects (source class) with a target label. Our attack also limits the unintended fooling by samples from non-sources classes, thereby circumscribing human-defined semantic notions for network fooling. We show that the proposed attack not only leads to the emergence of regular geometric patterns in the perturbations, but also reveals insightful information about the decision boundaries of deep models. Exploring this phenomenon further, we alter the `adversarial' objective of our attack to use it as a tool to `explain' deep visual representation. We show that by careful channeling and projection of the perturbations computed by our method, we can visualize a model's understanding of human-defined semantic notions. Finally, we exploit the explanability properties of our perturbations to perform image generation, inpainting and interactive image manipulation by attacking adversarialy robust `classifiers'.In all, our major contribution is a novel pragmatic adversarial attack that is subsequently transformed into a tool to interpret the visual models. The article also makes secondary contributions in terms of establishing the utility of our attack beyond the adversarial objective with multiple interesting applications.
    Piano Skills Assessment. (arXiv:2101.04884v2 [cs.CV] UPDATED)
    (2 min) Can a computer determine a piano player's skill level? Is it preferable to base this assessment on visual analysis of the player's performance or should we trust our ears over our eyes? Since current CNNs have difficulty processing long video videos, how can shorter clips be sampled to best reflect the players skill level? In this work, we collect and release a first-of-its-kind dataset for multimodal skill assessment focusing on assessing piano player's skill level, answer the asked questions, initiate work in automated evaluation of piano playing skills and provide baselines for future work. Dataset is available from: https://github.com/ParitoshParmar/Piano-Skills-Assessment.
    Cross-Modal learning for Audio-Visual Video Parsing. (arXiv:2104.04598v2 [cs.SD] UPDATED)
    (2 min) In this paper, we present a novel approach to the audio-visual video parsing (AVVP) task that demarcates events from a video separately for audio and visual modalities. The proposed parsing approach simultaneously detects the temporal boundaries in terms of start and end times of such events. We show how AVVP can benefit from the following techniques geared towards effective cross-modal learning: (i) adversarial training and skip connections (ii) global context aware attention and, (iii) self-supervised pretraining using an audio-video grounding objective to obtain cross-modal audio-video representations. We present extensive experimental evaluations on the Look, Listen, and Parse (LLP) dataset and show that we outperform the state-of-the-art Hybrid Attention Network (HAN) on all five metrics proposed for AVVP. We also present several ablations to validate the effect of pretraining, global attention and adversarial training.
    Augmented 2D-TAN: A Two-stage Approach for Human-centric Spatio-Temporal Video Grounding. (arXiv:2106.10634v1 [cs.CV])
    (2 min) We propose an effective two-stage approach to tackle the problem of language-based Human-centric Spatio-Temporal Video Grounding (HC-STVG) task. In the first stage, we propose an Augmented 2D Temporal Adjacent Network (Augmented 2D-TAN) to temporally ground the target moment corresponding to the given description. Primarily, we improve the original 2D-TAN from two aspects: First, a temporal context-aware Bi-LSTM Aggregation Module is developed to aggregate clip-level representations, replacing the original max-pooling. Second, we propose to employ Random Concatenation Augmentation (RCA) mechanism during the training phase. In the second stage, we use pretrained MDETR model to generate per-frame bounding boxes via language query, and design a set of hand-crafted rules to select the best matching bounding box outputted by MDETR for each frame within the grounded moment.
    Meta Faster R-CNN: Towards Accurate Few-Shot Object Detection with Attentive Feature Alignment. (arXiv:2104.07719v2 [cs.CV] UPDATED)
    (2 min) Few-shot object detection (FSOD) aims to detect objects using only few examples. It's critically needed for many practical applications but so far remains challenging. We propose a meta-learning based few-shot object detection method by transferring meta-knowledge learned from data-abundant base classes to data-scarce novel classes. Our method incorporates a coarse-to-fine approach into the proposal based object detection framework and integrates prototype based classifiers into both the proposal generation and classification stages. To improve proposal generation for few-shot novel classes, we propose to learn a lightweight matching network to measure the similarity between each spatial position in the query image feature map and spatially-pooled class features, instead of the traditional object/nonobject classifier, thus generating category-specific proposals and improving proposal recall for novel classes. To address the spatial misalignment between generated proposals and few-shot class examples, we propose a novel attentive feature alignment method, thus improving the performance of few-shot object detection. Meanwhile we jointly learn a Faster R-CNN detection head for base classes. Extensive experiments conducted on multiple FSOD benchmarks show our proposed approach achieves state of the art results under (incremental) few-shot learning settings.
    Do Input Gradients Highlight Discriminative Features?. (arXiv:2102.12781v2 [cs.LG] UPDATED)
    (2 min) Post-hoc gradient-based interpretability methods [Simonyan et al., 2013, Smilkov et al., 2017] that provide instance-specific explanations of model predictions are often based on assumption (A): magnitude of input gradients -- gradients of logits with respect to input -- noisily highlight discriminative task-relevant features. In this work, we test the validity of assumption (A) using a three-pronged approach. First, we develop an evaluation framework, DiffROAR, to test assumption (A) on four image classification benchmarks. Our results suggest that (i) input gradients of standard models (i.e., trained on original data) may grossly violate (A), whereas (ii) input gradients of adversarially robust models satisfy (A). Second, we then introduce BlockMNIST, an MNIST-based semi-real dataset, that by design encodes a priori knowledge of discriminative features. Our analysis on BlockMNIST leverages this information to validate as well as characterize differences between input gradient attributions of standard and robust models. Finally, we theoretically prove that our empirical findings hold on a simplified version of the BlockMNIST dataset. Specifically, we prove that input gradients of standard one-hidden-layer MLPs trained on this dataset do not highlight instance-specific signal coordinates, thus grossly violating assumption (A). Our findings motivate the need to formalize and test common assumptions in interpretability in a falsifiable manner [Leavitt and Morcos, 2020]. Additionally, we believe that the DiffROAR evaluation framework and BlockMNIST-based datasets can serve as sanity checks to audit instance-specific interpretability methods.
    Supervised learning for crop/weed classification based on color and texture features. (arXiv:2106.10581v1 [cs.CV])
    (2 min) Computer vision techniques have attracted a great interest in precision agriculture, recently. The common goal of all computer vision-based precision agriculture tasks is to detect the objects of interest (e.g., crop, weed) and discriminating them from the background. The Weeds are unwanted plants growing among crops competing for nutrients, water, and sunlight, causing losses to crop yields. Weed detection and mapping is critical for site-specific weed management to reduce the cost of labor and impact of herbicides. This paper investigates the use of color and texture features for discrimination of Soybean crops and weeds. Feature extraction methods including two color spaces (RGB, HSV), gray level Co-occurrence matrix (GLCM), and Local Binary Pattern (LBP) are used to train the Support Vector Machine (SVM) classifier. The experiment was carried out on image dataset of soybean crop, obtained from an unmanned aerial vehicle (UAV), which is publicly available. The results from the experiment showed that the highest accuracy (above 96%) was obtained from the combination of color and LBP features.
    Active Learning for Deep Neural Networks on Edge Devices. (arXiv:2106.10836v1 [cs.LG])
    (2 min) When dealing with deep neural network (DNN) applications on edge devices, continuously updating the model is important. Although updating a model with real incoming data is ideal, using all of them is not always feasible due to limits, such as labeling and communication costs. Thus, it is necessary to filter and select the data to use for training (i.e., active learning) on the device. In this paper, we formalize a practical active learning problem for DNNs on edge devices and propose a general task-agnostic framework to tackle this problem, which reduces it to a stream submodular maximization. This framework is light enough to be run with low computational resources, yet provides solutions whose quality is theoretically guaranteed thanks to the submodular property. Through this framework, we can configure data selection criteria flexibly, including using methods proposed in previous active learning studies. We evaluate our approach on both classification and object detection tasks in a practical setting to simulate a real-life scenario. The results of our study show that the proposed framework outperforms all other methods in both tasks, while running at a practical speed on real devices.
    TGRNet: A Table Graph Reconstruction Network for Table Structure Recognition. (arXiv:2106.10598v1 [cs.CV])
    (2 min) A table arranging data in rows and columns is a very effective data structure, which has been widely used in business and scientific research. Considering large-scale tabular data in online and offline documents, automatic table recognition has attracted increasing attention from the document analysis community. Though human can easily understand the structure of tables, it remains a challenge for machines to understand that, especially due to a variety of different table layouts and styles. Existing methods usually model a table as either the markup sequence or the adjacency matrix between different table cells, failing to address the importance of the logical location of table cells, e.g., a cell is located in the first row and the second column of the table. In this paper, we reformulate the problem of table structure recognition as the table graph reconstruction, and propose an end-to-end trainable table graph reconstruction network (TGRNet) for table structure recognition. Specifically, the proposed method has two main branches, a cell detection branch and a cell logical location branch, to jointly predict the spatial location and the logical location of different cells. Experimental results on three popular table recognition datasets and a new dataset with table graph annotations (TableGraph-350K) demonstrate the effectiveness of the proposed TGRNet for table structure recognition. Code and annotations will be made publicly available.
    Artificial Intelligence in the Creative Industries: A Review. (arXiv:2007.12391v5 [cs.CV] UPDATED)
    (2 min) This paper reviews the current state of the art in Artificial Intelligence (AI) technologies and applications in the context of the creative industries. A brief background of AI, and specifically Machine Learning (ML) algorithms, is provided including Convolutional Neural Network (CNNs), Generative Adversarial Networks (GANs), Recurrent Neural Networks (RNNs) and Deep Reinforcement Learning (DRL). We categorise creative applications into five groups related to how AI technologies are used: i) content creation, ii) information analysis, iii) content enhancement and post production workflows, iv) information extraction and enhancement, and v) data compression. We critically examine the successes and limitations of this rapidly advancing technology in each of these areas. We further differentiate between the use of AI as a creative tool and its potential as a creator in its own right. We foresee that, in the near future, machine learning-based AI will be adopted widely as a tool or collaborative assistant for creativity. In contrast, we observe that the successes of machine learning in domains with fewer constraints, where AI is the `creator', remain modest. The potential of AI (or its developers) to win awards for its original creations in competition with human creatives is also limited, based on contemporary technologies. We therefore conclude that, in the context of creative industries, maximum benefit from AI will be derived where its focus is human centric -- where it is designed to augment, rather than replace, human creativity.
    GLIB: Towards Automated Test Oracle for Graphically-Rich Applications. (arXiv:2106.10507v1 [cs.SE])
    (2 min) Graphically-rich applications such as games are ubiquitous with attractive visual effects of Graphical User Interface (GUI) that offers a bridge between software applications and end-users. However, various types of graphical glitches may arise from such GUI complexity and have become one of the main component of software compatibility issues. Our study on bug reports from game development teams in NetEase Inc. indicates that graphical glitches frequently occur during the GUI rendering and severely degrade the quality of graphically-rich applications such as video games. Existing automated testing techniques for such applications focus mainly on generating various GUI test sequences and check whether the test sequences can cause crashes. These techniques require constant human attention to captures non-crashing bugs such as bugs causing graphical glitches. In this paper, we present the first step in automating the test oracle for detecting non-crashing bugs in graphically-rich applications. Specifically, we propose \texttt{GLIB} based on a code-based data augmentation technique to detect game GUI glitches. We perform an evaluation of \texttt{GLIB} on 20 real-world game apps (with bug reports available) and the result shows that \texttt{GLIB} can achieve 100\% precision and 99.5\% recall in detecting non-crashing bugs such as game GUI glitches. Practical application of \texttt{GLIB} on another 14 real-world games (without bug reports) further demonstrates that \texttt{GLIB} can effectively uncover GUI glitches, with 48 of 53 bugs reported by \texttt{GLIB} having been confirmed and fixed so far.
    Plant Disease Detection Using Image Processing and Machine Learning. (arXiv:2106.10698v1 [cs.CV])
    (2 min) One of the important and tedious task in agricultural practices is the detection of the disease on crops. It requires huge time as well as skilled labor. This paper proposes a smart and efficient technique for detection of crop disease which uses computer vision and machine learning techniques. The proposed system is able to detect 20 different diseases of 5 common plants with 93% accuracy.
    Quality-Aware Memory Network for Interactive Volumetric Image Segmentation. (arXiv:2106.10686v1 [cs.CV])
    (2 min) Despite recent progress of automatic medical image segmentation techniques, fully automatic results usually fail to meet the clinical use and typically require further refinement. In this work, we propose a quality-aware memory network for interactive segmentation of 3D medical images. Provided by user guidance on an arbitrary slice, an interaction network is firstly employed to obtain an initial 2D segmentation. The quality-aware memory network subsequently propagates the initial segmentation estimation bidirectionally over the entire volume. Subsequent refinement based on additional user guidance on other slices can be incorporated in the same manner. To further facilitate interactive segmentation, a quality assessment module is introduced to suggest the next slice to segment based on the current segmentation quality of each slice. The proposed network has two appealing characteristics: 1) The memory-augmented network offers the ability to quickly encode past segmentation information, which will be retrieved for the segmentation of other slices; 2) The quality assessment module enables the model to directly estimate the qualities of segmentation predictions, which allows an active learning paradigm where users preferentially label the lowest-quality slice for multi-round refinement. The proposed network leads to a robust interactive segmentation engine, which can generalize well to various types of user annotations (e.g., scribbles, boxes). Experimental results on various medical datasets demonstrate the superiority of our approach in comparison with existing techniques.
    Robust Representation Learning with Feedback for Single Image Deraining. (arXiv:2101.12463v3 [eess.IV] UPDATED)
    (2 min) A deraining network can be interpreted as a conditional generator that aims at removing rain streaks from image. Most existing image deraining methods ignore model errors caused by uncertainty that reduces embedding quality. Unlike existing image deraining methods that embed low-quality features into the model directly, we replace low-quality features by latent high-quality features. The spirit of closed-loop feedback in the automatic control field is borrowed to obtain latent high-quality features. A new method for error detection and feature compensation is proposed to address model errors. Extensive experiments on benchmark datasets as well as specific real datasets demonstrate that the proposed method outperforms recent state-of-the-art methods. Code is available at: \\ https://github.com/LI-Hao-SJTU/DerainRLNet
    LEGAN: Disentangled Manipulation of Directional Lighting and Facial Expressions by Leveraging Human Perceptual Judgements. (arXiv:2010.01464v3 [cs.CV] UPDATED)
    (2 min) Building facial analysis systems that generalize to extreme variations in lighting and facial expressions is a challenging problem that can potentially be alleviated using natural-looking synthetic data. Towards that, we propose LEGAN, a novel synthesis framework that leverages perceptual quality judgments for jointly manipulating lighting and expressions in face images, without requiring paired training data. LEGAN disentangles the lighting and expression subspaces and performs transformations in the feature space before upscaling to the desired output image. The fidelity of the synthetic image is further refined by integrating a perceptual quality estimation model, trained with face images rendered using multiple synthesis methods and their crowd-sourced naturalness ratings, into the LEGAN framework as an auxiliary discriminator. Using objective metrics like FID and LPIPS, LEGAN is shown to generate higher quality face images when compared with popular GAN models like StarGAN and StarGAN-v2 for lighting and expression synthesis. We also conduct a perceptual study using images synthesized by LEGAN and other GAN models and show the correlation between our quality estimation and visual fidelity. Finally, we demonstrate the effectiveness of LEGAN as training data augmenter for expression recognition and face verification tasks.
    VQA-Aid: Visual Question Answering for Post-Disaster Damage Assessment and Analysis. (arXiv:2106.10548v1 [cs.CV])
    (2 min) Visual Question Answering system integrated with Unmanned Aerial Vehicle (UAV) has a lot of potentials to advance the post-disaster damage assessment purpose. Providing assistance to affected areas is highly dependent on real-time data assessment and analysis. Scope of the Visual Question Answering is to understand the scene and provide query related answer which certainly faster the recovery process after any disaster. In this work, we address the importance of \textit{visual question answering (VQA)} task for post-disaster damage assessment by presenting our recently developed VQA dataset called \textit{HurMic-VQA} collected during hurricane Michael, and comparing the performances of baseline VQA models.
    CAMERAS: Enhanced Resolution And Sanity preserving Class Activation Mapping for image saliency. (arXiv:2106.10649v1 [cs.CV])
    (2 min) Backpropagation image saliency aims at explaining model predictions by estimating model-centric importance of individual pixels in the input. However, class-insensitivity of the earlier layers in a network only allows saliency computation with low resolution activation maps of the deeper layers, resulting in compromised image saliency. Remedifying this can lead to sanity failures. We propose CAMERAS, a technique to compute high-fidelity backpropagation saliency maps without requiring any external priors and preserving the map sanity. Our method systematically performs multi-scale accumulation and fusion of the activation maps and backpropagated gradients to compute precise saliency maps. From accurate image saliency to articulation of relative importance of input features for different models, and precise discrimination between model perception of visually similar objects, our high-resolution mapping offers multiple novel insights into the black-box deep visual models, which are presented in the paper. We also demonstrate the utility of our saliency maps in adversarial setup by drastically reducing the norm of attack signals by focusing them on the precise regions identified by our maps. Our method also inspires new evaluation metrics and a sanity check for this developing research direction. Code is available here https://github.com/VisMIL/CAMERAS
    Humble Teachers Teach Better Students for Semi-Supervised Object Detection. (arXiv:2106.10456v1 [cs.CV])
    (2 min) We propose a semi-supervised approach for contemporary object detectors following the teacher-student dual model framework. Our method is featured with 1) the exponential moving averaging strategy to update the teacher from the student online, 2) using plenty of region proposals and soft pseudo-labels as the student's training targets, and 3) a light-weighted detection-specific data ensemble for the teacher to generate more reliable pseudo-labels. Compared to the recent state-of-the-art -- STAC, which uses hard labels on sparsely selected hard pseudo samples, the teacher in our model exposes richer information to the student with soft-labels on many proposals. Our model achieves COCO-style AP of 53.04% on VOC07 val set, 8.4% better than STAC, when using VOC12 as unlabeled data. On MS-COCO, it outperforms prior work when only a small percentage of data is taken as labeled. It also reaches 53.8% AP on MS-COCO test-dev with 3.1% gain over the fully supervised ResNet-152 Cascaded R-CNN, by tapping into unlabeled data of a similar size to the labeled data.
    Video Summarization through Reinforcement Learning with a 3D Spatio-Temporal U-Net. (arXiv:2106.10528v1 [cs.CV])
    (2 min) Intelligent video summarization algorithms allow to quickly convey the most relevant information in videos through the identification of the most essential and explanatory content while removing redundant video frames. In this paper, we introduce the 3DST-UNet-RL framework for video summarization. A 3D spatio-temporal U-Net is used to efficiently encode spatio-temporal information of the input videos for downstream reinforcement learning (RL). An RL agent learns from spatio-temporal latent scores and predicts actions for keeping or rejecting a video frame in a video summary. We investigate if real/inflated 3D spatio-temporal CNN features are better suited to learn representations from videos than commonly used 2D image features. Our framework can operate in both, a fully unsupervised mode and a supervised training mode. We analyse the impact of prescribed summary lengths and show experimental evidence for the effectiveness of 3DST-UNet-RL on two commonly used general video summarization benchmarks. We also applied our method on a medical video summarization task. The proposed video summarization method has the potential to save storage costs of ultrasound screening videos as well as to increase efficiency when browsing patient video data during retrospective analysis or audit without loosing essential information
    Reversible Colour Density Compression of Images using cGANs. (arXiv:2106.10542v1 [eess.IV])
    (2 min) Image compression using colour densities is historically impractical to decompress losslessly. We examine the use of conditional generative adversarial networks in making this transformation more feasible, through learning a mapping between the images and a loss function to train on. We show that this method is effective at producing visually lossless generations, indicating that efficient colour compression is viable.
    Practical Transferability Estimation for Image Classification Tasks. (arXiv:2106.10479v1 [cs.CV])
    (2 min) Transferability estimation is an essential problem in transfer learning to predict how good the performance is when transfer a source model (source task) to a target task. Recent analytical transferability metrics have been widely used for source model selection and multi-task learning. Earlier metrics does not work sufficiently well under the challenging cross-domain cross-task transfer settings, but recent OTCE score achieves a noteworthy performance using auxiliary tasks. A simplified version named OT-based NCE score sacrifices accuracy to be more efficient, but it can be further improved. Consequently, we propose a practical transferability metric called JC-NCE score to further improve the cross-domain cross-task transferability estimation performance, which is more efficient than the OTCE score and more accurate than the OT-based NCE score. Specifically, we build the joint correspondences between source and target data via solving an optimal transport problem with considering both the sample distance and label distance, and then compute the transferability score as the negative conditional entropy. Extensive validations under the intra-dataset and inter-dataset transfer settings demonstrate that our JC-NCE score outperforms the OT-based NCE score with about 7% and 12% gains, respectively.
    Implementing a Detection System for COVID-19 based on Lung Ultrasound Imaging and Deep Learning. (arXiv:2106.10651v1 [eess.IV])
    (2 min) The COVID-19 pandemic started in China in December 2019 and quickly spread to several countries. The consequences of this pandemic are incalculable, causing the death of millions of people and damaging the global economy. To achieve large-scale control of this pandemic, fast tools for detection and treatment of patients are needed. Thus, the demand for alternative tools for the diagnosis of COVID-19 has increased dramatically since accurated and automated tools are not available. In this paper we present the ongoing work on a system for COVID-19 detection using ultrasound imaging and using Deep Learning techniques. Furthermore, such a system is implemented on a Raspberry Pi to make it portable and easy to use in remote regions without an Internet connection.
    Spatial Contrastive Learning for Few-Shot Classification. (arXiv:2012.13831v3 [cs.CV] UPDATED)
    (2 min) In this paper, we explore contrastive learning for few-shot classification, in which we propose to use it as an additional auxiliary training objective acting as a data-dependent regularizer to promote more general and transferable features. In particular, we present a novel attention-based spatial contrastive objective to learn locally discriminative and class-agnostic features. As a result, our approach overcomes some of the limitations of the cross-entropy loss, such as its excessive discrimination towards seen classes, which reduces the transferability of features to unseen classes. With extensive experiments, we show that the proposed method outperforms state-of-the-art approaches, confirming the importance of learning good and transferable embeddings for few-shot learning.
    Learning to Track Object Position through Occlusion. (arXiv:2106.10766v1 [cs.CV])
    (2 min) Occlusion is one of the most significant challenges encountered by object detectors and trackers. While both object detection and tracking has received a lot of attention in the past, most existing methods in this domain do not target detecting or tracking objects when they are occluded. However, being able to detect or track an object of interest through occlusion has been a long standing challenge for different autonomous tasks. Traditional methods that employ visual object trackers with explicit occlusion modeling experience drift and make several fundamental assumptions about the data. We propose to address this with a `tracking-by-detection` approach that builds upon the success of region based video object detectors. Our video level object detector uses a novel recurrent computational unit at its core that enables long term propagation of object features even under occlusion. Finally, we compare our approach with existing state-of-the-art video object detectors and show that our approach achieves superior results on a dataset of furniture assembly videos collected from the internet, where small objects like screws, nuts, and bolts often get occluded from the camera viewpoint.
    NeuS: Learning Neural Implicit Surfaces by Volume Rendering for Multi-view Reconstruction. (arXiv:2106.10689v1 [cs.CV])
    (2 min) We present a novel neural surface reconstruction method, called NeuS, for reconstructing objects and scenes with high fidelity from 2D image inputs. Existing neural surface reconstruction approaches, such as DVR and IDR, require foreground mask as supervision, easily get trapped in local minima, and therefore struggle with the reconstruction of objects with severe self-occlusion or thin structures. Meanwhile, recent neural methods for novel view synthesis, such as NeRF and its variants, use volume rendering to produce a neural scene representation with robustness of optimization, even for highly complex objects. However, extracting high-quality surfaces from this learned implicit representation is difficult because there are not sufficient surface constraints in the representation. In NeuS, we propose to represent a surface as the zero-level set of a signed distance function (SDF) and develop a new volume rendering method to train a neural SDF representation. We observe that the conventional volume rendering method causes inherent geometric errors (i.e. bias) for surface reconstruction, and therefore propose a new formulation that is free of bias in the first order of approximation, thus leading to more accurate surface reconstruction even without the mask supervision. Experiments on the DTU dataset and the BlendedMVS dataset show that NeuS outperforms the state-of-the-arts in high-quality surface reconstruction, especially for objects and scenes with complex structures and self-occlusion.
    Remote Sensing Images Semantic Segmentation with General Remote Sensing Vision Model via a Self-Supervised Contrastive Learning Method. (arXiv:2106.10605v1 [cs.CV])
    (2 min) A new learning paradigm, self-supervised learning (SSL), can be used to solve such problems by pre-training a general model with large unlabeled images and then fine-tuning on a downstream task with very few labeled samples. Contrastive learning is a typical method of SSL, which can learn general invariant features. However, most of the existing contrastive learning is designed for classification tasks to obtain an image-level representation, which may be sub-optimal for semantic segmentation tasks requiring pixel-level discrimination. Therefore, we propose Global style and Local matching Contrastive Learning Network (GLCNet) for remote sensing semantic segmentation. Specifically, the global style contrastive module is used to learn an image-level representation better, as we consider the style features can better represent the overall image features; The local features matching contrastive module is designed to learn representations of local regions which is beneficial for semantic segmentation. We evaluate four remote sensing semantic segmentation datasets, and the experimental results show that our method mostly outperforms state-of-the-art self-supervised methods and ImageNet pre-training. Specifically, with 1\% annotation from the original dataset, our approach improves Kappa by 6\% on the ISPRS Potsdam dataset and 3\% on Deep Globe Land Cover Classification dataset relative to the existing baseline. Moreover, our method outperforms supervised learning when there are some differences between the datasets of upstream tasks and downstream tasks. Our study promotes the development of self-supervised learning in the field of remote sensing semantic segmentation. The source code is available at https://github.com/GeoX-Lab/G-RSIM.
    Two-Stream Consensus Network: Submission to HACS Challenge 2021 Weakly-Supervised Learning Track. (arXiv:2106.10829v1 [cs.CV])
    (2 min) This technical report presents our solution to the HACS Temporal Action Localization Challenge 2021, Weakly-Supervised Learning Track. The goal of weakly-supervised temporal action localization is to temporally locate and classify action of interest in untrimmed videos given only video-level labels. We adopt the two-stream consensus network (TSCN) as the main framework in this challenge. The TSCN consists of a two-stream base model training procedure and a pseudo ground truth learning procedure. The base model training encourages the model to predict reliable predictions based on single modality (i.e., RGB or optical flow), based on the fusion of which a pseudo ground truth is generated and in turn used as supervision to train the base models. On the HACS v1.1.1 dataset, without fine-tuning the feature-extraction I3D models, our method achieves 22.20% on the validation set and 21.68% on the testing set in terms of average mAP. Our solution ranked the 2rd in this challenge, and we hope our method can serve as a baseline for future academic research.
    Trainable Class Prototypes for Few-Shot Learning. (arXiv:2106.10846v1 [cs.CV])
    (2 min) Metric learning is a widely used method for few shot learning in which the quality of prototypes plays a key role in the algorithm. In this paper we propose the trainable prototypes for distance measure instead of the artificial ones within the meta-training and task-training framework. Also to avoid the disadvantages that the episodic meta-training brought, we adopt non-episodic meta-training based on self-supervised learning. Overall we solve the few-shot tasks in two phases: meta-training a transferable feature extractor via self-supervised learning and training the prototypes for metric classification. In addition, the simple attention mechanism is used in both meta-training and task-training. Our method achieves state-of-the-art performance in a variety of established few-shot tasks on the standard few-shot visual classification dataset, with about 20% increase compared to the available unsupervised few-shot learning methods.
    Using Shape to Categorize: Low-Shot Learning with an Explicit Shape Bias. (arXiv:2101.07296v2 [cs.CV] UPDATED)
    (2 min) It is widely accepted that reasoning about object shape is important for object recognition. However, the most powerful object recognition methods today do not explicitly make use of object shape during learning. In this work, motivated by recent developments in low-shot learning, findings in developmental psychology, and the increased use of synthetic data in computer vision research, we investigate how reasoning about 3D shape can be used to improve low-shot learning methods' generalization performance. We propose a new way to improve existing low-shot learning approaches by learning a discriminative embedding space using 3D object shape, and using this embedding by learning how to map images into it. Our new approach improves the performance of image-only low-shot learning approaches on multiple datasets. We also introduce Toys4K, a 3D object dataset with the largest number of object categories currently available, which supports low-shot learning.
    Large-scale image segmentation based on distributed clustering algorithms. (arXiv:2106.10795v1 [cs.CV])
    (2 min) Many approaches to 3D image segmentation are based on hierarchical clustering of supervoxels into image regions. Here we describe a distributed algorithm capable of handling a tremendous number of supervoxels. The algorithm works recursively, the regions are divided into chunks that are processed independently in parallel by multiple workers. At each round of the recursive procedure, the chunk size in all dimensions are doubled until a single chunk encompasses the entire image. The final result is provably independent of the chunking scheme, and the same as if the entire image were processed without division into chunks. This is nontrivial because a pair of adjacent regions is scored by some statistical property (e.g. mean or median) of the affinities at the interface, and the interface may extend over arbitrarily many chunks. The trick is to delay merge decisions for regions that touch chunk boundaries, and only complete them in a later round after the regions are fully contained within a chunk. We demonstrate the algorithm by clustering an affinity graph with over 1.5 trillion edges between 135 billion supervoxels derived from a 3D electron microscopic brain image.
    Neighborhood Contrastive Learning for Novel Class Discovery. (arXiv:2106.10731v1 [cs.CV])
    (2 min) In this paper, we address Novel Class Discovery (NCD), the task of unveiling new classes in a set of unlabeled samples given a labeled dataset with known classes. We exploit the peculiarities of NCD to build a new framework, named Neighborhood Contrastive Learning (NCL), to learn discriminative representations that are important to clustering performance. Our contribution is twofold. First, we find that a feature extractor trained on the labeled set generates representations in which a generic query sample and its neighbors are likely to share the same class. We exploit this observation to retrieve and aggregate pseudo-positive pairs with contrastive learning, thus encouraging the model to learn more discriminative representations. Second, we notice that most of the instances are easily discriminated by the network, contributing less to the contrastive loss. To overcome this issue, we propose to generate hard negatives by mixing labeled and unlabeled samples in the feature space. We experimentally demonstrate that these two ingredients significantly contribute to clustering performance and lead our model to outperform state-of-the-art methods by a large margin (e.g., clustering accuracy +13% on CIFAR-100 and +8% on ImageNet).
    3D Object Detection for Autonomous Driving: A Survey. (arXiv:2106.10823v1 [cs.CV])
    (2 min) Autonomous driving is regarded as one of the most promising remedies to shield human beings from severe crashes. To this end, 3D object detection serves as the core basis of such perception system especially for the sake of path planning, motion prediction, collision avoidance, etc. Generally, stereo or monocular images with corresponding 3D point clouds are already standard layout for 3D object detection, out of which point clouds are increasingly prevalent with accurate depth information being provided. Despite existing efforts, 3D object detection on point clouds is still in its infancy due to high sparseness and irregularity of point clouds by nature, misalignment view between camera view and LiDAR bird's eye of view for modality synergies, occlusions and scale variations at long distances, etc. Recently, profound progress has been made in 3D object detection, with a large body of literature being investigated to address this vision task. As such, we present a comprehensive review of the latest progress in this field covering all the main topics including sensors, fundamentals, and the recent state-of-the-art detection methods with their pros and cons. Furthermore, we introduce metrics and provide quantitative comparisons on popular public datasets. The avenues for future work are going to be judiciously identified after an in-deep analysis of the surveyed works. Finally, we conclude this paper.
    Long-term Pedestrian Trajectory Prediction using Mutable Intention Filter and Warp LSTM. (arXiv:2007.00113v3 [cs.RO] UPDATED)
    (2 min) Trajectory prediction is one of the key capabilities for robots to safely navigate and interact with pedestrians. Critical insights from human intention and behavioral patterns need to be integrated to effectively forecast long-term pedestrian behavior. Thus, we propose a framework incorporating a Mutable Intention Filter and a Warp LSTM (MIF-WLSTM) to simultaneously estimate human intention and perform trajectory prediction. The Mutable Intention Filter is inspired by particle filtering and genetic algorithms, where particles represent intention hypotheses that can be mutated throughout the pedestrian motion. Instead of predicting sequential displacement over time, our Warp LSTM learns to generate offsets on a full trajectory predicted by a nominal intention-aware linear model, which considers the intention hypotheses during filtering process. Through experiments on a publicly available dataset, we show that our method outperforms baseline approaches and demonstrate the robust performance of our method under abnormal intention-changing scenarios. Code is available at https://github.com/tedhuang96/mifwlstm.
    Global Semantic Description of Objects based on Prototype Theory. (arXiv:1906.03365v4 [cs.CV] UPDATED)
    (2 min) In this paper, we introduce a novel semantic description approach inspired on Prototype Theory foundations. We propose a Computational Prototype Model (CPM) that encodes and stores the central semantic meaning of objects category: the semantic prototype. Also, we introduce a Prototype-based Description Model that encodes the semantic meaning of an object while describing its features using our CPM model. Our description method uses semantic prototypes computed by CNN-classifications models to create discriminative signatures that describe an object highlighting its most distinctive features within the category. Our experiments show that: i) our CPM model (semantic prototype + distance metric) is able to describe the internal semantic structure of objects categories; ii) our semantic distance metric can be understood as the object visual typicality score within a category; iii) our descriptor encoding is semantically interpretable and significantly outperforms other image global encodings in clustering and classification tasks.
    DiGS : Divergence guided shape implicit neural representation for unoriented point clouds. (arXiv:2106.10811v1 [cs.CV])
    (2 min) Neural shape representations have recently shown to be effective in shape analysis and reconstruction tasks. Existing neural network methods require point coordinates and corresponding normal vectors to learn the implicit level sets of the shape. Normal vectors are often not provided as raw data, therefore, approximation and reorientation are required as pre-processing stages, both of which can introduce noise. In this paper, we propose a divergence guided shape representation learning approach that does not require normal vectors as input. We show that incorporating a soft constraint on the divergence of the distance function favours smooth solutions that reliably orients gradients to match the unknown normal at each point, in some cases even better than approaches that use ground truth normal vectors directly. Additionally, we introduce a novel geometric initialization method for sinusoidal shape representation networks that further improves convergence to the desired solution. We evaluate the effectiveness of our approach on the task of surface reconstruction and show state-of-the-art performance compared to other unoriented methods and on-par performance compared to oriented methods.
    Task Attended Meta-Learning for Few-Shot Learning. (arXiv:2106.10642v1 [cs.LG])
    (2 min) Meta-learning (ML) has emerged as a promising direction in learning models under constrained resource settings like few-shot learning. The popular approaches for ML either learn a generalizable initial model or a generic parametric optimizer through episodic training. The former approaches leverage the knowledge from a batch of tasks to learn an optimal prior. In this work, we study the importance of a batch for ML. Specifically, we first incorporate a batch episodic training regimen to improve the learning of the generic parametric optimizer. We also hypothesize that the common assumption in batch episodic training that each task in a batch has an equal contribution to learning an optimal meta-model need not be true. We propose to weight the tasks in a batch according to their "importance" in improving the meta-model's learning. To this end, we introduce a training curriculum motivated by selective focus in humans, called task attended meta-training, to weight the tasks in a batch. Task attention is a standalone module that can be integrated with any batch episodic training regimen. The comparisons of the models with their non-task-attended counterparts on complex datasets like miniImageNet and tieredImageNet validate its effectiveness.
    Exploring Vision Transformers for Fine-grained Classification. (arXiv:2106.10587v1 [cs.CV])
    (2 min) Existing computer vision research in categorization struggles with fine-grained attributes recognition due to the inherently high intra-class variances and low inter-class variances. SOTA methods tackle this challenge by locating the most informative image regions and rely on them to classify the complete image. The most recent work, Vision Transformer (ViT), shows its strong performance in both traditional and fine-grained classification tasks. In this work, we propose a multi-stage ViT framework for fine-grained image classification tasks, which localizes the informative image regions without requiring architectural changes using the inherent multi-head self-attention mechanism. We also introduce attention-guided augmentations for improving the model's capabilities. We demonstrate the value of our approach by experimenting with four popular fine-grained benchmarks: CUB-200-2011, Stanford Cars, Stanford Dogs, and FGVC7 Plant Pathology. We also prove our model's interpretability via qualitative results.
    Attention to Warp: Deep Metric Learning for Multivariate Time Series. (arXiv:2103.15074v2 [cs.CV] UPDATED)
    (2 min) Deep time series metric learning is challenging due to the difficult trade-off between temporal invariance to nonlinear distortion and discriminative power in identifying non-matching sequences. This paper proposes a novel neural network-based approach for robust yet discriminative time series classification and verification. This approach adapts a parameterized attention model to time warping for greater and more adaptive temporal invariance. It is robust against not only local but also large global distortions, so that even matching pairs that do not satisfy the monotonicity, continuity, and boundary conditions can still be successfully identified. Learning of this model is further guided by dynamic time warping to impose temporal constraints for stabilized training and higher discriminative power. It can learn to augment the inter-class variation through warping, so that similar but different classes can be effectively distinguished. We experimentally demonstrate the superiority of the proposed approach over previous non-parametric and deep models by combining it with a deep online signature verification framework, after confirming its promising behavior in single-letter handwriting classification on the Unipen dataset.
    ReGO: Reference-Guided Outpainting for Scenery Image. (arXiv:2106.10601v1 [cs.CV])
    (2 min) We aim to tackle the challenging yet practical scenery image outpainting task in this work. Recently, generative adversarial learning has significantly advanced the image outpainting by producing semantic consistent content for the given image. However, the existing methods always suffer from the blurry texture and the artifacts of the generative part, making the overall outpainting results lack authenticity. To overcome the weakness, this work investigates a principle way to synthesize texture-rich results by borrowing pixels from its neighbors (\ie, reference images), named \textbf{Re}ference-\textbf{G}uided \textbf{O}utpainting (ReGO). Particularly, the ReGO designs an Adaptive Content Selection (ACS) module to transfer the pixel of reference images for texture compensating of the target one. To prevent the style of the generated part from being affected by the reference images, a style ranking loss is further proposed to augment the ReGO to synthesize style-consistent results. Extensive experiments on two popular benchmarks, NS6K~\cite{yangzx} and NS8K~\cite{wang}, well demonstrate the effectiveness of our ReGO.
    Automated Deepfake Detection. (arXiv:2106.10705v1 [cs.CV])
    (2 min) In this paper, we propose to utilize Automated Machine Learning to automatically search architecture for deepfake detection. Unlike previous works, our method benefits from the superior capability of deep learning while relieving us from the high labor cost in the manual network design process. It is experimentally proved that our proposed method not only outperforms previous non-deep learning methods but achieves comparable or even better prediction accuracy compared to previous deep learning methods. To improve the generality of our method, especially when training data and testing data are manipulated by different methods, we propose a multi-task strategy in our network learning process, making it estimate potential manipulation regions in given samples as well as predict whether the samples are real. Comparing to previous works using similar strategies, our method depends much less on prior knowledge, such as no need to know which manipulation method is utilized and whether it is utilized already. Extensive experimental results on two benchmark datasets demonstrate the effectiveness of our proposed method on deepfake detection.
    Low-Power Multi-Camera Object Re-Identification using Hierarchical Neural Networks. (arXiv:2106.10588v1 [cs.CV])
    (2 min) Low-power computer vision on embedded devices has many applications. This paper describes a low-power technique for the object re-identification (reID) problem: matching a query image against a gallery of previously seen images. State-of-the-art techniques rely on large, computationally-intensive Deep Neural Networks (DNNs). We propose a novel hierarchical DNN architecture that uses attribute labels in the training dataset to perform efficient object reID. At each node in the hierarchy, a small DNN identifies a different attribute of the query image. The small DNN at each leaf node is specialized to re-identify a subset of the gallery: only the images with the attributes identified along the path from the root to a leaf. Thus, a query image is re-identified accurately after processing with a few small DNNs. We compare our method with state-of-the-art object reID techniques. With a 4% loss in accuracy, our approach realizes significant resource savings: 74% less memory, 72% fewer operations, and 67% lower query latency, yielding 65% less energy consumption.
    Neural Network Libraries: A Deep Learning Framework Designed from Engineers' Perspectives. (arXiv:2102.06725v2 [cs.LG] UPDATED)
    (2 min) While there exist a plethora of deep learning tools and frameworks, the fast-growing complexity of the field brings new demands and challenges, such as more flexible network design, speedy computation on distributed setting, and compatibility between different tools. In this paper, we introduce Neural Network Libraries (https://nnabla.org), a deep learning framework designed from engineer's perspective, with emphasis on usability and compatibility as its core design principles. We elaborate on each of our design principles and its merits, and validate our attempts via experiments.
    Unbalanced Feature Transport for Exemplar-based Image Translation. (arXiv:2106.10482v1 [cs.CV])
    (2 min) Despite the great success of GANs in images translation with different conditioned inputs such as semantic segmentation and edge maps, generating high-fidelity realistic images with reference styles remains a grand challenge in conditional image-to-image translation. This paper presents a general image translation framework that incorporates optimal transport for feature alignment between conditional inputs and style exemplars in image translation. The introduction of optimal transport mitigates the constraint of many-to-one feature matching significantly while building up accurate semantic correspondences between conditional inputs and exemplars. We design a novel unbalanced optimal transport to address the transport between features with deviational distributions which exists widely between conditional inputs and exemplars. In addition, we design a semantic-activation normalization scheme that injects style features of exemplars into the image translation process successfully. Extensive experiments over multiple image translation tasks show that our method achieves superior image translation qualitatively and quantitatively as compared with the state-of-the-art.
    Exploring Semantic Relationships for Unpaired Image Captioning. (arXiv:2106.10658v1 [cs.CV])
    (2 min) Recently, image captioning has aroused great interest in both academic and industrial worlds. Most existing systems are built upon large-scale datasets consisting of image-sentence pairs, which, however, are time-consuming to construct. In addition, even for the most advanced image captioning systems, it is still difficult to realize deep image understanding. In this work, we achieve unpaired image captioning by bridging the vision and the language domains with high-level semantic information. The motivation stems from the fact that the semantic concepts with the same modality can be extracted from both images and descriptions. To further improve the quality of captions generated by the model, we propose the Semantic Relationship Explorer, which explores the relationships between semantic concepts for better understanding of the image. Extensive experiments on MSCOCO dataset show that we can generate desirable captions without paired datasets. Furthermore, the proposed approach boosts five strong baselines under the paired setting, where the most significant improvement in CIDEr score reaches 8%, demonstrating that it is effective and generalizes well to a wide range of models.
    Nuclei Grading of Clear Cell Renal Cell Carcinoma in Histopathological Image by Composite High-Resolution Network. (arXiv:2106.10641v1 [eess.IV])
    (2 min) The grade of clear cell renal cell carcinoma (ccRCC) is a critical prognostic factor, making ccRCC nuclei grading a crucial task in RCC pathology analysis. Computer-aided nuclei grading aims to improve pathologists' work efficiency while reducing their misdiagnosis rate by automatically identifying the grades of tumor nuclei within histopathological images. Such a task requires precisely segment and accurately classify the nuclei. However, most of the existing nuclei segmentation and classification methods can not handle the inter-class similarity property of nuclei grading, thus can not be directly applied to the ccRCC grading task. In this paper, we propose a Composite High-Resolution Network for ccRCC nuclei grading. Specifically, we propose a segmentation network called W-Net that can separate the clustered nuclei. Then, we recast the fine-grained classification of nuclei to two cross-category classification tasks, based on two high-resolution feature extractors (HRFEs) which are proposed for learning these two tasks. The two HRFEs share the same backbone encoder with W-Net by a composite connection so that meaningful features for the segmentation task can be inherited for the classification task. Last, a head-fusion block is applied to generate the predicted label of each nucleus. Furthermore, we introduce a dataset for ccRCC nuclei grading, containing 1000 image patches with 70945 annotated nuclei. We demonstrate that our proposed method achieves state-of-the-art performance compared to existing methods on this large ccRCC grading dataset.
    Adversarial Manifold Matching via Deep Metric Learning for Generative Modeling. (arXiv:2106.10777v1 [cs.CV])
    (2 min) We propose a manifold matching approach to generative models which includes a distribution generator (or data generator) and a metric generator. In our framework, we view the real data set as some manifold embedded in a high-dimensional Euclidean space. The distribution generator aims at generating samples that follow some distribution condensed around the real data manifold. It is achieved by matching two sets of points using their geometric shape descriptors, such as centroid and $p$-diameter, with learned distance metric; the metric generator utilizes both real data and generated samples to learn a distance metric which is close to some intrinsic geodesic distance on the real data manifold. The produced distance metric is further used for manifold matching. The two networks are learned simultaneously during the training process. We apply the approach on both unsupervised and supervised learning tasks: in unconditional image generation task, the proposed method obtains competitive results compared with existing generative models; in super-resolution task, we incorporate the framework in perception-based models and improve visual qualities by producing samples with more natural textures. Both theoretical analysis and real data experiments guarantee the feasibility and effectiveness of the proposed framework.
    Tag, Copy or Predict: A Unified Weakly-Supervised Learning Framework for Visual Information Extraction using Sequences. (arXiv:2106.10681v1 [cs.CV])
    (2 min) Visual information extraction (VIE) has attracted increasing attention in recent years. The existing methods usually first organized optical character recognition (OCR) results into plain texts and then utilized token-level entity annotations as supervision to train a sequence tagging model. However, it expends great annotation costs and may be exposed to label confusion, and the OCR errors will also significantly affect the final performance. In this paper, we propose a unified weakly-supervised learning framework called TCPN (Tag, Copy or Predict Network), which introduces 1) an efficient encoder to simultaneously model the semantic and layout information in 2D OCR results; 2) a weakly-supervised training strategy that utilizes only key information sequences as supervision; and 3) a flexible and switchable decoder which contains two inference modes: one (Copy or Predict Mode) is to output key information sequences of different categories by copying a token from the input or predicting one in each time step, and the other (Tag Mode) is to directly tag the input sequence in a single forward pass. Our method shows new state-of-the-art performance on several public benchmarks, which fully proves its effectiveness.
    Neural Network Facial Authentication for Public Electric Vehicle Charging Station. (arXiv:2106.10432v1 [cs.CV])
    (2 min) This study is to investigate and compare the facial recognition accuracy performance of Dlib ResNet against a K-Nearest Neighbour (KNN) classifier. Particularly when used against a dataset from an Asian ethnicity as Dlib ResNet was reported to have an accuracy deficiency when it comes to Asian faces. The comparisons are both implemented on the facial vectors extracted using the Histogram of Oriented Gradients (HOG) method and use the same dataset for a fair comparison. Authentication of a user by facial recognition in an electric vehicle (EV) charging station demonstrates a practical use case for such an authentication system.
    IsMo-GAN: Adversarial Learning for Monocular Non-Rigid 3D Reconstruction. (arXiv:1904.12144v2 [cs.CV] UPDATED)
    (2 min) The majority of the existing methods for non-rigid 3D surface regression from monocular 2D images require an object template or point tracks over multiple frames as an input, and are still far from real-time processing rates. In this work, we present the Isometry-Aware Monocular Generative Adversarial Network (IsMo-GAN) - an approach for direct 3D reconstruction from a single image, trained for the deformation model in an adversarial manner on a light-weight synthetic dataset. IsMo-GAN reconstructs surfaces from real images under varying illumination, camera poses, textures and shading at over 250 Hz. In multiple experiments, it consistently outperforms several approaches in the reconstruction accuracy, runtime, generalisation to unknown surfaces and robustness to occlusions. In comparison to the state-of-the-art, we reduce the reconstruction error by 10-30% including the textureless case and our surfaces evince fewer artefacts qualitatively.
    CenterAtt: Fast 2-stage Center Attention Network. (arXiv:2106.10493v1 [cs.CV])
    (2 min) In this technical report, we introduce the methods of HIKVISION_LiDAR_Det in the challenge of waymo open dataset real-time 3D detection. Our solution for the competition are built upon Centerpoint 3D detection framework. Several variants of CenterPoint are explored, including center attention head and feature pyramid network neck. In order to achieve real time detection, methods like batchnorm merge, half-precision floating point network and GPU-accelerated voxelization process are adopted. By using these methods, our team ranks 6th among all the methods on real-time 3D detection challenge in the waymo open dataset.
    Sparse Training via Boosting Pruning Plasticity with Neuroregeneration. (arXiv:2106.10404v1 [cs.LG])
    (2 min) Works on lottery ticket hypothesis (LTH) and single-shot network pruning (SNIP) have raised a lot of attention currently on post-training pruning (iterative magnitude pruning), and before-training pruning (pruning at initialization). The former method suffers from an extremely large computation cost and the latter category of methods usually struggles with insufficient performance. In comparison, during-training pruning, a class of pruning methods that simultaneously enjoys the training/inference efficiency and the comparable performance, temporarily, has been less explored. To better understand during-training pruning, we quantitatively study the effect of pruning throughout training from the perspective of pruning plasticity (the ability of the pruned networks to recover the original performance). Pruning plasticity can help explain several other empirical observations about neural network pruning in literature. We further find that pruning plasticity can be substantially improved by injecting a brain-inspired mechanism called neuroregeneration, i.e., to regenerate the same number of connections as pruned. Based on the insights from pruning plasticity, we design a novel gradual magnitude pruning (GMP) method, named gradual pruning with zero-cost neuroregeneration (GraNet), and its dynamic sparse training (DST) variant (GraNet-ST). Both of them advance state of the art. Perhaps most impressively, the latter for the first time boosts the sparse-to-sparse training performance over various dense-to-sparse methods by a large margin with ResNet-50 on ImageNet. We will release all codes.
    Attend What You Need: Motion-Appearance Synergistic Networks for Video Question Answering. (arXiv:2106.10446v1 [cs.CV])
    (2 min) Video Question Answering is a task which requires an AI agent to answer questions grounded in video. This task entails three key challenges: (1) understand the intention of various questions, (2) capturing various elements of the input video (e.g., object, action, causality), and (3) cross-modal grounding between language and vision information. We propose Motion-Appearance Synergistic Networks (MASN), which embed two cross-modal features grounded on motion and appearance information and selectively utilize them depending on the question's intentions. MASN consists of a motion module, an appearance module, and a motion-appearance fusion module. The motion module computes the action-oriented cross-modal joint representations, while the appearance module focuses on the appearance aspect of the input video. Finally, the motion-appearance fusion module takes each output of the motion module and the appearance module as input, and performs question-guided fusion. As a result, MASN achieves new state-of-the-art performance on the TGIF-QA and MSVD-QA datasets. We also conduct qualitative analysis by visualizing the inference results of MASN. The code is available at https://github.com/ahjeongseo/MASN-pytorch.
    Direct Reconstruction of Linear Parametric Images from Dynamic PET Using Nonlocal Deep Image Prior. (arXiv:2106.10359v1 [eess.IV])
    (2 min) Direct reconstruction methods have been developed to estimate parametric images directly from the measured PET sinograms by combining the PET imaging model and tracer kinetics in an integrated framework. Due to limited counts received, signal-to-noise-ratio (SNR) and resolution of parametric images produced by direct reconstruction frameworks are still limited. Recently supervised deep learning methods have been successfully applied to medical imaging denoising/reconstruction when large number of high-quality training labels are available. For static PET imaging, high-quality training labels can be acquired by extending the scanning time. However, this is not feasible for dynamic PET imaging, where the scanning time is already long enough. In this work, we proposed an unsupervised deep learning framework for direct parametric reconstruction from dynamic PET, which was tested on the Patlak model and the relative equilibrium Logan model. The patient's anatomical prior image, which is readily available from PET/CT or PET/MR scans, was supplied as the network input to provide a manifold constraint, and also utilized to construct a kernel layer to perform non-local feature denoising. The linear kinetic model was embedded in the network structure as a 1x1 convolution layer. The training objective function was based on the PET statistical model. Evaluations based on dynamic datasets of 18F-FDG and 11C-PiB tracers show that the proposed framework can outperform the traditional and the kernel method-based direct reconstruction methods.
    MSN: Efficient Online Mask Selection Network for Video Instance Segmentation. (arXiv:2106.10452v1 [cs.CV])
    (2 min) In this work we present a novel solution for Video Instance Segmentation(VIS), that is automatically generating instance level segmentation masks along with object class and tracking them in a video. Our method improves the masks from segmentation and propagation branches in an online manner using the Mask Selection Network (MSN) hence limiting the noise accumulation during mask tracking. We propose an effective design of MSN by using patch-based convolutional neural network. The network is able to distinguish between very subtle differences between the masks and choose the better masks out of the associated masks accurately. Further, we make use of temporal consistency and process the video sequences in both forward and reverse manner as a post processing step to recover lost objects. The proposed method can be used to adapt any video object segmentation method for the task of VIS. Our method achieves a score of 49.1 mAP on 2021 YouTube-VIS Challenge and was ranked third place among more than 30 global teams. Our code will be available at https://github.com/SHI-Labs/Mask-Selection-Networks.
    CompConv: A Compact Convolution Module for Efficient Feature Learning. (arXiv:2106.10486v1 [cs.CV])
    (2 min) Convolutional Neural Networks (CNNs) have achieved remarkable success in various computer vision tasks but rely on tremendous computational cost. To solve this problem, existing approaches either compress well-trained large-scale models or learn lightweight models with carefully designed network structures. In this work, we make a close study of the convolution operator, which is the basic unit used in CNNs, to reduce its computing load. In particular, we propose a compact convolution module, called CompConv, to facilitate efficient feature learning. With the divide-and-conquer strategy, CompConv is able to save a great many computations as well as parameters to produce a certain dimensional feature map. Furthermore, CompConv discreetly integrates the input features into the outputs to efficiently inherit the input information. More importantly, the novel CompConv is a plug-and-play module that can be directly applied to modern CNN structures to replace the vanilla convolution layers without further effort. Extensive experimental results suggest that CompConv can adequately compress the benchmark CNN structures yet barely sacrifice the performance, surpassing other competitors.
    One-to-many Approach for Improving Super-Resolution. (arXiv:2106.10437v1 [eess.IV])
    (2 min) Super-resolution (SR) is a one-to-many task with multiple possible solutions. However, previous works were not concerned about this characteristic. For a one-to-many pipeline, the generator should be able to generate multiple estimates of the reconstruction, and not be penalized for generating similar and equally realistic images. To achieve this, we propose adding weighted pixel-wise noise after every Residual-in-Residual Dense Block (RRDB) to enable the generator to generate various images. We modify the strict content loss to not penalize the stochastic variation in reconstructed images as long as it has consistent content. Additionally, we observe that there are out-of-focus regions in the DIV2K, DIV8K datasets that provide unhelpful guidelines. We filter blurry regions in the training data using the method of [10]. Finally, we modify the discriminator to receive the low-resolution image as a reference image along with the target image to provide better feedback to the generator. Using our proposed methods, we were able to improve the performance of ESRGAN in x4 perceptual SR and achieve the state-of-the-art LPIPS score in x16 perceptual extreme SR.
    Single View Physical Distance Estimation using Human Pose. (arXiv:2106.10335v1 [cs.CV])
    (2 min) We propose a fully automated system that simultaneously estimates the camera intrinsics, the ground plane, and physical distances between people from a single RGB image or video captured by a camera viewing a 3-D scene from a fixed vantage point. To automate camera calibration and distance estimation, we leverage priors about human pose and develop a novel direct formulation for pose-based auto-calibration and distance estimation, which shows state-of-the-art performance on publicly available datasets. The proposed approach enables existing camera systems to measure physical distances without needing a dedicated calibration process or range sensors, and is applicable to a broad range of use cases such as social distancing and workplace safety. Furthermore, to enable evaluation and drive research in this area, we contribute to the publicly available MEVA dataset with additional distance annotations, resulting in MEVADA -- the first evaluation benchmark in the world for the pose-based auto-calibration and distance estimation problem.
    Prediction of the facial growth direction with Machine Learning methods. (arXiv:2106.10464v1 [cs.LG])
    (2 min) First attempts of prediction of the facial growth (FG) direction were made over half of a century ago. Despite numerous attempts and elapsed time, a satisfactory method has not been established yet and the problem still poses a challenge for medical experts. To our knowledge, this paper is the first Machine Learning approach to the prediction of FG direction. Conducted data analysis reveals the inherent complexity of the problem and explains the reasons of difficulty in FG direction prediction based on 2D X-ray images. To perform growth forecasting, we employ a wide range of algorithms, from logistic regression, through tree ensembles to neural networks and consider three, slightly different, problem formulations. The resulting classification accuracy varies between 71% and 75%.
    Informative Class Activation Maps. (arXiv:2106.10472v1 [cs.CV])
    (2 min) We study how to evaluate the quantitative information content of a region within an image for a particular label. To this end, we bridge class activation maps with information theory. We develop an informative class activation map (infoCAM). Given a classification task, infoCAM depict how to accumulate information of partial regions to that of the entire image toward a label. Thus, we can utilise infoCAM to locate the most informative features for a label. When applied to an image classification task, infoCAM performs better than the traditional classification map in the weakly supervised object localisation task. We achieve state-of-the-art results on Tiny-ImageNet.
    The Animal ID Problem: Continual Curation. (arXiv:2106.10377v1 [cs.CV])
    (2 min) Hoping to stimulate new research in individual animal identification from images, we propose to formulate the problem as the human-machine Continual Curation of images and animal identities. This is an open world recognition problem, where most new animals enter the system after its algorithms are initially trained and deployed. Continual Curation, as defined here, requires (1) an improvement in the effectiveness of current recognition methods, (2) a pairwise verification algorithm that allows the possibility of no decision, and (3) an algorithmic decision mechanism that seeks human input to guide the curation process. Error metrics must evaluate the ability of recognition algorithms to identify not only animals that have been seen just once or twice but also recognize new animals not in the database. An important measure of overall system performance is accuracy as a function of the amount of human input required.
    Interactive Object Segmentation with Dynamic Click Transform. (arXiv:2106.10465v1 [cs.CV])
    (2 min) In the interactive segmentation, users initially click on the target object to segment the main body and then provide corrections on mislabeled regions to iteratively refine the segmentation masks. Most existing methods transform these user-provided clicks into interaction maps and concatenate them with image as the input tensor. Typically, the interaction maps are determined by measuring the distance of each pixel to the clicked points, ignoring the relation between clicks and mislabeled regions. We propose a Dynamic Click Transform Network~(DCT-Net), consisting of Spatial-DCT and Feature-DCT, to better represent user interactions. Spatial-DCT transforms each user-provided click with individual diffusion distance according to the target scale, and Feature-DCT normalizes the extracted feature map to a specific distribution predicted from the clicked points. We demonstrate the effectiveness of our proposed method and achieve favorable performance compared to the state-of-the-art on three standard benchmark datasets.
    Towards Single Stage Weakly Supervised Semantic Segmentation. (arXiv:2106.10309v1 [cs.CV])
    (2 min) The costly process of obtaining semantic segmentation labels has driven research towards weakly supervised semantic segmentation (WSSS) methods, using only image-level, point, or box labels. The lack of dense scene representation requires methods to increase complexity to obtain additional semantic information about the scene, often done through multiple stages of training and refinement. Current state-of-the-art (SOTA) models leverage image-level labels to produce class activation maps (CAMs) which go through multiple stages of refinement before they are thresholded to make pseudo-masks for supervision. The multi-stage approach is computationally expensive, and dependency on image-level labels for CAMs generation lacks generalizability to more complex scenes. In contrary, our method offers a single-stage approach generalizable to arbitrary dataset, that is trainable from scratch, without any dependency on pre-trained backbones, classification, or separate refinement tasks. We utilize point annotations to generate reliable, on-the-fly pseudo-masks through refined and filtered features. While our method requires point annotations that are only slightly more expensive than image-level annotations, we are to demonstrate SOTA performance on benchmark datasets (PascalVOC 2012), as well as significantly outperform other SOTA WSSS methods on recent real-world datasets (CRAID, CityPersons, IAD).
    Dynamical Deep Generative Latent Modeling of 3D Skeletal Motion. (arXiv:2106.10393v1 [cs.CV])
    (2 min) In this paper, we propose a Bayesian switching dynamical model for segmentation of 3D pose data over time that uncovers interpretable patterns in the data and is generative. Our model decomposes highly correlated skeleton data into a set of few spatial basis of switching temporal processes in a low-dimensional latent framework. We parameterize these temporal processes with regard to a switching deep vector autoregressive prior in order to accommodate both multimodal and higher-order nonlinear inter-dependencies. This results in a dynamical deep generative latent model that parses the meaningful intrinsic states in the dynamics of 3D pose data using approximate variational inference, and enables a realistic low-level dynamical generation and segmentation of complex skeleton movements. Our experiments on four biological motion data containing bat flight, salsa dance, walking, and golf datasets substantiate superior performance of our model in comparison with the state-of-the-art methods.
    Multi-Contextual Design of Convolutional Neural Network for Steganalysis. (arXiv:2106.10430v1 [cs.MM])
    (2 min) In recent times, deep learning-based steganalysis classifiers became popular due to their state-of-the-art performance. Most deep steganalysis classifiers usually extract noise residuals using high-pass filters as preprocessing steps and feed them to their deep model for classification. It is observed that recent steganographic embedding does not always restrict their embedding in the high-frequency zone; instead, they distribute it as per embedding policy. Therefore, besides noise residual, learning the embedding zone is another challenging task. In this work, unlike the conventional approaches, the proposed model first extracts the noise residual using learned denoising kernels to boost the signal-to-noise ratio. After preprocessing, the sparse noise residuals are fed to a novel Multi-Contextual Convolutional Neural Network (M-CNET) that uses heterogeneous context size to learn the sparse and low-amplitude representation of noise residuals. The model performance is further improved by incorporating the Self-Attention module to focus on the areas prone to steganalytic embedding. A set of comprehensive experiments is performed to show the proposed scheme's efficacy over the prior arts. Besides, an ablation study is given to justify the contribution of various modules of the proposed architecture.
    A system of vision sensor based deep neural networks for complex driving scene analysis in support of crash risk assessment and prevention. (arXiv:2106.10319v1 [cs.CV])
    (2 min) To assist human drivers and autonomous vehicles in assessing crash risks, driving scene analysis using dash cameras on vehicles and deep learning algorithms is of paramount importance. Although these technologies are increasingly available, driving scene analysis for this purpose still remains a challenge. This is mainly due to the lack of annotated large image datasets for analyzing crash risk indicators and crash likelihood, and the lack of an effective method to extract lots of required information from complex driving scenes. To fill the gap, this paper develops a scene analysis system. The Multi-Net of the system includes two multi-task neural networks that perform scene classification to provide four labels for each scene. The DeepLab v3 and YOLO v3 are combined by the system to detect and locate risky pedestrians and the nearest vehicles. All identified information can provide the situational awareness to autonomous vehicles or human drivers for identifying crash risks from the surrounding traffic. To address the scarcity of annotated image datasets for studying traffic crashes, two completely new datasets have been developed by this paper and made available to the public, which were proved to be effective in training the proposed deep neural networks. The paper further evaluates the performance of the Multi-Net and the efficiency of the developed system. Comprehensive scene analysis is further illustrated with representative examples. Results demonstrate the effectiveness of the developed system and datasets for driving scene analysis, and their supportiveness for crash risk assessment and crash prevention.
    Place recognition survey: An update on deep learning approaches. (arXiv:2106.10458v1 [cs.CV])
    (2 min) Autonomous Vehicles (AV) are becoming more capable of navigating in complex environments with dynamic and changing conditions. A key component that enables these intelligent vehicles to overcome such conditions and become more autonomous is the sophistication of the perception and localization systems. As part of the localization system, place recognition has benefited from recent developments in other perception tasks such as place categorization or object recognition, namely with the emergence of deep learning (DL) frameworks. This paper surveys recent approaches and methods used in place recognition, particularly those based on deep learning. The contributions of this work are twofold: surveying recent sensors such as 3D LiDARs and RADARs, applied in place recognition; and categorizing the various DL-based place recognition works into supervised, unsupervised, semi-supervised, parallel, and hierarchical categories. First, this survey introduces key place recognition concepts to contextualize the reader. Then, sensor characteristics are addressed. This survey proceeds by elaborating on the various DL-based works, presenting summaries for each framework. Some lessons learned from this survey include: the importance of NetVLAD for supervised end-to-end learning; the advantages of unsupervised approaches in place recognition, namely for cross-domain applications; or the increasing tendency of recent works to seek, not only for higher performance but also for higher efficiency.
    AdaZoom: Adaptive Zoom Network for Multi-Scale Object Detection in Large Scenes. (arXiv:2106.10409v1 [cs.CV])
    (2 min) Detection in large-scale scenes is a challenging problem due to small objects and extreme scale variation. It is essential to focus on the image regions of small objects. In this paper, we propose a novel Adaptive Zoom (AdaZoom) network as a selective magnifier with flexible shape and focal length to adaptively zoom the focus regions for object detection. Based on policy gradient, we construct a reinforcement learning framework for focus region generation, with the reward formulated by object distributions. The scales and aspect ratios of the generated regions are adaptive to the scales and distribution of objects inside. We apply variable magnification according to the scale of the region for adaptive multi-scale detection. We further propose collaborative training to complementarily promote the performance of AdaZoom and the detection network. To validate the effectiveness, we conduct extensive experiments on VisDrone2019, UAVDT, and DOTA datasets. The experiments show AdaZoom brings a consistent and significant improvement over different detection networks, achieving state-of-the-art performance on these datasets, especially outperforming the existing methods by AP of 4.64% on Vis-Drone2019.
    Exploring Visual Context for Weakly Supervised Person Search. (arXiv:2106.10506v1 [cs.CV])
    (2 min) Person search has recently emerged as a challenging task that jointly addresses pedestrian detection and person re-identification. Existing approaches follow a fully supervised setting where both bounding box and identity annotations are available. However, annotating identities is labor-intensive, limiting the practicability and scalability of current frameworks. This paper inventively considers weakly supervised person search with only bounding box annotations. We proposed the first framework to address this novel task, namely Context-Guided Person Search (CGPS), by investigating three levels of context clues (i.e., detection, memory and scene) in unconstrained natural images. The first two are employed to promote local and global discriminative capabilities, while the latter enhances clustering accuracy. Despite its simple design, our CGPS boosts the baseline model by 8.3% in mAP on CUHK-SYSU. Surprisingly, it even achieves comparable performance to two-step person search models, while displaying higher efficiency. Our code is available at https://github.com/ljpadam/CGPS.
    Deep Generative Learning via Schr\"{o}dinger Bridge. (arXiv:2106.10410v1 [cs.LG])
    (2 min) We propose to learn a generative model via entropy interpolation with a Schr\"{o}dinger Bridge. The generative learning task can be formulated as interpolating between a reference distribution and a target distribution based on the Kullback-Leibler divergence. At the population level, this entropy interpolation is characterized via an SDE on $[0,1]$ with a time-varying drift term. At the sample level, we derive our Schr\"{o}dinger Bridge algorithm by plugging the drift term estimated by a deep score estimator and a deep density ratio estimator into the Euler-Maruyama method. Under some mild smoothness assumptions of the target distribution, we prove the consistency of both the score estimator and the density ratio estimator, and then establish the consistency of the proposed Schr\"{o}dinger Bridge approach. Our theoretical results guarantee that the distribution learned by our approach converges to the target distribution. Experimental results on multimodal synthetic data and benchmark data support our theoretical findings and indicate that the generative model via Schr\"{o}dinger Bridge is comparable with state-of-the-art GANs, suggesting a new formulation of generative learning. We demonstrate its usefulness in image interpolation and image inpainting.
  • cs.IR updates on arXiv.org

    Explainable Graph-based Search for Lessons-Learned Documents in the Semiconductor Industry. (arXiv:2105.08442v2 [cs.IR] UPDATED)
    (3 min) Industrial processes produce a considerable volume of data and thus information. Whether it is structured sensory data or semi- to unstructured textual data, the knowledge that can be derived from it is critical to the sustainable development of the industrial process. A key challenge of this sustainability is the intelligent management of the generated data, as well as the knowledge extracted from it, in order to utilize this knowledge for improving future procedures. This challenge is a result of the tailored documentation methods and domain-specific requirements, which include the need for quick visibility of the documented knowledge. In this paper, we utilize the expert knowledge documented in chip-design failure reports in supporting user access to information that is relevant to a current chip design. Unstructured, free, textual data in previous failure documentations provides a valuable source of lessons-learned, which expert design-engineers have experienced, solved and documented. To achieve a sustainable utilization of knowledge within the company, not only the inherent knowledge has to be mined from unstructured textual data, but also the relations between the lessons-learned, uncovering potentially unknown links. In this research, a knowledge graph is constructed, in order to represent and use the interconnections between reported design failures. A search engine is developed and applied onto the graph to answer queries. In contrast to mere keyword-based searching, the searchability of the knowledge graph offers enhanced search results beyond direct matches and acts as a mean for generating explainable results and result recommendations. Results are provided to the design engineer through an interactive search interface, in which, the feedback from the user is used to further optimize relations for future iterations of the knowledge graph.
    Ensemble of MRR and NDCG models for Visual Dialog. (arXiv:2104.07511v2 [cs.AI] UPDATED)
    (2 min) Assessing an AI agent that can converse in human language and understand visual content is challenging. Generation metrics, such as BLEU scores favor correct syntax over semantics. Hence a discriminative approach is often used, where an agent ranks a set of candidate options. The mean reciprocal rank (MRR) metric evaluates the model performance by taking into account the rank of a single human-derived answer. This approach, however, raises a new challenge: the ambiguity and synonymy of answers, for instance, semantic equivalence (e.g., `yeah' and `yes'). To address this, the normalized discounted cumulative gain (NDCG) metric has been used to capture the relevance of all the correct answers via dense annotations. However, the NDCG metric favors the usually applicable uncertain answers such as `I don't know. Crafting a model that excels on both MRR and NDCG metrics is challenging. Ideally, an AI agent should answer a human-like reply and validate the correctness of any answer. To address this issue, we describe a two-step non-parametric ranking approach that can merge strong MRR and NDCG models. Using our approach, we manage to keep most MRR state-of-the-art performance (70.41% vs. 71.24%) and the NDCG state-of-the-art performance (72.16% vs. 75.35%). Moreover, our approach won the recent Visual Dialog 2020 challenge. Source code is available at https://github.com/idansc/mrr-ndcg.
    A Comprehensive Review on Non-Neural Networks Collaborative Filtering Recommendation Systems. (arXiv:2106.10679v1 [cs.IR])
    (2 min) Over the past two decades, recommender systems have attracted a lot of interest due to the explosion in the amount of data in online applications. A particular attention has been paid to collaborative filtering, which is the most widely used in applications that involve information recommendations. Collaborative filtering (CF) uses the known preference of a group of users to make predictions and recommendations about the unknown preferences of other users (recommendations are made based on the past behavior of users). First introduced in the 1990s, a wide variety of increasingly successful models have been proposed. Due to the success of machine learning techniques in many areas, there has been a growing emphasis on the application of such algorithms in recommendation systems. In this article, we present an overview of the CF approaches for recommender systems, their two main categories, and their evaluation metrics. We focus on the application of classical Machine Learning algorithms to CF recommender systems by presenting their evolution from their first use-cases to advanced Machine Learning models. We attempt to provide a comprehensive and comparative overview of CF systems (with python implementations) that can serve as a guideline for research and practice in this area.
    Context-Aware Legal Citation Recommendation using Deep Learning. (arXiv:2106.10776v1 [cs.IR])
    (2 min) Lawyers and judges spend a large amount of time researching the proper legal authority to cite while drafting decisions. In this paper, we develop a citation recommendation tool that can help improve efficiency in the process of opinion drafting. We train four types of machine learning models, including a citation-list based method (collaborative filtering) and three context-based methods (text similarity, BiLSTM and RoBERTa classifiers). Our experiments show that leveraging local textual context improves recommendation, and that deep neural models achieve decent performance. We show that non-deep text-based methods benefit from access to structured case metadata, but deep models only benefit from such access when predicting from context of insufficient length. We also find that, even after extensive training, RoBERTa does not outperform a recurrent neural model, despite its benefits of pretraining. Our behavior analysis of the RoBERTa model further shows that predictive performance is stable across time and citation classes.
    Tag, Copy or Predict: A Unified Weakly-Supervised Learning Framework for Visual Information Extraction using Sequences. (arXiv:2106.10681v1 [cs.CV])
    (2 min) Visual information extraction (VIE) has attracted increasing attention in recent years. The existing methods usually first organized optical character recognition (OCR) results into plain texts and then utilized token-level entity annotations as supervision to train a sequence tagging model. However, it expends great annotation costs and may be exposed to label confusion, and the OCR errors will also significantly affect the final performance. In this paper, we propose a unified weakly-supervised learning framework called TCPN (Tag, Copy or Predict Network), which introduces 1) an efficient encoder to simultaneously model the semantic and layout information in 2D OCR results; 2) a weakly-supervised training strategy that utilizes only key information sequences as supervision; and 3) a flexible and switchable decoder which contains two inference modes: one (Copy or Predict Mode) is to output key information sequences of different categories by copying a token from the input or predicting one in each time step, and the other (Tag Mode) is to directly tag the input sequence in a single forward pass. Our method shows new state-of-the-art performance on several public benchmarks, which fully proves its effectiveness.
    Sequential Recommendation in Online Games with Multiple Sequences, Tasks and User Levels. (arXiv:2102.06950v2 [cs.AI] UPDATED)
    (2 min) Online gaming is growing faster than ever before, with increasing challenges of providing better user experience. Recommender systems (RS) for online games face unique challenges since they must fulfill players' distinct desires, at different user levels, based on their action sequences of various action types. Although many sequential RS already exist, they are mainly single-sequence, single-task, and single-user-level. In this paper, we introduce a new sequential recommendation model for multiple sequences, multiple tasks, and multiple user levels (abbreviated as M$^3$Rec) in Tencent Games platform, which can fully utilize complex data in online games. We leverage Graph Neural Network and multi-task learning to design M$^3$Rec in order to model the complex information in the heterogeneous sequential recommendation scenario of Tencent Games. We verify the effectiveness of M$^3$Rec on three online games of Tencent Games platform, in both offline and online evaluations. The results show that M$^3$Rec successfully addresses the challenges of recommendation in online games, and it generates superior recommendations compared with state-of-the-art sequential recommendation approaches.
    Leveraging Multiple Online Sources for Accurate Income Verification. (arXiv:2106.10547v1 [cs.IR])
    (2 min) Income verification is the problem of validating a person's stated income given basic identity information such as name, location, job title and employer. It is widely used in the context of mortgage lending, rental applications and other financial risk models. However, the current processes surrounding verification involve significant human effort and document gathering which can be both time-consuming and expensive. In this paper, we propose a novel model for verifying an individual's income given very limited identity information typically available in loan applications. Our model is a combination of a deep neural network and hand-engineered features. The hand engineered features are based upon matching the input information against income records extracted automatically from various publicly available online sources (e.g. payscale.com, H-1B filings, government employee salaries). We conduct experiments on two data sets, one simulated from H-1B records and the other from a real-world data set of peer-to-peer (P2P) loan applications obtained from one of the world's largest P2P lending platform. Our results show a significant reduction in error of 3-6% relative to several strong baselines. We also perform ablation studies to demonstrate that a combined model is indeed necessary to achieve state-of-the-art performance on this task.
    On Sampling Top-K Recommendation Evaluation. (arXiv:2106.10621v1 [cs.IR])
    (2 min) Recently, Rendle has warned that the use of sampling-based top-$k$ metrics might not suffice. This throws a number of recent studies on deep learning-based recommendation algorithms, and classic non-deep-learning algorithms using such a metric, into jeopardy. In this work, we thoroughly investigate the relationship between the sampling and global top-$K$ Hit-Ratio (HR, or Recall), originally proposed by Koren[2] and extensively used by others. By formulating the problem of aligning sampling top-$k$ ($SHR@k$) and global top-$K$ ($HR@K$) Hit-Ratios through a mapping function $f$, so that $SHR@k\approx HR@f(k)$, we demonstrate both theoretically and experimentally that the sampling top-$k$ Hit-Ratio provides an accurate approximation of its global (exact) counterpart, and can consistently predict the correct winners (the same as indicate by their corresponding global Hit-Ratios).
  • cs.LG updates on arXiv.org

    Explaining Inference Queries with Bayesian Optimization. (arXiv:2102.05308v2 [cs.DB] UPDATED)
    (2 min) Obtaining an explanation for an SQL query result can enrich the analysis experience, reveal data errors, and provide deeper insight into the data. Inference query explanation seeks to explain unexpected aggregate query results on inference data; such queries are challenging to explain because an explanation may need to be derived from the source, training, or inference data in an ML pipeline. In this paper, we model an objective function as a black-box function and propose BOExplain, a novel framework for explaining inference queries using Bayesian optimization (BO). An explanation is a predicate defining the input tuples that should be removed so that the query result of interest is significantly affected. BO - a technique for finding the global optimum of a black-box function - is used to find the best predicate. We develop two new techniques (individual contribution encoding and warm start) to handle categorical variables. We perform experiments showing that the predicates found by BOExplain have a higher degree of explanation compared to those found by the state-of-the-art query explanation engines. We also show that BOExplain is effective at deriving explanations for inference queries from source and training data on a variety of real-world datasets. BOExplain is open-sourced as a Python package at https://github.com/sfu-db/BOExplain.
    DiGS : Divergence guided shape implicit neural representation for unoriented point clouds. (arXiv:2106.10811v1 [cs.CV])
    (2 min) Neural shape representations have recently shown to be effective in shape analysis and reconstruction tasks. Existing neural network methods require point coordinates and corresponding normal vectors to learn the implicit level sets of the shape. Normal vectors are often not provided as raw data, therefore, approximation and reorientation are required as pre-processing stages, both of which can introduce noise. In this paper, we propose a divergence guided shape representation learning approach that does not require normal vectors as input. We show that incorporating a soft constraint on the divergence of the distance function favours smooth solutions that reliably orients gradients to match the unknown normal at each point, in some cases even better than approaches that use ground truth normal vectors directly. Additionally, we introduce a novel geometric initialization method for sinusoidal shape representation networks that further improves convergence to the desired solution. We evaluate the effectiveness of our approach on the task of surface reconstruction and show state-of-the-art performance compared to other unoriented methods and on-par performance compared to oriented methods.
    Representations and Strategies for Transferable Machine Learning Models in Chemical Discovery. (arXiv:2106.10768v1 [physics.chem-ph])
    (2 min) Strategies for machine-learning(ML)-accelerated discovery that are general across materials composition spaces are essential, but demonstrations of ML have been primarily limited to narrow composition variations. By addressing the scarcity of data in promising regions of chemical space for challenging targets like open-shell transition-metal complexes, general representations and transferable ML models that leverage known relationships in existing data will accelerate discovery. Over a large set (ca. 1000) of isovalent transition-metal complexes, we quantify evident relationships for different properties (i.e., spin-splitting and ligand dissociation) between rows of the periodic table (i.e., 3d/4d metals and 2p/3p ligands). We demonstrate an extension to graph-based revised autocorrelation (RAC) representation (i.e., eRAC) that incorporates the effective nuclear charge alongside the nuclear charge heuristic that otherwise overestimates dissimilarity of isovalent complexes. To address the common challenge of discovery in a new space where data is limited, we introduce a transfer learning approach in which we seed models trained on a large amount of data from one row of the periodic table with a small number of data points from the additional row. We demonstrate the synergistic value of the eRACs alongside this transfer learning strategy to consistently improve model performance. Analysis of these models highlights how the approach succeeds by reordering the distances between complexes to be more consistent with the periodic table, a property we expect to be broadly useful for other materials domains.
    Artificial Intelligence in the Creative Industries: A Review. (arXiv:2007.12391v5 [cs.CV] UPDATED)
    (2 min) This paper reviews the current state of the art in Artificial Intelligence (AI) technologies and applications in the context of the creative industries. A brief background of AI, and specifically Machine Learning (ML) algorithms, is provided including Convolutional Neural Network (CNNs), Generative Adversarial Networks (GANs), Recurrent Neural Networks (RNNs) and Deep Reinforcement Learning (DRL). We categorise creative applications into five groups related to how AI technologies are used: i) content creation, ii) information analysis, iii) content enhancement and post production workflows, iv) information extraction and enhancement, and v) data compression. We critically examine the successes and limitations of this rapidly advancing technology in each of these areas. We further differentiate between the use of AI as a creative tool and its potential as a creator in its own right. We foresee that, in the near future, machine learning-based AI will be adopted widely as a tool or collaborative assistant for creativity. In contrast, we observe that the successes of machine learning in domains with fewer constraints, where AI is the `creator', remain modest. The potential of AI (or its developers) to win awards for its original creations in competition with human creatives is also limited, based on contemporary technologies. We therefore conclude that, in the context of creative industries, maximum benefit from AI will be derived where its focus is human centric -- where it is designed to augment, rather than replace, human creativity.
    A Nonconvex Framework for Structured Dynamic Covariance Recovery. (arXiv:2011.05601v2 [stat.ML] UPDATED)
    (2 min) We propose a flexible yet interpretable model for high-dimensional data with time-varying second order statistics, motivated and applied to functional neuroimaging data. Motivated by the neuroscience literature, we factorize the covariances into sparse spatial and smooth temporal components. While this factorization results in both parsimony and domain interpretability, the resulting estimation problem is nonconvex. To this end, we design a two-stage optimization scheme with a carefully tailored spectral initialization, combined with iteratively refined alternating projected gradient descent. We prove a linear convergence rate up to a nontrivial statistical error for the proposed descent scheme and establish sample complexity guarantees for the estimator. We further quantify the statistical error for the multivariate Gaussian case. Empirical results using simulated and real brain imaging data illustrate that our approach outperforms existing baselines.
    Improved Generalization Bounds of Group Invariant / Equivariant Deep Networks via Quotient Feature Spaces. (arXiv:1910.06552v3 [stat.ML] UPDATED)
    (2 min) Numerous invariant (or equivariant) neural networks have succeeded in handling invariant data such as point clouds and graphs. However, a generalization theory for the neural networks has not been well developed, because several essential factors for the theory, such as network size and margin distribution, are not deeply connected to the invariance and equivariance. In this study, we develop a novel generalization error bound for invariant and equivariant deep neural networks. To describe the effect of invariance and equivariance on generalization, we develop a notion of a \textit{quotient feature space}, which measures the effect of group actions for the properties. Our main result proves that the volume of quotient feature spaces can describe the generalization error. Furthermore, the bound shows that the invariance and equivariance significantly improve the leading term of the bound. We apply our result to specific invariant and equivariant networks, such as DeepSets (Zaheer et al. (2017)), and show that their generalization bound is considerably improved by $\sqrt{n!}$, where $n!$ is the number of permutations. We also discuss the expressive power of invariant DNNs and show that they can achieve an optimal approximation rate. Our experimental result supports our theoretical claims.
    Estimation of Causal Effects in the Presence of Unobserved Confounding in the Alzheimer's Continuum. (arXiv:2006.13135v4 [stat.ME] UPDATED)
    (2 min) Studying the relationship between neuroanatomy and cognitive decline due to Alzheimer's has been a major research focus in the last decade. However, to infer cause-effect relationships rather than simple associations from observational data, we need to (i) express the causal relationships leading to cognitive decline in a graphical model, and (ii) ensure the causal effect of interest is identifiable from the collected data. We derive a causal graph from the current clinical knowledge on cause and effect in the Alzheimer's disease continuum, and show that identifiability of the causal effect requires all confounders to be known and measured. However, in complex neuroimaging studies, we neither know all potential confounders nor do we have data on them. To alleviate this requirement, we leverage the dependencies among multiple causes by deriving a substitute confounder via a probabilistic latent factor model. In our theoretical analysis, we prove that using the substitute confounder enables identifiability of the causal effect of neuroanatomy on cognition. We quantitatively evaluate the effectiveness of our approach on semi-synthetic data, where we know the true causal effects, and illustrate its use on real data on the Alzheimer's disease continuum, where it reveals important causes that otherwise would have been missed.
    Neural Spectral Marked Point Processes. (arXiv:2106.10773v1 [cs.LG])
    (2 min) Self- and mutually-exciting point processes are popular models in machine learning and statistics for dependent discrete event data. To date, most existing models assume stationary kernels (including the classical Hawkes processes) and simple parametric models. Modern applications with complex event data require more general point process models that can incorporate contextual information of the events, called marks, besides the temporal and location information. Moreover, such applications often require non-stationary models to capture more complex spatio-temporal dependence. To tackle these challenges, a key question is to devise a versatile influence kernel in the point process model. In this paper, we introduce a novel and general neural network-based non-stationary influence kernel with high expressiveness for handling complex discrete events data while providing theoretical performance guarantees. We demonstrate the superior performance of our proposed method compared with the state-of-the-art on synthetic and real data.
    Designing Interpretable Approximations to Deep Reinforcement Learning. (arXiv:2010.14785v2 [cs.LG] UPDATED)
    (2 min) In an ever expanding set of research and application areas, deep neural networks (DNNs) set the bar for algorithm performance. However, depending upon additional constraints such as processing power and execution time limits, or requirements such as verifiable safety guarantees, it may not be feasible to actually use such high-performing DNNs in practice. Many techniques have been developed in recent years to compress or distill complex DNNs into smaller, faster or more understandable models and controllers. This work seeks to identify reduced models that not only preserve a desired performance level, but also, for example, succinctly explain the latent knowledge represented by a DNN. We illustrate the effectiveness of the proposed approach on the evaluation of decision tree variants and kernel machines in the context of benchmark reinforcement learning tasks.
    Right for the Right Concept: Revising Neuro-Symbolic Concepts by Interacting with their Explanations. (arXiv:2011.12854v6 [cs.LG] UPDATED)
    (2 min) Most explanation methods in deep learning map importance estimates for a model's prediction back to the original input space. These "visual" explanations are often insufficient, as the model's actual concept remains elusive. Moreover, without insights into the model's semantic concept, it is difficult -- if not impossible -- to intervene on the model's behavior via its explanations, called Explanatory Interactive Learning. Consequently, we propose to intervene on a Neuro-Symbolic scene representation, which allows one to revise the model on the semantic level, e.g. "never focus on the color to make your decision". We compiled a novel confounded visual scene data set, the CLEVR-Hans data set, capturing complex compositions of different objects. The results of our experiments on CLEVR-Hans demonstrate that our semantic explanations, i.e. compositional explanations at a per-object level, can identify confounders that are not identifiable using "visual" explanations only. More importantly, feedback on this semantic level makes it possible to revise the model from focusing on these factors.
    Two-Faced Humans on Twitter and Facebook: Harvesting Social Multimedia for Human Personality Profiling. (arXiv:2106.10673v1 [cs.SI])
    (2 min) Human personality traits are the key drivers behind our decision-making, influencing our life path on a daily basis. Inference of personality traits, such as Myers-Briggs Personality Type, as well as an understanding of dependencies between personality traits and users' behavior on various social media platforms is of crucial importance to modern research and industry applications. The emergence of diverse and cross-purpose social media avenues makes it possible to perform user personality profiling automatically and efficiently based on data represented across multiple data modalities. However, the research efforts on personality profiling from multi-source multi-modal social media data are relatively sparse, and the level of impact of different social network data on machine learning performance has yet to be comprehensively evaluated. Furthermore, there is not such dataset in the research community to benchmark. This study is one of the first attempts towards bridging such an important research gap. Specifically, in this work, we infer the Myers-Briggs Personality Type indicators, by applying a novel multi-view fusion framework, called "PERS" and comparing the performance results not just across data modalities but also with respect to different social network data sources. Our experimental results demonstrate the PERS's ability to learn from multi-view data for personality profiling by efficiently leveraging on the significantly different data arriving from diverse social multimedia sources. We have also found that the selection of a machine learning approach is of crucial importance when choosing social network data sources and that people tend to reveal multiple facets of their personality in different social media avenues. Our released social multimedia dataset facilitates future research on this direction.
    Fair Bayesian Optimization. (arXiv:2006.05109v3 [stat.ML] UPDATED)
    (2 min) Given the increasing importance of machine learning (ML) in our lives, several algorithmic fairness techniques have been proposed to mitigate biases in the outcomes of the ML models. However, most of these techniques are specialized to cater to a single family of ML models and a specific definition of fairness, limiting their adaptibility in practice. We introduce a general constrained Bayesian optimization (BO) framework to optimize the performance of any ML model while enforcing one or multiple fairness constraints. BO is a model-agnostic optimization method that has been successfully applied to automatically tune the hyperparameters of ML models. We apply BO with fairness constraints to a range of popular models, including random forests, gradient boosting, and neural networks, showing that we can obtain accurate and fair solutions by acting solely on the hyperparameters. We also show empirically that our approach is competitive with specialized techniques that enforce model-specific fairness constraints, and outperforms preprocessing methods that learn fair representations of the input data. Moreover, our method can be used in synergy with such specialized fairness techniques to tune their hyperparameters. Finally, we study the relationship between fairness and the hyperparameters selected by BO. We observe a correlation between regularization and unbiased models, explaining why acting on the hyperparameters leads to ML models that generalize well and are fair.
    Adversarial Attack on Graph Neural Networks as An Influence Maximization Problem. (arXiv:2106.10785v1 [cs.LG])
    (2 min) Graph neural networks (GNNs) have attracted increasing interests. With broad deployments of GNNs in real-world applications, there is an urgent need for understanding the robustness of GNNs under adversarial attacks, especially in realistic setups. In this work, we study the problem of attacking GNNs in a restricted and realistic setup, by perturbing the features of a small set of nodes, with no access to model parameters and model predictions. Our formal analysis draws a connection between this type of attacks and an influence maximization problem on the graph. This connection not only enhances our understanding on the problem of adversarial attack on GNNs, but also allows us to propose a group of effective and practical attack strategies. Our experiments verify that the proposed attack strategies significantly degrade the performance of three popular GNN models and outperform baseline adversarial attack strategies.
    Can Self Reported Symptoms Predict Daily COVID-19 Cases?. (arXiv:2105.08321v2 [cs.LG] UPDATED)
    (3 min) The COVID-19 pandemic has impacted lives and economies across the globe, leading to many deaths. While vaccination is an important intervention, its roll-out is slow and unequal across the globe. Therefore, extensive testing still remains one of the key methods to monitor and contain the virus. Testing on a large scale is expensive and arduous. Hence, we need alternate methods to estimate the number of cases. Online surveys have been shown to be an effective method for data collection amidst the pandemic. In this work, we develop machine learning models to estimate the prevalence of COVID-19 using self-reported symptoms. Our best model predicts the daily cases with a mean absolute error (MAE) of 226.30 (normalized MAE of 27.09%) per state, which demonstrates the possibility of predicting the actual number of confirmed cases by utilizing self-reported symptoms. The models are developed at two levels of data granularity - local models, which are trained at the state level, and a single global model which is trained on the combined data aggregated across all states. Our results indicate a lower error on the local models as opposed to the global model. In addition, we also show that the most important symptoms (features) vary considerably from state to state. This work demonstrates that the models developed on crowd-sourced data, curated via online platforms, can complement the existing epidemiological surveillance infrastructure in a cost-effective manner. The code is publicly available at https://github.com/parthpatwa/Can-Self-Reported-Symptoms-Predict-Daily-COVID-19-Cases.
    Firefly Neural Architecture Descent: a General Approach for Growing Neural Networks. (arXiv:2102.08574v2 [cs.LG] UPDATED)
    (2 min) We propose firefly neural architecture descent, a general framework for progressively and dynamically growing neural networks to jointly optimize the networks' parameters and architectures. Our method works in a steepest descent fashion, which iteratively finds the best network within a functional neighborhood of the original network that includes a diverse set of candidate network structures. By using Taylor approximation, the optimal network structure in the neighborhood can be found with a greedy selection procedure. We show that firefly descent can flexibly grow networks both wider and deeper, and can be applied to learn accurate but resource-efficient neural architectures that avoid catastrophic forgetting in continual learning. Empirically, firefly descent achieves promising results on both neural architecture search and continual learning. In particular, on a challenging continual image classification task, it learns networks that are smaller in size but have higher average accuracy than those learned by the state-of-the-art methods.
    A compressive multi-kernel method for privacy-preserving machine learning. (arXiv:2106.10671v1 [cs.LG])
    (2 min) As the analytic tools become more powerful, and more data are generated on a daily basis, the issue of data privacy arises. This leads to the study of the design of privacy-preserving machine learning algorithms. Given two objectives, namely, utility maximization and privacy-loss minimization, this work is based on two previously non-intersecting regimes -- Compressive Privacy and multi-kernel method. Compressive Privacy is a privacy framework that employs utility-preserving lossy-encoding scheme to protect the privacy of the data, while multi-kernel method is a kernel based machine learning regime that explores the idea of using multiple kernels for building better predictors. The compressive multi-kernel method proposed consists of two stages -- the compression stage and the multi-kernel stage. The compression stage follows the Compressive Privacy paradigm to provide the desired privacy protection. Each kernel matrix is compressed with a lossy projection matrix derived from the Discriminant Component Analysis (DCA). The multi-kernel stage uses the signal-to-noise ratio (SNR) score of each kernel to non-uniformly combine multiple compressive kernels. The proposed method is evaluated on two mobile-sensing datasets -- MHEALTH and HAR -- where activity recognition is defined as utility and person identification is defined as privacy. The results show that the compression regime is successful in privacy preservation as the privacy classification accuracies are almost at the random-guess level in all experiments. On the other hand, the novel SNR-based multi-kernel shows utility classification accuracy improvement upon the state-of-the-art in both datasets. These results indicate a promising direction for research in privacy-preserving machine learning.
    Ensemble of MRR and NDCG models for Visual Dialog. (arXiv:2104.07511v2 [cs.AI] UPDATED)
    (2 min) Assessing an AI agent that can converse in human language and understand visual content is challenging. Generation metrics, such as BLEU scores favor correct syntax over semantics. Hence a discriminative approach is often used, where an agent ranks a set of candidate options. The mean reciprocal rank (MRR) metric evaluates the model performance by taking into account the rank of a single human-derived answer. This approach, however, raises a new challenge: the ambiguity and synonymy of answers, for instance, semantic equivalence (e.g., `yeah' and `yes'). To address this, the normalized discounted cumulative gain (NDCG) metric has been used to capture the relevance of all the correct answers via dense annotations. However, the NDCG metric favors the usually applicable uncertain answers such as `I don't know. Crafting a model that excels on both MRR and NDCG metrics is challenging. Ideally, an AI agent should answer a human-like reply and validate the correctness of any answer. To address this issue, we describe a two-step non-parametric ranking approach that can merge strong MRR and NDCG models. Using our approach, we manage to keep most MRR state-of-the-art performance (70.41% vs. 71.24%) and the NDCG state-of-the-art performance (72.16% vs. 75.35%). Moreover, our approach won the recent Visual Dialog 2020 challenge. Source code is available at https://github.com/idansc/mrr-ndcg.
    Whole MILC: generalizing learned dynamics across tasks, datasets, and populations. (arXiv:2007.16041v2 [cs.LG] UPDATED)
    (2 min) Behavioral changes are the earliest signs of a mental disorder, but arguably, the dynamics of brain function gets affected even earlier. Subsequently, spatio-temporal structure of disorder-specific dynamics is crucial for early diagnosis and understanding the disorder mechanism. A common way of learning discriminatory features relies on training a classifier and evaluating feature importance. Classical classifiers, based on handcrafted features are quite powerful, but suffer the curse of dimensionality when applied to large input dimensions of spatio-temporal data. Deep learning algorithms could handle the problem and a model introspection could highlight discriminatory spatio-temporal regions but need way more samples to train. In this paper we present a novel self supervised training schema which reinforces whole sequence mutual information local to context (whole MILC). We pre-train the whole MILC model on unlabeled and unrelated healthy control data. We test our model on three different disorders (i) Schizophrenia (ii) Autism and (iii) Alzheimers and four different studies. Our algorithm outperforms existing self-supervised pre-training methods and provides competitive classification results to classical machine learning algorithms. Importantly, whole MILC enables attribution of subject diagnosis to specific spatio-temporal regions in the fMRI signal.
    RetiNerveNet: Using Recursive Deep Learning to Estimate Pointwise 24-2 Visual Field Data based on Retinal Structure. (arXiv:2010.07488v2 [cs.LG] UPDATED)
    (2 min) Glaucoma is the leading cause of irreversible blindness in the world, affecting over 70 million people. The cumbersome Standard Automated Perimetry (SAP) test is most frequently used to detect visual loss due to glaucoma. Due to the SAP test's innate difficulty and its high test-retest variability, we propose the RetiNerveNet, a deep convolutional recursive neural network for obtaining estimates of the SAP visual field. RetiNerveNet uses information from the more objective Spectral-Domain Optical Coherence Tomography (SDOCT). RetiNerveNet attempts to trace-back the arcuate convergence of the retinal nerve fibers, starting from the Retinal Nerve Fiber Layer (RNFL) thickness around the optic disc, to estimate individual age-corrected 24-2 SAP values. Recursive passes through the proposed network sequentially yield estimates of the visual locations progressively farther from the optic disc. While all the methods used for our experiments exhibit lower performance for the advanced disease group, the proposed network is observed to be more accurate than all the baselines for estimating the individual visual field values. We further augment RetiNerveNet to additionally predict the SAP Mean Deviation values and also create an ensemble of RetiNerveNets that further improves the performance, by increasingly weighting-up underrepresented parts of the training data.
    Capturing Label Characteristics in VAEs. (arXiv:2006.10102v2 [cs.LG] UPDATED)
    (2 min) We present a principled approach to incorporating labels in VAEs that captures the rich characteristic information associated with those labels. While prior work has typically conflated these by learning latent variables that directly correspond to label values, we argue this is contrary to the intended effect of supervision in VAEs-capturing rich label characteristics with the latents. For example, we may want to capture the characteristics of a face that make it look young, rather than just the age of the person. To this end, we develop the CCVAE, a novel VAE model and concomitant variational objective which captures label characteristics explicitly in the latent space, eschewing direct correspondences between label values and latents. Through judicious structuring of mappings between such characteristic latents and labels, we show that the CCVAE can effectively learn meaningful representations of the characteristics of interest across a variety of supervision schemes. In particular, we show that the CCVAE allows for more effective and more general interventions to be performed, such as smooth traversals within the characteristics for a given label, diverse conditional generation, and transferring characteristics across datapoints.
    Algorithmic Instabilities of Accelerated Gradient Descent. (arXiv:2102.02167v2 [cs.LG] UPDATED)
    (2 min) We study the algorithmic stability of Nesterov's accelerated gradient method. For convex quadratic objectives, Chen et al. (2018) proved that the uniform stability of the method grows quadratically with the number of optimization steps, and conjectured that the same is true for the general convex and smooth case. We disprove this conjecture and show, for two notions of algorithmic stability (including uniform stability), that the stability of Nesterov's accelerated method in fact deteriorates exponentially fast with the number of gradient steps. This stands in sharp contrast to the bounds in the quadratic case, but also to known results for non-accelerated gradient methods where stability typically grows linearly with the number of steps.
    Active Learning for Deep Neural Networks on Edge Devices. (arXiv:2106.10836v1 [cs.LG])
    (2 min) When dealing with deep neural network (DNN) applications on edge devices, continuously updating the model is important. Although updating a model with real incoming data is ideal, using all of them is not always feasible due to limits, such as labeling and communication costs. Thus, it is necessary to filter and select the data to use for training (i.e., active learning) on the device. In this paper, we formalize a practical active learning problem for DNNs on edge devices and propose a general task-agnostic framework to tackle this problem, which reduces it to a stream submodular maximization. This framework is light enough to be run with low computational resources, yet provides solutions whose quality is theoretically guaranteed thanks to the submodular property. Through this framework, we can configure data selection criteria flexibly, including using methods proposed in previous active learning studies. We evaluate our approach on both classification and object detection tasks in a practical setting to simulate a real-life scenario. The results of our study show that the proposed framework outperforms all other methods in both tasks, while running at a practical speed on real devices.
    Neural Network Libraries: A Deep Learning Framework Designed from Engineers' Perspectives. (arXiv:2102.06725v2 [cs.LG] UPDATED)
    (2 min) While there exist a plethora of deep learning tools and frameworks, the fast-growing complexity of the field brings new demands and challenges, such as more flexible network design, speedy computation on distributed setting, and compatibility between different tools. In this paper, we introduce Neural Network Libraries (https://nnabla.org), a deep learning framework designed from engineer's perspective, with emphasis on usability and compatibility as its core design principles. We elaborate on each of our design principles and its merits, and validate our attempts via experiments.
    Fairness in Credit Scoring: Assessment, Implementation and Profit Implications. (arXiv:2103.01907v3 [stat.ML] UPDATED)
    (2 min) The rise of algorithmic decision-making has spawned much research on fair machine learning (ML). Financial institutions use ML for building risk scorecards that support a range of credit-related decisions. Yet, the literature on fair ML in credit scoring is scarce. The paper makes three contributions. First, we revisit statistical fairness criteria and examine their adequacy for credit scoring. Second, we catalog algorithmic options for incorporating fairness goals in the ML model development pipeline. Last, we empirically compare different fairness processors in a profit-oriented credit scoring context using real-world data. The empirical results substantiate the evaluation of fairness measures, identify suitable options to implement fair credit scoring, and clarify the profit-fairness trade-off in lending decisions. We find that multiple fairness criteria can be approximately satisfied at once and recommend separation as a proper criterion for measuring the fairness of a scorecard. We also find fair in-processors to deliver a good balance between profit and fairness and show that algorithmic discrimination can be reduced to a reasonable level at a relatively low cost. The codes corresponding to the paper are available on GitHub.
    Adaptive-Control-Oriented Meta-Learning for Nonlinear Systems. (arXiv:2103.04490v2 [cs.RO] UPDATED)
    (2 min) Real-time adaptation is imperative to the control of robots operating in complex, dynamic environments. Adaptive control laws can endow even nonlinear systems with good trajectory tracking performance, provided that any uncertain dynamics terms are linearly parameterizable with known nonlinear features. However, it is often difficult to specify such features a priori, such as for aerodynamic disturbances on rotorcraft or interaction forces between a manipulator arm and various objects. In this paper, we turn to data-driven modeling with neural networks to learn, offline from past data, an adaptive controller with an internal parametric model of these nonlinear features. Our key insight is that we can better prepare the controller for deployment with control-oriented meta-learning of features in closed-loop simulation, rather than regression-oriented meta-learning of features to fit input-output data. Specifically, we meta-learn the adaptive controller with closed-loop tracking simulation as the base-learner and the average tracking error as the meta-objective. With a nonlinear planar rotorcraft subject to wind, we demonstrate that our adaptive controller outperforms other controllers trained with regression-oriented meta-learning when deployed in closed-loop for trajectory tracking control.
    Domain Invariant Adversarial Learning. (arXiv:2104.00322v2 [cs.LG] UPDATED)
    (2 min) The phenomenon of adversarial examples illustrates one of the most basic vulnerabilities of deep neural networks. Among the variety of techniques introduced to surmount this inherent weakness, adversarial training has emerged as the most common and efficient strategy to achieve robustness. Typically, this is achieved by balancing robust and natural objectives. In this work, we aim to achieve better trade-off between robust and natural performances by enforcing a domain-invariant feature representation. We present a new adversarial training method, Domain Invariant Adversarial Learning (DIAL), which learns a feature representation which is both robust and domain invariant. DIAL uses a variant of Domain Adversarial Neural Network (DANN) on the natural domain and its corresponding adversarial domain. In a case where the source domain consists of natural examples and the target domain is the adversarially perturbed examples, our method learns a feature representation constrained not to discriminate between the natural and adversarial examples, and can therefore achieve a more robust representation. Our experiments indicate that our method improves both robustness and natural accuracy, when compared to current state-of-the-art adversarial training methods.
    Using Shape to Categorize: Low-Shot Learning with an Explicit Shape Bias. (arXiv:2101.07296v2 [cs.CV] UPDATED)
    (2 min) It is widely accepted that reasoning about object shape is important for object recognition. However, the most powerful object recognition methods today do not explicitly make use of object shape during learning. In this work, motivated by recent developments in low-shot learning, findings in developmental psychology, and the increased use of synthetic data in computer vision research, we investigate how reasoning about 3D shape can be used to improve low-shot learning methods' generalization performance. We propose a new way to improve existing low-shot learning approaches by learning a discriminative embedding space using 3D object shape, and using this embedding by learning how to map images into it. Our new approach improves the performance of image-only low-shot learning approaches on multiple datasets. We also introduce Toys4K, a 3D object dataset with the largest number of object categories currently available, which supports low-shot learning.
    TDA-Net: Fusion of Persistent Homology and Deep Learning Features for COVID-19 Detection in Chest X-Ray Images. (arXiv:2101.08398v2 [cs.CV] UPDATED)
    (2 min) Topological Data Analysis (TDA) has emerged recently as a robust tool to extract and compare the structure of datasets. TDA identifies features in data such as connected components and holes and assigns a quantitative measure to these features. Several studies reported that topological features extracted by TDA tools provide unique information about the data, discover new insights, and determine which feature is more related to the outcome. On the other hand, the overwhelming success of deep neural networks in learning patterns and relationships has been proven on a vast array of data applications, images in particular. To capture the characteristics of both powerful tools, we propose \textit{TDA-Net}, a novel ensemble network that fuses topological and deep features for the purpose of enhancing model generalizability and accuracy. We apply the proposed \textit{TDA-Net} to a critical application, which is the automated detection of COVID-19 from CXR images. The experimental results showed that the proposed network achieved excellent performance and suggests the applicability of our method in practice.
    Variance-Dependent Best Arm Identification. (arXiv:2106.10417v1 [cs.LG])
    (2 min) We study the problem of identifying the best arm in a stochastic multi-armed bandit game. Given a set of $n$ arms indexed from $1$ to $n$, each arm $i$ is associated with an unknown reward distribution supported on $[0,1]$ with mean $\theta_i$ and variance $\sigma_i^2$. Assume $\theta_1 > \theta_2 \geq \cdots \geq\theta_n$. We propose an adaptive algorithm which explores the gaps and variances of the rewards of the arms and makes future decisions based on the gathered information using a novel approach called \textit{grouped median elimination}. The proposed algorithm guarantees to output the best arm with probability $(1-\delta)$ and uses at most $O \left(\sum_{i = 1}^n \left(\frac{\sigma_i^2}{\Delta_i^2} + \frac{1}{\Delta_i}\right)(\ln \delta^{-1} + \ln \ln \Delta_i^{-1})\right)$ samples, where $\Delta_i$ ($i \geq 2$) denotes the reward gap between arm $i$ and the best arm and we define $\Delta_1 = \Delta_2$. This achieves a significant advantage over the variance-independent algorithms in some favorable scenarios and is the first result that removes the extra $\ln n$ factor on the best arm compared with the state-of-the-art. We further show that $\Omega \left( \sum_{i = 1}^n \left( \frac{\sigma_i^2}{\Delta_i^2} + \frac{1}{\Delta_i} \right) \ln \delta^{-1} \right)$ samples are necessary for an algorithm to achieve the same goal, thereby illustrating that our algorithm is optimal up to doubly logarithmic terms.
    Order in the Court: Explainable AI Methods Prone to Disagreement. (arXiv:2105.03287v2 [cs.LG] UPDATED)
    (2 min) By computing the rank correlation between attention weights and feature-additive explanation methods, previous analyses either invalidate or support the role of attention-based explanations as a faithful and plausible measure of salience. To investigate whether this approach is appropriate, we compare LIME, Integrated Gradients, DeepLIFT, Grad-SHAP, Deep-SHAP, and attention-based explanations, applied to two neural architectures trained on single- and pair-sequence language tasks. In most cases, we find that none of our chosen methods agree. Based on our empirical observations and theoretical objections, we conclude that rank correlation does not measure the quality of feature-additive methods. Practitioners should instead use the numerous and rigorous diagnostic methods proposed by the community.
    DPlis: Boosting Utility of Differentially Private Deep Learning via Randomized Smoothing. (arXiv:2103.01496v2 [cs.LG] UPDATED)
    (2 min) Deep learning techniques have achieved remarkable performance in wide-ranging tasks. However, when trained on privacy-sensitive datasets, the model parameters may expose private information in training data. Prior attempts for differentially private training, although offering rigorous privacy guarantees, lead to much lower model performance than the non-private ones. Besides, different runs of the same training algorithm produce models with large performance variance. To address these issues, we propose DPlis--Differentially Private Learning wIth Smoothing. The core idea of DPlis is to construct a smooth loss function that favors noise-resilient models lying in large flat regions of the loss landscape. We provide theoretical justification for the utility improvements of DPlis. Extensive experiments also demonstrate that DPlis can effectively boost model quality and training stability under a given privacy budget.
    A Topological Framework for Deep Learning. (arXiv:2008.13697v13 [cs.LG] UPDATED)
    (2 min) We utilize classical facts from topology to show that the classification problem in machine learning is always solvable under very mild conditions. Furthermore, we show that a softmax classification network acts on an input topological space by a finite sequence of topological moves to achieve the classification task. Moreover, given a training dataset, we show how topological formalism can be used to suggest the appropriate architectural choices for neural networks designed to be trained as classifiers on the data. Finally, we show how the architecture of a neural network cannot be chosen independently from the shape of the underlying data. To demonstrate these results, we provide example datasets and show how they are acted upon by neural nets from this topological perspective.
    Automatic heterogeneous quantization of deep neural networks for low-latency inference on the edge for particle detectors. (arXiv:2006.10159v3 [physics.ins-det] UPDATED)
    (2 min) Although the quest for more accurate solutions is pushing deep learning research towards larger and more complex algorithms, edge devices demand efficient inference and therefore reduction in model size, latency and energy consumption. One technique to limit model size is quantization, which implies using fewer bits to represent weights and biases. Such an approach usually results in a decline in performance. Here, we introduce a method for designing optimally heterogeneously quantized versions of deep neural network models for minimum-energy, high-accuracy, nanosecond inference and fully automated deployment on chip. With a per-layer, per-parameter type automatic quantization procedure, sampling from a wide range of quantizers, model energy consumption and size are minimized while high accuracy is maintained. This is crucial for the event selection procedure in proton-proton collisions at the CERN Large Hadron Collider, where resources are strictly limited and a latency of ${\mathcal O}(1)~\mu$s is required. Nanosecond inference and a resource consumption reduced by a factor of 50 when implemented on field-programmable gate array hardware are achieved.
    Plant Disease Detection Using Image Processing and Machine Learning. (arXiv:2106.10698v1 [cs.CV])
    (2 min) One of the important and tedious task in agricultural practices is the detection of the disease on crops. It requires huge time as well as skilled labor. This paper proposes a smart and efficient technique for detection of crop disease which uses computer vision and machine learning techniques. The proposed system is able to detect 20 different diseases of 5 common plants with 93% accuracy.
    The Power of the Weisfeiler-Leman Algorithm for Machine Learning with Graphs. (arXiv:2105.05911v2 [cs.LG] UPDATED)
    (2 min) In recent years, algorithms and neural architectures based on the Weisfeiler-Leman algorithm, a well-known heuristic for the graph isomorphism problem, emerged as a powerful tool for (supervised) machine learning with graphs and relational data. Here, we give a comprehensive overview of the algorithm's use in a machine learning setting. We discuss the theoretical background, show how to use it for supervised graph- and node classification, discuss recent extensions, and its connection to neural architectures. Moreover, we give an overview of current applications and future directions to stimulate research.
    This Looks Like That... Does it? Shortcomings of Latent Space Prototype Interpretability in Deep Networks. (arXiv:2105.02968v3 [cs.CV] UPDATED)
    (2 min) Deep neural networks that yield human interpretable decisions by architectural design have lately become an increasingly popular alternative to post hoc interpretation of traditional black-box models. Among these networks, the arguably most widespread approach is so-called prototype learning, where similarities to learned latent prototypes serve as the basis of classifying an unseen data point. In this work, we point to an important shortcoming of such approaches. Namely, there is a semantic gap between similarity in latent space and similarity in input space, which can corrupt interpretability. We design two experiments that exemplify this issue on the so-called ProtoPNet. Specifically, we find that this network's interpretability mechanism can be led astray by intentionally crafted or even JPEG compression artefacts, which can produce incomprehensible decisions. We argue that practitioners ought to have this shortcoming in mind when deploying prototype-based models in practice.
    Compressive Sensing and Neural Networks from a Statistical Learning Perspective. (arXiv:2010.15658v3 [math.ST] UPDATED)
    (2 min) Various iterative reconstruction algorithms for inverse problems can be unfolded as neural networks. Empirically, this approach has often led to improved results, but theoretical guarantees are still scarce. While some progress on generalization properties of neural networks have been made, great challenges remain. In this chapter, we discuss and combine these topics to present a generalization error analysis for a class of neural networks suitable for sparse reconstruction from few linear measurements. The hypothesis class considered is inspired by the classical iterative soft-thresholding algorithm (ISTA). The neural networks in this class are obtained by unfolding iterations of ISTA and learning some of the weights. Based on training samples, we aim at learning the optimal network parameters via empirical risk minimization and thereby the optimal network that reconstructs signals from their compressive linear measurements. In particular, we may learn a sparsity basis that is shared by all of the iterations/layers and thereby obtain a new approach for dictionary learning. For this class of networks, we present a generalization bound, which is based on bounding the Rademacher complexity of hypothesis classes consisting of such deep networks via Dudley's integral. Remarkably, under realistic conditions, the generalization error scales only logarithmically in the number of layers, and at most linear in number of measurements.
    Benchmarking Perturbation-based Saliency Maps for Explaining Atari Agents. (arXiv:2101.07312v2 [cs.LG] UPDATED)
    (2 min) Recent years saw a plethora of work on explaining complex intelligent agents. One example is the development of several algorithms that generate saliency maps which show how much each pixel attributed to the agents' decision. However, most evaluations of such saliency maps focus on image classification tasks. As far as we know, there is no work that thoroughly compares different saliency maps for Deep Reinforcement Learning agents. This paper compares four perturbation-based approaches to create saliency maps for Deep Reinforcement Learning agents trained on four different Atari 2600 games. All four approaches work by perturbing parts of the input and measuring how much this affects the agent's output. The approaches are compared using three computational metrics: dependence on the learned parameters of the agent (sanity checks), faithfulness to the agent's reasoning (input degradation), and run-time. In particular, during the sanity checks we find issues with two approaches and propose a solution to fix one of those issues.
    Cross-Modal learning for Audio-Visual Video Parsing. (arXiv:2104.04598v2 [cs.SD] UPDATED)
    (2 min) In this paper, we present a novel approach to the audio-visual video parsing (AVVP) task that demarcates events from a video separately for audio and visual modalities. The proposed parsing approach simultaneously detects the temporal boundaries in terms of start and end times of such events. We show how AVVP can benefit from the following techniques geared towards effective cross-modal learning: (i) adversarial training and skip connections (ii) global context aware attention and, (iii) self-supervised pretraining using an audio-video grounding objective to obtain cross-modal audio-video representations. We present extensive experimental evaluations on the Look, Listen, and Parse (LLP) dataset and show that we outperform the state-of-the-art Hybrid Attention Network (HAN) on all five metrics proposed for AVVP. We also present several ablations to validate the effect of pretraining, global attention and adversarial training.
    A non-alternating graph hashing algorithm for large scale image search. (arXiv:2012.13138v2 [cs.CV] UPDATED)
    (2 min) In the era of big data, methods for improving memory and computational efficiency have become crucial for successful deployment of technologies. Hashing is one of the most effective approaches to deal with computational limitations that come with big data. One natural way for formulating this problem is spectral hashing that directly incorporates affinity to learn binary codes. However, due to binary constraints, the optimization becomes intractable. To mitigate this challenge, different relaxation approaches have been proposed to reduce the computational load of obtaining binary codes and still attain a good solution. The problem with all existing relaxation methods is resorting to one or more additional auxiliary variables to attain high quality binary codes while relaxing the problem. The existence of auxiliary variables leads to coordinate descent approach which increases the computational complexity. We argue that introducing these variables is unnecessary. To this end, we propose a novel relaxed formulation for spectral hashing that adds no additional variables to the problem. Furthermore, instead of solving the problem in original space where number of variables is equal to the data points, we solve the problem in a much smaller space and retrieve the binary codes from this solution. This trick reduces both the memory and computational complexity at the same time. We apply two optimization techniques, namely projected gradient and optimization on manifold, to obtain the solution. Using comprehensive experiments on four public datasets, we show that the proposed efficient spectral hashing (ESH) algorithm achieves highly competitive retrieval performance compared with state of the art at low complexity.
    Revisiting Model's Uncertainty and Confidences for Adversarial Example Detection. (arXiv:2103.05354v2 [cs.CR] UPDATED)
    (2 min) Security-sensitive applications that rely on Deep Neural Networks (DNNs) are vulnerable to small perturbations that are crafted to generate Adversarial Examples(AEs). The AEs are imperceptible to humans and cause DNN to misclassify them. Many defense and detection techniques have been proposed. Model's confidences and Dropout, as a popular way to estimate the model's uncertainty, have been used for AE detection but they showed limited success against black- and gray-box attacks. Moreover, the state-of-the-art detection techniques have been designed for specific attacks or broken by others, need knowledge about the attacks, are not consistent, increase model parameters overhead, are time-consuming, or have latency in inference time. To trade off these factors, we revisit the model's uncertainty and confidences and propose a novel unsupervised ensemble AE detection mechanism that 1) uses the uncertainty method called SelectiveNet, 2) processes model layers outputs, i.e.feature maps, to generate new confidence probabilities. The detection method is called Selective and Feature based Adversarial Detection (SFAD). Experimental results show that the proposed approach achieves better performance against black- and gray-box attacks than the state-of-the-art methods and achieves comparable performance against white-box attacks. Moreover, results show that SFAD is fully robust against High Confidence Attacks (HCAs) for MNIST and partially robust for CIFAR10 datasets.
    Deep splitting method for parabolic PDEs. (arXiv:1907.03452v2 [math.NA] UPDATED)
    (2 min) In this paper we introduce a numerical method for nonlinear parabolic PDEs that combines operator splitting with deep learning. It divides the PDE approximation problem into a sequence of separate learning problems. Since the computational graph for each of the subproblems is comparatively small, the approach can handle extremely high-dimensional PDEs. We test the method on different examples from physics, stochastic control and mathematical finance. In all cases, it yields very good results in up to 10,000 dimensions with short run times.
    Predicting Critical Nodes in Temporal Networks by Dynamic Graph Convolutional Networks. (arXiv:2106.10419v1 [cs.SI])
    (2 min) Many real-world systems can be expressed in temporal networks with nodes playing far different roles in structure and function and edges representing the relationships between nodes. Identifying critical nodes can help us control the spread of public opinions or epidemics, predict leading figures in academia, conduct advertisements for various commodities, and so on. However, it is rather difficult to identify critical nodes because the network structure changes over time in temporal networks. In this paper, considering the sequence topological information of temporal networks, a novel and effective learning framework based on the combination of special GCNs and RNNs is proposed to identify nodes with the best spreading ability. The effectiveness of the approach is evaluated by weighted Susceptible-Infected-Recovered model. Experimental results on four real-world temporal networks demonstrate that the proposed method outperforms both traditional and deep learning benchmark methods in terms of the Kendall $\tau$ coefficient and top $k$ hit rate.
    Piano Skills Assessment. (arXiv:2101.04884v2 [cs.CV] UPDATED)
    (2 min) Can a computer determine a piano player's skill level? Is it preferable to base this assessment on visual analysis of the player's performance or should we trust our ears over our eyes? Since current CNNs have difficulty processing long video videos, how can shorter clips be sampled to best reflect the players skill level? In this work, we collect and release a first-of-its-kind dataset for multimodal skill assessment focusing on assessing piano player's skill level, answer the asked questions, initiate work in automated evaluation of piano playing skills and provide baselines for future work. Dataset is available from: https://github.com/ParitoshParmar/Piano-Skills-Assessment.
    Image Captioning as an Assistive Technology: Lessons Learned from VizWiz 2020 Challenge. (arXiv:2012.11696v2 [cs.CV] UPDATED)
    (2 min) Image captioning has recently demonstrated impressive progress largely owing to the introduction of neural network algorithms trained on curated dataset like MS-COCO. Often work in this field is motivated by the promise of deployment of captioning systems in practical applications. However, the scarcity of data and contexts in many competition datasets renders the utility of systems trained on these datasets limited as an assistive technology in real-world settings, such as helping visually impaired people navigate and accomplish everyday tasks. This gap motivated the introduction of the novel VizWiz dataset, which consists of images taken by the visually impaired and captions that have useful, task-oriented information. In an attempt to help the machine learning computer vision field realize its promise of producing technologies that have positive social impact, the curators of the VizWiz dataset host several competitions, including one for image captioning. This work details the theory and engineering from our winning submission to the 2020 captioning competition. Our work provides a step towards improved assistive image captioning systems.
    Smooth Exploration for Robotic Reinforcement Learning. (arXiv:2005.05719v2 [cs.LG] UPDATED)
    (2 min) Reinforcement learning (RL) enables robots to learn skills from interactions with the real world. In practice, the unstructured step-based exploration used in Deep RL -- often very successful in simulation -- leads to jerky motion patterns on real robots. Consequences of the resulting shaky behavior are poor exploration, or even damage to the robot. We address these issues by adapting state-dependent exploration (SDE) to current Deep RL algorithms. To enable this adaptation, we propose two extensions to the original SDE, using more general features and re-sampling the noise periodically, which leads to a new exploration method generalized state-dependent exploration (gSDE). We evaluate gSDE both in simulation, on PyBullet continuous control tasks, and directly on three different real robots: a tendon-driven elastic robot, a quadruped and an RC car. The noise sampling interval of gSDE permits to have a compromise between performance and smoothness, which allows training directly on the real robots without loss of performance. The code is available at https://github.com/DLR-RM/stable-baselines3.
    PIVEN: A Deep Neural Network for Prediction Intervals with Specific Value Prediction. (arXiv:2006.05139v3 [cs.LG] UPDATED)
    (2 min) Improving the robustness of neural nets in regression tasks is key to their application in multiple domains. Deep learning-based approaches aim to achieve this goal either by improving their prediction of specific values (i.e., point prediction), or by producing prediction intervals (PIs) that quantify uncertainty. We present PIVEN, a deep neural network for producing both a PI and a value prediction. Our loss function expresses the value prediction as a function of the upper and lower bounds, thus ensuring that it falls within the interval without increasing model complexity. Moreover, our approach makes no assumptions regarding data distribution within the PI, making its value prediction more effective for various real-world problems. Experiments and ablation tests on known benchmarks show that our approach produces tighter uncertainty bounds than the current state-of-the-art approaches for producing PIs, while maintaining comparable performance to the state-of-the-art approach for value-prediction. Additionally, we go beyond previous work and include large image datasets in our evaluation, where PIVEN is combined with modern neural nets.
    Approximation in shift-invariant spaces with deep ReLU neural networks. (arXiv:2005.11949v2 [cs.LG] UPDATED)
    (2 min) We study the expressive power of deep ReLU neural networks for approximating functions in dilated shift-invariant spaces, which are widely used in signal processing, image processing, communications and so on. Approximation error bounds are estimated with respect to the width and depth of neural networks. The network construction is based on the bit extraction and data-fitting capacity of deep neural networks. As applications of our main results, the approximation rates of classical function spaces such as Sobolev spaces and Besov spaces are obtained. We also give lower bounds of the $L^p (1\le p \le \infty)$ approximation error for Sobolev spaces, which show that our construction of neural network is asymptotically optimal up to a logarithmic factor.
    Pointwise Binary Classification with Pairwise Confidence Comparisons. (arXiv:2010.01875v3 [cs.LG] UPDATED)
    (2 min) To alleviate the data requirement for training effective binary classifiers in binary classification, many weakly supervised learning settings have been proposed. Among them, some consider using pairwise but not pointwise labels, when pointwise labels are not accessible due to privacy, confidentiality, or security reasons. However, as a pairwise label denotes whether or not two data points share a pointwise label, it cannot be easily collected if either point is equally likely to be positive or negative. Thus, in this paper, we propose a novel setting called pairwise comparison (Pcomp) classification, where we have only pairs of unlabeled data that we know one is more likely to be positive than the other. Firstly, we give a Pcomp data generation process, derive an unbiased risk estimator (URE) with theoretical guarantee, and further improve URE using correction functions. Secondly, we link Pcomp classification to noisy-label learning to develop a progressive URE and improve it by imposing consistency regularization. Finally, we demonstrate by experiments the effectiveness of our methods, which suggests Pcomp is a valuable and practically useful type of pairwise supervision besides the pairwise label.
    Efficient Urdu Caption Generation using Attention based LSTM. (arXiv:2008.01663v4 [cs.CL] UPDATED)
    (2 min) Recent advancements in deep learning have created many opportunities to solve real-world problems that remained unsolved for more than a decade. Automatic caption generation is a major research field, and the research community has done a lot of work on it in most common languages like English. Urdu is the national language of Pakistan and also much spoken and understood in the sub-continent region of Pakistan-India, and yet no work has been done for Urdu language caption generation. Our research aims to fill this gap by developing an attention-based deep learning model using techniques of sequence modeling specialized for the Urdu language. We have prepared a dataset in the Urdu language by translating a subset of the "Flickr8k" dataset containing 700 'man' images. We evaluate our proposed technique on this dataset and show that it can achieve a BLEU score of 0.83 in the Urdu language. We improve on the previous state-of-the-art by using better CNN architectures and optimization techniques. Furthermore, we provide a discussion on how the generated captions can be made correct grammar-wise.
    On Stein Variational Neural Network Ensembles. (arXiv:2106.10760v1 [cs.LG])
    (2 min) Ensembles of deep neural networks have achieved great success recently, but they do not offer a proper Bayesian justification. Moreover, while they allow for averaging of predictions over several hypotheses, they do not provide any guarantees for their diversity, leading to redundant solutions in function space. In contrast, particle-based inference methods, such as Stein variational gradient descent (SVGD), offer a Bayesian framework, but rely on the choice of a kernel to measure the similarity between ensemble members. In this work, we study different SVGD methods operating in the weight space, function space, and in a hybrid setting. % Defining the kernel directly on the neural network functions seems promising to overcome the limitations of deep ensembles. % However, ensuring diversity in function space while maintaining SVGD's theoretical guarantees is not trivial. % In this work, we provide an overview over different ensembling and SVGD methods in weight space and function space and propose new and assess their theoretical and empirical properties on synthetic and real-world tasks. We compare the SVGD approaches to other ensembling-based methods in terms of their theoretical properties and assess their empirical performance on synthetic and real-world tasks. We find that SVGD using functional and hybrid kernels can overcome the limitations of deep ensembles. It improves on functional diversity and uncertainty estimation and approaches the true Bayesian posterior more closely. Moreover, we show that using stochastic SVGD updates, as opposed to the standard deterministic ones, can further improve the performance.
    IG-RL: Inductive Graph Reinforcement Learning for Massive-Scale Traffic Signal Control. (arXiv:2003.05738v5 [cs.LG] UPDATED)
    (2 min) Scaling adaptive traffic-signal control involves dealing with combinatorial state and action spaces. Multi-agent reinforcement learning attempts to address this challenge by distributing control to specialized agents. However, specialization hinders generalization and transferability, and the computational graphs underlying neural-networks architectures -- dominating in the multi-agent setting -- do not offer the flexibility to handle an arbitrary number of entities which changes both between road networks, and over time as vehicles traverse the network. We introduce Inductive Graph Reinforcement Learning (IG-RL) based on graph-convolutional networks which adapts to the structure of any road network, to learn detailed representations of traffic-controllers and their surroundings. Our decentralized approach enables learning of a transferable-adaptive-traffic-signal-control policy. After being trained on an arbitrary set of road networks, our model can generalize to new road networks, traffic distributions, and traffic regimes, with no additional training and a constant number of parameters, enabling greater scalability compared to prior methods. Furthermore, our approach can exploit the granularity of available data by capturing the (dynamic) demand at both the lane and the vehicle levels. The proposed method is tested on both road networks and traffic settings never experienced during training. We compare IG-RL to multi-agent reinforcement learning and domain-specific baselines. In both synthetic road networks and in a larger experiment involving the control of the 3,971 traffic signals of Manhattan, we show that different instantiations of IG-RL outperform baselines.
    STEM: A Stochastic Two-Sided Momentum Algorithm Achieving Near-Optimal Sample and Communication Complexities for Federated Learning. (arXiv:2106.10435v1 [cs.LG])
    (2 min) Federated Learning (FL) refers to the paradigm where multiple worker nodes (WNs) build a joint model by using local data. Despite extensive research, for a generic non-convex FL problem, it is not clear, how to choose the WNs' and the server's update directions, the minibatch sizes, and the local update frequency, so that the WNs use the minimum number of samples and communication rounds to achieve the desired solution. This work addresses the above question and considers a class of stochastic algorithms where the WNs perform a few local updates before communication. We show that when both the WN's and the server's directions are chosen based on a stochastic momentum estimator, the algorithm requires $\tilde{\mathcal{O}}(\epsilon^{-3/2})$ samples and $\tilde{\mathcal{O}}(\epsilon^{-1})$ communication rounds to compute an $\epsilon$-stationary solution. To the best of our knowledge, this is the first FL algorithm that achieves such {\it near-optimal} sample and communication complexities simultaneously. Further, we show that there is a trade-off curve between local update frequencies and local minibatch sizes, on which the above sample and communication complexities can be maintained. Finally, we show that for the classical FedAvg (a.k.a. Local SGD, which is a momentum-less special case of the STEM), a similar trade-off curve exists, albeit with worse sample and communication complexities. Our insights on this trade-off provides guidelines for choosing the four important design elements for FL algorithms, the update frequency, directions, and minibatch sizes to achieve the best performance.
    Exponential Lower Bounds for Batch Reinforcement Learning: Batch RL can be Exponentially Harder than Online RL. (arXiv:2012.08005v4 [cs.LG] UPDATED)
    (2 min) Several practical applications of reinforcement learning involve an agent learning from past data without the possibility of further exploration. Often these applications require us to 1) identify a near optimal policy or to 2) estimate the value of a target policy. For both tasks we derive \emph{exponential} information-theoretic lower bounds in discounted infinite horizon MDPs with a linear function representation for the action value function even if 1) \emph{realizability} holds, 2) the batch algorithm observes the exact reward and transition \emph{functions}, and 3) the batch algorithm is given the \emph{best} a priori data distribution for the problem class. Our work introduces a new `oracle + batch algorithm' framework to prove lower bounds that hold for every distribution. The work shows an exponential separation between batch and online reinforcement learning.
    ROPE: Reading Order Equivariant Positional Encoding for Graph-based Document Information Extraction. (arXiv:2106.10786v1 [cs.CL])
    (2 min) Natural reading orders of words are crucial for information extraction from form-like documents. Despite recent advances in Graph Convolutional Networks (GCNs) on modeling spatial layout patterns of documents, they have limited ability to capture reading orders of given word-level node representations in a graph. We propose Reading Order Equivariant Positional Encoding (ROPE), a new positional encoding technique designed to apprehend the sequential presentation of words in documents. ROPE generates unique reading order codes for neighboring words relative to the target word given a word-level graph connectivity. We study two fundamental document entity extraction tasks including word labeling and word grouping on the public FUNSD dataset and a large-scale payment dataset. We show that ROPE consistently improves existing GCNs with a margin up to 8.4% F1-score.
    Steepest Descent Neural Architecture Optimization: Escaping Local Optimum with Signed Neural Splitting. (arXiv:2003.10392v5 [cs.LG] UPDATED)
    (2 min) Developing efficient and principled neural architecture optimization methods is a critical challenge of modern deep learning. Recently, Liu et al.[19] proposed a splitting steepest descent (S2D) method that jointly optimizes the neural parameters and architectures based on progressively growing network structures by splitting neurons into multiple copies in a steepest descent fashion. However, S2D suffers from a local optimality issue when all the neurons become "splitting stable", a concept akin to local stability in parametric optimization. In this work, we develop a significant and surprising extension of the splitting descent framework that addresses the local optimality issue. The idea is to observe that the original S2D is unnecessarily restricted to splitting neurons into positive weighted copies. By simply allowing both positive and negative weights during splitting, we can eliminate the appearance of splitting stability in S2D and hence escape the local optima to obtain better performance. By incorporating signed splittings, we significantly extend the optimization power of splitting steepest descent both theoretically and empirically. We verify our method on various challenging benchmarks such as CIFAR-100, ImageNet and ModelNet40, on which we outperform S2D and other advanced methods on learning accurate and energy-efficient neural networks.
    Multirate Training of Neural Networks. (arXiv:2106.10771v1 [cs.LG])
    (2 min) We propose multirate training of neural networks: partitioning neural network parameters into "fast" and "slow" parts which are trained simultaneously using different learning rates. By choosing appropriate partitionings we can obtain large computational speed-ups for transfer learning tasks. We show that for various transfer learning applications in vision and NLP we can fine-tune deep neural networks in almost half the time, without reducing the generalization performance of the resulting model. We also discuss other splitting choices for the neural network parameters which are beneficial in enhancing generalization performance in settings where neural networks are trained from scratch. Finally, we propose an additional multirate technique which can learn different features present in the data by training the full network on different time scales simultaneously. The benefits of using this approach are illustrated for ResNet architectures on image data. Our paper unlocks the potential of using multirate techniques for neural network training and provides many starting points for future work in this area.
    Addressing Catastrophic Forgetting in Few-Shot Problems. (arXiv:2005.00146v3 [cs.LG] UPDATED)
    (2 min) Neural networks are known to suffer from catastrophic forgetting when trained on sequential datasets. While there have been numerous attempts to solve this problem in large-scale supervised classification, little has been done to overcome catastrophic forgetting in few-shot classification problems. We demonstrate that the popular gradient-based model-agnostic meta-learning algorithm (MAML) indeed suffers from catastrophic forgetting and introduce a Bayesian online meta-learning framework that tackles this problem. Our framework utilises Bayesian online learning and meta-learning along with Laplace approximation and variational inference to overcome catastrophic forgetting in few-shot classification problems. The experimental evaluations demonstrate that our framework can effectively achieve this goal in comparison with various baselines. As an additional utility, we also demonstrate empirically that our framework is capable of meta-learning on sequentially arriving few-shot tasks from a stationary task distribution.
    Generative Model Adversarial Training for Deep Compressed Sensing. (arXiv:2106.10696v1 [eess.IV])
    (2 min) Deep compressed sensing assumes the data has sparse representation in a latent space, i.e., it is intrinsically of low-dimension. The original data is assumed to be mapped from a low-dimensional space through a low-to-high-dimensional generator. In this work, we propound how to design such a low-to-high dimensional deep learning-based generator suiting for compressed sensing, while satisfying robustness to universal adversarial perturbations in the latent domain. We also justify why the noise is considered in the latent space. The work is also buttressed with theoretical analysis on the robustness of the trained generator to adversarial perturbations. Experiments on real-world datasets are provided to substantiate the efficacy of the proposed \emph{generative model adversarial training for deep compressed sensing.}
    Opportunities and challenges in partitioning the graph measure space of real-world networks. (arXiv:2106.10753v1 [cs.LG])
    (2 min) Based on a large dataset containing thousands of real-world networks ranging from genetic, protein interaction, and metabolic networks to brain, language, ecology, and social networks we search for defining structural measures of the different complex network domains (CND). We calculate 208 measures for all networks and using a comprehensive and scrupulous workflow of statistical and machine learning methods we investigated the limitations and possibilities of identifying the key graph measures of CNDs. Our approach managed to identify well distinguishable groups of network domains and confer their relevant features. These features turn out to be CND specific and not unique even at the level of individual CNDs. The presented methodology may be applied to other similar scenarios involving highly unbalanced and skewed datasets.
    TinyML: Analysis of Xtensa LX6 microprocessor for Neural Network Applications by ESP32 SoC. (arXiv:2106.10652v1 [cs.LG])
    (2 min) In recent decades, Machine Learning (ML) has become extremely important for many computing applications. The pervasiveness of ultra-low-power embedded devices such as ESP32 or ESP32 Cam with tiny Machine Learning (tinyML) applications will enable the mass proliferation of Artificial Intelligent powered Embedded IoT Devices. In the last few years, the microcontroller device (Espressif ESP32) became powerful enough to be used for small/tiny machine learning (tinyML) tasks. The ease of use of platforms like Arduino IDE, MicroPython and TensorFlow Lite (TF) with tinyML application make it an indispensable topic of research for mobile robotics, modern computer science and electrical engineering. The goal of this paper is to analyze the speed of the Xtensa dual core 32-bit LX6 microprocessor by running a neural network application. The different number of inputs (9, 36, 144 and 576) inputted through the different number of neurons in neural networks with one and two hidden layers. Xtensa LX6 microprocessor has been analyzed because it comes inside with Espressif ESP32 and ESP32 Cam which are very easy to use, plug and play IoT device. In this paper speed of the Xtensa LX6 microprocessor in feed-forward mode has been analyzed.
    Exoskeleton-Based Multimodal Action and Movement Recognition: Identifying and Developing the Optimal Boosted Learning Approach. (arXiv:2106.10331v1 [cs.RO])
    (2 min) This paper makes two scientific contributions to the field of exoskeleton-based action and movement recognition. First, it presents a novel machine learning and pattern recognition-based framework that can detect a wide range of actions and movements - walking, walking upstairs, walking downstairs, sitting, standing, lying, stand to sit, sit to stand, sit to lie, lie to sit, stand to lie, and lie to stand, with an overall accuracy of 82.63%. Second, it presents a comprehensive comparative study of different learning approaches - Random Forest, Artificial Neural Network, Decision Tree, Multiway Decision Tree, Support Vector Machine, k-NN, Gradient Boosted Trees, Decision Stump, Auto MLP, Linear Regression, Vector Linear Regression, Random Tree, Na\"ive Bayes, Na\"ive Bayes (Kernel), Linear Discriminant Analysis, Quadratic Discriminant Analysis, and Deep Learning applied to this framework. The performance of each of these learning approaches was boosted by using the AdaBoost algorithm, and the Cross Validation approach was used for training and testing. The results show that in boosted form, the k- NN classifier outperforms all the other boosted learning approaches and is, therefore, the optimal learning method for this purpose. The results presented and discussed uphold the importance of this work to contribute towards augmenting the abilities of exoskeleton-based assisted and independent living of the elderly in the future of Internet of Things-based living environments, such as Smart Homes. As a specific use case, we also discuss how the findings of our work are relevant for augmenting the capabilities of the Hybrid Assistive Limb exoskeleton, a highly functional lower limb exoskeleton.
    Algorithm Unrolling for Massive Access via Deep Neural Network with Theoretical Guarantee. (arXiv:2106.10426v1 [cs.IT])
    (2 min) Massive access is a critical design challenge of Internet of Things (IoT) networks. In this paper, we consider the grant-free uplink transmission of an IoT network with a multiple-antenna base station (BS) and a large number of single-antenna IoT devices. Taking into account the sporadic nature of IoT devices, we formulate the joint activity detection and channel estimation (JADCE) problem as a group-sparse matrix estimation problem. This problem can be solved by applying the existing compressed sensing techniques, which however either suffer from high computational complexities or lack of algorithm robustness. To this end, we propose a novel algorithm unrolling framework based on the deep neural network to simultaneously achieve low computational complexity and high robustness for solving the JADCE problem. Specifically, we map the original iterative shrinkage thresholding algorithm (ISTA) into an unrolled recurrent neural network (RNN), thereby improving the convergence rate and computational efficiency through end-to-end training. Moreover, the proposed algorithm unrolling approach inherits the structure and domain knowledge of the ISTA, thereby maintaining the algorithm robustness, which can handle non-Gaussian preamble sequence matrix in massive access. With rigorous theoretical analysis, we further simplify the unrolled network structure by reducing the redundant training parameters. Furthermore, we prove that the simplified unrolled deep neural network structures enjoy a linear convergence rate. Extensive simulations based on various preamble signatures show that the proposed unrolled networks outperform the existing methods in terms of the convergence rate, robustness and estimation accuracy.
    Attack to Fool and Explain Deep Networks. (arXiv:2106.10606v1 [cs.CV])
    (2 min) Deep visual models are susceptible to adversarial perturbations to inputs. Although these signals are carefully crafted, they still appear noise-like patterns to humans. This observation has led to the argument that deep visual representation is misaligned with human perception. We counter-argue by providing evidence of human-meaningful patterns in adversarial perturbations. We first propose an attack that fools a network to confuse a whole category of objects (source class) with a target label. Our attack also limits the unintended fooling by samples from non-sources classes, thereby circumscribing human-defined semantic notions for network fooling. We show that the proposed attack not only leads to the emergence of regular geometric patterns in the perturbations, but also reveals insightful information about the decision boundaries of deep models. Exploring this phenomenon further, we alter the `adversarial' objective of our attack to use it as a tool to `explain' deep visual representation. We show that by careful channeling and projection of the perturbations computed by our method, we can visualize a model's understanding of human-defined semantic notions. Finally, we exploit the explanability properties of our perturbations to perform image generation, inpainting and interactive image manipulation by attacking adversarialy robust `classifiers'.In all, our major contribution is a novel pragmatic adversarial attack that is subsequently transformed into a tool to interpret the visual models. The article also makes secondary contributions in terms of establishing the utility of our attack beyond the adversarial objective with multiple interesting applications.
    Multi-Task Learning for User Engagement and Adoption in Live Video Streaming Events. (arXiv:2106.10305v1 [cs.AI])
    (2 min) Nowadays, live video streaming events have become a mainstay in viewer's communication in large international enterprises. Provided that viewers are distributed worldwide, the main challenge resides on how to schedule the optimal event's time so as to improve both the viewer's engagement and adoption. In this paper we present a multi-task deep reinforcement learning model to select the time of a live video streaming event, aiming to optimize the viewer's engagement and adoption at the same time. We consider the engagement and adoption of the viewers as independent tasks and formulate a unified loss function to learn a common policy. In addition, we account for the fact that each task might have different contribution to the training strategy of the agent. Therefore, to determine the contribution of each task to the agent's training, we design a Transformer's architecture for the state-action transitions of each task. We evaluate our proposed model on four real-world datasets, generated by the live video streaming events of four large enterprises spanning from January 2019 until March 2021. Our experiments demonstrate the effectiveness of the proposed model when compared with several state-of-the-art strategies. For reproduction purposes, our evaluation datasets and implementation are publicly available at https://github.com/stefanosantaris/merlin.
    Generalization in the Face of Adaptivity: A Bayesian Perspective. (arXiv:2106.10761v1 [cs.LG])
    (2 min) Repeated use of a data sample via adaptively chosen queries can rapidly lead to overfitting, wherein the issued queries yield answers on the sample that differ wildly from the values of those queries on the underlying data distribution. Differential privacy provides a tool to ensure generalization despite adaptively-chosen queries, but its worst-case nature means that it cannot, for example, yield improved results for low-variance queries. In this paper, we give a simple new characterization that illuminates the core problem of adaptive data analysis. We show explicitly that the harms of adaptivity come from the covariance between the behavior of future queries and a Bayes factor-based measure of how much information about the data sample was encoded in the responses given to past queries. We leverage this intuition to introduce a new stability notion; we then use it to prove new generalization results for the most basic noise-addition mechanisms (Laplace and Gaussian noise addition), with guarantees that scale with the variance of the queries rather than the square of their range. Our characterization opens the door to new insights and new algorithms for the fundamental problem of achieving generalization in adaptive data analysis.
    TD-GEN: Graph Generation With Tree Decomposition. (arXiv:2106.10656v1 [cs.LG])
    (2 min) We propose TD-GEN, a graph generation framework based on tree decomposition, and introduce a reduced upper bound on the maximum number of decisions needed for graph generation. The framework includes a permutation invariant tree generation model which forms the backbone of graph generation. Tree nodes are supernodes, each representing a cluster of nodes in the graph. Graph nodes and edges are incrementally generated inside the clusters by traversing the tree supernodes, respecting the structure of the tree decomposition, and following node sharing decisions between the clusters. Finally, we discuss the shortcomings of standard evaluation criteria based on statistical properties of the generated graphs as performance measures. We propose to compare the performance of models based on likelihood. Empirical results on a variety of standard graph generation datasets demonstrate the superior performance of our method.
    Accelerated Policy Evaluation: Learning Adversarial Environments with Adaptive Importance Sampling. (arXiv:2106.10566v1 [cs.LG])
    (2 min) The evaluation of rare but high-stakes events remains one of the main difficulties in obtaining reliable policies from intelligent agents, especially in large or continuous state/action spaces where limited scalability enforces the use of a prohibitively large number of testing iterations. On the other hand, a biased or inaccurate policy evaluation in a safety-critical system could potentially cause unexpected catastrophic failures during deployment. In this paper, we propose the Accelerated Policy Evaluation (APE) method, which simultaneously uncovers rare events and estimates the rare event probability in Markov decision processes. The APE method treats the environment nature as an adversarial agent and learns towards, through adaptive importance sampling, the zero-variance sampling distribution for the policy evaluation. Moreover, APE is scalable to large discrete or continuous spaces by incorporating function approximators. We investigate the convergence properties of proposed algorithms under suitable regularity conditions. Our empirical studies show that APE estimates rare event probability with a smaller variance while only using orders of magnitude fewer samples compared to baseline methods in both multi-agent and single-agent environments.
    OptiDICE: Offline Policy Optimization via Stationary Distribution Correction Estimation. (arXiv:2106.10783v1 [cs.LG])
    (2 min) We consider the offline reinforcement learning (RL) setting where the agent aims to optimize the policy solely from the data without further environment interactions. In offline RL, the distributional shift becomes the primary source of difficulty, which arises from the deviation of the target policy being optimized from the behavior policy used for data collection. This typically causes overestimation of action values, which poses severe problems for model-free algorithms that use bootstrapping. To mitigate the problem, prior offline RL algorithms often used sophisticated techniques that encourage underestimation of action values, which introduces an additional set of hyperparameters that need to be tuned properly. In this paper, we present an offline RL algorithm that prevents overestimation in a more principled way. Our algorithm, OptiDICE, directly estimates the stationary distribution corrections of the optimal policy and does not rely on policy-gradients, unlike previous offline RL algorithms. Using an extensive set of benchmark datasets for offline RL, we show that OptiDICE performs competitively with the state-of-the-art methods.
    Robust Regression via Model Based Methods. (arXiv:2106.10759v1 [cs.LG])
    (2 min) The mean squared error loss is widely used in many applications, including auto-encoders, multi-target regression, and matrix factorization, to name a few. Despite computational advantages due to its differentiability, it is not robust to outliers. In contrast, l_p norms are known to be robust, but cannot be optimized via, e.g., stochastic gradient descent, as they are non-differentiable. We propose an algorithm inspired by so-called model-based optimization (MBO) [35, 36], which replaces a non-convex objective with a convex model function and alternates between optimizing the model function and updating the solution. We apply this to robust regression, proposing SADM, a stochastic variant of the Online Alternating Direction Method of Multipliers (OADM) [50] to solve the inner optimization in MBO. We show that SADM converges with the rate O(log T/T). Finally, we demonstrate experimentally (a) the robustness of l_p norms to outliers and (b) the efficiency of our proposed model-based algorithms in comparison with gradient methods on autoencoders and multi-target regression.
    DiffLoop: Tuning PID controllers by differentiating through the feedback loop. (arXiv:2106.10516v1 [eess.SY])
    (2 min) Since most industrial control applications use PID controllers, PID tuning and anti-windup measures are significant problems. This paper investigates tuning the feedback gains of a PID controller via back-calculation and automatic differentiation tools. In particular, we episodically use a cost function to generate gradients and perform gradient descent to improve controller performance. We provide a theoretical framework for analyzing this non-convex optimization and establish a relationship between back-calculation and disturbance feedback policies. We include numerical experiments on linear systems with actuator saturation to show the efficacy of this approach.
    Score-Based Explanations in Data Management and Machine Learning: An Answer-Set Programming Approach to Counterfactual Analysis. (arXiv:2106.10562v1 [cs.AI])
    (2 min) We describe some recent approaches to score-based explanations for query answers in databases and outcomes from classification models in machine learning. The focus is on work done by the author and collaborators. Special emphasis is placed on declarative approaches based on answer-set programming to the use of counterfactual reasoning for score specification and computation. Several examples that illustrate the flexibility of these methods are shown.
    Exploring Vision Transformers for Fine-grained Classification. (arXiv:2106.10587v1 [cs.CV])
    (2 min) Existing computer vision research in categorization struggles with fine-grained attributes recognition due to the inherently high intra-class variances and low inter-class variances. SOTA methods tackle this challenge by locating the most informative image regions and rely on them to classify the complete image. The most recent work, Vision Transformer (ViT), shows its strong performance in both traditional and fine-grained classification tasks. In this work, we propose a multi-stage ViT framework for fine-grained image classification tasks, which localizes the informative image regions without requiring architectural changes using the inherent multi-head self-attention mechanism. We also introduce attention-guided augmentations for improving the model's capabilities. We demonstrate the value of our approach by experimenting with four popular fine-grained benchmarks: CUB-200-2011, Stanford Cars, Stanford Dogs, and FGVC7 Plant Pathology. We also prove our model's interpretability via qualitative results.
    Rayleigh-Gauss-Newton optimization with enhanced sampling for variational Monte Carlo. (arXiv:2106.10558v1 [stat.ML])
    (2 min) Variational Monte Carlo (VMC) is an approach for computing ground-state wavefunctions that has recently become more powerful due to the introduction of neural network-based wavefunction parametrizations. However, efficiently training neural wavefunctions to converge to an energy minimum remains a difficult problem. In this work, we analyze optimization and sampling methods used in VMC and introduce alterations to improve their performance. First, based on theoretical convergence analysis in a noiseless setting, we motivate a new optimizer that we call the Rayleigh-Gauss-Newton method, which can improve upon gradient descent and natural gradient descent to achieve superlinear convergence. Second, in order to realize this favorable comparison in the presence of stochastic noise, we analyze the effect of sampling error on VMC parameter updates and experimentally demonstrate that it can be reduced by the parallel tempering method. In particular, we demonstrate that RGN can be made robust to energy spikes that occur when new regions of configuration space become available to the sampler over the course of optimization. Finally, putting theory into practice, we apply our enhanced optimization and sampling methods to the transverse-field Ising and XXZ models on large lattices, yielding ground-state energy estimates with remarkably high accuracy after just 200-500 parameter updates.
    Practical Transferability Estimation for Image Classification Tasks. (arXiv:2106.10479v1 [cs.CV])
    (2 min) Transferability estimation is an essential problem in transfer learning to predict how good the performance is when transfer a source model (source task) to a target task. Recent analytical transferability metrics have been widely used for source model selection and multi-task learning. Earlier metrics does not work sufficiently well under the challenging cross-domain cross-task transfer settings, but recent OTCE score achieves a noteworthy performance using auxiliary tasks. A simplified version named OT-based NCE score sacrifices accuracy to be more efficient, but it can be further improved. Consequently, we propose a practical transferability metric called JC-NCE score to further improve the cross-domain cross-task transferability estimation performance, which is more efficient than the OTCE score and more accurate than the OT-based NCE score. Specifically, we build the joint correspondences between source and target data via solving an optimal transport problem with considering both the sample distance and label distance, and then compute the transferability score as the negative conditional entropy. Extensive validations under the intra-dataset and inter-dataset transfer settings demonstrate that our JC-NCE score outperforms the OT-based NCE score with about 7% and 12% gains, respectively.
    Parallel frequency function-deep neural network for efficient complex broadband signal approximation. (arXiv:2106.10401v1 [eess.SP])
    (2 min) A neural network is essentially a high-dimensional complex mapping model by adjusting network weights for feature fitting. However, the spectral bias in network training leads to unbearable training epochs for fitting the high-frequency components in broadband signals. To improve the fitting efficiency of high-frequency components, the PhaseDNN was proposed recently by combining complex frequency band extraction and frequency shift techniques [Cai et al. SIAM J. SCI. COMPUT. 42, A3285 (2020)]. Our paper is devoted to an alternative candidate for fitting complex signals with high-frequency components. Here, a parallel frequency function-deep neural network (PFF-DNN) is proposed to suppress computational overhead while ensuring fitting accuracy by utilizing fast Fourier analysis of broadband signals and the spectral bias nature of neural networks. The effectiveness and efficiency of the proposed PFF-DNN method are verified based on detailed numerical experiments for six typical broadband signals.
    Signal Processing Based Deep Learning for Blind Symbol Decoding and Modulation Classification. (arXiv:2106.10543v1 [eess.SP])
    (2 min) Blindly decoding a signal requires estimating its unknown transmit parameters, compensating for the wireless channel impairments, and identifying the modulation type. While deep learning can solve complex problems, digital signal processing (DSP) is interpretable and can be more computationally efficient. To combine both, we propose the dual path network (DPN). It consists of a signal path of DSP operations that recover the signal, and a feature path of neural networks that estimate the unknown transmit parameters. By interconnecting the paths over several recovery stages, later stages benefit from the recovered signals and reuse all the previously extracted features. The proposed design is demonstrated to provide 5% improvement in modulation classification compared to alternative designs lacking either feature sharing or access to recovered signals. The estimation results of DPN along with its blind decoding performance are shown to outperform a blind signal processing algorithm for BPSK and QPSK on a simulated dataset. An over-the-air software-defined-radio capture was used to verify DPN results at high SNRs. DPN design can process variable length inputs and is shown to outperform relying on fixed length inputs with prediction averaging on longer signals by up to 15% in modulation classification.
    Semi-supervised Optimal Transport with Self-paced Ensemble for Cross-hospital Sepsis Early Detection. (arXiv:2106.10352v1 [cs.LG])
    (2 min) The utilization of computer technology to solve problems in medical scenarios has attracted considerable attention in recent years, which still has great potential and space for exploration. Among them, machine learning has been widely used in the prediction, diagnosis and even treatment of Sepsis. However, state-of-the-art methods require large amounts of labeled medical data for supervised learning. In real-world applications, the lack of labeled data will cause enormous obstacles if one hospital wants to deploy a new Sepsis detection system. Different from the supervised learning setting, we need to use known information (e.g., from another hospital with rich labeled data) to help build a model with acceptable performance, i.e., transfer learning. In this paper, we propose a semi-supervised optimal transport with self-paced ensemble framework for Sepsis early detection, called SPSSOT, to transfer knowledge from the other that has rich labeled data. In SPSSOT, we first extract the same clinical indicators from the source domain (e.g., hospital with rich labeled data) and the target domain (e.g., hospital with little labeled data), then we combine the semi-supervised domain adaptation based on optimal transport theory with self-paced under-sampling to avoid a negative transfer possibly caused by covariate shift and class imbalance. On the whole, SPSSOT is an end-to-end transfer learning method for Sepsis early detection which can automatically select suitable samples from two domains respectively according to the number of iterations and align feature space of two domains. Extensive experiments on two open clinical datasets demonstrate that comparing with other methods, our proposed SPSSOT, can significantly improve the AUC values with only 1% labeled data in the target domain in two transfer learning scenarios, MIMIC $rightarrow$ Challenge and Challenge $rightarrow$ MIMIC.
    Differentiable Particle Filtering without Modifying the Forward Pass. (arXiv:2106.10314v1 [stat.ML])
    (2 min) In recent years particle filters have being used as components in systems optimized end-to-end with gradient descent. However, the resampling step in a particle filter is not differentiable, which biases gradients and interferes with optimization. To remedy this problem, several differentiable variants of resampling have been proposed, all of which modify the behavior of the particle filter in significant and potentially undesirable ways. In this paper, we show how to obtain unbiased estimators of the gradient of the marginal likelihood by only modifying messages used in backpropagation, leaving the standard forward pass of a particle filter unchanged. Our method is simple to implement, has a low computational overhead, does not introduce additional hyperparameters, and extends to derivatives of higher orders. We call it stop-gradient resampling, since it can easily be implemented with automatic differentiation libraries using the stop-gradient operator instead of explicitly modifying the backward messages.
    A Max-Min Entropy Framework for Reinforcement Learning. (arXiv:2106.10517v1 [cs.LG])
    (2 min) In this paper, we propose a max-min entropy framework for reinforcement learning (RL) to overcome the limitation of the maximum entropy RL framework in model-free sample-based learning. Whereas the maximum entropy RL framework guides learning for policies to reach states with high entropy in the future, the proposed max-min entropy framework aims to learn to visit states with low entropy and maximize the entropy of these low-entropy states to promote exploration. For general Markov decision processes (MDPs), an efficient algorithm is constructed under the proposed max-min entropy framework based on disentanglement of exploration and exploitation. Numerical results show that the proposed algorithm yields drastic performance improvement over the current state-of-the-art RL algorithms.
    EvoGrad: Efficient Gradient-Based Meta-Learning and Hyperparameter Optimization. (arXiv:2106.10575v1 [cs.LG])
    (2 min) Gradient-based meta-learning and hyperparameter optimization have seen significant progress recently, enabling practical end-to-end training of neural networks together with many hyperparameters. Nevertheless, existing approaches are relatively expensive as they need to compute second-order derivatives and store a longer computational graph. This cost prevents scaling them to larger network architectures. We present EvoGrad, a new approach to meta-learning that draws upon evolutionary techniques to more efficiently compute hypergradients. EvoGrad estimates hypergradient with respect to hyperparameters without calculating second-order gradients, or storing a longer computational graph, leading to significant improvements in efficiency. We evaluate EvoGrad on two substantial recent meta-learning applications, namely cross-domain few-shot learning with feature-wise transformations and noisy label learning with MetaWeightNet. The results show that EvoGrad significantly improves efficiency and enables scaling meta-learning to bigger CNN architectures such as from ResNet18 to ResNet34.
    Sample Efficient Social Navigation Using Inverse Reinforcement Learning. (arXiv:2106.10318v1 [cs.RO])
    (2 min) In this paper, we present an algorithm to efficiently learn socially-compliant navigation policies from observations of human trajectories. As mobile robots come to inhabit and traffic social spaces, they must account for social cues and behave in a socially compliant manner. We focus on learning such cues from examples. We describe an inverse reinforcement learning based algorithm which learns from human trajectory observations without knowing their specific actions. We increase the sample-efficiency of our approach over alternative methods by leveraging the notion of a replay buffer (found in many off-policy reinforcement learning methods) to eliminate the additional sample complexity associated with inverse reinforcement learning. We evaluate our method by training agents using publicly available pedestrian motion data sets and compare it to related methods. We show that our approach yields better performance while also decreasing training time and sample complexity.
    On the benefits of maximum likelihood estimation for Regression and Forecasting. (arXiv:2106.10370v1 [stat.ML])
    (2 min) We advocate for a practical Maximum Likelihood Estimation (MLE) approach for regression and forecasting, as an alternative to the typical approach of Empirical Risk Minimization (ERM) for a specific target metric. This approach is better suited to capture inductive biases such as prior domain knowledge in datasets, and can output post-hoc estimators at inference time that can optimize different types of target metrics. We present theoretical results to demonstrate that our approach is always competitive with any estimator for the target metric under some general conditions, and in many practical settings (such as Poisson Regression) can actually be much superior to ERM. We demonstrate empirically that our method instantiated with a well-designed general purpose mixture likelihood family can obtain superior performance over ERM for a variety of tasks across time-series forecasting and regression datasets with different data distributions.
    Unlocking Pixels for Reinforcement Learning via Implicit Attention. (arXiv:2102.04353v3 [cs.LG] UPDATED)
    (2 min) There has recently been significant interest in training reinforcement learning (RL) agents in vision-based environments. This poses many challenges, such as high dimensionality and potential for observational overfitting through spurious correlations. A promising approach to solve both of these problems is a self-attention bottleneck, which provides a simple and effective framework for learning high performing policies, even in the presence of distractions. However, due to poor scalability of attention architectures, these methods do not scale beyond low resolution visual inputs, using large patches (thus small attention matrices). In this paper we make use of new efficient attention algorithms, recently shown to be highly effective for Transformers, and demonstrate that these new techniques can be applied in the RL setting. This allows our attention-based controllers to scale to larger visual inputs, and facilitate the use of smaller patches, even individual pixels, improving generalization. In addition, we propose a new efficient algorithm approximating softmax attention with what we call hybrid random features, leveraging the theory of angular kernels. We show theoretically and empirically that hybrid random features is a promising approach when using attention for vision-based RL.
    iDARTS: Differentiable Architecture Search with Stochastic Implicit Gradients. (arXiv:2106.10784v1 [cs.LG])
    (2 min) \textit{Differentiable ARchiTecture Search} (DARTS) has recently become the mainstream of neural architecture search (NAS) due to its efficiency and simplicity. With a gradient-based bi-level optimization, DARTS alternately optimizes the inner model weights and the outer architecture parameter in a weight-sharing supernet. A key challenge to the scalability and quality of the learned architectures is the need for differentiating through the inner-loop optimisation. While much has been discussed about several potentially fatal factors in DARTS, the architecture gradient, a.k.a. hypergradient, has received less attention. In this paper, we tackle the hypergradient computation in DARTS based on the implicit function theorem, making it only depends on the obtained solution to the inner-loop optimization and agnostic to the optimization path. To further reduce the computational requirements, we formulate a stochastic hypergradient approximation for differentiable NAS, and theoretically show that the architecture optimization with the proposed method, named iDARTS, is expected to converge to a stationary point. Comprehensive experiments on two NAS benchmark search spaces and the common NAS search space verify the effectiveness of our proposed method. It leads to architectures outperforming, with large margins, those learned by the baseline methods.
    OLIVAW: Mastering Othello with neither Humans nor a Penny. (arXiv:2103.17228v2 [cs.LG] UPDATED)
    (2 min) We introduce OLIVAW, an AI Othello player adopting the design principles of the famous AlphaGo series. The main motivation behind OLIVAW was to attain exceptional competence in a non-trivial board game at a tiny fraction of the cost of its illustrious predecessors. In this paper, we show how the AlphaGo Zero's paradigm can be successfully applied to the popular game of Othello using only commodity hardware and free cloud services. While being simpler than Chess or Go, Othello maintains a considerable search space and difficulty in evaluating board positions. To achieve this result, OLIVAW implements some improvements inspired by recent works to accelerate the standard AlphaGo Zero learning process. The main modification implies doubling the positions collected per game during the training phase, by including also positions not played but largely explored by the agent. We tested the strength of OLIVAW in three different ways: by pitting it against Edax, the strongest open-source Othello engine, by playing anonymous games on the web platform OthelloQuest, and finally in two in-person matches against top-notch human players: a national champion and a former world champion.
    What Kinds of Functions do Deep Neural Networks Learn? Insights from Variational Spline Theory. (arXiv:2105.03361v2 [stat.ML] UPDATED)
    (2 min) We develop a variational framework to understand the properties of functions learned by deep neural networks with ReLU activation functions fit to data. We propose a new function space, which is reminiscent of classical bounded variation spaces, that captures the compositional structure associated with deep neural networks. We derive a representer theorem showing that deep ReLU networks are solutions to regularized data fitting problems in this function space. The function space consists of compositions of functions from the (non-reflexive) Banach spaces of second-order bounded variation in the Radon domain. These are Banach spaces with sparsity-promoting norms, giving insight into the role of sparsity in deep neural networks. The neural network solutions have skip connections and rank bounded weight matrices, providing new theoretical support for these common architectural choices. The variational problem we study can be recast as a finite-dimensional neural network training problem with regularization schemes related to the notions of weight decay and path-norm regularization. Finally, our analysis builds on techniques from variational spline theory, providing new connections between deep neural networks and splines.
    Spatial Contrastive Learning for Few-Shot Classification. (arXiv:2012.13831v3 [cs.CV] UPDATED)
    (2 min) In this paper, we explore contrastive learning for few-shot classification, in which we propose to use it as an additional auxiliary training objective acting as a data-dependent regularizer to promote more general and transferable features. In particular, we present a novel attention-based spatial contrastive objective to learn locally discriminative and class-agnostic features. As a result, our approach overcomes some of the limitations of the cross-entropy loss, such as its excessive discrimination towards seen classes, which reduces the transferability of features to unseen classes. With extensive experiments, we show that the proposed method outperforms state-of-the-art approaches, confirming the importance of learning good and transferable embeddings for few-shot learning.
    Constrained plasticity reserve as a natural way to control frequency and weights in spiking neural networks. (arXiv:2103.08143v2 [q-bio.NC] UPDATED)
    (2 min) Biological neurons have adaptive nature and perform complex computations involving the filtering of redundant information. However, most common neural cell models, including biologically plausible, such as Hodgkin-Huxley or Izhikevich, do not possess predictive dynamics on a single-cell level. Moreover, the modern rules of synaptic plasticity or interconnections weights adaptation also do not provide grounding for the ability of neurons to adapt to the ever-changing input signal intensity. While natural neuron synaptic growth is precisely controlled and restricted by protein supply and recycling, weight correction rules such as widely used STDP are efficiently unlimited in change rate and scale. The present article introduces new mechanics of interconnection between neuron firing rate homeostasis and weight change through STDP growth bounded by abstract protein reserve, controlled by the intracellular optimization algorithm. We show how these cellular dynamics help neurons filter out the intense noise signals to help neurons keep a stable firing rate. We also examine that such filtering does not affect the ability of neurons to recognize the correlated inputs in unsupervised mode. Such an approach might be used in the machine learning domain to improve the robustness of AI systems.
    Defense Against Reward Poisoning Attacks in Reinforcement Learning. (arXiv:2102.05776v2 [cs.LG] UPDATED)
    (2 min) We study defense strategies against reward poisoning attacks in reinforcement learning. As a threat model, we consider attacks that minimally alter rewards to make the attacker's target policy uniquely optimal under the poisoned rewards, with the optimality gap specified by an attack parameter. Our goal is to design agents that are robust against such attacks in terms of the worst-case utility w.r.t. the true, unpoisoned, rewards while computing their policies under the poisoned rewards. We propose an optimization framework for deriving optimal defense policies, both when the attack parameter is known and unknown. Moreover, we show that defense policies that are solutions to the proposed optimization problems have provable performance guarantees. In particular, we provide the following bounds with respect to the true, unpoisoned, rewards: a) lower bounds on the expected return of the defense policies, and b) upper bounds on how suboptimal these defense policies are compared to the attacker's target policy. We conclude the paper by illustrating the intuitions behind our formal results, and showing that the derived bounds are non-trivial.
    Constraint-Based Regularization of Neural Networks. (arXiv:2006.10114v2 [cs.LG] UPDATED)
    (2 min) We propose a method for efficiently incorporating constraints into a stochastic gradient Langevin framework for the training of deep neural networks. Constraints allow direct control of the parameter space of the model. Appropriately designed, they reduce the vanishing/exploding gradient problem, control weight magnitudes and stabilize deep neural networks and thus improve the robustness of training algorithms and the generalization capabilities of the trained neural network. We present examples of constrained training methods motivated by orthogonality preservation for weight matrices and explicit weight normalizations. We describe the methods in the overdamped formulation of Langevin dynamics and the underdamped form, in which momenta help to improve sampling efficiency. The methods are explored in test examples in image classification and natural language processing.
    Adversarial Examples Make Strong Poisons. (arXiv:2106.10807v1 [cs.LG])
    (2 min) The adversarial machine learning literature is largely partitioned into evasion attacks on testing data and poisoning attacks on training data. In this work, we show that adversarial examples, originally intended for attacking pre-trained models, are even more effective for data poisoning than recent methods designed specifically for poisoning. Our findings indicate that adversarial examples, when assigned the original label of their natural base image, cannot be used to train a classifier for natural images. Furthermore, when adversarial examples are assigned their adversarial class label, they are useful for training. This suggests that adversarial examples contain useful semantic content, just with the ``wrong'' labels (according to a network, but not a human). Our method, adversarial poisoning, is substantially more effective than existing poisoning methods for secure dataset release, and we release a poisoned version of ImageNet, ImageNet-P, to encourage research into the strength of this form of data obfuscation.
    Lossy Compression for Lossless Prediction. (arXiv:2106.10800v1 [cs.LG])
    (2 min) Most data is automatically collected and only ever "seen" by algorithms. Yet, data compressors preserve perceptual fidelity rather than just the information needed by algorithms performing downstream tasks. In this paper, we characterize the bit-rate required to ensure high performance on all predictive tasks that are invariant under a set of transformations, such as data augmentations. Based on our theory, we design unsupervised objectives for training neural compressors. Using these objectives, we train a generic image compressor that achieves substantial rate savings (more than $1000\times$ on ImageNet) compared to JPEG on 8 datasets, without decreasing downstream classification performance.
    Tight Differential Privacy for Discrete-Valued Mechanisms and for the Subsampled Gaussian Mechanism Using FFT. (arXiv:2006.07134v3 [stat.ML] UPDATED)
    (2 min) We propose a numerical accountant for evaluating the tight $(\varepsilon,\delta)$-privacy loss for algorithms with discrete one dimensional output. The method is based on the privacy loss distribution formalism and it uses the recently introduced fast Fourier transform based accounting technique. We carry out an error analysis of the method in terms of moment bounds of the privacy loss distribution which leads to rigorous lower and upper bounds for the true $(\varepsilon,\delta)$-values. As an application, we present a novel approach to accurate privacy accounting of the subsampled Gaussian mechanism. This completes the previously proposed analysis by giving strict lower and upper bounds for the privacy parameters. We demonstrate the performance of the accountant on the binomial mechanism and show that our approach allows decreasing noise variance up to 75 percent at equal privacy compared to existing bounds in the literature. We also illustrate how to compute tight bounds for the exponential mechanism applied to counting queries.
    Trainable Class Prototypes for Few-Shot Learning. (arXiv:2106.10846v1 [cs.CV])
    (2 min) Metric learning is a widely used method for few shot learning in which the quality of prototypes plays a key role in the algorithm. In this paper we propose the trainable prototypes for distance measure instead of the artificial ones within the meta-training and task-training framework. Also to avoid the disadvantages that the episodic meta-training brought, we adopt non-episodic meta-training based on self-supervised learning. Overall we solve the few-shot tasks in two phases: meta-training a transferable feature extractor via self-supervised learning and training the prototypes for metric classification. In addition, the simple attention mechanism is used in both meta-training and task-training. Our method achieves state-of-the-art performance in a variety of established few-shot tasks on the standard few-shot visual classification dataset, with about 20% increase compared to the available unsupervised few-shot learning methods.
    Mixed-Privacy Forgetting in Deep Networks. (arXiv:2012.13431v2 [cs.LG] UPDATED)
    (2 min) We show that the influence of a subset of the training samples can be removed -- or "forgotten" -- from the weights of a network trained on large-scale image classification tasks, and we provide strong computable bounds on the amount of remaining information after forgetting. Inspired by real-world applications of forgetting techniques, we introduce a novel notion of forgetting in mixed-privacy setting, where we know that a "core" subset of the training samples does not need to be forgotten. While this variation of the problem is conceptually simple, we show that working in this setting significantly improves the accuracy and guarantees of forgetting methods applied to vision classification tasks. Moreover, our method allows efficient removal of all information contained in non-core data by simply setting to zero a subset of the weights with minimal loss in performance. We achieve these results by replacing a standard deep network with a suitable linear approximation. With opportune changes to the network architecture and training procedure, we show that such linear approximation achieves comparable performance to the original network and that the forgetting problem becomes quadratic and can be solved efficiently even for large models. Unlike previous forgetting methods on deep networks, ours can achieve close to the state-of-the-art accuracy on large scale vision tasks. In particular, we show that our method allows forgetting without having to trade off the model accuracy.
    I-MAD: Interpretable Malware Detector Using Galaxy Transformer. (arXiv:1909.06865v3 [cs.LG] UPDATED)
    (2 min) Malware currently presents a number of serious threats to computer users. Signature-based malware detection methods are limited in detecting new malware samples that are significantly different from known ones. Therefore, machine learning-based methods have been proposed, but there are two challenges these methods face. The first is to model the full semantics behind the assembly code of malware. The second challenge is to provide interpretable results while keeping excellent detection performance. In this paper, we propose an Interpretable MAlware Detector (I-MAD) that outperforms state-of-the-art static malware detection models regarding accuracy with excellent interpretability. To improve the detection performance, I-MAD incorporates a novel network component called the Galaxy Transformer network that can understand assembly code at the basic block, function, and executable levels. It also incorporates our proposed interpretable feed-forward neural network to provide interpretations for its detection results by quantifying the impact of each feature with respect to the prediction. Experiment results show that our model significantly outperforms existing state-of-the-art static malware detection models and presents meaningful interpretations.
    Analytical confidence intervals for the number of different objects in data streams. (arXiv:1909.11564v3 [math.ST] UPDATED)
    (2 min) This paper develops a new mathematical-statistical approach to analyze a class of Flajolet-Martin algorithms (FMa), and provides analytical confidence intervals for the number F0 of distinct elements in a stream, based on Chernoff bounds. The class of FMa has reached a significant popularity in bigdata stream learning, and the attention of the literature has mainly been based on algorithmic aspects, basically complexity optimality, while the statistical analysis of these class of algorithms has been often faced heuristically. The analysis provided here shows deep connections with mathematical special functions and with extreme value theory. The latter connection may help in explaining heuristic considerations, while the first opens many numerical issues, faced at the end of the present paper. Finally, the algorithms are tested on an anonymized real data stream and MonteCarlo simulations are provided to support our analytical choice in this context.
    Practical Assessment of Generalization Performance Robustness for Deep Networks via Contrastive Examples. (arXiv:2106.10653v1 [cs.LG])
    (2 min) Training images with data transformations have been suggested as contrastive examples to complement the testing set for generalization performance evaluation of deep neural networks (DNNs). In this work, we propose a practical framework ContRE (The word "contre" means "against" or "versus" in French.) that uses Contrastive examples for DNN geneRalization performance Estimation. Specifically, ContRE follows the assumption in contrastive learning that robust DNN models with good generalization performance are capable of extracting a consistent set of features and making consistent predictions from the same image under varying data transformations. Incorporating with a set of randomized strategies for well-designed data transformations over the training set, ContRE adopts classification errors and Fisher ratios on the generated contrastive examples to assess and analyze the generalization performance of deep models in complement with a testing set. To show the effectiveness and the efficiency of ContRE, extensive experiments have been done using various DNN models on three open source benchmark datasets with thorough ablation studies and applicability analyses. Our experiment results confirm that (1) behaviors of deep models on contrastive examples are strongly correlated to what on the testing set, and (2) ContRE is a robust measure of generalization performance complementing to the testing set in various settings.
    Computing Differential Privacy Guarantees for Heterogeneous Compositions Using FFT. (arXiv:2102.12412v2 [cs.CR] UPDATED)
    (2 min) The recently proposed Fast Fourier Transform (FFT)-based accountant for evaluating $(\varepsilon,\delta)$-differential privacy guarantees using the privacy loss distribution formalism has been shown to give tighter bounds than commonly used methods such as R\'enyi accountants when applied to homogeneous compositions, i.e., to compositions of identical mechanisms. In this paper, we extend this approach to heterogeneous compositions. We carry out a full error analysis that allows choosing the parameters of the algorithm such that a desired accuracy is obtained. The analysis also extends previous results by taking into account all the parameters of the algorithm. Using the error analysis, we also give a bound for the computational complexity in terms of the error which is analogous to and slightly tightens the one given by Murtagh and Vadhan (2018). We also show how to speed up the evaluation of tight privacy guarantees using the Plancherel theorem at the cost of increased pre-computation and memory usage.
    Prediction-Free, Real-Time Flexible Control of Tidal Lagoons through Proximal Policy Optimisation: A Case Study for the Swansea Lagoon. (arXiv:2106.10360v1 [cs.LG])
    (2 min) Tidal range structures have been considered for large scale electricity generation for their potential ability to produce reasonable predictable energy without the emission of greenhouse gases. Once the main forcing components for driving the tides have deterministic dynamics, the available energy in a given tidal power plant has been estimated, through analytical and numerical optimisation routines, as a mostly predictable event. This constraint imposes state-of-art flexible operation methods to rely on tidal predictions (concurrent with measured data and up to a multiple of half-tidal cycles into the future) to infer best operational strategies for tidal lagoons, with the additional cost of requiring to run optimisation routines for every new tide. In this paper, we propose a novel optimised operation of tidal lagoons with proximal policy optimisation through Unity ML-Agents. We compare this technique with 6 different operation optimisation approaches (baselines) devised from the literature, utilising the Swansea Bay Tidal Lagoon as a case study. We show that our approach is successful in maximising energy generation through an optimised operational policy of turbines and sluices, yielding competitive results with state-of-the-art methods of optimisation, regardless of test data used, requiring training once and performing real-time flexible control with measured ocean data only.
    Machine learning in the social and health sciences. (arXiv:2106.10716v1 [cs.LG])
    (2 min) The uptake of machine learning (ML) approaches in the social and health sciences has been rather slow, and research using ML for social and health research questions remains fragmented. This may be due to the separate development of research in the computational/data versus social and health sciences as well as a lack of accessible overviews and adequate training in ML techniques for non data science researchers. This paper provides a meta-mapping of research questions in the social and health sciences to appropriate ML approaches, by incorporating the necessary requirements to statistical analysis in these disciplines. We map the established classification into description, prediction, and causal inference to common research goals, such as estimating prevalence of adverse health or social outcomes, predicting the risk of an event, and identifying risk factors or causes of adverse outcomes. This meta-mapping aims at overcoming disciplinary barriers and starting a fluid dialogue between researchers from the social and health sciences and methodologically trained researchers. Such mapping may also help to fully exploit the benefits of ML while considering domain-specific aspects relevant to the social and health sciences, and hopefully contribute to the acceleration of the uptake of ML applications to advance both basic and applied social and health sciences research.
    Learning audio sequence representations for acoustic event classification. (arXiv:1707.08729v2 [cs.SD] UPDATED)
    (2 min) Acoustic Event Classification (AEC) has become a significant task for machines to perceive the surrounding auditory scene. However, extracting effective representations that capture the underlying characteristics of the acoustic events is still challenging. Previous methods mainly focused on designing the audio features in a `hand-crafted' manner. Interestingly, data-learnt features have been recently reported to show better performance. Up to now, these were only considered on the frame level. In this article, we propose an unsupervised learning framework to learn a vector representation of an audio sequence for AEC. This framework consists of a Recurrent Neural Network (RNN) encoder and an RNN decoder, which respectively transforms the variable-length audio sequence into a fixed-length vector and reconstructs the input sequence on the generated vector. After training the encoder-decoder, we feed the audio sequences to the encoder and then take the learnt vectors as the audio sequence representations. Compared with previous methods, the proposed method can not only deal with the problem of arbitrary-lengths of audio streams, but also learn the salient information of the sequence. Extensive evaluation on a large-size acoustic event database is performed, and the empirical results demonstrate that the learnt audio sequence representation yields a significant performance improvement by a large margin compared with other state-of-the-art hand-crafted sequence features for AEC.
    A Comprehensive Review on Non-Neural Networks Collaborative Filtering Recommendation Systems. (arXiv:2106.10679v1 [cs.IR])
    (2 min) Over the past two decades, recommender systems have attracted a lot of interest due to the explosion in the amount of data in online applications. A particular attention has been paid to collaborative filtering, which is the most widely used in applications that involve information recommendations. Collaborative filtering (CF) uses the known preference of a group of users to make predictions and recommendations about the unknown preferences of other users (recommendations are made based on the past behavior of users). First introduced in the 1990s, a wide variety of increasingly successful models have been proposed. Due to the success of machine learning techniques in many areas, there has been a growing emphasis on the application of such algorithms in recommendation systems. In this article, we present an overview of the CF approaches for recommender systems, their two main categories, and their evaluation metrics. We focus on the application of classical Machine Learning algorithms to CF recommender systems by presenting their evolution from their first use-cases to advanced Machine Learning models. We attempt to provide a comprehensive and comparative overview of CF systems (with python implementations) that can serve as a guideline for research and practice in this area.
    DiffMG: Differentiable Meta Graph Search for Heterogeneous Graph Neural Networks. (arXiv:2010.03250v2 [cs.LG] UPDATED)
    (2 min) In this paper, we propose a novel framework to automatically utilize task-dependent semantic information which is encoded in heterogeneous information networks (HINs). Specifically, we search for a meta graph, which can capture more complex semantic relations than a meta path, to determine how graph neural networks (GNNs) propagate messages along different types of edges. We formalize the problem within the framework of neural architecture search (NAS) and then perform the search in a differentiable manner. We design an expressive search space in the form of a directed acyclic graph (DAG) to represent candidate meta graphs for a HIN, and we propose task-dependent type constraint to filter out those edge types along which message passing has no effect on the representations of nodes that are related to the downstream task. The size of the search space we define is huge, so we further propose a novel and efficient search algorithm to make the total search cost on a par with training a single GNN once. Compared with existing popular NAS algorithms, our proposed search algorithm improves the search efficiency. We conduct extensive experiments on different HINs and downstream tasks to evaluate our method, and experimental results show that our method can outperform state-of-the-art heterogeneous GNNs and also improves efficiency compared with those methods which can implicitly learn meta paths.
    Model-agnostic Feature Importance and Effects with Dependent Features -- A Conditional Subgroup Approach. (arXiv:2006.04628v2 [stat.ML] UPDATED)
    (2 min) The interpretation of feature importance in machine learning models is challenging when features are dependent. Permutation feature importance (PFI) ignores such dependencies, which can cause misleading interpretations due to extrapolation. A possible remedy is more advanced conditional PFI approaches that enable the assessment of feature importance conditional on all other features. Due to this shift in perspective and in order to enable correct interpretations, it is therefore important that the conditioning is transparent and humanly comprehensible. In this paper, we propose a new sampling mechanism for the conditional distribution based on permutations in conditional subgroups. As these subgroups are constructed using decision trees (transformation trees), the conditioning becomes inherently interpretable. This not only provides a simple and effective estimator of conditional PFI, but also local PFI estimates within the subgroups. In addition, we apply the conditional subgroups approach to partial dependence plots (PDP), a popular method for describing feature effects that can also suffer from extrapolation when features are dependent and interactions are present in the model. We show that PFI and PDP based on conditional subgroups often outperform methods such as conditional PFI based on knockoffs, or accumulated local effect plots. Furthermore, our approach allows for a more fine-grained interpretation of feature effects and importance within the conditional subgroups.
    Reinforcement learning for pursuit and evasion of microswimmers at low Reynolds number. (arXiv:2106.08609v1 [physics.flu-dyn] CROSS LISTED)
    (2 min) Aquatic organisms can use hydrodynamic cues to navigate, find their preys and escape from predators. We consider a model of two competing microswimmers engaged in a pursue-evasion task while immersed in a low-Reynolds-number environment. The players have limited abilities: they can only sense hydrodynamic disturbances, which provide some cue about the opponent's position, and perform simple manoeuvres. The goal of the pursuer is to capturethe evader in the shortest possible time. Conversely the evader aims at deferring capture as much as possible. We show that by means of Reinforcement Learning the players find efficient and physically explainable strategies which non-trivially exploit the hydrodynamic environment. This Letter offers a proof-of-concept for the use of Reinforcement Learning to discover prey-predator strategies in aquatic environments, with potential applications to underwater robotics.
    Learned Factor Graphs for Inference from Stationary Time Sequences. (arXiv:2006.03258v3 [cs.LG] UPDATED)
    (2 min) The design of methods for inference from time sequences has traditionally relied on statistical models that describe the relation between a latent desired sequence and the observed one. A broad family of model-based algorithms have been derived to carry out inference at controllable complexity using recursive computations over the factor graph representing the underlying distribution. An alternative model-agnostic approach utilizes machine learning (ML) methods. Here we propose a framework that combines model-based algorithms and data-driven ML tools for stationary time sequences. In the proposed approach, neural networks are developed to separately learn specific components of a factor graph describing the distribution of the time sequence, rather than the complete inference task. By exploiting stationary properties of this distribution, the resulting approach can be applied to sequences of varying temporal duration. Learned factor graph can be realized using compact neural networks that are trainable using small training sets, or alternatively, be used to improve upon existing deep inference systems. We present an inference algorithm based on learned stationary factor graphs, which learns to implement the sum-product scheme from labeled data, and can be applied to sequences of different lengths. Our experimental results demonstrate the ability of the proposed learned factor graphs to learn to carry out accurate inference from small training sets for sleep stage detection using the Sleep-EDF dataset, as well as for symbol detection in digital communications with unknown channels.
    High-level Features for Resource Economy and Fast Learning in Skill Transfer. (arXiv:2106.10354v1 [cs.RO])
    (2 min) Abstraction is an important aspect of intelligence which enables agents to construct robust representations for effective decision making. In the last decade, deep networks are proven to be effective due to their ability to form increasingly complex abstractions. However, these abstractions are distributed over many neurons, making the re-use of a learned skill costly. Previous work either enforced formation of abstractions creating a designer bias, or used a large number of neural units without investigating how to obtain high-level features that may more effectively capture the source task. For avoiding designer bias and unsparing resource use, we propose to exploit neural response dynamics to form compact representations to use in skill transfer. For this, we consider two competing methods based on (1) maximum information compression principle and (2) the notion that abstract events tend to generate slowly changing signals, and apply them to the neural signals generated during task execution. To be concrete, in our simulation experiments, we either apply principal component analysis (PCA) or slow feature analysis (SFA) on the signals collected from the last hidden layer of a deep network while it performs a source task, and use these features for skill transfer in a new target task. We compare the generalization performance of these alternatives with the baselines of skill transfer with full layer output and no-transfer settings. Our results show that SFA units are the most successful for skill transfer. SFA as well as PCA, incur less resources compared to usual skill transfer, whereby many units formed show a localized response reflecting end-effector-obstacle-goal relations. Finally, SFA units with lowest eigenvalues resembles symbolic representations that highly correlate with high-level features such as joint angles which might be thought of precursors for fully symbolic systems.
    Topological obstructions in neural networks learning. (arXiv:2012.15834v1 [cs.LG] CROSS LISTED)
    (2 min) We apply methods of topological data analysis to loss functions to gain insights on learning of deep neural networks and their generalization properties. We study global properties of the loss function gradient flow. We use topological data analysis of the loss function and its Morse complex to relate local behavior along gradient trajectories with global properties of the loss surface. We define neural network Topological Obstructions score, TO-score, with help of robust topological invariants, barcodes of loss function, that quantify the badness of local minima for gradient-based optimization. We have made several experiments for computing these invariants, for small neural networks, and for fully connected, convolutional and ResNet-like neural networks on different datasets: MNIST, Fashion MNIST, CIFAR10, SVHN. Our two principal observations are as follows. Firstly, the neural network barcode and TO-score decrease with the increase of the neural network depth and width. Secondly, there is an intriguing connection between the length of minima segments in the barcode and the minima generalization error.
    A Probabilistic State Space Model for Joint Inference from Differential Equations and Data. (arXiv:2103.10153v2 [stat.ML] UPDATED)
    (2 min) Mechanistic models with differential equations are a key component of scientific applications of machine learning. Inference in such models is usually computationally demanding, because it involves repeatedly solving the differential equation. The main problem here is that the numerical solver is hard to combine with standard inference techniques. Recent work in probabilistic numerics has developed a new class of solvers for ordinary differential equations (ODEs) that phrase the solution process directly in terms of Bayesian filtering. We here show that this allows such methods to be combined very directly, with conceptual and numerical ease, with latent force models in the ODE itself. It then becomes possible to perform approximate Bayesian inference on the latent force as well as the ODE solution in a single, linear complexity pass of an extended Kalman filter / smoother - that is, at the cost of computing a single ODE solution. We demonstrate the expressiveness and performance of the algorithm by training, among others, a non-parametric SIRD model on data from the COVID-19 outbreak.
    On the intrinsic robustness to noise of some leading classifiers and symmetric loss function -- an empirical evaluation. (arXiv:2010.13570v5 [cs.LG] UPDATED)
    (2 min) In some industrial applications such as fraud detection, the performance of common supervision techniques may be affected by the poor quality of the available labels : in actual operational use-cases, these labels may be weak in quantity, quality or trustworthiness. We propose a benchmark to evaluate the natural robustness of different algorithms taken from various paradigms on artificially corrupted datasets, with a focus on noisy labels. This paper studies the intrinsic robustness of some leading classifiers. The algorithms under scrutiny include SVM, logistic regression, random forests, XGBoost, Khiops. Furthermore, building on results from recent literature, the study is supplemented with an investigation into the opportunity to enhance some algorithms with symmetric loss functions.
    Learning to Reach, Swim, Walk and Fly in One Trial: Data-Driven Control with Scarce Data and Side Information. (arXiv:2106.10533v1 [eess.SY])
    (2 min) We develop a learning-based control algorithm for unknown dynamical systems under very severe data limitations. Specifically, the algorithm has access to streaming data only from a single and ongoing trial. Despite the scarcity of data, we show -- through a series of examples -- that the algorithm can provide performance comparable to reinforcement learning algorithms trained over millions of environment interactions. It accomplishes such performance by effectively leveraging various forms of side information on the dynamics to reduce the sample complexity. Such side information typically comes from elementary laws of physics and qualitative properties of the system. More precisely, the algorithm approximately solves an optimal control problem encoding the system's desired behavior. To this end, it constructs and refines a differential inclusion that contains the unknown vector field of the dynamics. The differential inclusion, used in an interval Taylor-based method, enables to over-approximate the set of states the system may reach. Theoretically, we establish a bound on the suboptimality of the approximate solution with respect to the case of known dynamics. We show that the longer the trial or the more side information is available, the tighter the bound. Empirically, experiments in a high-fidelity F-16 aircraft simulator and MuJoCo's environments such as the Reacher, Swimmer, and Cheetah illustrate the algorithm's effectiveness.
    Transfer Bayesian Meta-learning via Weighted Free Energy Minimization. (arXiv:2106.10711v1 [cs.LG])
    (2 min) Meta-learning optimizes the hyperparameters of a training procedure, such as its initialization, kernel, or learning rate, based on data sampled from a number of auxiliary tasks. A key underlying assumption is that the auxiliary tasks, known as meta-training tasks, share the same generating distribution as the tasks to be encountered at deployment time, known as meta-test tasks. This may, however, not be the case when the test environment differ from the meta-training conditions. To address shifts in task generating distribution between meta-training and meta-testing phases, this paper introduces weighted free energy minimization (WFEM) for transfer meta-learning. We instantiate the proposed approach for non-parametric Bayesian regression and classification via Gaussian Processes (GPs). The method is validated on a toy sinusoidal regression problem, as well as on classification using miniImagenet and CUB data sets, through comparison with standard meta-learning of GP priors as implemented by PACOH.
    Outlier Detection and Spatial Analysis Algorithms. (arXiv:2106.10669v1 [stat.ML])
    (2 min) Outlier detection is a significant area in data mining. It can be either used to pre-process the data prior to an analysis or post the processing phase (before visualization) depending on the effectiveness of the outlier and its importance. Outlier detection extends to several fields such as detection of credit card fraud, network intrusions, machine failure prediction, potential terrorist attacks, and so on. Outliers are those data points with characteristics considerably different. They deviate from the data set causing inconsistencies, noise and anomalies during analysis and result in modification of the original points However, a common misconception is that outliers have to be immediately eliminated or replaced from the data set. Such points could be considered useful if analyzed separately as they could be obtained from a separate mechanism entirely making it important to the research question. This study surveys the different methods of outlier detection for spatial analysis. Spatial data or geospatial data are those that exhibit geographic properties or attributes such as position or areas. An example would be weather data such as precipitation, temperature, wind velocity, and so on collected for a defined region.
    Tag, Copy or Predict: A Unified Weakly-Supervised Learning Framework for Visual Information Extraction using Sequences. (arXiv:2106.10681v1 [cs.CV])
    (2 min) Visual information extraction (VIE) has attracted increasing attention in recent years. The existing methods usually first organized optical character recognition (OCR) results into plain texts and then utilized token-level entity annotations as supervision to train a sequence tagging model. However, it expends great annotation costs and may be exposed to label confusion, and the OCR errors will also significantly affect the final performance. In this paper, we propose a unified weakly-supervised learning framework called TCPN (Tag, Copy or Predict Network), which introduces 1) an efficient encoder to simultaneously model the semantic and layout information in 2D OCR results; 2) a weakly-supervised training strategy that utilizes only key information sequences as supervision; and 3) a flexible and switchable decoder which contains two inference modes: one (Copy or Predict Mode) is to output key information sequences of different categories by copying a token from the input or predicting one in each time step, and the other (Tag Mode) is to directly tag the input sequence in a single forward pass. Our method shows new state-of-the-art performance on several public benchmarks, which fully proves its effectiveness.
    Adversarial Distortion for Learned Video Compression. (arXiv:2004.09508v3 [eess.IV] UPDATED)
    (2 min) In this paper, we present a novel adversarial lossy video compression model. At extremely low bit-rates, standard video coding schemes suffer from unpleasant reconstruction artifacts such as blocking, ringing etc. Existing learned neural approaches to video compression have achieved reasonable success on reducing the bit-rate for efficient transmission and reduce the impact of artifacts to an extent. However, they still tend to produce blurred results under extreme compression. In this paper, we present a deep adversarial learned video compression model that minimizes an auxiliary adversarial distortion objective. We find this adversarial objective to correlate better with human perceptual quality judgement relative to traditional quality metrics such as MS-SSIM and PSNR. Our experiments using a state-of-the-art learned video compression system demonstrate a reduction of perceptual artifacts and reconstruction of detail lost especially under extremely high compression.
    Statistical Inference of the Value Function for Reinforcement Learning in Infinite Horizon Settings. (arXiv:2001.04515v2 [stat.ML] UPDATED)
    (2 min) Reinforcement learning is a general technique that allows an agent to learn an optimal policy and interact with an environment in sequential decision making problems. The goodness of a policy is measured by its value function starting from some initial state. The focus of this paper is to construct confidence intervals (CIs) for a policy's value in infinite horizon settings where the number of decision points diverges to infinity. We propose to model the action-value state function (Q-function) associated with a policy based on series/sieve method to derive its confidence interval. When the target policy depends on the observed data as well, we propose a SequentiAl Value Evaluation (SAVE) method to recursively update the estimated policy and its value estimator. As long as either the number of trajectories or the number of decision points diverges to infinity, we show that the proposed CI achieves nominal coverage even in cases where the optimal policy is not unique. Simulation studies are conducted to back up our theoretical findings. We apply the proposed method to a dataset from mobile health studies and find that reinforcement learning algorithms could help improve patient's health status. A Python implementation of the proposed procedure is available at https://github.com/shengzhang37/SAVE.
    Fast Neural Network Verification via Shadow Prices. (arXiv:1902.07247v3 [cs.LG] UPDATED)
    (2 min) To use neural networks in safety-critical settings it is paramount to provide assurances on their runtime operation. Recent work on ReLU networks has sought to verify whether inputs belonging to a bounded box can ever yield some undesirable output. Input-splitting procedures, a particular type of verification mechanism, do so by recursively partitioning the input set into smaller sets. The efficiency of these methods is largely determined by the number of splits the box must undergo before the property can be verified. In this work, we propose a new technique based on shadow prices that fully exploits the information of the problem yielding a more efficient generation of splits than the state-of-the-art. Results on the Airborne Collision Avoidance System (ACAS) benchmark verification tasks show a considerable reduction in the partitions generated which substantially reduces computation times. These results open the door to improved verification methods for a wide variety of machine learning applications including vision and control.
    COVID-19 Outbreak Prediction and Analysis using Self Reported Symptoms. (arXiv:2101.10266v2 [cs.LG] UPDATED)
    (3 min) It is crucial for policymakers to understand the community prevalence of COVID-19 so combative resources can be effectively allocated and prioritized during the COVID-19 pandemic. Traditionally, community prevalence has been assessed through diagnostic and antibody testing data. However, despite the increasing availability of COVID-19 testing, the required level has not been met in most parts of the globe, introducing a need for an alternative method for communities to determine disease prevalence. This is further complicated by the observation that COVID-19 prevalence and spread varies across different spatial, temporal, and demographics. In this study, we understand trends in the spread of COVID-19 by utilizing the results of self-reported COVID-19 symptoms surveys as an alternative to COVID-19 testing reports. This allows us to assess community disease prevalence, even in areas with low COVID-19 testing ability. Using individually reported symptom data from various populations, our method predicts the likely percentage of the population that tested positive for COVID-19. We do so with a Mean Absolute Error (MAE) of 1.14 and Mean Relative Error (MRE) of 60.40\% with 95\% confidence interval as (60.12, 60.67). This implies that our model predicts +/- 1140 cases than the original in a population of 1 million. In addition, we forecast the location-wise percentage of the population testing positive for the next 30 days using self-reported symptoms data from previous days. The MAE for this method is as low as 0.15 (MRE of 23.61\% with 95\% confidence interval as (23.6, 13.7)) for New York. We present an analysis of these results, exposing various clinical attributes of interest across different demographics. Lastly, we qualitatively analyze how various policy enactments (testing, curfew) affect the prevalence of COVID-19 in a community.
    FNet: Mixing Tokens with Fourier Transforms. (arXiv:2105.03824v2 [cs.CL] UPDATED)
    (2 min) We show that Transformer encoder architectures can be massively sped up, with limited accuracy costs, by replacing the self-attention sublayers with simple linear transformations that "mix" input tokens. These linear transformations, along with standard nonlinearities in feed-forward layers, prove competent at modeling semantic relationships in several text classification tasks. Most surprisingly, we find that replacing the self-attention sublayer in a Transformer encoder with a standard, unparameterized Fourier Transform achieves 92-97% of the accuracy of BERT counterparts on the GLUE benchmark, but trains nearly seven times faster on GPUs and twice as fast on TPUs. The resulting model, FNet, also scales very efficiently to long inputs. Specifically, when compared to the "efficient" Transformers on the Long Range Arena benchmark, FNet matches the accuracy of the most accurate models, but is faster than the fastest models across all sequence lengths on GPUs (and across relatively shorter lengths on TPUs). Finally, FNet has a light memory footprint and is particularly efficient at smaller model sizes: for a fixed speed and accuracy budget, small FNet models outperform Transformer counterparts.
    Noise Learning Based Denoising Autoencoder. (arXiv:2101.07937v2 [cs.LG] UPDATED)
    (2 min) This letter introduces a new denoiser that modifies the structure of denoising autoencoder (DAE), namely noise learning based DAE (nlDAE). The proposed nlDAE learns the noise of the input data. Then, the denoising is performed by subtracting the regenerated noise from the noisy input. Hence, nlDAE is more effective than DAE when the noise is simpler to regenerate than the original data. To validate the performance of nlDAE, we provide three case studies: signal restoration, symbol demodulation, and precise localization. Numerical results suggest that nlDAE requires smaller latent space dimension and smaller training dataset compared to DAE.
    Neighborhood Contrastive Learning for Novel Class Discovery. (arXiv:2106.10731v1 [cs.CV])
    (2 min) In this paper, we address Novel Class Discovery (NCD), the task of unveiling new classes in a set of unlabeled samples given a labeled dataset with known classes. We exploit the peculiarities of NCD to build a new framework, named Neighborhood Contrastive Learning (NCL), to learn discriminative representations that are important to clustering performance. Our contribution is twofold. First, we find that a feature extractor trained on the labeled set generates representations in which a generic query sample and its neighbors are likely to share the same class. We exploit this observation to retrieve and aggregate pseudo-positive pairs with contrastive learning, thus encouraging the model to learn more discriminative representations. Second, we notice that most of the instances are easily discriminated by the network, contributing less to the contrastive loss. To overcome this issue, we propose to generate hard negatives by mixing labeled and unlabeled samples in the feature space. We experimentally demonstrate that these two ingredients significantly contribute to clustering performance and lead our model to outperform state-of-the-art methods by a large margin (e.g., clustering accuracy +13% on CIFAR-100 and +8% on ImageNet).
    Optimal Strategies for Decision Theoretic Online Learning. (arXiv:2106.10717v1 [cs.LG])
    (2 min) We extend the drifting games analysis to continuous time and show that the optimal adversary, if the value function has strictly positive derivative up to fourth order is bronian motion.
    Learning Signal Representations for EEG Cross-Subject Channel Selection and Trial Classification. (arXiv:2106.10633v1 [eess.SP])
    (2 min) EEG technology finds applications in several domains. Currently, most EEG systems require subjects to wear several electrodes on the scalp to be effective. However, several channels might include noisy information, redundant signals, induce longer preparation times and increase computational times of any automated system for EEG decoding. One way to reduce the signal-to-noise ratio and improve classification accuracy is to combine channel selection with feature extraction, but EEG signals are known to present high inter-subject variability. In this work we introduce a novel algorithm for subject-independent channel selection of EEG recordings. Considering multi-channel trial recordings as statistical units and the EEG decoding task as the class of reference, the algorithm (i) exploits channel-specific 1D-Convolutional Neural Networks (1D-CNNs) as feature extractors in a supervised fashion to maximize class separability; (ii) it reduces a high dimensional multi-channel trial representation into a unique trial vector by concatenating the channels' embeddings and (iii) recovers the complex inter-channel relationships during channel selection, by exploiting an ensemble of AutoEncoders (AE) to identify from these vectors the most relevant channels to perform classification. After training, the algorithm can be exploited by transferring only the parametrized subgroup of selected channel-specific 1D-CNNs to new signals from new subjects and obtain low-dimensional and highly informative trial vectors to be fed to any classifier.
    Memory Augmented Optimizers for Deep Learning. (arXiv:2106.10708v1 [cs.LG])
    (2 min) Popular approaches for minimizing loss in data-driven learning often involve an abstraction or an explicit retention of the history of gradients for efficient parameter updates. The aggregated history of gradients nudges the parameter updates in the right direction even when the gradients at any given step are not informative. Although the history of gradients summarized in meta-parameters or explicitly stored in memory has been shown effective in theory and practice, the question of whether $all$ or only a subset of the gradients in the history are sufficient in deciding the parameter updates remains unanswered. In this paper, we propose a framework of memory-augmented gradient descent optimizers that retain a limited view of their gradient history in their internal memory. Such optimizers scale well to large real-life datasets, and our experiments show that the memory augmented extensions of standard optimizers enjoy accelerated convergence and improved performance on a majority of computer vision and language tasks that we considered. Additionally, we prove that the proposed class of optimizers with fixed-size memory converge under assumptions of strong convexity, regardless of which gradients are selected or how they are linearly combined to form the update step.
    Stochastic Graph Neural Networks. (arXiv:2006.02684v2 [eess.SP] UPDATED)
    (2 min) Graph neural networks (GNNs) model nonlinear representations in graph data with applications in distributed agent coordination, control, and planning among others. Current GNN architectures assume ideal scenarios and ignore link fluctuations that occur due to environment, human factors, or external attacks. In these situations, the GNN fails to address its distributed task if the topological randomness is not considered accordingly. To overcome this issue, we put forth the stochastic graph neural network (SGNN) model: a GNN where the distributed graph convolution module accounts for the random network changes. Since stochasticity brings in a new learning paradigm, we conduct a statistical analysis on the SGNN output variance to identify conditions the learned filters should satisfy for achieving robust transference to perturbed scenarios, ultimately revealing the explicit impact of random link losses. We further develop a stochastic gradient descent (SGD) based learning process for the SGNN and derive conditions on the learning rate under which this learning process converges to a stationary point. Numerical results corroborate our theoretical findings and compare the benefits of SGNN robust transference with a conventional GNN that ignores graph perturbations during learning.
    Neural Network Classifier as Mutual Information Evaluator. (arXiv:2106.10471v1 [cs.LG])
    (2 min) Cross-entropy loss with softmax output is a standard choice to train neural network classifiers. We give a new view of neural network classifiers with softmax and cross-entropy as mutual information evaluators. We show that when the dataset is balanced, training a neural network with cross-entropy maximises the mutual information between inputs and labels through a variational form of mutual information. Thereby, we develop a new form of softmax that also converts a classifier to a mutual information evaluator when the dataset is imbalanced. Experimental results show that the new form leads to better classification accuracy, in particular for imbalanced datasets.
    Learning and Generalization in Overparameterized Normalizing Flows. (arXiv:2106.10535v1 [cs.LG])
    (2 min) In supervised learning, it is known that overparameterized neural networks with one hidden layer provably and efficiently learn and generalize, when trained using stochastic gradient descent with sufficiently small learning rate and suitable initialization. In contrast, the benefit of overparameterization in unsupervised learning is not well understood. Normalizing flows (NFs) constitute an important class of models in unsupervised learning for sampling and density estimation. In this paper, we theoretically and empirically analyze these models when the underlying neural network is one-hidden-layer overparameterized network. Our main contributions are two-fold: (1) On the one hand, we provide theoretical and empirical evidence that for a class of NFs containing most of the existing NF models, overparametrization hurts training. (2) On the other hand, we prove that unconstrained NFs, a recently introduced model, can efficiently learn any reasonable data distribution under minimal assumptions when the underlying network is overparametrized.
    Cogradient Descent for Dependable Learning. (arXiv:2106.10617v1 [cs.LG])
    (2 min) Conventional gradient descent methods compute the gradients for multiple variables through the partial derivative. Treating the coupled variables independently while ignoring the interaction, however, leads to an insufficient optimization for bilinear models. In this paper, we propose a dependable learning based on Cogradient Descent (CoGD) algorithm to address the bilinear optimization problem, providing a systematic way to coordinate the gradients of coupling variables based on a kernelized projection function. CoGD is introduced to solve bilinear problems when one variable is with sparsity constraint, as often occurs in modern learning paradigms. CoGD can also be used to decompose the association of features and weights, which further generalizes our method to better train convolutional neural networks (CNNs) and improve the model capacity. CoGD is applied in representative bilinear problems, including image reconstruction, image inpainting, network pruning and CNN training. Extensive experiments show that CoGD improves the state-of-the-arts by significant margins. Code is available at {https://github.com/bczhangbczhang/CoGD}.
    A Unified View of Algorithms for Path Planning Using Probabilistic Inference on Factor Graphs. (arXiv:2106.10442v1 [cs.LG])
    (2 min) Even if path planning can be solved using standard techniques from dynamic programming and control, the problem can also be approached using probabilistic inference. The algorithms that emerge using the latter framework bear some appealing characteristics that qualify the probabilistic approach as a powerful alternative to the more traditional control formulations. The idea of using estimation on stochastic models to solve control problems is not new and the inference approach considered here falls under the rubric of Active Inference (AI) and Control as Inference (CAI). In this work, we look at the specific recursions that arise from various cost functions that, although they may appear similar in scope, bear noticeable differences, at least when applied to typical path planning problems. We start by posing the path planning problem on a probabilistic factor graph, and show how the various algorithms translate into specific message composition rules. We then show how this unified approach, presented both in probability space and in log space, provides a very general framework that includes the Sum-product, the Max-product, Dynamic programming and mixed Reward/Entropy criteria-based algorithms. The framework also expands algorithmic design options for smoother or sharper policy distributions, including generalized Sum/Max-product algorithm, a Smooth Dynamic programming algorithm and modified versions of the Reward/Entropy recursions. We provide a comprehensive table of recursions and a comparison through simulations, first on a synthetic small grid with a single goal with obstacles, and then on a grid extrapolated from a real-world scene with multiple goals and a semantic map.
    Large-Scale Network Embedding in Apache Spark. (arXiv:2106.10620v1 [cs.SI])
    (2 min) Network embedding has been widely used in social recommendation and network analysis, such as recommendation systems and anomaly detection with graphs. However, most of previous approaches cannot handle large graphs efficiently, due to that (i) computation on graphs is often costly and (ii) the size of graph or the intermediate results of vectors could be prohibitively large, rendering it difficult to be processed on a single machine. In this paper, we propose an efficient and effective distributed algorithm for network embedding on large graphs using Apache Spark, which recursively partitions a graph into several small-sized subgraphs to capture the internal and external structural information of nodes, and then computes the network embedding for each subgraph in parallel. Finally, by aggregating the outputs on all subgraphs, we obtain the embeddings of nodes in a linear cost. After that, we demonstrate in various experiments that our proposed approach is able to handle graphs with billions of edges within a few hours and is at least 4 times faster than the state-of-the-art approaches. Besides, it achieves up to $4.25\%$ and $4.27\%$ improvements on link prediction and node classification tasks respectively. In the end, we deploy the proposed algorithms in two online games of Tencent with the applications of friend recommendation and item recommendation, which improve the competitors by up to $91.11\%$ in running time and up to $12.80\%$ in the corresponding evaluation metrics.
    Heterogeneous Multi-task Learning with Expert Diversity. (arXiv:2106.10595v1 [cs.LG])
    (2 min) Predicting multiple heterogeneous biological and medical targets is a challenge for traditional deep learning models. In contrast to single-task learning, in which a separate model is trained for each target, multi-task learning (MTL) optimizes a single model to predict multiple related targets simultaneously. To address this challenge, we propose the Multi-gate Mixture-of-Experts with Exclusivity (MMoEEx). Our work aims to tackle the heterogeneous MTL setting, in which the same model optimizes multiple tasks with different characteristics. Such a scenario can overwhelm current MTL approaches due to the challenges in balancing shared and task-specific representations and the need to optimize tasks with competing optimization paths. Our method makes two key contributions: first, we introduce an approach to induce more diversity among experts, thus creating representations more suitable for highly imbalanced and heterogenous MTL learning; second, we adopt a two-step optimization [6, 11] approach to balancing the tasks at the gradient level. We validate our method on three MTL benchmark datasets, including Medical Information Mart for Intensive Care (MIMIC-III) and PubChem BioAssay (PCBA).
    Improving Compositional Generalization in Classification Tasks via Structure Annotations. (arXiv:2106.10434v1 [cs.LG])
    (2 min) Compositional generalization is the ability to generalize systematically to a new data distribution by combining known components. Although humans seem to have a great ability to generalize compositionally, state-of-the-art neural models struggle to do so. In this work, we study compositional generalization in classification tasks and present two main contributions. First, we study ways to convert a natural language sequence-to-sequence dataset to a classification dataset that also requires compositional generalization. Second, we show that providing structural hints (specifically, providing parse trees and entity links as attention masks for a Transformer model) helps compositional generalization.
    CoreGen: Contextualized Code Representation Learning for Commit Message Generation. (arXiv:2007.06934v3 [cs.CL] UPDATED)
    (2 min) Automatic generation of high-quality commit messages for code commits can substantially facilitate software developers' works and coordination. However, the semantic gap between source code and natural language poses a major challenge for the task. Several studies have been proposed to alleviate the challenge but none explicitly involves code contextual information during commit message generation. Specifically, existing research adopts static embedding for code tokens, which maps a token to the same vector regardless of its context. In this paper, we propose a novel Contextualized code representation learning strategy for commit message Generation (CoreGen). CoreGen first learns contextualized code representations which exploit the contextual information behind code commit sequences. The learned representations of code commits built upon Transformer are then fine-tuned for downstream commit message generation. Experiments on the benchmark dataset demonstrate the superior effectiveness of our model over the baseline models with at least 28.18% improvement in terms of BLEU-4 score. Furthermore, we also highlight the future opportunities in training contextualized code representations on larger code corpus as a solution to low-resource tasks and adapting the contextualized code representation framework to other code-to-text generation tasks.
    Amazon SageMaker Automatic Model Tuning: Scalable Gradient-Free Optimization. (arXiv:2012.08489v2 [cs.LG] UPDATED)
    (2 min) Tuning complex machine learning systems is challenging. Machine learning typically requires to set hyperparameters, be it regularization, architecture, or optimization parameters, whose tuning is critical to achieve good predictive performance. To democratize access to machine learning systems, it is essential to automate the tuning. This paper presents Amazon SageMaker Automatic Model Tuning (AMT), a fully managed system for gradient-free optimization at scale. AMT finds the best version of a trained machine learning model by repeatedly evaluating it with different hyperparameter configurations. It leverages either random search or Bayesian optimization to choose the hyperparameter values resulting in the best model, as measured by the metric chosen by the user. AMT can be used with built-in algorithms, custom algorithms, and Amazon SageMaker pre-built containers for machine learning frameworks. We discuss the core functionality, system architecture, our design principles, and lessons learned. We also describe more advanced features of AMT, such as automated early stopping and warm-starting, showing in experiments their benefits to users.
    Intriguing Properties of Contrastive Losses. (arXiv:2011.02803v2 [cs.LG] UPDATED)
    (2 min) Contrastive loss and its variants have become very popular recently for learning visual representations without supervision. In this work, we study three intriguing properties of contrastive learning. We first generalize the standard contrastive loss to a broader family of losses, and we find that various instantiations of the generalized loss perform similarly under the presence of a multi-layer non-linear projection head. We then study if instance-based contrastive learning (such as in SimCLR, MoCo, BYOL, and so on, which are based on global image representation) can learn well on images with multiple objects present. We find that meaningful hierarchical local features can be learned despite the fact that these objectives operate on global instance-level features. Finally, we study an intriguing phenomenon of feature suppression among competing features shared across augmented views, such as "color distribution" vs "object class". We construct datasets with explicit and controllable competing features, and show that, for contrastive learning, a few bits of easy-to-learn shared features can suppress, and even fully prevent, the learning of other sets of competing features. In scenarios where there are multiple objects in an image, the dominant object would suppress the learning of smaller objects. Existing contrastive learning methods critically rely on data augmentation to favor certain sets of features over others, and face potential limitation for scenarios where existing augmentations cannot fully address the feature suppression. This poses open challenges to existing contrastive learning techniques.
    Dependency Structure Misspecification in Multi-Source Weak Supervision Models. (arXiv:2106.10302v1 [cs.LG])
    (2 min) Data programming (DP) has proven to be an attractive alternative to costly hand-labeling of data. In DP, users encode domain knowledge into \emph{labeling functions} (LF), heuristics that label a subset of the data noisily and may have complex dependencies. A label model is then fit to the LFs to produce an estimate of the unknown class label. The effects of label model misspecification on test set performance of a downstream classifier are understudied. This presents a serious awareness gap to practitioners, in particular since the dependency structure among LFs is frequently ignored in field applications of DP. We analyse modeling errors due to structure over-specification. We derive novel theoretical bounds on the modeling error and empirically show that this error can be substantial, even when modeling a seemingly sensible structure.
    Task Attended Meta-Learning for Few-Shot Learning. (arXiv:2106.10642v1 [cs.LG])
    (2 min) Meta-learning (ML) has emerged as a promising direction in learning models under constrained resource settings like few-shot learning. The popular approaches for ML either learn a generalizable initial model or a generic parametric optimizer through episodic training. The former approaches leverage the knowledge from a batch of tasks to learn an optimal prior. In this work, we study the importance of a batch for ML. Specifically, we first incorporate a batch episodic training regimen to improve the learning of the generic parametric optimizer. We also hypothesize that the common assumption in batch episodic training that each task in a batch has an equal contribution to learning an optimal meta-model need not be true. We propose to weight the tasks in a batch according to their "importance" in improving the meta-model's learning. To this end, we introduce a training curriculum motivated by selective focus in humans, called task attended meta-training, to weight the tasks in a batch. Task attention is a standalone module that can be integrated with any batch episodic training regimen. The comparisons of the models with their non-task-attended counterparts on complex datasets like miniImageNet and tieredImageNet validate its effectiveness.
    Nearly Minimax Optimal Adversarial Imitation Learning with Known and Unknown Transitions. (arXiv:2106.10424v1 [cs.LG])
    (2 min) This paper is dedicated to designing provably efficient adversarial imitation learning (AIL) algorithms that directly optimize policies from expert demonstrations. Firstly, we develop a transition-aware AIL algorithm named TAIL with an expert sample complexity of $\tilde{O}(H^{3/2} |S|/\varepsilon)$ under the known transition setting, where $H$ is the planning horizon, $|S|$ is the state space size and $\varepsilon$ is desired policy value gap. This improves upon the previous best bound of $\tilde{O}(H^2 |S| / \varepsilon^2)$ for AIL methods and matches the lower bound of $\tilde{\Omega} (H^{3/2} |S|/\varepsilon)$ in [Rajaraman et al., 2021] up to a logarithmic factor. The key ingredient of TAIL is a fine-grained estimator for expert state-action distribution, which explicitly utilizes the transition function information. Secondly, considering practical settings where the transition functions are usually unknown but environment interaction is allowed, we accordingly develop a model-based transition-aware AIL algorithm named MB-TAIL. In particular, MB-TAIL builds an empirical transition model by interacting with the environment and performs imitation under the recovered empirical model. The interaction complexity of MB-TAIL is $\tilde{O} (H^3 |S|^2 |A| / \varepsilon^2)$, which improves the best known result of $\tilde{O} (H^4 |S|^2 |A| / \varepsilon^2)$ in [Shani et al., 2021]. Finally, our theoretical results are supported by numerical evaluation and detailed analysis on two challenging MDPs.
    Stability of Graph Convolutional Neural Networks to Stochastic Perturbations. (arXiv:2106.10526v1 [cs.LG])
    (2 min) Graph convolutional neural networks (GCNNs) are nonlinear processing tools to learn representations from network data. A key property of GCNNs is their stability to graph perturbations. Current analysis considers deterministic perturbations but fails to provide relevant insights when topological changes are random. This paper investigates the stability of GCNNs to stochastic graph perturbations induced by link losses. In particular, it proves the expected output difference between the GCNN over random perturbed graphs and the GCNN over the nominal graph is upper bounded by a factor that is linear in the link loss probability. We perform the stability analysis in the graph spectral domain such that the result holds uniformly for any graph. This result also shows the role of the nonlinearity and the architecture width and depth, and allows identifying handle to improve the GCNN robustness. Numerical simulations on source localization and robot swarm control corroborate our theoretical findings.
    Better Training using Weight-Constrained Stochastic Dynamics. (arXiv:2106.10704v1 [cs.LG])
    (2 min) We employ constraints to control the parameter space of deep neural networks throughout training. The use of customized, appropriately designed constraints can reduce the vanishing/exploding gradients problem, improve smoothness of classification boundaries, control weight magnitudes and stabilize deep neural networks, and thus enhance the robustness of training algorithms and the generalization capabilities of neural networks. We provide a general approach to efficiently incorporate constraints into a stochastic gradient Langevin framework, allowing enhanced exploration of the loss landscape. We also present specific examples of constrained training methods motivated by orthogonality preservation for weight matrices and explicit weight normalizations. Discretization schemes are provided both for the overdamped formulation of Langevin dynamics and the underdamped form, in which momenta further improve sampling efficiency. These optimization schemes can be used directly, without needing to adapt neural network architecture design choices or to modify the objective with regularization terms, and see performance improvements in classification tasks.
    A Multilayered Block Network Model to Forecast Large Dynamic Transportation Graphs: an Application to US Air Transport. (arXiv:1911.13136v3 [stat.ML] UPDATED)
    (2 min) Dynamic transportation networks have been analyzed for years by means of static graph-based indicators in order to study the temporal evolution of relevant network components, and to reveal complex dependencies that would not be easily detected by a direct inspection of the data. This paper presents a state-of-the-art latent network model to forecast multilayer dynamic graphs that are increasingly common in transportation and proposes a community-based extension to reduce the computational burden. Flexible time series analysis is obtained by modeling the probability of edges between vertices through latent Gaussian processes. The models and Bayesian inference are illustrated on a sample of 10-year data from four major airlines within the US air transportation system. Results show how the estimated latent parameters from the models are related to the airline's connectivity dynamics, and their ability to project the multilayer graph into the future for out-of-sample full network forecasts, while stochastic blockmodeling allows for the identification of relevant communities. Reliable network predictions would allow policy-makers to better understand the dynamics of the transport system, and help in their planning on e.g. route development, or the deployment of new regulations.
    Combinatorial Semi-Bandit in the Non-Stationary Environment. (arXiv:2002.03580v2 [cs.LG] UPDATED)
    (2 min) In this paper, we investigate the non-stationary combinatorial semi-bandit problem, both in the switching case and in the dynamic case. In the general case where (a) the reward function is non-linear, (b) arms may be probabilistically triggered, and (c) only approximate offline oracle exists \cite{wang2017improving}, our algorithm achieves $\tilde{\mathcal{O}}(\sqrt{\mathcal{S} T})$ distribution-dependent regret in the switching case, and $\tilde{\mathcal{O}}(\mathcal{V}^{1/3}T^{2/3})$ in the dynamic case, where $\mathcal S$ is the number of switchings and $\mathcal V$ is the sum of the total ``distribution changes''. The regret bounds in both scenarios are nearly optimal, but our algorithm needs to know the parameter $\mathcal S$ or $\mathcal V$ in advance. We further show that by employing another technique, our algorithm no longer needs to know the parameters $\mathcal S$ or $\mathcal V$ but the regret bounds could become suboptimal. In a special case where the reward function is linear and we have an exact oracle, we design a parameter-free algorithm that achieves nearly optimal regret both in the switching case and in the dynamic case without knowing the parameters in advance.
    Non-parametric Differentially Private Confidence Intervals for the Median. (arXiv:2106.10333v1 [cs.CR])
    (2 min) Differential privacy is a restriction on data processing algorithms that provides strong confidentiality guarantees for individual records in the data. However, research on proper statistical inference, that is, research on properly quantifying the uncertainty of the (noisy) sample estimate regarding the true value in the population, is currently still limited. This paper proposes and evaluates several strategies to compute valid differentially private confidence intervals for the median. Instead of computing a differentially private point estimate and deriving its uncertainty, we directly estimate the interval bounds and discuss why this approach is superior if ensuring privacy is important. We also illustrate that addressing both sources of uncertainty--the error from sampling and the error from protecting the output--simultaneously should be preferred over simpler approaches that incorporate the uncertainty in a sequential fashion. We evaluate the performance of the different algorithms under various parameter settings in extensive simulation studies and demonstrate how the findings could be applied in practical settings using data from the 1940 Decennial Census.
    On predicting research grants productivity. (arXiv:2106.10700v1 [cs.DL])
    (2 min) Understanding the reasons associated with successful proposals is of paramount importance to improve evaluation processes. In this context, we analyzed whether bibliometric features are able to predict the success of research grants. We extracted features aiming at characterizing the academic history of Brazilian researchers, including research topics, affiliations, number of publications and visibility. The extracted features were then used to predict grants productivity via machine learning in three major research areas, namely Medicine, Dentistry and Veterinary Medicine. We found that research subject and publication history play a role in predicting productivity. In addition, institution-based features turned out to be relevant when combined with other features. While the best results outperformed text-based attributes, the evaluated features were not highly discriminative. Our findings indicate that predicting grants success, at least with the considered set of bibliometric features, is not a trivial task.
    Mostly Harmless Machine Learning: Learning Optimal Instruments in Linear IV Models. (arXiv:2011.06158v3 [econ.EM] UPDATED)
    (2 min) We offer straightforward theoretical results that justify incorporating machine learning in the standard linear instrumental variable setting. The key idea is to use machine learning, combined with sample-splitting, to predict the treatment variable from the instrument and any exogenous covariates, and then use this predicted treatment and the covariates as technical instruments to recover the coefficients in the second-stage. This allows the researcher to extract non-linear co-variation between the treatment and instrument that may dramatically improve estimation precision and robustness by boosting instrument strength. Importantly, we constrain the machine-learned predictions to be linear in the exogenous covariates, thus avoiding spurious identification arising from non-linear relationships between the treatment and the covariates. We show that this approach delivers consistent and asymptotically normal estimates under weak conditions and that it may be adapted to be semiparametrically efficient (Chamberlain, 1992). Our method preserves standard intuitions and interpretations of linear instrumental variable methods, including under weak identification, and provides a simple, user-friendly upgrade to the applied economics toolbox. We illustrate our method with an example in law and criminal justice, examining the causal effect of appellate court reversals on district court sentencing decisions.
    On the Cryptographic Hardness of Learning Single Periodic Neurons. (arXiv:2106.10744v1 [cs.LG])
    (2 min) We show a simple reduction which demonstrates the cryptographic hardness of learning a single periodic neuron over isotropic Gaussian distributions in the presence of noise. More precisely, our reduction shows that any polynomial-time algorithm (not necessarily gradient-based) for learning such functions under small noise implies a polynomial-time quantum algorithm for solving worst-case lattice problems, whose hardness form the foundation of lattice-based cryptography. Our core hard family of functions, which are well-approximated by one-layer neural networks, take the general form of a univariate periodic function applied to an affine projection of the data. These functions have appeared in previous seminal works which demonstrate their hardness against gradient-based (Shamir'18), and Statistical Query (SQ) algorithms (Song et al.'17). We show that if (polynomially) small noise is added to the labels, the intractability of learning these functions applies to all polynomial-time algorithms under the aforementioned cryptographic assumptions. Moreover, we demonstrate the necessity of noise in the hardness result by designing a polynomial-time algorithm for learning certain families of such functions under exponentially small adversarial noise. Our proposed algorithm is not a gradient-based or an SQ algorithm, but is rather based on the celebrated Lenstra-Lenstra-Lov\'asz (LLL) lattice basis reduction algorithm. Furthermore, in the absence of noise, this algorithm can be directly applied to solve CLWE detection (Bruna et al.'21) and phase retrieval with an optimal sample complexity of $d+1$ samples. In the former case, this improves upon the quadratic-in-$d$ sample complexity required in (Bruna et al.'21). In the latter case, this improves upon the state-of-the-art AMP-based algorithm, which requires approximately $1.128d$ samples (Barbier et al.'19).
    Is Shapley Value fair? Improving Client Selection for Mavericks in Federated Learning. (arXiv:2106.10734v1 [cs.LG])
    (2 min) Shapley Value is commonly adopted to measure and incentivize client participation in federated learning. In this paper, we show -- theoretically and through simulations -- that Shapley Value underestimates the contribution of a common type of client: the Maverick. Mavericks are clients that differ both in data distribution and data quantity and can be the sole owners of certain types of data. Selecting the right clients at the right moment is important for federated learning to reduce convergence times and improve accuracy. We propose FedEMD, an adaptive client selection strategy based on the Wasserstein distance between the local and global data distributions. As FedEMD adapts the selection probability such that Mavericks are preferably selected when the model benefits from improvement on rare classes, it consistently ensures the fast convergence in the presence of different types of Mavericks. Compared to existing strategies, including Shapley Value-based ones, FedEMD improves the convergence of neural network classifiers by at least 26.9% for FedAvg aggregation compared with the state of the art.
    Quantum Machine Learning: Fad or Future?. (arXiv:2106.10714v1 [quant-ph])
    (2 min) For the last few decades, classical machine learning has allowed us to improve the lives of many through automation, natural language processing, predictive analytics and much more. However, a major concern is the fact that we're fast approach the threshold of the maximum possible computational capacity available to us by the means of classical computing devices including CPUs, GPUs and Application Specific Integrated Circuits (ASICs). This is due to the exponential increase in model sizes which now have parameters in the magnitude of billions and trillions, requiring a significant amount of computing resources across a significant amount of time, just to converge one single model. To observe the efficacy of using quantum computing for certain machine learning tasks and explore the improved potential of convergence, error reduction and robustness to noisy data, this paper will look forth to test and verify the aspects in which quantum machine learning can help improve over classical machine learning approaches while also shedding light on the likely limitations that have prevented quantum approaches to become the mainstream. A major focus will be to recreate the work by Farhi et al and conduct experiments using their theory of performing machine learning in a quantum context, with assistance from the Tensorflow Quantum documentation.
    Group-Structured Adversarial Training. (arXiv:2106.10324v1 [cs.LG])
    (2 min) Robust training methods against perturbations to the input data have received great attention in the machine learning literature. A standard approach in this direction is adversarial training which learns a model using adversarially-perturbed training samples. However, adversarial training performs suboptimally against perturbations structured across samples such as universal and group-sparse shifts that are commonly present in biological data such as gene expression levels of different tissues. In this work, we seek to close this optimality gap and introduce Group-Structured Adversarial Training (GSAT) which learns a model robust to perturbations structured across samples. We formulate GSAT as a non-convex concave minimax optimization problem which minimizes a group-structured optimal transport cost. Specifically, we focus on the applications of GSAT for group-sparse and rank-constrained perturbations modeled using group and nuclear norm penalties. In order to solve GSAT's non-smooth optimization problem in those cases, we propose a new minimax optimization algorithm called GDADMM by combining Gradient Descent Ascent (GDA) and Alternating Direction Method of Multipliers (ADMM). We present several applications of the GSAT framework to gain robustness against structured perturbations for image recognition and computational biology datasets.
    CAMERAS: Enhanced Resolution And Sanity preserving Class Activation Mapping for image saliency. (arXiv:2106.10649v1 [cs.CV])
    (2 min) Backpropagation image saliency aims at explaining model predictions by estimating model-centric importance of individual pixels in the input. However, class-insensitivity of the earlier layers in a network only allows saliency computation with low resolution activation maps of the deeper layers, resulting in compromised image saliency. Remedifying this can lead to sanity failures. We propose CAMERAS, a technique to compute high-fidelity backpropagation saliency maps without requiring any external priors and preserving the map sanity. Our method systematically performs multi-scale accumulation and fusion of the activation maps and backpropagated gradients to compute precise saliency maps. From accurate image saliency to articulation of relative importance of input features for different models, and precise discrimination between model perception of visually similar objects, our high-resolution mapping offers multiple novel insights into the black-box deep visual models, which are presented in the paper. We also demonstrate the utility of our saliency maps in adversarial setup by drastically reducing the norm of attack signals by focusing them on the precise regions identified by our maps. Our method also inspires new evaluation metrics and a sanity check for this developing research direction. Code is available here https://github.com/VisMIL/CAMERAS
    Overcoming Catastrophic Forgetting by Generative Regularization. (arXiv:1912.01238v3 [cs.LG] UPDATED)
    (2 min) In this paper, we propose a new method to overcome catastrophic forgetting by adding generative regularization to Bayesian inference framework. Bayesian method provides a general framework for continual learning. We could further construct a generative regularization term for all given classification models by leveraging energy-based models and Langevin-dynamic sampling to enrich the features learned in each task. By combining discriminative and generative loss together, we empirically show that the proposed method outperforms state-of-the-art methods on a variety of tasks, avoiding catastrophic forgetting in continual learning. In particular, the proposed method outperforms baseline methods over 15% on the Fashion-MNIST dataset and 10% on the CUB dataset
    Graph Neural Networks for Learning Real-Time Prices in Electricity Market. (arXiv:2106.10529v1 [cs.LG])
    (2 min) Solving the optimal power flow (OPF) problem in real-time electricity market improves the efficiency and reliability in the integration of low-carbon energy resources into the power grids. To address the scalability and adaptivity issues of existing end-to-end OPF learning solutions, we propose a new graph neural network (GNN) framework for predicting the electricity market prices from solving OPFs. The proposed GNN-for-OPF framework innovatively exploits the locality property of prices and introduces physics-aware regularization, while attaining reduced model complexity and fast adaptivity to varying grid topology. Numerical tests have validated the learning efficiency and adaptivity improvements of our proposed method over existing approaches.
    Sparse Training via Boosting Pruning Plasticity with Neuroregeneration. (arXiv:2106.10404v1 [cs.LG])
    (2 min) Works on lottery ticket hypothesis (LTH) and single-shot network pruning (SNIP) have raised a lot of attention currently on post-training pruning (iterative magnitude pruning), and before-training pruning (pruning at initialization). The former method suffers from an extremely large computation cost and the latter category of methods usually struggles with insufficient performance. In comparison, during-training pruning, a class of pruning methods that simultaneously enjoys the training/inference efficiency and the comparable performance, temporarily, has been less explored. To better understand during-training pruning, we quantitatively study the effect of pruning throughout training from the perspective of pruning plasticity (the ability of the pruned networks to recover the original performance). Pruning plasticity can help explain several other empirical observations about neural network pruning in literature. We further find that pruning plasticity can be substantially improved by injecting a brain-inspired mechanism called neuroregeneration, i.e., to regenerate the same number of connections as pruned. Based on the insights from pruning plasticity, we design a novel gradual magnitude pruning (GMP) method, named gradual pruning with zero-cost neuroregeneration (GraNet), and its dynamic sparse training (DST) variant (GraNet-ST). Both of them advance state of the art. Perhaps most impressively, the latter for the first time boosts the sparse-to-sparse training performance over various dense-to-sparse methods by a large margin with ResNet-50 on ImageNet. We will release all codes.
    A Pilot Study on Visually Stimulated Cognitive Tasks for EEG-Based Dementia Recognition. (arXiv:2103.03854v2 [cs.LG] UPDATED)
    (2 min) In the status quo, dementia is yet to be cured. Precise diagnosis prior to the onset of the symptoms can prevent the rapid progression of the emerging cognitive impairment. Recent progress has shown that Electroencephalography (EEG) is the promising and cost-effective test to facilitate the detection of neurocognitive disorders. However, most of the existing works have been using only resting-state EEG. The efficiencies of EEG signals from various cognitive tasks, for dementia classification, have yet to be thoroughly investigated. In this study, we designed four cognitive tasks that engage different cognitive performances: attention, working memory, and executive function. We investigated these tasks by using statistical analysis on both time and frequency domains of EEG signals from three classes of human subjects: Dementia (DEM), Mild Cognitive Impairment (MCI), and Normal Control (NC). We also further evaluated the classification performances of two features extraction methods: Principal Component Analysis (PCA) and Filter Bank Common Spatial Pattern (FBCSP). We found that the working memory related tasks yielded good performances for dementia recognition in both cases using PCA and FBCSP. Moreover, FBCSP with features combination from four tasks revealed the best sensitivity of 0.87 and the specificity of 0.80. To our best knowledge, this is the first work that concurrently investigated several cognitive tasks for dementia recognition using both statistical analysis and classification scores. Our results give essential information to design and aid in conducting further experimental tasks to early diagnose dementia patients.
    Boosting Offline Reinforcement Learning with Residual Generative Modeling. (arXiv:2106.10411v1 [cs.LG])
    (2 min) Offline reinforcement learning (RL) tries to learn the near-optimal policy with recorded offline experience without online exploration. Current offline RL research includes: 1) generative modeling, i.e., approximating a policy using fixed data; and 2) learning the state-action value function. While most research focuses on the state-action function part through reducing the bootstrapping error in value function approximation induced by the distribution shift of training data, the effects of error propagation in generative modeling have been neglected. In this paper, we analyze the error in generative modeling. We propose AQL (action-conditioned Q-learning), a residual generative model to reduce policy approximation error for offline RL. We show that our method can learn more accurate policy approximations in different benchmark datasets. In addition, we show that the proposed offline RL method can learn more competitive AI agents in complex control tasks under the multiplayer online battle arena (MOBA) game Honor of Kings.
    Model-Agnostic Explanations using Minimal Forcing Subsets. (arXiv:2011.00639v3 [cs.LG] UPDATED)
    (2 min) How can we find a subset of training samples that are most responsible for a specific prediction made by a complex black-box machine learning model? More generally, how can we explain the model's decisions to end-users in a transparent way? We propose a new model-agnostic algorithm to identify a minimal set of training samples that are indispensable for a given model's decision at a particular test point, i.e., the model's decision would have changed upon the removal of this subset from the training dataset. Our algorithm identifies such a set of "indispensable" samples iteratively by solving a constrained optimization problem. Further, we speed up the algorithm through efficient approximations and provide theoretical justification for its performance. To demonstrate the applicability and effectiveness of our approach, we apply it to a variety of tasks including data poisoning detection, training set debugging and understanding loan decisions. The results show that our algorithm is an effective and easy-to-comprehend tool that helps to better understand local model behavior, and therefore facilitates the adoption of machine learning in domains where such understanding is a requisite.
    Towards a Query-Optimal and Time-Efficient Algorithm for Clustering with a Faulty Oracle. (arXiv:2106.10374v1 [cs.LG])
    (2 min) Motivated by applications in crowdsourced entity resolution in database, signed edge prediction in social networks and correlation clustering, Mazumdar and Saha [NIPS 2017] proposed an elegant theoretical model for studying clustering with a faulty oracle. In this model, given a set of $n$ items which belong to $k$ unknown groups (or clusters), our goal is to recover the clusters by asking pairwise queries to an oracle. This oracle can answer the query that ``do items $u$ and $v$ belong to the same cluster?''. However, the answer to each pairwise query errs with probability $\varepsilon$, for some $\varepsilon\in(0,\frac12)$. Mazumdar and Saha provided two algorithms under this model: one algorithm is query-optimal while time-inefficient (i.e., running in quasi-polynomial time), the other is time efficient (i.e., in polynomial time) while query-suboptimal. Larsen, Mitzenmacher and Tsourakakis [WWW 2020] then gave a new time-efficient algorithm for the special case of $2$ clusters, which is query-optimal if the bias $\delta:=1-2\varepsilon$ of the model is large. It was left as an open question whether one can obtain a query-optimal, time-efficient algorithm for the general case of $k$ clusters and other regimes of $\delta$. In this paper, we make progress on the above question and provide a time-efficient algorithm with nearly-optimal query complexity (up to a factor of $O(\log^2 n)$) for all constant $k$ and any $\delta$ in the regime when information-theoretic recovery is possible. Our algorithm is built on a connection to the stochastic block model.
    MIA-COV19D: COVID-19 Detection through 3-D Chest CT Image Analysis. (arXiv:2106.07524v2 [eess.IV] UPDATED)
    (2 min) Early and reliable COVID-19 diagnosis based on chest 3-D CT scans can assist medical specialists in vital circumstances. Deep learning methodologies constitute a main approach for chest CT scan analysis and disease prediction. However, large annotated databases are necessary for developing deep learning models that are able to provide COVID-19 diagnosis across various medical environments in different countries. Due to privacy issues, publicly available COVID-19 CT datasets are highly difficult to obtain, which hinders the research and development of AI-enabled diagnosis methods of COVID-19 based on CT scans. In this paper we present the COV19-CT-DB database which is annotated for COVID-19, consisting of about 5,000 3-D CT scans, We have split the database in training, validation and test datasets. The former two datasets can be used for training and validation of machine learning models, while the latter will be used for evaluation of the developed models. We also present a deep learning approach, based on a CNN-RNN network and report its performance on the COVID19-CT-DB database.
    Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability. (arXiv:2103.00065v2 [cs.LG] UPDATED)
    (2 min) We empirically demonstrate that full-batch gradient descent on neural network training objectives typically operates in a regime we call the Edge of Stability. In this regime, the maximum eigenvalue of the training loss Hessian hovers just above the numerical value $2 / \text{(step size)}$, and the training loss behaves non-monotonically over short timescales, yet consistently decreases over long timescales. Since this behavior is inconsistent with several widespread presumptions in the field of optimization, our findings raise questions as to whether these presumptions are relevant to neural network training. We hope that our findings will inspire future efforts aimed at rigorously understanding optimization at the Edge of Stability. Code is available at https://github.com/locuslab/edge-of-stability.
    Universal Rate-Distortion-Perception Representations for Lossy Compression. (arXiv:2106.10311v1 [cs.IT])
    (2 min) In the context of lossy compression, Blau & Michaeli (2019) adopt a mathematical notion of perceptual quality and define the information rate-distortion-perception function, generalizing the classical rate-distortion tradeoff. We consider the notion of universal representations in which one may fix an encoder and vary the decoder to achieve any point within a collection of distortion and perception constraints. We prove that the corresponding information-theoretic universal rate-distortion-perception function is operationally achievable in an approximate sense. Under MSE distortion, we show that the entire distortion-perception tradeoff of a Gaussian source can be achieved by a single encoder of the same rate asymptotically. We then characterize the achievable distortion-perception region for a fixed representation in the case of arbitrary distributions, identify conditions under which the aforementioned results continue to hold approximately, and study the case when the rate is not fixed in advance. This motivates the study of practical constructions that are approximately universal across the RDP tradeoff, thereby alleviating the need to design a new encoder for each objective. We provide experimental results on MNIST and SVHN suggesting that on image compression tasks, the operational tradeoffs achieved by machine learning models with a fixed encoder suffer only a small penalty when compared to their variable encoder counterparts.
    Time series forecasting with Gaussian Processes needs priors. (arXiv:2009.08102v2 [stat.ML] UPDATED)
    (2 min) Automatic forecasting is the task of receiving a time series and returning a forecast for the next time steps without any human intervention. Gaussian Processes (GPs) are a powerful tool for modeling time series, but so far there are no competitive approaches for automatic forecasting based on GPs. We propose practical solutions to two problems: automatic selection of the optimal kernel and reliable estimation of the hyperparameters. We propose a fixed composition of kernels, which contains the components needed to model most time series: linear trend, periodic patterns, and other flexible kernel for modeling the non-linear trend. Not all components are necessary to model each time series; during training the unnecessary components are automatically made irrelevant via automatic relevance determination (ARD). We moreover assign priors to the hyperparameters, in order to keep the inference within a plausible range; we design such priors through an empirical Bayes approach. We present results on many time series of different types; our GP model is more accurate than state-of-the-art time series models. Thanks to the priors, a single restart is enough the estimate the hyperparameters; hence the model is also fast to train.
    IGANI: Iterative Generative Adversarial Networks for Imputation with Application to Traffic Data. (arXiv:2008.04847v3 [stat.ML] UPDATED)
    (2 min) Increasing use of sensor data in intelligent transportation systems calls for accurate imputation algorithms that can enable reliable traffic management in the occasional absence of data. As one of the effective imputation approaches, generative adversarial networks (GANs) are implicit generative models that can be used for data imputation, which is formulated as an unsupervised learning problem. This work introduces a novel iterative GAN architecture, called Iterative Generative Adversarial Networks for Imputation (IGANI), for data imputation. IGANI imputes data in two steps and maintains the invertibility of the generative imputer, which will be shown to be a sufficient condition for the convergence of the proposed GAN-based imputation. The performance of our proposed method is evaluated on (1) the imputation of traffic speed data collected in the city of Guangzhou in China, and the training of short-term traffic prediction models using imputed data, and (2) the imputation of multi-variable traffic data of highways in Portland-Vancouver metropolitan region which includes volume, occupancy, and speed with different missing rates for each of them. It is shown that our proposed algorithm mostly produces more accurate results compared to those of previous GAN-based imputation architectures.
    Fast and Robust Online Inference with Stochastic Gradient Descent via Random Scaling. (arXiv:2106.03156v2 [stat.ML] UPDATED)
    (2 min) We develop a new method of online inference for a vector of parameters estimated by the Polyak-Ruppert averaging procedure of stochastic gradient descent (SGD) algorithms. We leverage insights from time series regression in econometrics and construct asymptotically pivotal statistics via random scaling. Our approach is fully operational with online data and is rigorously underpinned by a functional central limit theorem. Our proposed inference method has a couple of key advantages over the existing methods. First, the test statistic is computed in an online fashion with only SGD iterates and the critical values can be obtained without any resampling methods, thereby allowing for efficient implementation suitable for massive online data. Second, there is no need to estimate the asymptotic variance and our inference method is shown to be robust to changes in the tuning parameters for SGD algorithms in simulation experiments with synthetic data.
    Multiplicative Reweighting for Robust Neural Network Optimization. (arXiv:2102.12192v2 [cs.LG] UPDATED)
    (2 min) Deep neural networks are widespread due to their powerful performance. Yet, they suffer from degraded performance in the presence of noisy labels at train time or adversarial examples during inference. Inspired by the setting of learning with expert advice, where multiplicative weights (MW) updates were recently shown to be robust to moderate adversarial corruptions, we propose to use MW for reweighting examples during neural networks optimization. We establish the convergence of our method when used with gradient descent and show its advantage in two simple examples. We then validate empirically our findings by demonstrating that MW improve networks accuracy in the presence of label noise on CIFAR-10, CIFAR-100 and Clothing1M, and leads to better robustness to adversarial attacks.
    Seeing is Knowing! Fact-based Visual Question Answering using Knowledge Graph Embeddings. (arXiv:2012.15484v2 [cs.CL] UPDATED)
    (2 min) Fact-based Visual Question Answering (FVQA), a challenging variant of VQA, requires a QA-system to include facts from a diverse knowledge graph (KG) in its reasoning process to produce an answer. Large KGs, especially common-sense KGs, are known to be incomplete, i.e., not all non-existent facts are always incorrect. Therefore, being able to reason over incomplete KGs for QA is a critical requirement in real-world applications that has not been addressed extensively in the literature. We develop a novel QA architecture that allows us to reason over incomplete KGs, something current FVQA state-of-the-art (SOTA) approaches lack due to their critical reliance on fact retrieval. We use KG Embeddings, a technique widely used for KG completion, for the downstream task of FVQA. We also employ a new image representation technique we call 'Image-as-Knowledge' to enable this capability, alongside a simple one-step CoAttention mechanism to attend to text and image during QA. Our FVQA architecture is faster during inference time, being O(m), as opposed to existing FVQA SOTA methods which are O(N log N), where m = number of vertices, N = number of edges = O(m^2). KG embeddings are shown to hold complementary information to word embeddings: a combination of both metrics permits performance comparable to SOTA methods in the standard answer retrieval task, and significantly better (26% absolute) in the proposed missing-edge reasoning task.
    CD-SGD: Distributed Stochastic Gradient Descent with Compression and Delay Compensation. (arXiv:2106.10796v1 [cs.LG])
    (2 min) Communication overhead is the key challenge for distributed training. Gradient compression is a widely used approach to reduce communication traffic. When combining with parallel communication mechanism method like pipeline, gradient compression technique can greatly alleviate the impact of communication overhead. However, there exists two problems of gradient compression technique to be solved. Firstly, gradient compression brings in extra computation cost, which will delay the next training iteration. Secondly, gradient compression usually leads to the decrease of convergence accuracy.
    Hybrid approach to detecting symptoms of depression in social media entries. (arXiv:2106.10485v1 [cs.CL])
    (2 min) Sentiment and lexical analyses are widely used to detect depression or anxiety disorders. It has been documented that there are significant differences in the language used by a person with emotional disorders in comparison to a healthy individual. Still, the effectiveness of these lexical approaches could be improved further because the current analysis focuses on what the social media entries are about, and not how they are written. In this study, we focus on aspects in which these short texts are similar to each other, and how they were created. We present an innovative approach to the depression screening problem by applying Collgram analysis, which is a known effective method of obtaining linguistic information from texts. We compare these results with sentiment analysis based on the BERT architecture. Finally, we create a hybrid model achieving a diagnostic accuracy of 71%.
    MTC: Multiresolution Tensor Completion from Partial and Coarse Observations. (arXiv:2106.07135v2 [math.NA] UPDATED)
    (2 min) Existing tensor completion formulation mostly relies on partial observations from a single tensor. However, tensors extracted from real-world data are often more complex due to: (i) Partial observation: Only a small subset (e.g., 5%) of tensor elements are available. (ii) Coarse observation: Some tensor modes only present coarse and aggregated patterns (e.g., monthly summary instead of daily reports). In this paper, we are given a subset of the tensor and some aggregated/coarse observations (along one or more modes) and seek to recover the original fine-granular tensor with low-rank factorization. We formulate a coupled tensor completion problem and propose an efficient Multi-resolution Tensor Completion model (MTC) to solve the problem. Our MTC model explores tensor mode properties and leverages the hierarchy of resolutions to recursively initialize an optimization setup, and optimizes on the coupled system using alternating least squares. MTC ensures low computational and space complexity. We evaluate our model on two COVID-19 related spatio-temporal tensors. The experiments show that MTC could provide 65.20% and 75.79% percentage of fitness (PoF) in tensor completion with only 5% fine granular observations, which is 27.96% relative improvement over the best baseline. To evaluate the learned low-rank factors, we also design a tensor prediction task for daily and cumulative disease case predictions, where MTC achieves 50% in PoF and 30% relative improvements over the best baseline.
    Learning Space Partitions for Path Planning. (arXiv:2106.10544v1 [cs.AI])
    (2 min) Path planning, the problem of efficiently discovering high-reward trajectories, often requires optimizing a high-dimensional and multimodal reward function. Popular approaches like CEM and CMA-ES greedily focus on promising regions of the search space and may get trapped in local maxima. DOO and VOOT balance exploration and exploitation, but use space partitioning strategies independent of the reward function to be optimized. Recently, LaMCTS empirically learns to partition the search space in a reward-sensitive manner for black-box optimization. In this paper, we develop a novel formal regret analysis for when and why such an adaptive region partitioning scheme works. We also propose a new path planning method PlaLaM which improves the function value estimation within each sub-region, and uses a latent representation of the search space. Empirically, PlaLaM outperforms existing path planning methods in 2D navigation tasks, especially in the presence of difficult-to-escape local optima, and shows benefits when plugged into model-based RL with planning components such as PETS. These gains transfer to highly multimodal real-world tasks, where we outperform strong baselines in compiler phase ordering by up to 245% and in molecular design by up to 0.4 on properties on a 0-1 scale.
    FedXGBoost: Privacy-Preserving XGBoost for Federated Learning. (arXiv:2106.10662v1 [cs.LG])
    (2 min) Federated learning is the distributed machine learning framework that enables collaborative training across multiple parties while ensuring data privacy. Practical adaptation of XGBoost, the state-of-the-art tree boosting framework, to federated learning remains limited due to high cost incurred by conventional privacy-preserving methods. To address the problem, we propose two variants of federated XGBoost with privacy guarantee: FedXGBoost-SMM and FedXGBoost-LDP. Our first protocol FedXGBoost-SMM deploys enhanced secure matrix multiplication method to preserve privacy with lossless accuracy and lower overhead than encryption-based techniques. Developed independently, the second protocol FedXGBoost-LDP is heuristically designed with noise perturbation for local differential privacy, and empirically evaluated on real-world and synthetic datasets.
    Deep Learning for Functional Data Analysis with Adaptive Basis Layers. (arXiv:2106.10414v1 [stat.ML])
    (2 min) Despite their widespread success, the application of deep neural networks to functional data remains scarce today. The infinite dimensionality of functional data means standard learning algorithms can be applied only after appropriate dimension reduction, typically achieved via basis expansions. Currently, these bases are chosen a priori without the information for the task at hand and thus may not be effective for the designated task. We instead propose to adaptively learn these bases in an end-to-end fashion. We introduce neural networks that employ a new Basis Layer whose hidden units are each basis functions themselves implemented as a micro neural network. Our architecture learns to apply parsimonious dimension reduction to functional inputs that focuses only on information relevant to the target rather than irrelevant variation in the input function. Across numerous classification/regression tasks with functional data, our method empirically outperforms other types of neural networks, and we prove that our approach is statistically consistent with low generalization error. Code is available at: \url{https://github.com/jwyyy/AdaFNN}.
    The Perils of Learning Before Optimizing. (arXiv:2106.10349v1 [cs.LG])
    (2 min) Formulating real-world optimization problems often begins with making predictions from historical data (e.g., an optimizer that aims to recommend fast routes relies upon travel-time predictions). Typically, learning the prediction model used to generate the optimization problem and solving that problem are performed in two separate stages. Recent work has showed how such prediction models can be learned end-to-end by differentiating through the optimization task. Such methods often yield empirical improvements, which are typically attributed to end-to-end making better error tradeoffs than the standard loss function used in a two-stage solution. We refine this explanation and more precisely characterize when end-to-end can improve performance. When prediction targets are stochastic, a two-stage solution must make an a priori choice about which statistics of the target distribution to model -- we consider expectations over prediction targets -- while an end-to-end solution can make this choice adaptively. We show that the performance gap between a two-stage and end-to-end approach is closely related to the \emph{price of correlation} concept in stochastic optimization and show the implications of some existing POC results for our predict-then-optimize problem. We then consider a novel and particularly practical setting, where coefficients in the objective function depend on multiple prediction targets. We give explicit constructions where (1) two-stage performs unboundedly worse than end-to-end; and (2) two-stage is optimal. We identify a large set of real-world applications whose objective functions rely on multiple prediction targets but which nevertheless deploy two-stage solutions. We also use simulations to experimentally quantify performance gaps.
    Informative Class Activation Maps. (arXiv:2106.10472v1 [cs.CV])
    (2 min) We study how to evaluate the quantitative information content of a region within an image for a particular label. To this end, we bridge class activation maps with information theory. We develop an informative class activation map (infoCAM). Given a classification task, infoCAM depict how to accumulate information of partial regions to that of the entire image toward a label. Thus, we can utilise infoCAM to locate the most informative features for a label. When applied to an image classification task, infoCAM performs better than the traditional classification map in the weakly supervised object localisation task. We achieve state-of-the-art results on Tiny-ImageNet.
    Learning the Preferences of Uncertain Humans with Inverse Decision Theory. (arXiv:2106.10394v1 [stat.ML])
    (2 min) Existing observational approaches for learning human preferences, such as inverse reinforcement learning, usually make strong assumptions about the observability of the human's environment. However, in reality, people make many important decisions under uncertainty. To better understand preference learning in these cases, we study the setting of inverse decision theory (IDT), a previously proposed framework where a human is observed making non-sequential binary decisions under uncertainty. In IDT, the human's preferences are conveyed through their loss function, which expresses a tradeoff between different types of mistakes. We give the first statistical analysis of IDT, providing conditions necessary to identify these preferences and characterizing the sample complexity -- the number of decisions that must be observed to learn the tradeoff the human is making to a desired precision. Interestingly, we show that it is actually easier to identify preferences when the decision problem is more uncertain. Furthermore, uncertain decision problems allow us to relax the unrealistic assumption that the human is an optimal decision maker but still identify their exact preferences; we give sample complexities in this suboptimal case as well. Our analysis contradicts the intuition that partial observability should make preference learning more difficult. It also provides a first step towards understanding and improving preference learning methods for uncertain and suboptimal humans.
    Scenic4RL: Programmatic Modeling and Generation of Reinforcement Learning Environments. (arXiv:2106.10365v1 [cs.LG])
    (2 min) The capability of reinforcement learning (RL) agent directly depends on the diversity of learning scenarios the environment generates and how closely it captures real-world situations. However, existing environments/simulators lack the support to systematically model distributions over initial states and transition dynamics. Furthermore, in complex domains such as soccer, the space of possible scenarios is infinite, which makes it impossible for one research group to provide a comprehensive set of scenarios to train, test, and benchmark RL algorithms. To address this issue, for the first time, we adopt an existing formal scenario specification language, SCENIC, to intuitively model and generate interactive scenarios. We interfaced SCENIC to Google Research Soccer environment to create a platform called SCENIC4RL. Using this platform, we provide a dataset consisting of 36 scenario programs encoded in SCENIC and demonstration data generated from a subset of them. We share our experimental results to show the effectiveness of our dataset and the platform to train, test, and benchmark RL algorithms. More importantly, we open-source our platform to enable RL community to collectively contribute to constructing a comprehensive set of scenarios.
    The Animal ID Problem: Continual Curation. (arXiv:2106.10377v1 [cs.CV])
    (2 min) Hoping to stimulate new research in individual animal identification from images, we propose to formulate the problem as the human-machine Continual Curation of images and animal identities. This is an open world recognition problem, where most new animals enter the system after its algorithms are initially trained and deployed. Continual Curation, as defined here, requires (1) an improvement in the effectiveness of current recognition methods, (2) a pairwise verification algorithm that allows the possibility of no decision, and (3) an algorithmic decision mechanism that seeks human input to guide the curation process. Error metrics must evaluate the ability of recognition algorithms to identify not only animals that have been seen just once or twice but also recognize new animals not in the database. An important measure of overall system performance is accuracy as a function of the amount of human input required.
    How COVID-19 Have Changed Crowdfunding: Evidence From GoFundMe. (arXiv:2106.09981v1 [cs.CY] CROSS LISTED)
    (2 min) While the long-term effects of COVID-19 are yet to be determined, its immediate impact on crowdfunding is nonetheless significant. This study takes a computational approach to more deeply comprehend this change. Using a unique data set of all the campaigns published over the past two years on GoFundMe, we explore the factors that have led to the successful funding of a crowdfunding project. In particular, we study a corpus of crowdfunded projects, analyzing cover images and other variables commonly present on crowdfunding sites. Furthermore, we construct a classifier and a regression model to assess the significance of features based on XGBoost. In addition, we employ counterfactual analysis to investigate the causality between features and the success of crowdfunding. More importantly, sentiment analysis and the paired sample t-test are performed to examine the differences in crowdfunding campaigns before and after the COVID-19 outbreak that started in March 2020. First, we note that there is significant racial disparity in crowdfunding success. Second, we find that sad emotion expressed through the campaign's description became significant after the COVID-19 outbreak. Considering all these factors, our findings shed light on the impact of COVID-19 on crowdfunding campaigns.
    S2-BNN: Bridging the Gap Between Self-Supervised Real and 1-bit Neural Networks via Guided Distribution Calibration. (arXiv:2102.08946v2 [cs.CV] UPDATED)
    (2 min) Previous studies dominantly target at self-supervised learning on real-valued networks and have achieved many promising results. However, on the more challenging binary neural networks (BNNs), this task has not yet been fully explored in the community. In this paper, we focus on this more difficult scenario: learning networks where both weights and activations are binary, meanwhile, without any human annotated labels. We observe that the commonly used contrastive objective is not satisfying on BNNs for competitive accuracy, since the backbone network contains relatively limited capacity and representation ability. Hence instead of directly applying existing self-supervised methods, which cause a severe decline in performance, we present a novel guided learning paradigm from real-valued to distill binary networks on the final prediction distribution, to minimize the loss and obtain desirable accuracy. Our proposed method can boost the simple contrastive learning baseline by an absolute gain of 5.5~15% on BNNs. We further reveal that it is difficult for BNNs to recover the similar predictive distributions as real-valued models when training without labels. Thus, how to calibrate them is key to address the degradation in performance. Extensive experiments are conducted on the large-scale ImageNet and downstream datasets. Our method achieves substantial improvement over the simple contrastive learning baseline, and is even comparable to many mainstream supervised BNN methods. Code is available at https://github.com/szq0214/S2-BNN.
    Uncertainty-Aware COVID-19 Detection from Imbalanced Sound Data. (arXiv:2104.02005v2 [cs.SD] UPDATED)
    (2 min) Recently, sound-based COVID-19 detection studies have shown great promise to achieve scalable and prompt digital pre-screening. However, there are still two unsolved issues hindering the practice. First, collected datasets for model training are often imbalanced, with a considerably smaller proportion of users tested positive, making it harder to learn representative and robust features. Second, deep learning models are generally overconfident in their predictions. Clinically, false predictions aggravate healthcare costs. Estimation of the uncertainty of screening would aid this. To handle these issues, we propose an ensemble framework where multiple deep learning models for sound-based COVID-19 detection are developed from different but balanced subsets from original data. As such, data are utilized more effectively compared to traditional up-sampling and down-sampling approaches: an AUC of 0.74 with a sensitivity of 0.68 and a specificity of 0.69 is achieved. Simultaneously, we estimate uncertainty from the disagreement across multiple models. It is shown that false predictions often yield higher uncertainty, enabling us to suggest the users with certainty higher than a threshold to repeat the audio test on their phones or to take clinical tests if digital diagnosis still fails. This study paves the way for a more robust sound-based COVID-19 automated screening system.
    Do Input Gradients Highlight Discriminative Features?. (arXiv:2102.12781v2 [cs.LG] UPDATED)
    (2 min) Post-hoc gradient-based interpretability methods [Simonyan et al., 2013, Smilkov et al., 2017] that provide instance-specific explanations of model predictions are often based on assumption (A): magnitude of input gradients -- gradients of logits with respect to input -- noisily highlight discriminative task-relevant features. In this work, we test the validity of assumption (A) using a three-pronged approach. First, we develop an evaluation framework, DiffROAR, to test assumption (A) on four image classification benchmarks. Our results suggest that (i) input gradients of standard models (i.e., trained on original data) may grossly violate (A), whereas (ii) input gradients of adversarially robust models satisfy (A). Second, we then introduce BlockMNIST, an MNIST-based semi-real dataset, that by design encodes a priori knowledge of discriminative features. Our analysis on BlockMNIST leverages this information to validate as well as characterize differences between input gradient attributions of standard and robust models. Finally, we theoretically prove that our empirical findings hold on a simplified version of the BlockMNIST dataset. Specifically, we prove that input gradients of standard one-hidden-layer MLPs trained on this dataset do not highlight instance-specific signal coordinates, thus grossly violating assumption (A). Our findings motivate the need to formalize and test common assumptions in interpretability in a falsifiable manner [Leavitt and Morcos, 2020]. Additionally, we believe that the DiffROAR evaluation framework and BlockMNIST-based datasets can serve as sanity checks to audit instance-specific interpretability methods.
    A cappella: Audio-visual Singing Voice Separation. (arXiv:2104.09946v2 [cs.SD] UPDATED)
    (2 min) Music source separation can be interpreted as the estimation of the constituent music sources that a music clip is composed of. In this work, we explore the single-channel singing voice separation problem from a multimodal perspective, by jointly learning from audio and visual modalities. To do so, we present Acappella, a dataset spanning around 46 hours of a cappella solo singing videos sourced from YouTube. We propose Y-Net, an audio-visual convolutional neural network which achieves state-of-the-art singing voice separation results on the Acappella dataset and compare it against its audio-only counterpart, U-Net, and a state-of-the-art audio-visual speech separation model. Singing voice separation can be particularly challenging when the audio mixture also comprises of other accompaniment voices and background sounds along with the target voice of interest. We demonstrate that our model can outperform the baseline models in the singing voice separation task in such challenging scenarios. The code, the pre-trained models and the dataset will be publicly available at https://ipcv.github.io/Acappella/
    Improving Dialog Systems for Negotiation with Personality Modeling. (arXiv:2010.09954v2 [cs.CL] UPDATED)
    (2 min) In this paper, we explore the ability to model and infer personality types of opponents, predict their responses, and use this information to adapt a dialog agent's high-level strategy in negotiation tasks. Inspired by the idea of incorporating a theory of mind (ToM) into machines, we introduce a probabilistic formulation to encapsulate the opponent's personality type during both learning and inference. We test our approach on the CraigslistBargain dataset and show that our method using ToM inference achieves a 20% higher dialog agreement rate compared to baselines on a mixed population of opponents. We also find that our model displays diverse negotiation behavior with different types of opponents.
    Memory compression and thermal efficiency of quantum implementations of non-deterministic hidden Markov models. (arXiv:2105.06285v2 [quant-ph] UPDATED)
    (2 min) Stochastic modelling is an essential component of the quantitative sciences, with hidden Markov models (HMMs) often playing a central role. Concurrently, the rise of quantum technologies promises a host of advantages in computational problems, typically in terms of the scaling of requisite resources such as time and memory. HMMs are no exception to this, with recent results highlighting quantum implementations of deterministic HMMs exhibiting superior memory and thermal efficiency relative to their classical counterparts. In many contexts however, non-deterministic HMMs are viable alternatives; compared to them the advantages of current quantum implementations do not always hold. Here, we provide a systematic prescription for constructing quantum implementations of non-deterministic HMMs that re-establish the quantum advantages against this broader class. Crucially, we show that whenever the classical implementation suffers from thermal dissipation due to its need to process information in a time-local manner, our quantum implementations will both mitigate some of this dissipation, and achieve an advantage in memory compression.
    Learning to Represent Action Values as a Hypergraph on the Action Vertices. (arXiv:2010.14680v2 [cs.LG] UPDATED)
    (2 min) Action-value estimation is a critical component of many reinforcement learning (RL) methods whereby sample complexity relies heavily on how fast a good estimator for action value can be learned. By viewing this problem through the lens of representation learning, good representations of both state and action can facilitate action-value estimation. While advances in deep learning have seamlessly driven progress in learning state representations, given the specificity of the notion of agency to RL, little attention has been paid to learning action representations. We conjecture that leveraging the combinatorial structure of multi-dimensional action spaces is a key ingredient for learning good representations of action. To test this, we set forth the action hypergraph networks framework -- a class of functions for learning action representations in multi-dimensional discrete action spaces with a structural inductive bias. Using this framework we realise an agent class based on a combination with deep Q-networks, which we dub hypergraph Q-networks. We show the effectiveness of our approach on a myriad of domains: illustrative prediction problems under minimal confounding effects, Atari 2600 games, and discretised physical control benchmarks.
    Greedy Hierarchical Variational Autoencoders for Large-Scale Video Prediction. (arXiv:2103.04174v3 [cs.CV] UPDATED)
    (2 min) A video prediction model that generalizes to diverse scenes would enable intelligent agents such as robots to perform a variety of tasks via planning with the model. However, while existing video prediction models have produced promising results on small datasets, they suffer from severe underfitting when trained on large and diverse datasets. To address this underfitting challenge, we first observe that the ability to train larger video prediction models is often bottlenecked by the memory constraints of GPUs or TPUs. In parallel, deep hierarchical latent variable models can produce higher quality predictions by capturing the multi-level stochasticity of future observations, but end-to-end optimization of such models is notably difficult. Our key insight is that greedy and modular optimization of hierarchical autoencoders can simultaneously address both the memory constraints and the optimization challenges of large-scale video prediction. We introduce Greedy Hierarchical Variational Autoencoders (GHVAEs), a method that learns high-fidelity video predictions by greedily training each level of a hierarchical autoencoder. In comparison to state-of-the-art models, GHVAEs provide 17-55% gains in prediction performance on four video datasets, a 35-40% higher success rate on real robot tasks, and can improve performance monotonically by simply adding more modules.
    A Review of Autonomous Road Vehicle Integrated Approaches to an Emergency Obstacle Avoidance Maneuver. (arXiv:2105.09446v3 [cs.RO] UPDATED)
    (2 min) As passenger vehicle technologies have advanced, so have their capabilities to avoid obstacles, especially with developments in tires, suspensions, steering, as well as safety technologies like ABS, ESC, and more recently, ADAS systems. However, environments around passenger vehicles have also become more complex, and dangerous. There have previously been studies that outline driver tendencies and performance capabilities when attempting to avoid obstacles while driving passenger vehicles. Now that autonomous vehicles are being developed with obstacle avoidance capabilities, it is important to target performance that meets or exceeds that of human drivers. This manuscript highlights systems that are crucial for an emergency obstacle avoidance maneuver (EOAM) and identifies the state-of-the-art for each of the related systems, while considering the nuances of traveling at highway speeds. Some of the primary EOAM-related systems/areas that are discussed in this review are: general path planning methods, system hierarchies, decision-making, trajectory generation, and trajectory-tracking control methods. After concluding remarks, suggestions for future work which could lead to an ideal EOAM development, are discussed.
    Neural Rough Differential Equations for Long Time Series. (arXiv:2009.08295v4 [cs.LG] UPDATED)
    (2 min) Neural controlled differential equations (CDEs) are the continuous-time analogue of recurrent neural networks, as Neural ODEs are to residual networks, and offer a memory-efficient continuous-time way to model functions of potentially irregular time series. Existing methods for computing the forward pass of a Neural CDE involve embedding the incoming time series into path space, often via interpolation, and using evaluations of this path to drive the hidden state. Here, we use rough path theory to extend this formulation. Instead of directly embedding into path space, we instead represent the input signal over small time intervals through its \textit{log-signature}, which are statistics describing how the signal drives a CDE. This is the approach for solving \textit{rough differential equations} (RDEs), and correspondingly we describe our main contribution as the introduction of Neural RDEs. This extension has a purpose: by generalising the Neural CDE approach to a broader class of driving signals, we demonstrate particular advantages for tackling long time series. In this regime, we demonstrate efficacy on problems of length up to 17k observations and observe significant training speed-ups, improvements in model performance, and reduced memory requirements compared to existing approaches.
    Honey, I Shrunk The Actor: A Case Study on Preserving Performance with Smaller Actors in Actor-Critic RL. (arXiv:2102.11893v3 [cs.LG] UPDATED)
    (2 min) Actors and critics in actor-critic reinforcement learning algorithms are functionally separate, yet they often use the same network architectures. This case study explores the performance impact of network sizes when considering actor and critic architectures independently. By relaxing the assumption of architectural symmetry, it is often possible for smaller actors to achieve comparable policy performance to their symmetric counterparts. Our experiments show up to 99% reduction in the number of network weights with an average reduction of 77% over multiple actor-critic algorithms on 9 independent tasks. Given that reducing actor complexity results in a direct reduction of run-time inference cost, we believe configurations of actors and critics are aspects of actor-critic design that deserve to be considered independently, particularly in resource-constrained applications or when deploying multiple actors simultaneously.
    Teacher's pet: understanding and mitigating biases in distillation. (arXiv:2106.10494v1 [cs.LG])
    (2 min) Knowledge distillation is widely used as a means of improving the performance of a relatively simple student model using the predictions from a complex teacher model. Several works have shown that distillation significantly boosts the student's overall performance; however, are these gains uniform across all data subgroups? In this paper, we show that distillation can harm performance on certain subgroups, e.g., classes with few associated samples. We trace this behaviour to errors made by the teacher distribution being transferred to and amplified by the student model. To mitigate this problem, we present techniques which soften the teacher influence for subgroups where it is less reliable. Experiments on several image classification benchmarks show that these modifications of distillation maintain boost in overall accuracy, while additionally ensuring improvement in subgroup performance.
    Deep Generative Learning via Schr\"{o}dinger Bridge. (arXiv:2106.10410v1 [cs.LG])
    (2 min) We propose to learn a generative model via entropy interpolation with a Schr\"{o}dinger Bridge. The generative learning task can be formulated as interpolating between a reference distribution and a target distribution based on the Kullback-Leibler divergence. At the population level, this entropy interpolation is characterized via an SDE on $[0,1]$ with a time-varying drift term. At the sample level, we derive our Schr\"{o}dinger Bridge algorithm by plugging the drift term estimated by a deep score estimator and a deep density ratio estimator into the Euler-Maruyama method. Under some mild smoothness assumptions of the target distribution, we prove the consistency of both the score estimator and the density ratio estimator, and then establish the consistency of the proposed Schr\"{o}dinger Bridge approach. Our theoretical results guarantee that the distribution learned by our approach converges to the target distribution. Experimental results on multimodal synthetic data and benchmark data support our theoretical findings and indicate that the generative model via Schr\"{o}dinger Bridge is comparable with state-of-the-art GANs, suggesting a new formulation of generative learning. We demonstrate its usefulness in image interpolation and image inpainting.
    Robust M-estimation-based Tensor Ring Completion: a Half-quadratic Minimization Approach. (arXiv:2106.10422v1 [cs.LG])
    (2 min) Tensor completion is the problem of estimating the missing values of high-order data from partially observed entries. Among several definitions of tensor rank, tensor ring rank affords the flexibility and accuracy needed to model tensors of different orders, which motivated recent efforts on tensor-ring completion. However, data corruption due to prevailing outliers poses major challenges to existing algorithms. In this paper, we develop a robust approach to tensor ring completion that uses an M-estimator as its error statistic, which can significantly alleviate the effect of outliers. Leveraging a half-quadratic (HQ) method, we reformulate the problem as one of weighted tensor completion. We present two HQ-based algorithms based on truncated singular value decomposition and matrix factorization along with their convergence and complexity analysis. Extendibility of the proposed approach to alternative definitions of tensor rank is also discussed. The experimental results demonstrate the superior performance of the proposed approach over state-of-the-art robust algorithms for tensor completion.
    MSN: Efficient Online Mask Selection Network for Video Instance Segmentation. (arXiv:2106.10452v1 [cs.CV])
    (2 min) In this work we present a novel solution for Video Instance Segmentation(VIS), that is automatically generating instance level segmentation masks along with object class and tracking them in a video. Our method improves the masks from segmentation and propagation branches in an online manner using the Mask Selection Network (MSN) hence limiting the noise accumulation during mask tracking. We propose an effective design of MSN by using patch-based convolutional neural network. The network is able to distinguish between very subtle differences between the masks and choose the better masks out of the associated masks accurately. Further, we make use of temporal consistency and process the video sequences in both forward and reverse manner as a post processing step to recover lost objects. The proposed method can be used to adapt any video object segmentation method for the task of VIS. Our method achieves a score of 49.1 mAP on 2021 YouTube-VIS Challenge and was ranked third place among more than 30 global teams. Our code will be available at https://github.com/SHI-Labs/Mask-Selection-Networks.
    Compressing Deep ODE-Nets using Basis Function Expansions. (arXiv:2106.10820v1 [cs.LG])
    (2 min) The recently-introduced class of ordinary differential equation networks (ODE-Nets) establishes a fruitful connection between deep learning and dynamical systems. In this work, we reconsider formulations of the weights as continuous-depth functions using linear combinations of basis functions. This perspective allows us to compress the weights through a change of basis, without retraining, while maintaining near state-of-the-art performance. In turn, both inference time and the memory footprint are reduced, enabling quick and rigorous adaptation between computational environments. Furthermore, our framework enables meaningful continuous-in-time batch normalization layers using function projections. The performance of basis function compression is demonstrated by applying continuous-depth models to (a) image classification tasks using convolutional units and (b) sentence-tagging tasks using transformer encoder units.
    Why flatness does and does not correlate with generalization for deep neural networks. (arXiv:2103.06219v2 [cs.LG] UPDATED)
    (2 min) The intuition that local flatness of the loss landscape is correlated with better generalization for deep neural networks (DNNs) has been explored for decades, spawning many different flatness measures. Recently, this link with generalization has been called into question by a demonstration that many measures of flatness are vulnerable to parameter re-scaling which arbitrarily changes their value without changing neural network outputs. Here we show that, in addition, some popular variants of SGD such as Adam and Entropy-SGD, can also break the flatness-generalization correlation. As an alternative to flatness measures, we use a function based picture and propose using the log of Bayesian prior upon initialization, $\log P(f)$, as a predictor of the generalization when a DNN converges on function $f$ after training to zero error. The prior is directly proportional to the Bayesian posterior for functions that give zero error on a test set. For the case of image classification, we show that $\log P(f)$ is a significantly more robust predictor of generalization than flatness measures are. Whilst local flatness measures fail under parameter re-scaling, the prior/posterior, which is global quantity, remains invariant under re-scaling. Moreover, the correlation with generalization as a function of data complexity remains good for different variants of SGD.
    Demonstration of Panda: A Weakly Supervised Entity Matching System. (arXiv:2106.10821v1 [cs.DB])
    (2 min) Entity matching (EM) refers to the problem of identifying tuple pairs in one or more relations that refer to the same real world entities. Supervised machine learning (ML) approaches, and deep learning based approaches in particular, typically achieve state-of-the-art matching results. However, these approaches require many labeled examples, in the form of matching and non-matching pairs, which are expensive and time-consuming to label. In this paper, we introduce Panda, a weakly supervised system specifically designed for EM. Panda uses the same labeling function abstraction as Snorkel, where labeling functions (LF) are user-provided programs that can generate large amounts of (somewhat noisy) labels quickly and cheaply, which can then be combined via a labeling model to generate accurate final predictions. To support users developing LFs for EM, Panda provides an integrated development environment (IDE) that lives in a modern browser architecture. Panda's IDE facilitates the development, debugging, and life-cycle management of LFs in the context of EM tasks, similar to how IDEs such as Visual Studio or Eclipse excel in general-purpose programming. Panda's IDE includes many novel features purpose-built for EM, such as smart data sampling, a builtin library of EM utility functions, automatically generated LFs, visual debugging of LFs, and finally, an EM-specific labeling model. We show in this demo that Panda IDE can greatly accelerate the development of high-quality EM solutions using weak supervision.
    Prediction of the facial growth direction with Machine Learning methods. (arXiv:2106.10464v1 [cs.LG])
    (2 min) First attempts of prediction of the facial growth (FG) direction were made over half of a century ago. Despite numerous attempts and elapsed time, a satisfactory method has not been established yet and the problem still poses a challenge for medical experts. To our knowledge, this paper is the first Machine Learning approach to the prediction of FG direction. Conducted data analysis reveals the inherent complexity of the problem and explains the reasons of difficulty in FG direction prediction based on 2D X-ray images. To perform growth forecasting, we employ a wide range of algorithms, from logistic regression, through tree ensembles to neural networks and consider three, slightly different, problem formulations. The resulting classification accuracy varies between 71% and 75%.
    GLIB: Towards Automated Test Oracle for Graphically-Rich Applications. (arXiv:2106.10507v1 [cs.SE])
    (2 min) Graphically-rich applications such as games are ubiquitous with attractive visual effects of Graphical User Interface (GUI) that offers a bridge between software applications and end-users. However, various types of graphical glitches may arise from such GUI complexity and have become one of the main component of software compatibility issues. Our study on bug reports from game development teams in NetEase Inc. indicates that graphical glitches frequently occur during the GUI rendering and severely degrade the quality of graphically-rich applications such as video games. Existing automated testing techniques for such applications focus mainly on generating various GUI test sequences and check whether the test sequences can cause crashes. These techniques require constant human attention to captures non-crashing bugs such as bugs causing graphical glitches. In this paper, we present the first step in automating the test oracle for detecting non-crashing bugs in graphically-rich applications. Specifically, we propose \texttt{GLIB} based on a code-based data augmentation technique to detect game GUI glitches. We perform an evaluation of \texttt{GLIB} on 20 real-world game apps (with bug reports available) and the result shows that \texttt{GLIB} can achieve 100\% precision and 99.5\% recall in detecting non-crashing bugs such as game GUI glitches. Practical application of \texttt{GLIB} on another 14 real-world games (without bug reports) further demonstrates that \texttt{GLIB} can effectively uncover GUI glitches, with 48 of 53 bugs reported by \texttt{GLIB} having been confirmed and fixed so far.
    Defending against Backdoor Attack on Deep Neural Networks. (arXiv:2002.12162v2 [cs.CR] UPDATED)
    (2 min) Although deep neural networks (DNNs) have achieved a great success in various computer vision tasks, it is recently found that they are vulnerable to adversarial attacks. In this paper, we focus on the so-called \textit{backdoor attack}, which injects a backdoor trigger to a small portion of training data (also known as data poisoning) such that the trained DNN induces misclassification while facing examples with this trigger. To be specific, we carefully study the effect of both real and synthetic backdoor attacks on the internal response of vanilla and backdoored DNNs through the lens of Gard-CAM. Moreover, we show that the backdoor attack induces a significant bias in neuron activation in terms of the $\ell_\infty$ norm of an activation map compared to its $\ell_1$ and $\ell_2$ norm. Spurred by our results, we propose the \textit{$\ell_\infty$-based neuron pruning} to remove the backdoor from the backdoored DNN. Experiments show that our method could effectively decrease the attack success rate, and also hold a high classification accuracy for clean images.
    Low-rank Characteristic Tensor Density Estimation Part II: Compression and Latent Density Estimation. (arXiv:2106.10591v1 [stat.ML])
    (2 min) Learning generative probabilistic models is a core problem in machine learning, which presents significant challenges due to the curse of dimensionality. This paper proposes a joint dimensionality reduction and non-parametric density estimation framework, using a novel estimator that can explicitly capture the underlying distribution of appropriate reduced-dimension representations of the input data. The idea is to jointly design a nonlinear dimensionality reducing auto-encoder to model the training data in terms of a parsimonious set of latent random variables, and learn a canonical low-rank tensor model of the joint distribution of the latent variables in the Fourier domain. The proposed latent density model is non-parametric and universal, as opposed to the predefined prior that is assumed in variational auto-encoders. Joint optimization of the auto-encoder and the latent density estimator is pursued via a formulation which learns both by minimizing a combination of the negative log-likelihood in the latent domain and the auto-encoder reconstruction loss. We demonstrate that the proposed model achieves very promising results on toy, tabular, and image datasets on regression tasks, sampling, and anomaly detection.
    On Characterizing GAN Convergence Through Proximal Duality Gap. (arXiv:2105.04801v2 [cs.LG] UPDATED)
    (2 min) Despite the accomplishments of Generative Adversarial Networks (GANs) in modeling data distributions, training them remains a challenging task. A contributing factor to this difficulty is the non-intuitive nature of the GAN loss curves, which necessitates a subjective evaluation of the generated output to infer training progress. Recently, motivated by game theory, duality gap has been proposed as a domain agnostic measure to monitor GAN training. However, it is restricted to the setting when the GAN converges to a Nash equilibrium. But GANs need not always converge to a Nash equilibrium to model the data distribution. In this work, we extend the notion of duality gap to proximal duality gap that is applicable to the general context of training GANs where Nash equilibria may not exist. We show theoretically that the proximal duality gap is capable of monitoring the convergence of GANs to a wider spectrum of equilibria that subsumes Nash equilibria. We also theoretically establish the relationship between the proximal duality gap and the divergence between the real and generated data distributions for different GAN formulations. Our results provide new insights into the nature of GAN convergence. Finally, we validate experimentally the usefulness of proximal duality gap for monitoring and influencing GAN training.
    Proper Value Equivalence. (arXiv:2106.10316v1 [cs.AI])
    (2 min) One of the main challenges in model-based reinforcement learning (RL) is to decide which aspects of the environment should be modeled. The value-equivalence (VE) principle proposes a simple answer to this question: a model should capture the aspects of the environment that are relevant for value-based planning. Technically, VE distinguishes models based on a set of policies and a set of functions: a model is said to be VE to the environment if the Bellman operators it induces for the policies yield the correct result when applied to the functions. As the number of policies and functions increase, the set of VE models shrinks, eventually collapsing to a single point corresponding to a perfect model. A fundamental question underlying the VE principle is thus how to select the smallest sets of policies and functions that are sufficient for planning. In this paper we take an important step towards answering this question. We start by generalizing the concept of VE to order-$k$ counterparts defined with respect to $k$ applications of the Bellman operator. This leads to a family of VE classes that increase in size as $k \rightarrow \infty$. In the limit, all functions become value functions, and we have a special instantiation of VE which we call proper VE or simply PVE. Unlike VE, the PVE class may contain multiple models even in the limit when all value functions are used. Crucially, all these models are sufficient for planning, meaning that they will yield an optimal policy despite the fact that they may ignore many aspects of the environment. We construct a loss function for learning PVE models and argue that popular algorithms such as MuZero and Muesli can be understood as minimizing an upper bound for this loss. We leverage this connection to propose a modification to MuZero and show that it can lead to improved performance in practice.
    Learning on a Budget via Teacher Imitation. (arXiv:2104.08440v2 [cs.LG] UPDATED)
    (2 min) Deep Reinforcement Learning (RL) techniques can benefit greatly from leveraging prior experience, which can be either self-generated or acquired from other entities. Action advising is a framework that provides a flexible way to transfer such knowledge in the form of actions between teacher-student peers. However, due to the realistic concerns, the number of these interactions is limited with a budget; therefore, it is crucial to perform these in the most appropriate moments. There have been several promising studies recently that address this problem setting especially from the student's perspective. Despite their success, they have some shortcomings when it comes to the practical applicability and integrity as an overall solution to the learning from advice challenge. In this paper, we extend the idea of advice reusing via teacher imitation to construct a unified approach that addresses both advice collection and advice utilisation problems. We also propose a method to automatically tune the relevant hyperparameters of these components on-the-fly to make it able to adapt to any task with minimal human intervention. The experiments we performed in $5$ different Atari games verify that our algorithm either surpasses or performs on-par with its top competitors while being far simpler to be employed. Furthermore, its individual components are also found to be providing significant advantages alone.
    Neural network interpretability for forecasting of aggregated renewable generatiion. (arXiv:2106.10476v1 [cs.LG])
    (2 min) With the rapid growth of renewable energy, lots of small photovoltaic (PV) prosumers emerge. Due to the uncertainty of solar power generation, there is a need for aggregated prosumers to predict solar power generation and whether solar power generation will be larger than load. This paper presents two interpretable neural networks to solve the problem: one binary classification neural network and one regression neural network. The neural networks are built using TensorFlow. The global feature importance and local feature contributions are examined by three gradient-based methods: Integrated Gradients, Expected Gradients, and DeepLIFT. Moreover, we detect abnormal cases when predictions might fail by estimating the prediction uncertainty using Bayesian neural networks. Neural networks, which are interpreted by gradient-based methods and complemented with uncertainty estimation, provide robust and explainable forecasting for decision-makers.
    Forecasting the Olympic medal distribution during a pandemic: a socio-economic machine learning model. (arXiv:2012.04378v2 [cs.LG] UPDATED)
    (2 min) Forecasting the number of Olympic medals for each nation is highly relevant for different stakeholders: Ex ante, sports betting companies can determine the odds while sponsors and media companies can allocate their resources to promising teams. Ex post, sports politicians and managers can benchmark the performance of their teams and evaluate the drivers of success. To significantly increase the Olympic medal forecasting accuracy, we apply machine learning, more specifically a two-staged Random Forest, thus outperforming more traditional na\"ive forecast for three previous Olympics held between 2008 and 2016 for the first time. Regarding the Tokyo 2020 Games in 2021, our model suggests that the United States will lead the Olympic medal table, winning 120 medals, followed by China (87) and Great Britain (74). Intriguingly, we predict that the current COVID-19 pandemic will not significantly alter the medal count as all countries suffer from the pandemic to some extent (data inherent) and limited historical data points on comparable diseases (model inherent).
    Leveraging directed causal discovery to detect latent common causes. (arXiv:1910.10174v3 [stat.ML] UPDATED)
    (2 min) The discovery of causal relationships is a fundamental problem in science and medicine. In recent years, many elegant approaches to discovering causal relationships between two variables from observational data have been proposed. However, most of these deal only with purely directed causal relationships and cannot detect latent common causes. Here, we devise a general heuristic which takes a causal discovery algorithm that can only distinguish purely directed causal relations and modifies it to also detect latent common causes. We apply our method to two directed causal discovery algorithms, the Information Geometric Causal Inference of (Daniusis et al., 2010) and the Kernel Conditional Deviance for Causal Inference of (Mitrovic, Sejdinovic, & Teh, 2018), and extensively test on synthetic data -- detecting latent common causes in additive, multiplicative and complex noise regimes -- and on real data, where we are able to detect known common causes. In addition to detecting latent common causes, our experiments demonstrate that both the modified algorithms preserve the performance of the original in distinguishing directed causal relations.
    Recent Advances in Large Margin Learning. (arXiv:2103.13598v2 [cs.LG] UPDATED)
    (2 min) This paper serves as a survey of recent advances in large margin training and its theoretical foundations, mostly for (nonlinear) deep neural networks (DNNs) that are probably the most prominent machine learning models for large-scale data in the community over the past decade. We generalize the formulation of classification margins from classical research to latest DNNs, summarize theoretical connections between the margin, network generalization, and robustness, and introduce recent efforts in enlarging the margins for DNNs comprehensively. Since the viewpoint of different methods is discrepant, we categorize them into groups for ease of comparison and discussion in the paper. Hopefully, our discussions and overview inspire new research work in the community that aim to improve the performance of DNNs, and we also point to directions where the large margin principle can be verified to provide theoretical evidence why certain regularizations for DNNs function well in practice. We managed to shorten the paper such that the crucial spirit of large margin learning and related methods are better emphasized.
    Densities of Almost Surely Terminating Probabilistic Programs are Differentiable Almost Everywhere. (arXiv:2004.03924v2 [cs.LO] UPDATED)
    (2 min) We study the differential properties of higher-order statistical probabilistic programs with recursion and conditioning. Our starting point is an open problem posed by Hongseok Yang: what class of statistical probabilistic programs have densities that are differentiable almost everywhere? To formalise the problem, we consider Statistical PCF (SPCF), an extension of call-by-value PCF with real numbers, and constructs for sampling and conditioning. We give SPCF a sampling-style operational semantics a la Borgstrom et al., and study the associated weight (commonly referred to as the density) function and value function on the set of possible execution traces. Our main result is that almost-surely terminating SPCF programs, generated from a set of primitive functions (e.g. the set of analytic functions) satisfying mild closure properties, have weight and value functions that are almost-everywhere differentiable. We use a stochastic form of symbolic execution to reason about almost-everywhere differentiability. A by-product of this work is that almost-surely terminating deterministic (S)PCF programs with real parameters denote functions that are almost-everywhere differentiable. Our result is of practical interest, as almost-everywhere differentiability of the density function is required to hold for the correctness of major gradient-based inference algorithms.
    Fast PDN Impedance Prediction Using Deep Learning. (arXiv:2106.10693v1 [cs.LG])
    (2 min) Modeling and simulating a power distribution network (PDN) for printed circuit boards (PCBs) with irregular board shapes and multi-layer stackup is computationally inefficient using full-wave simulations. This paper presents a new concept of using deep learning for PDN impedance prediction. A boundary element method (BEM) is applied to efficiently calculate the impedance for arbitrary board shape and stackup. Then over one million boards with different shapes, stackup, IC location, and decap placement are randomly generated to train a deep neural network (DNN). The trained DNN can predict the impedance accurately for new board configurations that have not been used for training. The consumed time using the trained DNN is only 0.1 seconds, which is over 100 times faster than the BEM method and 5000 times faster than full-wave simulations.
    One-to-many Approach for Improving Super-Resolution. (arXiv:2106.10437v1 [eess.IV])
    (2 min) Super-resolution (SR) is a one-to-many task with multiple possible solutions. However, previous works were not concerned about this characteristic. For a one-to-many pipeline, the generator should be able to generate multiple estimates of the reconstruction, and not be penalized for generating similar and equally realistic images. To achieve this, we propose adding weighted pixel-wise noise after every Residual-in-Residual Dense Block (RRDB) to enable the generator to generate various images. We modify the strict content loss to not penalize the stochastic variation in reconstructed images as long as it has consistent content. Additionally, we observe that there are out-of-focus regions in the DIV2K, DIV8K datasets that provide unhelpful guidelines. We filter blurry regions in the training data using the method of [10]. Finally, we modify the discriminator to receive the low-resolution image as a reference image along with the target image to provide better feedback to the generator. Using our proposed methods, we were able to improve the performance of ESRGAN in x4 perceptual SR and achieve the state-of-the-art LPIPS score in x16 perceptual extreme SR.
    Learning Timestamp-Level Representations for Time Series with Hierarchical Contrastive Loss. (arXiv:2106.10466v1 [cs.LG])
    (2 min) This paper presents TS2Vec, a universal framework for learning timestamp-level representations of time series. Unlike existing methods, TS2Vec performs timestamp-wise discrimination, which learns a contextual representation vector directly for each timestamp. We find that the learned representations have superior predictive ability. A linear regression trained on top of the learned representations outperforms previous SOTAs for supervised time series forecasting. Also, the instance-level representations can be simply obtained by applying a max pooling layer on top of learned representations of all timestamps. We conduct extensive experiments on time series classification tasks to evaluate the quality of instance-level representations. As a result, TS2Vec achieves significant improvement compared with existing SOTAs of unsupervised time series representation on 125 UCR datasets and 29 UEA datasets. The source code is publicly available at https://github.com/yuezhihan/ts2vec.
    GPLA-12: An Acoustic Signal Dataset of Gas Pipeline Leakage. (arXiv:2106.10277v1 [eess.AS])
    (2 min) In this paper, we introduce a new acoustic leakage dataset of gas pipelines, called as GPLA-12, which has 12 categories over 684 training/testing acoustic signals. Unlike massive image and voice datasets, there have relatively few acoustic signal datasets, especially for engineering fault detection. In order to enhance the development of fault diagnosis, we collect acoustic leakage signals on the basis of an intact gas pipe system with external artificial leakages, and then preprocess the collected data with structured tailoring which are turned into GPLA-12. GPLA-12 dedicates to serve as a feature learning dataset for time-series tasks and classifications. To further understand the dataset, we train both shadow and deep learning algorithms to observe the performance. The dataset as well as the pretrained models have been released at both www.daip.club and github.com/Deep-AI-Application-DAIP
    Learning a Universal Template for Few-shot Dataset Generalization. (arXiv:2105.07029v2 [cs.LG] UPDATED)
    (2 min) Few-shot dataset generalization is a challenging variant of the well-studied few-shot classification problem where a diverse training set of several datasets is given, for the purpose of training an adaptable model that can then learn classes from new datasets using only a few examples. To this end, we propose to utilize the diverse training set to construct a universal template: a partial model that can define a wide array of dataset-specialized models, by plugging in appropriate components. For each new few-shot classification problem, our approach therefore only requires inferring a small number of parameters to insert into the universal template. We design a separate network that produces an initialization of those parameters for each given task, and we then fine-tune its proposed initialization via a few steps of gradient descent. Our approach is more parameter-efficient, scalable and adaptable compared to previous methods, and achieves the state-of-the-art on the challenging Meta-Dataset benchmark.
    Liquid Sensing Using WiFi Signals. (arXiv:2106.10356v1 [eess.SP])
    (2 min) The popularity of Internet-of-Things (IoT) has provided us with unprecedented opportunities to enable a variety of emerging services in a smart home environment. Among those services, sensing the liquid level in a container is critical to building many smart home and mobile healthcare applications that improve the quality of life. This paper presents LiquidSense, a liquid-level sensing system that is low-cost, high accuracy, widely applicable to different daily liquids and containers, and can be easily integrated with existing smart home networks. LiquidSense uses an existing home WiFi network and a low-cost transducer that attached to the container to sense the resonance of the container for liquid level detection. In particular, our system mounts a low-cost transducer on the surface of the container and emits a well-designed chirp signal to make the container resonant, which introduces subtle changes to the home WiFi signals. By analyzing the subtle phase changes of the WiFi signals, LiquidSense extracts the resonance frequency as a feature for liquid level detection. Our system constructs prediction models for both continuous and discrete predictions using curve fitting and SVM respectively. We evaluate LiquidSense in home environments with containers of three different materials and six types of liquids. Results show that LiquidSense achieves an overall accuracy of 97% for continuous prediction and an overall F-score of 0.968 for discrete prediction. Results also show that our system has a large coverage in a home environment and works well under non-line-of-sight (NLOS) scenarios.

2021-06-21

  • cs.CL updates on arXiv.org

    End-to-end Speech Translation via Cross-modal Progressive Training. (arXiv:2104.10380v2 [cs.CL] UPDATED)
    (2 min) End-to-end speech translation models have become a new trend in research due to their potential of reducing error propagation. However, these models still suffer from the challenge of data scarcity. How to effectively use unlabeled or other parallel corpora from machine translation is promising but still an open problem. In this paper, we propose Cross Speech-Text Network (XSTNet), an end-to-end model for speech-to-text translation. XSTNet takes both speech and text as input and outputs both transcription and translation text. The model benefits from its three key design aspects: a self-supervised pre-trained sub-network as the audio encoder, a multi-task training objective to exploit additional parallel bilingual text, and a progressive training procedure. We evaluate the performance of XSTNet and baselines on the MuST-C En-X and LibriSpeech En-Fr datasets. In particular, XSTNet achieves state-of-the-art results on all language directions with an average BLEU of 28.8, outperforming the previous best method by 3.2 BLEU. Code, models, cases, and more detailed analysis are available at https://github.com/ReneeYe/XSTNet.
    Modeling Profanity and Hate Speech in Social Media with Semantic Subspaces. (arXiv:2106.07505v2 [cs.CL] UPDATED)
    (2 min) Hate speech and profanity detection suffer from data sparsity, especially for languages other than English, due to the subjective nature of the tasks and the resulting annotation incompatibility of existing corpora. In this study, we identify profane subspaces in word and sentence representations and explore their generalization capability on a variety of similar and distant target tasks in a zero-shot setting. This is done monolingually (German) and cross-lingually to closely-related (English), distantly-related (French) and non-related (Arabic) tasks. We observe that, on both similar and distant target tasks and across all languages, the subspace-based representations transfer more effectively than standard BERT representations in the zero-shot setting, with improvements between F1 +10.9 and F1 +42.9 over the baselines across all tested monolingual and cross-lingual scenarios.
    CSFCube -- A Test Collection of Computer Science Research Articles for Faceted Query by Example. (arXiv:2103.12906v2 [cs.IR] UPDATED)
    (2 min) Query by Example is a well-known information retrieval task in which a document is chosen by the user as the search query and the goal is to retrieve relevant documents from a large collection. However, a document often covers multiple aspects of a topic. To address this scenario we introduce the task of faceted Query by Example in which users can also specify a finer grained aspect in addition to the input query document. We focus on the application of this task in scientific literature search. We envision models which are able to retrieve scientific papers analogous to a query scientific paper along specifically chosen rhetorical structure elements as one solution to this problem. In this work, the rhetorical structure elements, which we refer to as facets, indicate backgrounds, methods, or results of a scientific paper. We introduce and describe an expert annotated test collection to evaluate models trained to perform this task. Our test collection consists of a diverse set of 50 query documents, drawn from computational linguistics and machine learning venues. We carefully followed the annotation guideline used by TREC for depth-k pooling (k = 100 or 250) and the resulting data collection consists of graded relevance scores with high annotation agreement. The data is freely available for research purposes.
    Label Mask for Multi-Label Text Classification. (arXiv:2106.10076v1 [cs.CL])
    (2 min) One of the key problems in multi-label text classification is how to take advantage of the correlation among labels. However, it is very challenging to directly model the correlations among labels in a complex and unknown label space. In this paper, we propose a Label Mask multi-label text classification model (LM-MTC), which is inspired by the idea of cloze questions of language model. LM-MTC is able to capture implicit relationships among labels through the powerful ability of pre-train language models. On the basis, we assign a different token to each potential label, and randomly mask the token with a certain probability to build a label based Masked Language Model (MLM). We train the MTC and MLM together, further improving the generalization ability of the model. A large number of experiments on multiple datasets demonstrate the effectiveness of our method.
    Predicting gender of Brazilian names using deep learning. (arXiv:2106.10156v1 [cs.LG])
    (2 min) Predicting gender by the name is not a simple task. In many applications, especially in the natural language processing (NLP) field, this task may be necessary, mainly when considering foreign names. Some machine learning algorithms can satisfactorily perform the prediction. In this paper, we examined and implemented feedforward and recurrent deep neural network models, such as MLP, RNN, GRU, CNN, and BiLSTM, to classify gender through the first name. A dataset of Brazilian names is used to train and evaluate the models. We analyzed the accuracy, recall, precision, and confusion matrix to measure the models' performances. The results indicate that the gender prediction can be performed from the feature extraction strategy looking at the names as a set of strings. Some models accurately predict the gender in more than 90% of the cases. The recurrent models overcome the feedforward models in this binary classification problem.
    Discriminative Self-training for Punctuation Prediction. (arXiv:2104.10339v2 [cs.CL] UPDATED)
    (2 min) Punctuation prediction for automatic speech recognition (ASR) output transcripts plays a crucial role for improving the readability of the ASR transcripts and for improving the performance of downstream natural language processing applications. However, achieving good performance on punctuation prediction often requires large amounts of labeled speech transcripts, which is expensive and laborious. In this paper, we propose a Discriminative Self-Training approach with weighted loss and discriminative label smoothing to exploit unlabeled speech transcripts. Experimental results on the English IWSLT2011 benchmark test set and an internal Chinese spoken language dataset demonstrate that the proposed approach achieves significant improvement on punctuation prediction accuracy over strong baselines including BERT, RoBERTa, and ELECTRA models. The proposed Discriminative Self-Training approach outperforms the vanilla self-training approach. We establish a new state-of-the-art (SOTA) on the IWSLT2011 test set, outperforming the current SOTA model by 1.3% absolute gain on F$_1$.
    Pre-training for Spoken Language Understanding with Joint Textual and Phonetic Representation Learning. (arXiv:2104.10357v2 [cs.CL] UPDATED)
    (2 min) In the traditional cascading architecture for spoken language understanding (SLU), it has been observed that automatic speech recognition errors could be detrimental to the performance of natural language understanding. End-to-end (E2E) SLU models have been proposed to directly map speech input to desired semantic frame with a single model, hence mitigating ASR error propagation. Recently, pre-training technologies have been explored for these E2E models. In this paper, we propose a novel joint textual-phonetic pre-training approach for learning spoken language representations, aiming at exploring the full potentials of phonetic information to improve SLU robustness to ASR errors. We explore phoneme labels as high-level speech features, and design and compare pre-training tasks based on conditional masked language model objectives and inter-sentence relation objectives. We also investigate the efficacy of combining textual and phonetic information during fine-tuning. Experimental results on spoken language understanding benchmarks, Fluent Speech Commands and SNIPS, show that the proposed approach significantly outperforms strong baseline models and improves robustness of spoken language understanding to ASR errors.
    Back Attention Knowledge Transfer for Low-Resource Named Entity Recognition. (arXiv:1906.01183v3 [cs.CL] UPDATED)
    (2 min) In recent years, great success has been achieved in the field of natural language processing (NLP), thanks in part to the considerable amount of annotated resources. For named entity recognition (NER), most languages do not have such an abundance of labeled data as English, so the performances of those languages are relatively lower. To improve the performance, we propose a general approach called Back Attention Network (BAN). BAN uses a translation system to translate other language sentences into English and then applies a new mechanism named back attention knowledge transfer to obtain task-specific information from pre-trained high-resource languages NER model. This strategy can transfer high-layer features of well-trained model and enrich the semantic representations of the original language. Experiments on three different language datasets indicate that the proposed approach outperforms other state-of-the-art methods.
    Weakly Supervised Pre-Training for Multi-Hop Retriever. (arXiv:2106.09983v1 [cs.CL])
    (2 min) In multi-hop QA, answering complex questions entails iterative document retrieval for finding the missing entity of the question. The main steps of this process are sub-question detection, document retrieval for the sub-question, and generation of a new query for the final document retrieval. However, building a dataset that contains complex questions with sub-questions and their corresponding documents requires costly human annotation. To address the issue, we propose a new method for weakly supervised multi-hop retriever pre-training without human efforts. Our method includes 1) a pre-training task for generating vector representations of complex questions, 2) a scalable data generation method that produces the nested structure of question and sub-question as weak supervision for pre-training, and 3) a pre-training model structure based on dense encoders. We conduct experiments to compare the performance of our pre-trained retriever with several state-of-the-art models on end-to-end multi-hop QA as well as document retrieval. The experimental results show that our pre-trained retriever is effective and also robust on limited data and computational resources.
    Enhancing user creativity: Semantic measures for idea generation. (arXiv:2106.10131v1 [cs.CL])
    (2 min) Human creativity generates novel ideas to solve real-world problems. This thereby grants us the power to transform the surrounding world and extend our human attributes beyond what is currently possible. Creative ideas are not just new and unexpected, but are also successful in providing solutions that are useful, efficient and valuable. Thus, creativity optimizes the use of available resources and increases wealth. The origin of human creativity, however, is poorly understood, and semantic measures that could predict the success of generated ideas are currently unknown. Here, we analyze a dataset of design problem-solving conversations in real-world settings by using 49 semantic measures based on WordNet 3.1 and demonstrate that a divergence of semantic similarity, an increased information content, and a decreased polysemy predict the success of generated ideas. The first feedback from clients also enhances information content and leads to a divergence of successful ideas in creative problem solving. These results advance cognitive science by identifying real-world processes in human problem solving that are relevant to the success of produced solutions and provide tools for real-time monitoring of problem solving, student training and skill acquisition. A selected subset of information content (IC S\'anchez-Batet) and semantic similarity (Lin/S\'anchez-Batet) measures, which are both statistically powerful and computationally fast, could support the development of technologies for computer-assisted enhancements of human creativity or for the implementation of creativity in machines endowed with general artificial intelligence.
    Synchronising speech segments with musical beats in Mandarin and English singing. (arXiv:2106.10045v1 [cs.SD])
    (2 min) Generating synthesised singing voice with models trained on speech data has many advantages due to the models' flexibility and controllability. However, since the information about the temporal relationship between segments and beats are lacking in speech training data, the synthesised singing may sound off-beat at times. Therefore, the availability of the information on the temporal relationship between speech segments and music beats is crucial. The current study investigated the segment-beat synchronisation in singing data, with hypotheses formed based on the linguistics theories of P-centre and sonority hierarchy. A Mandarin corpus and an English corpus of professional singing data were manually annotated and analysed. The results showed that the presence of musical beats was more dependent on segment duration than sonority. However, the sonority hierarchy and the P-centre theory were highly related to the location of beats. Mandarin and English demonstrated cross-linguistic variations despite exhibiting common patterns.
    On-Device Personalization of Automatic Speech Recognition Models for Disordered Speech. (arXiv:2106.10259v1 [eess.AS])
    (2 min) While current state-of-the-art Automatic Speech Recognition (ASR) systems achieve high accuracy on typical speech, they suffer from significant performance degradation on disordered speech and other atypical speech patterns. Personalization of ASR models, a commonly applied solution to this problem, is usually performed in a server-based training environment posing problems around data privacy, delayed model-update times, and communication cost for copying data and models between mobile device and server infrastructure. In this paper, we present an approach to on-device based ASR personalization with very small amounts of speaker-specific data. We test our approach on a diverse set of 100 speakers with disordered speech and find median relative word error rate improvement of 71% with only 50 short utterances required per speaker. When tested on a voice-controlled home automation platform, on-device personalized models show a median task success rate of 81%, compared to only 40% of the unadapted models.
    Challenges and Limitations with the Metrics Measuring the Complexity of Code-Mixed Text. (arXiv:2106.10123v1 [cs.CL])
    (2 min) Code-mixing is a frequent communication style among multilingual speakers where they mix words and phrases from two different languages in the same utterance of text or speech. Identifying and filtering code-mixed text is a challenging task due to its co-existence with monolingual and noisy text. Over the years, several code-mixing metrics have been extensively used to identify and validate code-mixed text quality. This paper demonstrates several inherent limitations of code-mixing metrics with examples from the already existing datasets that are popularly used across various experiments.
    Fusion of Embeddings Networks for Robust Combination of Text Dependent and Independent Speaker Recognition. (arXiv:2106.10169v1 [cs.LG])
    (2 min) By implicitly recognizing a user based on his/her speech input, speaker identification enables many downstream applications, such as personalized system behavior and expedited shopping checkouts. Based on whether the speech content is constrained or not, both text-dependent (TD) and text-independent (TI) speaker recognition models may be used. We wish to combine the advantages of both types of models through an ensemble system to make more reliable predictions. However, any such combined approach has to be robust to incomplete inputs, i.e., when either TD or TI input is missing. As a solution we propose a fusion of embeddings network foenet architecture, combining joint learning with neural attention. We compare foenet with four competitive baseline methods on a dataset of voice assistant inputs, and show that it achieves higher accuracy than the baseline and score fusion methods, especially in the presence of incomplete inputs.
    DEUS: A Data-driven Approach to Estimate User Satisfaction in Multi-turn Dialogues. (arXiv:2103.01287v2 [cs.CL] UPDATED)
    (2 min) Digital assistants are experiencing rapid growth due to their ability to assist users with day-to-day tasks where most dialogues are happening multi-turn. However, evaluating multi-turn dialogues remains challenging, especially at scale. We suggest a context-sensitive method to estimate the turn-level satisfaction for dialogue considering various types of user preferences. The costs of interactions between users and dialogue systems are formulated using a budget consumption concept. We assume users have an initial interaction budget for a dialogue formed based on the task complexity and that each turn has a cost. When the task is completed, or the budget has been exhausted, users quit the dialogue. We demonstrate our method's effectiveness by extensive experimentation with a simulated dialogue platform and real multi-turn dialogues.
    BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models. (arXiv:2106.10199v1 [cs.LG])
    (2 min) We show that with small-to-medium training data, fine-tuning only the bias terms (or a subset of the bias terms) of pre-trained BERT models is competitive with (and sometimes better than) fine-tuning the entire model. For larger data, bias-only fine-tuning is competitive with other sparse fine-tuning methods. Besides their practical utility, these findings are relevant for the question of understanding the commonly-used process of finetuning: they support the hypothesis that finetuning is mainly about exposing knowledge induced by language-modeling training, rather than learning new task-specific linguistic knowledge.
    Towards Financial Sentiment Analysis in a South African Landscape. (arXiv:2106.10004v1 [cs.CL])
    (2 min) Sentiment analysis as a sub-field of natural language processing has received increased attention in the past decade enabling organisations to more effectively manage their reputation through online media monitoring. Many drivers impact reputation, however, this thesis focuses only the aspect of financial performance and explores the gap with regards to financial sentiment analysis in a South African context. Results showed that pre-trained sentiment analysers are least effective for this task and that traditional lexicon-based and machine learning approaches are best suited to predict financial sentiment of news articles. The evaluated methods produced accuracies of 84\%-94\%. The predicted sentiments correlated quite well with share price and highlighted the potential use of sentiment as an indicator of financial performance. A main contribution of the study was updating an existing sentiment dictionary for financial sentiment analysis. Model generalisation was less acceptable due to the limited amount of training data used. Future work includes expanding the data set to improve general usability and contribute to an open-source financial sentiment analyser for South African data.
    A Neural Edge-Editing Approach for Document-Level Relation Graph Extraction. (arXiv:2106.09900v1 [cs.CL])
    (2 min) In this paper, we propose a novel edge-editing approach to extract relation information from a document. We treat the relations in a document as a relation graph among entities in this approach. The relation graph is iteratively constructed by editing edges of an initial graph, which might be a graph extracted by another system or an empty graph. The way to edit edges is to classify them in a close-first manner using the document and temporally-constructed graph information; each edge is represented with a document context information by a pretrained transformer model and a graph context information by a graph convolutional neural network model. We evaluate our approach on the task to extract material synthesis procedures from materials science texts. The experimental results show the effectiveness of our approach in editing the graphs initialized by our in-house rule-based system and empty graphs.
    Recurrent Stacking of Layers in Neural Networks: An Application to Neural Machine Translation. (arXiv:2106.10002v1 [cs.CL])
    (2 min) In deep neural network modeling, the most common practice is to stack a number of recurrent, convolutional, or feed-forward layers in order to obtain high-quality continuous space representations which in turn improves the quality of the network's prediction. Conventionally, each layer in the stack has its own parameters which leads to a significant increase in the number of model parameters. In this paper, we propose to share parameters across all layers thereby leading to a recurrently stacked neural network model. We report on an extensive case study on neural machine translation (NMT), where we apply our proposed method to an encoder-decoder based neural network model, i.e., the Transformer model, and experiment with three Japanese--English translation datasets. We empirically demonstrate that the translation quality of a model that recurrently stacks a single layer 6 times, despite having significantly fewer parameters, approaches that of a model that stacks 6 layers where each layer has different parameters. We also explore the limits of recurrent stacking where we train extremely deep NMT models. This paper also examines the utility of our recurrently stacked model as a student model through transfer learning via leveraging pre-trained parameters and knowledge distillation, and shows that it compensates for the performance drops in translation quality that the direct training of recurrently stacked model brings. We also show how transfer learning helps in faster decoding on top of the already reduced number of parameters due to recurrent stacking. Finally, we analyze the effects of recurrently stacked layers by visualizing the attentions of models that use recurrently stacked layers and models that do not.
    Investigating the Role of Negatives in Contrastive Representation Learning. (arXiv:2106.09943v1 [cs.LG])
    (2 min) Noise contrastive learning is a popular technique for unsupervised representation learning. In this approach, a representation is obtained via reduction to supervised learning, where given a notion of semantic similarity, the learner tries to distinguish a similar (positive) example from a collection of random (negative) examples. The success of modern contrastive learning pipelines relies on many parameters such as the choice of data augmentation, the number of negative examples, and the batch size; however, there is limited understanding as to how these parameters interact and affect downstream performance. We focus on disambiguating the role of one of these parameters: the number of negative examples. Theoretically, we show the existence of a collision-coverage trade-off suggesting that the optimal number of negative examples should scale with the number of underlying concepts in the data. Empirically, we scrutinize the role of the number of negatives in both NLP and vision tasks. In the NLP task, we find that the results broadly agree with our theory, while our vision experiments are murkier with performance sometimes even being insensitive to the number of negatives. We discuss plausible explanations for this behavior and suggest future directions to better align theory and practice.
    Continuity of Topic, Interaction, and Query: Learning to Quote in Online Conversations. (arXiv:2106.09896v1 [cs.CL])
    (2 min) Quotations are crucial for successful explanations and persuasions in interpersonal communications. However, finding what to quote in a conversation is challenging for both humans and machines. This work studies automatic quotation generation in an online conversation and explores how language consistency affects whether a quotation fits the given context. Here, we capture the contextual consistency of a quotation in terms of latent topics, interactions with the dialogue history, and coherence to the query turn's existing content. Further, an encoder-decoder neural framework is employed to continue the context with a quotation via language generation. Experiment results on two large-scale datasets in English and Chinese demonstrate that our quotation generation model outperforms the state-of-the-art models. Further analysis shows that topic, interaction, and query consistency are all helpful to learn how to quote in online conversations.
    Multi-mode Transformer Transducer with Stochastic Future Context. (arXiv:2106.09760v1 [eess.AS])
    (2 min) Automatic speech recognition (ASR) models make fewer errors when more surrounding speech information is presented as context. Unfortunately, acquiring a larger future context leads to higher latency. There exists an inevitable trade-off between speed and accuracy. Naively, to fit different latency requirements, people have to store multiple models and pick the best one under the constraints. Instead, a more desirable approach is to have a single model that can dynamically adjust its latency based on different constraints, which we refer to as Multi-mode ASR. A Multi-mode ASR model can fulfill various latency requirements during inference -- when a larger latency becomes acceptable, the model can process longer future context to achieve higher accuracy and when a latency budget is not flexible, the model can be less dependent on future context but still achieve reliable accuracy. In pursuit of Multi-mode ASR, we propose Stochastic Future Context, a simple training procedure that samples one streaming configuration in each iteration. Through extensive experiments on AISHELL-1 and LibriSpeech datasets, we show that a Multi-mode ASR model rivals, if not surpasses, a set of competitive streaming baselines trained with different latency budgets.
    PRGC: Potential Relation and Global Correspondence Based Joint Relational Triple Extraction. (arXiv:2106.09895v1 [cs.CL])
    (2 min) Joint extraction of entities and relations from unstructured texts is a crucial task in information extraction. Recent methods achieve considerable performance but still suffer from some inherent limitations, such as redundancy of relation prediction, poor generalization of span-based extraction and inefficiency. In this paper, we decompose this task into three subtasks, Relation Judgement, Entity Extraction and Subject-object Alignment from a novel perspective and then propose a joint relational triple extraction framework based on Potential Relation and Global Correspondence (PRGC). Specifically, we design a component to predict potential relations, which constrains the following entity extraction to the predicted relation subset rather than all relations; then a relation-specific sequence tagging component is applied to handle the overlapping problem between subjects and objects; finally, a global correspondence component is designed to align the subject and object into a triple with low-complexity. Extensive experiments show that PRGC achieves state-of-the-art performance on public benchmarks with higher efficiency and delivers consistent performance gain on complex scenarios of overlapping triples.
    LNN-EL: A Neuro-Symbolic Approach to Short-text Entity Linking. (arXiv:2106.09795v1 [cs.CL])
    (2 min) Entity linking (EL), the task of disambiguating mentions in text by linking them to entities in a knowledge graph, is crucial for text understanding, question answering or conversational systems. Entity linking on short text (e.g., single sentence or question) poses particular challenges due to limited context. While prior approaches use either heuristics or black-box neural methods, here we propose LNN-EL, a neuro-symbolic approach that combines the advantages of using interpretable rules based on first-order logic with the performance of neural learning. Even though constrained to using rules, LNN-EL performs competitively against SotA black-box neural approaches, with the added benefits of extensibility and transferability. In particular, we show that we can easily blend existing rule templates given by a human expert, with multiple types of features (priors, BERT encodings, box embeddings, etc), and even scores resulting from previous EL methods, thus improving on such methods. For instance, on the LC-QuAD-1.0 dataset, we show more than $4$\% increase in F1 score over previous SotA. Finally, we show that the inductive bias offered by using logic results in learned rules that transfer well across datasets, even without fine tuning, while maintaining high accuracy.
    Multi-Task Learning and Adapted Knowledge Models for Emotion-Cause Extraction. (arXiv:2106.09790v1 [cs.CL])
    (2 min) Detecting what emotions are expressed in text is a well-studied problem in natural language processing. However, research on finer grained emotion analysis such as what causes an emotion is still in its infancy. We present solutions that tackle both emotion recognition and emotion cause detection in a joint fashion. Considering that common-sense knowledge plays an important role in understanding implicitly expressed emotions and the reasons for those emotions, we propose novel methods that combine common-sense knowledge via adapted knowledge models with multi-task learning to perform joint emotion classification and emotion cause tagging. We show performance improvement on both tasks when including common-sense reasoning and a multitask framework. We provide a thorough analysis to gain insights into model performance.
    Subjective Bias in Abstractive Summarization. (arXiv:2106.10084v1 [cs.CL])
    (2 min) Due to the subjectivity of the summarization, it is a good practice to have more than one gold summary for each training document. However, many modern large-scale abstractive summarization datasets have only one-to-one samples written by different human with different styles. The impact of this phenomenon is understudied. We formulate the differences among possible multiple expressions summarizing the same content as subjective bias and examine the role of this bias in the context of abstractive summarization. In this paper a lightweight and effective method to extract the feature embeddings of subjective styles is proposed. Results of summarization models trained on style-clustered datasets show that there are certain types of styles that lead to better convergence, abstraction and generalization. The reproducible code and generated summaries are available online.
    Graph-based Joint Pandemic Concern and Relation Extraction on Twitter. (arXiv:2106.09929v1 [cs.CL])
    (2 min) Public concern detection provides potential guidance to the authorities for crisis management before or during a pandemic outbreak. Detecting people's concerns and attention from online social media platforms has been widely acknowledged as an effective approach to relieve public panic and prevent a social crisis. However, detecting concerns in time from massive information in social media turns out to be a big challenge, especially when sufficient manually labeled data is in the absence of public health emergencies, e.g., COVID-19. In this paper, we propose a novel end-to-end deep learning model to identify people's concerns and the corresponding relations based on Graph Convolutional Network and Bi-directional Long Short Term Memory integrated with Concern Graph. Except for the sequential features from BERT embeddings, the regional features of tweets can be extracted by the Concern Graph module, which not only benefits the concern detection but also enables our model to be high noise-tolerant. Thus, our model can address the issue of insufficient manually labeled data. We conduct extensive experiments to evaluate the proposed model by using both manually labeled tweets and automatically labeled tweets. The experimental results show that our model can outperform the state-of-art models on real-world datasets.
    An Information Retrieval Approach to Building Datasets for Hate Speech Detection. (arXiv:2106.09775v1 [cs.CL])
    (2 min) Building a benchmark dataset for hate speech detection presents several challenges. Firstly, because hate speech is relatively rare -- e.g., less than 3\% of Twitter posts are hateful \citep{founta2018large} -- random sampling of tweets to annotate is inefficient in capturing hate speech. A common practice is to only annotate tweets containing known ``hate words'', but this risks yielding a biased benchmark that only partially captures the real-world phenomenon of interest. A second challenge is that definitions of hate speech tend to be highly variable and subjective. Annotators having diverse prior notions of hate speech may not only disagree with one another but also struggle to conform to specified labeling guidelines. Our key insight is that the rarity and subjectivity of hate speech are akin to that of relevance in information retrieval (IR). This connection suggests that well-established methodologies for creating IR test collections might also be usefully applied to create better benchmark datasets for hate speech detection. Firstly, to intelligently and efficiently select which tweets to annotate, we apply established IR techniques of {\em pooling} and {\em active learning}. Secondly, to improve both consistency and value of annotations, we apply {\em task decomposition} \cite{Zhang-sigir14} and {\em annotator rationale} \cite{mcdonnell16-hcomp} techniques. Using the above techniques, we create and share a new benchmark dataset\footnote{We will release the dataset upon publication.} for hate speech detection with broader coverage than prior datasets. We also show a dramatic drop in accuracy of existing detection models when tested on these broader forms of hate. Collected annotator rationales not only provide documented support for labeling decisions but also create exciting future work opportunities for dual-supervision and/or explanation generation in modeling.
    GEM: A General Evaluation Benchmark for Multimodal Tasks. (arXiv:2106.09889v1 [cs.CL])
    (2 min) In this paper, we present GEM as a General Evaluation benchmark for Multimodal tasks. Different from existing datasets such as GLUE, SuperGLUE, XGLUE and XTREME that mainly focus on natural language tasks, GEM is a large-scale vision-language benchmark, which consists of GEM-I for image-language tasks and GEM-V for video-language tasks. Comparing with existing multimodal datasets such as MSCOCO and Flicker30K for image-language tasks, YouCook2 and MSR-VTT for video-language tasks, GEM is not only the largest vision-language dataset covering image-language tasks and video-language tasks at the same time, but also labeled in multiple languages. We also provide two baseline models for this benchmark. We will release the dataset, code and baseline models, aiming to advance the development of multilingual multimodal research.
    SPBERT: Pre-training BERT on SPARQL Queries for End-to-end Question Answering over Knowledge Graphs. (arXiv:2106.09997v1 [cs.CL])
    (2 min) We aim to create an unprecedented attempt to build an end-to-end Question Answering (QA) over Knowledge Graphs (KGs), which can construct SPARQL queries from natural language questions and generate a verbalized answer to its queries. Hence, we introduce SPBERT, a Transformer-based language model pre-trained on massive SPARQL query logs. By incorporating masked language modelling objective and word structural objective, SPBERT can learn general-purpose representations in both natural language and SPARQL query language and make the most of the sequential order of words that are crucial for structured language like SPARQL. In this paper, we investigate how SPBERT and encoder-decoder architecture can be adapted for Knowledge-based QA corpora. We conduct exhaustive experiments on two auxiliary tasks, including SPARQL Query Construction and Answer Verbalization Generation. Results show that SPBERT obtains promising performance and achieves state-of-the-art results on several of these tasks.
    Bad Characters: Imperceptible NLP Attacks. (arXiv:2106.09898v1 [cs.CL])
    (2 min) Several years of research have shown that machine-learning systems are vulnerable to adversarial examples, both in theory and in practice. Until now, such attacks have primarily targeted visual models, exploiting the gap between human and machine perception. Although text-based models have also been attacked with adversarial examples, such attacks struggled to preserve semantic meaning and indistinguishability. In this paper, we explore a large class of adversarial examples that can be used to attack text-based models in a black-box setting without making any human-perceptible visual modification to inputs. We use encoding-specific perturbations that are imperceptible to the human eye to manipulate the outputs of a wide range of Natural Language Processing (NLP) systems from neural machine-translation pipelines to web search engines. We find that with a single imperceptible encoding injection -- representing one invisible character, homoglyph, reordering, or deletion -- an attacker can significantly reduce the performance of vulnerable models, and with three injections most models can be functionally broken. Our attacks work against currently-deployed commercial systems, including those produced by Microsoft and Google, in addition to open source models published by Facebook and IBM. This novel series of attacks presents a significant threat to many language processing systems: an attacker can affect systems in a targeted manner without any assumptions about the underlying model. We conclude that text-based NLP systems require careful input sanitization, just like conventional applications, and that given such systems are now being deployed rapidly at scale, the urgent attention of architects and operators is required.
  • cs.CV updates on arXiv.org

    ResDepth: Learned Residual Stereo Reconstruction. (arXiv:2001.08026v3 [cs.CV] UPDATED)
    (2 min) We propose an embarrassingly simple but very effective scheme for high-quality dense stereo reconstruction: (i) generate an approximate reconstruction with your favourite stereo matcher; (ii) rewarp the input images with that approximate model; (iii) with the initial reconstruction and the warped images as input, train a deep network to enhance the reconstruction by regressing a residual correction; and (iv) if desired, iterate the refinement with the new, improved reconstruction. The strategy to only learn the residual greatly simplifies the learning problem. A standard Unet without bells and whistles is enough to reconstruct even small surface details, like dormers and roof substructures in satellite images. We also investigate residual reconstruction with less information and find that even a single image is enough to greatly improve an approximate reconstruction. Our full model reduces the mean absolute error of state-of-the-art stereo reconstruction systems by >50%, both in our target domain of satellite stereo and on stereo pairs from the ETH3D benchmark.
    Partition-Guided GANs. (arXiv:2104.00816v2 [cs.LG] UPDATED)
    (2 min) Despite the success of Generative Adversarial Networks (GANs), their training suffers from several well-known problems, including mode collapse and difficulties learning a disconnected set of manifolds. In this paper, we break down the challenging task of learning complex high dimensional distributions, supporting diverse data samples, to simpler sub-tasks. Our solution relies on designing a partitioner that breaks the space into smaller regions, each having a simpler distribution, and training a different generator for each partition. This is done in an unsupervised manner without requiring any labels. We formulate two desired criteria for the space partitioner that aid the training of our mixture of generators: 1) to produce connected partitions and 2) provide a proxy of distance between partitions and data samples, along with a direction for reducing that distance. These criteria are developed to avoid producing samples from places with non-existent data density, and also facilitate training by providing additional direction to the generators. We develop theoretical constraints for a space partitioner to satisfy the above criteria. Guided by our theoretical analysis, we design an effective neural architecture for the space partitioner that empirically assures these conditions. Experimental results on various standard benchmarks show that the proposed unsupervised model outperforms several recent methods.
    Delving Deep into the Generalization of Vision Transformers under Distribution Shifts. (arXiv:2106.07617v2 [cs.CV] UPDATED)
    (2 min) Recently, Vision Transformers (ViTs) have achieved impressive results on various vision tasks. Yet, their generalization ability under different distribution shifts is rarely understood. In this work, we provide a comprehensive study on the out-of-distribution generalization of ViTs. To support a systematic investigation, we first present a taxonomy of distribution shifts by categorizing them into five conceptual groups: corruption shift, background shift, texture shift, destruction shift, and style shift. Then we perform extensive evaluations of ViT variants under different groups of distribution shifts and compare their generalization ability with CNNs. Several important observations are obtained: 1) ViTs generalize better than CNNs under multiple distribution shifts. With the same or fewer parameters, ViTs are ahead of corresponding CNNs by more than 5% in top-1 accuracy under most distribution shifts. 2) Larger ViTs gradually narrow the in-distribution and out-of-distribution performance gap. To further improve the generalization of ViTs, we design the Generalization-Enhanced ViTs by integrating adversarial learning, information theory, and self-supervised learning. By investigating three types of generalization-enhanced ViTs, we observe their gradient-sensitivity and design a smoother learning strategy to achieve a stable training process. With modified training schemes, we achieve improvements on performance towards out-of-distribution data by 4% from vanilla ViTs. We comprehensively compare three generalization-enhanced ViTs with their corresponding CNNs, and observe that: 1) For the enhanced model, larger ViTs still benefit more for the out-of-distribution generalization. 2) generalization-enhanced ViTs are more sensitive to the hyper-parameters than corresponding CNNs. We hope our comprehensive study could shed light on the design of more generalizable learning architectures.
    Exponential Moving Average Normalization for Self-supervised and Semi-supervised Learning. (arXiv:2101.08482v2 [cs.LG] UPDATED)
    (2 min) We present a plug-in replacement for batch normalization (BN) called exponential moving average normalization (EMAN), which improves the performance of existing student-teacher based self- and semi-supervised learning techniques. Unlike the standard BN, where the statistics are computed within each batch, EMAN, used in the teacher, updates its statistics by exponential moving average from the BN statistics of the student. This design reduces the intrinsic cross-sample dependency of BN and enhances the generalization of the teacher. EMAN improves strong baselines for self-supervised learning by 4-6/1-2 points and semi-supervised learning by about 7/2 points, when 1%/10% supervised labels are available on ImageNet. These improvements are consistent across methods, network architectures, training duration, and datasets, demonstrating the general effectiveness of this technique. The code is available at https://github.com/amazon-research/exponential-moving-average-normalization.
    Radar-to-Lidar: Heterogeneous Place Recognition via Joint Learning. (arXiv:2102.04960v2 [cs.CV] UPDATED)
    (2 min) Place recognition is critical for both offline mapping and online localization. However, current single-sensor based place recognition still remains challenging in adverse conditions. In this paper, a heterogeneous measurements based framework is proposed for long-term place recognition, which retrieves the query radar scans from the existing lidar maps. To achieve this, a deep neural network is built with joint training in the learning stage, and then in the testing stage, shared embeddings of radar and lidar are extracted for heterogeneous place recognition. To validate the effectiveness of the proposed method, we conduct tests and generalization experiments on the multi-session public datasets compared to other competitive methods. The experimental results indicate that our model is able to perform multiple place recognitions: lidar-to-lidar, radar-to-radar and radar-to-lidar, while the learned model is trained only once. We also release the source code publicly: https://github.com/ZJUYH/radar-to-lidar-place-recognition.
    A Consensual Collaborative Learning Method for Remote Sensing Image Classification Under Noisy Multi-Labels. (arXiv:2105.05496v2 [cs.CV] UPDATED)
    (2 min) Collecting a large number of reliable training images annotated by multiple land-cover class labels in the framework of multi-label classification is time-consuming and costly in remote sensing (RS). To address this problem, publicly available thematic products are often used for annotating RS images with zero-labeling-cost. However, such an approach may result in constructing a training set with noisy multi-labels, distorting the learning process. To address this problem, we propose a Consensual Collaborative Multi-Label Learning (CCML) method. The proposed CCML identifies, ranks and corrects training images with noisy multi-labels through four main modules: 1) discrepancy module; 2) group lasso module; 3) flipping module; and 4) swap module. The discrepancy module ensures that the two networks learn diverse features, while obtaining the same predictions. The group lasso module detects the potentially noisy labels by estimating the label uncertainty based on the aggregation of two collaborative networks. The flipping module corrects the identified noisy labels, whereas the swap module exchanges the ranking information between the two networks. The experimental results confirm the success of the proposed CCML under high (synthetically added) multi-label noise rates. The code of the proposed method is publicly available at https://noisy-labels-in-rs.org
    Learning Diverse-Structured Networks for Adversarial Robustness. (arXiv:2102.01886v4 [cs.LG] UPDATED)
    (2 min) In adversarial training (AT), the main focus has been the objective and optimizer while the model has been less studied, so that the models being used are still those classic ones in standard training (ST). Classic network architectures (NAs) are generally worse than searched NAs in ST, which should be the same in AT. In this paper, we argue that NA and AT cannot be handled independently, since given a dataset, the optimal NA in ST would be no longer optimal in AT. That being said, AT is time-consuming itself; if we directly search NAs in AT over large search spaces, the computation will be practically infeasible. Thus, we propose a diverse-structured network (DS-Net), to significantly reduce the size of the search space: instead of low-level operations, we only consider predefined atomic blocks, where an atomic block is a time-tested building block like the residual block. There are only a few atomic blocks and thus we can weight all atomic blocks rather than find the best one in a searched block of DS-Net, which is an essential trade-off between exploring diverse structures and exploiting the best structures. Empirical results demonstrate the advantages of DS-Net, i.e., weighting the atomic blocks.
    Few-Shot Semantic Segmentation Augmented with Image-Level Weak Annotations. (arXiv:2007.01496v2 [cs.CV] UPDATED)
    (2 min) Despite the great progress made by deep neural networks in the semantic segmentation task, traditional neural-networkbased methods typically suffer from a shortage of large amounts of pixel-level annotations. Recent progress in fewshot semantic segmentation tackles the issue by only a few pixel-level annotated examples. However, these few-shot approaches cannot easily be applied to multi-way or weak annotation settings. In this paper, we advance the few-shot segmentation paradigm towards a scenario where image-level annotations are available to help the training process of a few pixel-level annotations. Our key idea is to learn a better prototype representation of the class by fusing the knowledge from the image-level labeled data. Specifically, we propose a new framework, called PAIA, to learn the class prototype representation in a metric space by integrating image-level annotations. Furthermore, by considering the uncertainty of pseudo-masks, a distilled soft masked average pooling strategy is designed to handle distractions in image-level annotations. Extensive empirical results on two datasets show superior performance of PAIA.
    Facial Expressions as a Vulnerability in Face Recognition. (arXiv:2011.08809v2 [cs.CV] UPDATED)
    (2 min) This work explores facial expression bias as a security vulnerability of face recognition systems. Despite the great performance achieved by state-of-the-art face recognition systems, the algorithms are still sensitive to a large range of covariates. We present a comprehensive analysis of how facial expression bias impacts the performance of face recognition technologies. Our study analyzes: i) facial expression biases in the most popular face recognition databases; and ii) the impact of facial expression in face recognition performances. Our experimental framework includes two face detectors, three face recognition models, and three different databases. Our results demonstrate a huge facial expression bias in the most widely used databases, as well as a related impact of face expression in the performance of state-of-the-art algorithms. This work opens the door to new research lines focused on mitigating the observed vulnerability.
    Consistent Posterior Distributions under Vessel-Mixing: A Regularization for Cross-Domain Retinal Artery/Vein Classification. (arXiv:2103.09097v2 [cs.CV] UPDATED)
    (2 min) Retinal artery/vein (A/V) classification is a critical technique for diagnosing diabetes and cardiovascular diseases. Although deep learning based methods achieve impressive results in A/V classification, their performances usually degrade severely when being directly applied to another database, due to the domain shift, e.g., caused by the variations in imaging protocols. In this paper, we propose a novel vessel-mixing based consistency regularization framework, for cross-domain learning in retinal A/V classification. Specially, to alleviate the severe bias to source domain, based on the label smooth prior, the model is regularized to give consistent predictions for unlabeled target-domain inputs that are under perturbation. This consistency regularization implicitly introduces a mechanism where the model and the perturbation is opponent to each other, where the model is pushed to be robust enough to cope with the perturbation. Thus, we investigate a more difficult opponent to further inspire the robustness of model, in the scenario of retinal A/V, called vessel-mixing perturbation. Specially, it effectively disturbs the fundus images especially the vessel structures by mixing two images regionally. We conduct extensive experiments on cross-domain A/V classification using four public datasets, which are collected by diverse institutions and imaging devices. The results demonstrate that our method achieves the state-of-the-art cross-domain performance, which is also close to the upper bound obtained by fully supervised learning on target domain.
    SLSNet: Skin lesion segmentation using a lightweight generative adversarial network. (arXiv:1907.00856v3 [eess.IV] UPDATED)
    (3 min) The determination of precise skin lesion boundaries in dermoscopic images using automated methods faces many challenges, most importantly, the presence of hair, inconspicuous lesion edges and low contrast in dermoscopic images, and variability in the color, texture and shapes of skin lesions. Existing deep learning-based skin lesion segmentation algorithms are expensive in terms of computational time and memory. Consequently, running such segmentation algorithms requires a powerful GPU and high bandwidth memory, which are not available in dermoscopy devices. Thus, this article aims to achieve precise skin lesion segmentation with minimum resources: a lightweight, efficient generative adversarial network (GAN) model called SLSNet, which combines 1-D kernel factorized networks, position and channel attention, and multiscale aggregation mechanisms with a GAN model. The 1-D kernel factorized network reduces the computational cost of 2D filtering. The position and channel attention modules enhance the discriminative ability between the lesion and non-lesion feature representations in spatial and channel dimensions, respectively. A multiscale block is also used to aggregate the coarse-to-fine features of input skin images and reduce the effect of the artifacts. SLSNet is evaluated on two publicly available datasets: ISBI 2017 and the ISIC 2018. Although SLSNet has only 2.35 million parameters, the experimental results demonstrate that it achieves segmentation results on a par with the state-of-the-art skin lesion segmentation methods with an accuracy of 97.61%, and Dice and Jaccard similarity coefficients of 90.63% and 81.98%, respectively. SLSNet can run at more than 110 frames per second (FPS) in a single GTX1080Ti GPU, which is faster than well-known deep learning-based image segmentation models, such as FCN. Therefore, SLSNet can be used for practical dermoscopic applications.
    Level Set Stereo for Cooperative Grouping with Occlusion. (arXiv:2006.16094v3 [cs.CV] UPDATED)
    (2 min) Localizing stereo boundaries is difficult because matching cues are absent in the occluded regions that are adjacent to them. We introduce an energy and level-set optimizer that improves boundaries by encoding the essential geometry of occlusions: The spatial extent of an occlusion must equal the amplitude of the disparity jump that causes it. In a collection of figure-ground scenes from Middlebury and Falling Things stereo datasets, the model provides more accurate boundaries than previous occlusion-handling techniques.
    CompositeTasking: Understanding Images by Spatial Composition of Tasks. (arXiv:2012.09030v2 [cs.CV] UPDATED)
    (2 min) We define the concept of CompositeTasking as the fusion of multiple, spatially distributed tasks, for various aspects of image understanding. Learning to perform spatially distributed tasks is motivated by the frequent availability of only sparse labels across tasks, and the desire for a compact multi-tasking network. To facilitate CompositeTasking, we introduce a novel task conditioning model -- a single encoder-decoder network that performs multiple, spatially varying tasks at once. The proposed network takes an image and a set of pixel-wise dense task requests as inputs, and performs the requested prediction task for each pixel. Moreover, we also learn the composition of tasks that needs to be performed according to some CompositeTasking rules, which includes the decision of where to apply which task. It not only offers us a compact network for multi-tasking, but also allows for task-editing. Another strength of the proposed method is demonstrated by only having to supply sparse supervision per task. The obtained results are on par with our baselines that use dense supervision and a multi-headed multi-tasking design. The source code will be made publicly available at www.github.com/nikola3794/composite-tasking.
    Semantic segmentation of multispectral photoacoustic images using deep learning. (arXiv:2105.09624v2 [eess.IV] UPDATED)
    (2 min) Photoacoustic imaging has the potential to revolutionise healthcare due to the valuable information on tissue physiology that is contained in multispectral photoacoustic measurements. Clinical translation of the technology requires conversion of the high-dimensional acquired data into clinically relevant and interpretable information. In this work, we present a deep learning-based approach to semantic segmentation of multispectral photoacoustic images to facilitate the interpretability of recorded images. Manually annotated multispectral photoacoustic imaging data are used as gold standard reference annotations and enable the training of a deep learning-based segmentation algorithm in a supervised manner. Based on a validation study with experimentally acquired data of healthy human volunteers, we show that automatic tissue segmentation can be used to create powerful analyses and visualisations of multispectral photoacoustic images. Due to the intuitive representation of high-dimensional information, such a processing algorithm could be a valuable means to facilitate the clinical translation of photoacoustic imaging.
    Radar Camera Fusion via Representation Learning in Autonomous Driving. (arXiv:2103.07825v3 [cs.CV] UPDATED)
    (2 min) Radars and cameras are mature, cost-effective, and robust sensors and have been widely used in the perception stack of mass-produced autonomous driving systems. Due to their complementary properties, outputs from radar detection (radar pins) and camera perception (2D bounding boxes) are usually fused to generate the best perception results. The key to successful radar-camera fusion is the accurate data association. The challenges in the radar-camera association can be attributed to the complexity of driving scenes, the noisy and sparse nature of radar measurements, and the depth ambiguity from 2D bounding boxes. Traditional rule-based association methods are susceptible to performance degradation in challenging scenarios and failure in corner cases. In this study, we propose to address radar-camera association via deep representation learning, to explore feature-level interaction and global reasoning. Additionally, we design a loss sampling mechanism and an innovative ordinal loss to overcome the difficulty of imperfect labeling and to enforce critical human-like reasoning. Despite being trained with noisy labels generated by a rule-based algorithm, our proposed method achieves a performance of 92.2% F1 score, which is 11.6% higher than the rule-based teacher. Moreover, this data-driven method also lends itself to continuous improvement via corner case mining.
    End-to-End 3D Point Cloud Learning for Registration Task Using Virtual Correspondences. (arXiv:2011.14579v2 [cs.CV] UPDATED)
    (2 min) 3D Point cloud registration is still a very challenging topic due to the difficulty in finding the rigid transformation between two point clouds with partial correspondences, and it's even harder in the absence of any initial estimation information. In this paper, we present an end-to-end deep-learning based approach to resolve the point cloud registration problem. Firstly, the revised LPD-Net is introduced to extract features and aggregate them with the graph network. Secondly, the self-attention mechanism is utilized to enhance the structure information in the point cloud and the cross-attention mechanism is designed to enhance the corresponding information between the two input point clouds. Based on which, the virtual corresponding points can be generated by a soft pointer based method, and finally, the point cloud registration problem can be solved by implementing the SVD method. Comparison results in ModelNet40 dataset validate that the proposed approach reaches the state-of-the-art in point cloud registration tasks and experiment resutls in KITTI dataset validate the effectiveness of the proposed approach in real applications.Our source code is available at \url{https://github.com/qiaozhijian/VCR-Net.git}
    VSAC: Efficient and Accurate Estimator for H and F. (arXiv:2106.10240v1 [cs.CV])
    (2 min) We present VSAC, a RANSAC-type robust estimator with a number of novelties. It benefits from the introduction of the concept of independent inliers that improves significantly the efficacy of the dominant plane handling and, also, allows near error-free rejection of incorrect models, without false positives. The local optimization process and its application is improved so that it is run on average only once. Further technical improvements include adaptive sequential hypothesis verification and efficient model estimation via Gaussian elimination. Experiments on four standard datasets show that VSAC is significantly faster than all its predecessors and runs on average in 1-2 ms, on a CPU. It is two orders of magnitude faster and yet as precise as MAGSAC++, the currently most accurate estimator of two-view geometry. In the repeated runs on EVD, HPatches, PhotoTourism, and Kusvod2 datasets, it never failed.
    Embodied Language Grounding with 3D Visual Feature Representations. (arXiv:1910.01210v3 [cs.CV] UPDATED)
    (2 min) We propose associating language utterances to 3D visual abstractions of the scene they describe. The 3D visual abstractions are encoded as 3-dimensional visual feature maps. We infer these 3D visual scene feature maps from RGB images of the scene via view prediction: when the generated 3D scene feature map is neurally projected from a camera viewpoint, it should match the corresponding RGB image. We present generative models that condition on the dependency tree of an utterance and generate a corresponding visual 3D feature map as well as reason about its plausibility, and detector models that condition on both the dependency tree of an utterance and a related image and localize the object referents in the 3D feature map inferred from the image. Our model outperforms models of language and vision that associate language with 2D CNN activations or 2D images by a large margin in a variety of tasks, such as, classifying plausibility of utterances, detecting referential expressions, and supplying rewards for trajectory optimization of object placement policies from language instructions. We perform numerous ablations and show the improved performance of our detectors is due to its better generalization across camera viewpoints and lack of object interferences in the inferred 3D feature space, and the improved performance of our generators is due to their ability to spatially reason about objects and their configurations in 3D when mapping from language to scenes.
    Go with the Flows: Mixtures of Normalizing Flows for Point Cloud Generation and Reconstruction. (arXiv:2106.03135v2 [cs.CV] UPDATED)
    (2 min) Recently normalizing flows (NFs) have demonstrated state-of-the-art performance on modeling 3D point clouds while allowing sampling with arbitrary resolution at inference time. However, these flow-based models still require long training times and large models for representing complicated geometries. This work enhances their representational power by applying mixtures of NFs to point clouds. We show that in this more general framework each component learns to specialize in a particular subregion of an object in a completely unsupervised fashion. By instantiating each mixture component with a comparatively small NF we generate point clouds with improved details compared to single-flow-based models while using fewer parameters and considerably reducing the inference runtime. We further demonstrate that by adding data augmentation, individual mixture components can learn to specialize in a semantically meaningful manner. We evaluate mixtures of NFs on generation, autoencoding and single-view reconstruction based on the ShapeNet dataset.
    Steerable Partial Differential Operators for Equivariant Neural Networks. (arXiv:2106.10163v1 [cs.LG])
    (2 min) Recent work in equivariant deep learning bears strong similarities to physics. Fields over a base space are fundamental entities in both subjects, as are equivariant maps between these fields. In deep learning, however, these maps are usually defined by convolutions with a kernel, whereas they are partial differential operators (PDOs) in physics. Developing the theory of equivariant PDOs in the context of deep learning could bring these subjects even closer together and lead to a stronger flow of ideas. In this work, we derive a $G$-steerability constraint that completely characterizes when a PDO between feature vector fields is equivariant, for arbitrary symmetry groups $G$. We then fully solve this constraint for several important groups. We use our solutions as equivariant drop-in replacements for convolutional layers and benchmark them in that role. Finally, we develop a framework for equivariant maps based on Schwartz distributions that unifies classical convolutions and differential operators and gives insight about the relation between the two.
    Self-Supervised Longitudinal Neighbourhood Embedding. (arXiv:2103.03840v3 [cs.CV] UPDATED)
    (2 min) Longitudinal MRIs are often used to capture the gradual deterioration of brain structure and function caused by aging or neurological diseases. Analyzing this data via machine learning generally requires a large number of ground-truth labels, which are often missing or expensive to obtain. Reducing the need for labels, we propose a self-supervised strategy for representation learning named Longitudinal Neighborhood Embedding (LNE). Motivated by concepts in contrastive learning, LNE explicitly models the similarity between trajectory vectors across different subjects. We do so by building a graph in each training iteration defining neighborhoods in the latent space so that the progression direction of a subject follows the direction of its neighbors. This results in a smooth trajectory field that captures the global morphological change of the brain while maintaining the local continuity. We apply LNE to longitudinal T1w MRIs of two neuroimaging studies: a dataset composed of 274 healthy subjects, and Alzheimer's Disease Neuroimaging Initiative (ADNI, N=632). The visualization of the smooth trajectory vector field and superior performance on downstream tasks demonstrate the strength of the proposed method over existing self-supervised methods in extracting information associated with normal aging and in revealing the impact of neurodegenerative disorders. The code is available at \url{https://github.com/ouyangjiahong/longitudinal-neighbourhood-embedding.git}.
    SIR: Self-supervised Image Rectification via Seeing the Same Scene from Multiple Different Lenses. (arXiv:2011.14611v2 [cs.CV] UPDATED)
    (2 min) Deep learning has demonstrated its power in image rectification by leveraging the representation capacity of deep neural networks via supervised training based on a large-scale synthetic dataset. However, the model may overfit the synthetic images and generalize not well on real-world fisheye images due to the limited universality of a specific distortion model and the lack of explicitly modeling the distortion and rectification process. In this paper, we propose a novel self-supervised image rectification (SIR) method based on an important insight that the rectified results of distorted images of a same scene from different lens should be the same. Specifically, we devise a new network architecture with a shared encoder and several prediction heads, each of which predicts the distortion parameter of a specific distortion model. We further leverage a differentiable warping module to generate the rectified images and re-distorted images from the distortion parameters and exploit the intra- and inter-model consistency between them during training, thereby leading to a self-supervised learning scheme without the need for ground-truth distortion parameters or normal images. Experiments on synthetic dataset and real-world fisheye images demonstrate that our method achieves comparable or even better performance than the supervised baseline method and representative state-of-the-art methods. Self-supervised learning also improves the universality of distortion models while keeping their self-consistency.
    Multi-Source Domain Adaptation with Collaborative Learning for Semantic Segmentation. (arXiv:2103.04717v3 [cs.CV] UPDATED)
    (2 min) Multi-source unsupervised domain adaptation~(MSDA) aims at adapting models trained on multiple labeled source domains to an unlabeled target domain. In this paper, we propose a novel multi-source domain adaptation framework based on collaborative learning for semantic segmentation. Firstly, a simple image translation method is introduced to align the pixel value distribution to reduce the gap between source domains and target domain to some extent. Then, to fully exploit the essential semantic information across source domains, we propose a collaborative learning method for domain adaptation without seeing any data from target domain. In addition, similar to the setting of unsupervised domain adaptation, unlabeled target domain data is leveraged to further improve the performance of domain adaptation. This is achieved by additionally constraining the outputs of multiple adaptation models with pseudo labels online generated by an ensembled model. Extensive experiments and ablation studies are conducted on the widely-used domain adaptation benchmark datasets in semantic segmentation. Our proposed method achieves 59.0\% mIoU on the validation set of Cityscapes by training on the labeled Synscapes and GTA5 datasets and unlabeled training set of Cityscapes. It significantly outperforms all previous state-of-the-arts single-source and multi-source unsupervised domain adaptation methods.
    Noise2Sim -- Similarity-based Self-Learning for Image Denoising. (arXiv:2011.03384v4 [cs.LG] UPDATED)
    (2 min) Despite its best performance in image denoising, the supervised deep denoising methods require paired noise-clean data, which are often unavailable. To address this challenge, Noise2Noise was designed based on the fact that paired noise-clean images can be replaced by paired noise-noise images that are easier to collect. However, in many scenarios the collection of paired noise-noise images is still impractical. To bypass labeled images, Noise2Void methods predict masked pixels from their surroundings with single noisy images only and give improved denoising results that still need improvements. An observation on classic denoising methods is that non-local mean (NLM) outcomes are typically superior to locally denoised results. In contrast, Noise2Void and its variants do not utilize self-similarities in an image as the NLM-based methods do. Here we propose Noise2Sim, an NLM-inspired self-learning method for image denoising. Specifically, Noise2Sim leverages the self-similarity of image pixels to train the denoising network, requiring single noisy images only. Our theoretical analysis shows that Noise2Sim tends to be equivalent to Noise2Noise under mild conditions. To efficiently manage the computational burden for globally searching similar pixels, we design a two-step procedure to provide data for Noise2Sim training. Extensive experiments demonstrate the superiority of Noise2Sim on common benchmark datasets.
    A Novel Graph based Trajectory Predictor with Pseudo Oracle. (arXiv:2002.00391v2 [cs.CV] UPDATED)
    (2 min) Pedestrian trajectory prediction in dynamic scenes remains a challenging and critical problem in numerous applications, such as self-driving cars and socially aware robots. Challenges concentrate on capturing pedestrians' motion patterns and social interactions, as well as handling the future uncertainties. Recent studies focus on modeling pedestrians' motion patterns with recurrent neural networks, capturing social interactions with pooling-based or graph-based methods, and handling future uncertainties by using random Gaussian noise as the latent variable. However, they do not integrate specific obstacle avoidance experience (OAE) that may improve prediction performance. For example, pedestrians' future trajectories are always influenced by others in front. Here we propose GTPPO (Graph-based Trajectory Predictor with Pseudo Oracle), an encoder-decoder-based method conditioned on pedestrians' future behaviors. Pedestrians' motion patterns are encoded with a long short-term memory unit, which introduces the temporal attention to highlight specific time steps. Their interactions are captured by a graph-based attention mechanism, which draws OAE into the data-driven learning process of graph attention. Future uncertainties are handled by generating multi-modal outputs with an informative latent variable. Such a variable is generated by a novel pseudo oracle predictor, which minimizes the knowledge gap between historical and ground-truth trajectories. Finally, the GTPPO is evaluated on ETH, UCY and Stanford Drone datasets, and the results demonstrate state-of-the-art performance. Besides, the qualitative evaluations show successful cases of handling sudden motion changes in the future. Such findings indicate that GTPPO can peek into the future.
    Edge Computing for Real-Time Near-Crash Detection for Smart Transportation Applications. (arXiv:2008.00549v2 [cs.RO] UPDATED)
    (2 min) Traffic near-crash events serve as critical data sources for various smart transportation applications, such as being surrogate safety measures for traffic safety research and corner case data for automated vehicle testing. However, there are several key challenges for near-crash detection. First, extracting near-crashes from original data sources requires significant computing, communication, and storage resources. Also, existing methods lack efficiency and transferability, which bottlenecks prospective large-scale applications. To this end, this paper leverages the power of edge computing to address these challenges by processing the video streams from existing dashcams onboard in a real-time manner. We design a multi-thread system architecture that operates on edge devices and model the bounding boxes generated by object detection and tracking in linear complexity. The method is insensitive to camera parameters and backward compatible with different vehicles. The edge computing system has been evaluated with recorded videos and real-world tests on two cars and four buses for over ten thousand hours. It filters out irrelevant videos in real-time thereby saving labor cost, processing time, network bandwidth, and data storage. It collects not only event videos but also other valuable data such as road user type, event location, time to collision, vehicle trajectory, vehicle speed, brake switch, and throttle. The experiments demonstrate the promising performance of the system regarding efficiency, accuracy, reliability, and transferability. It is among the first efforts in applying edge computing for real-time traffic video analytics and is expected to benefit multiple sub-fields in smart transportation research and applications.
    Model Generalization in Deep Learning Applications for Land Cover Mapping. (arXiv:2008.10351v3 [cs.CV] UPDATED)
    (2 min) Recent work has shown that deep learning models can be used to classify land-use data from geospatial satellite imagery. We show that when these deep learning models are trained on data from specific continents/seasons, there is a high degree of variability in model performance on out-of-sample continents/seasons. This suggests that just because a model accurately predicts land-use classes in one continent or season does not mean that the model will accurately predict land-use classes in a different continent or season. We then use clustering techniques on satellite imagery from different continents to visualize the differences in landscapes that make geospatial generalization particularly difficult, and summarize our takeaways for future satellite imagery-related applications.
    How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers. (arXiv:2106.10270v1 [cs.CV])
    (2 min) Vision Transformers (ViT) have been shown to attain highly competitive performance for a wide range of vision applications, such as image classification, object detection and semantic image segmentation. In comparison to convolutional neural networks, the Vision Transformer's weaker inductive bias is generally found to cause an increased reliance on model regularization or data augmentation (``AugReg'' for short) when training on smaller training datasets. We conduct a systematic empirical study in order to better understand the interplay between the amount of training data, AugReg, model size and compute budget. As one result of this study we find that the combination of increased compute and AugReg can yield models with the same performance as models trained on an order of magnitude more training data: we train ViT models of various sizes on the public ImageNet-21k dataset which either match or outperform their counterparts trained on the larger, but not publicly available JFT-300M dataset.
    A Survey on Deep Hashing Methods. (arXiv:2003.03369v2 [cs.CV] UPDATED)
    (2 min) Nearest neighbor search is to find the data points in the database such that the distances from them to the query are the smallest, which is a fundamental problem in various domains, such as computer vision, recommendation systems and machine learning. Hashing is one of the most widely used methods for its computational and storage efficiency. With the development of deep learning, deep hashing methods show more advantages than traditional methods. In this paper, we present a comprehensive survey of the deep hashing algorithms. Specifically, we categorize deep supervised hashing methods into pairwise similarity preserving, multiwise similarity preserving, implicit similarity preserving, classification-oriented preserving as well as quantization according to the manners of preserving the similarities. In addition, we also introduce some other topics such as deep unsupervised hashing and multi-modal deep hashing methods. Meanwhile, we also present some commonly used public datasets and the scheme to measure the performance of deep hashing algorithms. Finally, we discussed some potential research directions in conclusion.
    CT Image Synthesis Using Weakly Supervised Segmentation and Geometric Inter-Label Relations For COVID Image Analysis. (arXiv:2106.10230v1 [eess.IV])
    (2 min) While medical image segmentation is an important task for computer aided diagnosis, the high expertise requirement for pixelwise manual annotations makes it a challenging and time consuming task. Since conventional data augmentations do not fully represent the underlying distribution of the training set, the trained models have varying performance when tested on images captured from different sources. Most prior work on image synthesis for data augmentation ignore the interleaved geometric relationship between different anatomical labels. We propose improvements over previous GAN-based medical image synthesis methods by learning the relationship between different anatomical labels. We use a weakly supervised segmentation method to obtain pixel level semantic label map of images which is used learn the intrinsic relationship of geometry and shape across semantic labels. Latent space variable sampling results in diverse generated images from a base image and improves robustness. We use the synthetic images from our method to train networks for segmenting COVID-19 infected areas from lung CT images. The proposed method outperforms state-of-the-art segmentation methods on a public dataset. Ablation studies also demonstrate benefits of integrating geometry and diversity.
    Pseudo-healthy synthesis with pathology disentanglement and adversarial learning. (arXiv:2005.01607v3 [eess.IV] UPDATED)
    (2 min) Pseudo-healthy synthesis is the task of creating a subject-specific `healthy' image from a pathological one. Such images can be helpful in tasks such as anomaly detection and understanding changes induced by pathology and disease. In this paper, we present a model that is encouraged to disentangle the information of pathology from what seems to be healthy. We disentangle what appears to be healthy and where disease is as a segmentation map, which are then recombined by a network to reconstruct the input disease image. We train our models adversarially using either paired or unpaired settings, where we pair disease images and maps when available. We quantitatively and subjectively, with a human study, evaluate the quality of pseudo-healthy images using several criteria. We show in a series of experiments, performed on ISLES, BraTS and Cam-CAN datasets, that our method is better than several baselines and methods from the literature. We also show that due to better training processes we could recover deformations, on surrounding tissue, caused by disease. Our implementation is publicly available at https://github.com/xiat0616/pseudo-healthy-synthesis. This paper has been accepted by Medical Image Analysis: https://doi.org/10.1016/j.media.2020.101719.
    No Routing Needed Between Capsules. (arXiv:2001.09136v6 [cs.CV] UPDATED)
    (2 min) Most capsule network designs rely on traditional matrix multiplication between capsule layers and computationally expensive routing mechanisms to deal with the capsule dimensional entanglement that the matrix multiplication introduces. By using Homogeneous Vector Capsules (HVCs), which use element-wise multiplication rather than matrix multiplication, the dimensions of the capsules remain unentangled. In this work, we study HVCs as applied to the highly structured MNIST dataset in order to produce a direct comparison to the capsule research direction of Geoffrey Hinton, et al. In our study, we show that a simple convolutional neural network using HVCs performs as well as the prior best performing capsule network on MNIST using 5.5x fewer parameters, 4x fewer training epochs, no reconstruction sub-network, and requiring no routing mechanism. The addition of multiple classification branches to the network establishes a new state of the art for the MNIST dataset with an accuracy of 99.87% for an ensemble of these models, as well as establishing a new state of the art for a single model (99.83% accurate).
    Toward Fault Detection in Industrial Welding Processes with Deep Learning and Data Augmentation. (arXiv:2106.10160v1 [cs.CV])
    (2 min) With the rise of deep learning models in the field of computer vision, new possibilities for their application in industrial processes proves to return great benefits. Nevertheless, the actual fit of machine learning for highly standardised industrial processes is still under debate. This paper addresses the challenges on the industrial realization of the AI tools, considering the use case of Laser Beam Welding quality control as an example. We use object detection algorithms from the TensorFlow object detection API and adapt them to our use case using transfer learning. The baseline models we develop are used as benchmarks and evaluated and compared to models that undergo dataset scaling and hyperparameter tuning. We find that moderate scaling of the dataset via image augmentation leads to improvements in intersection over union (IoU) and recall, whereas high levels of augmentation and scaling may lead to deterioration of results. Finally, we put our results into perspective of the underlying use case and evaluate their fit.
    End-to-end Temporal Action Detection with Transformer. (arXiv:2106.10271v1 [cs.CV])
    (2 min) Temporal action detection (TAD) aims to determine the semantic label and the boundaries of every action instance in an untrimmed video. It is a fundamental task in video understanding and significant progress has been made in TAD. Previous methods involve multiple stages or networks and hand-designed rules or operations, which fall short in efficiency and flexibility. Here, we construct an end-to-end framework for TAD upon Transformer, termed \textit{TadTR}, which simultaneously predicts all action instances as a set of labels and temporal locations in parallel. TadTR is able to adaptively extract temporal context information needed for making action predictions, by selectively attending to a number of snippets in a video. It greatly simplifies the pipeline of TAD and runs much faster than previous detectors. Our method achieves state-of-the-art performance on HACS Segments and THUMOS14 and competitive performance on ActivityNet-1.3. Our code will be made available at \url{https://github.com/xlliu7/TadTR}.
    Self-supervised Video Representation Learning with Cross-Stream Prototypical Contrasting. (arXiv:2106.10137v1 [cs.CV])
    (2 min) Instance-level contrastive learning techniques, which rely on data augmentation and a contrastive loss function, have found great success in the domain of visual representation learning. They are not suitable for exploiting the rich dynamical structure of video however, as operations are done on many augmented instances. In this paper we propose "Video Cross-Stream Prototypical Contrasting", a novel method which predicts consistent prototype assignments from both RGB and optical flow views, operating on sets of samples. Specifically, we alternate the optimization process; while optimizing one of the streams, all views are mapped to one set of stream prototype vectors. Each of the assignments is predicted with all views except the one matching the prediction, pushing representations closer to their assigned prototypes. As a result, more efficient video embeddings with ingrained motion information are learned, without the explicit need for optical flow computation during inference. We obtain state-of-the-art results on nearest neighbour video retrieval and action recognition, outperforming previous best by +3.2% on UCF101 using the S3D backbone (90.5% Top-1 acc), and by +7.2% on UCF101 and +15.1% on HMDB51 using the R(2+1)D backbone.
    Residual Contrastive Learning for Joint Demosaicking and Denoising. (arXiv:2106.10070v1 [cs.CV])
    (2 min) The breakthrough of contrastive learning (CL) has fueled the recent success of self-supervised learning (SSL) in high-level vision tasks on RGB images. However, CL is still ill-defined for low-level vision tasks, such as joint demosaicking and denoising (JDD), in the RAW domain. To bridge this methodological gap, we present a novel CL approach on RAW images, residual contrastive learning (RCL), which aims to learn meaningful representations for JDD. Our work is built on the assumption that noise contained in each RAW image is signal-dependent, thus two crops from the same RAW image should have more similar noise distribution than two crops from different RAW images. We use residuals as a discriminative feature and the earth mover's distance to measure the distribution divergence for the contrastive loss. To evaluate the proposed CL strategy, we simulate a series of unsupervised JDD experiments with large-scale data corrupted by synthetic signal-dependent noise, where we set a new benchmark for unsupervised JDD tasks with unknown (random) noise variance. Our empirical study not only validates that CL can be applied on distributions (c.f. features), but also exposes the lack of robustness of previous non-ML and SSL JDD methods when the statistics of the noise are unknown, thus providing some further insight into signal-dependent noise problems.
    Advanced Hough-based method for on-device document localization. (arXiv:2106.09987v1 [cs.CV])
    (2 min) The demand for on-device document recognition systems increases in conjunction with the emergence of more strict privacy and security requirements. In such systems, there is no data transfer from the end device to a third-party information processing servers. The response time is vital to the user experience of on-device document recognition. Combined with the unavailability of discrete GPUs, powerful CPUs, or a large RAM capacity on consumer-grade end devices such as smartphones, the time limitations put significant constraints on the computational complexity of the applied algorithms for on-device execution. In this work, we consider document location in an image without prior knowledge of the document content or its internal structure. In accordance with the published works, at least 5 systems offer solutions for on-device document location. All these systems use a location method which can be considered Hough-based. The precision of such systems seems to be lower than that of the state-of-the-art solutions which were not designed to account for the limited computational resources. We propose an advanced Hough-based method. In contrast with other approaches, it accounts for the geometric invariants of the central projection model and combines both edge and color features for document boundary detection. The proposed method allowed for the second best result for SmartDoc dataset in terms of precision, surpassed by U-net like neural network. When evaluated on a more challenging MIDV-500 dataset, the proposed algorithm guaranteed the best precision compared to published methods. Our method retained the applicability to on-device computations.
    A Coarse-to-Fine Instance Segmentation Network with Learning Boundary Representation. (arXiv:2106.10213v1 [cs.CV])
    (2 min) Boundary-based instance segmentation has drawn much attention since of its attractive efficiency. However, existing methods suffer from the difficulty in long-distance regression. In this paper, we propose a coarse-to-fine module to address the problem. Approximate boundary points are generated at the coarse stage and then features of these points are sampled and fed to a refined regressor for fine prediction. It is end-to-end trainable since differential sampling operation is well supported in the module. Furthermore, we design a holistic boundary-aware branch and introduce instance-agnostic supervision to assist regression. Equipped with ResNet-101, our approach achieves 31.7\% mask AP on COCO dataset with single-scale training and testing, outperforming the baseline 1.3\% mask AP with less than 1\% additional parameters and GFLOPs. Experiments also show that our proposed method achieves competitive performance compared to existing boundary-based methods with a lightweight design and a simple pipeline.
    All You Can Embed: Natural Language based Vehicle Retrieval with Spatio-Temporal Transformers. (arXiv:2106.10153v1 [cs.CV])
    (2 min) Combining Natural Language with Vision represents a unique and interesting challenge in the domain of Artificial Intelligence. The AI City Challenge Track 5 for Natural Language-Based Vehicle Retrieval focuses on the problem of combining visual and textual information, applied to a smart-city use case. In this paper, we present All You Can Embed (AYCE), a modular solution to correlate single-vehicle tracking sequences with natural language. The main building blocks of the proposed architecture are (i) BERT to provide an embedding of the textual descriptions, (ii) a convolutional backbone along with a Transformer model to embed the visual information. For the training of the retrieval model, a variation of the Triplet Margin Loss is proposed to learn a distance measure between the visual and language embeddings. The code is publicly available at https://github.com/cscribano/AYCE_2021.
    Contrastive Learning of Generalized Game Representations. (arXiv:2106.10060v1 [cs.CV])
    (2 min) Representing games through their pixels offers a promising approach for building general-purpose and versatile game models. While games are not merely images, neural network models trained on game pixels often capture differences of the visual style of the image rather than the content of the game. As a result, such models cannot generalize well even within similar games of the same genre. In this paper we build on recent advances in contrastive learning and showcase its benefits for representation learning in games. Learning to contrast images of games not only classifies games in a more efficient manner; it also yields models that separate games in a more meaningful fashion by ignoring the visual style and focusing, instead, on their content. Our results in a large dataset of sports video games containing 100k images across 175 games and 10 game genres suggest that contrastive learning is better suited for learning generalized game representations compared to conventional supervised learning. The findings of this study bring us closer to universal visual encoders for games that can be reused across previously unseen games without requiring retraining or fine-tuning.
    Bridging the Gap Between Object Detection and User Intent via Query-Modulation. (arXiv:2106.10258v1 [cs.CV])
    (2 min) When interacting with objects through cameras, or pictures, users often have a specific intent. For example, they may want to perform a visual search. However, most object detection models ignore the user intent, relying on image pixels as their only input. This often leads to incorrect results, such as lack of a high-confidence detection on the object of interest, or detection with a wrong class label. In this paper we investigate techniques to modulate standard object detectors to explicitly account for the user intent, expressed as an embedding of a simple query. Compared to standard object detectors, query-modulated detectors show superior performance at detecting objects for a given label of interest. Thanks to large-scale training data synthesized from standard object detection annotations, query-modulated detectors can also outperform specialized referring expression recognition systems. Furthermore, they can be simultaneously trained to solve for both query-modulated detection and standard object detection.
    Development of a conversing and body temperature scanning autonomously navigating robot to help screen for COVID-19. (arXiv:2106.09894v1 [cs.RO])
    (2 min) Throughout the COVID-19 pandemic, the most common symptom displayed by patients has been a fever, leading to the use of temperature scanning as a preemptive measure to detect potential carriers of the virus. Human employees with handheld thermometers have been used to fulfill this task, however this puts them at risk as they cannot be physically distanced and the sequential nature of this method leads to great inconveniences and inefficiency. The proposed solution is an autonomously navigating robot capable of conversing and scanning people's temperature to detect fevers and help screen for COVID-19. To satisfy this objective, the robot must be able to (1) navigate autonomously, (2) detect and track people, and (3) get individuals' temperature reading and converse with them if it exceeds 38{\deg}C. An autonomously navigating mobile robot is used with a manipulator controlled using a face tracking algorithm, and an end effector consisting of a thermal camera, smartphone, and chatbot. The goal is to develop a functioning solution that performs the above tasks. In addition, technical challenges encountered and their engineering solutions will be presented, and recommendations will be made for enhancements that could be incorporated when approaching commercialization.
    A Dynamic Spatial-temporal Attention Network for Early Anticipation of Traffic Accidents. (arXiv:2106.10197v1 [cs.CV])
    (2 min) Recently, autonomous vehicles and those equipped with an Advanced Driver Assistance System (ADAS) are emerging. They share the road with regular ones operated by human drivers entirely. To ensure guaranteed safety for passengers and other road users, it becomes essential for autonomous vehicles and ADAS to anticipate traffic accidents from natural driving scenes. The dynamic spatial-temporal interaction of the traffic agents is complex, and visual cues for predicting a future accident are embedded deeply in dashcam video data. Therefore, early anticipation of traffic accidents remains a challenge. To this end, the paper presents a dynamic spatial-temporal attention (DSTA) network for early anticipation of traffic accidents from dashcam videos. The proposed DSTA-network learns to select discriminative temporal segments of a video sequence with a module named Dynamic Temporal Attention (DTA). It also learns to focus on the informative spatial regions of frames with another module named Dynamic Spatial Attention (DSA). The spatial-temporal relational features of accidents, along with scene appearance features, are learned jointly with a Gated Recurrent Unit (GRU) network. The experimental evaluation of the DSTA-network on two benchmark datasets confirms that it has exceeded the state-of-the-art performance. A thorough ablation study evaluates the contributions of individual components of the DSTA-network, revealing how the network achieves such performance. Furthermore, this paper proposes a new strategy that fuses the prediction scores from two complementary models and verifies its effectiveness in further boosting the performance of early accident anticipation.
    Training or Architecture? How to Incorporate Invariance in Neural Networks. (arXiv:2106.10044v1 [cs.CV])
    (2 min) Many applications require the robustness, or ideally the invariance, of a neural network to certain transformations of input data. Most commonly, this requirement is addressed by either augmenting the training data, using adversarial training, or defining network architectures that include the desired invariance automatically. Unfortunately, the latter often relies on the ability to enlist all possible transformations, which make such approaches largely infeasible for infinite sets of transformations, such as arbitrary rotations or scaling. In this work, we propose a method for provably invariant network architectures with respect to group actions by choosing one element from a (possibly continuous) orbit based on a fixed criterion. In a nutshell, we intend to 'undo' any possible transformation before feeding the data into the actual network. We analyze properties of such approaches, extend them to equivariant networks, and demonstrate their advantages in terms of robustness as well as computational efficiency in several numerical examples. In particular, we investigate the robustness with respect to rotations of images (which can possibly hold up to discretization artifacts only) as well as the provable rotational and scaling invariance of 3D point cloud classification.
    Light Pollution Reduction in Nighttime Photography. (arXiv:2106.10046v1 [cs.CV])
    (2 min) Nighttime photographers are often troubled by light pollution of unwanted artificial lights. Artificial lights, after scattered by aerosols in the atmosphere, can inundate the starlight and degrade the quality of nighttime images, by reducing contrast and dynamic range and causing hazes. In this paper we develop a physically-based light pollution reduction (LPR) algorithm that can substantially alleviate the aforementioned degradations of perceptual quality and restore the pristine state of night sky. The key to the success of the proposed LPR algorithm is an inverse method to estimate the spatial radiance distribution and spectral signature of ground artificial lights. Extensive experiments are carried out to evaluate the efficacy and limitations of the LPR algorithm.
    Residual Error: a New Performance Measure for Adversarial Robustness. (arXiv:2106.10212v1 [cs.LG])
    (2 min) Despite the significant advances in deep learning over the past decade, a major challenge that limits the wide-spread adoption of deep learning has been their fragility to adversarial attacks. This sensitivity to making erroneous predictions in the presence of adversarially perturbed data makes deep neural networks difficult to adopt for certain real-world, mission-critical applications. While much of the research focus has revolved around adversarial example creation and adversarial hardening, the area of performance measures for assessing adversarial robustness is not well explored. Motivated by this, this study presents the concept of residual error, a new performance measure for not only assessing the adversarial robustness of a deep neural network at the individual sample level, but also can be used to differentiate between adversarial and non-adversarial examples to facilitate for adversarial example detection. Furthermore, we introduce a hybrid model for approximating the residual error in a tractable manner. Experimental results using the case of image classification demonstrates the effectiveness and efficacy of the proposed residual error metric for assessing several well-known deep neural network architectures. These results thus illustrate that the proposed measure could be a useful tool for not only assessing the robustness of deep neural networks used in mission-critical scenarios, but also in the design of adversarially robust models.
    Learning and Meshing from Deep Implicit Surface Networks Using an Efficient Implementation of Analytic Marching. (arXiv:2106.10031v1 [cs.CV])
    (2 min) Reconstruction of object or scene surfaces has tremendous applications in computer vision, computer graphics, and robotics. In this paper, we study a fundamental problem in this context about recovering a surface mesh from an implicit field function whose zero-level set captures the underlying surface. To achieve the goal, existing methods rely on traditional meshing algorithms; while promising, they suffer from loss of precision learned in the implicit surface networks, due to the use of discrete space sampling in marching cubes. Given that an MLP with activations of Rectified Linear Unit (ReLU) partitions its input space into a number of linear regions, we are motivated to connect this local linearity with a same property owned by the desired result of polygon mesh. More specifically, we identify from the linear regions, partitioned by an MLP based implicit function, the analytic cells and analytic faces that are associated with the function's zero-level isosurface. We prove that under mild conditions, the identified analytic faces are guaranteed to connect and form a closed, piecewise planar surface. Based on the theorem, we propose an algorithm of analytic marching, which marches among analytic cells to exactly recover the mesh captured by an implicit surface network. We also show that our theory and algorithm are equally applicable to advanced MLPs with shortcut connections and max pooling. Given the parallel nature of analytic marching, we contribute AnalyticMesh, a software package that supports efficient meshing of implicit surface networks via CUDA parallel computing, and mesh simplification for efficient downstream processing. We apply our method to different settings of generative shape modeling using implicit surface networks. Extensive experiments demonstrate our advantages over existing methods in terms of both meshing accuracy and efficiency.
    Medical Matting: A New Perspective on Medical Segmentation with Uncertainty. (arXiv:2106.09887v1 [cs.CV])
    (2 min) In medical image segmentation, it is difficult to mark ambiguous areas accurately with binary masks, especially when dealing with small lesions. Therefore, it is a challenge for radiologists to reach a consensus by using binary masks under the condition of multiple annotations. However, these areas may contain anatomical structures that are conducive to diagnosis. Uncertainty is introduced to study these situations. Nevertheless, the uncertainty is usually measured by the variances between predictions in a multiple trial way. It is not intuitive, and there is no exact correspondence in the image. Inspired by image matting, we introduce matting as a soft segmentation method and a new perspective to deal with and represent uncertain regions into medical scenes, namely medical matting. More specifically, because there is no available medical matting dataset, we first labeled two medical datasets with alpha matte. Secondly, the matting method applied to the natural image is not suitable for the medical scene, so we propose a new architecture to generate binary masks and alpha matte in a row. Thirdly, the uncertainty map is introduced to highlight the ambiguous regions from the binary results and improve the matting performance. Evaluated on these datasets, the proposed model outperformed state-of-the-art matting algorithms by a large margin, and alpha matte is proved to be a more efficient labeling form than a binary mask.
    Combined Person Classification with Airborne Optical Sectioning. (arXiv:2106.10077v1 [cs.CV])
    (2 min) Fully autonomous drones have been demonstrated to find lost or injured persons under strongly occluding forest canopy. Airborne Optical Sectioning (AOS), a novel synthetic aperture imaging technique, together with deep-learning-based classification enables high detection rates under realistic search-and-rescue conditions. We demonstrate that false detections can be significantly suppressed and true detections boosted by combining classifications from multiple AOS rather than single integral images. This improves classification rates especially in the presence of occlusion. To make this possible, we modified the AOS imaging process to support large overlaps between subsequent integrals, enabling real-time and on-board scanning and processing of groundspeeds up to 10 m/s.
    Virtual Temporal Samples for Recurrent Neural Networks: applied to semantic segmentation in agriculture. (arXiv:2106.10118v1 [cs.CV])
    (2 min) This paper explores the potential for performing temporal semantic segmentation in the context of agricultural robotics without temporally labelled data. We achieve this by proposing to generate virtual temporal samples from labelled still images. This allows us, with no extra annotation effort, to generate virtually labelled temporal sequences. Normally, to train a recurrent neural network (RNN), labelled samples from a video (temporal) sequence are required which is laborious and has stymied work in this direction. By generating virtual temporal samples, we demonstrate that it is possible to train a lightweight RNN to perform semantic segmentation on two challenging agricultural datasets. Our results show that by training a temporal semantic segmenter using virtual samples we can increase the performance by an absolute amount of 4.6 and 4.9 on sweet pepper and sugar beet datasets, respectively. This indicates that our virtual data augmentation technique is able to accurately classify agricultural images temporally without the use of complicated synthetic data generation techniques nor with the overhead of labelling large amounts of temporal sequences.
    Non-Iterative Phase Retrieval With Cascaded Neural Networks. (arXiv:2106.10195v1 [eess.IV])
    (2 min) Fourier phase retrieval is the problem of reconstructing a signal given only the magnitude of its Fourier transformation. Optimization-based approaches, like the well-established Gerchberg-Saxton or the hybrid input output algorithm, struggle at reconstructing images from magnitudes that are not oversampled. This motivates the application of learned methods, which allow reconstruction from non-oversampled magnitude measurements after a learning phase. In this paper, we want to push the limits of these learned methods by means of a deep neural network cascade that reconstructs the image successively on different resolutions from its non-oversampled Fourier magnitude. We evaluate our method on four different datasets (MNIST, EMNIST, Fashion-MNIST, and KMNIST) and demonstrate that it yields improved performance over other non-iterative methods and optimization-based methods.
    Discerning Generic Event Boundaries in Long-Form Wild Videos. (arXiv:2106.10090v1 [cs.CV])
    (2 min) Detecting generic, taxonomy-free event boundaries invideos represents a major stride forward towards holisticvideo understanding. In this paper we present a technique forgeneric event boundary detection based on a two stream in-flated 3D convolutions architecture, which can learn spatio-temporal features from videos. Our work is inspired from theGeneric Event Boundary Detection Challenge (part of CVPR2021 Long Form Video Understanding- LOVEU Workshop).Throughout the paper we provide an in-depth analysis ofthe experiments performed along with an interpretation ofthe results obtained.
    World-GAN: a Generative Model for Minecraft Worlds. (arXiv:2106.10155v1 [cs.LG])
    (2 min) This work introduces World-GAN, the first method to perform data-driven Procedural Content Generation via Machine Learning in Minecraft from a single example. Based on a 3D Generative Adversarial Network (GAN) architecture, we are able to create arbitrarily sized world snippets from a given sample. We evaluate our approach on creations from the community as well as structures generated with the Minecraft World Generator. Our method is motivated by the dense representations used in Natural Language Processing (NLP) introduced with word2vec [1]. The proposed block2vec representations make World-GAN independent from the number of different blocks, which can vary a lot in Minecraft, and enable the generation of larger levels. Finally, we demonstrate that changing this new representation space allows us to change the generated style of an already trained generator. World-GAN enables its users to generate Minecraft worlds based on parts of their creations.
    EPIC-KITCHENS-100 Unsupervised Domain Adaptation Challenge for Action Recognition 2021: Team M3EM Technical Report. (arXiv:2106.10026v1 [cs.CV])
    (2 min) In this report, we describe the technical details of our submission to the 2021 EPIC-KITCHENS-100 Unsupervised Domain Adaptation Challenge for Action Recognition. Leveraging multiple modalities has been proved to benefit the Unsupervised Domain Adaptation (UDA) task. In this work, we present Multi-Modal Mutual Enhancement Module (M3EM), a deep module for jointly considering information from multiple modalities to find the most transferable representations across domains. We achieve this by implementing two sub-modules for enhancing each modality using the context of other modalities. The first sub-module exchanges information across modalities through the semantic space, while the second sub-module finds the most transferable spatial region based on the consensus of all modalities.
    GEM: A General Evaluation Benchmark for Multimodal Tasks. (arXiv:2106.09889v1 [cs.CL])
    (2 min) In this paper, we present GEM as a General Evaluation benchmark for Multimodal tasks. Different from existing datasets such as GLUE, SuperGLUE, XGLUE and XTREME that mainly focus on natural language tasks, GEM is a large-scale vision-language benchmark, which consists of GEM-I for image-language tasks and GEM-V for video-language tasks. Comparing with existing multimodal datasets such as MSCOCO and Flicker30K for image-language tasks, YouCook2 and MSR-VTT for video-language tasks, GEM is not only the largest vision-language dataset covering image-language tasks and video-language tasks at the same time, but also labeled in multiple languages. We also provide two baseline models for this benchmark. We will release the dataset, code and baseline models, aiming to advance the development of multilingual multimodal research.
    Shape Prior Non-Uniform Sampling Guided Real-time Stereo 3D Object Detection. (arXiv:2106.10013v1 [cs.CV])
    (2 min) Pseudo-LiDAR based 3D object detectors have gained popularity due to their high accuracy. However, these methods need dense depth supervision and suffer from inferior speed. To solve these two issues, a recently introduced RTS3D builds an efficient 4D Feature-Consistency Embedding (FCE) space for the intermediate representation of object without depth supervision. FCE space splits the entire object region into 3D uniform grid latent space for feature sampling point generation, which ignores the importance of different object regions. However, we argue that, compared with the inner region, the outer region plays a more important role for accurate 3D detection. To encode more information from the outer region, we propose a shape prior non-uniform sampling strategy that performs dense sampling in outer region and sparse sampling in inner region. As a result, more points are sampled from the outer region and more useful features are extracted for 3D detection. Further, to enhance the feature discrimination of each sampling point, we propose a high-level semantic enhanced FCE module to exploit more contextual information and suppress noise better. Experiments on the KITTI dataset are performed to show the effectiveness of the proposed method. Compared with the baseline RTS3D, our proposed method has 2.57% improvement on AP3d almost without extra network parameters. Moreover, our proposed method outperforms the state-of-the-art methods without extra supervision at a real-time speed.
    Debiased Subjective Assessment of Real-World Image Enhancement. (arXiv:2106.10080v1 [eess.IV])
    (2 min) In real-world image enhancement, it is often challenging (if not impossible) to acquire ground-truth data, preventing the adoption of distance metrics for objective quality assessment. As a result, one often resorts to subjective quality assessment, the most straightforward and reliable means of evaluating image enhancement. Conventional subjective testing requires manually pre-selecting a small set of visual examples, which may suffer from three sources of biases: 1) sampling bias due to the extremely sparse distribution of the selected samples in the image space; 2) algorithmic bias due to potential overfitting the selected samples; 3) subjective bias due to further potential cherry-picking test results. This eventually makes the field of real-world image enhancement more of an art than a science. Here we take steps towards debiasing conventional subjective assessment by automatically sampling a set of adaptive and diverse images for subsequent testing. This is achieved by casting sample selection into a joint maximization of the discrepancy between the enhancers and the diversity among the selected input images. Careful visual inspection on the resulting enhanced images provides a debiased ranking of the enhancement algorithms. We demonstrate our subjective assessment method using three popular and practically demanding image enhancement tasks: dehazing, super-resolution, and low-light enhancement.
    Towards interpreting computer vision based on transformation invariant optimization. (arXiv:2106.09982v1 [cs.CV])
    (2 min) Interpreting how does deep neural networks (DNNs) make predictions is a vital field in artificial intelligence, which hinders wide applications of DNNs. Visualization of learned representations helps we humans understand the vision of DNNs. In this work, visualized images that can activate the neural network to the target classes are generated by back-propagation method. Here, rotation and scaling operations are applied to introduce the transformation invariance in the image generating process, which we find a significant improvement on visualization effect. Finally, we show some cases that such method can help us to gain insight into neural networks.
    Accumulative Poisoning Attacks on Real-time Data. (arXiv:2106.09993v1 [cs.LG])
    (2 min) Collecting training data from untrusted sources exposes machine learning services to poisoning adversaries, who maliciously manipulate training data to degrade the model accuracy. When trained on offline datasets, poisoning adversaries have to inject the poisoned data in advance before training, and the order of feeding these poisoned batches into the model is stochastic. In contrast, practical systems are more usually trained/fine-tuned on sequentially captured real-time data, in which case poisoning adversaries could dynamically poison each data batch according to the current model state. In this paper, we focus on the real-time settings and propose a new attacking strategy, which affiliates an accumulative phase with poisoning attacks to secretly (i.e., without affecting accuracy) magnify the destructive effect of a (poisoned) trigger batch. By mimicking online learning and federated learning on CIFAR-10, we show that the model accuracy will significantly drop by a single update step on the trigger batch after the accumulative phase. Our work validates that a well-designed but straightforward attacking strategy can dramatically amplify the poisoning effects, with no need to explore complex techniques.
    hSMAL: Detailed Horse Shape and Pose Reconstruction for Motion Pattern Recognition. (arXiv:2106.10102v1 [cs.CV])
    (2 min) In this paper we present our preliminary work on model-based behavioral analysis of horse motion. Our approach is based on the SMAL model, a 3D articulated statistical model of animal shape. We define a novel SMAL model for horses based on a new template, skeleton and shape space learned from $37$ horse toys. We test the accuracy of our hSMAL model in reconstructing a horse from 3D mocap data and images. We apply the hSMAL model to the problem of lameness detection from video, where we fit the model to images to recover 3D pose and train an ST-GCN network on pose data. A comparison with the same network trained on mocap points illustrates the benefit of our approach.
    Equivariance-bridged SO(2)-Invariant Representation Learning using Graph Convolutional Network. (arXiv:2106.09996v1 [cs.CV])
    (2 min) Training a Convolutional Neural Network (CNN) to be robust against rotation has mostly been done with data augmentation. In this paper, another progressive vision of research direction is highlighted to encourage less dependence on data augmentation by achieving structural rotational invariance of a network. The deep equivariance-bridged SO(2) invariant network is proposed to echo such vision. First, Self-Weighted Nearest Neighbors Graph Convolutional Network (SWN-GCN) is proposed to implement Graph Convolutional Network (GCN) on the graph representation of an image to acquire rotationally equivariant representation, as GCN is more suitable for constructing deeper network than spectral graph convolution-based approaches. Then, invariant representation is eventually obtained with Global Average Pooling (GAP), a permutation-invariant operation suitable for aggregating high-dimensional representations, over the equivariant set of vertices retrieved from SWN-GCN. Our method achieves the state-of-the-art image classification performance on rotated MNIST and CIFAR-10 images, where the models are trained with a non-augmented dataset only. Quantitative validations over invariance of the representations also demonstrate strong invariance of deep representations of SWN-GCN over rotations.
    Towards Distraction-Robust Active Visual Tracking. (arXiv:2106.10110v1 [cs.CV])
    (2 min) In active visual tracking, it is notoriously difficult when distracting objects appear, as distractors often mislead the tracker by occluding the target or bringing a confusing appearance. To address this issue, we propose a mixed cooperative-competitive multi-agent game, where a target and multiple distractors form a collaborative team to play against a tracker and make it fail to follow. Through learning in our game, diverse distracting behaviors of the distractors naturally emerge, thereby exposing the tracker's weakness, which helps enhance the distraction-robustness of the tracker. For effective learning, we then present a bunch of practical methods, including a reward function for distractors, a cross-modal teacher-student learning strategy, and a recurrent attention mechanism for the tracker. The experimental results show that our tracker performs desired distraction-robust active visual tracking and can be well generalized to unseen environments. We also show that the multi-agent game can be used to adversarially test the robustness of trackers.
    Multi-Granularity Network with Modal Attention for Dense Affective Understanding. (arXiv:2106.09964v1 [cs.CV])
    (2 min) Video affective understanding, which aims to predict the evoked expressions by the video content, is desired for video creation and recommendation. In the recent EEV challenge, a dense affective understanding task is proposed and requires frame-level affective prediction. In this paper, we propose a multi-granularity network with modal attention (MGN-MA), which employs multi-granularity features for better description of the target frame. Specifically, the multi-granularity features could be divided into frame-level, clips-level and video-level features, which corresponds to visual-salient content, semantic-context and video theme information. Then the modal attention fusion module is designed to fuse the multi-granularity features and emphasize more affection-relevant modals. Finally, the fused feature is fed into a Mixtures Of Experts (MOE) classifier to predict the expressions. Further employing model-ensemble post-processing, the proposed method achieves the correlation score of 0.02292 in the EEV challenge.
    Towards Clustering-friendly Representations: Subspace Clustering via Graph Filtering. (arXiv:2106.09874v1 [cs.CV])
    (2 min) Finding a suitable data representation for a specific task has been shown to be crucial in many applications. The success of subspace clustering depends on the assumption that the data can be separated into different subspaces. However, this simple assumption does not always hold since the raw data might not be separable into subspaces. To recover the ``clustering-friendly'' representation and facilitate the subsequent clustering, we propose a graph filtering approach by which a smooth representation is achieved. Specifically, it injects graph similarity into data features by applying a low-pass filter to extract useful data representations for clustering. Extensive experiments on image and document clustering datasets demonstrate that our method improves upon state-of-the-art subspace clustering techniques. Especially, its comparable performance with deep learning methods emphasizes the effectiveness of the simple graph filtering scheme for many real-world applications. An ablation study shows that graph filtering can remove noise, preserve structure in the image, and increase the separability of classes.
    Novelty Detection via Contrastive Learning with Negative Data Augmentation. (arXiv:2106.09958v1 [cs.CV])
    (2 min) Novelty detection is the process of determining whether a query example differs from the learned training distribution. Previous methods attempt to learn the representation of the normal samples via generative adversarial networks (GANs). However, they will suffer from instability training, mode dropping, and low discriminative ability. Recently, various pretext tasks (e.g. rotation prediction and clustering) have been proposed for self-supervised learning in novelty detection. However, the learned latent features are still low discriminative. We overcome such problems by introducing a novel decoder-encoder framework. Firstly, a generative network (a.k.a. decoder) learns the representation by mapping the initialized latent vector to an image. In particular, this vector is initialized by considering the entire distribution of training data to avoid the problem of mode-dropping. Secondly, a contrastive network (a.k.a. encoder) aims to ``learn to compare'' through mutual information estimation, which directly helps the generative network to obtain a more discriminative representation by using a negative data augmentation strategy. Extensive experiments show that our model has significant superiority over cutting-edge novelty detectors and achieves new state-of-the-art results on some novelty detection benchmarks, e.g. CIFAR10 and DCASE. Moreover, our model is more stable for training in a non-adversarial manner, compared to other adversarial based novelty detection methods.
    Improved Radar Localization on Lidar Maps Using Shared Embedding. (arXiv:2106.10000v1 [cs.RO])
    (2 min) We present a heterogeneous localization framework for solving radar global localization and pose tracking on pre-built lidar maps. To bridge the gap of sensing modalities, deep neural networks are constructed to create shared embedding space for radar scans and lidar maps. Herein learned feature embeddings are supportive for similarity measurement, thus improving map retrieval and data matching respectively. In RobotCar and MulRan datasets, we demonstrate the effectiveness of the proposed framework with the comparison to Scan Context and RaLL. In addition, the proposed pose tracking pipeline is with less neural networks compared to the original RaLL.
    HifiFace: 3D Shape and Semantic Prior Guided High Fidelity Face Swapping. (arXiv:2106.09965v1 [cs.CV])
    (2 min) In this work, we propose a high fidelity face swapping method, called HifiFace, which can well preserve the face shape of the source face and generate photo-realistic results. Unlike other existing face swapping works that only use face recognition model to keep the identity similarity, we propose 3D shape-aware identity to control the face shape with the geometric supervision from 3DMM and 3D face reconstruction method. Meanwhile, we introduce the Semantic Facial Fusion module to optimize the combination of encoder and decoder features and make adaptive blending, which makes the results more photo-realistic. Extensive experiments on faces in the wild demonstrate that our method can preserve better identity, especially on the face shape, and can generate more photo-realistic results than previous state-of-the-art methods.
    Medical Image Analysis on Left Atrial LGE MRI for Atrial Fibrillation Studies: A Review. (arXiv:2106.09862v1 [cs.CV])
    (2 min) Late gadolinium enhancement magnetic resonance imaging (LGE MRI) is commonly used to visualize and quantify left atrial (LA) scars. The position and extent of scars provide important information of the pathophysiology and progression of atrial fibrillation (AF). Hence, LA scar segmentation and quantification from LGE MRI can be useful in computer-assisted diagnosis and treatment stratification of AF patients. Since manual delineation can be time-consuming and subject to intra- and inter-expert variability, automating this computing is highly desired, which nevertheless is still challenging and under-researched. This paper aims to provide a systematic review on computing methods for LA cavity, wall, scar and ablation gap segmentation and quantification from LGE MRI, and the related literature for AF studies. Specifically, we first summarize AF-related imaging techniques, particularly LGE MRI. Then, we review the methodologies of the four computing tasks in detail, and summarize the validation strategies applied in each task. Finally, the possible future developments are outlined, with a brief survey on the potential clinical applications of the aforementioned methods. The review shows that the research into this topic is still in early stages. Although several methods have been proposed, especially for LA segmentation, there is still large scope for further algorithmic developments due to performance issues related to the high variability of enhancement appearance and differences in image acquisition.
    A Framework for Real-time Traffic Trajectory Tracking, Speed Estimation, and Driver Behavior Calibration at Urban Intersections Using Virtual Traffic Lanes. (arXiv:2106.09932v1 [cs.CV])
    (2 min) In a previous study, we presented VT-Lane, a three-step framework for real-time vehicle detection, tracking, and turn movement classification at urban intersections. In this study, we present a case study incorporating the highly accurate trajectories and movement classification obtained via VT-Lane for the purpose of speed estimation and driver behavior calibration for traffic at urban intersections. First, we use a highly instrumented vehicle to verify the estimated speeds obtained from video inference. The results of the speed validation show that our method can estimate the average travel speed of detected vehicles in real-time with an error of 0.19 m/sec, which is equivalent to 2% of the average observed travel speeds in the intersection of the study. Instantaneous speeds (at the resolution of 30 Hz) were found to be estimated with an average error of 0.21 m/sec and 0.86 m/sec respectively for free-flowing and congested traffic conditions. We then use the estimated speeds to calibrate the parameters of a driver behavior model for the vehicles in the area of study. The results show that the calibrated model replicates the driving behavior with an average error of 0.45 m/sec, indicating the high potential for using this framework for automated, large-scale calibration of car-following models from roadside traffic video data, which can lead to substantial improvements in traffic modeling via microscopic simulation.
    Evolving GANs: When Contradictions Turn into Compliance. (arXiv:2106.09946v1 [cs.LG])
    (2 min) Limited availability of labeled-data makes any supervised learning problem challenging. Alternative learning settings like semi-supervised and universum learning alleviate the dependency on labeled data, but still require a large amount of unlabeled data, which may be unavailable or expensive to acquire. GAN-based synthetic data generation methods have recently shown promise by generating synthetic samples to improve task at hand. However, these samples cannot be used for other purposes. In this paper, we propose a GAN game which provides improved discriminator accuracy under limited data settings, while generating realistic synthetic data. This provides the added advantage that now the generated data can be used for other similar tasks. We provide the theoretical guarantees and empirical results in support of our approach.
    AI-Enabled Ultra-Low-Dose CT Reconstruction. (arXiv:2106.09834v1 [eess.IV])
    (2 min) By the ALARA (As Low As Reasonably Achievable) principle, ultra-low-dose CT reconstruction is a holy grail to minimize cancer risks and genetic damages, especially for children. With the development of medical CT technologies, the iterative algorithms are widely used to reconstruct decent CT images from a low-dose scan. Recently, artificial intelligence (AI) techniques have shown a great promise in further reducing CT radiation dose to the next level. In this paper, we demonstrate that AI-powered CT reconstruction offers diagnostic image quality at an ultra-low-dose level comparable to that of radiography. Specifically, here we develop a Split Unrolled Grid-like Alternative Reconstruction (SUGAR) network, in which deep learning, physical modeling and image prior are integrated. The reconstruction results from clinical datasets show that excellent images can be reconstructed using SUGAR from 36 projections. This approach has a potential to change future healthcare.
    Effective Model Sparsification by Scheduled Grow-and-Prune Methods. (arXiv:2106.09857v1 [cs.CV])
    (2 min) Deep neural networks (DNNs) are effective in solving many real-world problems. Larger DNN models usually exhibit better quality (e.g., accuracy) but their excessive computation results in long training and inference time. Model sparsification can reduce the computation and memory cost while maintaining model quality. Most existing sparsification algorithms unidirectionally remove weights, while others randomly or greedily explore a small subset of weights in each layer. The inefficiency of the algorithms reduces the achievable sparsity level. In addition, many algorithms still require pre-trained dense models and thus suffer from large memory footprint and long training time. In this paper, we propose a novel scheduled grow-and-prune (GaP) methodology without pre-training the dense models. It addresses the shortcomings of the previous works by repeatedly growing a subset of layers to dense and then pruning back to sparse after some training. Experiments have shown that such models can match or beat the quality of highly optimized dense models at 80% sparsity on a variety of tasks, such as image classification, objective detection, 3D object part segmentation, and translation. They also outperform other state-of-the-art (SOTA) pruning methods, including pruning from pre-trained dense models. As an example, a 90% sparse ResNet-50 obtained via GaP achieves 77.9% top-1 accuracy on ImageNet, improving the SOTA results by 1.5%.
    Discovering Relationships between Object Categories via Universal Canonical Maps. (arXiv:2106.09758v1 [cs.CV])
    (2 min) We tackle the problem of learning the geometry of multiple categories of deformable objects jointly. Recent work has shown that it is possible to learn a unified dense pose predictor for several categories of related objects. However, training such models requires to initialize inter-category correspondences by hand. This is suboptimal and the resulting models fail to maintain correct correspondences as individual categories are learned. In this paper, we show that improved correspondences can be learned automatically as a natural byproduct of learning category-specific dense pose predictors. To do this, we express correspondences between different categories and between images and categories using a unified embedding. Then, we use the latter to enforce two constraints: symmetric inter-category cycle consistency and a new asymmetric image-to-category cycle consistency. Without any manual annotations for the inter-category correspondences, we obtain state-of-the-art alignment results, outperforming dedicated methods for matching 3D shapes. Moreover, the new model is also better at the task of dense pose prediction than prior work.
    Hybrid graph convolutional neural networks for landmark-based anatomical segmentation. (arXiv:2106.09832v1 [eess.IV])
    (2 min) In this work we address the problem of landmark-based segmentation for anatomical structures. We propose HybridGNet, an encoder-decoder neural architecture which combines standard convolutions for image feature encoding, with graph convolutional neural networks to decode plausible representations of anatomical structures. We benchmark the proposed architecture considering other standard landmark and pixel-based models for anatomical segmentation in chest x-ray images, and found that HybridGNet is more robust to image occlusions. We also show that it can be used to construct landmark-based segmentations from pixel level annotations. Our experimental results suggest that HybridGNet produces accurate and anatomically plausible landmark-based segmentations, by naturally incorporating shape constraints within the decoding process via spectral convolutions.
    Deep reinforcement learning with automated label extraction from clinical reports accurately classifies 3D MRI brain volumes. (arXiv:2106.09812v1 [cs.CV])
    (3 min) Purpose: Image classification is perhaps the most fundamental task in imaging AI. However, labeling images is time-consuming and tedious. We have recently demonstrated that reinforcement learning (RL) can classify 2D slices of MRI brain images with high accuracy. Here we make two important steps toward speeding image classification: Firstly, we automatically extract class labels from the clinical reports. Secondly, we extend our prior 2D classification work to fully 3D image volumes from our institution. Hence, we proceed as follows: in Part 1, we extract labels from reports automatically using the SBERT natural language processing approach. Then, in Part 2, we use these labels with RL to train a classification Deep-Q Network (DQN) for 3D image volumes. Methods: For Part 1, we trained SBERT with 90 radiology report impressions. We then used the trained SBERT to predict class labels for use in Part 2. In Part 2, we applied multi-step image classification to allow for combined Deep-Q learning using 3D convolutions and TD(0) Q learning. We trained on a set of 90 images. We tested on a separate set of 61 images, again using the classes predicted from patient reports by the trained SBERT in Part 1. For comparison, we also trained and tested a supervised deep learning classification network on the same set of training and testing images using the same labels. Results: Part 1: Upon training with the corpus of radiology reports, the SBERT model had 100% accuracy for both normal and metastasis-containing scans. Part 2: Then, using these labels, whereas the supervised approach quickly overfit the training data and as expected performed poorly on the testing set (66% accuracy, just over random guessing), the reinforcement learning approach achieved an accuracy of 92%. The results were found to be statistically significant, with a p-value of 3.1 x 10^-5.
    RSG: A Simple but Effective Module for Learning Imbalanced Datasets. (arXiv:2106.09859v1 [cs.CV])
    (2 min) Imbalanced datasets widely exist in practice and area great challenge for training deep neural models with agood generalization on infrequent classes. In this work, wepropose a new rare-class sample generator (RSG) to solvethis problem. RSG aims to generate some new samplesfor rare classes during training, and it has in particularthe following advantages: (1) it is convenient to use andhighly versatile, because it can be easily integrated intoany kind of convolutional neural network, and it works wellwhen combined with different loss functions, and (2) it isonly used during the training phase, and therefore, no ad-ditional burden is imposed on deep neural networks duringthe testing phase. In extensive experimental evaluations, weverify the effectiveness of RSG. Furthermore, by leveragingRSG, we obtain competitive results on Imbalanced CIFARand new state-of-the-art results on Places-LT, ImageNet-LT, and iNaturalist 2018. The source code is available at https://github.com/Jianf-Wang/RSG.
    Indicators of Attack Failure: Debugging and Improving Optimization of Adversarial Examples. (arXiv:2106.09947v1 [cs.LG])
    (2 min) Evaluating robustness of machine-learning models to adversarial examples is a challenging problem. Many defenses have been shown to provide a false sense of security by causing gradient-based attacks to fail, and they have been broken under more rigorous evaluations. Although guidelines and best practices have been suggested to improve current adversarial robustness evaluations, the lack of automatic testing and debugging tools makes it difficult to apply these recommendations in a systematic manner. In this work, we overcome these limitations by (i) defining a set of quantitative indicators which unveil common failures in the optimization of gradient-based attacks, and (ii) proposing specific mitigation strategies within a systematic evaluation protocol. Our extensive experimental analysis shows that the proposed indicators of failure can be used to visualize, debug and improve current adversarial robustness evaluations, providing a first concrete step towards automatizing and systematizing current adversarial robustness evaluations. Our open-source code is available at: https://github.com/pralab/IndicatorsOfAttackFailure.
    A Unified Generative Adversarial Network Training via Self-Labeling and Self-Attention. (arXiv:2106.09914v1 [cs.LG])
    (2 min) We propose a novel GAN training scheme that can handle any level of labeling in a unified manner. Our scheme introduces a form of artificial labeling that can incorporate manually defined labels, when available, and induce an alignment between them. To define the artificial labels, we exploit the assumption that neural network generators can be trained more easily to map nearby latent vectors to data with semantic similarities, than across separate categories. We use generated data samples and their corresponding artificial conditioning labels to train a classifier. The classifier is then used to self-label real data. To boost the accuracy of the self-labeling, we also use the exponential moving average of the classifier. However, because the classifier might still make mistakes, especially at the beginning of the training, we also refine the labels through self-attention, by using the labeling of real data samples only when the classifier outputs a high classification probability score. We evaluate our approach on CIFAR-10, STL-10 and SVHN, and show that both self-labeling and self-attention consistently improve the quality of generated data. More surprisingly, we find that the proposed scheme can even outperform class-conditional GANs.
    Guided Integrated Gradients: An Adaptive Path Method for Removing Noise. (arXiv:2106.09788v1 [cs.CV])
    (2 min) Integrated Gradients (IG) is a commonly used feature attribution method for deep neural networks. While IG has many desirable properties, the method often produces spurious/noisy pixel attributions in regions that are not related to the predicted class when applied to visual models. While this has been previously noted, most existing solutions are aimed at addressing the symptoms by explicitly reducing the noise in the resulting attributions. In this work, we show that one of the causes of the problem is the accumulation of noise along the IG path. To minimize the effect of this source of noise, we propose adapting the attribution path itself -- conditioning the path not just on the image but also on the model being explained. We introduce Adaptive Path Methods (APMs) as a generalization of path methods, and Guided IG as a specific instance of an APM. Empirically, Guided IG creates saliency maps better aligned with the model's prediction and the input image that is being explained. We show through qualitative and quantitative experiments that Guided IG outperforms other, related methods in nearly every experiment.
    DeepLab2: A TensorFlow Library for Deep Labeling. (arXiv:2106.09748v1 [cs.CV])
    (2 min) DeepLab2 is a TensorFlow library for deep labeling, aiming to provide a state-of-the-art and easy-to-use TensorFlow codebase for general dense pixel prediction problems in computer vision. DeepLab2 includes all our recently developed DeepLab model variants with pretrained checkpoints as well as model training and evaluation code, allowing the community to reproduce and further improve upon the state-of-art systems. To showcase the effectiveness of DeepLab2, our Panoptic-DeepLab employing Axial-SWideRNet as network backbone achieves 68.0% PQ or 83.5% mIoU on Cityscaspes validation set, with only single-scale inference and ImageNet-1K pretrained checkpoints. We hope that publicly sharing our library could facilitate future research on dense pixel labeling tasks and envision new applications of this technology. Code is made publicly available at \url{https://github.com/google-research/deeplab2}.
    Analyzing Adversarial Robustness of Deep Neural Networks in Pixel Space: a Semantic Perspective. (arXiv:2106.09872v1 [cs.CV])
    (2 min) The vulnerability of deep neural networks to adversarial examples, which are crafted maliciously by modifying the inputs with imperceptible perturbations to misled the network produce incorrect outputs, reveals the lack of robustness and poses security concerns. Previous works study the adversarial robustness of image classifiers on image level and use all the pixel information in an image indiscriminately, lacking of exploration of regions with different semantic meanings in the pixel space of an image. In this work, we fill this gap and explore the pixel space of the adversarial image by proposing an algorithm to looking for possible perturbations pixel by pixel in different regions of the segmented image. The extensive experimental results on CIFAR-10 and ImageNet verify that searching for the modified pixel in only some pixels of an image can successfully launch the one-pixel adversarial attacks without requiring all the pixels of the entire image, and there exist multiple vulnerable points scattered in different regions of an image. We also demonstrate that the adversarial robustness of different regions on the image varies with the amount of semantic information contained.
    Quantized Neural Networks via {-1, +1} Encoding Decomposition and Acceleration. (arXiv:2106.09886v1 [cs.CV])
    (2 min) The training of deep neural networks (DNNs) always requires intensive resources for both computation and data storage. Thus, DNNs cannot be efficiently applied to mobile phones and embedded devices, which severely limits their applicability in industrial applications. To address this issue, we propose a novel encoding scheme using {-1, +1} to decompose quantized neural networks (QNNs) into multi-branch binary networks, which can be efficiently implemented by bitwise operations (i.e., xnor and bitcount) to achieve model compression, computational acceleration, and resource saving. By using our method, users can achieve different encoding precisions arbitrarily according to their requirements and hardware resources. The proposed mechanism is highly suitable for the use of FPGA and ASIC in terms of data storage and computation, which provides a feasible idea for smart chips. We validate the effectiveness of our method on large-scale image classification (e.g., ImageNet), object detection, and semantic segmentation tasks. In particular, our method with low-bit encoding can still achieve almost the same performance as its high-bit counterparts.
    A Distance-based Separability Measure for Internal Cluster Validation. (arXiv:2106.09794v1 [cs.LG])
    (2 min) To evaluate clustering results is a significant part of cluster analysis. Since there are no true class labels for clustering in typical unsupervised learning, many internal cluster validity indices (CVIs), which use predicted labels and data, have been created. Without true labels, to design an effective CVI is as difficult as to create a clustering method. And it is crucial to have more CVIs because there are no universal CVIs that can be used to measure all datasets and no specific methods of selecting a proper CVI for clusters without true labels. Therefore, to apply a variety of CVIs to evaluate clustering results is necessary. In this paper, we propose a novel internal CVI -- the Distance-based Separability Index (DSI), based on a data separability measure. We compared the DSI with eight internal CVIs including studies from early Dunn (1974) to most recent CVDD (2019) and an external CVI as ground truth, by using clustering results of five clustering algorithms on 12 real and 97 synthetic datasets. Results show DSI is an effective, unique, and competitive CVI to other compared CVIs. We also summarized the general process to evaluate CVIs and created the rank-difference metric for comparison of CVIs' results.
    PyKale: Knowledge-Aware Machine Learning from Multiple Sources in Python. (arXiv:2106.09756v1 [cs.LG])
    (2 min) Machine learning is a general-purpose technology holding promises for many interdisciplinary research problems. However, significant barriers exist in crossing disciplinary boundaries when most machine learning tools are developed in different areas separately. We present Pykale - a Python library for knowledge-aware machine learning on graphs, images, texts, and videos to enable and accelerate interdisciplinary research. We formulate new green machine learning guidelines based on standard software engineering practices and propose a novel pipeline-based application programming interface (API). PyKale focuses on leveraging knowledge from multiple sources for accurate and interpretable prediction, thus supporting multimodal learning and transfer learning (particularly domain adaptation) with latest deep learning and dimensionality reduction models. We build PyKale on PyTorch and leverage the rich PyTorch ecosystem. Our pipeline-based API design enforces standardization and minimalism, embracing green machine learning concepts via reducing repetitions and redundancy, reusing existing resources, and recycling learning models across areas. We demonstrate its interdisciplinary nature via examples in bioinformatics, knowledge graph, image/video recognition, and medical imaging.
    Efficient Self-supervised Vision Transformers for Representation Learning. (arXiv:2106.09785v1 [cs.CV])
    (2 min) This paper investigates two techniques for developing efficient self-supervised vision transformers (EsViT) for visual representation learning. First, we show through a comprehensive empirical study that multi-stage architectures with sparse self-attentions can significantly reduce modeling complexity but with a cost of losing the ability to capture fine-grained correspondences between image regions. Second, we propose a new pre-training task of region matching which allows the model to capture fine-grained region dependencies and as a result significantly improves the quality of the learned vision representations. Our results show that combining the two techniques, EsViT achieves 81.3% top-1 on the ImageNet linear probe evaluation, outperforming prior arts with around an order magnitude of higher throughput. When transferring to downstream linear classification tasks, EsViT outperforms its supervised counterpart on 17 out of 18 datasets. The code and models will be publicly available.
    Light Lies: Optical Adversarial Attack. (arXiv:2106.09908v1 [cs.CV])
    (2 min) A significant amount of work has been done on adversarial attacks that inject imperceptible noise to images to deteriorate the image classification performance of deep models. However, most of the existing studies consider attacks in the digital (pixel) domain where an image acquired by an image sensor with sampling and quantization has been recorded. This paper, for the first time, introduces an optical adversarial attack, which physically alters the light field information arriving at the image sensor so that the classification model yields misclassification. More specifically, we modulate the phase of the light in the Fourier domain using a spatial light modulator placed in the photographic system. The operative parameters of the modulator are obtained by gradient-based optimization to maximize cross-entropy and minimize distortions. We present experiments based on both simulation and a real hardware optical system, from which the feasibility of the proposed optical attack is demonstrated. It is also verified that the proposed attack is completely different from common optical-domain distortions such as spherical aberration, defocus, and astigmatism in terms of both perturbation patterns and classification results.
    Synthetic COVID-19 Chest X-ray Dataset for Computer-Aided Diagnosis. (arXiv:2106.09759v1 [eess.IV])
    (2 min) We introduce a new dataset called Synthetic COVID-19 Chest X-ray Dataset for training machine learning models. The dataset consists of 21,295 synthetic COVID-19 chest X-ray images to be used for computer-aided diagnosis. These images, generated via an unsupervised domain adaptation approach, are of high quality. We find that the synthetic images not only improve performance of various deep learning architectures when used as additional training data under heavy imbalance conditions, but also detect the target class with high confidence. We also find that comparable performance can also be achieved when trained only on synthetic images. Further, salient features of the synthetic COVID-19 images indicate that the distribution is significantly different from Non-COVID-19 classes, enabling a proper decision boundary. We hope the availability of such high fidelity chest X-ray images of COVID-19 will encourage advances in the development of diagnostic and/or management tools.
    Smoothed Multi-View Subspace Clustering. (arXiv:2106.09875v1 [cs.CV])
    (2 min) In recent years, multi-view subspace clustering has achieved impressive performance due to the exploitation of complementary imformation across multiple views. However, multi-view data can be very complicated and are not easy to cluster in real-world applications. Most existing methods operate on raw data and may not obtain the optimal solution. In this work, we propose a novel multi-view clustering method named smoothed multi-view subspace clustering (SMVSC) by employing a novel technique, i.e., graph filtering, to obtain a smooth representation for each view, in which similar data points have similar feature values. Specifically, it retains the graph geometric features through applying a low-pass filter. Consequently, it produces a ``clustering-friendly" representation and greatly facilitates the downstream clustering task. Extensive experiments on benchmark datasets validate the superiority of our approach. Analysis shows that graph filtering increases the separability of classes.
    Dual-Teacher Class-Incremental Learning With Data-Free Generative Replay. (arXiv:2106.09835v1 [cs.CV])
    (2 min) This paper proposes two novel knowledge transfer techniques for class-incremental learning (CIL). First, we propose data-free generative replay (DF-GR) to mitigate catastrophic forgetting in CIL by using synthetic samples from a generative model. In the conventional generative replay, the generative model is pre-trained for old data and shared in extra memory for later incremental learning. In our proposed DF-GR, we train a generative model from scratch without using any training data, based on the pre-trained classification model from the past, so we curtail the cost of sharing pre-trained generative models. Second, we introduce dual-teacher information distillation (DT-ID) for knowledge distillation from two teachers to one student. In CIL, we use DT-ID to learn new classes incrementally based on the pre-trained model for old classes and another model (pre-)trained on the new data for new classes. We implemented the proposed schemes on top of one of the state-of-the-art CIL methods and showed the performance improvement on CIFAR-100 and ImageNet datasets.
  • cs.IR updates on arXiv.org

    CSFCube -- A Test Collection of Computer Science Research Articles for Faceted Query by Example. (arXiv:2103.12906v2 [cs.IR] UPDATED)
    (2 min) Query by Example is a well-known information retrieval task in which a document is chosen by the user as the search query and the goal is to retrieve relevant documents from a large collection. However, a document often covers multiple aspects of a topic. To address this scenario we introduce the task of faceted Query by Example in which users can also specify a finer grained aspect in addition to the input query document. We focus on the application of this task in scientific literature search. We envision models which are able to retrieve scientific papers analogous to a query scientific paper along specifically chosen rhetorical structure elements as one solution to this problem. In this work, the rhetorical structure elements, which we refer to as facets, indicate backgrounds, methods, or results of a scientific paper. We introduce and describe an expert annotated test collection to evaluate models trained to perform this task. Our test collection consists of a diverse set of 50 query documents, drawn from computational linguistics and machine learning venues. We carefully followed the annotation guideline used by TREC for depth-k pooling (k = 100 or 250) and the resulting data collection consists of graded relevance scores with high annotation agreement. The data is freely available for research purposes.
    Self-supervised Graph Learning for Recommendation. (arXiv:2010.10783v4 [cs.IR] UPDATED)
    (2 min) Representation learning on user-item graph for recommendation has evolved from using single ID or interaction history to exploiting higher-order neighbors. This leads to the success of graph convolution networks (GCNs) for recommendation such as PinSage and LightGCN. Despite effectiveness, we argue that they suffer from two limitations: (1) high-degree nodes exert larger impact on the representation learning, deteriorating the recommendations of low-degree (long-tail) items; and (2) representations are vulnerable to noisy interactions, as the neighborhood aggregation scheme further enlarges the impact of observed edges. In this work, we explore self-supervised learning on user-item graph, so as to improve the accuracy and robustness of GCNs for recommendation. The idea is to supplement the classical supervised task of recommendation with an auxiliary self-supervised task, which reinforces node representation learning via self-discrimination. Specifically, we generate multiple views of a node, maximizing the agreement between different views of the same node compared to that of other nodes. We devise three operators to generate the views -- node dropout, edge dropout, and random walk -- that change the graph structure in different manners. We term this new learning paradigm as \textit{Self-supervised Graph Learning} (SGL), implementing it on the state-of-the-art model LightGCN. Through theoretical analyses, we find that SGL has the ability of automatically mining hard negatives. Empirical studies on three benchmark datasets demonstrate the effectiveness of SGL, which improves the recommendation accuracy, especially on long-tail items, and the robustness against interaction noises. Our implementations are available at \url{https://github.com/wujcan/SGL}.
    B-PROP: Bootstrapped Pre-training with Representative Words Prediction for Ad-hoc Retrieval. (arXiv:2104.09791v3 [cs.IR] UPDATED)
    (2 min) Pre-training and fine-tuning have achieved remarkable success in many downstream natural language processing (NLP) tasks. Recently, pre-training methods tailored for information retrieval (IR) have also been explored, and the latest success is the PROP method which has reached new SOTA on a variety of ad-hoc retrieval benchmarks. The basic idea of PROP is to construct the \textit{representative words prediction} (ROP) task for pre-training inspired by the query likelihood model. Despite its exciting performance, the effectiveness of PROP might be bounded by the classical unigram language model adopted in the ROP task construction process. To tackle this problem, we propose a bootstrapped pre-training method (namely B-PROP) based on BERT for ad-hoc retrieval. The key idea is to use the powerful contextual language model BERT to replace the classical unigram language model for the ROP task construction, and re-train BERT itself towards the tailored objective for IR. Specifically, we introduce a novel contrastive method, inspired by the divergence-from-randomness idea, to leverage BERT's self-attention mechanism to sample representative words from the document. By further fine-tuning on downstream ad-hoc retrieval tasks, our method achieves significant improvements over baselines without pre-training or with other pre-training methods, and further pushes forward the SOTA on a variety of ad-hoc retrieval tasks.
    An Information Retrieval Approach to Building Datasets for Hate Speech Detection. (arXiv:2106.09775v1 [cs.CL])
    (2 min) Building a benchmark dataset for hate speech detection presents several challenges. Firstly, because hate speech is relatively rare -- e.g., less than 3\% of Twitter posts are hateful \citep{founta2018large} -- random sampling of tweets to annotate is inefficient in capturing hate speech. A common practice is to only annotate tweets containing known ``hate words'', but this risks yielding a biased benchmark that only partially captures the real-world phenomenon of interest. A second challenge is that definitions of hate speech tend to be highly variable and subjective. Annotators having diverse prior notions of hate speech may not only disagree with one another but also struggle to conform to specified labeling guidelines. Our key insight is that the rarity and subjectivity of hate speech are akin to that of relevance in information retrieval (IR). This connection suggests that well-established methodologies for creating IR test collections might also be usefully applied to create better benchmark datasets for hate speech detection. Firstly, to intelligently and efficiently select which tweets to annotate, we apply established IR techniques of {\em pooling} and {\em active learning}. Secondly, to improve both consistency and value of annotations, we apply {\em task decomposition} \cite{Zhang-sigir14} and {\em annotator rationale} \cite{mcdonnell16-hcomp} techniques. Using the above techniques, we create and share a new benchmark dataset\footnote{We will release the dataset upon publication.} for hate speech detection with broader coverage than prior datasets. We also show a dramatic drop in accuracy of existing detection models when tested on these broader forms of hate. Collected annotator rationales not only provide documented support for labeling decisions but also create exciting future work opportunities for dual-supervision and/or explanation generation in modeling.
    Point-of-Interest Recommender Systems: A Survey from an Experimental Perspective. (arXiv:2106.10069v1 [cs.IR])
    (2 min) Point-of-Interest recommendation is an increasing research and developing area within the widely adopted technologies known as Recommender Systems. Among them, those that exploit information coming from Location-Based Social Networks (LBSNs) are very popular nowadays and could work with different information sources, which pose several challenges and research questions to the community as a whole. We present a systematic review focused on the research done in the last 10 years about this topic. We discuss and categorize the algorithms and evaluation methodologies used in these works and point out the opportunities and challenges that remain open in the field. More specifically, we report the leading recommendation techniques and information sources that have been exploited more often (such as the geographical signal and deep learning approaches) while we also alert about the lack of reproducibility in the field that may hinder real performance improvements.
    Heuristic Stopping Rules For Technology-Assisted Review. (arXiv:2106.09871v1 [cs.IR])
    (2 min) Technology-assisted review (TAR) refers to human-in-the-loop active learning workflows for finding relevant documents in large collections. These workflows often must meet a target for the proportion of relevant documents found (i.e. recall) while also holding down costs. A variety of heuristic stopping rules have been suggested for striking this tradeoff in particular settings, but none have been tested against a range of recall targets and tasks. We propose two new heuristic stopping rules, Quant and QuantCI based on model-based estimation techniques from survey research. We compare them against a range of proposed heuristics and find they are accurate at hitting a range of recall targets while substantially reducing review costs.
    FinGAT: Financial Graph Attention Networks for Recommending Top-K Profitable Stocks. (arXiv:2106.10159v1 [cs.LG])
    (2 min) Financial technology (FinTech) has drawn much attention among investors and companies. While conventional stock analysis in FinTech targets at predicting stock prices, less effort is made for profitable stock recommendation. Besides, in existing approaches on modeling time series of stock prices, the relationships among stocks and sectors (i.e., categories of stocks) are either neglected or pre-defined. Ignoring stock relationships will miss the information shared between stocks while using pre-defined relationships cannot depict the latent interactions or influence of stock prices between stocks. In this work, we aim at recommending the top-K profitable stocks in terms of return ratio using time series of stock prices and sector information. We propose a novel deep learning-based model, Financial Graph Attention Networks (FinGAT), to tackle the task under the setting that no pre-defined relationships between stocks are given. The idea of FinGAT is three-fold. First, we devise a hierarchical learning component to learn short-term and long-term sequential patterns from stock time series. Second, a fully-connected graph between stocks and a fully-connected graph between sectors are constructed, along with graph attention networks, to learn the latent interactions among stocks and sectors. Third, a multi-task objective is devised to jointly recommend the profitable stocks and predict the stock movement. Experiments conducted on Taiwan Stock, S&P 500, and NASDAQ datasets exhibit remarkable recommendation performance of our FinGAT, comparing to state-of-the-art methods.
    On Minimizing Cost in Legal Document Review Workflows. (arXiv:2106.09866v1 [cs.IR])
    (2 min) Technology-assisted review (TAR) refers to human-in-the-loop machine learning workflows for document review in legal discovery and other high recall review tasks. Attorneys and legal technologists have debated whether review should be a single iterative process (one-phase TAR workflows) or whether model training and review should be separate (two-phase TAR workflows), with implications for the choice of active learning algorithm. The relative cost of manual labeling for different purposes (training vs. review) and of different documents (positive vs. negative examples) is a key and neglected factor in this debate. Using a novel cost dynamics analysis, we show analytically and empirically that these relative costs strongly impact whether a one-phase or two-phase workflow minimizes cost. We also show how category prevalence, classification task difficulty, and collection size impact the optimal choice not only of workflow type, but of active learning method and stopping point.
  • cs.LG updates on arXiv.org

    Data Assimilation Predictive GAN (DA-PredGAN): applied to determine the spread of COVID-19. (arXiv:2105.07729v2 [cs.LG] UPDATED)
    (2 min) We propose the novel use of a generative adversarial network (GAN) (i) to make predictions in time (PredGAN) and (ii) to assimilate measurements (DA-PredGAN). In the latter case, we take advantage of the natural adjoint-like properties of generative models and the ability to simulate forwards and backwards in time. GANs have received much attention recently, after achieving excellent results for their generation of realistic-looking images. We wish to explore how this property translates to new applications in computational modelling and to exploit the adjoint-like properties for efficient data assimilation. To predict the spread of COVID-19 in an idealised town, we apply these methods to a compartmental model in epidemiology that is able to model space and time variations. To do this, the GAN is set within a reduced-order model (ROM), which uses a low-dimensional space for the spatial distribution of the simulation states. Then the GAN learns the evolution of the low-dimensional states over time. The results show that the proposed methods can accurately predict the evolution of the high-fidelity numerical simulation, and can efficiently assimilate observed data and determine the corresponding model parameters.
    Neural Pharmacodynamic State Space Modeling. (arXiv:2102.11218v3 [cs.LG] UPDATED)
    (2 min) Modeling the time-series of high-dimensional, longitudinal data is important for predicting patient disease progression. However, existing neural network based approaches that learn representations of patient state, while very flexible, are susceptible to overfitting. We propose a deep generative model that makes use of a novel attention-based neural architecture inspired by the physics of how treatments affect disease state. The result is a scalable and accurate model of high-dimensional patient biomarkers as they vary over time. Our proposed model yields significant improvements in generalization and, on real-world clinical data, provides interpretable insights into the dynamics of cancer progression.
    Noise2Sim -- Similarity-based Self-Learning for Image Denoising. (arXiv:2011.03384v4 [cs.LG] UPDATED)
    (2 min) Despite its best performance in image denoising, the supervised deep denoising methods require paired noise-clean data, which are often unavailable. To address this challenge, Noise2Noise was designed based on the fact that paired noise-clean images can be replaced by paired noise-noise images that are easier to collect. However, in many scenarios the collection of paired noise-noise images is still impractical. To bypass labeled images, Noise2Void methods predict masked pixels from their surroundings with single noisy images only and give improved denoising results that still need improvements. An observation on classic denoising methods is that non-local mean (NLM) outcomes are typically superior to locally denoised results. In contrast, Noise2Void and its variants do not utilize self-similarities in an image as the NLM-based methods do. Here we propose Noise2Sim, an NLM-inspired self-learning method for image denoising. Specifically, Noise2Sim leverages the self-similarity of image pixels to train the denoising network, requiring single noisy images only. Our theoretical analysis shows that Noise2Sim tends to be equivalent to Noise2Noise under mild conditions. To efficiently manage the computational burden for globally searching similar pixels, we design a two-step procedure to provide data for Noise2Sim training. Extensive experiments demonstrate the superiority of Noise2Sim on common benchmark datasets.
    Diffusion Approximations for a Class of Sequential Testing Problems. (arXiv:2102.07030v2 [stat.ML] UPDATED)
    (2 min) We consider a decision maker who must choose an action in order to maximize a reward function that depends also on an unknown parameter {\Theta}. The decision maker can delay taking the action in order to experiment and gather additional information on {\Theta}. We model the decision maker's problem using a Bayesian sequential experimentation framework and use dynamic programming and diffusion-asymptotic analysis to solve it. For that, we scale our problem in a way that both the average number of experiments that is conducted per unit of time is large and the informativeness of each individual experiment is low. Under such regime, we derive a diffusion approximation for the sequential experimentation problem, which provides a number of important insights about the nature of the problem and its solution. Our solution method also shows that the complexity of the problem grows only quadratically with the cardinality of the set of actions from which the decision maker can choose. We illustrate our methodology and results using a concrete application in the context of assortment selection and new product introduction. Specifically, we study the problem of a seller who wants to select an optimal assortment of products to launch into the marketplace and is uncertain about consumers' preferences. Motivated by emerging practices in e-commerce, we assume that the seller is able to use a crowdvoting system to learn these preferences before a final assortment decision is made. In this context, we undertake an extensive numerical analysis to assess the value of learning and demonstrate the effectiveness and robustness of the heuristics derived from the diffusion approximation.
    Predicting Water Temperature Dynamics of Unmonitored Lakes with Meta Transfer Learning. (arXiv:2011.05369v2 [cs.LG] UPDATED)
    (2 min) Most environmental data come from a minority of well-monitored sites. An ongoing challenge in the environmental sciences is transferring knowledge from monitored sites to unmonitored sites. Here, we demonstrate a novel transfer learning framework that accurately predicts depth-specific temperature in unmonitored lakes (targets) by borrowing models from well-monitored lakes (sources). This method, Meta Transfer Learning (MTL), builds a meta-learning model to predict transfer performance from candidate source models to targets using lake attributes and candidates' past performance. We constructed source models at 145 well-monitored lakes using calibrated process-based modeling (PB) and a recently developed approach called process-guided deep learning (PGDL). We applied MTL to either PB or PGDL source models (PB-MTL or PGDL-MTL, respectively) to predict temperatures in 305 target lakes treated as unmonitored in the Upper Midwestern United States. We show significantly improved performance relative to the uncalibrated process-based General Lake Model, where the median RMSE for the target lakes is $2.52^{\circ}C$. PB-MTL yielded a median RMSE of $2.43^{\circ}C$; PGDL-MTL yielded $2.16^{\circ}C$; and a PGDL-MTL ensemble of nine sources per target yielded $1.88^{\circ}C$. For sparsely monitored target lakes, PGDL-MTL often outperformed PGDL models trained on the target lakes themselves. Differences in maximum depth between the source and target were consistently the most important predictors. Our approach readily scales to thousands of lakes in the Midwestern United States, demonstrating that MTL with meaningful predictor variables and high-quality source models is a promising approach for many kinds of unmonitored systems and environmental variables.
    Structured Dropout Variational Inference for Bayesian Neural Networks. (arXiv:2102.07927v2 [cs.LG] UPDATED)
    (2 min) Approximate inference in deep Bayesian networks exhibits a dilemma of how to yield high fidelity posterior approximations while maintaining computational efficiency and scalability. We tackle this challenge by introducing a novel variational structured approximation inspired by the Bayesian interpretation of Dropout regularization. Concretely, we focus on the inflexibility of the factorized structure in Dropout posterior and then propose an improved method called Variational Structured Dropout (VSD). VSD employs an orthogonal transformation to learn a structured representation on the variational noise and consequently induces statistical dependencies in the approximate posterior. Theoretically, VSD successfully addresses the pathologies of previous Variational Dropout methods and thus offers a standard Bayesian justification. We further show that VSD induces an adaptive regularization term with several desirable properties which contribute to better generalization. Finally, we conduct extensive experiments on standard benchmarks to demonstrate the effectiveness of VSD over state-of-the-art variational methods on predictive accuracy, uncertainty estimation, and out-of-distribution detection.
    Learning Diverse-Structured Networks for Adversarial Robustness. (arXiv:2102.01886v4 [cs.LG] UPDATED)
    (2 min) In adversarial training (AT), the main focus has been the objective and optimizer while the model has been less studied, so that the models being used are still those classic ones in standard training (ST). Classic network architectures (NAs) are generally worse than searched NAs in ST, which should be the same in AT. In this paper, we argue that NA and AT cannot be handled independently, since given a dataset, the optimal NA in ST would be no longer optimal in AT. That being said, AT is time-consuming itself; if we directly search NAs in AT over large search spaces, the computation will be practically infeasible. Thus, we propose a diverse-structured network (DS-Net), to significantly reduce the size of the search space: instead of low-level operations, we only consider predefined atomic blocks, where an atomic block is a time-tested building block like the residual block. There are only a few atomic blocks and thus we can weight all atomic blocks rather than find the best one in a searched block of DS-Net, which is an essential trade-off between exploring diverse structures and exploiting the best structures. Empirical results demonstrate the advantages of DS-Net, i.e., weighting the atomic blocks.
    Consensus Control for Decentralized Deep Learning. (arXiv:2102.04828v2 [cs.LG] UPDATED)
    (2 min) Decentralized training of deep learning models enables on-device learning over networks, as well as efficient scaling to large compute clusters. Experiments in earlier works reveal that, even in a data-center setup, decentralized training often suffers from the degradation in the quality of the model: the training and test performance of models trained in a decentralized fashion is in general worse than that of models trained in a centralized fashion, and this performance drop is impacted by parameters such as network size, communication topology and data partitioning. We identify the changing consensus distance between devices as a key parameter to explain the gap between centralized and decentralized training. We show in theory that when the training consensus distance is lower than a critical quantity, decentralized training converges as fast as the centralized counterpart. We empirically validate that the relation between generalization performance and consensus distance is consistent with this theoretical observation. Our empirical insights allow the principled design of better decentralized training schemes that mitigate the performance drop. To this end, we provide practical training guidelines and exemplify its effectiveness on the data-center setup as the important first step.
    End-To-End Bias Mitigation: Removing Gender Bias in Deep Learning. (arXiv:2104.02532v2 [cs.LG] UPDATED)
    (2 min) Machine Learning models have been deployed across many different aspects of society, often in situations that affect social welfare. Although these models offer streamlined solutions to large problems, they may contain biases and treat groups or individuals unfairly based on protected attributes such as gender. In this paper, we introduce several examples of machine learning gender bias in practice followed by formalizations of fairness. We provide a survey of fairness research by detailing influential pre-processing, in-processing, and post-processing bias mitigation algorithms. We then propose an \textup{end-to-end bias mitigation} framework, which employs a fusion of pre-, in-, and post-processing methods to leverage the strengths of each individual technique. We test this method, along with the standard techniques we review, on a deep neural network to analyze bias mitigation in a deep learning setting. We find that our end-to-end bias mitigation framework outperforms the baselines with respect to several fairness metrics, suggesting its promise as a method for improving fairness. As society increasingly relies on artificial intelligence to help in decision-making, addressing gender biases present in deep learning models is imperative. To provide readers with the tools to assess the fairness of machine learning models and mitigate the biases present in them, we discuss multiple open source packages for fairness in AI.
    Exponential Moving Average Normalization for Self-supervised and Semi-supervised Learning. (arXiv:2101.08482v2 [cs.LG] UPDATED)
    (2 min) We present a plug-in replacement for batch normalization (BN) called exponential moving average normalization (EMAN), which improves the performance of existing student-teacher based self- and semi-supervised learning techniques. Unlike the standard BN, where the statistics are computed within each batch, EMAN, used in the teacher, updates its statistics by exponential moving average from the BN statistics of the student. This design reduces the intrinsic cross-sample dependency of BN and enhances the generalization of the teacher. EMAN improves strong baselines for self-supervised learning by 4-6/1-2 points and semi-supervised learning by about 7/2 points, when 1%/10% supervised labels are available on ImageNet. These improvements are consistent across methods, network architectures, training duration, and datasets, demonstrating the general effectiveness of this technique. The code is available at https://github.com/amazon-research/exponential-moving-average-normalization.
    Robust Implicit Networks via Non-Euclidean Contractions. (arXiv:2106.03194v2 [cs.LG] UPDATED)
    (2 min) Implicit neural networks, a.k.a., deep equilibrium networks, are a class of implicit-depth learning models where function evaluation is performed by solving a fixed point equation. They generalize classic feedforward models and are equivalent to infinite-depth weight-tied feedforward networks. While implicit models show improved accuracy and significant reduction in memory consumption, they can suffer from ill-posedness and convergence instability. This paper provides a new framework to design well-posed and robust implicit neural networks based upon contraction theory for the non-Euclidean norm $\ell_\infty$. Our framework includes (i) a novel condition for well-posedness based on one-sided Lipschitz constants, (ii) an average iteration for computing fixed-points, and (iii) explicit estimates on input-output Lipschitz constants. Additionally, we design a training problem with the well-posedness condition and the average iteration as constraints and, to achieve robust models, with the input-output Lipschitz constant as a regularizer. Our $\ell_\infty$ well-posedness condition leads to a larger polytopic training search space than existing conditions and our average iteration enjoys accelerated convergence. Finally, we perform several numerical experiments for function estimation and digit classification through the MNIST data set. Our numerical results demonstrate improved accuracy and robustness of the implicit models with smaller input-output Lipschitz bounds.
    Optimising simulations for diphoton production at hadron colliders using amplitude neural networks. (arXiv:2106.09474v1 [hep-ph] CROSS LISTED)
    (2 min) Machine learning technology has the potential to dramatically optimise event generation and simulations. We continue to investigate the use of neural networks to approximate matrix elements for high-multiplicity scattering processes. We focus on the case of loop-induced diphoton production through gluon fusion and develop a realistic simulation method that can be applied to hadron collider observables. Neural networks are trained using the one-loop amplitudes implemented in the NJet C++ library and interfaced to the Sherpa Monte Carlo event generator where we perform a detailed study for $2\to3$ and $2\to4$ scattering problems. We also consider how the trained networks perform when varying the kinematic cuts effecting the phase space and the reliability of the neural network simulations.
    Bootstrap an end-to-end ASR system by multilingual training, transfer learning, text-to-text mapping and synthetic audio. (arXiv:2011.12696v2 [eess.AS] UPDATED)
    (2 min) Bootstrapping speech recognition on limited data resources has been an area of active research for long. The recent transition to all-neural models and end-to-end (E2E) training brought along particular challenges as these models are known to be data hungry, but also came with opportunities around language-agnostic representations derived from multilingual data as well as shared word-piece output representations across languages that share script and roots. We investigate here the effectiveness of different strategies to bootstrap an RNN-Transducer (RNN-T) based automatic speech recognition (ASR) system in the low resource regime, while exploiting the abundant resources available in other languages as well as the synthetic audio from a text-to-speech (TTS) engine. Our experiments demonstrate that transfer learning from a multilingual model, using a post-ASR text-to-text mapping and synthetic audio deliver additive improvements, allowing us to bootstrap a model for a new language with a fraction of the data that would otherwise be needed. The best system achieved a 46% relative word error rate (WER) reduction compared to the monolingual baseline, among which 25% relative WER improvement is attributed to the post-ASR text-to-text mappings and the TTS synthetic data.
    Learning to Plan via a Multi-Step Policy Regression Method. (arXiv:2106.10075v1 [cs.LG])
    (2 min) We propose a new approach to increase inference performance in environments that require a specific sequence of actions in order to be solved. This is for example the case for maze environments where ideally an optimal path is determined. Instead of learning a policy for a single step, we want to learn a policy that can predict n actions in advance. Our proposed method called policy horizon regression (PHR) uses knowledge of the environment sampled by A2C to learn an n dimensional policy vector in a policy distillation setup which yields n sequential actions per observation. We test our method on the MiniGrid and Pong environments and show drastic speedup during inference time by successfully predicting sequences of actions on a single observation.
    SoK: Privacy-Preserving Collaborative Tree-based Model Learning. (arXiv:2103.08987v2 [cs.CR] UPDATED)
    (2 min) Tree-based models are among the most efficient machine learning techniques for data mining nowadays due to their accuracy, interpretability, and simplicity. The recent orthogonal needs for more data and privacy protection call for collaborative privacy-preserving solutions. In this work, we survey the literature on distributed and privacy-preserving training of tree-based models and we systematize its knowledge based on four axes: the learning algorithm, the collaborative model, the protection mechanism, and the threat model. We use this to identify the strengths and limitations of these works and provide for the first time a framework analyzing the information leakage occurring in distributed tree-based model learning.
    Faster Kernel Matrix Algebra via Density Estimation. (arXiv:2102.08341v2 [cs.DS] UPDATED)
    (2 min) We study fast algorithms for computing fundamental properties of a positive semidefinite kernel matrix $K \in \mathbb{R}^{n \times n}$ corresponding to $n$ points $x_1,\ldots,x_n \in \mathbb{R}^d$. In particular, we consider estimating the sum of kernel matrix entries, along with its top eigenvalue and eigenvector. We show that the sum of matrix entries can be estimated to $1+\epsilon$ relative error in time $sublinear$ in $n$ and linear in $d$ for many popular kernels, including the Gaussian, exponential, and rational quadratic kernels. For these kernels, we also show that the top eigenvalue (and an approximate eigenvector) can be approximated to $1+\epsilon$ relative error in time $subquadratic$ in $n$ and linear in $d$. Our algorithms represent significant advances in the best known runtimes for these problems. They leverage the positive definiteness of the kernel matrix, along with a recent line of work on efficient kernel density estimation.
    Ray-based framework for state identification in quantum dot devices. (arXiv:2102.11784v2 [quant-ph] UPDATED)
    (2 min) Quantum dots (QDs) defined with electrostatic gates are a leading platform for a scalable quantum computing implementation. However, with increasing numbers of qubits, the complexity of the control parameter space also grows. Traditional measurement techniques, relying on complete or near-complete exploration via two-parameter scans (images) of the device response, quickly become impractical with increasing numbers of gates. Here we propose to circumvent this challenge by introducing a measurement technique relying on one-dimensional projections of the device response in the multidimensional parameter space. Dubbed the ``ray-based classification (RBC) framework,'' we use this machine learning approach to implement a classifier for QD states, enabling automated recognition of qubit-relevant parameter regimes. We show that RBC surpasses the 82 % accuracy benchmark from the experimental implementation of image-based classification techniques from prior work while reducing the number of measurement points needed by up to 70 %. The reduction in measurement cost is a significant gain for time-intensive QD measurements and is a step forward toward the scalability of these devices. We also discuss how the RBC-based optimizer, which tunes the device to a multiqubit regime, performs when tuning in the two-dimensional and three-dimensional parameter spaces defined by plunger and barrier gates that control the QDs.This work provides experimental validation of both efficient state identification and optimization with machine learning techniques for non-traditional measurements in quantum systems with high-dimensional parameter spaces and time-intensive measurements.
    Dual-view Molecule Pre-training. (arXiv:2106.10234v1 [q-bio.QM])
    (2 min) Inspired by its success in natural language processing and computer vision, pre-training has attracted substantial attention in cheminformatics and bioinformatics, especially for molecule based tasks. A molecule can be represented by either a graph (where atoms are connected by bonds) or a SMILES sequence (where depth-first-search is applied to the molecular graph with specific rules). Existing works on molecule pre-training use either graph representations only or SMILES representations only. In this work, we propose to leverage both the representations and design a new pre-training algorithm, dual-view molecule pre-training (briefly, DMP), that can effectively combine the strengths of both types of molecule representations. The model of DMP consists of two branches: a Transformer branch that takes the SMILES sequence of a molecule as input, and a GNN branch that takes a molecular graph as input. The training of DMP contains three tasks: (1) predicting masked tokens in a SMILES sequence by the Transformer branch, (2) predicting masked atoms in a molecular graph by the GNN branch, and (3) maximizing the consistency between the two high-level representations output by the Transformer and GNN branches separately. After pre-training, we can use either the Transformer branch (this one is recommended according to empirical results), the GNN branch, or both for downstream tasks. DMP is tested on nine molecular property prediction tasks and achieves state-of-the-art performances on seven of them. Furthermore, we test DMP on three retrosynthesis tasks and achieve state-of-the-result on the USPTO-full dataset. Our code will be released soon.
    A Distance-based Separability Measure for Internal Cluster Validation. (arXiv:2106.09794v1 [cs.LG])
    (2 min) To evaluate clustering results is a significant part of cluster analysis. Since there are no true class labels for clustering in typical unsupervised learning, many internal cluster validity indices (CVIs), which use predicted labels and data, have been created. Without true labels, to design an effective CVI is as difficult as to create a clustering method. And it is crucial to have more CVIs because there are no universal CVIs that can be used to measure all datasets and no specific methods of selecting a proper CVI for clusters without true labels. Therefore, to apply a variety of CVIs to evaluate clustering results is necessary. In this paper, we propose a novel internal CVI -- the Distance-based Separability Index (DSI), based on a data separability measure. We compared the DSI with eight internal CVIs including studies from early Dunn (1974) to most recent CVDD (2019) and an external CVI as ground truth, by using clustering results of five clustering algorithms on 12 real and 97 synthetic datasets. Results show DSI is an effective, unique, and competitive CVI to other compared CVIs. We also summarized the general process to evaluate CVIs and created the rank-difference metric for comparison of CVIs' results.
    Residual Error: a New Performance Measure for Adversarial Robustness. (arXiv:2106.10212v1 [cs.LG])
    (2 min) Despite the significant advances in deep learning over the past decade, a major challenge that limits the wide-spread adoption of deep learning has been their fragility to adversarial attacks. This sensitivity to making erroneous predictions in the presence of adversarially perturbed data makes deep neural networks difficult to adopt for certain real-world, mission-critical applications. While much of the research focus has revolved around adversarial example creation and adversarial hardening, the area of performance measures for assessing adversarial robustness is not well explored. Motivated by this, this study presents the concept of residual error, a new performance measure for not only assessing the adversarial robustness of a deep neural network at the individual sample level, but also can be used to differentiate between adversarial and non-adversarial examples to facilitate for adversarial example detection. Furthermore, we introduce a hybrid model for approximating the residual error in a tractable manner. Experimental results using the case of image classification demonstrates the effectiveness and efficacy of the proposed residual error metric for assessing several well-known deep neural network architectures. These results thus illustrate that the proposed measure could be a useful tool for not only assessing the robustness of deep neural networks used in mission-critical scenarios, but also in the design of adversarially robust models.
    Continuous Doubly Constrained Batch Reinforcement Learning. (arXiv:2102.09225v3 [cs.LG] UPDATED)
    (2 min) Reliant on too many experiments to learn good actions, current Reinforcement Learning (RL) algorithms have limited applicability in real-world settings, which can be too expensive to allow exploration. We propose an algorithm for batch RL, where effective policies are learned using only a fixed offline dataset instead of online interactions with the environment. The limited data in batch RL produces inherent uncertainty in value estimates of states/actions that were insufficiently represented in the training data. This leads to particularly severe extrapolation when our candidate policies diverge from one that generated the data. We propose to mitigate this issue via two straightforward penalties: a policy-constraint to reduce this divergence and a value-constraint that discourages overly optimistic estimates. Over a comprehensive set of 32 continuous-action batch RL benchmarks, our approach compares favorably to state-of-the-art methods, regardless of how the offline data were collected.
    Leakage of Dataset Properties in Multi-Party Machine Learning. (arXiv:2006.07267v3 [cs.LG] UPDATED)
    (2 min) Secure multi-party machine learning allows several parties to build a model on their pooled data to increase utility while not explicitly sharing data with each other. We show that such multi-party computation can cause leakage of global dataset properties between the parties even when parties obtain only black-box access to the final model. In particular, a ``curious'' party can infer the distribution of sensitive attributes in other parties' data with high accuracy. This raises concerns regarding the confidentiality of properties pertaining to the whole dataset as opposed to individual data records. We show that our attack can leak population-level properties in datasets of different types, including tabular, text, and graph data. To understand and measure the source of leakage, we consider several models of correlation between a sensitive attribute and the rest of the data. Using multiple machine learning models, we show that leakage occurs even if the sensitive attribute is not included in the training data and has a low correlation with other attributes or the target variable.
    pyWATTS: Python Workflow Automation Tool for Time Series. (arXiv:2106.10157v1 [cs.LG])
    (2 min) Time series data are fundamental for a variety of applications, ranging from financial markets to energy systems. Due to their importance, the number and complexity of tools and methods used for time series analysis is constantly increasing. However, due to unclear APIs and a lack of documentation, researchers struggle to integrate them into their research projects and replicate results. Additionally, in time series analysis there exist many repetitive tasks, which are often re-implemented for each project, unnecessarily costing time. To solve these problems we present \texttt{pyWATTS}, an open-source Python-based package that is a non-sequential workflow automation tool for the analysis of time series data. pyWATTS includes modules with clearly defined interfaces to enable seamless integration of new or existing methods, subpipelining to easily reproduce repetitive tasks, load and save functionality to simply replicate results, and native support for key Python machine learning libraries such as scikit-learn, PyTorch, and Keras.
    Locally Private Graph Neural Networks. (arXiv:2006.05535v8 [cs.LG] UPDATED)
    (3 min) Graph Neural Networks (GNNs) have demonstrated superior performance in learning node representations for various graph inference tasks. However, learning over graph data can raise privacy concerns when nodes represent people or human-related variables that involve sensitive or personal information. While numerous techniques have been proposed for privacy-preserving deep learning over non-relational data, there is less work addressing the privacy issues pertained to applying deep learning algorithms on graphs. In this paper, we study the problem of node data privacy, where graph nodes have potentially sensitive data that is kept private, but they could be beneficial for a central server for training a GNN over the graph. To address this problem, we develop a privacy-preserving, architecture-agnostic GNN learning algorithm with formal privacy guarantees based on Local Differential Privacy (LDP). Specifically, we propose an LDP encoder and an unbiased rectifier, by which the server can communicate with the graph nodes to privately collect their data and approximate the GNN's first layer. To further reduce the effect of the injected noise, we propose to prepend a simple graph convolution layer, called KProp, which is based on the multi-hop aggregation of the nodes' features acting as a denoising mechanism. Finally, we propose a robust training framework, in which we benefit from KProp's denoising capability to increase the accuracy of inference in the presence of noisy labels. Extensive experiments conducted over real-world datasets demonstrate that our method can maintain a satisfying level of accuracy with low privacy loss.
    Wide stochastic networks: Gaussian limit and PAC-Bayesian training. (arXiv:2106.09798v1 [stat.ML])
    (2 min) The limit of infinite width allows for substantial simplifications in the analytical study of overparameterized neural networks. With a suitable random initialization, an extremely large network is well approximated by a Gaussian process, both before and during training. In the present work, we establish a similar result for a simple stochastic architecture whose parameters are random variables. The explicit evaluation of the output distribution allows for a PAC-Bayesian training procedure that directly optimizes the generalization bound. For a large but finite-width network, we show empirically on MNIST that this training approach can outperform standard PAC-Bayesian methods.
    Autoencoder-based cleaning in probabilistic databases. (arXiv:2106.09764v1 [cs.DB])
    (2 min) In the field of data integration, data quality problems are often encountered when extracting, combining, and merging data. The probabilistic data integration approach represents information about such problems as uncertainties in a probabilistic database. In this paper, we propose a data-cleaning autoencoder capable of near-automatic data quality improvement. It learns the structure and dependencies in the data to identify and correct doubtful values. A theoretical framework is provided, and experiments show that it can remove significant amounts of noise from categorical and numeric probabilistic data. Our method does not require clean data. We do, however, show that manually cleaning a small fraction of the data significantly improves performance.
    On-Device Personalization of Automatic Speech Recognition Models for Disordered Speech. (arXiv:2106.10259v1 [eess.AS])
    (2 min) While current state-of-the-art Automatic Speech Recognition (ASR) systems achieve high accuracy on typical speech, they suffer from significant performance degradation on disordered speech and other atypical speech patterns. Personalization of ASR models, a commonly applied solution to this problem, is usually performed in a server-based training environment posing problems around data privacy, delayed model-update times, and communication cost for copying data and models between mobile device and server infrastructure. In this paper, we present an approach to on-device based ASR personalization with very small amounts of speaker-specific data. We test our approach on a diverse set of 100 speakers with disordered speech and find median relative word error rate improvement of 71% with only 50 short utterances required per speaker. When tested on a voice-controlled home automation platform, on-device personalized models show a median task success rate of 81%, compared to only 40% of the unadapted models.
    Deep Reinforcement Learning Models Predict Visual Responses in the Brain: A Preliminary Result. (arXiv:2106.10112v1 [cs.LG])
    (2 min) Supervised deep convolutional neural networks (DCNNs) are currently one of the best computational models that can explain how the primate ventral visual stream solves object recognition. However, embodied cognition has not been considered in the existing visual processing models. From the ecological standpoint, humans learn to recognize objects by interacting with them, allowing better classification, specialization, and generalization. Here, we ask if computational models under the embodied learning framework can explain mechanisms underlying object recognition in the primate visual system better than the existing supervised models? To address this question, we use reinforcement learning to train neural network models to play a 3D computer game and we find that these reinforcement learning models achieve neural response prediction accuracy scores in the early visual areas (e.g., V1 and V2) in the levels that are comparable to those accomplished by the supervised neural network model. In contrast, the supervised neural network models yield better neural response predictions in the higher visual areas, compared to the reinforcement learning models. Our preliminary results suggest the future direction of visual neuroscience in which deep reinforcement learning should be included to fill the missing embodiment concept.
    Probabilistic Sequential Shrinking: A Best Arm Identification Algorithm for Stochastic Bandits with Corruptions. (arXiv:2010.07904v4 [cs.LG] UPDATED)
    (2 min) We consider a best arm identification (BAI) problem for stochastic bandits with adversarial corruptions in the fixed-budget setting of T steps. We design a novel randomized algorithm, Probabilistic Sequential Shrinking($u$) (PSS($u$)), which is agnostic to the amount of corruptions. When the amount of corruptions per step (CPS) is below a threshold, PSS($u$) identifies the best arm or item with probability tending to $1$ as $T\rightarrow \infty$. Otherwise, the optimality gap of the identified item degrades gracefully with the CPS.We argue that such a bifurcation is necessary. In PSS($u$), the parameter $u$ serves to balance between the optimality gap and success probability. The injection of randomization is shown to be essential to mitigate the impact of corruptions. To demonstrate this, we design two attack strategies that are applicable to any algorithm. We apply one of them to a deterministic analogue of PSS($u$) known as Successive Halving (SH) by Karnin et al. (2013). The attack strategy results in a high failure probability for SH, but PSS($u$) remains robust. In the absence of corruptions, PSS($2$)'s performance guarantee matches SH's. We show that when the CPS is sufficiently large, no algorithm can achieve a BAI probability tending to $1$ as $T\rightarrow \infty$. Numerical experiments corroborate our theoretical findings.
    Riemannian Convex Potential Maps. (arXiv:2106.10272v1 [cs.LG])
    (2 min) Modeling distributions on Riemannian manifolds is a crucial component in understanding non-Euclidean data that arises, e.g., in physics and geology. The budding approaches in this space are limited by representational and computational tradeoffs. We propose and study a class of flows that uses convex potentials from Riemannian optimal transport. These are universal and can model distributions on any compact Riemannian manifold without requiring domain knowledge of the manifold to be integrated into the architecture. We demonstrate that these flows can model standard distributions on spheres, and tori, on synthetic and geological data. Our source code is freely available online at this http URL
    Deep State Space Models for Nonlinear System Identification. (arXiv:2003.14162v3 [eess.SY] UPDATED)
    (2 min) Deep state space models (SSMs) are an actively researched model class for temporal models developed in the deep learning community which have a close connection to classic SSMs. The use of deep SSMs as a black-box identification model can describe a wide range of dynamics due to the flexibility of deep neural networks. Additionally, the probabilistic nature of the model class allows the uncertainty of the system to be modelled. In this work a deep SSM class and its parameter learning algorithm are explained in an effort to extend the toolbox of nonlinear identification methods with a deep learning based method. Six recent deep SSMs are evaluated in a first unified implementation on nonlinear system identification benchmarks.
    Active Offline Policy Selection. (arXiv:2106.10251v1 [cs.LG])
    (2 min) This paper addresses the problem of policy selection in domains with abundant logged data, but with a very restricted interaction budget. Solving this problem would enable safe evaluation and deployment of offline reinforcement learning policies in industry, robotics, and healthcare domain among others. Several off-policy evaluation (OPE) techniques have been proposed to assess the value of policies using only logged data. However, there is still a big gap between the evaluation by OPE and the full online evaluation in the real environment. To reduce this gap, we introduce a novel \emph{active offline policy selection} problem formulation, which combined logged data and limited online interactions to identify the best policy. We rely on the advances in OPE to warm start the evaluation. We build upon Bayesian optimization to iteratively decide which policies to evaluate in order to utilize the limited environment interactions wisely. Many candidate policies could be proposed, thus, we focus on making our approach scalable and introduce a kernel function to model similarity between policies. We use several benchmark environments to show that the proposed approach improves upon state-of-the-art OPE estimates and fully online policy evaluation with limited budget. Additionally, we show that each component of the proposed method is important, it works well with various number and quality of OPE estimates and even with a large number of candidate policies.
    On Invariance Penalties for Risk Minimization. (arXiv:2106.09777v1 [cs.LG])
    (2 min) The Invariant Risk Minimization (IRM) principle was first proposed by Arjovsky et al. [2019] to address the domain generalization problem by leveraging data heterogeneity from differing experimental conditions. Specifically, IRM seeks to find a data representation under which an optimal classifier remains invariant across all domains. Despite the conceptual appeal of IRM, the effectiveness of the originally proposed invariance penalty has recently been brought into question. In particular, there exists counterexamples for which that invariance penalty can be arbitrarily small for non-invariant data representations. We propose an alternative invariance penalty by revisiting the Gramian matrix of the data representation. We discuss the role of its eigenvalues in the relationship between the risk and the invariance penalty, and demonstrate that it is ill-conditioned for said counterexamples. The proposed approach is guaranteed to recover an invariant representation for linear settings under mild non-degeneracy conditions. Its effectiveness is substantiated by experiments on DomainBed and InvarianceUnitTest, two extensive test beds for domain generalization.
    An Analysis of the Deployment of Models Trained on Private Tabular Synthetic Data: Unexpected Surprises. (arXiv:2106.10241v1 [stat.ML])
    (2 min) Diferentially private (DP) synthetic datasets are a powerful approach for training machine learning models while respecting the privacy of individual data providers. The effect of DP on the fairness of the resulting trained models is not yet well understood. In this contribution, we systematically study the effects of differentially private synthetic data generation on classification. We analyze disparities in model utility and bias caused by the synthetic dataset, measured through algorithmic fairness metrics. Our first set of results show that although there seems to be a clear negative correlation between privacy and utility (the more private, the less accurate) across all data synthesizers we evaluated, more privacy does not necessarily imply more bias. Additionally, we assess the effects of utilizing synthetic datasets for model training and model evaluation. We show that results obtained on synthetic data can misestimate the actual model performance when it is deployed on real data. We hence advocate on the need for defining proper testing protocols in scenarios where differentially private synthetic datasets are utilized for model training and evaluation.
    LSEC: Large-scale spectral ensemble clustering. (arXiv:2106.09852v1 [cs.LG])
    (2 min) Ensemble clustering is a fundamental problem in the machine learning field, combining multiple base clusterings into a better clustering result. However, most of the existing methods are unsuitable for large-scale ensemble clustering tasks due to the efficiency bottleneck. In this paper, we propose a large-scale spectral ensemble clustering (LSEC) method to strike a good balance between efficiency and effectiveness. In LSEC, a large-scale spectral clustering based efficient ensemble generation framework is designed to generate various base clusterings within a low computational complexity. Then all based clustering are combined through a bipartite graph partition based consensus function into a better consensus clustering result. The LSEC method achieves a lower computational complexity than most existing ensemble clustering methods. Experiments conducted on ten large-scale datasets show the efficiency and effectiveness of the LSEC method. The MATLAB code of the proposed method and experimental datasets are available at https://github.com/Li- Hongmin/MyPaperWithCode.
    Deterministic Gibbs Sampling via Ordinary Differential Equations. (arXiv:2106.10188v1 [stat.CO])
    (2 min) Deterministic dynamics is an essential part of many MCMC algorithms, e.g. Hybrid Monte Carlo or samplers utilizing normalizing flows. This paper presents a general construction of deterministic measure-preserving dynamics using autonomous ODEs and tools from differential geometry. We show how Hybrid Monte Carlo and other deterministic samplers follow as special cases of our theory. We then demonstrate the utility of our approach by constructing a continuous non-sequential version of Gibbs sampling in terms of an ODE flow and extending it to discrete state spaces. We find that our deterministic samplers are more sample efficient than stochastic counterparts, even if the latter generate independent samples.
    A Vertical Federated Learning Framework for Horizontally Partitioned Labels. (arXiv:2106.10056v1 [cs.LG])
    (2 min) Vertical federated learning is a collaborative machine learning framework to train deep leaning models on vertically partitioned data with privacy-preservation. It attracts much attention both from academia and industry. Unfortunately, applying most existing vertical federated learning methods in real-world applications still faces two daunting challenges. First, most existing vertical federated learning methods have a strong assumption that at least one party holds the complete set of labels of all data samples, while this assumption is not satisfied in many practical scenarios, where labels are horizontally partitioned and the parties only hold partial labels. Existing vertical federated learning methods can only utilize partial labels, which may lead to inadequate model update in end-to-end backpropagation. Second, computational and communication resources vary in parties. Some parties with limited computational and communication resources will become the stragglers and slow down the convergence of training. Such straggler problem will be exaggerated in the scenarios of horizontally partitioned labels in vertical federated learning. To address these challenges, we propose a novel vertical federated learning framework named Cascade Vertical Federated Learning (CVFL) to fully utilize all horizontally partitioned labels to train neural networks with privacy-preservation. To mitigate the straggler problem, we design a novel optimization objective which can increase straggler's contribution to the trained models. We conduct a series of qualitative experiments to rigorously verify the effectiveness of CVFL. It is demonstrated that CVFL can achieve comparable performance (e.g., accuracy for classification tasks) with centralized training. The new optimization objective can further mitigate the straggler problem comparing with only using the asynchronous aggregation mechanism during training.
    A Probabilistic Representation of DNNs: Bridging Mutual Information and Generalization. (arXiv:2106.10262v1 [cs.LG])
    (2 min) Recently, Mutual Information (MI) has attracted attention in bounding the generalization error of Deep Neural Networks (DNNs). However, it is intractable to accurately estimate the MI in DNNs, thus most previous works have to relax the MI bound, which in turn weakens the information theoretic explanation for generalization. To address the limitation, this paper introduces a probabilistic representation of DNNs for accurately estimating the MI. Leveraging the proposed MI estimator, we validate the information theoretic explanation for generalization, and derive a tighter generalization bound than the state-of-the-art relaxations.
    Efficient Black-Box Importance Sampling for VaR and CVaR Estimation. (arXiv:2106.10236v1 [q-fin.RM])
    (2 min) This paper considers Importance Sampling (IS) for the estimation of tail risks of a loss defined in terms of a sophisticated object such as a machine learning feature map or a mixed integer linear optimisation formulation. Assuming only black-box access to the loss and the distribution of the underlying random vector, the paper presents an efficient IS algorithm for estimating the Value at Risk and Conditional Value at Risk. The key challenge in any IS procedure, namely, identifying an appropriate change-of-measure, is automated with a self-structuring IS transformation that learns and replicates the concentration properties of the conditional excess from less rare samples. The resulting estimators enjoy asymptotically optimal variance reduction when viewed in the logarithmic scale. Simulation experiments highlight the efficacy and practicality of the proposed scheme
    Some Theoretical Insights into Wasserstein GANs. (arXiv:2006.02682v2 [cs.LG] UPDATED)
    (2 min) Generative Adversarial Networks (GANs) have been successful in producing outstanding results in areas as diverse as image, video, and text generation. Building on these successes, a large number of empirical studies have validated the benefits of the cousin approach called Wasserstein GANs (WGANs), which brings stabilization in the training process. In the present paper, we add a new stone to the edifice by proposing some theoretical advances in the properties of WGANs. First, we properly define the architecture of WGANs in the context of integral probability metrics parameterized by neural networks and highlight some of their basic mathematical features. We stress in particular interesting optimization properties arising from the use of a parametric 1-Lipschitz discriminator. Then, in a statistically-driven approach, we study the convergence of empirical WGANs as the sample size tends to infinity, and clarify the adversarial effects of the generator and the discriminator by underlining some trade-off properties. These features are finally illustrated with experiments using both synthetic and real-world datasets.
    Concurrent Neural Network : A model of competition between times series. (arXiv:2009.14610v2 [stat.ML] UPDATED)
    (2 min) Competition between times series often arises in sales prediction, when similar products are on sale on a marketplace. This article provides a model of the presence of cannibalization between times series. This model creates a "competitiveness" function that depends on external features such as price and margin. It also provides a theoretical guaranty on the error of the model under some reasonable conditions, and implement this model using a neural network to compute this competitiveness function. This implementation outperforms other traditional time series methods and classical neural networks for market share prediction on a real-world data set.
    An Investigation into Mini-Batch Rule Learning. (arXiv:2106.10202v1 [cs.LG])
    (2 min) We investigate whether it is possible to learn rule sets efficiently in a network structure with a single hidden layer using iterative refinements over mini-batches of examples. A first rudimentary version shows an acceptable performance on all but one dataset, even though it does not yet reach the performance levels of Ripper.
    Causal Bias Quantification for Continuous Treatment. (arXiv:2106.09762v1 [stat.ME])
    (2 min) In this work we develop a novel characterization of marginal causal effect and causal bias in the continuous treatment setting. We show they can be expressed as an expectation with respect to a conditional probability distribution, which can be estimated via standard statistical and probabilistic methods. All terms in the expectations can be computed via automatic differentiation, also for highly non-linear models. We further develop a new complete criterion for identifiability of causal effects via covariate adjustment, showing the bias equals zero if the criterion is met. We study the effectiveness of our framework in three different scenarios: linear models under confounding, overcontrol and endogenous selection bias; a non-linear model where full identifiability cannot be achieved because of missing data; a simulated medical study of statins and atherosclerotic cardiovascular disease.
    Indicators of Attack Failure: Debugging and Improving Optimization of Adversarial Examples. (arXiv:2106.09947v1 [cs.LG])
    (2 min) Evaluating robustness of machine-learning models to adversarial examples is a challenging problem. Many defenses have been shown to provide a false sense of security by causing gradient-based attacks to fail, and they have been broken under more rigorous evaluations. Although guidelines and best practices have been suggested to improve current adversarial robustness evaluations, the lack of automatic testing and debugging tools makes it difficult to apply these recommendations in a systematic manner. In this work, we overcome these limitations by (i) defining a set of quantitative indicators which unveil common failures in the optimization of gradient-based attacks, and (ii) proposing specific mitigation strategies within a systematic evaluation protocol. Our extensive experimental analysis shows that the proposed indicators of failure can be used to visualize, debug and improve current adversarial robustness evaluations, providing a first concrete step towards automatizing and systematizing current adversarial robustness evaluations. Our open-source code is available at: https://github.com/pralab/IndicatorsOfAttackFailure.
    Distributed Deep Learning in Open Collaborations. (arXiv:2106.10207v1 [cs.LG])
    (2 min) Modern deep learning applications require increasingly more compute to train state-of-the-art models. To address this demand, large corporations and institutions use dedicated High-Performance Computing clusters, whose construction and maintenance are both environmentally costly and well beyond the budget of most organizations. As a result, some research directions become the exclusive domain of a few large industrial and even fewer academic actors. To alleviate this disparity, smaller groups may pool their computational resources and run collaborative experiments that benefit all participants. This paradigm, known as grid- or volunteer computing, has seen successful applications in numerous scientific areas. However, using this approach for machine learning is difficult due to high latency, asymmetric bandwidth, and several challenges unique to volunteer computing. In this work, we carefully analyze these constraints and propose a novel algorithmic framework designed specifically for collaborative training. We demonstrate the effectiveness of our approach for SwAV and ALBERT pretraining in realistic conditions and achieve performance comparable to traditional setups at a fraction of the cost. Finally, we provide a detailed report of successful collaborative language model pretraining with 40 participants.
    Predicting gender of Brazilian names using deep learning. (arXiv:2106.10156v1 [cs.LG])
    (2 min) Predicting gender by the name is not a simple task. In many applications, especially in the natural language processing (NLP) field, this task may be necessary, mainly when considering foreign names. Some machine learning algorithms can satisfactorily perform the prediction. In this paper, we examined and implemented feedforward and recurrent deep neural network models, such as MLP, RNN, GRU, CNN, and BiLSTM, to classify gender through the first name. A dataset of Brazilian names is used to train and evaluate the models. We analyzed the accuracy, recall, precision, and confusion matrix to measure the models' performances. The results indicate that the gender prediction can be performed from the feature extraction strategy looking at the names as a set of strings. Some models accurately predict the gender in more than 90% of the cases. The recurrent models overcome the feedforward models in this binary classification problem.
    A Helmholtz equation solver using unsupervised learning: Application to transcranial ultrasound. (arXiv:2010.15761v2 [physics.comp-ph] UPDATED)
    (2 min) Transcranial ultrasound therapy is increasingly used for the non-invasive treatment of brain disorders. However, conventional numerical wave solvers are currently too computationally expensive to be used online during treatments to predict the acoustic field passing through the skull (e.g., to account for subject-specific dose and targeting variations). As a step towards real-time predictions, in the current work, a fast iterative solver for the heterogeneous Helmholtz equation in 2D is developed using a fully-learned optimizer. The lightweight network architecture is based on a modified UNet that includes a learned hidden state. The network is trained using a physics-based loss function and a set of idealized sound speed distributions with fully unsupervised training (no knowledge of the true solution is required). The learned optimizer shows excellent performance on the test set, and is capable of generalization well outside the training examples, including to much larger computational domains, and more complex source and sound speed distributions, for example, those derived from x-ray computed tomography images of the skull.
    Less is More: Feature Selection for Adversarial Robustness with Compressive Counter-Adversarial Attacks. (arXiv:2106.10252v1 [cs.LG])
    (2 min) A common observation regarding adversarial attacks is that they mostly give rise to false activation at the penultimate layer to fool the classifier. Assuming that these activation values correspond to certain features of the input, the objective becomes choosing the features that are most useful for classification. Hence, we propose a novel approach to identify the important features by employing counter-adversarial attacks, which highlights the consistency at the penultimate layer with respect to perturbations on input samples. First, we empirically show that there exist a subset of features, classification based in which bridge the gap between the clean and robust accuracy. Second, we propose a simple yet efficient mechanism to identify those features by searching the neighborhood of input sample. We then select features by observing the consistency of the activation values at the penultimate layer.
    MARS: Masked Automatic Ranks Selection in Tensor Decompositions. (arXiv:2006.10859v2 [cs.LG] UPDATED)
    (2 min) Tensor decomposition methods are known to be efficient for compressing and accelerating neural networks. However, the problem of optimal decomposition structure determination is still not well studied while being quite important. Specifically, decomposition ranks present the crucial parameter controlling the compression-accuracy trade-off. In this paper, we introduce MARS -- a new efficient method for the automatic selection of ranks in general tensor decompositions. During training, the procedure learns binary masks over decomposition cores that "select" the optimal tensor structure. The learning is performed via relaxed maximum a posteriori (MAP) estimation in a specific Bayesian model. The proposed method achieves better results compared to previous works in various tasks.
    Hybrid graph convolutional neural networks for landmark-based anatomical segmentation. (arXiv:2106.09832v1 [eess.IV])
    (2 min) In this work we address the problem of landmark-based segmentation for anatomical structures. We propose HybridGNet, an encoder-decoder neural architecture which combines standard convolutions for image feature encoding, with graph convolutional neural networks to decode plausible representations of anatomical structures. We benchmark the proposed architecture considering other standard landmark and pixel-based models for anatomical segmentation in chest x-ray images, and found that HybridGNet is more robust to image occlusions. We also show that it can be used to construct landmark-based segmentations from pixel level annotations. Our experimental results suggest that HybridGNet produces accurate and anatomically plausible landmark-based segmentations, by naturally incorporating shape constraints within the decoding process via spectral convolutions.
    Self-supervised Incremental Deep Graph Learning for Ethereum Phishing Scam Detection. (arXiv:2106.10176v1 [cs.LG])
    (2 min) In recent years, phishing scams have become the crime type with the largest money involved on Ethereum, the second-largest blockchain platform. Meanwhile, graph neural network (GNN) has shown promising performance in various node classification tasks. However, for Ethereum transaction data, which could be naturally abstracted to a real-world complex graph, the scarcity of labels and the huge volume of transaction data make it difficult to take advantage of GNN methods. Here in this paper, to address the two challenges, we propose a Self-supervised Incremental deep Graph learning model (SIEGE), for the phishing scam detection problem on Ethereum. In our model, two pretext tasks designed from spatial and temporal perspectives help us effectively learn useful node embedding from the huge amount of unlabelled transaction data. And the incremental paradigm allows us to efficiently handle large-scale transaction data and help the model maintain good performance when the data distribution is drastically changing. We collect transaction records about half a year from Ethereum and our extensive experiments show that our model consistently outperforms strong baselines in both transductive and inductive settings.
    Generalized Learning Vector Quantization for Classification in Randomized Neural Networks and Hyperdimensional Computing. (arXiv:2106.09821v1 [cs.LG])
    (2 min) Machine learning algorithms deployed on edge devices must meet certain resource constraints and efficiency requirements. Random Vector Functional Link (RVFL) networks are favored for such applications due to their simple design and training efficiency. We propose a modified RVFL network that avoids computationally expensive matrix operations during training, thus expanding the network's range of potential applications. Our modification replaces the least-squares classifier with the Generalized Learning Vector Quantization (GLVQ) classifier, which only employs simple vector and distance calculations. The GLVQ classifier can also be considered an improvement upon certain classification algorithms popularly used in the area of Hyperdimensional Computing. The proposed approach achieved state-of-the-art accuracy on a collection of datasets from the UCI Machine Learning Repository - higher than previously proposed RVFL networks. We further demonstrate that our approach still achieves high accuracy while severely limited in training iterations (using on average only 21% of the least-squares classifier computational costs).
    Problem Dependent View on Structured Thresholding Bandit Problems. (arXiv:2106.10166v1 [stat.ML])
    (2 min) We investigate the problem dependent regime in the stochastic Thresholding Bandit problem (TBP) under several shape constraints. In the TBP, the objective of the learner is to output, at the end of a sequential game, the set of arms whose means are above a given threshold. The vanilla, unstructured, case is already well studied in the literature. Taking $K$ as the number of arms, we consider the case where (i) the sequence of arm's means $(\mu_k)_{k=1}^K$ is monotonically increasing (MTBP) and (ii) the case where $(\mu_k)_{k=1}^K$ is concave (CTBP). We consider both cases in the problem dependent regime and study the probability of error - i.e. the probability to mis-classify at least one arm. In the fixed budget setting, we provide upper and lower bounds for the probability of error in both the concave and monotone settings, as well as associated algorithms. In both settings the bounds match in the problem dependent regime up to universal constants in the exponential.
    Embodied Language Grounding with 3D Visual Feature Representations. (arXiv:1910.01210v3 [cs.CV] UPDATED)
    (2 min) We propose associating language utterances to 3D visual abstractions of the scene they describe. The 3D visual abstractions are encoded as 3-dimensional visual feature maps. We infer these 3D visual scene feature maps from RGB images of the scene via view prediction: when the generated 3D scene feature map is neurally projected from a camera viewpoint, it should match the corresponding RGB image. We present generative models that condition on the dependency tree of an utterance and generate a corresponding visual 3D feature map as well as reason about its plausibility, and detector models that condition on both the dependency tree of an utterance and a related image and localize the object referents in the 3D feature map inferred from the image. Our model outperforms models of language and vision that associate language with 2D CNN activations or 2D images by a large margin in a variety of tasks, such as, classifying plausibility of utterances, detecting referential expressions, and supplying rewards for trajectory optimization of object placement policies from language instructions. We perform numerous ablations and show the improved performance of our detectors is due to its better generalization across camera viewpoints and lack of object interferences in the inferred 3D feature space, and the improved performance of our generators is due to their ability to spatially reason about objects and their configurations in 3D when mapping from language to scenes.
    On the Sample Complexity of Batch Reinforcement Learning with Policy-Induced Data. (arXiv:2106.09973v1 [cs.LG])
    (2 min) We study the fundamental question of the sample complexity of learning a good policy in finite Markov decision processes (MDPs) when the data available for learning is obtained by following a logging policy that must be chosen without knowledge of the underlying MDP. Our main results show that the sample complexity, the minimum number of transitions necessary and sufficient to obtain a good policy, is an exponential function of the relevant quantities when the planning horizon $H$ is finite. In particular, we prove that the sample complexity of obtaining $\epsilon$-optimal policies is at least $\Omega(\mathrm{A}^{\min(\mathrm{S}-1, H+1)})$ for $\gamma$-discounted problems, where $\mathrm{S}$ is the number of states, $\mathrm{A}$ is the number of actions, and $H$ is the effective horizon defined as $H=\lfloor \tfrac{\ln(1/\epsilon)}{\ln(1/\gamma)} \rfloor$; and it is at least $\Omega(\mathrm{A}^{\min(\mathrm{S}-1, H)}/\varepsilon^2)$ for finite horizon problems, where $H$ is the planning horizon of the problem. This lower bound is essentially matched by an upper bound. For the average-reward setting we show that there is no algorithm finding $\epsilon$-optimal policies with a finite amount of data.
    Adapting the Function Approximation Architecture in Online Reinforcement Learning. (arXiv:2106.09776v1 [cs.LG])
    (2 min) The performance of a reinforcement learning (RL) system depends on the computational architecture used to approximate a value function. Deep learning methods provide both optimization techniques and architectures for approximating nonlinear functions from noisy, high-dimensional observations. However, prevailing optimization techniques are not designed for strictly-incremental online updates. Nor are standard architectures designed for observations with an a priori unknown structure: for example, light sensors randomly dispersed in space. This paper proposes an online RL prediction algorithm with an adaptive architecture that efficiently finds useful nonlinear features. The algorithm is evaluated in a spatial domain with high-dimensional, stochastic observations. The algorithm outperforms non-adaptive baseline architectures and approaches the performance of an architecture given side-channel information. These results are a step towards scalable RL algorithms for more general problems, where the observation structure is not available.
    CoPhy-PGNN: Learning Physics-guided Neural Networks with Competing Loss Functions for Solving Eigenvalue Problems. (arXiv:2007.01420v5 [cs.LG] UPDATED)
    (2 min) Physics-guided Neural Networks (PGNNs) represent an emerging class of neural networks that are trained using physics-guided (PG) loss functions (capturing violations in network outputs with known physics), along with the supervision contained in data. Existing work in PGNNs have demonstrated the efficacy of adding single PG loss functions in the neural network objectives, using constant trade-off parameters, to ensure better generalizability. However, in the presence of multiple physics loss functions with competing gradient directions, there is a need to adaptively tune the contribution of competing PG loss functions during the course of training to arrive at generalizable solutions. We demonstrate the presence of competing PG losses in the generic neural network problem of solving for the lowest (or highest) eigenvector of a physics-based eigenvalue equation, common to many scientific problems. We present a novel approach to handle competing PG losses and demonstrate its efficacy in learning generalizable solutions in two motivating applications of quantum mechanics and electromagnetic propagation. All the code and data used in this work is available at https://github.com/jayroxis/Cophy-PGNN.
    A learned conditional prior for the VAE acoustic space of a TTS system. (arXiv:2106.10229v1 [eess.AS])
    (2 min) Many factors influence speech yielding different renditions of a given sentence. Generative models, such as variational autoencoders (VAEs), capture this variability and allow multiple renditions of the same sentence via sampling. The degree of prosodic variability depends heavily on the prior that is used when sampling. In this paper, we propose a novel method to compute an informative prior for the VAE latent space of a neural text-to-speech (TTS) system. By doing so, we aim to sample with more prosodic variability, while gaining controllability over the latent space's structure. By using as prior the posterior distribution of a secondary VAE, which we condition on a speaker vector, we can sample from the primary VAE taking explicitly the conditioning into account and resulting in samples from a specific region of the latent space for each condition (i.e. speaker). A formal preference test demonstrates significant preference of the proposed approach over standard Conditional VAE. We also provide visualisations of the latent space where well-separated condition-specific clusters appear, as well as ablation studies to better understand the behaviour of the system.
    NoiseGrad: enhancing explanations by introducing stochasticity to model weights. (arXiv:2106.10185v1 [cs.LG])
    (2 min) Attribution methods remain a practical instrument that is used in real-world applications to explain the decision-making process of complex learning machines. It has been shown that a simple method called SmoothGrad can effectively reduce the visual diffusion of gradient-based attribution methods and has established itself among both researchers and practitioners. What remains unexplored in research, however, is how explanations can be improved by introducing stochasticity to the model weights. In the light of this, we introduce - NoiseGrad - a stochastic, method-agnostic explanation-enhancing method that adds noise to the weights instead of the input data. We investigate our proposed method through various experiments including different datasets, explanation methods and network architectures and conclude that NoiseGrad (and its extension NoiseGrad++) with multiplicative Gaussian noise offers a clear advantage compared to SmoothGrad on several evaluation criteria. We connect our proposed method to Bayesian Learning and provide the user with a heuristic for choosing hyperparameters.
    Symbolic Parallel Adaptive Importance Sampling for Probabilistic Program Analysis. (arXiv:2010.05050v2 [cs.LG] UPDATED)
    (2 min) Probabilistic software analysis aims at quantifying the probability of a target event occurring during the execution of a program processing uncertain incoming data or written itself using probabilistic programming constructs. Recent techniques combine symbolic execution with model counting or solution space quantification methods to obtain accurate estimates of the occurrence probability of rare target events, such as failures in a mission-critical system. However, they face several scalability and applicability limitations when analyzing software processing with high-dimensional and correlated multivariate input distributions. In this paper, we present SYMbolic Parallel Adaptive Importance Sampling (SYMPAIS), a new inference method tailored to analyze path conditions generated from the symbolic execution of programs with high-dimensional, correlated input distributions. SYMPAIS combines results from importance sampling and constraint solving to produce accurate estimates of the satisfaction probability for a broad class of constraints that cannot be analyzed by current solution space quantification methods. We demonstrate SYMPAIS's generality and performance compared with state-of-the-art alternatives on a set of problems from different application domains.
    The Trimmed Lasso: Sparse Recovery Guarantees and Practical Optimization by the Generalized Soft-Min Penalty. (arXiv:2005.09021v3 [cs.LG] UPDATED)
    (2 min) We present a new approach to solve the sparse approximation or best subset selection problem, namely find a $k$-sparse vector ${\bf x}\in\mathbb{R}^d$ that minimizes the $\ell_2$ residual $\lVert A{\bf x}-{\bf y} \rVert_2$. We consider a regularized approach, whereby this residual is penalized by the non-convex $\textit{trimmed lasso}$, defined as the $\ell_1$-norm of ${\bf x}$ excluding its $k$ largest-magnitude entries. We prove that the trimmed lasso has several appealing theoretical properties, and in particular derive sparse recovery guarantees assuming successful optimization of the penalized objective. Next, we show empirically that directly optimizing this objective can be quite challenging. Instead, we propose a surrogate for the trimmed lasso, called the $\textit{generalized soft-min}$. This penalty smoothly interpolates between the classical lasso and the trimmed lasso, while taking into account all possible $k$-sparse patterns. The generalized soft-min penalty involves summation over $\binom{d}{k}$ terms, yet we derive a polynomial-time algorithm to compute it. This, in turn, yields a practical method for the original sparse approximation problem. Via simulations, we demonstrate its competitive performance compared to current state of the art.
    Gradual Domain Adaptation via Self-Training of Auxiliary Models. (arXiv:2106.09890v1 [cs.LG])
    (2 min) Domain adaptation becomes more challenging with increasing gaps between source and target domains. Motivated from an empirical analysis on the reliability of labeled source data for the use of distancing target domains, we propose self-training of auxiliary models (AuxSelfTrain) that learns models for intermediate domains and gradually combats the distancing shifts across domains. We introduce evolving intermediate domains as combinations of decreasing proportion of source data and increasing proportion of target data, which are sampled to minimize the domain distance between consecutive domains. Then the source model could be gradually adapted for the use in the target domain by self-training of auxiliary models on evolving intermediate domains. We also introduce an enhanced indicator for sample selection via implicit ensemble and extend the proposed method to semi-supervised domain adaptation. Experiments on benchmark datasets of unsupervised and semi-supervised domain adaptation verify its efficacy.
    No Routing Needed Between Capsules. (arXiv:2001.09136v6 [cs.CV] UPDATED)
    (2 min) Most capsule network designs rely on traditional matrix multiplication between capsule layers and computationally expensive routing mechanisms to deal with the capsule dimensional entanglement that the matrix multiplication introduces. By using Homogeneous Vector Capsules (HVCs), which use element-wise multiplication rather than matrix multiplication, the dimensions of the capsules remain unentangled. In this work, we study HVCs as applied to the highly structured MNIST dataset in order to produce a direct comparison to the capsule research direction of Geoffrey Hinton, et al. In our study, we show that a simple convolutional neural network using HVCs performs as well as the prior best performing capsule network on MNIST using 5.5x fewer parameters, 4x fewer training epochs, no reconstruction sub-network, and requiring no routing mechanism. The addition of multiple classification branches to the network establishes a new state of the art for the MNIST dataset with an accuracy of 99.87% for an ensemble of these models, as well as establishing a new state of the art for a single model (99.83% accurate).
    Learning to Generate Code Sketches. (arXiv:2106.10158v1 [cs.LG])
    (2 min) Traditional generative models are limited to predicting sequences of terminal tokens. However, ambiguities in the generation task may lead to incorrect outputs. Towards addressing this, we introduce Grammformers, transformer-based grammar-guided models that learn (without explicit supervision) to generate sketches -- sequences of tokens with holes. Through reinforcement learning, Grammformers learn to introduce holes avoiding the generation of incorrect tokens where there is ambiguity in the target task. We train Grammformers for statement-level source code completion, i.e., the generation of code snippets given an ambiguous user intent, such as a partial code context. We evaluate Grammformers on code completion for C# and Python and show that it generates 10-50% more accurate sketches compared to traditional generative models and 37-50% longer sketches compared to sketch-generating baselines trained with similar techniques.
    Iterative Feature Matching: Toward Provable Domain Generalization with Logarithmic Environments. (arXiv:2106.09913v1 [cs.LG])
    (2 min) Domain generalization aims at performing well on unseen test environments with data from a limited number of training environments. Despite a proliferation of proposal algorithms for this task, assessing their performance, both theoretically and empirically is still very challenging. Moreover, recent approaches such as Invariant Risk Minimization (IRM) require a prohibitively large number of training environments - linear in the dimension of the spurious feature space $d_s$ - even on simple data models like the one proposed by [Rosenfeld et al., 2021]. Under a variant of this model, we show that both ERM and IRM cannot generalize with $o(d_s)$ environments. We then present a new algorithm based on performing iterative feature matching that is guaranteed with high probability to yield a predictor that generalizes after seeing only $O(\log{d_s})$ environments.
    Model Reduction and Neural Networks for Parametric PDEs. (arXiv:2005.03180v2 [math.NA] UPDATED)
    (2 min) We develop a general framework for data-driven approximation of input-output maps between infinite-dimensional spaces. The proposed approach is motivated by the recent successes of neural networks and deep learning, in combination with ideas from model reduction. This combination results in a neural network approximation which, in principle, is defined on infinite-dimensional spaces and, in practice, is robust to the dimension of finite-dimensional approximations of these spaces required for computation. For a class of input-output maps, and suitably chosen probability measures on the inputs, we prove convergence of the proposed approximation methodology. We also include numerical experiments which demonstrate the effectiveness of the method, showing convergence and robustness of the approximation scheme with respect to the size of the discretization, and compare it with existing algorithms from the literature; our examples include the mapping from coefficient to solution in a divergence form elliptic partial differential equation (PDE) problem, and the solution operator for viscous Burgers' equation.
    Labelling Drifts in a Fault Detection System for Wind Turbine Maintenance. (arXiv:2106.09951v1 [cs.LG])
    (2 min) A failure detection system is the first step towards predictive maintenance strategies. A popular data-driven method to detect incipient failures and anomalies is the training of normal behaviour models by applying a machine learning technique like feed-forward neural networks (FFNN) or extreme learning machines (ELM). However, the performance of any of these modelling techniques can be deteriorated by the unexpected rise of non-stationarities in the dynamic environment in which industrial assets operate. This unpredictable statistical change in the measured variable is known as concept drift. In this article a wind turbine maintenance case is presented, where non-stationarities of various kinds can happen unexpectedly. Such concept drift events are desired to be detected by means of statistical detectors and window-based approaches. However, in real complex systems, concept drifts are not as clear and evident as in artificially generated datasets. In order to evaluate the effectiveness of current drift detectors and also to design an appropriate novel technique for this specific industrial application, it is essential to dispose beforehand of a characterization of the existent drifts. Under the lack of information in this regard, a methodology for labelling concept drift events in the lifetime of wind turbines is proposed. This methodology will facilitate the creation of a drift database that will serve both as a training ground for concept drift detectors and as a valuable information to enhance the knowledge about maintenance of complex systems.
    On Contrastive Representations of Stochastic Processes. (arXiv:2106.10052v1 [stat.ML])
    (2 min) Learning representations of stochastic processes is an emerging problem in machine learning with applications from meta-learning to physical object models to time series. Typical methods rely on exact reconstruction of observations, but this approach breaks down as observations become high-dimensional or noise distributions become complex. To address this, we propose a unifying framework for learning contrastive representations of stochastic processes (CRESP) that does away with exact reconstruction. We dissect potential use cases for stochastic process representations, and propose methods that accommodate each. Empirically, we show that our methods are effective for learning representations of periodic functions, 3D objects and dynamical processes. Our methods tolerate noisy high-dimensional observations better than traditional approaches, and the learned representations transfer to a range of downstream tasks.
    Multi-Armed Bandits for Minesweeper: Profiting from Exploration-Exploitation Synergy. (arXiv:2007.12824v2 [cs.LG] UPDATED)
    (2 min) A popular computer puzzle, the game of Minesweeper requires its human players to have a mix of both luck and strategy to succeed. Analyzing these aspects more formally, in our research we assessed the feasibility of a novel methodology based on Reinforcement Learning as an adequate approach to tackle the problem presented by this game. For this purpose we employed Multi-Armed Bandit algorithms which were carefully adapted in order to enable their use to define autonomous computational players, targeting to make the best use of some game peculiarities. After experimental evaluation, results showed that this approach was indeed successful, especially in smaller game boards, such as the standard beginner level. Despite this fact the main contribution of this work is a detailed examination of Minesweeper from a learning perspective, which led to various original insights which are thoroughly discussed.
    Fitting summary statistics of neural data with a differentiable spiking network simulator. (arXiv:2106.10064v1 [stat.ML])
    (2 min) Fitting network models to neural activity is becoming an important tool in neuroscience. A popular approach is to model a brain area with a probabilistic recurrent spiking network whose parameters maximize the likelihood of the recorded activity. Although this is widely used, we show that the resulting model does not produce realistic neural activity and wrongly estimates the connectivity matrix when neurons that are not recorded have a substantial impact on the recorded network. To correct for this, we suggest to augment the log-likelihood with terms that measure the dissimilarity between simulated and recorded activity. This dissimilarity is defined via summary statistics commonly used in neuroscience, and the optimization is efficient because it relies on back-propagation through the stochastically simulated spike trains. We analyze this method theoretically and show empirically that it generates more realistic activity statistics and recovers the connectivity matrix better than other methods.
    Why Mixup Improves the Model Performance. (arXiv:2006.06231v4 [stat.ML] UPDATED)
    (2 min) Machine learning techniques are used in a wide range of domains. However, machine learning models often suffer from the problem of over-fitting. Many data augmentation methods have been proposed to tackle such a problem, and one of them is called mixup. Mixup is a recently proposed regularization procedure, which linearly interpolates a random pair of training examples. This regularization method works very well experimentally, but its theoretical guarantee is not adequately discussed. In this study, we aim to discover why mixup works well from the aspect of the statistical learning theory.
    Evaluating the Robustness of Trigger Set-Based Watermarks Embedded in Deep Neural Networks. (arXiv:2106.10147v1 [cs.CR])
    (2 min) Trigger set-based watermarking schemes have gained emerging attention as they provide a means to prove ownership for deep neural network model owners. In this paper, we argue that state-of-the-art trigger set-based watermarking algorithms do not achieve their designed goal of proving ownership. We posit that this impaired capability stems from two common experimental flaws that the existing research practice has committed when evaluating the robustness of watermarking algorithms: (1) incomplete adversarial evaluation and (2) overlooked adaptive attacks. We conduct a comprehensive adversarial evaluation of 10 representative watermarking schemes against six of the existing attacks and demonstrate that each of these watermarking schemes lacks robustness against at least two attacks. We also propose novel adaptive attacks that harness the adversary's knowledge of the underlying watermarking algorithm of a target model. We demonstrate that the proposed attacks effectively break all of the 10 watermarking schemes, consequently allowing adversaries to obscure the ownership of any watermarked model. We encourage follow-up studies to consider our guidelines when evaluating the robustness of their watermarking schemes via conducting comprehensive adversarial evaluation that include our adaptive attacks to demonstrate a meaningful upper bound of watermark robustness.
    An Empirical Investigation into Deep and Shallow Rule Learning. (arXiv:2106.10254v1 [cs.LG])
    (2 min) Inductive rule learning is arguably among the most traditional paradigms in machine learning. Although we have seen considerable progress over the years in learning rule-based theories, all state-of-the-art learners still learn descriptions that directly relate the input features to the target concept. In the simplest case, concept learning, this is a disjunctive normal form (DNF) description of the positive class. While it is clear that this is sufficient from a logical point of view because every logical expression can be reduced to an equivalent DNF expression, it could nevertheless be the case that more structured representations, which form deep theories by forming intermediate concepts, could be easier to learn, in very much the same way as deep neural networks are able to outperform shallow networks, even though the latter are also universal function approximators. In this paper, we empirically compare deep and shallow rule learning with a uniform general algorithm, which relies on greedy mini-batch based optimization. Our experiments on both artificial and real-world benchmark data indicate that deep rule networks outperform shallow networks.
    BitFit: Simple Parameter-efficient Fine-tuning for Transformer-based Masked Language-models. (arXiv:2106.10199v1 [cs.LG])
    (2 min) We show that with small-to-medium training data, fine-tuning only the bias terms (or a subset of the bias terms) of pre-trained BERT models is competitive with (and sometimes better than) fine-tuning the entire model. For larger data, bias-only fine-tuning is competitive with other sparse fine-tuning methods. Besides their practical utility, these findings are relevant for the question of understanding the commonly-used process of finetuning: they support the hypothesis that finetuning is mainly about exposing knowledge induced by language-modeling training, rather than learning new task-specific linguistic knowledge.
    Solving Stochastic Compositional Optimization is Nearly as Easy as Solving Stochastic Optimization. (arXiv:2008.10847v3 [math.OC] UPDATED)
    (2 min) Stochastic compositional optimization generalizes classic (non-compositional) stochastic optimization to the minimization of compositions of functions. Each composition may introduce an additional expectation. The series of expectations may be nested. Stochastic compositional optimization is gaining popularity in applications such as reinforcement learning and meta learning. This paper presents a new Stochastically Corrected Stochastic Compositional gradient method (SCSC). SCSC runs in a single-time scale with a single loop, uses a fixed batch size, and guarantees to converge at the same rate as the stochastic gradient descent (SGD) method for non-compositional stochastic optimization. This is achieved by making a careful improvement to a popular stochastic compositional gradient method. It is easy to apply SGD-improvement techniques to accelerate SCSC. This helps SCSC achieve state-of-the-art performance for stochastic compositional optimization. In particular, we apply Adam to SCSC, and the exhibited rate of convergence matches that of the original Adam on non-compositional stochastic optimization. We test SCSC using the portfolio management and model-agnostic meta-learning tasks.
    On the Connections between Counterfactual Explanations and Adversarial Examples. (arXiv:2106.09992v1 [cs.LG])
    (2 min) Counterfactual explanations and adversarial examples have emerged as critical research areas for addressing the explainability and robustness goals of machine learning (ML). While counterfactual explanations were developed with the goal of providing recourse to individuals adversely impacted by algorithmic decisions, adversarial examples were designed to expose the vulnerabilities of ML models. While prior research has hinted at the commonalities between these frameworks, there has been little to no work on systematically exploring the connections between the literature on counterfactual explanations and adversarial examples. In this work, we make one of the first attempts at formalizing the connections between counterfactual explanations and adversarial examples. More specifically, we theoretically analyze salient counterfactual explanation and adversarial example generation methods, and highlight the conditions under which they behave similarly. Our analysis demonstrates that several popular counterfactual explanation and adversarial example generation methods such as the ones proposed by Wachter et. al. and Carlini and Wagner (with mean squared error loss), and C-CHVAE and natural adversarial examples by Zhao et. al. are equivalent. We also bound the distance between counterfactual explanations and adversarial examples generated by Wachter et. al. and DeepFool methods for linear models. Finally, we empirically validate our theoretical findings using extensive experimentation with synthetic and real world datasets.
    Learning Mesh-Based Simulation with Graph Networks. (arXiv:2010.03409v4 [cs.LG] UPDATED)
    (2 min) Mesh-based simulations are central to modeling complex physical systems in many disciplines across science and engineering. Mesh representations support powerful numerical integration methods and their resolution can be adapted to strike favorable trade-offs between accuracy and efficiency. However, high-dimensional scientific simulations are very expensive to run, and solvers and parameters must often be tuned individually to each system studied. Here we introduce MeshGraphNets, a framework for learning mesh-based simulations using graph neural networks. Our model can be trained to pass messages on a mesh graph and to adapt the mesh discretization during forward simulation. Our results show it can accurately predict the dynamics of a wide range of physical systems, including aerodynamics, structural mechanics, and cloth. The model's adaptivity supports learning resolution-independent dynamics and can scale to more complex state spaces at test time. Our method is also highly efficient, running 1-2 orders of magnitude faster than the simulation on which it is trained. Our approach broadens the range of problems on which neural network simulators can operate and promises to improve the efficiency of complex, scientific modeling tasks.
    Local Information Agent Modelling in Partially-Observable Environments. (arXiv:2006.09447v3 [cs.LG] UPDATED)
    (2 min) Modelling the behaviours of other agents is essential for understanding how agents interact and making effective decisions. Existing methods for agent modelling commonly assume knowledge of the local observations and chosen actions of the modelled agents during execution. To eliminate this assumption, we extract representations from the local information of the controlled agent using encoder-decoder architectures. Using the observations and actions of the modelled agents during training, our models learn to extract representations about the modelled agents conditioned only on the local observations of the controlled agent. The representations are used to augment the controlled agent's decision policy which is trained via deep reinforcement learning; thus, during execution, the policy does not require access to other agents' information. We provide a comprehensive evaluation and ablations studies in cooperative, competitive and mixed multi-agent environments, showing that our method achieves significantly higher returns than baseline methods which do not use the learned representations.
    Information criteria for non-normalized models. (arXiv:1905.05976v4 [math.ST] UPDATED)
    (2 min) Many statistical models are given in the form of non-normalized densities with an intractable normalization constant. Since maximum likelihood estimation is computationally intensive for these models, several estimation methods have been developed which do not require explicit computation of the normalization constant, such as noise contrastive estimation (NCE) and score matching. However, model selection methods for general non-normalized models have not been proposed so far. In this study, we develop information criteria for non-normalized models estimated by NCE or score matching. They are approximately unbiased estimators of discrepancy measures for non-normalized models. Simulation results and applications to real data demonstrate that the proposed criteria enable selection of the appropriate non-normalized model in a data-driven manner.
    Self-supervised Graph Learning for Recommendation. (arXiv:2010.10783v4 [cs.IR] UPDATED)
    (2 min) Representation learning on user-item graph for recommendation has evolved from using single ID or interaction history to exploiting higher-order neighbors. This leads to the success of graph convolution networks (GCNs) for recommendation such as PinSage and LightGCN. Despite effectiveness, we argue that they suffer from two limitations: (1) high-degree nodes exert larger impact on the representation learning, deteriorating the recommendations of low-degree (long-tail) items; and (2) representations are vulnerable to noisy interactions, as the neighborhood aggregation scheme further enlarges the impact of observed edges. In this work, we explore self-supervised learning on user-item graph, so as to improve the accuracy and robustness of GCNs for recommendation. The idea is to supplement the classical supervised task of recommendation with an auxiliary self-supervised task, which reinforces node representation learning via self-discrimination. Specifically, we generate multiple views of a node, maximizing the agreement between different views of the same node compared to that of other nodes. We devise three operators to generate the views -- node dropout, edge dropout, and random walk -- that change the graph structure in different manners. We term this new learning paradigm as \textit{Self-supervised Graph Learning} (SGL), implementing it on the state-of-the-art model LightGCN. Through theoretical analyses, we find that SGL has the ability of automatically mining hard negatives. Empirical studies on three benchmark datasets demonstrate the effectiveness of SGL, which improves the recommendation accuracy, especially on long-tail items, and the robustness against interaction noises. Our implementations are available at \url{https://github.com/wujcan/SGL}.
    Being Properly Improper. (arXiv:2106.09920v1 [cs.LG])
    (2 min) In today's ML, data can be twisted (changed) in various ways, either for bad or good intent. Such twisted data challenges the founding theory of properness for supervised losses which form the basis for many popular losses for class probability estimation. Unfortunately, at its core, properness ensures that the optimal models also learn the twist. In this paper, we analyse such class probability-based losses when they are stripped off the mandatory properness; we define twist-proper losses as losses formally able to retrieve the optimum (untwisted) estimate off the twists, and show that a natural extension of a half-century old loss introduced by S. Arimoto is twist proper. We then turn to a theory that has provided some of the best off-the-shelf algorithms for proper losses, boosting. Boosting can require access to the derivative of the convex conjugate of a loss to compute examples weights. Such a function can be hard to get, for computational or mathematical reasons; this turns out to be the case for Arimoto's loss. We bypass this difficulty by inverting the problem as follows: suppose a blueprint boosting algorithm is implemented with a general weight update function. What are the losses for which boosting-compliant minimisation happens? Our answer comes as a general boosting algorithm which meets the optimal boosting dependence on the number of calls to the weak learner; when applied to Arimoto's loss, it leads to a simple optimisation algorithm whose performances are showcased on several domains and twists.
    Pseudo-healthy synthesis with pathology disentanglement and adversarial learning. (arXiv:2005.01607v3 [eess.IV] UPDATED)
    (2 min) Pseudo-healthy synthesis is the task of creating a subject-specific `healthy' image from a pathological one. Such images can be helpful in tasks such as anomaly detection and understanding changes induced by pathology and disease. In this paper, we present a model that is encouraged to disentangle the information of pathology from what seems to be healthy. We disentangle what appears to be healthy and where disease is as a segmentation map, which are then recombined by a network to reconstruct the input disease image. We train our models adversarially using either paired or unpaired settings, where we pair disease images and maps when available. We quantitatively and subjectively, with a human study, evaluate the quality of pseudo-healthy images using several criteria. We show in a series of experiments, performed on ISLES, BraTS and Cam-CAN datasets, that our method is better than several baselines and methods from the literature. We also show that due to better training processes we could recover deformations, on surrounding tissue, caused by disease. Our implementation is publicly available at https://github.com/xiat0616/pseudo-healthy-synthesis. This paper has been accepted by Medical Image Analysis: https://doi.org/10.1016/j.media.2020.101719.
    Max-Margin is Dead, Long Live Max-Margin!. (arXiv:2105.15069v2 [cs.LG] UPDATED)
    (2 min) The foundational concept of Max-Margin in machine learning is ill-posed for output spaces with more than two labels such as in structured prediction. In this paper, we show that the Max-Margin loss can only be consistent to the classification task under highly restrictive assumptions on the discrete loss measuring the error between outputs. These conditions are satisfied by distances defined in tree graphs, for which we prove consistency, thus being the first losses shown to be consistent for Max-Margin beyond the binary setting. We finally address these limitations by correcting the concept of Max-Margin and introducing the Restricted-Max-Margin, where the maximization of the loss-augmented scores is maintained, but performed over a subset of the original domain. The resulting loss is also a generalization of the binary support vector machine and it is consistent under milder conditions on the discrete loss.
    Quasi-Global Momentum: Accelerating Decentralized Deep Learning on Heterogeneous Data. (arXiv:2102.04761v2 [cs.LG] UPDATED)
    (2 min) Decentralized training of deep learning models is a key element for enabling data privacy and on-device learning over networks. In realistic learning scenarios, the presence of heterogeneity across different clients' local datasets poses an optimization challenge and may severely deteriorate the generalization performance. In this paper, we investigate and identify the limitation of several decentralized optimization algorithms for different degrees of data heterogeneity. We propose a novel momentum-based method to mitigate this decentralized training difficulty. We show in extensive empirical experiments on various CV/NLP datasets (CIFAR-10, ImageNet, and AG News) and several network topologies (Ring and Social Network) that our method is much more robust to the heterogeneity of clients' data than other existing methods, by a significant improvement in test performance ($1\% \!-\! 20\%$). Our code is publicly available.
    Zero-Shot Federated Learning with New Classes for Audio Classification. (arXiv:2106.10019v1 [cs.LG])
    (2 min) Federated learning is an effective way of extracting insights from different user devices while preserving the privacy of users. However, new classes with completely unseen data distributions can stream across any device in a federated learning setting, whose data cannot be accessed by the global server or other users. To this end, we propose a unified zero-shot framework to handle these aforementioned challenges during federated learning. We simulate two scenarios here -- 1) when the new class labels are not reported by the user, the traditional FL setting is used; 2) when new class labels are reported by the user, we synthesize Anonymized Data Impressions by calculating class similarity matrices corresponding to each device's new classes followed by unsupervised clustering to distinguish between new classes across different users. Moreover, our proposed framework can also handle statistical heterogeneities in both labels and models across the participating users. We empirically evaluate our framework on-device across different communication rounds (FL iterations) with new classes in both local and global updates, along with heterogeneous labels and models, on two widely used audio classification applications -- keyword spotting and urban sound classification, and observe an average deterministic accuracy increase of ~4.041% and ~4.258% respectively.
    LoRMIkA: Local rule-based model interpretability with k-optimal associations. (arXiv:1908.03840v2 [cs.LG] UPDATED)
    (2 min) As we rely more and more on machine learning models for real-life decision-making, being able to understand and trust the predictions becomes ever more important. Local explainer models have recently been introduced to explain the predictions of complex machine learning models at the instance level. In this paper, we propose Local Rule-based Model Interpretability with k-optimal Associations (LoRMIkA), a novel model-agnostic approach that obtains k-optimal association rules from a neighbourhood of the instance to be explained. Compared with other rule-based approaches in the literature, we argue that the most predictive rules are not necessarily the rules that provide the best explanations. Consequently, the LoRMIkA framework provides a flexible way to obtain predictive and interesting rules. It uses an efficient search algorithm guaranteed to find the k-optimal rules with respect to objectives such as confidence, lift, leverage, coverage, and support. It also provides multiple rules which explain the decision and counterfactual rules, which give indications for potential changes to obtain different outputs for given instances. We compare our approach to other state-of-the-art approaches in local model interpretability on three different datasets and achieve competitive results in terms of local accuracy and interpretability.
    It's FLAN time! Summing feature-wise latent representations for interpretability. (arXiv:2106.10086v1 [cs.LG])
    (2 min) Interpretability has become a necessary feature for machine learning models deployed in critical scenarios, e.g. legal systems, healthcare. In these situations, algorithmic decisions may have (potentially negative) long-lasting effects on the end-user affected by the decision. In many cases, the representational power of deep learning models is not needed, therefore simple and interpretable models (e.g. linear models) should be preferred. However, in high-dimensional and/or complex domains (e.g. computer vision), the universal approximation capabilities of neural networks is required. Inspired by linear models and the Kolmogorov-Arnol representation theorem, we propose a novel class of structurally-constrained neural networks, which we call FLANs (Feature-wise Latent Additive Networks). Crucially, FLANs process each input feature separately, computing for each of them a representation in a common latent space. These feature-wise latent representations are then simply summed, and the aggregated representation is used for prediction. These constraints (which are at the core of the interpretability of linear models) allow an user to estimate the effect of each individual feature independently from the others, enhancing interpretability. In a set of experiments across different domains, we show how without compromising excessively the test performance, the structural constraints proposed in FLANs indeed increase the interpretability of deep learning models.
    Combining Pseudo-Point and State Space Approximations for Sum-Separable Gaussian Processes. (arXiv:2106.10210v1 [cs.LG])
    (2 min) Gaussian processes (GPs) are important probabilistic tools for inference and learning in spatio-temporal modelling problems such as those in climate science and epidemiology. However, existing GP approximations do not simultaneously support large numbers of off-the-grid spatial data-points and long time-series which is a hallmark of many applications. Pseudo-point approximations, one of the gold-standard methods for scaling GPs to large data sets, are well suited for handling off-the-grid spatial data. However, they cannot handle long temporal observation horizons effectively reverting to cubic computational scaling in the time dimension. State space GP approximations are well suited to handling temporal data, if the temporal GP prior admits a Markov form, leading to linear complexity in the number of temporal observations, but have a cubic spatial cost and cannot handle off-the-grid spatial data. In this work we show that there is a simple and elegant way to combine pseudo-point methods with the state space GP approximation framework to get the best of both worlds. The approach hinges on a surprising conditional independence property which applies to space--time separable GPs. We demonstrate empirically that the combined approach is more scalable and applicable to a greater range of spatio-temporal problems than either method on its own.
    Federated Robustness Propagation: Sharing Adversarial Robustness in Federated Learning. (arXiv:2106.10196v1 [cs.LG])
    (2 min) Federated learning (FL) emerges as a popular distributed learning schema that learns a model from a set of participating users without requiring raw data to be shared. One major challenge of FL comes from heterogeneity in users, which may have distributionally different (or non-iid) data and varying computation resources. Just like in centralized learning, FL users also desire model robustness against malicious attackers at test time. Whereas adversarial training (AT) provides a sound solution for centralized learning, extending its usage for FL users has imposed significant challenges, as many users may have very limited training data as well as tight computational budgets, to afford the data-hungry and costly AT. In this paper, we study a novel learning setting that propagates adversarial robustness from high-resource users that can afford AT, to those low-resource users that cannot afford it, during the FL process. We show that existing FL techniques cannot effectively propagate adversarial robustness among non-iid users, and propose a simple yet effective propagation approach that transfers robustness through carefully designed batch-normalization statistics. We demonstrate the rationality and effectiveness of our method through extensive experiments. Especially, the proposed method is shown to grant FL remarkable robustness even when only a small portion of users afford AT during learning. Codes will be published upon acceptance.
    A Note on Optimizing Distributions using Kernel Mean Embeddings. (arXiv:2106.09994v1 [cs.LG])
    (2 min) Kernel mean embeddings are a popular tool that consists in representing probability measures by their infinite-dimensional mean embeddings in a reproducing kernel Hilbert space. When the kernel is characteristic, mean embeddings can be used to define a distance between probability measures, known as the maximum mean discrepancy (MMD). A well-known advantage of mean embeddings and MMD is their low computational cost and low sample complexity. However, kernel mean embeddings have had limited applications to problems that consist in optimizing distributions, due to the difficulty of characterizing which Hilbert space vectors correspond to a probability distribution. In this note, we propose to leverage the kernel sums-of-squares parameterization of positive functions of Marteau-Ferey et al. [2020] to fit distributions in the MMD geometry. First, we show that when the kernel is characteristic, distributions with a kernel sum-of-squares density are dense. Then, we provide algorithms to optimize such distributions in the finite-sample setting, which we illustrate in a density fitting numerical experiment.
    On Effects of Compression with Hyperdimensional Computing in Distributed Randomized Neural Networks. (arXiv:2106.09831v1 [cs.LG])
    (2 min) A change of the prevalent supervised learning techniques is foreseeable in the near future: from the complex, computational expensive algorithms to more flexible and elementary training ones. The strong revitalization of randomized algorithms can be framed in this prospect steering. We recently proposed a model for distributed classification based on randomized neural networks and hyperdimensional computing, which takes into account cost of information exchange between agents using compression. The use of compression is important as it addresses the issues related to the communication bottleneck, however, the original approach is rigid in the way the compression is used. Therefore, in this work, we propose a more flexible approach to compression and compare it to conventional compression algorithms, dimensionality reduction, and quantization techniques.
    Batch Multi-Fidelity Bayesian Optimization with Deep Auto-Regressive Networks. (arXiv:2106.09884v1 [cs.LG])
    (2 min) Bayesian optimization (BO) is a powerful approach for optimizing black-box, expensive-to-evaluate functions. To enable a flexible trade-off between the cost and accuracy, many applications allow the function to be evaluated at different fidelities. In order to reduce the optimization cost while maximizing the benefit-cost ratio, in this paper, we propose Batch Multi-fidelity Bayesian Optimization with Deep Auto-Regressive Networks (BMBO-DARN). We use a set of Bayesian neural networks to construct a fully auto-regressive model, which is expressive enough to capture strong yet complex relationships across all the fidelities, so as to improve the surrogate learning and optimization performance. Furthermore, to enhance the quality and diversity of queries, we develop a simple yet efficient batch querying method, without any combinatorial search over the fidelities. We propose a batch acquisition function based on Max-value Entropy Search (MES) principle, which penalizes highly correlated queries and encourages diversity. We use posterior samples and moment matching to fulfill efficient computation of the acquisition function and conduct alternating optimization over every fidelity-input pair, which guarantees an improvement at each step. We demonstrate the advantage of our approach on four real-world hyperparameter optimization applications.
    PyKale: Knowledge-Aware Machine Learning from Multiple Sources in Python. (arXiv:2106.09756v1 [cs.LG])
    (2 min) Machine learning is a general-purpose technology holding promises for many interdisciplinary research problems. However, significant barriers exist in crossing disciplinary boundaries when most machine learning tools are developed in different areas separately. We present Pykale - a Python library for knowledge-aware machine learning on graphs, images, texts, and videos to enable and accelerate interdisciplinary research. We formulate new green machine learning guidelines based on standard software engineering practices and propose a novel pipeline-based application programming interface (API). PyKale focuses on leveraging knowledge from multiple sources for accurate and interpretable prediction, thus supporting multimodal learning and transfer learning (particularly domain adaptation) with latest deep learning and dimensionality reduction models. We build PyKale on PyTorch and leverage the rich PyTorch ecosystem. Our pipeline-based API design enforces standardization and minimalism, embracing green machine learning concepts via reducing repetitions and redundancy, reusing existing resources, and recycling learning models across areas. We demonstrate its interdisciplinary nature via examples in bioinformatics, knowledge graph, image/video recognition, and medical imaging.
    Dual-Teacher Class-Incremental Learning With Data-Free Generative Replay. (arXiv:2106.09835v1 [cs.CV])
    (2 min) This paper proposes two novel knowledge transfer techniques for class-incremental learning (CIL). First, we propose data-free generative replay (DF-GR) to mitigate catastrophic forgetting in CIL by using synthetic samples from a generative model. In the conventional generative replay, the generative model is pre-trained for old data and shared in extra memory for later incremental learning. In our proposed DF-GR, we train a generative model from scratch without using any training data, based on the pre-trained classification model from the past, so we curtail the cost of sharing pre-trained generative models. Second, we introduce dual-teacher information distillation (DT-ID) for knowledge distillation from two teachers to one student. In CIL, we use DT-ID to learn new classes incrementally based on the pre-trained model for old classes and another model (pre-)trained on the new data for new classes. We implemented the proposed schemes on top of one of the state-of-the-art CIL methods and showed the performance improvement on CIFAR-100 and ImageNet datasets.
    BinarizedAttack: Structural Poisoning Attacks to Graph-based Anomaly Detection. (arXiv:2106.09989v1 [cs.LG])
    (2 min) Graph-based Anomaly Detection (GAD) is becoming prevalent due to the powerful representation abilities of graphs as well as recent advances in graph mining techniques. These GAD tools, however, expose a new attacking surface, ironically due to their unique advantage of being able to exploit the relations among data. That is, attackers now can manipulate those relations (i.e., the structure of the graph) to allow some target nodes to evade detection. In this paper, we exploit this vulnerability by designing a new type of targeted structural poisoning attacks to a representative regression-based GAD system termed OddBall. Specially, we formulate the attack against OddBall as a bi-level optimization problem, where the key technical challenge is to efficiently solve the problem in a discrete domain. We propose a novel attack method termed BinarizedAttack based on gradient descent. Comparing to prior arts, BinarizedAttack can better use the gradient information, making it particularly suitable for solving combinatorial optimization problems. Furthermore, we investigate the attack transferability of BinarizedAttack by employing it to attack other representation-learning-based GAD systems. Our comprehensive experiments demonstrate that BinarizedAttack is very effective in enabling target nodes to evade graph-based anomaly detection tools with limited attackers' budget, and in the black-box transfer attack setting, BinarizedAttack is also tested effective and in particular, can significantly change the node embeddings learned by the GAD systems. Our research thus opens the door to studying a new type of attack against security analytic tools that rely on graph data.
    Active Finite Reward Automaton Inference and Reinforcement Learning Using Queries and Counterexamples. (arXiv:2006.15714v3 [cs.LG] UPDATED)
    (2 min) Despite the fact that deep reinforcement learning (RL) has surpassed human-level performances in various tasks, it still has several fundamental challenges. First, most RL methods require intensive data from the exploration of the environment to achieve satisfactory performance. Second, the use of neural networks in RL renders it hard to interpret the internals of the system in a way that humans can understand. To address these two challenges, we propose a framework that enables an RL agent to reason over its exploration process and distill high-level knowledge for effectively guiding its future explorations. Specifically, we propose a novel RL algorithm that learns high-level knowledge in the form of a finite reward automaton by using the L* learning algorithm. We prove that in episodic RL, a finite reward automaton can express any non-Markovian bounded reward functions with finitely many reward values and approximate any non-Markovian bounded reward function (with infinitely many reward values) with arbitrary precision. We also provide a lower bound for the episode length such that the proposed RL approach almost surely converges to an optimal policy in the limit. We test this approach on two RL environments with non-Markovian reward functions, choosing a variety of tasks with increasing complexity for each environment. We compare our algorithm with the state-of-the-art RL algorithms for non-Markovian reward functions, such as Joint Inference of Reward machines and Policies for RL (JIRP), Learning Reward Machine (LRM), and Proximal Policy Optimization (PPO2). Our results show that our algorithm converges to an optimal policy faster than other baseline methods.
    Few-Shot Semantic Segmentation Augmented with Image-Level Weak Annotations. (arXiv:2007.01496v2 [cs.CV] UPDATED)
    (2 min) Despite the great progress made by deep neural networks in the semantic segmentation task, traditional neural-networkbased methods typically suffer from a shortage of large amounts of pixel-level annotations. Recent progress in fewshot semantic segmentation tackles the issue by only a few pixel-level annotated examples. However, these few-shot approaches cannot easily be applied to multi-way or weak annotation settings. In this paper, we advance the few-shot segmentation paradigm towards a scenario where image-level annotations are available to help the training process of a few pixel-level annotations. Our key idea is to learn a better prototype representation of the class by fusing the knowledge from the image-level labeled data. Specifically, we propose a new framework, called PAIA, to learn the class prototype representation in a metric space by integrating image-level annotations. Furthermore, by considering the uncertainty of pseudo-masks, a distilled soft masked average pooling strategy is designed to handle distractions in image-level annotations. Extensive empirical results on two datasets show superior performance of PAIA.
    MADE: Exploration via Maximizing Deviation from Explored Regions. (arXiv:2106.10268v1 [cs.LG])
    (2 min) In online reinforcement learning (RL), efficient exploration remains particularly challenging in high-dimensional environments with sparse rewards. In low-dimensional environments, where tabular parameterization is possible, count-based upper confidence bound (UCB) exploration methods achieve minimax near-optimal rates. However, it remains unclear how to efficiently implement UCB in realistic RL tasks that involve non-linear function approximation. To address this, we propose a new exploration approach via \textit{maximizing} the deviation of the occupancy of the next policy from the explored regions. We add this term as an adaptive regularizer to the standard RL objective to balance exploration vs. exploitation. We pair the new objective with a provably convergent algorithm, giving rise to a new intrinsic reward that adjusts existing bonuses. The proposed intrinsic reward is easy to implement and combine with other existing RL algorithms to conduct exploration. As a proof of concept, we evaluate the new intrinsic reward on tabular examples across a variety of model-based and model-free algorithms, showing improvements over count-only exploration strategies. When tested on navigation and locomotion tasks from MiniGrid and DeepMind Control Suite benchmarks, our approach significantly improves sample efficiency over state-of-the-art methods. Our code is available at https://github.com/tianjunz/MADE.
    A Unified Generative Adversarial Network Training via Self-Labeling and Self-Attention. (arXiv:2106.09914v1 [cs.LG])
    (2 min) We propose a novel GAN training scheme that can handle any level of labeling in a unified manner. Our scheme introduces a form of artificial labeling that can incorporate manually defined labels, when available, and induce an alignment between them. To define the artificial labels, we exploit the assumption that neural network generators can be trained more easily to map nearby latent vectors to data with semantic similarities, than across separate categories. We use generated data samples and their corresponding artificial conditioning labels to train a classifier. The classifier is then used to self-label real data. To boost the accuracy of the self-labeling, we also use the exponential moving average of the classifier. However, because the classifier might still make mistakes, especially at the beginning of the training, we also refine the labels through self-attention, by using the labeling of real data samples only when the classifier outputs a high classification probability score. We evaluate our approach on CIFAR-10, STL-10 and SVHN, and show that both self-labeling and self-attention consistently improve the quality of generated data. More surprisingly, we find that the proposed scheme can even outperform class-conditional GANs.
    RobustSleepNet: Transfer learning for automated sleep staging at scale. (arXiv:2101.02452v2 [stat.ML] UPDATED)
    (2 min) Sleep disorder diagnosis relies on the analysis of polysomnography (PSG) records. As a preliminary step of this examination, sleep stages are systematically determined. In practice, sleep stage classification relies on the visual inspection of 30-second epochs of polysomnography signals. Numerous automatic approaches have been developed to replace this tedious and expensive task. Although these methods demonstrated better performance than human sleep experts on specific datasets, they remain largely unused in sleep clinics. The main reason is that each sleep clinic uses a specific PSG montage that most automatic approaches cannot handle out-of-the-box. Moreover, even when the PSG montage is compatible, publications have shown that automatic approaches perform poorly on unseen data with different demographics. To address these issues, we introduce RobustSleepNet, a deep learning model for automatic sleep stage classification able to handle arbitrary PSG montages. We trained and evaluated this model in a leave-one-out-dataset fashion on a large corpus of 8 heterogeneous sleep staging datasets to make it robust to demographic changes. When evaluated on an unseen dataset, RobustSleepNet reaches 97% of the F1 of a model explicitly trained on this dataset. Hence, RobustSleepNet unlocks the possibility to perform high-quality out-of-the-box automatic sleep staging with any clinical setup. We further show that finetuning RobustSleepNet, using a part of the unseen dataset, increases the F1 by 2% when compared to a model trained specifically for this dataset. Therefore, finetuning might be used to reach a state-of-the-art level of performance on a specific population.
    Partition-Guided GANs. (arXiv:2104.00816v2 [cs.LG] UPDATED)
    (2 min) Despite the success of Generative Adversarial Networks (GANs), their training suffers from several well-known problems, including mode collapse and difficulties learning a disconnected set of manifolds. In this paper, we break down the challenging task of learning complex high dimensional distributions, supporting diverse data samples, to simpler sub-tasks. Our solution relies on designing a partitioner that breaks the space into smaller regions, each having a simpler distribution, and training a different generator for each partition. This is done in an unsupervised manner without requiring any labels. We formulate two desired criteria for the space partitioner that aid the training of our mixture of generators: 1) to produce connected partitions and 2) provide a proxy of distance between partitions and data samples, along with a direction for reducing that distance. These criteria are developed to avoid producing samples from places with non-existent data density, and also facilitate training by providing additional direction to the generators. We develop theoretical constraints for a space partitioner to satisfy the above criteria. Guided by our theoretical analysis, we design an effective neural architecture for the space partitioner that empirically assures these conditions. Experimental results on various standard benchmarks show that the proposed unsupervised model outperforms several recent methods.
    Graph Context Encoder: Graph Feature Inpainting for Graph Generation and Self-supervised Pretraining. (arXiv:2106.10124v1 [cs.LG])
    (2 min) We propose the Graph Context Encoder (GCE), a simple but efficient approach for graph representation learning based on graph feature masking and reconstruction. GCE models are trained to efficiently reconstruct input graphs similarly to a graph autoencoder where node and edge labels are masked. In particular, our model is also allowed to change graph structures by masking and reconstructing graphs augmented by random pseudo-edges. We show that GCE can be used for novel graph generation, with applications for molecule generation. Used as a pretraining method, we also show that GCE improves baseline performances in supervised classification tasks tested on multiple standard benchmark graph datasets.
    Model Generalization in Deep Learning Applications for Land Cover Mapping. (arXiv:2008.10351v3 [cs.CV] UPDATED)
    (2 min) Recent work has shown that deep learning models can be used to classify land-use data from geospatial satellite imagery. We show that when these deep learning models are trained on data from specific continents/seasons, there is a high degree of variability in model performance on out-of-sample continents/seasons. This suggests that just because a model accurately predicts land-use classes in one continent or season does not mean that the model will accurately predict land-use classes in a different continent or season. We then use clustering techniques on satellite imagery from different continents to visualize the differences in landscapes that make geospatial generalization particularly difficult, and summarize our takeaways for future satellite imagery-related applications.
    FedADC: Accelerated Federated Learning with Drift Control. (arXiv:2012.09102v2 [cs.LG] UPDATED)
    (2 min) Federated learning (FL) has become de facto framework for collaborative learning among edge devices with privacy concern. The core of the FL strategy is the use of stochastic gradient descent (SGD) in a distributed manner. Large scale implementation of FL brings new challenges, such as the incorporation of acceleration techniques designed for SGD into the distributed setting, and mitigation of the drift problem due to non-homogeneous distribution of local datasets. These two problems have been separately studied in the literature; whereas, in this paper, we show that it is possible to address both problems using a single strategy without any major alteration to the FL framework, or introducing additional computation and communication load. To achieve this goal, we propose FedADC, which is an accelerated FL algorithm with drift control. We empirically illustrate the advantages of FedADC.
    FinGAT: Financial Graph Attention Networks for Recommending Top-K Profitable Stocks. (arXiv:2106.10159v1 [cs.LG])
    (2 min) Financial technology (FinTech) has drawn much attention among investors and companies. While conventional stock analysis in FinTech targets at predicting stock prices, less effort is made for profitable stock recommendation. Besides, in existing approaches on modeling time series of stock prices, the relationships among stocks and sectors (i.e., categories of stocks) are either neglected or pre-defined. Ignoring stock relationships will miss the information shared between stocks while using pre-defined relationships cannot depict the latent interactions or influence of stock prices between stocks. In this work, we aim at recommending the top-K profitable stocks in terms of return ratio using time series of stock prices and sector information. We propose a novel deep learning-based model, Financial Graph Attention Networks (FinGAT), to tackle the task under the setting that no pre-defined relationships between stocks are given. The idea of FinGAT is three-fold. First, we devise a hierarchical learning component to learn short-term and long-term sequential patterns from stock time series. Second, a fully-connected graph between stocks and a fully-connected graph between sectors are constructed, along with graph attention networks, to learn the latent interactions among stocks and sectors. Third, a multi-task objective is devised to jointly recommend the profitable stocks and predict the stock movement. Experiments conducted on Taiwan Stock, S&P 500, and NASDAQ datasets exhibit remarkable recommendation performance of our FinGAT, comparing to state-of-the-art methods.
    PACOH: Bayes-Optimal Meta-Learning with PAC-Guarantees. (arXiv:2002.05551v5 [stat.ML] UPDATED)
    (2 min) Meta-learning can successfully acquire useful inductive biases from data. Yet, its generalization properties to unseen learning tasks are poorly understood. Particularly if the number of meta-training tasks is small, this raises concerns about overfitting. We provide a theoretical analysis using the PAC-Bayesian framework and derive novel generalization bounds for meta-learning. Using these bounds, we develop a class of PAC-optimal meta-learning algorithms with performance guarantees and a principled meta-level regularization. Unlike previous PAC-Bayesian meta-learners, our method results in a standard stochastic optimization problem which can be solved efficiently and scales well. When instantiating our PAC-optimal hyper-posterior (PACOH) with Gaussian processes and Bayesian Neural Networks as base learners, the resulting methods yield state-of-the-art performance, both in terms of predictive accuracy and the quality of uncertainty estimates. Thanks to their principled treatment of uncertainty, our meta-learners can also be successfully employed for sequential decision problems.
    Message Passing in Graph Convolution Networks via Adaptive Filter Banks. (arXiv:2106.09910v1 [cs.LG])
    (2 min) Graph convolution networks, like message passing graph convolution networks (MPGCNs), have been a powerful tool in representation learning of networked data. However, when data is heterogeneous, most architectures are limited as they employ a single strategy to handle multi-channel graph signals and they typically focus on low-frequency information. In this paper, we present a novel graph convolution operator, termed BankGCN, which keeps benefits of message passing models, but extends their capabilities beyond `low-pass' features. It decomposes multi-channel signals on graphs into subspaces and handles particular information in each subspace with an adapted filter. The filters of all subspaces have different frequency responses and together form a filter bank. Furthermore, each filter in the spectral domain corresponds to a message passing scheme, and diverse schemes are implemented via the filter bank. Importantly, the filter bank and the signal decomposition are jointly learned to adapt to the spectral characteristics of data and to target applications. Furthermore, this is implemented almost without extra parameters in comparison with most existing MPGCNs. Experimental results show that the proposed convolution operator permits to achieve excellent performance in graph classification on a collection of benchmark graph datasets.
    Bridging the Gap Between Object Detection and User Intent via Query-Modulation. (arXiv:2106.10258v1 [cs.CV])
    (2 min) When interacting with objects through cameras, or pictures, users often have a specific intent. For example, they may want to perform a visual search. However, most object detection models ignore the user intent, relying on image pixels as their only input. This often leads to incorrect results, such as lack of a high-confidence detection on the object of interest, or detection with a wrong class label. In this paper we investigate techniques to modulate standard object detectors to explicitly account for the user intent, expressed as an embedding of a simple query. Compared to standard object detectors, query-modulated detectors show superior performance at detecting objects for a given label of interest. Thanks to large-scale training data synthesized from standard object detection annotations, query-modulated detectors can also outperform specialized referring expression recognition systems. Furthermore, they can be simultaneously trained to solve for both query-modulated detection and standard object detection.
    Nonparametric Hamiltonian Monte Carlo. (arXiv:2106.10238v1 [cs.LG])
    (2 min) Probabilistic programming uses programs to express generative models whose posterior probability is then computed by built-in inference engines. A challenging goal is to develop general purpose inference algorithms that work out-of-the-box for arbitrary programs in a universal probabilistic programming language (PPL). The densities defined by such programs, which may use stochastic branching and recursion, are (in general) nonparametric, in the sense that they correspond to models on an infinite-dimensional parameter space. However standard inference algorithms, such as the Hamiltonian Monte Carlo (HMC) algorithm, target distributions with a fixed number of parameters. This paper introduces the Nonparametric Hamiltonian Monte Carlo (NP-HMC) algorithm which generalises HMC to nonparametric models. Inputs to NP-HMC are a new class of measurable functions called "tree representable", which serve as a language-independent representation of the density functions of probabilistic programs in a universal PPL. We provide a correctness proof of NP-HMC, and empirically demonstrate significant performance improvements over existing approaches on several nonparametric examples.
    Residual Contrastive Learning for Joint Demosaicking and Denoising. (arXiv:2106.10070v1 [cs.CV])
    (2 min) The breakthrough of contrastive learning (CL) has fueled the recent success of self-supervised learning (SSL) in high-level vision tasks on RGB images. However, CL is still ill-defined for low-level vision tasks, such as joint demosaicking and denoising (JDD), in the RAW domain. To bridge this methodological gap, we present a novel CL approach on RAW images, residual contrastive learning (RCL), which aims to learn meaningful representations for JDD. Our work is built on the assumption that noise contained in each RAW image is signal-dependent, thus two crops from the same RAW image should have more similar noise distribution than two crops from different RAW images. We use residuals as a discriminative feature and the earth mover's distance to measure the distribution divergence for the contrastive loss. To evaluate the proposed CL strategy, we simulate a series of unsupervised JDD experiments with large-scale data corrupted by synthetic signal-dependent noise, where we set a new benchmark for unsupervised JDD tasks with unknown (random) noise variance. Our empirical study not only validates that CL can be applied on distributions (c.f. features), but also exposes the lack of robustness of previous non-ML and SSL JDD methods when the statistics of the noise are unknown, thus providing some further insight into signal-dependent noise problems.
    Consistency of Extreme Learning Machines and Regression under Non-Stationarity and Dependence for ML-Enhanced Moving Objects. (arXiv:2005.11115v3 [stat.ML] UPDATED)
    (2 min) Supervised learning by extreme learning machines resp. neural networks with random weights is studied under a non-stationary spatial-temporal sampling design which especially addresses settings where an autonomous object moving in a non-stationary spatial environment collects and analyzes data. The stochastic model especially allows for spatial heterogeneity and weak dependence. As efficient and computationally cheap learning methods (unconstrained) least squares, ridge regression and $\ell_s$-penalized least squares (including the LASSO) are studied. Consistency and asymptotic normality of the least squares and ridge regression estimates as well as corresponding consistency results for the $\ell_s$-penalty are shown under weak conditions. The resuts also cover bounds for the sample squared predicition error.
    Steerable Partial Differential Operators for Equivariant Neural Networks. (arXiv:2106.10163v1 [cs.LG])
    (2 min) Recent work in equivariant deep learning bears strong similarities to physics. Fields over a base space are fundamental entities in both subjects, as are equivariant maps between these fields. In deep learning, however, these maps are usually defined by convolutions with a kernel, whereas they are partial differential operators (PDOs) in physics. Developing the theory of equivariant PDOs in the context of deep learning could bring these subjects even closer together and lead to a stronger flow of ideas. In this work, we derive a $G$-steerability constraint that completely characterizes when a PDO between feature vector fields is equivariant, for arbitrary symmetry groups $G$. We then fully solve this constraint for several important groups. We use our solutions as equivariant drop-in replacements for convolutional layers and benchmark them in that role. Finally, we develop a framework for equivariant maps based on Schwartz distributions that unifies classical convolutions and differential operators and gives insight about the relation between the two.
    Boltzmann machine learning and regularization methods for inferring evolutionary fields and couplings from a multiple sequence alignment. (arXiv:1909.05006v3 [q-bio.PE] UPDATED)
    (3 min) The inverse Potts problem to infer a Boltzmann distribution for homologous protein sequences from their single-site and pairwise amino acid frequencies recently attracts a great deal of attention in the studies of protein structure and evolution. We study regularization and learning methods and how to tune regularization parameters to correctly infer interactions in Boltzmann machine learning. Using $L_2$ regularization for fields, group $L_1$ for couplings is shown to be very effective for sparse couplings in comparison with $L_2$ and $L_1$. Two regularization parameters are tuned to yield equal values for both the sample and ensemble averages of evolutionary energy. Both averages smoothly change and converge, but their learning profiles are very different between learning methods. The Adam method is modified to make stepsize proportional to the gradient for sparse couplings and to use a soft-thresholding function for group $L_1$. It is shown by first inferring interactions from protein sequences and then from Monte Carlo samples that the fields and couplings can be well recovered, but that recovering the pairwise correlations in the resolution of a total energy is harder for the natural proteins than for the protein-like sequences. Selective temperature for folding/structural constrains in protein evolution is also estimated.
    Adversarial Training Helps Transfer Learning via Better Representations. (arXiv:2106.10189v1 [cs.LG])
    (2 min) Transfer learning aims to leverage models pre-trained on source data to efficiently adapt to target setting, where only limited data are available for model fine-tuning. Recent works empirically demonstrate that adversarial training in the source data can improve the ability of models to transfer to new domains. However, why this happens is not known. In this paper, we provide a theoretical model to rigorously analyze how adversarial training helps transfer learning. We show that adversarial training in the source data generates provably better representations, so fine-tuning on top of this representation leads to a more accurate predictor of the target data. We further demonstrate both theoretically and empirically that semi-supervised learning in the source data can also improve transfer learning by similarly improving the representation. Moreover, performing adversarial training on top of semi-supervised learning can further improve transferability, suggesting that the two approaches have complementary benefits on representations. We support our theories with experiments on popular data sets and deep learning architectures.
    Optimal Change-Point Detection with Training Sequences in the Large and Moderate Deviations Regimes. (arXiv:2003.06511v3 [cs.IT] UPDATED)
    (2 min) This paper investigates a novel offline change-point detection problem from an information-theoretic perspective. In contrast to most related works, we assume that the knowledge of the underlying pre- and post-change distributions are not known and can only be learned from the training sequences which are available. We further require the probability of the \emph{estimation error} to decay either exponentially or sub-exponentially fast (corresponding respectively to the large and moderate deviations regimes in information theory parlance). Based on the training sequences as well as the test sequence consisting of a single change-point, we design a change-point estimator and further show that this estimator is optimal by establishing matching (strong) converses. This leads to a full characterization of the optimal confidence width (i.e., half the width of the confidence interval within which the true change-point is located at with high probability) as a function of the undetected error, under both the large and moderate deviations regimes.
    Accumulative Poisoning Attacks on Real-time Data. (arXiv:2106.09993v1 [cs.LG])
    (2 min) Collecting training data from untrusted sources exposes machine learning services to poisoning adversaries, who maliciously manipulate training data to degrade the model accuracy. When trained on offline datasets, poisoning adversaries have to inject the poisoned data in advance before training, and the order of feeding these poisoned batches into the model is stochastic. In contrast, practical systems are more usually trained/fine-tuned on sequentially captured real-time data, in which case poisoning adversaries could dynamically poison each data batch according to the current model state. In this paper, we focus on the real-time settings and propose a new attacking strategy, which affiliates an accumulative phase with poisoning attacks to secretly (i.e., without affecting accuracy) magnify the destructive effect of a (poisoned) trigger batch. By mimicking online learning and federated learning on CIFAR-10, we show that the model accuracy will significantly drop by a single update step on the trigger batch after the accumulative phase. Our work validates that a well-designed but straightforward attacking strategy can dramatically amplify the poisoning effects, with no need to explore complex techniques.
    Goal-Directed Planning by Reinforcement Learning and Active Inference. (arXiv:2106.09938v1 [cs.LG])
    (2 min) What is the difference between goal-directed and habitual behavior? We propose a novel computational framework of decision making with Bayesian inference, in which everything is integrated as an entire neural network model. The model learns to predict environmental state transitions by self-exploration and generating motor actions by sampling stochastic internal states $z$. Habitual behavior, which is obtained from the prior distribution of $z$, is acquired by reinforcement learning. Goal-directed behavior is determined from the posterior distribution of $z$ by planning, using active inference, to minimize the free energy for goal observation. We demonstrate the effectiveness of the proposed framework by experiments in a sensorimotor navigation task with camera observations and continuous motor actions.
    How to train your ViT? Data, Augmentation, and Regularization in Vision Transformers. (arXiv:2106.10270v1 [cs.CV])
    (2 min) Vision Transformers (ViT) have been shown to attain highly competitive performance for a wide range of vision applications, such as image classification, object detection and semantic image segmentation. In comparison to convolutional neural networks, the Vision Transformer's weaker inductive bias is generally found to cause an increased reliance on model regularization or data augmentation (``AugReg'' for short) when training on smaller training datasets. We conduct a systematic empirical study in order to better understand the interplay between the amount of training data, AugReg, model size and compute budget. As one result of this study we find that the combination of increased compute and AugReg can yield models with the same performance as models trained on an order of magnitude more training data: we train ViT models of various sizes on the public ImageNet-21k dataset which either match or outperform their counterparts trained on the larger, but not publicly available JFT-300M dataset.
    The Dimpled Manifold Model of Adversarial Examples in Machine Learning. (arXiv:2106.10151v1 [cs.LG])
    (2 min) The extreme fragility of deep neural networks when presented with tiny perturbations in their inputs was independently discovered by several research groups in 2013, but in spite of enormous effort these adversarial examples remained a baffling phenomenon with no clear explanation. In this paper we introduce a new conceptual framework (which we call the Dimpled Manifold Model) which provides a simple explanation for why adversarial examples exist, why their perturbations have such tiny norms, why these perturbations look like random noise, and why a network which was adversarially trained with incorrectly labeled images can still correctly classify test images. In the last part of the paper we describe the results of numerous experiments which strongly support this new model, and in particular our assertion that adversarial perturbations are roughly perpendicular to the low dimensional manifold which contains all the training examples.
    World-GAN: a Generative Model for Minecraft Worlds. (arXiv:2106.10155v1 [cs.LG])
    (2 min) This work introduces World-GAN, the first method to perform data-driven Procedural Content Generation via Machine Learning in Minecraft from a single example. Based on a 3D Generative Adversarial Network (GAN) architecture, we are able to create arbitrarily sized world snippets from a given sample. We evaluate our approach on creations from the community as well as structures generated with the Minecraft World Generator. Our method is motivated by the dense representations used in Natural Language Processing (NLP) introduced with word2vec [1]. The proposed block2vec representations make World-GAN independent from the number of different blocks, which can vary a lot in Minecraft, and enable the generation of larger levels. Finally, we demonstrate that changing this new representation space allows us to change the generated style of an already trained generator. World-GAN enables its users to generate Minecraft worlds based on parts of their creations.
    Rational Shapley Values. (arXiv:2106.10191v1 [cs.LG])
    (2 min) Explaining the predictions of opaque machine learning algorithms is an important and challenging task, especially as complex models are increasingly used to assist in high-stakes decisions such as those arising in healthcare and finance. Most popular tools for post-hoc explainable artificial intelligence (XAI) are either insensitive to context (e.g., feature attributions) or difficult to summarize (e.g., counterfactuals). In this paper, I introduce \emph{rational Shapley values}, a novel XAI method that synthesizes and extends these seemingly incompatible approaches in a rigorous, flexible manner. I leverage tools from decision theory and causal modeling to formalize and implement a pragmatic approach that resolves a number of known challenges in XAI. By pairing the distribution of random variables with the appropriate reference class for a given explanation task, I illustrate through theory and experiments how user goals and knowledge can inform and constrain the solution set in an iterative fashion. The method compares favorably to state of the art XAI tools in a range of quantitative and qualitative comparisons.
    The Principles of Deep Learning Theory. (arXiv:2106.10165v1 [cs.LG])
    (2 min) This book develops an effective theory approach to understanding deep neural networks of practical relevance. Beginning from a first-principles component-level picture of networks, we explain how to determine an accurate description of the output of trained networks by solving layer-to-layer iteration equations and nonlinear learning dynamics. A main result is that the predictions of networks are described by nearly-Gaussian distributions, with the depth-to-width aspect ratio of the network controlling the deviations from the infinite-width Gaussian description. We explain how these effectively-deep networks learn nontrivial representations from training and more broadly analyze the mechanism of representation learning for nonlinear models. From a nearly-kernel-methods perspective, we find that the dependence of such models' predictions on the underlying learning algorithm can be expressed in a simple and universal way. To obtain these results, we develop the notion of representation group flow (RG flow) to characterize the propagation of signals through the network. By tuning networks to criticality, we give a practical solution to the exploding and vanishing gradient problem. We further explain how RG flow leads to near-universal behavior and lets us categorize networks built from different activation functions into universality classes. Altogether, we show that the depth-to-width ratio governs the effective model complexity of the ensemble of trained networks. By using information-theoretic techniques, we estimate the optimal aspect ratio at which we expect the network to be practically most useful and show how residual connections can be used to push this scale to arbitrary depths. With these tools, we can learn in detail about the inductive bias of architectures, hyperparameters, and optimizers.
    Investigating the Role of Negatives in Contrastive Representation Learning. (arXiv:2106.09943v1 [cs.LG])
    (2 min) Noise contrastive learning is a popular technique for unsupervised representation learning. In this approach, a representation is obtained via reduction to supervised learning, where given a notion of semantic similarity, the learner tries to distinguish a similar (positive) example from a collection of random (negative) examples. The success of modern contrastive learning pipelines relies on many parameters such as the choice of data augmentation, the number of negative examples, and the batch size; however, there is limited understanding as to how these parameters interact and affect downstream performance. We focus on disambiguating the role of one of these parameters: the number of negative examples. Theoretically, we show the existence of a collision-coverage trade-off suggesting that the optimal number of negative examples should scale with the number of underlying concepts in the data. Empirically, we scrutinize the role of the number of negatives in both NLP and vision tasks. In the NLP task, we find that the results broadly agree with our theory, while our vision experiments are murkier with performance sometimes even being insensitive to the number of negatives. We discuss plausible explanations for this behavior and suggest future directions to better align theory and practice.
    Gradient-free optimization of chaotic acoustics with reservoir computing. (arXiv:2106.09780v1 [physics.flu-dyn])
    (2 min) We develop a versatile optimization method, which finds the design parameters that minimize time-averaged acoustic cost functionals. The method is gradient-free, model-informed, and data-driven with reservoir computing based on echo state networks. First, we analyse the predictive capabilities of echo state networks both in the short- and long-time prediction of the dynamics. We find that both fully data-driven and model-informed architectures learn the chaotic acoustic dynamics, both time-accurately and statistically. Informing the training with a physical reduced-order model with one acoustic mode markedly improves the accuracy and robustness of the echo state networks, whilst keeping the computational cost low. Echo state networks offer accurate predictions of the long-time dynamics, which would be otherwise expensive by integrating the governing equations to evaluate the time-averaged quantity to optimize. Second, we couple echo state networks with a Bayesian technique to explore the design thermoacoustic parameter space. The computational method is minimally intrusive. Third, we find the set of flame parameters that minimize the time-averaged acoustic energy of chaotic oscillations, which are caused by the positive feedback with a heat source, such as a flame in gas turbines or rocket motors. These oscillations are known as thermoacoustic oscillations. The optimal set of flame parameters is found with the same accuracy as brute-force grid search, but with a convergence rate that is more than one order of magnitude faster. This work opens up new possibilities for non-intrusive (``hands-off'') optimization of chaotic systems, in which the cost of generating data, for example from high-fidelity simulations and experiments, is high.
    Being a Bit Frequentist Improves Bayesian Neural Networks. (arXiv:2106.10065v1 [cs.LG])
    (2 min) Despite their compelling theoretical properties, Bayesian neural networks (BNNs) tend to perform worse than frequentist methods in classification-based uncertainty quantification (UQ) tasks such as out-of-distribution (OOD) detection and dataset-shift robustness. In this work, based on empirical findings in prior works, we hypothesize that this issue is due to the avoidance of Bayesian methods in the so-called "OOD training" -- a family of techniques for incorporating OOD data during training process, which has since been an integral part of state-of-the-art frequentist UQ methods. To validate this, we treat OOD data as a first-class citizen in BNN training by exploring four different ways of incorporating OOD data in Bayesian inference. We show in extensive experiments that OOD-trained BNNs are competitive to, if not better than recent frequentist baselines. This work thus provides strong baselines for future work in both Bayesian and frequentist UQ.
    Smoothed Multi-View Subspace Clustering. (arXiv:2106.09875v1 [cs.CV])
    (2 min) In recent years, multi-view subspace clustering has achieved impressive performance due to the exploitation of complementary imformation across multiple views. However, multi-view data can be very complicated and are not easy to cluster in real-world applications. Most existing methods operate on raw data and may not obtain the optimal solution. In this work, we propose a novel multi-view clustering method named smoothed multi-view subspace clustering (SMVSC) by employing a novel technique, i.e., graph filtering, to obtain a smooth representation for each view, in which similar data points have similar feature values. Specifically, it retains the graph geometric features through applying a low-pass filter. Consequently, it produces a ``clustering-friendly" representation and greatly facilitates the downstream clustering task. Extensive experiments on benchmark datasets validate the superiority of our approach. Analysis shows that graph filtering increases the separability of classes.
    Towards Clustering-friendly Representations: Subspace Clustering via Graph Filtering. (arXiv:2106.09874v1 [cs.CV])
    (2 min) Finding a suitable data representation for a specific task has been shown to be crucial in many applications. The success of subspace clustering depends on the assumption that the data can be separated into different subspaces. However, this simple assumption does not always hold since the raw data might not be separable into subspaces. To recover the ``clustering-friendly'' representation and facilitate the subsequent clustering, we propose a graph filtering approach by which a smooth representation is achieved. Specifically, it injects graph similarity into data features by applying a low-pass filter to extract useful data representations for clustering. Extensive experiments on image and document clustering datasets demonstrate that our method improves upon state-of-the-art subspace clustering techniques. Especially, its comparable performance with deep learning methods emphasizes the effectiveness of the simple graph filtering scheme for many real-world applications. An ablation study shows that graph filtering can remove noise, preserve structure in the image, and increase the separability of classes.
    Boolean Matrix Factorization with SAT and MaxSAT. (arXiv:2106.10105v1 [cs.LG])
    (2 min) The Boolean matrix factorization problem consists in approximating a matrix by the Boolean product of two smaller Boolean matrices. To obtain optimal solutions when the matrices to be factorized are small, we propose SAT and MaxSAT encoding; however, when the matrices to be factorized are large, we propose a heuristic based on the search for maximal biclique edge cover. We experimentally demonstrate that our approaches allow a better factorization than existing approaches while keeping reasonable computation times. Our methods also allow the handling of incomplete matrices with missing entries.
    Efficient Self-supervised Vision Transformers for Representation Learning. (arXiv:2106.09785v1 [cs.CV])
    (2 min) This paper investigates two techniques for developing efficient self-supervised vision transformers (EsViT) for visual representation learning. First, we show through a comprehensive empirical study that multi-stage architectures with sparse self-attentions can significantly reduce modeling complexity but with a cost of losing the ability to capture fine-grained correspondences between image regions. Second, we propose a new pre-training task of region matching which allows the model to capture fine-grained region dependencies and as a result significantly improves the quality of the learned vision representations. Our results show that combining the two techniques, EsViT achieves 81.3% top-1 on the ImageNet linear probe evaluation, outperforming prior arts with around an order magnitude of higher throughput. When transferring to downstream linear classification tasks, EsViT outperforms its supervised counterpart on 17 out of 18 datasets. The code and models will be publicly available.
    Many Agent Reinforcement Learning Under Partial Observability. (arXiv:2106.09825v1 [cs.LG])
    (2 min) Recent renewed interest in multi-agent reinforcement learning (MARL) has generated an impressive array of techniques that leverage deep reinforcement learning, primarily actor-critic architectures, and can be applied to a limited range of settings in terms of observability and communication. However, a continuing limitation of much of this work is the curse of dimensionality when it comes to representations based on joint actions, which grow exponentially with the number of agents. In this paper, we squarely focus on this challenge of scalability. We apply the key insight of action anonymity, which leads to permutation invariance of joint actions, to two recently presented deep MARL algorithms, MADDPG and IA2C, and compare these instantiations to another recent technique that leverages action anonymity, viz., mean-field MARL. We show that our instantiations can learn the optimal behavior in a broader class of agent networks than the mean-field method, using a recently introduced pragmatic domain.
    Evolving GANs: When Contradictions Turn into Compliance. (arXiv:2106.09946v1 [cs.LG])
    (2 min) Limited availability of labeled-data makes any supervised learning problem challenging. Alternative learning settings like semi-supervised and universum learning alleviate the dependency on labeled data, but still require a large amount of unlabeled data, which may be unavailable or expensive to acquire. GAN-based synthetic data generation methods have recently shown promise by generating synthetic samples to improve task at hand. However, these samples cannot be used for other purposes. In this paper, we propose a GAN game which provides improved discriminator accuracy under limited data settings, while generating realistic synthetic data. This provides the added advantage that now the generated data can be used for other similar tasks. We provide the theoretical guarantees and empirical results in support of our approach.
    Heuristic Stopping Rules For Technology-Assisted Review. (arXiv:2106.09871v1 [cs.IR])
    (2 min) Technology-assisted review (TAR) refers to human-in-the-loop active learning workflows for finding relevant documents in large collections. These workflows often must meet a target for the proportion of relevant documents found (i.e. recall) while also holding down costs. A variety of heuristic stopping rules have been suggested for striking this tradeoff in particular settings, but none have been tested against a range of recall targets and tasks. We propose two new heuristic stopping rules, Quant and QuantCI based on model-based estimation techniques from survey research. We compare them against a range of proposed heuristics and find they are accurate at hitting a range of recall targets while substantially reducing review costs.
    Bad Characters: Imperceptible NLP Attacks. (arXiv:2106.09898v1 [cs.CL])
    (2 min) Several years of research have shown that machine-learning systems are vulnerable to adversarial examples, both in theory and in practice. Until now, such attacks have primarily targeted visual models, exploiting the gap between human and machine perception. Although text-based models have also been attacked with adversarial examples, such attacks struggled to preserve semantic meaning and indistinguishability. In this paper, we explore a large class of adversarial examples that can be used to attack text-based models in a black-box setting without making any human-perceptible visual modification to inputs. We use encoding-specific perturbations that are imperceptible to the human eye to manipulate the outputs of a wide range of Natural Language Processing (NLP) systems from neural machine-translation pipelines to web search engines. We find that with a single imperceptible encoding injection -- representing one invisible character, homoglyph, reordering, or deletion -- an attacker can significantly reduce the performance of vulnerable models, and with three injections most models can be functionally broken. Our attacks work against currently-deployed commercial systems, including those produced by Microsoft and Google, in addition to open source models published by Facebook and IBM. This novel series of attacks presents a significant threat to many language processing systems: an attacker can affect systems in a targeted manner without any assumptions about the underlying model. We conclude that text-based NLP systems require careful input sanitization, just like conventional applications, and that given such systems are now being deployed rapidly at scale, the urgent attention of architects and operators is required.
    ScoreGrad: Multivariate Probabilistic Time Series Forecasting with Continuous Energy-based Generative Models. (arXiv:2106.10121v1 [cs.LG])
    (2 min) Multivariate time series prediction has attracted a lot of attention because of its wide applications such as intelligence transportation, AIOps. Generative models have achieved impressive results in time series modeling because they can model data distribution and take noise into consideration. However, many existing works can not be widely used because of the constraints of functional form of generative models or the sensitivity to hyperparameters. In this paper, we propose ScoreGrad, a multivariate probabilistic time series forecasting framework based on continuous energy-based generative models. ScoreGrad is composed of time series feature extraction module and conditional stochastic differential equation based score matching module. The prediction can be achieved by iteratively solving reverse-time SDE. To the best of our knowledge, ScoreGrad is the first continuous energy based generative model used for time series forecasting. Furthermore, ScoreGrad achieves state-of-the-art results on six real-world datasets. The impact of hyperparameters and sampler types on the performance are also explored. Code is available at https://github.com/yantijin/ScoreGradPred.
    Fusion of Embeddings Networks for Robust Combination of Text Dependent and Independent Speaker Recognition. (arXiv:2106.10169v1 [cs.LG])
    (2 min) By implicitly recognizing a user based on his/her speech input, speaker identification enables many downstream applications, such as personalized system behavior and expedited shopping checkouts. Based on whether the speech content is constrained or not, both text-dependent (TD) and text-independent (TI) speaker recognition models may be used. We wish to combine the advantages of both types of models through an ensemble system to make more reliable predictions. However, any such combined approach has to be robust to incomplete inputs, i.e., when either TD or TI input is missing. As a solution we propose a fusion of embeddings network foenet architecture, combining joint learning with neural attention. We compare foenet with four competitive baseline methods on a dataset of voice assistant inputs, and show that it achieves higher accuracy than the baseline and score fusion methods, especially in the presence of incomplete inputs.
    AI-Enabled Ultra-Low-Dose CT Reconstruction. (arXiv:2106.09834v1 [eess.IV])
    (2 min) By the ALARA (As Low As Reasonably Achievable) principle, ultra-low-dose CT reconstruction is a holy grail to minimize cancer risks and genetic damages, especially for children. With the development of medical CT technologies, the iterative algorithms are widely used to reconstruct decent CT images from a low-dose scan. Recently, artificial intelligence (AI) techniques have shown a great promise in further reducing CT radiation dose to the next level. In this paper, we demonstrate that AI-powered CT reconstruction offers diagnostic image quality at an ultra-low-dose level comparable to that of radiography. Specifically, here we develop a Split Unrolled Grid-like Alternative Reconstruction (SUGAR) network, in which deep learning, physical modeling and image prior are integrated. The reconstruction results from clinical datasets show that excellent images can be reconstructed using SUGAR from 36 projections. This approach has a potential to change future healthcare.
    Effective Model Sparsification by Scheduled Grow-and-Prune Methods. (arXiv:2106.09857v1 [cs.CV])
    (2 min) Deep neural networks (DNNs) are effective in solving many real-world problems. Larger DNN models usually exhibit better quality (e.g., accuracy) but their excessive computation results in long training and inference time. Model sparsification can reduce the computation and memory cost while maintaining model quality. Most existing sparsification algorithms unidirectionally remove weights, while others randomly or greedily explore a small subset of weights in each layer. The inefficiency of the algorithms reduces the achievable sparsity level. In addition, many algorithms still require pre-trained dense models and thus suffer from large memory footprint and long training time. In this paper, we propose a novel scheduled grow-and-prune (GaP) methodology without pre-training the dense models. It addresses the shortcomings of the previous works by repeatedly growing a subset of layers to dense and then pruning back to sparse after some training. Experiments have shown that such models can match or beat the quality of highly optimized dense models at 80% sparsity on a variety of tasks, such as image classification, objective detection, 3D object part segmentation, and translation. They also outperform other state-of-the-art (SOTA) pruning methods, including pruning from pre-trained dense models. As an example, a 90% sparse ResNet-50 obtained via GaP achieves 77.9% top-1 accuracy on ImageNet, improving the SOTA results by 1.5%.
    Local AdaGrad-Type Algorithm for Stochastic Convex-Concave Minimax Problems. (arXiv:2106.10022v1 [cs.LG])
    (2 min) Large scale convex-concave minimax problems arise in numerous applications, including game theory, robust training, and training of generative adversarial networks. Despite their wide applicability, solving such problems efficiently and effectively is challenging in the presence of large amounts of data using existing stochastic minimax methods. We study a class of stochastic minimax methods and develop a communication-efficient distributed stochastic extragradient algorithm, LocalAdaSEG, with an adaptive learning rate suitable for solving convex-concave minimax problem in the Parameter-Server model. LocalAdaSEG has three main features: (i) periodic communication strategy reduces the communication cost between workers and the server; (ii) an adaptive learning rate that is computed locally and allows for tuning-free implementation; and (iii) theoretically, a nearly linear speed-up with respect to the dominant variance term, arising from estimation of the stochastic gradient, is proven in both the smooth and nonsmooth convex-concave settings. LocalAdaSEG is used to solve a stochastic bilinear game, and train generative adversarial network. We compare LocalAdaSEG against several existing optimizers for minimax problems and demonstrate its efficacy through several experiments in both the homogeneous and heterogeneous settings.
    PAC Prediction Sets Under Covariate Shift. (arXiv:2106.09848v1 [cs.LG])
    (2 min) An important challenge facing modern machine learning is how to rigorously quantify the uncertainty of model predictions. Conveying uncertainty is especially important when there are changes to the underlying data distribution that might invalidate the predictive model. Yet, most existing uncertainty quantification algorithms break down in the presence of such shifts. We propose a novel approach that addresses this challenge by constructing \emph{probably approximately correct (PAC)} prediction sets in the presence of covariate shift. Our approach focuses on the setting where there is a covariate shift from the source distribution (where we have labeled training examples) to the target distribution (for which we want to quantify uncertainty). Our algorithm assumes given importance weights that encode how the probabilities of the training examples change under the covariate shift. In practice, importance weights typically need to be estimated; thus, we extend our algorithm to the setting where we are given confidence intervals for the importance weights rather than their true value. We demonstrate the effectiveness of our approach on various covariate shifts designed based on the DomainNet and ImageNet datasets.
    Topological Indoor Mapping through WiFi Signals. (arXiv:2106.09789v1 [cs.NI])
    (2 min) The ubiquitous presence of WiFi access points and mobile devices capable of measuring WiFi signal strengths allow for real-world applications in indoor localization and mapping. In particular, no additional infrastructure is required. Previous approaches in this field were, however, often hindered by problems such as effortful map-building processes, changing environments and hardware differences. We tackle these problems focussing on topological maps. These represent discrete locations, such as rooms, and their relations, e.g., distances and transition frequencies. In our unsupervised method, we employ WiFi signal strength distributions, dimension reduction and clustering. It can be used in settings where users carry mobile devices and follow their normal routine. We aim for applications in short-lived indoor events such as conferences.
    Machining Cycle Time Prediction: Data-driven Modelling of Machine Tool Feedrate Behavior with Neural Networks. (arXiv:2106.09719v1 [cs.LG])
    (2 min) Accurate prediction of machining cycle times is important in the manufacturing industry. Usually, Computer Aided Manufacturing (CAM) software estimates the machining times using the commanded feedrate from the toolpath file using basic kinematic settings. Typically, the methods do not account for toolpath geometry or toolpath tolerance and therefore under estimate the machining cycle times considerably. Removing the need for machine specific knowledge, this paper presents a data-driven feedrate and machining cycle time prediction method by building a neural network model for each machine tool axis. In this study, datasets composed of the commanded feedrate, nominal acceleration, toolpath geometry and the measured feedrate were used to train a neural network model. Validation trials using a representative industrial thin wall structure component on a commercial machining centre showed that this method estimated the machining time with more than 90% accuracy. This method showed that neural network models have the capability to learn the behavior of a complex machine tool system and predict cycle times. Further integration of the methods will be critical in the implantation of digital twins in Industry 4.0.
    Anomaly Detection in Dynamic Graphs via Transformer. (arXiv:2106.09876v1 [cs.LG])
    (2 min) Detecting anomalies for dynamic graphs has drawn increasing attention due to their wide applications in social networks, e-commerce, and cybersecurity. The recent deep learning-based approaches have shown promising results over shallow methods. However, they fail to address two core challenges of anomaly detection in dynamic graphs: the lack of informative encoding for unattributed nodes and the difficulty of learning discriminate knowledge from coupled spatial-temporal dynamic graphs. To overcome these challenges, in this paper, we present a novel Transformer-based Anomaly Detection framework for DYnamic graph (TADDY). Our framework constructs a comprehensive node encoding strategy to better represent each node's structural and temporal roles in an evolving graphs stream. Meanwhile, TADDY captures informative representation from dynamic graphs with coupled spatial-temporal patterns via a dynamic graph transformer model. The extensive experimental results demonstrate that our proposed TADDY framework outperforms the state-of-the-art methods by a large margin on four real-world datasets.
    CIRA Guide to Custom Loss Functions for Neural Networks in Environmental Sciences -- Version 1. (arXiv:2106.09757v1 [cs.LG])
    (3 min) Neural networks are increasingly used in environmental science applications. Furthermore, neural network models are trained by minimizing a loss function, and it is crucial to choose the loss function very carefully for environmental science applications, as it determines what exactly is being optimized. Standard loss functions do not cover all the needs of the environmental sciences, which makes it important for scientists to be able to develop their own custom loss functions so that they can implement many of the classic performance measures already developed in environmental science, including measures developed for spatial model verification. However, there are very few resources available that cover the basics of custom loss function development comprehensively, and to the best of our knowledge none that focus on the needs of environmental scientists. This document seeks to fill this gap by providing a guide on how to write custom loss functions targeted toward environmental science applications. Topics include the basics of writing custom loss functions, common pitfalls, functions to use in loss functions, examples such as fractions skill score as loss function, how to incorporate physical constraints, discrete and soft discretization, and concepts such as focal, robust, and adaptive loss. While examples are currently provided in this guide for Python with Keras and the TensorFlow backend, the basic concepts also apply to other environments, such as Python with PyTorch. Similarly, while the sample loss functions provided here are from meteorology, these are just examples of how to create custom loss functions. Other fields in the environmental sciences have very similar needs for custom loss functions, e.g., for evaluating spatial forecasts effectively, and the concepts discussed here can be applied there as well. All code samples are provided in a GitHub repository.
    Guided Integrated Gradients: An Adaptive Path Method for Removing Noise. (arXiv:2106.09788v1 [cs.CV])
    (2 min) Integrated Gradients (IG) is a commonly used feature attribution method for deep neural networks. While IG has many desirable properties, the method often produces spurious/noisy pixel attributions in regions that are not related to the predicted class when applied to visual models. While this has been previously noted, most existing solutions are aimed at addressing the symptoms by explicitly reducing the noise in the resulting attributions. In this work, we show that one of the causes of the problem is the accumulation of noise along the IG path. To minimize the effect of this source of noise, we propose adapting the attribution path itself -- conditioning the path not just on the image but also on the model being explained. We introduce Adaptive Path Methods (APMs) as a generalization of path methods, and Guided IG as a specific instance of an APM. Empirically, Guided IG creates saliency maps better aligned with the model's prediction and the input image that is being explained. We show through qualitative and quantitative experiments that Guided IG outperforms other, related methods in nearly every experiment.
    Shuffle Private Stochastic Convex Optimization. (arXiv:2106.09805v1 [cs.LG])
    (2 min) In shuffle privacy, each user sends a collection of randomized messages to a trusted shuffler, the shuffler randomly permutes these messages, and the resulting shuffled collection of messages must satisfy differential privacy. Prior work in this model has largely focused on protocols that use a single round of communication to compute algorithmic primitives like means, histograms, and counts. In this work, we present interactive shuffle protocols for stochastic convex optimization. Our optimization protocols rely on a new noninteractive protocol for summing vectors of bounded $\ell_2$ norm. By combining this sum subroutine with techniques including mini-batch stochastic gradient descent, accelerated gradient descent, and Nesterov's smoothing method, we obtain loss guarantees for a variety of convex loss functions that significantly improve on those of the local model and sometimes match those of the central model.
    Locally Differentially Private Federated Learning: Efficient Algorithms with Tight Risk Bounds. (arXiv:2106.09779v1 [cs.LG])
    (2 min) Federated learning (FL) is a distributed learning paradigm in which many clients with heterogeneous, unbalanced, and often sensitive local data, collaborate to learn a model. Local Differential Privacy (LDP) provides a strong guarantee that each client's data cannot be leaked during and after training, without relying on a trusted third party. While LDP is often believed to be too stringent to allow for satisfactory utility, our paper challenges this belief. We consider a general setup with unbalanced, heterogeneous data, disparate privacy needs across clients, and unreliable communication, where a random number/subset of clients is available each round. We propose three LDP algorithms for smooth (strongly) convex FL; each are noisy variations of distributed minibatch SGD. One is accelerated and one involves novel time-varying noise, which we use to obtain the first non-trivial LDP excess risk bound for the fully general non-i.i.d. FL problem. Specializing to i.i.d. clients, our risk bounds interpolate between the best known and/or optimal bounds in the centralized setting and the cross-device setting, where each client represents just one person's data. Furthermore, we show that in certain regimes, our convergence rate (nearly) matches the corresponding non-private lower bound or outperforms state of the art non-private algorithms (``privacy for free''). Finally, we validate our theoretical results and illustrate the practical utility of our algorithm with numerical experiments.
    Escaping strict saddle points of the Moreau envelope in nonsmooth optimization. (arXiv:2106.09815v1 [math.OC])
    (2 min) Recent work has shown that stochastically perturbed gradient methods can efficiently escape strict saddle points of smooth functions. We extend this body of work to nonsmooth optimization, by analyzing an inexact analogue of a stochastically perturbed gradient method applied to the Moreau envelope. The main conclusion is that a variety of algorithms for nonsmooth optimization can escape strict saddle points of the Moreau envelope at a controlled rate. The main technical insight is that typical algorithms applied to the proximal subproblem yield directions that approximate the gradient of the Moreau envelope in relative terms.
    Unsupervised Resource Allocation with Graph Neural Networks. (arXiv:2106.09761v1 [cs.LG])
    (2 min) We present an approach for maximizing a global utility function by learning how to allocate resources in an unsupervised way. We expect interactions between allocation targets to be important and therefore propose to learn the reward structure for near-optimal allocation policies with a GNN. By relaxing the resource constraint, we can employ gradient-based optimization in contrast to more standard evolutionary algorithms. Our algorithm is motivated by a problem in modern astronomy, where one needs to select-based on limited initial information-among $10^9$ galaxies those whose detailed measurement will lead to optimal inference of the composition of the universe. Our technique presents a way of flexibly learning an allocation strategy by only requiring forward simulators for the physics of interest and the measurement process. We anticipate that our technique will also find applications in a range of resource allocation problems.
    Synthetic COVID-19 Chest X-ray Dataset for Computer-Aided Diagnosis. (arXiv:2106.09759v1 [eess.IV])
    (2 min) We introduce a new dataset called Synthetic COVID-19 Chest X-ray Dataset for training machine learning models. The dataset consists of 21,295 synthetic COVID-19 chest X-ray images to be used for computer-aided diagnosis. These images, generated via an unsupervised domain adaptation approach, are of high quality. We find that the synthetic images not only improve performance of various deep learning architectures when used as additional training data under heavy imbalance conditions, but also detect the target class with high confidence. We also find that comparable performance can also be achieved when trained only on synthetic images. Further, salient features of the synthetic COVID-19 images indicate that the distribution is significantly different from Non-COVID-19 classes, enabling a proper decision boundary. We hope the availability of such high fidelity chest X-ray images of COVID-19 will encourage advances in the development of diagnostic and/or management tools.
    Low Resource German ASR with Untranscribed Data Spoken by Non-native Children -- INTERSPEECH 2021 Shared Task SPAPL System. (arXiv:2106.09963v1 [eess.AS])
    (2 min) This paper describes the SPAPL system for the INTERSPEECH 2021 Challenge: Shared Task on Automatic Speech Recognition for Non-Native Children's Speech in German. ~ 5 hours of transcribed data and ~ 60 hours of untranscribed data are provided to develop a German ASR system for children. For the training of the transcribed data, we propose a non-speech state discriminative loss (NSDL) to mitigate the influence of long-duration non-speech segments within speech utterances. In order to explore the use of the untranscribed data, various approaches are implemented and combined together to incrementally improve the system performance. First, bidirectional autoregressive predictive coding (Bi-APC) is used to learn initial parameters for acoustic modelling using the provided untranscribed data. Second, incremental semi-supervised learning is further used to iteratively generate pseudo-transcribed data. Third, different data augmentation schemes are used at different training stages to increase the variability and size of the training data. Finally, a recurrent neural network language model (RNNLM) is used for rescoring. Our system achieves a word error rate (WER) of 39.68% on the evaluation data, an approximately 12% relative improvement over the official baseline (45.21%).

2021-06-18

  • cs.CL updates on arXiv.org

    End-to-End Cross-Domain Text-to-SQL Semantic Parsing with Auxiliary Task. (arXiv:2106.09588v1 [cs.CL])
    (2 min) In this work, we focus on two crucial components in the cross-domain text-to-SQL semantic parsing task: schema linking and value filling. To encourage the model to learn better encoding ability, we propose a column selection auxiliary task to empower the encoder with the relevance matching capability by using explicit learning targets. Furthermore, we propose two value filling methods to build the bridge from the existing zero-shot semantic parsers to real-world applications, considering most of the existing parsers ignore the values filling in the synthesized SQL. With experiments on Spider, our proposed framework improves over the baselines on the execution accuracy and exact set match accuracy when database contents are unavailable, and detailed analysis sheds light on future work.
    Can I Be of Further Assistance? Using Unstructured Knowledge Access to Improve Task-oriented Conversational Modeling. (arXiv:2106.09174v1 [cs.CL])
    (2 min) Most prior work on task-oriented dialogue systems are restricted to limited coverage of domain APIs. However, users oftentimes have requests that are out of the scope of these APIs. This work focuses on responding to these beyond-API-coverage user turns by incorporating external, unstructured knowledge sources. Our approach works in a pipelined manner with knowledge-seeking turn detection, knowledge selection, and response generation in sequence. We introduce novel data augmentation methods for the first two steps and demonstrate that the use of information extracted from dialogue context improves the knowledge selection and end-to-end performances. Through experiments, we achieve state-of-the-art performance for both automatic and human evaluation metrics on the DSTC9 Track 1 benchmark dataset, validating the effectiveness of our contributions.
    Multi-Modal Detection of Alzheimer's Disease from Speech and Text. (arXiv:2012.00096v2 [cs.LG] UPDATED)
    (2 min) Reliable detection of the prodromal stages of Alzheimer's disease (AD) remains difficult even today because, unlike other neurocognitive impairments, there is no definitive diagnosis of AD in vivo. In this context, existing research has shown that patients often develop language impairment even in mild AD conditions. We propose a multimodal deep learning method that utilizes speech and the corresponding transcript simultaneously to detect AD. For audio signals, the proposed audio-based network, a convolutional neural network (CNN) based model, predicts the diagnosis for multiple speech segments, which are combined for the final prediction. Similarly, we use contextual embedding extracted from BERT concatenated with a CNN-generated embedding for classifying the transcript. The individual predictions of the two models are then combined to make the final classification. We also perform experiments to analyze the model performance when Automated Speech Recognition (ASR) system generated transcripts are used instead of manual transcription in the text-based model. The proposed method achieves 85.3% 10-fold cross-validation accuracy when trained and evaluated on the Dementiabank Pitt corpus.
    ASR Adaptation for E-commerce Chatbots using Cross-Utterance Context and Multi-Task Language Modeling. (arXiv:2106.09532v1 [eess.AS])
    (2 min) Automatic Speech Recognition (ASR) robustness toward slot entities are critical in e-commerce voice assistants that involve monetary transactions and purchases. Along with effective domain adaptation, it is intuitive that cross utterance contextual cues play an important role in disambiguating domain specific content words from speech. In this paper, we investigate various techniques to improve contextualization, content word robustness and domain adaptation of a Transformer-XL neural language model (NLM) to rescore ASR N-best hypotheses. To improve contextualization, we utilize turn level dialogue acts along with cross utterance context carry over. Additionally, to adapt our domain-general NLM towards e-commerce on-the-fly, we use embeddings derived from a finetuned masked LM on in-domain data. Finally, to improve robustness towards in-domain content words, we propose a multi-task model that can jointly perform content word detection and language modeling tasks. Compared to a non-contextual LSTM LM baseline, our best performing NLM rescorer results in a content WER reduction of 19.2% on e-commerce audio test set and a slot labeling F1 improvement of 6.4%.
    Joint Emotion Label Space Modelling for Affect Lexica. (arXiv:1911.08782v2 [cs.CL] UPDATED)
    (2 min) Emotion lexica are commonly used resources to combat data poverty in automatic emotion detection. However, vocabulary coverage issues, differences in construction method and discrepancies in emotion framework and representation result in a heterogeneous landscape of emotion detection resources, calling for a unified approach to utilising them. To combat this, we present an extended emotion lexicon of 30,273 unique entries, which is a result of merging eight existing emotion lexica by means of a multi-view variational autoencoder (VAE). We showed that a VAE is a valid approach for combining lexica with different label spaces into a joint emotion label space with a chosen number of dimensions, and that these dimensions are still interpretable. We tested the utility of the unified VAE lexicon by employing the lexicon values as features in an emotion detection model. We found that the VAE lexicon outperformed individual lexica, but contrary to our expectations, it did not outperform a naive concatenation of lexica, although it did contribute to the naive concatenation when added as an extra lexicon. Furthermore, using lexicon information as additional features next to state-of-the-art language models usually resulted in a better performance than when no lexicon information was used.
    Denoising Distantly Supervised Named Entity Recognition via a Hypergeometric Probabilistic Model. (arXiv:2106.09234v1 [cs.CL])
    (2 min) Denoising is the essential step for distant supervision based named entity recognition. Previous denoising methods are mostly based on instance-level confidence statistics, which ignore the variety of the underlying noise distribution on different datasets and entity types. This makes them difficult to be adapted to high noise rate settings. In this paper, we propose Hypergeometric Learning (HGL), a denoising algorithm for distantly supervised NER that takes both noise distribution and instance-level confidence into consideration. Specifically, during neural network training, we naturally model the noise samples in each batch following a hypergeometric distribution parameterized by the noise-rate. Then each instance in the batch is regarded as either correct or noisy one according to its label confidence derived from previous training step, as well as the noise distribution in this sampled batch. Experiments show that HGL can effectively denoise the weakly-labeled data retrieved from distant supervision, and therefore results in significant improvements on the trained models.
    Element Intervention for Open Relation Extraction. (arXiv:2106.09558v1 [cs.CL])
    (2 min) Open relation extraction aims to cluster relation instances referring to the same underlying relation, which is a critical step for general relation extraction. Current OpenRE models are commonly trained on the datasets generated from distant supervision, which often results in instability and makes the model easily collapsed. In this paper, we revisit the procedure of OpenRE from a causal view. By formulating OpenRE using a structural causal model, we identify that the above-mentioned problems stem from the spurious correlations from entities and context to the relation type. To address this issue, we conduct \emph{Element Intervention}, which intervenes on the context and entities respectively to obtain the underlying causal effects of them. We also provide two specific implementations of the interventions based on entity ranking and context contrasting. Experimental results on unsupervised relation extraction datasets show that our methods outperform previous state-of-the-art methods and are robust across different datasets.
    Optimizing Data Usage via Differentiable Rewards. (arXiv:1911.10088v3 [cs.LG] UPDATED)
    (2 min) To acquire a new skill, humans learn better and faster if a tutor, based on their current knowledge level, informs them of how much attention they should pay to particular content or practice problems. Similarly, a machine learning model could potentially be trained better with a scorer that "adapts" to its current learning state and estimates the importance of each training data instance. Training such an adaptive scorer efficiently is a challenging problem; in order to precisely quantify the effect of a data instance at a given time during the training, it is typically necessary to first complete the entire training process. To efficiently optimize data usage, we propose a reinforcement learning approach called Differentiable Data Selection (DDS). In DDS, we formulate a scorer network as a learnable function of the training data, which can be efficiently updated along with the main model being trained. Specifically, DDS updates the scorer with an intuitive reward signal: it should up-weigh the data that has a similar gradient with a dev set upon which we would finally like to perform well. Without significant computing overhead, DDS delivers strong and consistent improvements over several strong baselines on two very different tasks of machine translation and image classification.
    On Sampling-Based Training Criteria for Neural Language Modeling. (arXiv:2104.10507v2 [cs.CL] UPDATED)
    (2 min) As the vocabulary size of modern word-based language models becomes ever larger, many sampling-based training criteria are proposed and investigated. The essence of these sampling methods is that the softmax-related traversal over the entire vocabulary can be simplified, giving speedups compared to the baseline. A problem we notice about the current landscape of such sampling methods is the lack of a systematic comparison and some myths about preferring one over another. In this work, we consider Monte Carlo sampling, importance sampling, a novel method we call compensated partial summation, and noise contrastive estimation. Linking back to the three traditional criteria, namely mean squared error, binary cross-entropy, and cross-entropy, we derive the theoretical solutions to the training problems. Contrary to some common belief, we show that all these sampling methods can perform equally well, as long as we correct for the intended class posterior probabilities. Experimental results in language modeling and automatic speech recognition on Switchboard and LibriSpeech support our claim, with all sampling-based methods showing similar perplexities and word error rates while giving the expected speedups.
    Pushing the Limits of Non-Autoregressive Speech Recognition. (arXiv:2104.03416v3 [eess.AS] UPDATED)
    (2 min) We combine recent advancements in end-to-end speech recognition to non-autoregressive automatic speech recognition. We push the limits of non-autoregressive state-of-the-art results for multiple datasets: LibriSpeech, Fisher+Switchboard and Wall Street Journal. Key to our recipe, we leverage CTC on giant Conformer neural network architectures with SpecAugment and wav2vec2 pre-training. We achieve 1.8%/3.6% WER on LibriSpeech test/test-other sets, 5.1%/9.8% WER on Switchboard, and 3.4% on the Wall Street Journal, all without a language model.
    Investigating Methods to Improve Language Model Integration for Attention-based Encoder-Decoder ASR Models. (arXiv:2104.05544v2 [cs.CL] UPDATED)
    (2 min) Attention-based encoder-decoder (AED) models learn an implicit internal language model (ILM) from the training transcriptions. The integration with an external LM trained on much more unpaired text usually leads to better performance. A Bayesian interpretation as in the hybrid autoregressive transducer (HAT) suggests dividing by the prior of the discriminative acoustic model, which corresponds to this implicit LM, similarly as in the hybrid hidden Markov model approach. The implicit LM cannot be calculated efficiently in general and it is yet unclear what are the best methods to estimate it. In this work, we compare different approaches from the literature and propose several novel methods to estimate the ILM directly from the AED model. Our proposed methods outperform all previous approaches. We also investigate other methods to suppress the ILM mainly by decreasing the capacity of the AED model, limiting the label context, and also by training the AED model together with a pre-existing LM.
    Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence. (arXiv:2004.03974v2 [cs.CL] UPDATED)
    (2 min) Topic models extract groups of words from documents, whose interpretation as a topic hopefully allows for a better understanding of the data. However, the resulting word groups are often not coherent, making them harder to interpret. Recently, neural topic models have shown improvements in overall coherence. Concurrently, contextual embeddings have advanced the state of the art of neural models in general. In this paper, we combine contextualized representations with neural topic models. We find that our approach produces more meaningful and coherent topics than traditional bag-of-words topic models and recent neural models. Our results indicate that future improvements in language models will translate into better topic models.
    STN4DST: A Scalable Dialogue State Tracking based on Slot Tagging Navigation. (arXiv:2010.10811v2 [cs.CL] UPDATED)
    (2 min) Scalability for handling unknown slot values is a important problem in dialogue state tracking (DST). As far as we know, previous scalable DST approaches generally rely on either the candidate generation from slot tagging output or the span extraction in dialogue context. However, the candidate generation based DST often suffers from error propagation due to its pipelined two-stage process; meanwhile span extraction based DST has the risk of generating invalid spans in the lack of semantic constraints between start and end position pointers. To tackle the above drawbacks, in this paper, we propose a novel scalable dialogue state tracking method based on slot tagging navigation, which implements an end-to-end single-step pointer to locate and extract slot value quickly and accurately by the joint learning of slot tagging and slot value position prediction in the dialogue context, especially for unknown slot values. Extensive experiments over several benchmark datasets show that the proposed model performs better than state-of-the-art baselines greatly.
    Scrambled Translation Problem: A Problem of Denoising UNMT. (arXiv:1911.01212v2 [cs.CL] UPDATED)
    (2 min) In this paper, we identify an interesting kind of error in the output of Unsupervised Neural Machine Translation (UNMT) systems like \textit{Undreamt}(footnote). We refer to this error type as \textit{Scrambled Translation problem}. We observe that UNMT models which use \textit{word shuffle} noise (as in case of Undreamt) can generate correct words, but fail to stitch them together to form phrases. As a result, words of the translated sentence look \textit{scrambled}, resulting in decreased BLEU. We hypothesise that the reason behind \textit{scrambled translation problem} is 'shuffling noise' which is introduced in every input sentence as a denoising strategy. To test our hypothesis, we experiment by retraining UNMT models with a simple \textit{retraining} strategy. We stop the training of the Denoising UNMT model after a pre-decided number of iterations and resume the training for the remaining iterations -- which number is also pre-decided -- using original sentence as input without adding any noise. Our proposed solution achieves significant performance improvement UNMT models that train conventionally. We demonstrate these performance gains on four language pairs, \textit{viz.}, English-French, English-German, English-Spanish, Hindi-Punjabi. Our qualitative and quantitative analysis shows that the retraining strategy helps achieve better alignment as observed by attention heatmap and better phrasal translation, leading to statistically significant improvement in BLEU scores.
    Listen with Intent: Improving Speech Recognition with Audio-to-Intent Front-End. (arXiv:2105.07071v2 [eess.AS] UPDATED)
    (2 min) Comprehending the overall intent of an utterance helps a listener recognize the individual words spoken. Inspired by this fact, we perform a novel study of the impact of explicitly incorporating intent representations as additional information to improve a recurrent neural network-transducer (RNN-T) based automatic speech recognition (ASR) system. An audio-to-intent (A2I) model encodes the intent of the utterance in the form of embeddings or posteriors, and these are used as auxiliary inputs for RNN-T training and inference. Experimenting with a 50k-hour far-field English speech corpus, this study shows that when running the system in non-streaming mode, where intent representation is extracted from the entire utterance and then used to bias streaming RNN-T search from the start, it provides a 5.56% relative word error rate reduction (WERR). On the other hand, a streaming system using per-frame intent posteriors as extra inputs for the RNN-T ASR system yields a 3.33% relative WERR. A further detailed analysis of the streaming system indicates that our proposed method brings especially good gain on media-playing related intents (e.g. 9.12% relative WERR on PlayMusicIntent).
    Symmetric Regularization based BERT for Pair-wise Semantic Reasoning. (arXiv:1909.03405v3 [cs.CL] UPDATED)
    (2 min) The ability of semantic reasoning over the sentence pair is essential for many natural language understanding tasks, e.g., natural language inference and machine reading comprehension. A recent significant improvement in these tasks comes from BERT. As reported, the next sentence prediction (NSP) in BERT, which learns the contextual relationship between two sentences, is of great significance for downstream problems with sentence-pair input. Despite the effectiveness of NSP, we suggest that NSP still lacks the essential signal to distinguish between entailment and shallow correlation. To remedy this, we propose to augment the NSP task to a 3-class categorization task, which includes a category for previous sentence prediction (PSP). The involvement of PSP encourages the model to focus on the informative semantics to determine the sentence order, thereby improves the ability of semantic understanding. This simple modification yields remarkable improvement against vanilla BERT. To further incorporate the document-level information, the scope of NSP and PSP is expanded into a broader range, i.e., NSP and PSP also include close but nonsuccessive sentences, the noise of which is mitigated by the label-smoothing technique. Both qualitative and quantitative experimental results demonstrate the effectiveness of the proposed method. Our method consistently improves the performance on the NLI and MRC benchmarks, including the challenging HANS dataset \cite{hans}, suggesting that the document-level task is still promising for the pre-training.
    Text2Event: Controllable Sequence-to-Structure Generation for End-to-end Event Extraction. (arXiv:2106.09232v1 [cs.CL])
    (2 min) Event extraction is challenging due to the complex structure of event records and the semantic gap between text and event. Traditional methods usually extract event records by decomposing the complex structure prediction task into multiple subtasks. In this paper, we propose Text2Event, a sequence-to-structure generation paradigm that can directly extract events from the text in an end-to-end manner. Specifically, we design a sequence-to-structure network for unified event extraction, a constrained decoding algorithm for event knowledge injection during inference, and a curriculum learning algorithm for efficient model learning. Experimental results show that, by uniformly modeling all tasks in a single model and universally predicting different labels, our method can achieve competitive performance using only record-level annotations in both supervised learning and transfer learning settings.
    LoRA: Low-Rank Adaptation of Large Language Models. (arXiv:2106.09685v1 [cs.CL])
    (2 min) The dominant paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, conventional fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example, deploying many independent instances of fine-tuned models, each with 175B parameters, is extremely expensive. We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. For GPT-3, LoRA can reduce the number of trainable parameters by 10,000 times and the computation hardware requirement by 3 times compared to full fine-tuning. LoRA performs on-par or better than fine-tuning in model quality on both GPT-3 and GPT-2, despite having fewer trainable parameters, a higher training throughput, and no additional inference latency. We also provide an empirical investigation into rank-deficiency in language model adaptations, which sheds light on the efficacy of LoRA. We release our implementation in GPT-2 at https://github.com/microsoft/LoRA .
    Do Large Scale Molecular Language Representations Capture Important Structural Information?. (arXiv:2106.09553v1 [cs.LG])
    (2 min) Predicting chemical properties from the structure of a molecule is of great importance in many applications including drug discovery and material design. Machine learning based molecular property prediction holds the promise of enabling accurate predictions at much less complexity, when compared to, for example Density Functional Theory (DFT) calculations. Features extracted from molecular graphs, using graph neural nets in a supervised manner, have emerged as strong baselines for such tasks. However, the vast chemical space together with the limited availability of labels makes supervised learning challenging, calling for learning a general-purpose molecular representation. Recently, pre-trained transformer-based language models (PTLMs) on large unlabeled corpus have produced state-of-the-art results in many downstream natural language processing tasks. Inspired by this development, here we present molecular embeddings obtained by training an efficient transformer encoder model, referred to as MoLFormer. This model was employed with a linear attention mechanism and highly paralleized training on 1D SMILES sequences of 1.1 billion unlabeled molecules from the PubChem and ZINC datasets. Experiments show that the learned molecular representation performs competitively, when compared to existing graph-based and fingerprint-based supervised learning baselines, on the challenging tasks of predicting properties of QM8 and QM9 molecules. Further task-specific fine-tuning of the MoLFormerr representation improves performance on several of those property prediction benchmarks. These results provide encouraging evidence that large-scale molecular language models can capture sufficient structural information to be able to accurately predict quantum chemical properties and beyond.
    Dr-Vectors: Decision Residual Networks and an Improved Loss for Speaker Recognition. (arXiv:2104.01989v2 [cs.CL] UPDATED)
    (2 min) Many neural network speaker recognition systems model each speaker using a fixed-dimensional embedding vector. These embeddings are generally compared using either linear or 2nd-order scoring and, until recently, do not handle utterance-specific uncertainty. In this work we propose scoring these representations in a way that can capture uncertainty, enroll/test asymmetry and additional non-linear information. This is achieved by incorporating a 2nd-stage neural network (known as a decision network) as part of an end-to-end training regimen. In particular, we propose the concept of decision residual networks which involves the use of a compact decision network to leverage cosine scores and to model the residual signal that's needed. Additionally, we present a modification to the generalized end-to-end softmax loss function to target the separation of same/different speaker scores. We observed significant performance gains for the two techniques.
    Topic Modeling and Progression of American Digital News Media During the Onset of the COVID-19 Pandemic. (arXiv:2106.09572v1 [cs.CL])
    (2 min) Currently, the world is in the midst of a severe global pandemic, which has affected all aspects of people's lives. As a result, there is a deluge of COVID-related digital media articles published in the United States, due to the disparate effects of the pandemic. This large volume of information is difficult to consume by the audience in a reasonable amount of time. In this paper, we develop a Natural Language Processing (NLP) pipeline that is capable of automatically distilling various digital articles into manageable pieces of information, while also modelling the progression topics discussed over time in order to aid readers in rapidly gaining holistic perspectives on pressing issues (i.e., the COVID-19 pandemic) from a diverse array of sources. We achieve these goals by first collecting a large corpus of COVID-related articles during the onset of the pandemic. After, we apply unsupervised and semi-supervised learning procedures to summarize articles, then cluster them based on their similarities using the community detection methods. Next, we identify the topic of each cluster of articles using the BART algorithm. Finally, we provide a detailed digital media analysis based on the NLP-pipeline outputs and show how the conversation surrounding COVID-19 evolved over time.
    Multi-head or Single-head? An Empirical Comparison for Transformer Training. (arXiv:2106.09650v1 [cs.CL])
    (2 min) Multi-head attention plays a crucial role in the recent success of Transformer models, which leads to consistent performance improvements over conventional attention in various applications. The popular belief is that this effectiveness stems from the ability of jointly attending multiple positions. In this paper, we first demonstrate that jointly attending multiple positions is not a unique feature of multi-head attention, as multi-layer single-head attention also attends multiple positions and is more effective. Then, we suggest the main advantage of the multi-head attention is the training stability, since it has less number of layers than the single-head attention, when attending the same number of positions. For example, 24-layer 16-head Transformer (BERT-large) and 384-layer single-head Transformer has the same total attention head number and roughly the same model size, while the multi-head one is significantly shallower. Meanwhile, we show that, with recent advances in deep learning, we can successfully stabilize the training of the 384-layer Transformer. As the training difficulty is no longer a bottleneck, substantially deeper single-head Transformer achieves consistent performance improvements without tuning hyper-parameters.
    Scientific Language Models for Biomedical Knowledge Base Completion: An Empirical Study. (arXiv:2106.09700v1 [cs.CL])
    (2 min) Biomedical knowledge graphs (KGs) hold rich information on entities such as diseases, drugs, and genes. Predicting missing links in these graphs can boost many important applications, such as drug design and repurposing. Recent work has shown that general-domain language models (LMs) can serve as "soft" KGs, and that they can be fine-tuned for the task of KG completion. In this work, we study scientific LMs for KG completion, exploring whether we can tap into their latent knowledge to enhance biomedical link prediction. We evaluate several domain-specific LMs, fine-tuning them on datasets centered on drugs and diseases that we represent as KGs and enrich with textual entity descriptions. We integrate the LM-based models with KG embedding models, using a router method that learns to assign each input example to either type of model and provides a substantial boost in performance. Finally, we demonstrate the advantage of LM models in the inductive setting with novel scientific entities. Our datasets and code are made publicly available.
    Lost in Interpreting: Speech Translation from Source or Interpreter?. (arXiv:2106.09343v1 [cs.CL])
    (2 min) Interpreters facilitate multi-lingual meetings but the affordable set of languages is often smaller than what is needed. Automatic simultaneous speech translation can extend the set of provided languages. We investigate if such an automatic system should rather follow the original speaker, or an interpreter to achieve better translation quality at the cost of increased delay. To answer the question, we release Europarl Simultaneous Interpreting Corpus (ESIC), 10 hours of recordings and transcripts of European Parliament speeches in English, with simultaneous interpreting into Czech and German. We evaluate quality and latency of speaker-based and interpreter-based spoken translation systems from English to Czech. We study the differences in implicit simplification and summarization of the human interpreter compared to a machine translation system trained to shorten the output to some extent. Finally, we perform human evaluation to measure information loss of each of these approaches.
    Modeling Worlds in Text. (arXiv:2106.09578v1 [cs.CL])
    (2 min) We provide a dataset that enables the creation of learning agents that can build knowledge graph-based world models of interactive narratives. Interactive narratives -- or text-adventure games -- are partially observable environments structured as long puzzles or quests in which an agent perceives and interacts with the world purely through textual natural language. Each individual game typically contains hundreds of locations, characters, and objects -- each with their own unique descriptions -- providing an opportunity to study the problem of giving language-based agents the structured memory necessary to operate in such worlds. Our dataset provides 24198 mappings between rich natural language observations and: (1) knowledge graphs that reflect the world state in the form of a map; (2) natural language actions that are guaranteed to cause a change in that particular world state. The training data is collected across 27 games in multiple genres and contains a further 7836 heldout instances over 9 additional games in the test set. We further provide baseline models using rules-based, question-answering, and sequence learning approaches in addition to an analysis of the data and corresponding learning tasks.
    Biomedical Interpretable Entity Representations. (arXiv:2106.09502v1 [cs.CL])
    (2 min) Pre-trained language models induce dense entity representations that offer strong performance on entity-centric NLP tasks, but such representations are not immediately interpretable. This can be a barrier to model uptake in important domains such as biomedicine. There has been recent work on general interpretable representation learning (Onoe and Durrett, 2020), but these domain-agnostic representations do not readily transfer to the important domain of biomedicine. In this paper, we create a new entity type system and training set from a large corpus of biomedical texts by mapping entities to concepts in a medical ontology, and from these to Wikipedia pages whose categories are our types. From this mapping we derive Biomedical Interpretable Entity Representations(BIERs), in which dimensions correspond to fine-grained entity types, and values are predicted probabilities that a given entity is of the corresponding type. We propose a novel method that exploits BIER's final sparse and intermediate dense representations to facilitate model and entity type debugging. We show that BIERs achieve strong performance in biomedical tasks including named entity disambiguation and entity label classification, and we provide error analysis to highlight the utility of their interpretability, particularly in low-supervision settings. Finally, we provide our induced 68K biomedical type system, the corresponding 37 million triples of derived data used to train BIER models and our best performing model.
    De-biasing Distantly Supervised Named Entity Recognition via Causal Intervention. (arXiv:2106.09233v1 [cs.CL])
    (2 min) Distant supervision tackles the data bottleneck in NER by automatically generating training instances via dictionary matching. Unfortunately, the learning of DS-NER is severely dictionary-biased, which suffers from spurious correlations and therefore undermines the effectiveness and the robustness of the learned models. In this paper, we fundamentally explain the dictionary bias via a Structural Causal Model (SCM), categorize the bias into intra-dictionary and inter-dictionary biases, and identify their causes. Based on the SCM, we learn de-biased DS-NER via causal interventions. For intra-dictionary bias, we conduct backdoor adjustment to remove the spurious correlations introduced by the dictionary confounder. For inter-dictionary bias, we propose a causal invariance regularizer which will make DS-NER models more robust to the perturbation of dictionaries. Experiments on four datasets and three DS-NER models show that our method can significantly improve the performance of DS-NER.
    Classifying vaccine sentiment tweets by modelling domain-specific representation and commonsense knowledge into context-aware attentive GRU. (arXiv:2106.09589v1 [cs.CL])
    (2 min) Vaccines are an important public health measure, but vaccine hesitancy and refusal can create clusters of low vaccine coverage and reduce the effectiveness of vaccination programs. Social media provides an opportunity to estimate emerging risks to vaccine acceptance by including geographical location and detailing vaccine-related concerns. Methods for classifying social media posts, such as vaccine-related tweets, use language models (LMs) trained on general domain text. However, challenges to measuring vaccine sentiment at scale arise from the absence of tonal stress and gestural cues and may not always have additional information about the user, e.g., past tweets or social connections. Another challenge in LMs is the lack of commonsense knowledge that are apparent in users metadata, i.e., emoticons, positive and negative words etc. In this study, to classify vaccine sentiment tweets with limited information, we present a novel end-to-end framework consisting of interconnected components that use domain-specific LM trained on vaccine-related tweets and models commonsense knowledge into a bidirectional gated recurrent network (CK-BiGRU) with context-aware attention. We further leverage syntactical, user metadata and sentiment information to capture the sentiment of a tweet. We experimented using two popular vaccine-related Twitter datasets and demonstrate that our proposed approach outperforms state-of-the-art models in identifying pro-vaccine, anti-vaccine and neutral tweets.
    STAN: A stuttering therapy analysis helper. (arXiv:2106.09545v1 [eess.AS])
    (2 min) Stuttering is a complex speech disorder identified by repeti-tions, prolongations of sounds, syllables or words and blockswhile speaking. Specific stuttering behaviour differs strongly,thus needing personalized therapy. Therapy sessions requirea high level of concentration by the therapist. We introduceSTAN, a system to aid speech therapists in stuttering therapysessions. Such an automated feedback system can lower thecognitive load on the therapist and thereby enable a more con-sistent therapy as well as allowing analysis of stuttering overthe span of multiple therapy sessions.
    DravidianCodeMix: Sentiment Analysis and Offensive Language Identification Dataset for Dravidian Languages in Code-Mixed Text. (arXiv:2106.09460v1 [cs.CL])
    (2 min) This paper describes the development of a multilingual, manually annotated dataset for three under-resourced Dravidian languages generated from social media comments. The dataset was annotated for sentiment analysis and offensive language identification for a total of more than 60,000 YouTube comments. The dataset consists of around 44,000 comments in Tamil-English, around 7,000 comments in Kannada-English, and around 20,000 comments in Malayalam-English. The data was manually annotated by volunteer annotators and has a high inter-annotator agreement in Krippendorff's alpha. The dataset contains all types of code-mixing phenomena since it comprises user-generated content from a multilingual country. We also present baseline experiments to establish benchmarks on the dataset using machine learning methods. The dataset is available on Github (https://github.com/bharathichezhiyan/DravidianCodeMix-Dataset) and Zenodo (https://zenodo.org/record/4750858\#.YJtw0SYo\_0M).
    A Self-supervised Method for Entity Alignment. (arXiv:2106.09395v1 [cs.CL])
    (2 min) Entity alignment, aiming to identify equivalent entities across different knowledge graphs (KGs), is a fundamental problem for constructing large-scale KGs. Over the course of its development, supervision has been considered necessary for accurate alignments. Inspired by the recent progress of self-supervised learning, we explore the extent to which we can get rid of supervision for entity alignment. Existing supervised methods for this task focus on pulling each pair of positive (labeled) entities close to each other. However, our analysis suggests that the learning of entity alignment can actually benefit more from pushing sampled (unlabeled) negatives far away than pulling positive aligned pairs close. We present SelfKG by leveraging this discovery to design a contrastive learning strategy across two KGs. Extensive experiments on benchmark datasets demonstrate that SelfKG without supervision can match or achieve comparable results with state-of-the-art supervised baselines. The performance of SelfKG demonstrates self-supervised learning offers great potential for entity alignment in KGs.
    Scalable Approach for Normalizing E-commerce Text Attributes (SANTA). (arXiv:2106.09493v1 [cs.CL])
    (2 min) In this paper, we present SANTA, a scalable framework to automatically normalize E-commerce attribute values (e.g. "Win 10 Pro") to a fixed set of pre-defined canonical values (e.g. "Windows 10"). Earlier works on attribute normalization focused on fuzzy string matching (also referred as syntactic matching in this paper). In this work, we first perform an extensive study of nine syntactic matching algorithms and establish that 'cosine' similarity leads to best results, showing 2.7% improvement over commonly used Jaccard index. Next, we argue that string similarity alone is not sufficient for attribute normalization as many surface forms require going beyond syntactic matching (e.g. "720p" and "HD" are synonyms). While semantic techniques like unsupervised embeddings (e.g. word2vec/fastText) have shown good results in word similarity tasks, we observed that they perform poorly to distinguish between close canonical forms, as these close forms often occur in similar contexts. We propose to learn token embeddings using a twin network with triplet loss. We propose an embedding learning task leveraging raw attribute values and product titles to learn these embeddings in a self-supervised fashion. We show that providing supervision using our proposed task improves over both syntactic and unsupervised embeddings based techniques for attribute normalization. Experiments on a real-world attribute normalization dataset of 50 attributes show that the embeddings trained using our proposed approach obtain 2.3% improvement over best string matching and 19.3% improvement over best unsupervised embeddings.
    DocNLI: A Large-scale Dataset for Document-level Natural Language Inference. (arXiv:2106.09449v1 [cs.CL])
    (2 min) Natural language inference (NLI) is formulated as a unified framework for solving various NLP problems such as relation extraction, question answering, summarization, etc. It has been studied intensively in the past few years thanks to the availability of large-scale labeled datasets. However, most existing studies focus on merely sentence-level inference, which limits the scope of NLI's application in downstream NLP problems. This work presents DocNLI -- a newly-constructed large-scale dataset for document-level NLI. DocNLI is transformed from a broad range of NLP problems and covers multiple genres of text. The premises always stay in the document granularity, whereas the hypotheses vary in length from single sentences to passages with hundreds of words. Additionally, DocNLI has pretty limited artifacts which unfortunately widely exist in some popular sentence-level NLI datasets. Our experiments demonstrate that, even without fine-tuning, a model pretrained on DocNLI shows promising performance on popular sentence-level benchmarks, and generalizes well to out-of-domain NLP tasks that rely on inference at document granularity. Task-specific fine-tuning can bring further improvements. Data, code, and pretrained models can be found at https://github.com/salesforce/DocNLI.
    Scaling Laws for Acoustic Models. (arXiv:2106.09488v1 [eess.AS])
    (2 min) There is a recent trend in machine learning to increase model quality by growing models to sizes previously thought to be unreasonable. Recent work has shown that autoregressive generative models with cross-entropy objective functions exhibit smooth power-law relationships, or scaling laws, that predict model quality from model size, training set size, and the available compute budget. These scaling laws allow one to choose nearly optimal hyper-parameters given constraints on available training data, model parameter count, or training computation budget. In this paper, we demonstrate that acoustic models trained with an auto-predictive coding loss behave as if they are subject to similar scaling laws. We extend previous work to jointly predict loss due to model size, to training set size, and to the inherent "irreducible loss" of the task. We find that the scaling laws accurately match model performance over two orders of magnitude in both model size and training set size, and make predictions about the limits of model performance.
    Probing Image-Language Transformers for Verb Understanding. (arXiv:2106.09141v1 [cs.CL])
    (2 min) Multimodal image-language transformers have achieved impressive results on a variety of tasks that rely on fine-tuning (e.g., visual question answering and image retrieval). We are interested in shedding light on the quality of their pretrained representations -- in particular, if these models can distinguish different types of verbs or if they rely solely on nouns in a given sentence. To do so, we collect a dataset of image-sentence pairs (in English) consisting of 421 verbs that are either visual or commonly found in the pretraining data (i.e., the Conceptual Captions dataset). We use this dataset to evaluate pretrained image-language transformers and find that they fail more in situations that require verb understanding compared to other parts of speech. We also investigate what category of verbs are particularly challenging.
    Learning Knowledge Graph-based World Models of Textual Environments. (arXiv:2106.09608v1 [cs.LG])
    (2 min) World models improve a learning agent's ability to efficiently operate in interactive and situated environments. This work focuses on the task of building world models of text-based game environments. Text-based games, or interactive narratives, are reinforcement learning environments in which agents perceive and interact with the world using textual natural language. These environments contain long, multi-step puzzles or quests woven through a world that is filled with hundreds of characters, locations, and objects. Our world model learns to simultaneously: (1) predict changes in the world caused by an agent's actions when representing the world as a knowledge graph; and (2) generate the set of contextually relevant natural language actions required to operate in the world. We frame this task as a Set of Sequences generation problem by exploiting the inherent structure of knowledge graphs and actions and introduce both a transformer-based multi-task architecture and a loss function to train it. A zero-shot ablation study on never-before-seen textual worlds shows that our methodology significantly outperforms existing textual world modeling techniques as well as the importance of each of our contributions.
    pysentimiento: A Python Toolkit for Sentiment Analysis and SocialNLP tasks. (arXiv:2106.09462v1 [cs.CL])
    (2 min) Extracting opinions from texts has gathered a lot of interest in the last years, as we are experiencing an unprecedented volume of user-generated content in social networks and other places. A problem that social researchers find in using opinion mining tools is that they are usually behind commercial APIs and unavailable for other languages than English. To address these issues, we present pysentimiento, a multilingual Python toolkit for Sentiment Analysis and other Social NLP tasks. This open-source library brings state-of-the-art models for Spanish and English in a black-box fashion, allowing researchers to easily access these techniques.
    Knowledgeable or Educated Guess? Revisiting Language Models as Knowledge Bases. (arXiv:2106.09231v1 [cs.CL])
    (2 min) Previous literatures show that pre-trained masked language models (MLMs) such as BERT can achieve competitive factual knowledge extraction performance on some datasets, indicating that MLMs can potentially be a reliable knowledge source. In this paper, we conduct a rigorous study to explore the underlying predicting mechanisms of MLMs over different extraction paradigms. By investigating the behaviors of MLMs, we find that previous decent performance mainly owes to the biased prompts which overfit dataset artifacts. Furthermore, incorporating illustrative cases and external contexts improve knowledge prediction mainly due to entity type guidance and golden answer leakage. Our findings shed light on the underlying predicting mechanisms of MLMs, and strongly question the previous conclusion that current MLMs can potentially serve as reliable factual knowledge bases.
    X-FACT: A New Benchmark Dataset for Multilingual Fact Checking. (arXiv:2106.09248v1 [cs.CL])
    (2 min) In this work, we introduce X-FACT: the largest publicly available multilingual dataset for factual verification of naturally existing real-world claims. The dataset contains short statements in 25 languages and is labeled for veracity by expert fact-checkers. The dataset includes a multilingual evaluation benchmark that measures both out-of-domain generalization, and zero-shot capabilities of the multilingual models. Using state-of-the-art multilingual transformer-based models, we develop several automated fact-checking models that, along with textual claims, make use of additional metadata and evidence from news stories retrieved using a search engine. Empirically, our best model attains an F-score of around 40%, suggesting that our dataset is a challenging benchmark for evaluation of multilingual fact-checking models.
    Disentangling Online Chats with DAG-Structured LSTMs. (arXiv:2106.09024v1 [cs.CL])
    (2 min) Many modern messaging systems allow fast and synchronous textual communication among many users. The resulting sequence of messages hides a more complicated structure in which independent sub-conversations are interwoven with one another. This poses a challenge for any task aiming to understand the content of the chat logs or gather information from them. The ability to disentangle these conversations is then tantamount to the success of many downstream tasks such as summarization and question answering. Structured information accompanying the text such as user turn, user mentions, timestamps, is used as a cue by the participants themselves who need to follow the conversation and has been shown to be important for disentanglement. DAG-LSTMs, a generalization of Tree-LSTMs that can handle directed acyclic dependencies, are a natural way to incorporate such information and its non-sequential nature. In this paper, we apply DAG-LSTMs to the conversation disentanglement task. We perform our experiments on the Ubuntu IRC dataset. We show that the novel model we propose achieves state of the art status on the task of recovering reply-to relations and it is competitive on other disentanglement metrics.
    Specializing Multilingual Language Models: An Empirical Study. (arXiv:2106.09063v1 [cs.CL])
    (2 min) Contextualized word representations from pretrained multilingual language models have become the de facto standard for addressing natural language tasks in many different languages, but the success of this approach is far from universal. For languages rarely or never seen by these models, directly using such models often results in suboptimal representation or use of data, motivating additional model adaptations to achieve reasonably strong performance. In this work, we study the performance, extensibility, and interaction of two such adaptations for this low-resource setting: vocabulary augmentation and script transliteration. Our evaluations on a set of three tasks in nine diverse low-resource languages yield a mixed result, upholding the viability of these approaches while raising new questions around how to optimally adapt multilingual models to low-resource settings.
    An Empirical Study on Hyperparameter Optimization for Fine-Tuning Pre-trained Language Models. (arXiv:2106.09204v1 [cs.CL])
    (2 min) The performance of fine-tuning pre-trained language models largely depends on the hyperparameter configuration. In this paper, we investigate the performance of modern hyperparameter optimization methods (HPO) on fine-tuning pre-trained language models. First, we study and report three HPO algorithms' performances on fine-tuning two state-of-the-art language models on the GLUE dataset. We find that using the same time budget, HPO often fails to outperform grid search due to two reasons: insufficient time budget and overfitting. We propose two general strategies and an experimental procedure to systematically troubleshoot HPO's failure cases. By applying the procedure, we observe that HPO can succeed with more appropriate settings in the search space and time budget; however, in certain cases overfitting remains. Finally, we make suggestions for future work. Our implementation can be found in https://github.com/microsoft/FLAML/tree/main/flaml/nlp/.
    Automatic Construction of Evaluation Suites for Natural Language Generation Datasets. (arXiv:2106.09069v1 [cs.CL])
    (2 min) Machine learning approaches applied to NLP are often evaluated by summarizing their performance in a single number, for example accuracy. Since most test sets are constructed as an i.i.d. sample from the overall data, this approach overly simplifies the complexity of language and encourages overfitting to the head of the data distribution. As such, rare language phenomena or text about underrepresented groups are not equally included in the evaluation. To encourage more in-depth model analyses, researchers have proposed the use of multiple test sets, also called challenge sets, that assess specific capabilities of a model. In this paper, we develop a framework based on this idea which is able to generate controlled perturbations and identify subsets in text-to-scalar, text-to-text, or data-to-text settings. By applying this framework to the GEM generation benchmark, we propose an evaluation suite made of 80 challenge sets, demonstrate the kinds of analyses that it enables and shed light onto the limits of current generation models.
    Layer Pruning on Demand with Intermediate CTC. (arXiv:2106.09216v1 [eess.AS])
    (2 min) Deploying an end-to-end automatic speech recognition (ASR) model on mobile/embedded devices is a challenging task, since the device computational power and energy consumption requirements are dynamically changed in practice. To overcome the issue, we present a training and pruning method for ASR based on the connectionist temporal classification (CTC) which allows reduction of model depth at run-time without any extra fine-tuning. To achieve the goal, we adopt two regularization methods, intermediate CTC and stochastic depth, to train a model whose performance does not degrade much after pruning. We present an in-depth analysis of layer behaviors using singular vector canonical correlation analysis (SVCCA), and efficient strategies for finding layers which are safe to prune. Using the proposed method, we show that a Transformer-CTC model can be pruned in various depth on demand, improving real-time factor from 0.005 to 0.002 on GPU, while each pruned sub-model maintains the accuracy of individually trained model of the same depth.
  • cs.CV updates on arXiv.org

    Real-Time Selfie Video Stabilization. (arXiv:2009.02007v2 [cs.CV] UPDATED)
    (2 min) We propose a novel real-time selfie video stabilization method. Our method is completely automatic and runs at 26 fps. We use a 1D linear convolutional network to directly infer the rigid moving least squares warping which implicitly balances between the global rigidity and local flexibility. Our network structure is specifically designed to stabilize the background and foreground at the same time, while providing optional control of stabilization focus (relative importance of foreground vs. background) to the users. To train our network, we collect a selfie video dataset with 1005 videos, which is significantly larger than previous selfie video datasets. We also propose a grid approximation method to the rigid moving least squares warping that enables the real-time frame warping. Our method is fully automatic and produces visually and quantitatively better results than previous real-time general video stabilization methods. Compared to previous offline selfie video methods, our approach produces comparable quality with a speed improvement of orders of magnitude.
    Unsupervised Video Prediction from a Single Frame by Estimating 3D Dynamic Scene Structure. (arXiv:2106.09051v1 [cs.CV])
    (2 min) Our goal in this work is to generate realistic videos given just one initial frame as input. Existing unsupervised approaches to this task do not consider the fact that a video typically shows a 3D environment, and that this should remain coherent from frame to frame even as the camera and objects move. We address this by developing a model that first estimates the latent 3D structure of the scene, including the segmentation of any moving objects. It then predicts future frames by simulating the object and camera dynamics, and rendering the resulting views. Importantly, it is trained end-to-end using only the unsupervised objective of predicting future frames, without any 3D information nor segmentation annotations. Experiments on two challenging datasets of natural videos show that our model can estimate 3D structure and motion segmentation from a single frame, and hence generate plausible and varied predictions.
    NPAS: A Compiler-aware Framework of Unified Network Pruning and Architecture Search for Beyond Real-Time Mobile Acceleration. (arXiv:2012.00596v3 [cs.LG] UPDATED)
    (2 min) With the increasing demand to efficiently deploy DNNs on mobile edge devices, it becomes much more important to reduce unnecessary computation and increase the execution speed. Prior methods towards this goal, including model compression and network architecture search (NAS), are largely performed independently and do not fully consider compiler-level optimizations which is a must-do for mobile acceleration. In this work, we first propose (i) a general category of fine-grained structured pruning applicable to various DNN layers, and (ii) a comprehensive, compiler automatic code generation framework supporting different DNNs and different pruning schemes, which bridge the gap of model compression and NAS. We further propose NPAS, a compiler-aware unified network pruning, and architecture search. To deal with large search space, we propose a meta-modeling procedure based on reinforcement learning with fast evaluation and Bayesian optimization, ensuring the total number of training epochs comparable with representative NAS frameworks. Our framework achieves 6.7ms, 5.9ms, 3.9ms ImageNet inference times with 78.2%, 75% (MobileNet-V3 level), and 71% (MobileNet-V2 level) Top-1 accuracy respectively on an off-the-shelf mobile phone, consistently outperforming prior work.
    AttDLNet: Attention-based DL Network for 3D LiDAR Place Recognition. (arXiv:2106.09637v1 [cs.CV])
    (2 min) Deep networks have been progressively adapted to new sensor modalities, namely to 3D LiDAR, which led to unprecedented achievements in autonomous vehicle-related applications such as place recognition. One of the main challenges of deep models in place recognition is to extract efficient and descriptive feature representations that relate places based on their similarity. To address the problem of place recognition using LiDAR data, this paper proposes a novel 3D LiDAR-based deep learning network (named AttDLNet) that comprises an encoder network and exploits an attention mechanism to selectively focus on long-range context and interfeature relationships. The proposed network is trained and validated on the KITTI dataset, using the cosine loss for training and a retrieval-based place recognition pipeline for validation. Additionally, an ablation study is presented to assess the best network configuration. Results show that the encoder network features are already very descriptive, but adding attention to the network further improves performance. From the ablation study, results indicate that the middle encoder layers have the highest mean performance, while deeper layers are more robust to orientation change. The code is publicly available on the project website: https://github.com/Cybonic/ AttDLNet
    Episode Adaptive Embedding Networks for Few-shot Learning. (arXiv:2106.09398v1 [cs.CV])
    (2 min) Few-shot learning aims to learn a classifier using a few labelled instances for each class. Metric-learning approaches for few-shot learning embed instances into a high-dimensional space and conduct classification based on distances among instance embeddings. However, such instance embeddings are usually shared across all episodes and thus lack the discriminative power to generalize classifiers according to episode-specific features. In this paper, we propose a novel approach, namely \emph{Episode Adaptive Embedding Network} (EAEN), to learn episode-specific embeddings of instances. By leveraging the probability distributions of all instances in an episode at each channel-pixel embedding dimension, EAEN can not only alleviate the overfitting issue encountered in few-shot learning tasks, but also capture discriminative features specific to an episode. To empirically verify the effectiveness and robustness of EAEN, we have conducted extensive experiments on three widely used benchmark datasets, under various combinations of different generic embedding backbones and different classifiers. The results show that EAEN significantly improves classification accuracy about $10\%$ to $20\%$ in different settings over the state-of-the-art methods.
    NeuroMorph: Unsupervised Shape Interpolation and Correspondence in One Go. (arXiv:2106.09431v1 [cs.CV])
    (2 min) We present NeuroMorph, a new neural network architecture that takes as input two 3D shapes and produces in one go, i.e. in a single feed forward pass, a smooth interpolation and point-to-point correspondences between them. The interpolation, expressed as a deformation field, changes the pose of the source shape to resemble the target, but leaves the object identity unchanged. NeuroMorph uses an elegant architecture combining graph convolutions with global feature pooling to extract local features. During training, the model is incentivized to create realistic deformations by approximating geodesics on the underlying shape space manifold. This strong geometric prior allows to train our model end-to-end and in a fully unsupervised manner without requiring any manual correspondence annotations. NeuroMorph works well for a large variety of input shapes, including non-isometric pairs from different object categories. It obtains state-of-the-art results for both shape correspondence and interpolation tasks, matching or surpassing the performance of recent unsupervised and supervised methods on multiple benchmarks.
    Video Analytics with Zero-streaming Cameras. (arXiv:1904.12342v4 [cs.DB] UPDATED)
    (2 min) Low-cost cameras enable powerful analytics. An unexploited opportunity is that most captured videos remain "cold" without being queried. For efficiency, we advocate for these cameras to be zero streaming: capturing videos to local storage and communicating with the cloud only when analytics is requested. How to query zero-streaming cameras efficiently? Our response is a camera/cloud runtime system called DIVA. It addresses two key challenges: to best use limited camera resource during video capture; to rapidly explore massive videos during query execution. DIVA contributes two unconventional techniques. (1) When capturing videos, a camera builds sparse yet accurate landmark frames, from which it learns reliable knowledge for accelerating future queries. (2) When executing a query, a camera processes frames in multiple passes with increasingly more expensive operators. As such, DIVA presents and keeps refining inexact query results throughout the query's execution. On diverse queries over 15 videos lasting 720 hours in total, DIVA runs at more than 100x video realtime and outperforms competitive alternative designs. To our knowledge, DIVA is the first system for querying large videos stored on low-cost remote cameras.
    BigEarthNet-MM: A Large Scale Multi-Modal Multi-Label Benchmark Archive for Remote Sensing Image Classification and Retrieval. (arXiv:2105.07921v2 [cs.CV] UPDATED)
    (2 min) This paper presents the multi-modal BigEarthNet (BigEarthNet-MM) benchmark archive made up of 590,326 pairs of Sentinel-1 and Sentinel-2 image patches to support the deep learning (DL) studies in multi-modal multi-label remote sensing (RS) image retrieval and classification. Each pair of patches in BigEarthNet-MM is annotated with multi-labels provided by the CORINE Land Cover (CLC) map of 2018 based on its thematically most detailed Level-3 class nomenclature. Our initial research demonstrates that some CLC classes are challenging to be accurately described by only considering (single-date) BigEarthNet-MM images. In this paper, we also introduce an alternative class-nomenclature as an evolution of the original CLC labels to address this problem. This is achieved by interpreting and arranging the CLC Level-3 nomenclature based on the properties of BigEarthNet-MM images in a new nomenclature of 19 classes. In our experiments, we show the potential of BigEarthNet-MM for multi-modal multi-label image retrieval and classification problems by considering several state-of-the-art DL models. We also demonstrate that the DL models trained from scratch on BigEarthNet-MM outperform those pre-trained on ImageNet, especially in relation to some complex classes, including agriculture and other vegetated and natural environments. We make all the data and the DL models publicly available at https://bigearth.net, offering an important resource to support studies on multi-modal image scene classification and retrieval problems in RS.
    Unsupervised Training Data Generation of Handwritten Formulas using Generative Adversarial Networks with Self-Attention. (arXiv:2106.09432v1 [cs.CV])
    (2 min) The recognition of handwritten mathematical expressions in images and video frames is a difficult and unsolved problem yet. Deep convectional neural networks are basically a promising approach, but typically require a large amount of labeled training data. However, such a large training dataset does not exist for the task of handwritten formula recognition. In this paper, we introduce a system that creates a large set of synthesized training examples of mathematical expressions which are derived from LaTeX documents. For this purpose, we propose a novel attention-based generative adversarial network to translate rendered equations to handwritten formulas. The datasets generated by this approach contain hundreds of thousands of formulas, making it ideal for pretraining or the design of more complex models. We evaluate our synthesized dataset and the recognition approach on the CROHME 2014 benchmark dataset. Experimental results demonstrate the feasibility of the approach.
    A Random CNN Sees Objects: One Inductive Bias of CNN and Its Applications. (arXiv:2106.09259v1 [cs.CV])
    (2 min) This paper starts by revealing a surprising finding: without any learning, a randomly initialized CNN can localize objects surprisingly well. That is, a CNN has an inductive bias to naturally focus on objects, named as Tobias (``The object is at sight'') in this paper. This empirical inductive bias is further analyzed and successfully applied to self-supervised learning. A CNN is encouraged to learn representations that focus on the foreground object, by transforming every image into various versions with different backgrounds, where the foreground and background separation is guided by Tobias. Experimental results show that the proposed Tobias significantly improves downstream tasks, especially for object detection. This paper also shows that Tobias has consistent improvements on training sets of different sizes, and is more resilient to changes in image augmentations. Our codes will be available at https://github.com/CupidJay/Tobias.
    Visual Correspondence Hallucination: Towards Geometric Reasoning. (arXiv:2106.09711v1 [cs.CV])
    (2 min) Given a pair of partially overlapping source and target images and a keypoint in the source image, the keypoint's correspondent in the target image can be either visible, occluded or outside the field of view. Local feature matching methods are only able to identify the correspondent's location when it is visible, while humans can also hallucinate its location when it is occluded or outside the field of view through geometric reasoning. In this paper, we bridge this gap by training a network to output a peaked probability distribution over the correspondent's location, regardless of this correspondent being visible, occluded, or outside the field of view. We experimentally demonstrate that this network is indeed able to hallucinate correspondences on unseen pairs of images. We also apply this network to a camera pose estimation problem and find it is significantly more robust than state-of-the-art local feature matching-based competitors.
    Positional Contrastive Learning for VolumetricMedical Image Segmentation. (arXiv:2106.09157v1 [cs.CV])
    (2 min) The success of deep learning heavily depends on the availability of large labeled training sets. However, it is hard to get large labeled datasets in medical image domain because of the strict privacy concern and costly labeling efforts. Contrastive learning, an unsupervised learning technique, has been proved powerful in learning image-level representations from unlabeled data. The learned encoder can then be transferred or fine-tuned to improve the performance of downstream tasks with limited labels. A critical step in contrastive learning is the generation of contrastive data pairs, which is relatively simple for natural image classification but quite challenging for medical image segmentation due to the existence of the same tissue or organ across the dataset. As a result, when applied to medical image segmentation, most state-of-the-art contrastive learning frameworks inevitably introduce a lot of false-negative pairs and result in degraded segmentation quality. To address this issue, we propose a novel positional contrastive learning (PCL) framework to generate contrastive data pairs by leveraging the position information in volumetric medical images. Experimental results on CT and MRI datasets demonstrate that the proposed PCL method can substantially improve the segmentation performance compared to existing methods in both semi-supervised setting and transfer learning setting.
    A Multi-task convolutional neural network for blind stereoscopic image quality assessment using naturalness analysis. (arXiv:2106.09303v1 [eess.IV])
    (2 min) This paper addresses the problem of blind stereoscopic image quality assessment (NR-SIQA) using a new multi-task deep learning based-method. In the field of stereoscopic vision, the information is fairly distributed between the left and right views as well as the binocular phenomenon. In this work, we propose to integrate these characteristics to estimate the quality of stereoscopic images without reference through a convolutional neural network. Our method is based on two main tasks: the first task predicts naturalness analysis based features adapted to stereo images, while the second task predicts the quality of such images. The former, so-called auxiliary task, aims to find more robust and relevant features to improve the quality prediction. To do this, we compute naturalness-based features using a Natural Scene Statistics (NSS) model in the complex wavelet domain. It allows to capture the statistical dependency between pairs of the stereoscopic images. Experiments are conducted on the well known LIVE PHASE I and LIVE PHASE II databases. The results obtained show the relevance of our method when comparing with those of the state-of-the-art. Our code is available online on \url{https://github.com/Bourbia-Salima/multitask-cnn-nrsiqa_2021}.
    CReST: A Class-Rebalancing Self-Training Framework for Imbalanced Semi-Supervised Learning. (arXiv:2102.09559v2 [cs.CV] UPDATED)
    (2 min) Semi-supervised learning on class-imbalanced data, although a realistic problem, has been under studied. While existing semi-supervised learning (SSL) methods are known to perform poorly on minority classes, we find that they still generate high precision pseudo-labels on minority classes. By exploiting this property, in this work, we propose Class-Rebalancing Self-Training (CReST), a simple yet effective framework to improve existing SSL methods on class-imbalanced data. CReST iteratively retrains a baseline SSL model with a labeled set expanded by adding pseudo-labeled samples from an unlabeled set, where pseudo-labeled samples from minority classes are selected more frequently according to an estimated class distribution. We also propose a progressive distribution alignment to adaptively adjust the rebalancing strength dubbed CReST+. We show that CReST and CReST+ improve state-of-the-art SSL algorithms on various class-imbalanced datasets and consistently outperform other popular rebalancing methods. Code has been made available at https://github.com/google-research/crest.
    MoDist: Motion Distillation for Self-supervised Video Representation Learning. (arXiv:2106.09703v1 [cs.CV])
    (2 min) We present MoDist as a novel method to explicitly distill motion information into self-supervised video representations. Compared to previous video representation learning methods that mostly focus on learning motion cues implicitly from RGB inputs, we show that the representation learned with our MoDist method focus more on foreground motion regions and thus generalizes better to downstream tasks. To achieve this, MoDist enriches standard contrastive learning objectives for RGB video clips with a cross-modal learning objective between a Motion pathway and a Visual pathway. We evaluate MoDist on several datasets for both action recognition (UCF101/HMDB51/SSv2) as well as action detection (AVA), and demonstrate state-of-the-art self-supervised performance on all datasets. Furthermore, we show that MoDist representation can be as effective as (in some cases even better than) representations learned with full supervision. Given its simplicity, we hope MoDist could serve as a strong baseline for future research in self-supervised video representation learning.
    On Anytime Learning at Macroscale. (arXiv:2106.09563v1 [cs.LG])
    (2 min) Classical machine learning frameworks assume access to a possibly large dataset in order to train a predictive model. In many practical applications however, data does not arrive all at once, but in batches over time. This creates a natural trade-off between accuracy of a model and time to obtain such a model. A greedy predictor could produce non-trivial predictions by immediately training on batches as soon as these become available but, it may also make sub-optimal use of future data. On the other hand, a tardy predictor could wait for a long time to aggregate several batches into a larger dataset, but ultimately deliver a much better performance. In this work, we consider such a streaming learning setting, which we dub {\em anytime learning at macroscale} (ALMA). It is an instance of anytime learning applied not at the level of a single chunk of data, but at the level of the entire sequence of large batches. We first formalize this learning setting, we then introduce metrics to assess how well learners perform on the given task for a given memory and compute budget, and finally we test several baseline approaches on standard benchmarks repurposed for anytime learning at macroscale. The general finding is that bigger models always generalize better. In particular, it is important to grow model capacity over time if the initial model is relatively small. Moreover, updating the model at an intermediate rate strikes the best trade off between accuracy and time to obtain a useful predictor.
    Adversarial Attack Vulnerability of Medical Image Analysis Systems: Unexplored Factors. (arXiv:2006.06356v3 [cs.CR] UPDATED)
    (3 min) Adversarial attacks are considered a potentially serious security threat for machine learning systems. Medical image analysis (MedIA) systems have recently been argued to be vulnerable to adversarial attacks due to strong financial incentives and the associated technological infrastructure. In this paper, we study previously unexplored factors affecting adversarial attack vulnerability of deep learning MedIA systems in three medical domains: ophthalmology, radiology, and pathology. We focus on adversarial black-box settings, in which the attacker does not have full access to the target model and usually uses another model, commonly referred to as surrogate model, to craft adversarial examples. We consider this to be the most realistic scenario for MedIA systems. Firstly, we study the effect of weight initialization (ImageNet vs. random) on the transferability of adversarial attacks from the surrogate model to the target model. Secondly, we study the influence of differences in development data between target and surrogate models. We further study the interaction of weight initialization and data differences with differences in model architecture. All experiments were done with a perturbation degree tuned to ensure maximal transferability at minimal visual perceptibility of the attacks. Our experiments show that pre-training may dramatically increase the transferability of adversarial examples, even when the target and surrogate's architectures are different: the larger the performance gain using pre-training, the larger the transferability. Differences in the development data between target and surrogate models considerably decrease the performance of the attack; this decrease is further amplified by difference in the model architecture. We believe these factors should be considered when developing security-critical MedIA systems planned to be deployed in clinical practice.
    Causal Contextual Prediction for Learned Image Compression. (arXiv:2011.09704v4 [cs.CV] UPDATED)
    (2 min) Over the past several years, we have witnessed impressive progress in the field of learned image compression. Recent learned image codecs are commonly based on autoencoders, that first encode an image into low-dimensional latent representations and then decode them for reconstruction purposes. To capture spatial dependencies in the latent space, prior works exploit hyperprior and spatial context model to build an entropy model, which estimates the bit-rate for end-to-end rate-distortion optimization. However, such an entropy model is suboptimal from two aspects: (1) It fails to capture spatially global correlations among the latents. (2) Cross-channel relationships of the latents are still underexplored. In this paper, we propose the concept of separate entropy coding to leverage a serial decoding process for causal contextual entropy prediction in the latent space. A causal context model is proposed that separates the latents across channels and makes use of cross-channel relationships to generate highly informative contexts. Furthermore, we propose a causal global prediction model, which is able to find global reference points for accurate predictions of unknown points. Both these two models facilitate entropy estimation without the transmission of overhead. In addition, we further adopt a new separate attention module to build more powerful transform networks. Experimental results demonstrate that our full image compression model outperforms standard VVC/H.266 codec on Kodak dataset in terms of both PSNR and MS-SSIM, yielding the state-of-the-art rate-distortion performance.
    Beyond VQA: Generating Multi-word Answer and Rationale to Visual Questions. (arXiv:2010.12852v2 [cs.CV] UPDATED)
    (2 min) Visual Question Answering is a multi-modal task that aims to measure high-level visual understanding. Contemporary VQA models are restrictive in the sense that answers are obtained via classification over a limited vocabulary (in the case of open-ended VQA), or via classification over a set of multiple-choice-type answers. In this work, we present a completely generative formulation where a multi-word answer is generated for a visual query. To take this a step forward, we introduce a new task: ViQAR (Visual Question Answering and Reasoning), wherein a model must generate the complete answer and a rationale that seeks to justify the generated answer. We propose an end-to-end architecture to solve this task and describe how to evaluate it. We show that our model generates strong answers and rationales through qualitative and quantitative evaluation, as well as through a human Turing Test.
    Self-Supervised Multimodal Domino: in Search of Biomarkers for Alzheimer's Disease. (arXiv:2012.13623v4 [cs.LG] UPDATED)
    (2 min) Sensory input from multiple sources is crucial for robust and coherent human perception. Different sources contribute complementary explanatory factors. Similarly, research studies often collect multimodal imaging data, each of which can provide shared and unique information. This observation motivated the design of powerful multimodal self-supervised representation-learning algorithms. In this paper, we unify recent work on multimodal self-supervised learning under a single framework. Observing that most self-supervised methods optimize similarity metrics between a set of model components, we propose a taxonomy of all reasonable ways to organize this process. We first evaluate models on toy multimodal MNIST datasets and then apply them to a multimodal neuroimaging dataset with Alzheimer's disease patients. We find that (1) multimodal contrastive learning has significant benefits over its unimodal counterpart, (2) the specific composition of multiple contrastive objectives is critical to performance on a downstream task, (3) maximization of the similarity between representations has a regularizing effect on a neural network, which can sometimes lead to reduced downstream performance but still reveal multimodal relations. Results show that the proposed approach outperforms previous self-supervised encoder-decoder methods based on canonical correlation analysis (CCA) or the mixture-of-experts multimodal variational autoEncoder (MMVAE) on various datasets with a linear evaluation protocol. Importantly, we find a promising solution to uncover connections between modalities through a jointly shared subspace that can help advance work in our search for neuroimaging biomarkers.
    CMA-Net: A Cascaded Mutual Attention Network for Light Field Salient Object Detection. (arXiv:2105.00949v3 [cs.CV] UPDATED)
    (2 min) In the past few years, numerous deep learning methods have been proposed to address the task of segmenting salient objects from RGB images. However, these approaches depending on single modality fail to achieve the state-of-the-art performance on widely used light field salient object detection (SOD) datasets, which collect large-scale natural images and provide multiple modalities such as multi-view, micro-lens images and depth maps. Most recently proposed light field SOD methods have acquired improving detecting accuracy, yet still predict rough objects' structures and perform slow inference speed. To this end, we propose CMA-Net, which consists of two novel cascaded mutual attention modules aiming at fusing the high level features from the modalities of all-in-focus and depth. Our proposed CMA-Net outperforms 30 SOD methods (by a large margin) on two widely applied light field benchmark datasets. Besides, the proposed CMA-Net can run at a speed of 53 fps, thus being four times faster than the state-of-the-art multi-modal SOD methods. Extensive quantitative and qualitative experiments illustrate both the effectiveness and efficiency of our CMA-Net, inspiring future development of multi-modal learning for both the RGB-D and light field SOD.
    Invertible Concept-based Explanations for CNN Models with Non-negative Concept Activation Vectors. (arXiv:2006.15417v4 [cs.CV] UPDATED)
    (2 min) Convolutional neural network (CNN) models for computer vision are powerful but lack explainability in their most basic form. This deficiency remains a key challenge when applying CNNs in important domains. Recent work on explanations through feature importance of approximate linear models has moved from input-level features (pixels or segments) to features from mid-layer feature maps in the form of concept activation vectors (CAVs). CAVs contain concept-level information and could be learned via clustering. In this work, we rethink the ACE algorithm of Ghorbani et~al., proposing an alternative invertible concept-based explanation (ICE) framework to overcome its shortcomings. Based on the requirements of fidelity (approximate models to target models) and interpretability (being meaningful to people), we design measurements and evaluate a range of matrix factorization methods with our framework. We find that non-negative concept activation vectors (NCAVs) from non-negative matrix factorization provide superior performance in interpretability and fidelity based on computational and human subject experiments. Our framework provides both local and global concept-level explanations for pre-trained CNN models.
    Transductive Few-Shot Learning: Clustering is All You Need?. (arXiv:2106.09516v1 [cs.LG])
    (2 min) We investigate a general formulation for clustering and transductive few-shot learning, which integrates prototype-based objectives, Laplacian regularization and supervision constraints from a few labeled data points. We propose a concave-convex relaxation of the problem, and derive a computationally efficient block-coordinate bound optimizer, with convergence guarantee. At each iteration,our optimizer computes independent (parallel) updates for each point-to-cluster assignment. Therefore, it could be trivially distributed for large-scale clustering and few-shot tasks. Furthermore, we provides a thorough convergence analysis based on point-to-set maps. Were port comprehensive clustering and few-shot learning experiments over various data sets, showing that our method yields competitive performances, in term of accuracy and optimization quality, while scaling up to large problems. Using standard training on the base classes, without resorting to complex meta-learning and episodic-training strategies, our approach outperforms state-of-the-art few-shot methods by significant margins, across various models, settings and data sets. Surprisingly, we found that even standard clustering procedures (e.g., K-means), which correspond to particular, non-regularized cases of our general model, already achieve competitive performances in comparison to the state-of-the-art in few-shot learning. These surprising results point to the limitations of the current few-shot benchmarks, and question the viability of a large body of convoluted few-shot learning techniques in the recent literature.
    SECANT: Self-Expert Cloning for Zero-Shot Generalization of Visual Policies. (arXiv:2106.09678v1 [cs.LG])
    (2 min) Generalization has been a long-standing challenge for reinforcement learning (RL). Visual RL, in particular, can be easily distracted by irrelevant factors in high-dimensional observation space. In this work, we consider robust policy learning which targets zero-shot generalization to unseen visual environments with large distributional shift. We propose SECANT, a novel self-expert cloning technique that leverages image augmentation in two stages to decouple robust representation learning from policy optimization. Specifically, an expert policy is first trained by RL from scratch with weak augmentations. A student network then learns to mimic the expert policy by supervised learning with strong augmentations, making its representation more robust against visual variations compared to the expert. Extensive experiments demonstrate that SECANT significantly advances the state of the art in zero-shot generalization across 4 challenging domains. Our average reward improvements over prior SOTAs are: DeepMind Control (+26.5%), robotic manipulation (+337.8%), vision-based autonomous driving (+47.7%), and indoor object navigation (+15.8%). Code release and video are available at https://linxifan.github.io/secant-site/.
    BABEL: Bodies, Action and Behavior with English Labels. (arXiv:2106.09696v1 [cs.CV])
    (2 min) Understanding the semantics of human movement -- the what, how and why of the movement -- is an important problem that requires datasets of human actions with semantic labels. Existing datasets take one of two approaches. Large-scale video datasets contain many action labels but do not contain ground-truth 3D human motion. Alternatively, motion-capture (mocap) datasets have precise body motions but are limited to a small number of actions. To address this, we present BABEL, a large dataset with language labels describing the actions being performed in mocap sequences. BABEL consists of action labels for about 43 hours of mocap sequences from AMASS. Action labels are at two levels of abstraction -- sequence labels describe the overall action in the sequence, and frame labels describe all actions in every frame of the sequence. Each frame label is precisely aligned with the duration of the corresponding action in the mocap sequence, and multiple actions can overlap. There are over 28k sequence labels, and 63k frame labels in BABEL, which belong to over 250 unique action categories. Labels from BABEL can be leveraged for tasks like action recognition, temporal action localization, motion synthesis, etc. To demonstrate the value of BABEL as a benchmark, we evaluate the performance of models on 3D action recognition. We demonstrate that BABEL poses interesting learning challenges that are applicable to real-world scenarios, and can serve as a useful benchmark of progress in 3D action recognition. The dataset, baseline method, and evaluation code is made available, and supported for academic research purposes at https://babel.is.tue.mpg.de/.
    Indian Masked Faces in the Wild Dataset. (arXiv:2106.09670v1 [cs.CV])
    (2 min) Due to the COVID-19 pandemic, wearing face masks has become a mandate in public places worldwide. Face masks occlude a significant portion of the facial region. Additionally, people wear different types of masks, from simple ones to ones with graphics and prints. These pose new challenges to face recognition algorithms. Researchers have recently proposed a few masked face datasets for designing algorithms to overcome the challenges of masked face recognition. However, existing datasets lack the cultural diversity and collection in the unrestricted settings. Country like India with attire diversity, people are not limited to wearing traditional masks but also clothing like a thin cotton printed towel (locally called as ``gamcha''), ``stoles'', and ``handkerchiefs'' to cover their faces. In this paper, we present a novel \textbf{Indian Masked Faces in the Wild (IMFW)} dataset which contains images with variations in pose, illumination, resolution, and the variety of masks worn by the subjects. We have also benchmarked the performance of existing face recognition models on the proposed IMFW dataset. Experimental results demonstrate the limitations of existing algorithms in presence of diverse conditions.
    Semi-Autoregressive Transformer for Image Captioning. (arXiv:2106.09436v1 [cs.CV])
    (2 min) Current state-of-the-art image captioning models adopt autoregressive decoders, \ie they generate each word by conditioning on previously generated words, which leads to heavy latency during inference. To tackle this issue, non-autoregressive image captioning models have recently been proposed to significantly accelerate the speed of inference by generating all words in parallel. However, these non-autoregressive models inevitably suffer from large generation quality degradation since they remove words dependence excessively. To make a better trade-off between speed and quality, we introduce a semi-autoregressive model for image captioning~(dubbed as SATIC), which keeps the autoregressive property in global but generates words parallelly in local. Based on Transformer, there are only a few modifications needed to implement SATIC. Extensive experiments on the MSCOCO image captioning benchmark show that SATIC can achieve a better trade-off without bells and whistles. Code is available at {\color{magenta}\url{https://github.com/YuanEZhou/satic}}.
    FG-Net: Fast Large-Scale LiDAR Point Clouds Understanding Network Leveraging Correlated Feature Mining and Geometric-Aware Modelling. (arXiv:2012.09439v2 [cs.CV] UPDATED)
    (2 min) This work presents FG-Net, a general deep learning framework for large-scale point clouds understanding without voxelizations, which achieves accurate and real-time performance with a single NVIDIA GTX 1080 GPU. First, a novel noise and outlier filtering method is designed to facilitate subsequent high-level tasks. For effective understanding purpose, we propose a deep convolutional neural network leveraging correlated feature mining and deformable convolution based geometric-aware modelling, in which the local feature relationships and geometric patterns can be fully exploited. For the efficiency issue, we put forward an inverse density sampling operation and a feature pyramid based residual learning strategy to save the computational cost and memory consumption respectively. Extensive experiments on real-world challenging datasets demonstrated that our approaches outperform state-of-the-art approaches in terms of accuracy and efficiency. Moreover, weakly supervised transfer learning is also conducted to demonstrate the generalization capacity of our method.
    Machine-learning enhanced dark soliton detection in Bose-Einstein condensates. (arXiv:2101.05404v2 [cond-mat.quant-gas] UPDATED)
    (2 min) Most data in cold-atom experiments comes from images, the analysis of which is limited by our preconceptions of the patterns that could be present in the data. We focus on the well-defined case of detecting dark solitons -- appearing as local density depletions in a Bose-Einstein condensate (BEC) -- using a methodology that is extensible to the general task of pattern recognition in images of cold atoms. Studying soliton dynamics over a wide range of parameters requires the analysis of large datasets, making the existing human-inspection-based methodology a significant bottleneck. Here we describe an automated classification and positioning system for identifying localized excitations in atomic BECs utilizing deep convolutional neural networks to eliminate the need for human image examination. Furthermore, we openly publish our labeled dataset of dark solitons, the first of its kind, for further machine learning research.
    Latent Correlation-Based Multiview Learning and Self-Supervision: A Unifying Perspective. (arXiv:2106.07115v2 [cs.LG] UPDATED)
    (2 min) Multiple views of data, both naturally acquired (e.g., image and audio) and artificially produced (e.g., via adding different noise to data samples), have proven useful in enhancing representation learning. Natural views are often handled by multiview analysis tools, e.g., (deep) canonical correlation analysis [(D)CCA], while the artificial ones are frequently used in self-supervised learning (SSL) paradigms, e.g., SimCLR and Barlow Twins. Both types of approaches often involve learning neural feature extractors such that the embeddings of data exhibit high cross-view correlations. Although intuitive, the effectiveness of correlation-based neural embedding is only empirically validated. This work puts forth a theory-backed framework for unsupervised multiview learning. Our development starts with proposing a multiview model, where each view is a nonlinear mixture of shared and private components. Consequently, the learning problem boils down to shared/private component identification and disentanglement. Under this model, latent correlation maximization is shown to guarantee the extraction of the shared components across views (up to certain ambiguities). In addition, the private information in each view can be provably disentangled from the shared using proper regularization design. The method is tested on a series of tasks, e.g., downstream clustering, which all show promising performance. Our development also provides a unifying perspective for understanding various DCCA and SSL schemes.
    Rethinking and Designing a High-performing Automatic License Plate Recognition Approach. (arXiv:2011.14936v2 [cs.CV] UPDATED)
    (2 min) In this paper, we propose a real-time and accurate automatic license plate recognition (ALPR) approach. Our study illustrates the outstanding design of ALPR with four insights: (1) the resampling-based cascaded framework is beneficial to both speed and accuracy; (2) the highly efficient license plate recognition should abundant additional character segmentation and recurrent neural network (RNN), but adopt a plain convolutional neural network (CNN); (3) in the case of CNN, taking advantage of vertex information on license plates improves the recognition performance; and (4) the weight-sharing character classifier addresses the lack of training images in small-scale datasets. Based on these insights, we propose a novel ALPR approach, termed VSNet. Specifically, VSNet includes two CNNs, i.e., VertexNet for license plate detection and SCR-Net for license plate recognition, integrated in a resampling-based cascaded manner. In VertexNet, we propose an efficient integration block to extract the spatial features of license plates. With vertex supervisory information, we propose a vertex-estimation branch in VertexNet such that license plates can be rectified as the input images of SCR-Net. In SCR-Net, we introduce a horizontal encoding technique for left-to-right feature extraction and propose a weight-sharing classifier for character recognition. Experimental results show that the proposed VSNet outperforms state-of-the-art methods by more than 50% relative improvement on error rate, achieving > 99% recognition accuracy on CCPD and AOLP datasets with 149 FPS inference speed. Moreover, our method illustrates an outstanding generalization capability when evaluated on the unseen PKUData and CLPD datasets.
    Learning Personal Style from Few Examples. (arXiv:2105.14457v2 [cs.CV] UPDATED)
    (2 min) A key task in design work is grasping the client's implicit tastes. Designers often do this based on a set of examples from the client. However, recognizing a common pattern among many intertwining variables such as color, texture, and layout and synthesizing them into a composite preference can be challenging. In this paper, we leverage the pattern recognition capability of computational models to aid in this task. We offer a set of principles for computationally learning personal style. The principles are manifested in PseudoClient, a deep learning framework that learns a computational model for personal graphic design style from only a handful of examples. In several experiments, we found that PseudoClient achieves a 79.40% accuracy with only five positive and negative examples, outperforming several alternative methods. Finally, we discuss how PseudoClient can be utilized as a building block to support the development of future design applications.
    Improving Adversarial Transferability with Gradient Refining. (arXiv:2105.04834v2 [cs.CV] UPDATED)
    (2 min) Deep neural networks are vulnerable to adversarial examples, which are crafted by adding human-imperceptible perturbations to original images. Most existing adversarial attack methods achieve nearly 100% attack success rates under the white-box setting, but only achieve relatively low attack success rates under the black-box setting. To improve the transferability of adversarial examples for the black-box setting, several methods have been proposed, e.g., input diversity, translation-invariant attack, and momentum-based attack. In this paper, we propose a method named Gradient Refining, which can further improve the adversarial transferability by correcting useless gradients introduced by input diversity through multiple transformations. Our method is generally applicable to many gradient-based attack methods combined with input diversity. Extensive experiments are conducted on the ImageNet dataset and our method can achieve an average transfer success rate of 82.07% for three different models under single-model setting, which outperforms the other state-of-the-art methods by a large margin of 6.0% averagely. And we have applied the proposed method to the competition CVPR 2021 Unrestricted Adversarial Attacks on ImageNet organized by Alibaba and won the second place in attack success rates among 1558 teams.
    Stochastic Image-to-Video Synthesis using cINNs. (arXiv:2105.04551v2 [cs.CV] UPDATED)
    (2 min) Video understanding calls for a model to learn the characteristic interplay between static scene content and its dynamics: Given an image, the model must be able to predict a future progression of the portrayed scene and, conversely, a video should be explained in terms of its static image content and all the remaining characteristics not present in the initial frame. This naturally suggests a bijective mapping between the video domain and the static content as well as residual information. In contrast to common stochastic image-to-video synthesis, such a model does not merely generate arbitrary videos progressing the initial image. Given this image, it rather provides a one-to-one mapping between the residual vectors and the video with stochastic outcomes when sampling. The approach is naturally implemented using a conditional invertible neural network (cINN) that can explain videos by independently modelling static and other video characteristics, thus laying the basis for controlled video synthesis. Experiments on four diverse video datasets demonstrate the effectiveness of our approach in terms of both the quality and diversity of the synthesized results. Our project page is available at https://bit.ly/3t66bnU.
    Deep Subdomain Adaptation Network for Image Classification. (arXiv:2106.09388v1 [cs.CV])
    (2 min) For a target task where labeled data is unavailable, domain adaptation can transfer a learner from a different source domain. Previous deep domain adaptation methods mainly learn a global domain shift, i.e., align the global source and target distributions without considering the relationships between two subdomains within the same category of different domains, leading to unsatisfying transfer learning performance without capturing the fine-grained information. Recently, more and more researchers pay attention to Subdomain Adaptation which focuses on accurately aligning the distributions of the relevant subdomains. However, most of them are adversarial methods which contain several loss functions and converge slowly. Based on this, we present Deep Subdomain Adaptation Network (DSAN) which learns a transfer network by aligning the relevant subdomain distributions of domain-specific layer activations across different domains based on a local maximum mean discrepancy (LMMD). Our DSAN is very simple but effective which does not need adversarial training and converges fast. The adaptation can be achieved easily with most feed-forward network models by extending them with LMMD loss, which can be trained efficiently via back-propagation. Experiments demonstrate that DSAN can achieve remarkable results on both object recognition tasks and digit classification tasks. Our code will be available at: https://github.com/easezyc/deep-transfer-learning
    BinaryCoP: Binary Neural Network-based COVID-19 Face-Mask Wear and Positioning Predictor on Edge Devices. (arXiv:2102.03456v2 [cs.CV] UPDATED)
    (3 min) Face masks have long been used in many areas of everyday life to protect against the inhalation of hazardous fumes and particles. They also offer an effective solution in healthcare for bi-directional protection against air-borne diseases. Wearing and positioning the mask correctly is essential for its function. Convolutional neural networks (CNNs) offer an excellent solution for face recognition and classification of correct mask wearing and positioning. In the context of the ongoing COVID-19 pandemic, such algorithms can be used at entrances to corporate buildings, airports, shopping areas, and other indoor locations, to mitigate the spread of the virus. These application scenarios impose major challenges to the underlying compute platform. The inference hardware must be cheap, small and energy efficient, while providing sufficient memory and compute power to execute accurate CNNs at a reasonably low latency. To maintain data privacy of the public, all processing must remain on the edge-device, without any communication with cloud servers. To address these challenges, we present a low-power binary neural network classifier for correct facial-mask wear and positioning. The classification task is implemented on an embedded FPGA, performing high-throughput binary operations. Classification can take place at up to ~6400 frames-per-second, easily enabling multi-camera, speed-gate settings or statistics collection in crowd settings. When deployed on a single entrance or gate, the idle power consumption is reduced to 1.6W, improving the battery-life of the device. We achieve an accuracy of up to 98% for four wearing positions of the MaskedFace-Net dataset. To maintain equivalent classification accuracy for all face structures, skin-tones, hair types, and mask types, the algorithms are tested for their ability to generalize the relevant features over all subjects using the Grad-CAM approach.
    JOKR: Joint Keypoint Representation for Unsupervised Cross-Domain Motion Retargeting. (arXiv:2106.09679v1 [cs.CV])
    (2 min) The task of unsupervised motion retargeting in videos has seen substantial advancements through the use of deep neural networks. While early works concentrated on specific object priors such as a human face or body, recent work considered the unsupervised case. When the source and target videos, however, are of different shapes, current methods fail. To alleviate this problem, we introduce JOKR - a JOint Keypoint Representation that captures the motion common to both the source and target videos, without requiring any object prior or data collection. By employing a domain confusion term, we enforce the unsupervised keypoint representations of both videos to be indistinguishable. This encourages disentanglement between the parts of the motion that are common to the two domains, and their distinctive appearance and motion, enabling the generation of videos that capture the motion of the one while depicting the style of the other. To enable cases where the objects are of different proportions or orientations, we apply a learned affine transformation between the JOKRs. This augments the representation to be affine invariant, and in practice broadens the variety of possible retargeting pairs. This geometry-driven representation enables further intuitive control, such as temporal coherence and manual editing. Through comprehensive experimentation, we demonstrate the applicability of our method to different challenging cross-domain video pairs. We evaluate our method both qualitatively and quantitatively, and demonstrate that our method handles various cross-domain scenarios, such as different animals, different flowers, and humans. We also demonstrate superior temporal coherency and visual quality compared to state-of-the-art alternatives, through statistical metrics and a user study. Source code and videos can be found at https://rmokady.github.io/JOKR/ .
    Knowledge distillation from multi-modal to mono-modal segmentation networks. (arXiv:2106.09564v1 [cs.CV])
    (2 min) The joint use of multiple imaging modalities for medical image segmentation has been widely studied in recent years. The fusion of information from different modalities has demonstrated to improve the segmentation accuracy, with respect to mono-modal segmentations, in several applications. However, acquiring multiple modalities is usually not possible in a clinical setting due to a limited number of physicians and scanners, and to limit costs and scan time. Most of the time, only one modality is acquired. In this paper, we propose KD-Net, a framework to transfer knowledge from a trained multi-modal network (teacher) to a mono-modal one (student). The proposed method is an adaptation of the generalized distillation framework where the student network is trained on a subset (1 modality) of the teacher's inputs (n modalities). We illustrate the effectiveness of the proposed framework in brain tumor segmentation with the BraTS 2018 dataset. Using different architectures, we show that the student network effectively learns from the teacher and always outperforms the baseline mono-modal network in terms of segmentation accuracy.
    Privacy-Preserving Eye-tracking Using Deep Learning. (arXiv:2106.09621v1 [cs.CV])
    (2 min) The expanding usage of complex machine learning methods like deep learning has led to an explosion in human activity recognition, particularly applied to health. In particular, as part of a larger body sensor network system, face and full-body analysis is becoming increasingly common for evaluating health status. However, complex models which handle private and sometimes protected data, raise concerns about the potential leak of identifiable data. In this work, we focus on the case of a deep network model trained on images of individual faces. Full-face video recordings taken from 493 individuals undergoing an eye-tracking based evaluation of neurological function were used. Outputs, gradients, intermediate layer outputs, loss, and labels were used as inputs for a deep network with an added support vector machine emission layer to recognize membership in the training data. The inference attack method and associated mathematical analysis indicate that there is a low likelihood of unintended memorization of facial features in the deep learning model. In this study, it is showed that the named model preserves the integrity of training data with reasonable confidence. The same process can be implemented in similar conditions for different models.
    DMN4: Few-shot Learning via Discriminative Mutual Nearest Neighbor Neural Network. (arXiv:2103.08160v2 [cs.CV] UPDATED)
    (2 min) Few-shot learning (FSL) aims to classify images under low-data regimes, where the conventional pooled global representation is likely to lose useful local characteristics. Recent work has achieved promising performances by using deep descriptors. They generally take all deep descriptors from neural networks into consideration while ignoring that some of them are useless in classification due to their limited receptive field, e.g., task-irrelevant descriptors could be misleading and multiple aggregative descriptors from background clutter could even overwhelm the object's presence. In this paper, we argue that a Mutual Nearest Neighbor (MNN) relation should be established to explicitly select the query descriptors that are most relevant to each task and discard less relevant ones from aggregative clutters in FSL. Specifically, we propose Discriminative Mutual Nearest Neighbor Neural Network (DMN4) for FSL. Extensive experiments demonstrate that our method not only qualitatively selects task-relevant descriptors but also quantitatively outperforms the existing state-of-the-arts by a large margin of 1.8~4.9% on fine-grained CUB, a considerable margin of 1.4~2.2% on both supervised and semi-supervised miniImagenet, and ~1.4% on challenging tieredimagenet.
    "What's This?" -- Learning to Segment Unknown Objects from Manipulation Sequences. (arXiv:2011.03279v2 [cs.CV] UPDATED)
    (2 min) We present a novel framework for self-supervised grasped object segmentation with a robotic manipulator. Our method successively learns an agnostic foreground segmentation followed by a distinction between manipulator and object solely by observing the motion between consecutive RGB frames. In contrast to previous approaches, we propose a single, end-to-end trainable architecture which jointly incorporates motion cues and semantic knowledge. Furthermore, while the motion of the manipulator and the object are substantial cues for our algorithm, we present means to robustly deal with distraction objects moving in the background, as well as with completely static scenes. Our method neither depends on any visual registration of a kinematic robot or 3D object models, nor on precise hand-eye calibration or any additional sensor data. By extensive experimental evaluation we demonstrate the superiority of our framework and provide detailed insights on its capability of dealing with the aforementioned extreme cases of motion. We also show that training a semantic segmentation network with the automatically labeled data achieves results on par with manually annotated training data. Code and pretrained model are available at https://github.com/DLR-RM/DistinctNet.
    Autobots: Latent Variable Sequential Set Transformers. (arXiv:2104.00563v2 [cs.RO] UPDATED)
    (2 min) Robust multi-agent trajectory prediction is essential for the safe control of robots and vehicles that interact with humans. Many existing methods treat social and temporal information separately and therefore fall short of modelling the joint future trajectories of all agents in a socially consistent way. To address this, we propose a new class of Latent Variable Sequential Set Transformers which autoregressively model multi-agent trajectories. We refer to these architectures as "AutoBots". AutoBots model the contents of sets (e.g. representing the properties of agents in a scene) over time and employ multi-head self-attention blocks over these sequences of sets to encode the sociotemporal relationships between the different actors of a scene. This produces either the trajectory of one ego-agent or a distribution over the future trajectories for all agents under consideration. Our approach works for general sequences of sets and we provide illustrative experiments modelling the sequential structure of the multiple strokes that make up symbols in the Omniglot data. For the single-agent prediction case, we validate our model on the NuScenes motion prediction task and achieve competitive results on the global leaderboard. In the multi-agent forecasting setting, we validate our model on TrajNet. We find that our method outperforms physical extrapolation and recurrent network baselines and generates scene-consistent trajectories.
    IFCNet: A Benchmark Dataset for IFC Entity Classification. (arXiv:2106.09712v1 [cs.CV])
    (2 min) Enhancing interoperability and information exchange between domain-specific software products for BIM is an important aspect in the Architecture, Engineering, Construction and Operations industry. Recent research started investigating methods from the areas of machine and deep learning for semantic enrichment of BIM models. However, training and evaluation of these machine learning algorithms requires sufficiently large and comprehensive datasets. This work presents IFCNet, a dataset of single-entity IFC files spanning a broad range of IFC classes containing both geometric and semantic information. Using only the geometric information of objects, the experiments show that three different deep learning models are able to achieve good classification performance.
    Learning to Recommend Frame for Interactive Video Object Segmentation in the Wild. (arXiv:2103.10391v2 [cs.CV] UPDATED)
    (2 min) This paper proposes a framework for the interactive video object segmentation (VOS) in the wild where users can choose some frames for annotations iteratively. Then, based on the user annotations, a segmentation algorithm refines the masks. The previous interactive VOS paradigm selects the frame with some worst evaluation metric, and the ground truth is required for calculating the evaluation metric, which is impractical in the testing phase. In contrast, in this paper, we advocate that the frame with the worst evaluation metric may not be exactly the most valuable frame that leads to the most performance improvement across the video. Thus, we formulate the frame selection problem in the interactive VOS as a Markov Decision Process, where an agent is learned to recommend the frame under a deep reinforcement learning framework. The learned agent can automatically determine the most valuable frame, making the interactive setting more practical in the wild. Experimental results on the public datasets show the effectiveness of our learned agent without any changes to the underlying VOS algorithms. Our data, code, and models are available at https://github.com/svip-lab/IVOS-W.
    Just How Toxic is Data Poisoning? A Unified Benchmark for Backdoor and Data Poisoning Attacks. (arXiv:2006.12557v3 [cs.LG] UPDATED)
    (2 min) Data poisoning and backdoor attacks manipulate training data in order to cause models to fail during inference. A recent survey of industry practitioners found that data poisoning is the number one concern among threats ranging from model stealing to adversarial attacks. However, it remains unclear exactly how dangerous poisoning methods are and which ones are more effective considering that these methods, even ones with identical objectives, have not been tested in consistent or realistic settings. We observe that data poisoning and backdoor attacks are highly sensitive to variations in the testing setup. Moreover, we find that existing methods may not generalize to realistic settings. While these existing works serve as valuable prototypes for data poisoning, we apply rigorous tests to determine the extent to which we should fear them. In order to promote fair comparison in future work, we develop standardized benchmarks for data poisoning and backdoor attacks.
    Semantic-guided Automatic Natural Image Matting with Light-weight Non-local Attention. (arXiv:2103.17020v2 [cs.CV] UPDATED)
    (2 min) Natural image matting aims to precisely separate foreground objects from background using alpha matte. Fully automatic natural image matting without external annotation is quite challenging. Well-performed matting methods usually require accurate labor-intensive handcrafted trimap as extra input, while the performance of automatic trimap generation method of dilating foreground segmentation fluctuates with segmentation quality. Therefore, we argue that how to handle trade-off of additional information input is a major issue in automatic matting. This paper presents a universal semantic-guided automatic natural image matting pipeline with light-weight non-local attention without trimap and background image as input. Specifically, guided by semantic information of coarse foreground segmentation, Trimap Generation Network estimates accurate trimap. With estimated trimap and RGB image as input, our light-weight Non-local Matting Network with Refinement produces final alpha matte, whose trimap-guided global aggregation attention block is equipped with stride downsampling convolution, reducing computation complexity and promoting performance. Experimental results show that our matting algorithm has competitive performance with current state-of-the-art methods in both trimap-free and trimap-needed aspects.
    Normalization of breast MRIs using Cycle-Consistent Generative Adversarial Networks. (arXiv:1912.08061v2 [eess.IV] UPDATED)
    (3 min) Dynamic Contrast Enhanced-Magnetic Resonance Imaging (DCE-MRI) is widely used to complement ultrasound examinations and x-ray mammography during the early detection and diagnosis of breast cancer. However, images generated by various MRI scanners (e.g. GE Healthcare vs Siemens) differ both in intensity and noise distribution, preventing algorithms trained on MRIs from one scanner to generalize to data from other scanners successfully. We propose a method for image normalization to solve this problem. MRI normalization is challenging because it requires both normalizing intensity values and mapping between the noise distributions of different scanners. We utilize a cycle-consistent generative adversarial network to learn a bidirectional mapping between MRIs produced by GE Healthcare and Siemens scanners. This allows us learning the mapping between two different scanner types without matched data, which is not commonly available. To ensure the preservation of breast shape and structures within the breast, we propose two technical innovations. First, we incorporate a mutual information loss with the CycleGAN architecture to ensure that the structure of the breast is maintained. Second, we propose a modified discriminator architecture which utilizes a smaller field-of-view to ensure the preservation of finer details in the breast tissue. Quantitative and qualitative evaluations show that the second proposed method was able to consistently preserve a high level of detail in the breast structure while also performing the proper intensity normalization and noise mapping. Our results demonstrate that the proposed model can successfully learn a bidirectional mapping between MRIs produced by different vendors, potentially enabling improved accuracy of downstream computational algorithms for diagnosis and detection of breast cancer. All the data used in this study are publicly available.
    Seesaw Loss for Long-Tailed Instance Segmentation. (arXiv:2008.10032v4 [cs.CV] UPDATED)
    (2 min) Instance segmentation has witnessed a remarkable progress on class-balanced benchmarks. However, they fail to perform as accurately in real-world scenarios, where the category distribution of objects naturally comes with a long tail. Instances of head classes dominate a long-tailed dataset and they serve as negative samples of tail categories. The overwhelming gradients of negative samples on tail classes lead to a biased learning process for classifiers. Consequently, objects of tail categories are more likely to be misclassified as backgrounds or head categories. To tackle this problem, we propose Seesaw Loss to dynamically re-balance gradients of positive and negative samples for each category, with two complementary factors, i.e., mitigation factor and compensation factor. The mitigation factor reduces punishments to tail categories w.r.t. the ratio of cumulative training instances between different categories. Meanwhile, the compensation factor increases the penalty of misclassified instances to avoid false positives of tail categories. We conduct extensive experiments on Seesaw Loss with mainstream frameworks and different data sampling strategies. With a simple end-to-end training pipeline, Seesaw Loss obtains significant gains over Cross-Entropy Loss, and achieves state-of-the-art performance on LVIS dataset without bells and whistles. Code is available at https://github.com/open-mmlab/mmdetection.
    XCiT: Cross-Covariance Image Transformers. (arXiv:2106.09681v1 [cs.CV])
    (2 min) Following their success in natural language processing, transformers have recently shown much promise for computer vision. The self-attention operation underlying transformers yields global interactions between all tokens ,i.e. words or image patches, and enables flexible modelling of image data beyond the local interactions of convolutions. This flexibility, however, comes with a quadratic complexity in time and memory, hindering application to long sequences and high-resolution images. We propose a "transposed" version of self-attention that operates across feature channels rather than tokens, where the interactions are based on the cross-covariance matrix between keys and queries. The resulting cross-covariance attention (XCA) has linear complexity in the number of tokens, and allows efficient processing of high-resolution images. Our cross-covariance image transformer (XCiT) is built upon XCA. It combines the accuracy of conventional transformers with the scalability of convolutional architectures. We validate the effectiveness and generality of XCiT by reporting excellent results on multiple vision benchmarks, including image classification and self-supervised feature learning on ImageNet-1k, object detection and instance segmentation on COCO, and semantic segmentation on ADE20k.
    How can we learn (more) from challenges? A statistical approach to driving future algorithm development. (arXiv:2106.09302v1 [cs.CV])
    (2 min) Challenges have become the state-of-the-art approach to benchmark image analysis algorithms in a comparative manner. While the validation on identical data sets was a great step forward, results analysis is often restricted to pure ranking tables, leaving relevant questions unanswered. Specifically, little effort has been put into the systematic investigation on what characterizes images in which state-of-the-art algorithms fail. To address this gap in the literature, we (1) present a statistical framework for learning from challenges and (2) instantiate it for the specific task of instrument instance segmentation in laparoscopic videos. Our framework relies on the semantic meta data annotation of images, which serves as foundation for a General Linear Mixed Models (GLMM) analysis. Based on 51,542 meta data annotations performed on 2,728 images, we applied our approach to the results of the Robust Medical Instrument Segmentation Challenge (ROBUST-MIS) challenge 2019 and revealed underexposure, motion and occlusion of instruments as well as the presence of smoke or other objects in the background as major sources of algorithm failure. Our subsequent method development, tailored to the specific remaining issues, yielded a deep learning model with state-of-the-art overall performance and specific strengths in the processing of images in which previous methods tended to fail. Due to the objectivity and generic applicability of our approach, it could become a valuable tool for validation in the field of medical image analysis and beyond. and segmentation of small, crossing, moving and transparent instrument(s) (parts).
    Improving On-Screen Sound Separation for Open Domain Videos with Audio-Visual Self-attention. (arXiv:2106.09669v1 [cs.SD])
    (2 min) We introduce a state-of-the-art audio-visual on-screen sound separation system which is capable of learning to separate sounds and associate them with on-screen objects by looking at in-the-wild videos. We identify limitations of previous work on audiovisual on-screen sound separation, including the simplicity and coarse resolution of spatio-temporal attention, and poor convergence of the audio separation model. Our proposed model addresses these issues using cross-modal and self-attention modules that capture audio-visual dependencies at a finer resolution over time, and by unsupervised pre-training of audio separation model. These improvements allow the model to generalize to a much wider set of unseen videos. For evaluation and semi-supervised training, we collected human annotations of on-screen audio from a large database of in-the-wild videos (YFCC100M). Our results show marked improvements in on-screen separation performance, in more general conditions than previous methods.
    Orthogonal-Pad\'e Activation Functions: Trainable Activation functions for smooth and faster convergence in deep networks. (arXiv:2106.09693v1 [cs.NE])
    (2 min) We have proposed orthogonal-Pad\'e activation functions, which are trainable activation functions and show that they have faster learning capability and improves the accuracy in standard deep learning datasets and models. Based on our experiments, we have found two best candidates out of six orthogonal-Pad\'e activations, which we call safe Hermite-Pade (HP) activation functions, namely HP-1 and HP-2. When compared to ReLU, HP-1 and HP-2 has an increment in top-1 accuracy by 5.06% and 4.63% respectively in PreActResNet-34, by 3.02% and 2.75% respectively in MobileNet V2 model on CIFAR100 dataset while on CIFAR10 dataset top-1 accuracy increases by 2.02% and 1.78% respectively in PreActResNet-34, by 2.24% and 2.06% respectively in LeNet, by 2.15% and 2.03% respectively in Efficientnet B0.
    To fit or not to fit: Model-based Face Reconstruction and Occlusion Segmentation from Weak Supervision. (arXiv:2106.09614v1 [cs.CV])
    (2 min) 3D face reconstruction from a single image is challenging due to its ill-posed nature. Model-based face autoencoders address this issue effectively by fitting a face model to the target image in a weakly supervised manner. However, in unconstrained environments occlusions distort the face reconstruction because the model often erroneously tries to adapt to occluded face regions. Supervised occlusion segmentation is a viable solution to avoid the fitting of occluded face regions, but it requires a large amount of annotated training data. In this work, we enable model-based face autoencoders to segment occluders accurately without requiring any additional supervision during training, and this separates regions where the model will be fitted from those where it will not be fitted. To achieve this, we extend face autoencoders with a segmentation network. The segmentation network decides which regions the model should adapt to by reaching balances in a trade-off between including pixels and adapting the model to them, and excluding pixels so that the model fitting is not negatively affected and reaches higher overall reconstruction accuracy on pixels showing the face. This leads to a synergistic effect, in which the occlusion segmentation guides the training of the face autoencoder to constrain the fitting in the non-occluded regions, while the improved fitting enables the segmentation model to better predict the occluded face regions. Qualitative and quantitative experiments on the CelebA-HQ database and the AR database verify the effectiveness of our model in improving 3D face reconstruction under occlusions and in enabling accurate occlusion segmentation from weak supervision only. Code available at https://github.com/unibas-gravis/Occlusion-Robust-MoFA.
    Learning Perceptual Manifold of Fonts. (arXiv:2106.09198v1 [cs.GR])
    (2 min) Along the rapid development of deep learning techniques in generative models, it is becoming an urgent issue to combine machine intelligence with human intelligence to solve the practical applications. Motivated by this methodology, this work aims to adjust the machine generated character fonts with the effort of human workers in the perception study. Although numerous fonts are available online for public usage, it is difficult and challenging to generate and explore a font to meet the preferences for common users. To solve the specific issue, we propose the perceptual manifold of fonts to visualize the perceptual adjustment in the latent space of a generative model of fonts. In our framework, we adopt the variational autoencoder network for the font generation. Then, we conduct a perceptual study on the generated fonts from the multi-dimensional latent space of the generative model. After we obtained the distribution data of specific preferences, we utilize manifold learning approach to visualize the font distribution. In contrast to the conventional user interface in our user study, the proposed font-exploring user interface is efficient and helpful in the designated user preference.
    Multi-Label Learning from Single Positive Labels. (arXiv:2106.09708v1 [cs.CV])
    (2 min) Predicting all applicable labels for a given image is known as multi-label classification. Compared to the standard multi-class case (where each image has only one label), it is considerably more challenging to annotate training data for multi-label classification. When the number of potential labels is large, human annotators find it difficult to mention all applicable labels for each training image. Furthermore, in some settings detection is intrinsically difficult e.g. finding small object instances in high resolution images. As a result, multi-label training data is often plagued by false negatives. We consider the hardest version of this problem, where annotators provide only one relevant label for each image. As a result, training sets will have only one positive label per image and no confirmed negatives. We explore this special case of learning from missing labels across four different multi-label image classification datasets for both linear classifiers and end-to-end fine-tuned deep networks. We extend existing multi-label losses to this setting and propose novel variants that constrain the number of expected positive labels during training. Surprisingly, we show that in some cases it is possible to approach the performance of fully labeled classifiers despite training with significantly fewer confirmed labels.
    Controllable Confidence-Based Image Denoising. (arXiv:2106.09311v1 [eess.IV])
    (2 min) Image denoising is a classic restoration problem. Yet, current deep learning methods are subject to the problems of generalization and interpretability. To mitigate these problems, in this project, we present a framework that is capable of controllable, confidence-based noise removal. The framework is based on the fusion between two different denoised images, both derived from the same noisy input. One of the two is denoised using generic algorithms (e.g. Gaussian), which make few assumptions on the input images, therefore, generalize in all scenarios. The other is denoised using deep learning, performing well on seen datasets. We introduce a set of techniques to fuse the two components smoothly in the frequency domain. Beyond that, we estimate the confidence of a deep learning denoiser to allow users to interpret the output, and provide a fusion strategy that safeguards them against out-of-distribution inputs. Through experiments, we demonstrate the effectiveness of the proposed framework in different use cases.
    On the Dark Side of Calibration for Modern Neural Networks. (arXiv:2106.09385v1 [cs.LG])
    (2 min) Modern neural networks are highly uncalibrated. It poses a significant challenge for safety-critical systems to utilise deep neural networks (DNNs), reliably. Many recently proposed approaches have demonstrated substantial progress in improving DNN calibration. However, they hardly touch upon refinement, which historically has been an essential aspect of calibration. Refinement indicates separability of a network's correct and incorrect predictions. This paper presents a theoretically and empirically supported exposition for reviewing a model's calibration and refinement. Firstly, we show the breakdown of expected calibration error (ECE), into predicted confidence and refinement. Connecting with this result, we highlight that regularisation based calibration only focuses on naively reducing a model's confidence. This logically has a severe downside to a model's refinement. We support our claims through rigorous empirical evaluations of many state of the art calibration approaches on standard datasets. We find that many calibration approaches with the likes of label smoothing, mixup etc. lower the utility of a DNN by degrading its refinement. Even under natural data shift, this calibration-refinement trade-off holds for the majority of calibration methods. These findings call for an urgent retrospective into some popular pathways taken for modern DNN calibration.
    using multiple losses for accurate facial age estimation. (arXiv:2106.09393v1 [cs.CV])
    (2 min) Age estimation is an essential challenge in computer vision. With the advances of convolutional neural networks, the performance of age estimation has been dramatically improved. Existing approaches usually treat age estimation as a classification problem. However, the age labels are ambiguous, thus make the classification task difficult. In this paper, we propose a simple yet effective approach for age estimation, which improves the performance compared to classification-based methods. The method combines four classification losses and one regression loss representing different class granularities together, and we name it as Age-Granularity-Net. We validate the Age-Granularity-Net framework on the CVPR Chalearn 2016 dataset, and extensive experiments show that the proposed approach can reduce the prediction error compared to any individual loss. The source code link is https://github.com/yipersevere/age-estimation.
    Learning Dexterous Grasping with Object-Centric Visual Affordances. (arXiv:2009.01439v2 [cs.RO] UPDATED)
    (2 min) Dexterous robotic hands are appealing for their agility and human-like morphology, yet their high degree of freedom makes learning to manipulate challenging. We introduce an approach for learning dexterous grasping. Our key idea is to embed an object-centric visual affordance model within a deep reinforcement learning loop to learn grasping policies that favor the same object regions favored by people. Unlike traditional approaches that learn from human demonstration trajectories (e.g., hand joint sequences captured with a glove), the proposed prior is object-centric and image-based, allowing the agent to anticipate useful affordance regions for objects unseen during policy learning. We demonstrate our idea with a 30-DoF five-fingered robotic hand simulator on 40 objects from two datasets, where it successfully and efficiently learns policies for stable functional grasps. Our affordance-guided policies are significantly more effective, generalize better to novel objects, train 3 X faster than the baselines, and are more robust to noisy sensor readings and actuation. Our work offers a step towards manipulation agents that learn by watching how people use objects, without requiring state and action information about the human body. Project website: this http URL
    Evaluating the Robustness of Bayesian Neural Networks Against Different Types of Attacks. (arXiv:2106.09223v1 [cs.LG])
    (2 min) To evaluate the robustness gain of Bayesian neural networks on image classification tasks, we perform input perturbations, and adversarial attacks to the state-of-the-art Bayesian neural networks, with a benchmark CNN model as reference. The attacks are selected to simulate signal interference and cyberattacks towards CNN-based machine learning systems. The result shows that a Bayesian neural network achieves significantly higher robustness against adversarial attacks generated against a deterministic neural network model, without adversarial training. The Bayesian posterior can act as the safety precursor of ongoing malicious activities. Furthermore, we show that the stochastic classifier after the deterministic CNN extractor has sufficient robustness enhancement rather than a stochastic feature extractor before the stochastic classifier. This advises on utilizing stochastic layers in building decision-making pipelines within a safety-critical domain.
    Adversarial Visual Robustness by Causal Intervention. (arXiv:2106.09534v1 [cs.CV])
    (2 min) Adversarial training is the de facto most promising defense against adversarial examples. Yet, its passive nature inevitably prevents it from being immune to unknown attackers. To achieve a proactive defense, we need a more fundamental understanding of adversarial examples, beyond the popular bounded threat model. In this paper, we provide a causal viewpoint of adversarial vulnerability: the cause is the confounder ubiquitously existing in learning, where attackers are precisely exploiting the confounding effect. Therefore, a fundamental solution for adversarial robustness is causal intervention. As the confounder is unobserved in general, we propose to use the instrumental variable that achieves intervention without the need for confounder observation. We term our robust training method as Causal intervention by instrumental Variable (CiiV). It has a differentiable retinotopic sampling layer and a consistency loss, which is stable and guaranteed not to suffer from gradient obfuscation. Extensive experiments on a wide spectrum of attackers and settings applied in MNIST, CIFAR-10, and mini-ImageNet datasets empirically demonstrate that CiiV is robust to adaptive attacks.
    Learning to Predict Visual Attributes in the Wild. (arXiv:2106.09707v1 [cs.CV])
    (2 min) Visual attributes constitute a large portion of information contained in a scene. Objects can be described using a wide variety of attributes which portray their visual appearance (color, texture), geometry (shape, size, posture), and other intrinsic properties (state, action). Existing work is mostly limited to study of attribute prediction in specific domains. In this paper, we introduce a large-scale in-the-wild visual attribute prediction dataset consisting of over 927K attribute annotations for over 260K object instances. Formally, object attribute prediction is a multi-label classification problem where all attributes that apply to an object must be predicted. Our dataset poses significant challenges to existing methods due to large number of attributes, label sparsity, data imbalance, and object occlusion. To this end, we propose several techniques that systematically tackle these challenges, including a base model that utilizes both low- and high-level CNN features with multi-hop attention, reweighting and resampling techniques, a novel negative label expansion scheme, and a novel supervised attribute-aware contrastive learning algorithm. Using these techniques, we achieve near 3.7 mAP and 5.7 overall F1 points improvement over the current state of the art. Further details about the VAW dataset can be found at this http URL
    Knowledge Distillation and Student-Teacher Learning for Visual Intelligence: A Review and New Outlooks. (arXiv:2004.05937v7 [cs.CV] UPDATED)
    (3 min) Deep neural models in recent years have been successful in almost every field, including extremely complex problem statements. However, these models are huge in size, with millions (and even billions) of parameters, thus demanding more heavy computation power and failing to be deployed on edge devices. Besides, the performance boost is highly dependent on redundant labeled data. To achieve faster speeds and to handle the problems caused by the lack of data, knowledge distillation (KD) has been proposed to transfer information learned from one model to another. KD is often characterized by the so-called `Student-Teacher' (S-T) learning framework and has been broadly applied in model compression and knowledge transfer. This paper is about KD and S-T learning, which are being actively studied in recent years. First, we aim to provide explanations of what KD is and how/why it works. Then, we provide a comprehensive survey on the recent progress of KD methods together with S-T frameworks typically for vision tasks. In general, we consider some fundamental questions that have been driving this research area and thoroughly generalize the research progress and technical details. Additionally, we systematically analyze the research status of KD in vision applications. Finally, we discuss the potentials and open challenges of existing methods and prospect the future directions of KD and S-T learning.
    Scaling-up Diverse Orthogonal Convolutional Networks with a Paraunitary Framework. (arXiv:2106.09121v1 [cs.LG])
    (2 min) Enforcing orthogonality in neural networks is an antidote for gradient vanishing/exploding problems, sensitivity by adversarial perturbation, and bounding generalization errors. However, many previous approaches are heuristic, and the orthogonality of convolutional layers is not systematically studied: some of these designs are not exactly orthogonal, while others only consider standard convolutional layers and propose specific classes of their realizations. To address this problem, we propose a theoretical framework for orthogonal convolutional layers, which establishes the equivalence between various orthogonal convolutional layers in the spatial domain and the paraunitary systems in the spectral domain. Since there exists a complete spectral factorization of paraunitary systems, any orthogonal convolution layer can be parameterized as convolutions of spatial filters. Our framework endows high expressive power to various convolutional layers while maintaining their exact orthogonality. Furthermore, our layers are memory and computationally efficient for deep networks compared to previous designs. Our versatile framework, for the first time, enables the study of architecture designs for deep orthogonal networks, such as choices of skip connection, initialization, stride, and dilation. Consequently, we scale up orthogonal networks to deep architectures, including ResNet, WideResNet, and ShuffleNet, substantially increasing the performance over the traditional shallow orthogonal networks.
    Automatic Segmentation of the Prostate on 3D Trans-rectal Ultrasound Images using Statistical Shape Models and Convolutional Neural Networks. (arXiv:2106.09662v1 [eess.IV])
    (2 min) In this work we propose to segment the prostate on a challenging dataset of trans-rectal ultrasound (TRUS) images using convolutional neural networks (CNNs) and statistical shape models (SSMs). TRUS is commonly used for a number of image-guided interventions on the prostate. Fast and accurate segmentation on the organ in these images is crucial to planning and fusion with other modalities such as magnetic resonance images (MRIs) . However, TRUS has limited soft tissue contrast and signal to noise ratio which makes the task of segmenting the prostate challenging and subject to inter-observer and intra-observer variability. This is especially problematic at the base and apex where the gland boundary is hard to define. In this paper, we aim to tackle this problem by taking advantage of shape priors learnt on an MR dataset which has higher soft tissue contrast allowing the prostate to be contoured more accurately. We use this shape prior in combination with a prostate tissue probability map computed by a CNN for segmentation.
    Class Balancing GAN with a Classifier in the Loop. (arXiv:2106.09402v1 [cs.LG])
    (2 min) Generative Adversarial Networks (GANs) have swiftly evolved to imitate increasingly complex image distributions. However, majority of the developments focus on performance of GANs on balanced datasets. We find that the existing GANs and their training regimes which work well on balanced datasets fail to be effective in case of imbalanced (i.e. long-tailed) datasets. In this work we introduce a novel theoretically motivated Class Balancing regularizer for training GANs. Our regularizer makes use of the knowledge from a pre-trained classifier to ensure balanced learning of all the classes in the dataset. This is achieved via modelling the effective class frequency based on the exponential forgetting observed in neural networks and encouraging the GAN to focus on underrepresented classes. We demonstrate the utility of our regularizer in learning representations for long-tailed distributions via achieving better performance than existing approaches over multiple datasets. Specifically, when applied to an unconditional GAN, it improves the FID from $13.03$ to $9.01$ on the long-tailed iNaturalist-$2019$ dataset.
    SIFT Matching by Context Exposed. (arXiv:2106.09584v1 [cs.CV])
    (2 min) This paper investigates how to step up local image descriptor matching by exploiting matching context information. Two main contexts are identified, originated respectively from the descriptor space and from the keypoint space. The former is generally used to design the actual matching strategy while the latter to filter matches according to the local spatial consistency. On this basis, a new matching strategy and a novel local spatial filter, named respectively blob matching and Delaunay Triangulation Matching (DTM) are devised. Blob matching provides a general matching framework by merging together several strategies, including pre-filtering as well as many-to-many and symmetric matching, enabling to achieve a global improvement upon each individual strategy. DTM alternates between Delaunay triangulation contractions and expansions to figure out and adjust keypoint neighborhood consistency. Experimental evaluation shows that DTM is comparable or better than the state-of-the-art in terms of matching accuracy and robustness, especially for non-planar scenes. Evaluation is carried out according to a new benchmark devised for analyzing the matching pipeline in terms of correct correspondences on both planar and non-planar scenes, including state-of-the-art methods as well as the common SIFT matching approach for reference. This evaluation can be of assistance for future research in this field.
    Wavelet-Packet Powered Deepfake Image Detection. (arXiv:2106.09369v1 [cs.CV])
    (2 min) As neural networks become more able to generate realistic artificial images, they have the potential to improve movies, music, video games and make the internet an even more creative and inspiring place. Yet, at the same time, the latest technology potentially enables new digital ways to lie. In response, the need for a diverse and reliable toolbox arises to identify artificial images and other content. Previous work primarily relies on pixel-space CNN or the Fourier transform. To the best of our knowledge, wavelet-based gan analysis and detection methods have been absent thus far. This paper aims to fill this gap and describes a wavelet-based approach to gan-generated image analysis and detection. We evaluate our method on FFHQ, CelebA, and LSUN source identification problems and find improved or competitive performance.
    Scale-Consistent Fusion: from Heterogeneous Local Sampling to Global Immersive Rendering. (arXiv:2106.09548v1 [cs.CV])
    (2 min) Image-based geometric modeling and novel view synthesis based on sparse, large-baseline samplings are challenging but important tasks for emerging multimedia applications such as virtual reality and immersive telepresence. Existing methods fail to produce satisfactory results due to the limitation on inferring reliable depth information over such challenging reference conditions. With the popularization of commercial light field (LF) cameras, capturing LF images (LFIs) is as convenient as taking regular photos, and geometry information can be reliably inferred. This inspires us to use a sparse set of LF captures to render high-quality novel views globally. However, fusion of LF captures from multiple angles is challenging due to the scale inconsistency caused by various capture settings. To overcome this challenge, we propose a novel scale-consistent volume rescaling algorithm that robustly aligns the disparity probability volumes (DPV) among different captures for scale-consistent global geometry fusion. Based on the fused DPV projected to the target camera frustum, novel learning-based modules have been proposed (i.e., the attention-guided multi-scale residual fusion module, and the disparity field guided deep re-regularization module) which comprehensively regularize noisy observations from heterogeneous captures for high-quality rendering of novel LFIs. Both quantitative and qualitative experiments over the Stanford Lytro Multi-view LF dataset show that the proposed method outperforms state-of-the-art methods significantly under different experiment settings for disparity inference and LF synthesis.
    ShuffleBlock: Shuffle to Regularize Deep Convolutional Neural Networks. (arXiv:2106.09358v1 [cs.CV])
    (2 min) Deep neural networks have enormous representational power which leads them to overfit on most datasets. Thus, regularizing them is important in order to reduce overfitting and enhance their generalization capabilities. Recently, channel shuffle operation has been introduced for mixing channels in group convolutions in resource efficient networks in order to reduce memory and computations. This paper studies the operation of channel shuffle as a regularization technique in deep convolutional networks. We show that while random shuffling of channels during training drastically reduce their performance, however, randomly shuffling small patches between channels significantly improves their performance. The patches to be shuffled are picked from the same spatial locations in the feature maps such that a patch, when transferred from one channel to another, acts as structured noise for the later channel. We call this method "ShuffleBlock". The proposed ShuffleBlock module is easy to implement and improves the performance of several baseline networks on the task of image classification on CIFAR and ImageNet datasets. It also achieves comparable and in many cases better performance than many other regularization methods. We provide several ablation studies on selecting various hyperparameters of the ShuffleBlock module and propose a new scheduling method that further enhances its performance.
    The 2021 Image Similarity Dataset and Challenge. (arXiv:2106.09672v1 [cs.CV])
    (2 min) This paper introduces a new benchmark for large-scale image similarity detection. This benchmark is used for the Image Similarity Challenge at NeurIPS'21 (ISC2021). The goal is to determine whether a query image is a modified copy of any image in a reference corpus of size 1~million. The benchmark features a variety of image transformations such as automated transformations, hand-crafted image edits and machine-learning based manipulations. This mimics real-life cases appearing in social media, for example for integrity-related problems dealing with misinformation and objectionable content. The strength of the image manipulations, and therefore the difficulty of the benchmark, is calibrated according to the performance of a set of baseline approaches. Both the query and reference set contain a majority of ``distractor'' images that do not match, which corresponds to a real-life needle-in-haystack setting, and the evaluation metric reflects that. We expect the DISC21 benchmark to promote image copy detection as an important and challenging computer vision task and refresh the state of the art.
    Multi-level Motion Attention for Human Motion Prediction. (arXiv:2106.09300v1 [cs.CV])
    (2 min) Human motion prediction aims to forecast future human poses given a historical motion. Whether based on recurrent or feed-forward neural networks, existing learning based methods fail to model the observation that human motion tends to repeat itself, even for complex sports actions and cooking activities. Here, we introduce an attention based feed-forward network that explicitly leverages this observation. In particular, instead of modeling frame-wise attention via pose similarity, we propose to extract motion attention to capture the similarity between the current motion context and the historical motion sub-sequences. In this context, we study the use of different types of attention, computed at joint, body part, and full pose levels. Aggregating the relevant past motions and processing the result with a graph convolutional network allows us to effectively exploit motion patterns from the long-term history to predict the future poses. Our experiments on Human3.6M, AMASS and 3DPW validate the benefits of our approach for both periodical and non-periodical actions. Thanks to our attention model, it yields state-of-the-art results on all three datasets. Our code is available at https://github.com/wei-mao-2019/HisRepItself.
    Localized Uncertainty Attacks. (arXiv:2106.09222v1 [stat.ML])
    (2 min) The susceptibility of deep learning models to adversarial perturbations has stirred renewed attention in adversarial examples resulting in a number of attacks. However, most of these attacks fail to encompass a large spectrum of adversarial perturbations that are imperceptible to humans. In this paper, we present localized uncertainty attacks, a novel class of threat models against deterministic and stochastic classifiers. Under this threat model, we create adversarial examples by perturbing only regions in the inputs where a classifier is uncertain. To find such regions, we utilize the predictive uncertainty of the classifier when the classifier is stochastic or, we learn a surrogate model to amortize the uncertainty when it is deterministic. Unlike $\ell_p$ ball or functional attacks which perturb inputs indiscriminately, our targeted changes can be less perceptible. When considered under our threat model, these attacks still produce strong adversarial examples; with the examples retaining a greater degree of similarity with the inputs.
    Always Be Dreaming: A New Approach for Data-Free Class-Incremental Learning. (arXiv:2106.09701v1 [cs.CV])
    (2 min) Modern computer vision applications suffer from catastrophic forgetting when incrementally learning new concepts over time. The most successful approaches to alleviate this forgetting require extensive replay of previously seen data, which is problematic when memory constraints or data legality concerns exist. In this work, we consider the high-impact problem of Data-Free Class-Incremental Learning (DFCIL), where an incremental learning agent must learn new concepts over time without storing generators or training data from past tasks. One approach for DFCIL is to replay synthetic images produced by inverting a frozen copy of the learner's classification model, but we show this approach fails for common class-incremental benchmarks when using standard distillation strategies. We diagnose the cause of this failure and propose a novel incremental distillation strategy for DFCIL, contributing a modified cross-entropy training and importance-weighted feature distillation, and show that our method results in up to a 25.1% increase in final task accuracy (absolute difference) compared to SOTA DFCIL methods for common class-incremental benchmarks. Our method even outperforms several standard replay based methods which store a coreset of images.
    Deep HDR Hallucination for Inverse Tone Mapping. (arXiv:2106.09486v1 [cs.CV])
    (2 min) Inverse Tone Mapping (ITM) methods attempt to reconstruct High Dynamic Range (HDR) information from Low Dynamic Range (LDR) image content. The dynamic range of well-exposed areas must be expanded and any missing information due to over/under-exposure must be recovered (hallucinated). The majority of methods focus on the former and are relatively successful, while most attempts on the latter are not of sufficient quality, even ones based on Convolutional Neural Networks (CNNs). A major factor for the reduced inpainting quality in some works is the choice of loss function. Work based on Generative Adversarial Networks (GANs) shows promising results for image synthesis and LDR inpainting, suggesting that GAN losses can improve inverse tone mapping results. This work presents a GAN-based method that hallucinates missing information from badly exposed areas in LDR images and compares its efficacy with alternative variations. The proposed method is quantitatively competitive with state-of-the-art inverse tone mapping methods, providing good dynamic range expansion for well-exposed areas and plausible hallucinations for saturated and under-exposed areas. A density-based normalisation method, targeted for HDR content, is also proposed, as well as an HDR data augmentation method targeted for HDR hallucination.
    Dynamic Knowledge Distillation with A Single Stream Structure for RGB-DSalient Object Detection. (arXiv:2106.09517v1 [cs.CV])
    (2 min) RGB-D salient object detection(SOD) demonstrates its superiority on detecting in complex environments due to the additional depth information introduced in the data. Inevitably, an independent stream is introduced to extract features from depth images, leading to extra computation and parameters. This methodology which sacrifices the model size to improve the detection accuracy may impede the practical application of SOD problems. To tackle this dilemma, we propose a dynamic distillation method along with a lightweight framework, which significantly reduces the parameters. This method considers the factors of both teacher and student performance within the training stage and dynamically assigns the distillation weight instead of applying a fixed weight on the student model. Extensive experiments are conducted on five public datasets to demonstrate that our method can achieve competitive performance compared to 10 prior methods through a 78.2MB lightweight structure.
    The Fishnet Open Images Database: A Dataset for Fish Detection and Fine-Grained Categorization in Fisheries. (arXiv:2106.09178v1 [cs.CV])
    (2 min) Camera-based electronic monitoring (EM) systems are increasingly being deployed onboard commercial fishing vessels to collect essential data for fisheries management and regulation. These systems generate large quantities of video data which must be reviewed on land by human experts. Computer vision can assist this process by automatically detecting and classifying fish species, however the lack of existing public data in this domain has hindered progress. To address this, we present the Fishnet Open Images Database, a large dataset of EM imagery for fish detection and fine-grained categorization onboard commercial fishing vessels. The dataset consists of 86,029 images containing 34 object classes, making it the largest and most diverse public dataset of fisheries EM imagery to-date. It includes many of the characteristic challenges of EM data: visual similarity between species, skewed class distributions, harsh weather conditions, and chaotic crew activity. We evaluate the performance of existing detection and classification algorithms and demonstrate that the dataset can serve as a challenging benchmark for development of computer vision algorithms in fisheries. The dataset is available at https://www.fishnet.ai/.
    Probing Image-Language Transformers for Verb Understanding. (arXiv:2106.09141v1 [cs.CL])
    (2 min) Multimodal image-language transformers have achieved impressive results on a variety of tasks that rely on fine-tuning (e.g., visual question answering and image retrieval). We are interested in shedding light on the quality of their pretrained representations -- in particular, if these models can distinguish different types of verbs or if they rely solely on nouns in a given sentence. To do so, we collect a dataset of image-sentence pairs (in English) consisting of 421 verbs that are either visual or commonly found in the pretraining data (i.e., the Conceptual Captions dataset). We use this dataset to evaluate pretrained image-language transformers and find that they fail more in situations that require verb understanding compared to other parts of speech. We also investigate what category of verbs are particularly challenging.
    Learning to Associate Every Segment for Video Panoptic Segmentation. (arXiv:2106.09453v1 [cs.CV])
    (2 min) Temporal correspondence - linking pixels or objects across frames - is a fundamental supervisory signal for the video models. For the panoptic understanding of dynamic scenes, we further extend this concept to every segment. Specifically, we aim to learn coarse segment-level matching and fine pixel-level matching together. We implement this idea by designing two novel learning objectives. To validate our proposals, we adopt a deep siamese model and train the model to learn the temporal correspondence on two different levels (i.e., segment and pixel) along with the target task. At inference time, the model processes each frame independently without any extra computation and post-processing. We show that our per-frame inference model can achieve new state-of-the-art results on Cityscapes-VPS and VIPER datasets. Moreover, due to its high efficiency, the model runs in a fraction of time (3x) compared to the previous state-of-the-art approach.
    THUNDR: Transformer-based 3D HUmaN Reconstruction with Markers. (arXiv:2106.09336v1 [cs.CV])
    (2 min) We present THUNDR, a transformer-based deep neural network methodology to reconstruct the 3d pose and shape of people, given monocular RGB images. Key to our methodology is an intermediate 3d marker representation, where we aim to combine the predictive power of model-free-output architectures and the regularizing, anthropometrically-preserving properties of a statistical human surface model like GHUM -- a recently introduced, expressive full body statistical 3d human model, trained end-to-end. Our novel transformer-based prediction pipeline can focus on image regions relevant to the task, supports self-supervised regimes, and ensures that solutions are consistent with human anthropometry. We show state-of-the-art results on Human3.6M and 3DPW, for both the fully-supervised and the self-supervised models, for the task of inferring 3d human shape, joint positions, and global translation. Moreover, we observe very solid 3d reconstruction performance for difficult human poses collected in the wild.
    Federated CycleGAN for Privacy-Preserving Image-to-Image Translation. (arXiv:2106.09246v1 [cs.CV])
    (2 min) Unsupervised image-to-image translation methods such as CycleGAN learn to convert images from one domain to another using unpaired training data sets from different domains. Unfortunately, these approaches still require centrally collected unpaired records, potentially violating privacy and security issues. Although the recent federated learning (FL) allows a neural network to be trained without data exchange, the basic assumption of the FL is that all clients have their own training data from a similar domain, which is different from our image-to-image translation scenario in which each client has images from its unique domain and the goal is to learn image translation between different domains without accessing the target domain data. To address this, here we propose a novel federated CycleGAN architecture that can learn image translation in an unsupervised manner while maintaining the data privacy. Specifically, our approach arises from a novel observation that CycleGAN loss can be decomposed into the sum of client specific local objectives that can be evaluated using only their data. This local objective decomposition allows multiple clients to participate in federated CycleGAN training without sacrificing performance. Furthermore, our method employs novel switchable generator and discriminator architecture using Adaptive Instance Normalization (AdaIN) that significantly reduces the band-width requirement of the federated learning. Our experimental results on various unsupervised image translation tasks show that our federated CycleGAN provides comparable performance compared to the non-federated counterpart.
    Regularization of Mixture Models for Robust Principal Graph Learning. (arXiv:2106.09035v1 [cs.LG])
    (2 min) A regularized version of Mixture Models is proposed to learn a principal graph from a distribution of $D$-dimensional data points. In the particular case of manifold learning for ridge detection, we assume that the underlying manifold can be modeled as a graph structure acting like a topological prior for the Gaussian clusters turning the problem into a maximum a posteriori estimation. Parameters of the model are iteratively estimated through an Expectation-Maximization procedure making the learning of the structure computationally efficient with guaranteed convergence for any graph prior in a polynomial time. We also embed in the formalism a natural way to make the algorithm robust to outliers of the pattern and heteroscedasticity of the manifold sampling coherently with the graph structure. The method uses a graph prior given by the minimum spanning tree that we extend using random sub-samplings of the dataset to take into account cycles that can be observed in the spatial distribution.
    Optical Mouse: 3D Mouse Pose From Single-View Video. (arXiv:2106.09251v1 [cs.CV])
    (2 min) We present a method to infer the 3D pose of mice, including the limbs and feet, from monocular videos. Many human clinical conditions and their corresponding animal models result in abnormal motion, and accurately measuring 3D motion at scale offers insights into health. The 3D poses improve classification of health-related attributes over 2D representations. The inferred poses are accurate enough to estimate stride length even when the feet are mostly occluded. This method could be applied as part of a continuous monitoring system to non-invasively measure animal health.
    A Two-stage Multi-modal Affect Analysis Framework for Children with Autism Spectrum Disorder. (arXiv:2106.09199v1 [cs.CV])
    (2 min) Autism spectrum disorder (ASD) is a developmental disorder that influences the communication and social behavior of a person in a way that those in the spectrum have difficulty in perceiving other people's facial expressions, as well as presenting and communicating emotions and affect via their own faces and bodies. Some efforts have been made to predict and improve children with ASD's affect states in play therapy, a common method to improve children's social skills via play and games. However, many previous works only used pre-trained models on benchmark emotion datasets and failed to consider the distinction in emotion between typically developing children and children with autism. In this paper, we present an open-source two-stage multi-modal approach leveraging acoustic and visual cues to predict three main affect states of children with ASD's affect states (positive, negative, and neutral) in real-world play therapy scenarios, and achieved an overall accuracy of 72:40%. This work presents a novel way to combine human expertise and machine intelligence for ASD affect recognition by proposing a two-stage schema.
    Invisible for both Camera and LiDAR: Security of Multi-Sensor Fusion based Perception in Autonomous Driving Under Physical-World Attacks. (arXiv:2106.09249v1 [cs.CR])
    (3 min) In Autonomous Driving (AD) systems, perception is both security and safety critical. Despite various prior studies on its security issues, all of them only consider attacks on camera- or LiDAR-based AD perception alone. However, production AD systems today predominantly adopt a Multi-Sensor Fusion (MSF) based design, which in principle can be more robust against these attacks under the assumption that not all fusion sources are (or can be) attacked at the same time. In this paper, we present the first study of security issues of MSF-based perception in AD systems. We directly challenge the basic MSF design assumption above by exploring the possibility of attacking all fusion sources simultaneously. This allows us for the first time to understand how much security guarantee MSF can fundamentally provide as a general defense strategy for AD perception. We formulate the attack as an optimization problem to generate a physically-realizable, adversarial 3D-printed object that misleads an AD system to fail in detecting it and thus crash into it. We propose a novel attack pipeline that addresses two main design challenges: (1) non-differentiable target camera and LiDAR sensing systems, and (2) non-differentiable cell-level aggregated features popularly used in LiDAR-based AD perception. We evaluate our attack on MSF included in representative open-source industry-grade AD systems in real-world driving scenarios. Our results show that the attack achieves over 90% success rate across different object types and MSF. Our attack is also found stealthy, robust to victim positions, transferable across MSF algorithms, and physical-world realizable after being 3D-printed and captured by LiDAR and camera devices. To concretely assess the end-to-end safety impact, we further perform simulation evaluation and show that it can cause a 100% vehicle collision rate for an industry-grade AD system.
    Deep Contrastive Graph Representation via Adaptive Homotopy Learning. (arXiv:2106.09244v1 [cs.CV])
    (2 min) Homotopy model is an excellent tool exploited by diverse research works in the field of machine learning. However, its flexibility is limited due to lack of adaptiveness, i.e., manual fixing or tuning the appropriate homotopy coefficients. To address the problem above, we propose a novel adaptive homotopy framework (AH) in which the Maclaurin duality is employed, such that the homotopy parameters can be adaptively obtained. Accordingly, the proposed AH can be widely utilized to enhance the homotopy-based algorithm. In particular, in this paper, we apply AH to contrastive learning (AHCL) such that it can be effectively transferred from weak-supervised learning (given label priori) to unsupervised learning, where soft labels of contrastive learning are directly and adaptively learned. Accordingly, AHCL has the adaptive ability to extract deep features without any sort of prior information. Consequently, the affinity matrix formulated by the related adaptive labels can be constructed as the deep Laplacian graph that incorporates the topology of deep representations for the inputs. Eventually, extensive experiments on benchmark datasets validate the superiority of our method.
    Deformation Driven Seq2Seq Longitudinal Tumor and Organs-at-Risk Prediction for Radiotherapy. (arXiv:2106.09076v1 [cs.CV])
    (2 min) Purpose: Radiotherapy presents unique challenges and clinical requirements for longitudinal tumor and organ-at-risk (OAR) prediction during treatment. The challenges include tumor inflammation/edema and radiation-induced changes in organ geometry, whereas the clinical requirements demand flexibility in input/output sequence timepoints to update the predictions on rolling basis and the grounding of all predictions in relationship to the pre-treatment imaging information for response and toxicity assessment in adaptive radiotherapy. Methods: To deal with the aforementioned challenges and to comply with the clinical requirements, we present a novel 3D sequence-to-sequence model based on Convolution Long Short Term Memory (ConvLSTM) that makes use of series of deformation vector fields (DVF) between individual timepoints and reference pre-treatment/planning CTs to predict future anatomical deformations and changes in gross tumor volume as well as critical OARs. High-quality DVF training data is created by employing hyper-parameter optimization on the subset of the training data with DICE coefficient and mutual information metric. We validated our model on two radiotherapy datasets: a publicly available head-and-neck dataset (28 patients with manually contoured pre-, mid-, and post-treatment CTs), and an internal non-small cell lung cancer dataset (63 patients with manually contoured planning CT and 6 weekly CBCTs). Results: The use of DVF representation and skip connections overcomes the blurring issue of ConvLSTM prediction with the traditional image representation. The mean and standard deviation of DICE for predictions of lung GTV at week 4, 5, and 6 were 0.83$\pm$0.09, 0.82$\pm$0.08, and 0.81$\pm$0.10, respectively, and for post-treatment ipsilateral and contralateral parotids, were 0.81$\pm$0.06 and 0.85$\pm$0.02.
    Long-Short Temporal Contrastive Learning of Video Transformers. (arXiv:2106.09212v1 [cs.CV])
    (2 min) Video transformers have recently emerged as a competitive alternative to 3D CNNs for video understanding. However, due to their large number of parameters and reduced inductive biases, these models require supervised pretraining on large-scale image datasets to achieve top performance. In this paper, we empirically demonstrate that self-supervised pretraining of video transformers on video-only datasets can lead to action recognition results that are on par or better than those obtained with supervised pretraining on large-scale image datasets, even massive ones such as ImageNet-21K. Since transformer-based models are effective at capturing dependencies over extended temporal spans, we propose a simple learning procedure that forces the model to match a long-term view to a short-term view of the same video. Our approach, named Long-Short Temporal Contrastive Learning (LSTCL), enables video transformers to learn an effective clip-level representation by predicting temporal context captured from a longer temporal extent. To demonstrate the generality of our findings, we implement and validate our approach under three different self-supervised contrastive learning frameworks (MoCo v3, BYOL, SimSiam) using two distinct video-transformer architectures, including an improved variant of the Swin Transformer augmented with space-time attention. We conduct a thorough ablation study and show that LSTCL achieves competitive performance on multiple video benchmarks and represents a convincing alternative to supervised image-based pretraining.
    An Evaluation of Self-Supervised Pre-Training for Skin-Lesion Analysis. (arXiv:2106.09229v1 [cs.CV])
    (2 min) Self-supervised pre-training appears as an advantageous alternative to supervised pre-trained for transfer learning. By synthesizing annotations on pretext tasks, self-supervision allows to pre-train models on large amounts of pseudo-labels before fine-tuning them on the target task. In this work, we assess self-supervision for the diagnosis of skin lesions, comparing three self-supervised pipelines to a challenging supervised baseline, on five test datasets comprising in- and out-of-distribution samples. Our results show that self-supervision is competitive both in improving accuracies and in reducing the variability of outcomes. Self-supervision proves particularly useful for low training data scenarios ($<1\,500$ and $<150$ samples), where its ability to stabilize the outcomes is essential to provide sound results.
    Insights into Data through Model Behaviour: An Explainability-driven Strategy for Data Auditing for Responsible Computer Vision Applications. (arXiv:2106.09177v1 [cs.CV])
    (2 min) In this study, we take a departure and explore an explainability-driven strategy to data auditing, where actionable insights into the data at hand are discovered through the eyes of quantitative explainability on the behaviour of a dummy model prototype when exposed to data. We demonstrate this strategy by auditing two popular medical benchmark datasets, and discover hidden data quality issues that lead deep learning models to make predictions for the wrong reasons. The actionable insights gained from this explainability driven data auditing strategy is then leveraged to address the discovered issues to enable the creation of high-performing deep learning models with appropriate prediction behaviour. The hope is that such an explainability-driven strategy can be complimentary to data-driven strategies to facilitate for more responsible development of machine learning algorithms for computer vision applications.
    LiRA: Learning Visual Speech Representations from Audio through Self-supervision. (arXiv:2106.09171v1 [cs.LG])
    (2 min) The large amount of audiovisual content being shared online today has drawn substantial attention to the prospect of audiovisual self-supervised learning. Recent works have focused on each of these modalities separately, while others have attempted to model both simultaneously in a cross-modal fashion. However, comparatively little attention has been given to leveraging one modality as a training objective to learn from the other. In this work, we propose Learning visual speech Representations from Audio via self-supervision (LiRA). Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech. We find that this pre-trained model can be leveraged towards word-level and sentence-level lip-reading through feature extraction and fine-tuning experiments. We show that our approach significantly outperforms other self-supervised methods on the Lip Reading in the Wild (LRW) dataset and achieves state-of-the-art performance on Lip Reading Sentences 2 (LRS2) using only a fraction of the total labelled data.
    Trilateral Attention Network for Real-time Medical Image Segmentation. (arXiv:2106.09201v1 [cs.CV])
    (2 min) Accurate segmentation of medical images into anatomically meaningful regions is critical for the extraction of quantitative indices or biomarkers. The common pipeline for segmentation comprises regions of interest detection stage and segmentation stage, which are independent of each other and typically performed using separate deep learning networks. The performance of the segmentation stage highly relies on the extracted set of spatial features and the receptive fields. In this work, we propose an end-to-end network, called Trilateral Attention Network (TaNet), for real-time detection and segmentation in medical images. TaNet has a module for region localization, and three segmentation pathways: 1) handcrafted pathway with hand-designed convolutional kernels, 2) detail pathway with regular convolutional kernels, and 3) a global pathway to enlarge the receptive field. The first two pathways encode rich handcrafted and low-level features extracted by hand-designed and regular kernels while the global pathway encodes high-level context information. By jointly training the network for localization and segmentation using different sets of features, TaNet achieved superior performance, in terms of accuracy and speed, when evaluated on an echocardiography dataset for cardiac segmentation. The code and models will be made publicly available in TaNet Github page.
    Layer Folding: Neural Network Depth Reduction using Activation Linearization. (arXiv:2106.09309v1 [cs.CV])
    (2 min) Despite the increasing prevalence of deep neural networks, their applicability in resource-constrained devices is limited due to their computational load. While modern devices exhibit a high level of parallelism, real-time latency is still highly dependent on networks' depth. Although recent works show that below a certain depth, the width of shallower networks must grow exponentially, we presume that neural networks typically exceed this minimal depth to accelerate convergence and incrementally increase accuracy. This motivates us to transform pre-trained deep networks that already exploit such advantages into shallower forms. We propose a method that learns whether non-linear activations can be removed, allowing to fold consecutive linear layers into one. We apply our method to networks pre-trained on CIFAR-10 and CIFAR-100 and find that they can all be transformed into shallower forms that share a similar depth. Finally, we use our method to provide more efficient alternatives to MobileNetV2 and EfficientNet-Lite architectures on the ImageNet classification task.
    SPeCiaL: Self-Supervised Pretraining for Continual Learning. (arXiv:2106.09065v1 [cs.CV])
    (2 min) This paper presents SPeCiaL: a method for unsupervised pretraining of representations tailored for continual learning. Our approach devises a meta-learning objective that differentiates through a sequential learning process. Specifically, we train a linear model over the representations to match different augmented views of the same image together, each view presented sequentially. The linear model is then evaluated on both its ability to classify images it just saw, and also on images from previous iterations. This gives rise to representations that favor quick knowledge retention with minimal forgetting. We evaluate SPeCiaL in the Continual Few-Shot Learning setting, and show that it can match or outperform other supervised pretraining approaches.
    Automatic Main Character Recognition for Photographic Studies. (arXiv:2106.09064v1 [cs.CV])
    (2 min) Main characters in images are the most important humans that catch the viewer's attention upon first look, and they are emphasized by properties such as size, position, color saturation, and sharpness of focus. Identifying the main character in images plays an important role in traditional photographic studies and media analysis, but the task is performed manually and can be slow and laborious. Furthermore, selection of main characters can be sometimes subjective. In this paper, we analyze the feasibility of solving the main character recognition needed for photographic studies automatically and propose a method for identifying the main characters. The proposed method uses machine learning based human pose estimation along with traditional computer vision approaches for this task. We approach the task as a binary classification problem where each detected human is classified either as a main character or not. To evaluate both the subjectivity of the task and the performance of our method, we collected a dataset of 300 varying images from multiple sources and asked five people, a photographic researcher and four other persons, to annotate the main characters. Our analysis showed a relatively high agreement between different annotators. The proposed method achieved a promising F1 score of 0.83 on the full image set and 0.96 on a subset evaluated as most clear and important cases by the photographic researcher.
  • cs.IR updates on arXiv.org

    Predicting the Popularity of Reddit Posts with AI. (arXiv:2106.07380v2 [cs.LG] UPDATED)
    (2 min) Social media creates crucial mass changes, as popular posts and opinions cast a significant influence on users' decisions and thought processes. For example, the recent Reddit uprising inspired by r/wallstreetbets which had remarkable economic impact was started with a series of posts on the thread. The prediction of posts that may have a notable impact will allow for the preparation of possible following trends. This study aims to develop a machine learning model capable of accurately predicting the popularity of a Reddit post. Specifically, the model is predicting the number of upvotes a post will receive based on its textual content. I experimented with three different models: a baseline linear regression model, a random forest regression model, and a neural network. I collected Reddit post data from an online data set and analyzed the model's performance when trained on a single subreddit and a collection of subreddits. The results showed that the neural network model performed the best when the loss of the models were compared. With the use of a machine learning model to predict social trends through the reaction users have to post, a better picture of the near future can be envisioned.
    Author Clustering and Topic Estimation for Short Texts. (arXiv:2106.09533v1 [cs.IR])
    (2 min) Analysis of short text, such as social media posts, is extremely difficult because it relies on observing many document-level word co-occurrence pairs. Beyond topic distributions, a common downstream task of the modeling is grouping the authors of these documents for subsequent analyses. Traditional models estimate the document groupings and identify user clusters with an independent procedure. We propose a novel model that expands on the Latent Dirichlet Allocation by modeling strong dependence among the words in the same document, with user-level topic distributions. We also simultaneously cluster users, removing the need for post-hoc cluster estimation and improving topic estimation by shrinking noisy user-level topic distributions towards typical values. Our method performs as well as -- or better -- than traditional approaches to problems arising in short text, and we demonstrate its usefulness on a dataset of tweets from United States Senators, recovering both meaningful topics and clusters that reflect partisan ideology.
    A Self-supervised Method for Entity Alignment. (arXiv:2106.09395v1 [cs.CL])
    (2 min) Entity alignment, aiming to identify equivalent entities across different knowledge graphs (KGs), is a fundamental problem for constructing large-scale KGs. Over the course of its development, supervision has been considered necessary for accurate alignments. Inspired by the recent progress of self-supervised learning, we explore the extent to which we can get rid of supervision for entity alignment. Existing supervised methods for this task focus on pulling each pair of positive (labeled) entities close to each other. However, our analysis suggests that the learning of entity alignment can actually benefit more from pushing sampled (unlabeled) negatives far away than pulling positive aligned pairs close. We present SelfKG by leveraging this discovery to design a contrastive learning strategy across two KGs. Extensive experiments on benchmark datasets demonstrate that SelfKG without supervision can match or achieve comparable results with state-of-the-art supervised baselines. The performance of SelfKG demonstrates self-supervised learning offers great potential for entity alignment in KGs.
    Current Challenges and Future Directions in Podcast Information Access. (arXiv:2106.09227v1 [cs.IR])
    (2 min) Podcasts are spoken documents across a wide-range of genres and styles, with growing listenership across the world, and a rapidly lowering barrier to entry for both listeners and creators. The great strides in search and recommendation in research and industry have yet to see impact in the podcast space, where recommendations are still largely driven by word of mouth. In this perspective paper, we highlight the many differences between podcasts and other media, and discuss our perspective on challenges and future research directions in the domain of podcast information access.
    PEN4Rec: Preference Evolution Networks for Session-based Recommendation. (arXiv:2106.09306v1 [cs.IR])
    (2 min) Session-based recommendation aims to predict user the next action based on historical behaviors in an anonymous session. For better recommendations, it is vital to capture user preferences as well as their dynamics. Besides, user preferences evolve over time dynamically and each preference has its own evolving track. However, most previous works neglect the evolving trend of preferences and can be easily disturbed by the effect of preference drifting. In this paper, we propose a novel Preference Evolution Networks for session-based Recommendation (PEN4Rec) to model preference evolving process by a two-stage retrieval from historical contexts. Specifically, the first-stage process integrates relevant behaviors according to recent items. Then, the second-stage process models the preference evolving trajectory over time dynamically and infer rich preferences. The process can strengthen the effect of relevant sequential behaviors during the preference evolution and weaken the disturbance from preference drifting. Extensive experiments on three public datasets demonstrate the effectiveness and superiority of the proposed model.
    XDM: Improving Sequential Deep Matching with Unclicked User Behaviors for Recommender System. (arXiv:2010.12837v3 [cs.IR] UPDATED)
    (2 min) Deep learning-based sequential recommender systems have recently attracted increasing attention from both academia and industry. Most of industrial Embedding-Based Retrieval (EBR) system for recommendation share the similar ideas with sequential recommenders. Among them, how to comprehensively capture sequential user interest is a fundamental problem. However, most existing sequential recommendation models take as input clicked or purchased behavior sequences from user-item interactions. This leads to incomprehensive user representation and sub-optimal model performance, since they ignore the complete user behavior exposure data, i.e., items impressed yet unclicked by users. In this work, we attempt to incorporate and model those unclicked item sequences using a new learning approach in order to explore better sequential recommendation technique. An efficient triplet metric learning algorithm is proposed to appropriately learn the representation of unclicked items. Our method can be simply integrated with existing sequential recommendation models by a confidence fusion network and further gain better user representation. The offline experimental results based on real-world E-commerce data demonstrate the effectiveness and verify the importance of unclicked items in sequential recommendation. Moreover we deploy our new model (named XDM) into EBR of recommender system at Taobao, outperforming the deployed previous generation SDM.
    Understanding the Effectiveness of Reviews in E-commerce Top-N Recommendation. (arXiv:2106.09665v1 [cs.IR])
    (2 min) Modern E-commerce websites contain heterogeneous sources of information, such as numerical ratings, textual reviews and images. These information can be utilized to assist recommendation. Through textual reviews, a user explicitly express her affinity towards the item. Previous researchers found that by using the information extracted from these reviews, we can better profile the users' explicit preferences as well as the item features, leading to the improvement of recommendation performance. However, most of the previous algorithms were only utilizing the review information for explicit-feedback problem i.e. rating prediction, and when it comes to implicit-feedback ranking problem such as top-N recommendation, the usage of review information has not been fully explored. Seeing this gap, in this work, we investigate the effectiveness of textual review information for top-N recommendation under E-commerce settings. We adapt several SOTA review-based rating prediction models for top-N recommendation tasks and compare them to existing top-N recommendation models from both performance and efficiency. We find that models utilizing only review information can not achieve better performances than vanilla implicit-feedback matrix factorization method. When utilizing review information as a regularizer or auxiliary information, the performance of implicit-feedback matrix factorization method can be further improved. However, the optimal model structure to utilize textual reviews for E-commerce top-N recommendation is yet to be determined.
    Open Data and the Status Quo -- A Fine-Grained Evaluation Framework for Open Data Quality and an Analysis of Open Data portals in Germany. (arXiv:2106.09590v1 [cs.IR])
    (2 min) This paper presents a framework for assessing data and metadata quality within Open Data portals. Although a few benchmark frameworks already exist for this purpose, they are not yet detailed enough in both breadth and depth to make valid statements about the actual discoverability and accessibility of publicly available data collections. To address this research gap, we have designed a quality framework that is able to evaluate data quality in Open Data portals on dedicated and fine-grained dimensions, such as interoperability, findability, uniqueness or completeness. Additionally, we propose quality measures that allow for valid assessments regarding cross-portal findability and uniqueness of dataset descriptions. We have validated our novel quality framework for the German Open Data landscape and found out that metadata often still lacks meaningful descriptions and is not yet extensively connected to the Semantic Web.
    Embedding-based Product Retrieval in Taobao Search. (arXiv:2106.09297v1 [cs.IR])
    (2 min) Nowadays, the product search service of e-commerce platforms has become a vital shopping channel in people's life. The retrieval phase of products determines the search system's quality and gradually attracts researchers' attention. Retrieving the most relevant products from a large-scale corpus while preserving personalized user characteristics remains an open question. Recent approaches in this domain have mainly focused on embedding-based retrieval (EBR) systems. However, after a long period of practice on Taobao, we find that the performance of the EBR system is dramatically degraded due to its: (1) low relevance with a given query and (2) discrepancy between the training and inference phases. Therefore, we propose a novel and practical embedding-based product retrieval model, named Multi-Grained Deep Semantic Product Retrieval (MGDSPR). Specifically, we first identify the inconsistency between the training and inference stages, and then use the softmax cross-entropy loss as the training objective, which achieves better performance and faster convergence. Two efficient methods are further proposed to improve retrieval relevance, including smoothing noisy training data and generating relevance-improving hard negative samples without requiring extra knowledge and training procedures. We evaluate MGDSPR on Taobao Product Search with significant metrics gains observed in offline experiments and online A/B tests. MGDSPR has been successfully deployed to the existing multi-channel retrieval system in Taobao Search. We also introduce the online deployment scheme and share practical lessons of our retrieval system to contribute to the community.
    Recovery under Side Constraints. (arXiv:2106.09375v1 [cs.IR])
    (2 min) This paper addresses sparse signal reconstruction under various types of structural side constraints with applications in multi-antenna systems. Side constraints may result from prior information on the measurement system and the sparse signal structure. They may involve the structure of the sensing matrix, the structure of the non-zero support values, the temporal structure of the sparse representationvector, and the nonlinear measurement structure. First, we demonstrate how a priori information in form of structural side constraints influence recovery guarantees (null space properties) using L1-minimization. Furthermore, for constant modulus signals, signals with row-, block- and rank-sparsity, as well as non-circular signals, we illustrate how structural prior information can be used to devise efficient algorithms with improved recovery performance and reduced computational complexity. Finally, we address the measurement system design for linear and nonlinear measurements of sparse signals. Moreover, we discuss the linear mixing matrix design based on coherence minimization. Then we extend our focus to nonlinear measurement systems where we design parallel optimization algorithms to efficiently compute stationary points in the sparse phase retrieval problem with and without dictionary learning.
  • cs.LG updates on arXiv.org

    Learning to Shape Rewards using a Game of Switching Controls. (arXiv:2103.09159v2 [cs.LG] UPDATED)
    (2 min) Reward shaping (RS) is a powerful method in reinforcement learning (RL) for overcoming the problem of sparse or uninformative rewards. However, RS typically relies on manually engineered shaping-reward functions whose construction is time-consuming and error-prone. It also requires domain knowledge which runs contrary to the goal of autonomous learning. We introduce Reinforcement Learning Optimal Shaping Algorithm (ROSA), an automated RS framework in which the shaping-reward function is constructed in a novel Markov game between two agents. A reward-shaping agent (Shaper) uses switching controls to determine which states to add shaping rewards and their optimal values while the other agent (Controller) learns the optimal policy for the task using these shaped rewards. We prove that ROSA, which easily adopts existing RL algorithms, learns to construct a shaping-reward function that is tailored to the task thus ensuring efficient convergence to high performance policies. We demonstrate ROSA's congenial properties in three carefully designed experiments and show its superior performance against state-of-the-art RS algorithms in challenging sparse reward environments.
    Disentangling Online Chats with DAG-Structured LSTMs. (arXiv:2106.09024v1 [cs.CL])
    (2 min) Many modern messaging systems allow fast and synchronous textual communication among many users. The resulting sequence of messages hides a more complicated structure in which independent sub-conversations are interwoven with one another. This poses a challenge for any task aiming to understand the content of the chat logs or gather information from them. The ability to disentangle these conversations is then tantamount to the success of many downstream tasks such as summarization and question answering. Structured information accompanying the text such as user turn, user mentions, timestamps, is used as a cue by the participants themselves who need to follow the conversation and has been shown to be important for disentanglement. DAG-LSTMs, a generalization of Tree-LSTMs that can handle directed acyclic dependencies, are a natural way to incorporate such information and its non-sequential nature. In this paper, we apply DAG-LSTMs to the conversation disentanglement task. We perform our experiments on the Ubuntu IRC dataset. We show that the novel model we propose achieves state of the art status on the task of recovering reply-to relations and it is competitive on other disentanglement metrics.
    Statistical Learning Guarantees for Compressive Clustering and Compressive Mixture Modeling. (arXiv:2004.08085v2 [cs.LG] UPDATED)
    (2 min) We provide statistical learning guarantees for two unsupervised learning tasks in the context of compressive statistical learning, a general framework for resource-efficient large-scale learning that we introduced in a companion paper. The principle of compressive statistical learning is to compress a training collection, in one pass, into a low-dimensional sketch (a vector of random empirical generalized moments) that captures the information relevant to the considered learning task. We explicit random feature functions which empirical averages preserve the needed information for compressive clustering and compressive Gaussian mixture modeling with fixed known variance, and establish sufficient sketch sizes given the problem dimensions.
    BF++: a language for general-purpose program synthesis. (arXiv:2101.09571v4 [cs.AI] UPDATED)
    (2 min) Most state of the art decision systems based on Reinforcement Learning (RL) are data-driven black-box neural models, where it is often difficult to incorporate expert knowledge into the models or let experts review and validate the learned decision mechanisms. Knowledge-insertion and model review are important requirements in many applications involving human health and safety. One way to bridge the gap between data and knowledge driven systems is program synthesis: replacing a neural network that outputs decisions with a symbolic program generated by a neural network or by means of genetic programming. We propose a new programming language, BF++, designed specifically for automatic programming of agents in a Partially Observable Markov Decision Process (POMDP) setting and apply neural program synthesis to solve standard OpenAI Gym benchmarks.
    On the Validity of Modeling SGD with Stochastic Differential Equations (SDEs). (arXiv:2102.12470v2 [cs.LG] UPDATED)
    (2 min) It is generally recognized that finite learning rate (LR), in contrast to infinitesimal LR, is important for good generalization in real-life deep nets. Most attempted explanations propose approximating finite-LR SGD with Ito Stochastic Differential Equations (SDEs), but formal justification for this approximation (e.g., (Li et al., 2019)) only applies to SGD with tiny LR. Experimental verification of the approximation appears computationally infeasible. The current paper clarifies the picture with the following contributions: (a) An efficient simulation algorithm SVAG that provably converges to the conventionally used Ito SDE approximation. (b) A theoretically motivated testable necessary condition for the SDE approximation and its most famous implication, the linear scaling rule (Goyal et al., 2017), to hold. (c) Experiments using this simulation to demonstrate that the previously proposed SDE approximation can meaningfully capture the training and generalization properties of common deep nets.
    Multi-head or Single-head? An Empirical Comparison for Transformer Training. (arXiv:2106.09650v1 [cs.CL])
    (2 min) Multi-head attention plays a crucial role in the recent success of Transformer models, which leads to consistent performance improvements over conventional attention in various applications. The popular belief is that this effectiveness stems from the ability of jointly attending multiple positions. In this paper, we first demonstrate that jointly attending multiple positions is not a unique feature of multi-head attention, as multi-layer single-head attention also attends multiple positions and is more effective. Then, we suggest the main advantage of the multi-head attention is the training stability, since it has less number of layers than the single-head attention, when attending the same number of positions. For example, 24-layer 16-head Transformer (BERT-large) and 384-layer single-head Transformer has the same total attention head number and roughly the same model size, while the multi-head one is significantly shallower. Meanwhile, we show that, with recent advances in deep learning, we can successfully stabilize the training of the 384-layer Transformer. As the training difficulty is no longer a bottleneck, substantially deeper single-head Transformer achieves consistent performance improvements without tuning hyper-parameters.
    Machine learning for complete intersection Calabi-Yau manifolds: a methodological study. (arXiv:2007.15706v2 [hep-th] UPDATED)
    (2 min) We revisit the question of predicting both Hodge numbers $h^{1,1}$ and $h^{2,1}$ of complete intersection Calabi-Yau (CICY) 3-folds using machine learning (ML), considering both the old and new datasets built respectively by Candelas-Dale-Lutken-Schimmrigk / Green-H\"ubsch-Lutken and by Anderson-Gao-Gray-Lee. In real world applications, implementing a ML system rarely reduces to feed the brute data to the algorithm. Instead, the typical workflow starts with an exploratory data analysis (EDA) which aims at understanding better the input data and finding an optimal representation. It is followed by the design of a validation procedure and a baseline model. Finally, several ML models are compared and combined, often involving neural networks with a topology more complicated than the sequential models typically used in physics. By following this procedure, we improve the accuracy of ML computations for Hodge numbers with respect to the existing literature. First, we obtain 97% (resp. 99%) accuracy for $h^{1,1}$ using a neural network inspired by the Inception model for the old dataset, using only 30% (resp. 70%) of the data for training. For the new one, a simple linear regression leads to almost 100% accuracy with 30% of the data for training. The computation of $h^{2,1}$ is less successful as we manage to reach only 50% accuracy for both datasets, but this is still better than the 16% obtained with a simple neural network (SVM with Gaussian kernel and feature engineering and sequential convolutional network reach at best 36%). This serves as a proof of concept that neural networks can be valuable to study the properties of geometries appearing in string theory.
    Class2Simi: A Noise Reduction Perspective on Learning with Noisy Labels. (arXiv:2006.07831v2 [cs.LG] UPDATED)
    (2 min) Learning with noisy labels has attracted a lot of attention in recent years, where the mainstream approaches are in pointwise manners. Meanwhile, pairwise manners have shown great potential in supervised metric learning and unsupervised contrastive learning. Thus, a natural question is raised: does learning in a pairwise manner mitigate label noise? To give an affirmative answer, in this paper, we propose a framework called Class2Simi: it transforms data points with noisy class labels to data pairs with noisy similarity labels, where a similarity label denotes whether a pair shares the class label or not. Through this transformation, the reduction of the noise rate is theoretically guaranteed, and hence it is in principle easier to handle noisy similarity labels. Amazingly, DNNs that predict the clean class labels can be trained from noisy data pairs if they are first pretrained from noisy data points. Class2Simi is computationally efficient because not only this transformation is on-the-fly in mini-batches, but also it just changes loss computation on top of model prediction into a pairwise manner. Its effectiveness is verified by extensive experiments.
    DMN4: Few-shot Learning via Discriminative Mutual Nearest Neighbor Neural Network. (arXiv:2103.08160v2 [cs.CV] UPDATED)
    (2 min) Few-shot learning (FSL) aims to classify images under low-data regimes, where the conventional pooled global representation is likely to lose useful local characteristics. Recent work has achieved promising performances by using deep descriptors. They generally take all deep descriptors from neural networks into consideration while ignoring that some of them are useless in classification due to their limited receptive field, e.g., task-irrelevant descriptors could be misleading and multiple aggregative descriptors from background clutter could even overwhelm the object's presence. In this paper, we argue that a Mutual Nearest Neighbor (MNN) relation should be established to explicitly select the query descriptors that are most relevant to each task and discard less relevant ones from aggregative clutters in FSL. Specifically, we propose Discriminative Mutual Nearest Neighbor Neural Network (DMN4) for FSL. Extensive experiments demonstrate that our method not only qualitatively selects task-relevant descriptors but also quantitatively outperforms the existing state-of-the-arts by a large margin of 1.8~4.9% on fine-grained CUB, a considerable margin of 1.4~2.2% on both supervised and semi-supervised miniImagenet, and ~1.4% on challenging tieredimagenet.
    SECANT: Self-Expert Cloning for Zero-Shot Generalization of Visual Policies. (arXiv:2106.09678v1 [cs.LG])
    (2 min) Generalization has been a long-standing challenge for reinforcement learning (RL). Visual RL, in particular, can be easily distracted by irrelevant factors in high-dimensional observation space. In this work, we consider robust policy learning which targets zero-shot generalization to unseen visual environments with large distributional shift. We propose SECANT, a novel self-expert cloning technique that leverages image augmentation in two stages to decouple robust representation learning from policy optimization. Specifically, an expert policy is first trained by RL from scratch with weak augmentations. A student network then learns to mimic the expert policy by supervised learning with strong augmentations, making its representation more robust against visual variations compared to the expert. Extensive experiments demonstrate that SECANT significantly advances the state of the art in zero-shot generalization across 4 challenging domains. Our average reward improvements over prior SOTAs are: DeepMind Control (+26.5%), robotic manipulation (+337.8%), vision-based autonomous driving (+47.7%), and indoor object navigation (+15.8%). Code release and video are available at https://linxifan.github.io/secant-site/.
    Safe Reinforcement Learning Using Advantage-Based Intervention. (arXiv:2106.09110v1 [cs.LG])
    (2 min) Many sequential decision problems involve finding a policy that maximizes total reward while obeying safety constraints. Although much recent research has focused on the development of safe reinforcement learning (RL) algorithms that produce a safe policy after training, ensuring safety during training as well remains an open problem. A fundamental challenge is performing exploration while still satisfying constraints in an unknown Markov decision process (MDP). In this work, we address this problem for the chance-constrained setting. We propose a new algorithm, SAILR, that uses an intervention mechanism based on advantage functions to keep the agent safe throughout training and optimizes the agent's policy using off-the-shelf RL algorithms designed for unconstrained MDPs. Our method comes with strong guarantees on safety during both training and deployment (i.e., after training and without the intervention mechanism) and policy performance compared to the optimal safety-constrained policy. In our experiments, we show that SAILR violates constraints far less during training than standard safe RL and constrained MDP approaches and converges to a well-performing policy that can be deployed safely without intervention. Our code is available at https://github.com/nolanwagener/safe_rl.
    Scaling-up Diverse Orthogonal Convolutional Networks with a Paraunitary Framework. (arXiv:2106.09121v1 [cs.LG])
    (2 min) Enforcing orthogonality in neural networks is an antidote for gradient vanishing/exploding problems, sensitivity by adversarial perturbation, and bounding generalization errors. However, many previous approaches are heuristic, and the orthogonality of convolutional layers is not systematically studied: some of these designs are not exactly orthogonal, while others only consider standard convolutional layers and propose specific classes of their realizations. To address this problem, we propose a theoretical framework for orthogonal convolutional layers, which establishes the equivalence between various orthogonal convolutional layers in the spatial domain and the paraunitary systems in the spectral domain. Since there exists a complete spectral factorization of paraunitary systems, any orthogonal convolution layer can be parameterized as convolutions of spatial filters. Our framework endows high expressive power to various convolutional layers while maintaining their exact orthogonality. Furthermore, our layers are memory and computationally efficient for deep networks compared to previous designs. Our versatile framework, for the first time, enables the study of architecture designs for deep orthogonal networks, such as choices of skip connection, initialization, stride, and dilation. Consequently, we scale up orthogonal networks to deep architectures, including ResNet, WideResNet, and ShuffleNet, substantially increasing the performance over the traditional shallow orthogonal networks.
    Behavioral Priors and Dynamics Models: Improving Performance and Domain Transfer in Offline RL. (arXiv:2106.09119v1 [cs.LG])
    (2 min) Offline Reinforcement Learning (RL) aims to extract near-optimal policies from imperfect offline data without additional environment interactions. Extracting policies from diverse offline datasets has the potential to expand the range of applicability of RL by making the training process safer, faster, and more streamlined. We investigate how to improve the performance of offline RL algorithms, its robustness to the quality of offline data, as well as its generalization capabilities. To this end, we introduce Offline Model-based RL with Adaptive Behavioral Priors (MABE). Our algorithm is based on the finding that dynamics models, which support within-domain generalization, and behavioral priors, which support cross-domain generalization, are complementary. When combined together, they substantially improve the performance and generalization of offline RL policies. In the widely studied D4RL offline RL benchmark, we find that MABE achieves higher average performance compared to prior model-free and model-based algorithms. In experiments that require cross-domain generalization, we find that MABE outperforms prior methods. Our website is available at https://sites.google.com/berkeley.edu/mabe .
    Hyperparameter Optimization via Sequential Uniform Designs. (arXiv:2009.03586v2 [cs.LG] UPDATED)
    (2 min) Hyperparameter optimization (HPO) plays a central role in the automated machine learning (AutoML). It is a challenging task as the response surfaces of hyperparameters are generally unknown, hence essentially a global optimization problem. This paper reformulates HPO as a computer experiment and proposes a novel sequential uniform design (SeqUD) strategy with three-fold advantages: a) the hyperparameter space is adaptively explored with evenly spread design points, without the need of expensive meta-modeling and acquisition optimization; b) the batch-by-batch design points are sequentially generated with parallel processing support; c) a new augmented uniform design algorithm is developed for the efficient real-time generation of follow-up design points. Extensive experiments are conducted on both global optimization tasks and HPO applications. The numerical results show that the proposed SeqUD strategy outperforms benchmark HPO methods, and it can be therefore a promising and competitive alternative to existing AutoML tools.
    Trainable Discrete Feature Embeddings for Variational Quantum Classifier. (arXiv:2106.09415v1 [quant-ph])
    (2 min) Quantum classifiers provide sophisticated embeddings of input data in Hilbert space promising quantum advantage. The advantage stems from quantum feature maps encoding the inputs into quantum states with variational quantum circuits. A recent work shows how to map discrete features with fewer quantum bits using Quantum Random Access Coding (QRAC), an important primitive to encode binary strings into quantum states. We propose a new method to embed discrete features with trainable quantum circuits by combining QRAC and a recently proposed strategy for training quantum feature map called quantum metric learning. We show that the proposed trainable embedding requires not only as few qubits as QRAC but also overcomes the limitations of QRAC to classify inputs whose classes are based on hard Boolean functions. We numerically demonstrate its use in variational quantum classifiers to achieve better performances in classifying real-world datasets, and thus its possibility to leverage near-term quantum computers for quantum machine learning.
    Machine Learning for Postprocessing Ensemble Streamflow Forecasts. (arXiv:2106.09547v1 [cs.LG])
    (2 min) Skillful streamflow forecasting informs decisions in various areas of water policy and management. We integrate dynamical modeling with machine learning to demonstrate the enhanced quality of streamflow forecasts at short-to medium-range timescales (1 - 7 days). Dynamical modeling generates ensemble streamflow forecasts by forcing a hydrological model with numerical weather prediction model outputs. We employ a Long Short-Term Memory (LSTM) neural network to correct forecast biases in raw ensemble streamflow forecasts obtained from dynamical modeling. For forecast verification, we use different metrics such as skill score and reliability diagram conditioned upon the lead time, flow threshold, and season. The verification results show that the LSTM can improve streamflow forecasts relative to climatological, temporal persistence, deterministic, and raw ensemble forecasts. The LSTM demonstrates improvement across all lead times, flow thresholds, and seasons. As compared to the raw ensembles, relative gain in forecast skill from LSTM is generally higher at medium-range timescales compared to initial lead time; high flows compared to low-moderate flows; and warm-season compared to the cool ones. Overall, our results highlight the benefits of LSTM for improving both the skill and reliability of streamflow forecasts.
    Machine-learning enhanced dark soliton detection in Bose-Einstein condensates. (arXiv:2101.05404v2 [cond-mat.quant-gas] UPDATED)
    (2 min) Most data in cold-atom experiments comes from images, the analysis of which is limited by our preconceptions of the patterns that could be present in the data. We focus on the well-defined case of detecting dark solitons -- appearing as local density depletions in a Bose-Einstein condensate (BEC) -- using a methodology that is extensible to the general task of pattern recognition in images of cold atoms. Studying soliton dynamics over a wide range of parameters requires the analysis of large datasets, making the existing human-inspection-based methodology a significant bottleneck. Here we describe an automated classification and positioning system for identifying localized excitations in atomic BECs utilizing deep convolutional neural networks to eliminate the need for human image examination. Furthermore, we openly publish our labeled dataset of dark solitons, the first of its kind, for further machine learning research.
    Incremental Without Replacement Sampling in Nonconvex Optimization. (arXiv:2007.07557v3 [cs.LG] UPDATED)
    (2 min) Minibatch decomposition methods for empirical risk minimization are commonly analysed in a stochastic approximation setting, also known as sampling with replacement. On the other hands modern implementations of such techniques are incremental: they rely on sampling without replacement, for which available analysis are much scarcer. We provide convergence guaranties for the latter variant by analysing a versatile incremental gradient scheme. For this scheme, we consider constant, decreasing or adaptive step sizes. In the smooth setting we obtain explicit complexity estimates in terms of epoch counter. In the nonsmooth setting we prove that the sequence is attracted by solutions of optimality conditions of the problem.
    On a Sparse Shortcut Topology of Artificial Neural Networks. (arXiv:1811.09003v4 [cs.LG] UPDATED)
    (2 min) Over recent years, deep learning has become the mainstream data-driven approach to solve many important real-world problems. In the successful network architectures, shortcut connections are well established to take the outputs of earlier layers as additional inputs to later layers, which have produced excellent results. Despite the extraordinary effectiveness of shortcuts, there remain important questions on the underlying mechanism and associated functionalities. For example, why are shortcuts powerful? Why shortcuts generalize well? To address these questions, we investigate the representation and generalization ability of a sparse shortcut topology. Specifically, we first demonstrate that this topology can empower a one-neuron-wide deep network to approximate any univariate continuous function. Then, we present a novel width-bounded universal approximator in contrast to depth-bounded universal approximators, and also extend the approximation result to a family of networks such that in the view of approximation ability, these networks are equally competent. Furthermore, we use the generalization bound theory to show that the investigated shortcut topology enjoys an excellent generalizability. Finally, we corroborate our theoretical analyses with experiments on some well-known benchmarks.
    Improving adversarial robustness of deep neural networks by using semantic information. (arXiv:2008.07838v2 [cs.LG] UPDATED)
    (2 min) The vulnerability of deep neural networks (DNNs) to adversarial attack, which is an attack that can mislead state-of-the-art classifiers into making an incorrect classification with high confidence by deliberately perturbing the original inputs, raises concerns about the robustness of DNNs to such attacks. Adversarial training, which is the main heuristic method for improving adversarial robustness and the first line of defense against adversarial attacks, requires many sample-by-sample calculations to increase training size and is usually insufficiently strong for an entire network. This paper provides a new perspective on the issue of adversarial robustness, one that shifts the focus from the network as a whole to the critical part of the region close to the decision boundary corresponding to a given class. From this perspective, we propose a method to generate a single but image-agnostic adversarial perturbation that carries the semantic information implying the directions to the fragile parts on the decision boundary and causes inputs to be misclassified as a specified target. We call the adversarial training based on such perturbations "region adversarial training" (RAT), which resembles classical adversarial training but is distinguished in that it reinforces the semantic information missing in the relevant regions. Experimental results on the MNIST and CIFAR-10 datasets show that this approach greatly improves adversarial robustness even using a very small dataset from the training data; moreover, it can defend against FGSM adversarial attacks that have a completely different pattern from the model seen during retraining.
    FedV: Privacy-Preserving Federated Learning over Vertically Partitioned Data. (arXiv:2103.03918v2 [cs.LG] UPDATED)
    (2 min) Federated learning (FL) has been proposed to allow collaborative training of machine learning (ML) models among multiple parties where each party can keep its data private. In this paradigm, only model updates, such as model weights or gradients, are shared. Many existing approaches have focused on horizontal FL, where each party has the entire feature set and labels in the training data set. However, many real scenarios follow a vertically-partitioned FL setup, where a complete feature set is formed only when all the datasets from the parties are combined, and the labels are only available to a single party. Privacy-preserving vertical FL is challenging because complete sets of labels and features are not owned by one entity. Existing approaches for vertical FL require multiple peer-to-peer communications among parties, leading to lengthy training times, and are restricted to (approximated) linear models and just two parties. To close this gap, we propose FedV, a framework for secure gradient computation in vertical settings for several widely used ML models such as linear models, logistic regression, and support vector machines. FedV removes the need for peer-to-peer communication among parties by using functional encryption schemes; this allows FedV to achieve faster training times. It also works for larger and changing sets of parties. We empirically demonstrate the applicability for multiple types of ML models and show a reduction of 10%-70% of training time and 80% to 90% in data transfer with respect to the state-of-the-art approaches.
    Finite-Sample Analysis of Stochastic Approximation Using Smooth Convex Envelopes. (arXiv:2002.00874v5 [cs.LG] UPDATED)
    (2 min) Stochastic Approximation (SA) is a popular approach for solving fixed-point equations where the information is corrupted by noise. In this paper, we consider an SA involving a contraction mapping with respect to an arbitrary norm, and show its finite-sample error bounds while using different stepsizes. The idea is to construct a smooth Lyapunov function using the generalized Moreau envelope, and show that the iterates of SA have negative drift with respect to that Lyapunov function. Our result is applicable in Reinforcement Learning (RL). In particular, we use it to establish the first-known convergence rate of the V-trace algorithm for off-policy TD-learning. Moreover, we also use it to study TD-learning in the on-policy setting, and recover the existing state-of-the-art results for $Q$-learning. Importantly, our construction results in only a logarithmic dependence of the convergence bound on the size of the state-space.
    Protecting gender and identity with disentangled speech representations. (arXiv:2104.11051v2 [cs.SD] UPDATED)
    (2 min) Besides its linguistic content, our speech is rich in biometric information that can be inferred by classifiers. Learning privacy-preserving representations for speech signals enables downstream tasks without sharing unnecessary, private information about an individual. In this paper, we show that protecting gender information in speech is more effective than modelling speaker-identity information only when generating a non-sensitive representation of speech. Our method relies on reconstructing speech by decoding linguistic content along with gender information using a variational autoencoder. Specifically, we exploit disentangled representation learning to encode information about different attributes into separate subspaces that can be factorised independently. We present a novel way to encode gender information and disentangle two sensitive biometric identifiers, namely gender and identity, in a privacy-protecting setting. Experiments on the LibriSpeech dataset show that gender recognition and speaker verification can be reduced to a random guess, protecting against classification-based attacks.
    Adaptive Low-Rank Regularization with Damping Sequences to Restrict Lazy Weights in Deep Networks. (arXiv:2106.09677v1 [cs.LG])
    (2 min) Overfitting is one of the critical problems in deep neural networks. Many regularization schemes try to prevent overfitting blindly. However, they decrease the convergence speed of training algorithms. Adaptive regularization schemes can solve overfitting more intelligently. They usually do not affect the entire network weights. This paper detects a subset of the weighting layers that cause overfitting. The overfitting recognizes by matrix and tensor condition numbers. An adaptive regularization scheme entitled Adaptive Low-Rank (ALR) is proposed that converges a subset of the weighting layers to their Low-Rank Factorization (LRF). It happens by minimizing a new Tikhonov-based loss function. ALR also encourages lazy weights to contribute to the regularization when epochs grow up. It uses a damping sequence to increment layer selection likelihood in the last generations. Thus before falling the training accuracy, ALR reduces the lazy weights and regularizes the network substantially. The experimental results show that ALR regularizes the deep networks well with high training speed and low resource usage.
    Hi-Phy: A Benchmark for Hierarchical Physical Reasoning. (arXiv:2106.09692v1 [cs.AI])
    (2 min) Reasoning about the behaviour of physical objects is a key capability of agents operating in physical worlds. Humans are very experienced in physical reasoning while it remains a major challenge for AI. To facilitate research addressing this problem, several benchmarks have been proposed recently. However, these benchmarks do not enable us to measure an agent's granular physical reasoning capabilities when solving a complex reasoning task. In this paper, we propose a new benchmark for physical reasoning that allows us to test individual physical reasoning capabilities. Inspired by how humans acquire these capabilities, we propose a general hierarchy of physical reasoning capabilities with increasing complexity. Our benchmark tests capabilities according to this hierarchy through generated physical reasoning tasks in the video game Angry Birds. This benchmark enables us to conduct a comprehensive agent evaluation by measuring the agent's granular physical reasoning capabilities. We conduct an evaluation with human players, learning agents, and heuristic agents and determine their capabilities. Our evaluation shows that learning agents, with good local generalization ability, still struggle to learn the underlying physical reasoning capabilities and perform worse than current state-of-the-art heuristic agents and humans. We believe that this benchmark will encourage researchers to develop intelligent agents with advanced, human-like physical reasoning capabilities. URL: https://github.com/Cheng-Xue/Hi-Phy
    On the training of sparse and dense deep neural networks: less parameters, same performance. (arXiv:2106.09021v1 [cs.LG])
    (2 min) Deep neural networks can be trained in reciprocal space, by acting on the eigenvalues and eigenvectors of suitable transfer operators in direct space. Adjusting the eigenvalues, while freezing the eigenvectors, yields a substantial compression of the parameter space. This latter scales by definition with the number of computing neurons. The classification scores, as measured by the displayed accuracy, are however inferior to those attained when the learning is carried in direct space, for an identical architecture and by employing the full set of trainable parameters (with a quadratic dependence on the size of neighbor layers). In this Letter, we propose a variant of the spectral learning method as appeared in Giambagli et al {Nat. Comm.} 2021, which leverages on two sets of eigenvalues, for each mapping between adjacent layers. The eigenvalues act as veritable knobs which can be freely tuned so as to (i) enhance, or alternatively silence, the contribution of the input nodes, (ii) modulate the excitability of the receiving nodes with a mechanism which we interpret as the artificial analogue of the homeostatic plasticity. The number of trainable parameters is still a linear function of the network size, but the performances of the trained device gets much closer to those obtained via conventional algorithms, these latter requiring however a considerably heavier computational cost. The residual gap between conventional and spectral trainings can be eventually filled by employing a suitable decomposition for the non trivial block of the eigenvectors matrix. Each spectral parameter reflects back on the whole set of inter-nodes weights, an attribute which we shall effectively exploit to yield sparse networks with stunning classification abilities, as compared to their homologues trained with conventional means.
    Class Balancing GAN with a Classifier in the Loop. (arXiv:2106.09402v1 [cs.LG])
    (2 min) Generative Adversarial Networks (GANs) have swiftly evolved to imitate increasingly complex image distributions. However, majority of the developments focus on performance of GANs on balanced datasets. We find that the existing GANs and their training regimes which work well on balanced datasets fail to be effective in case of imbalanced (i.e. long-tailed) datasets. In this work we introduce a novel theoretically motivated Class Balancing regularizer for training GANs. Our regularizer makes use of the knowledge from a pre-trained classifier to ensure balanced learning of all the classes in the dataset. This is achieved via modelling the effective class frequency based on the exponential forgetting observed in neural networks and encouraging the GAN to focus on underrepresented classes. We demonstrate the utility of our regularizer in learning representations for long-tailed distributions via achieving better performance than existing approaches over multiple datasets. Specifically, when applied to an unconditional GAN, it improves the FID from $13.03$ to $9.01$ on the long-tailed iNaturalist-$2019$ dataset.
    NU-Wave: A Diffusion Probabilistic Model for Neural Audio Upsampling. (arXiv:2104.02321v2 [eess.AS] UPDATED)
    (2 min) In this work, we introduce NU-Wave, the first neural audio upsampling model to produce waveforms of sampling rate 48kHz from coarse 16kHz or 24kHz inputs, while prior works could generate only up to 16kHz. NU-Wave is the first diffusion probabilistic model for audio super-resolution which is engineered based on neural vocoders. NU-Wave generates high-quality audio that achieves high performance in terms of signal-to-noise ratio (SNR), log-spectral distance (LSD), and accuracy of the ABX test. In all cases, NU-Wave outperforms the baseline models despite the substantially smaller model capacity (3.0M parameters) than baselines (5.4-21%). The audio samples of our model are available at https://mindslab-ai.github.io/nuwave, and the code will be made available soon.
    Autobots: Latent Variable Sequential Set Transformers. (arXiv:2104.00563v2 [cs.RO] UPDATED)
    (2 min) Robust multi-agent trajectory prediction is essential for the safe control of robots and vehicles that interact with humans. Many existing methods treat social and temporal information separately and therefore fall short of modelling the joint future trajectories of all agents in a socially consistent way. To address this, we propose a new class of Latent Variable Sequential Set Transformers which autoregressively model multi-agent trajectories. We refer to these architectures as "AutoBots". AutoBots model the contents of sets (e.g. representing the properties of agents in a scene) over time and employ multi-head self-attention blocks over these sequences of sets to encode the sociotemporal relationships between the different actors of a scene. This produces either the trajectory of one ego-agent or a distribution over the future trajectories for all agents under consideration. Our approach works for general sequences of sets and we provide illustrative experiments modelling the sequential structure of the multiple strokes that make up symbols in the Omniglot data. For the single-agent prediction case, we validate our model on the NuScenes motion prediction task and achieve competitive results on the global leaderboard. In the multi-agent forecasting setting, we validate our model on TrajNet. We find that our method outperforms physical extrapolation and recurrent network baselines and generates scene-consistent trajectories.
    Rotation Invariant Graph Neural Networks using Spin Convolutions. (arXiv:2106.09575v1 [cs.LG])
    (2 min) Progress towards the energy breakthroughs needed to combat climate change can be significantly accelerated through the efficient simulation of atomic systems. Simulation techniques based on first principles, such as Density Functional Theory (DFT), are limited in their practical use due to their high computational expense. Machine learning approaches have the potential to approximate DFT in a computationally efficient manner, which could dramatically increase the impact of computational simulations on real-world problems. Approximating DFT poses several challenges. These include accurately modeling the subtle changes in the relative positions and angles between atoms, and enforcing constraints such as rotation invariance or energy conservation. We introduce a novel approach to modeling angular information between sets of neighboring atoms in a graph neural network. Rotation invariance is achieved for the network's edge messages through the use of a per-edge local coordinate frame and a novel spin convolution over the remaining degree of freedom. Two model variants are proposed for the applications of structure relaxation and molecular dynamics. State-of-the-art results are demonstrated on the large-scale Open Catalyst 2020 dataset. Comparisons are also performed on the MD17 and QM9 datasets.
    Latent Correlation-Based Multiview Learning and Self-Supervision: A Unifying Perspective. (arXiv:2106.07115v2 [cs.LG] UPDATED)
    (2 min) Multiple views of data, both naturally acquired (e.g., image and audio) and artificially produced (e.g., via adding different noise to data samples), have proven useful in enhancing representation learning. Natural views are often handled by multiview analysis tools, e.g., (deep) canonical correlation analysis [(D)CCA], while the artificial ones are frequently used in self-supervised learning (SSL) paradigms, e.g., SimCLR and Barlow Twins. Both types of approaches often involve learning neural feature extractors such that the embeddings of data exhibit high cross-view correlations. Although intuitive, the effectiveness of correlation-based neural embedding is only empirically validated. This work puts forth a theory-backed framework for unsupervised multiview learning. Our development starts with proposing a multiview model, where each view is a nonlinear mixture of shared and private components. Consequently, the learning problem boils down to shared/private component identification and disentanglement. Under this model, latent correlation maximization is shown to guarantee the extraction of the shared components across views (up to certain ambiguities). In addition, the private information in each view can be provably disentangled from the shared using proper regularization design. The method is tested on a series of tasks, e.g., downstream clustering, which all show promising performance. Our development also provides a unifying perspective for understanding various DCCA and SSL schemes.
    Visualising Deep Network's Time-Series Representations. (arXiv:2103.07176v2 [cs.LG] UPDATED)
    (2 min) Despite the popularisation of machine learning models, more often than not, they still operate as black boxes with no insight into what is happening inside the model. There exist a few methods that allow to visualise and explain why a model has made a certain prediction. Those methods, however, allow visualisation of the link between the input and output of the model without presenting how the model learns to represent the data used to train the model as whole. In this paper, a method that addresses that issue is proposed, with a focus on visualising multi-dimensional time-series data. Experiments on a high-frequency stock market dataset show that the method provides fast and discernible visualisations. Large datasets can be visualised quickly and on one plot, which makes it easy for a user to compare the learned representations of the data. The developed method successfully combines known techniques to provide an insight into the inner workings of time-series classification models.
    FORMS: Fine-grained Polarized ReRAM-based In-situ Computation for Mixed-signal DNN Accelerator. (arXiv:2106.09144v1 [cs.AR])
    (3 min) Recent works demonstrated the promise of using resistive random access memory (ReRAM) as an emerging technology to perform inherently parallel analog domain in-situ matrix-vector multiplication -- the intensive and key computation in DNNs. With weights stored in the ReRAM crossbar cells as conductance, when the input vector is applied to word lines, the matrix-vector multiplication results can be generated as the current in bit lines. A key problem is that the weight can be either positive or negative, but the in-situ computation assumes all cells on each crossbar column with the same sign. The current architectures either use two ReRAM crossbars for positive and negative weights, or add an offset to weights so that all values become positive. Neither solution is ideal: they either double the cost of crossbars, or incur extra offset circuity. To better solve this problem, this paper proposes FORMS, a fine-grained ReRAM-based DNN accelerator with polarized weights. Instead of trying to represent the positive/negative weights, our key design principle is to enforce exactly what is assumed in the in-situ computation -- ensuring that all weights in the same column of a crossbar have the same sign. It naturally avoids the cost of an additional crossbar. Such weights can be nicely generated using alternating direction method of multipliers (ADMM) regularized optimization, which can exactly enforce certain patterns in DNN weights. To achieve high accuracy, we propose to use fine-grained sub-array columns, which provide a unique opportunity for input zero-skipping, significantly avoiding unnecessary computations. It also makes the hardware much easier to implement. Putting all together, with the same optimized models, FORMS achieves significant throughput improvement and speed up in frame per second over ISAAC with similar area cost.
    Knowledge Distillation and Student-Teacher Learning for Visual Intelligence: A Review and New Outlooks. (arXiv:2004.05937v7 [cs.CV] UPDATED)
    (3 min) Deep neural models in recent years have been successful in almost every field, including extremely complex problem statements. However, these models are huge in size, with millions (and even billions) of parameters, thus demanding more heavy computation power and failing to be deployed on edge devices. Besides, the performance boost is highly dependent on redundant labeled data. To achieve faster speeds and to handle the problems caused by the lack of data, knowledge distillation (KD) has been proposed to transfer information learned from one model to another. KD is often characterized by the so-called `Student-Teacher' (S-T) learning framework and has been broadly applied in model compression and knowledge transfer. This paper is about KD and S-T learning, which are being actively studied in recent years. First, we aim to provide explanations of what KD is and how/why it works. Then, we provide a comprehensive survey on the recent progress of KD methods together with S-T frameworks typically for vision tasks. In general, we consider some fundamental questions that have been driving this research area and thoroughly generalize the research progress and technical details. Additionally, we systematically analyze the research status of KD in vision applications. Finally, we discuss the potentials and open challenges of existing methods and prospect the future directions of KD and S-T learning.
    Learning Proposals for Probabilistic Programs with Inference Combinators. (arXiv:2103.00668v3 [stat.ML] UPDATED)
    (2 min) We develop operators for construction of proposals in probabilistic programs, which we refer to as inference combinators. Inference combinators define a grammar over importance samplers that compose primitive operations such as application of a transition kernel and importance resampling. Proposals in these samplers can be parameterized using neural networks, which in turn can be trained by optimizing variational objectives. The result is a framework for user-programmable variational methods that are correct by construction and can be tailored to specific models. We demonstrate the flexibility of this framework by implementing advanced variational methods based on amortized Gibbs sampling and annealing.
    TacticZero: Learning to Prove Theorems from Scratch with Deep Reinforcement Learning. (arXiv:2102.09756v2 [cs.LG] UPDATED)
    (2 min) We propose a novel approach to interactive theorem-proving (ITP) using deep reinforcement learning. The proposed framework is able to learn proof search strategies as well as tactic and arguments prediction in an end-to-end manner. We formulate the process of ITP as a Markov decision process (MDP) in which each state represents a set of potential derivation paths. This structure allows us to introduce a novel backtracking mechanism which enables the agent to efficiently discard (predicted) dead-end derivations and restart from promising alternatives. We implement the framework in the HOL4 theorem prover. Experimental results show that the framework outperforms existing automated theorem provers (i.e., hammers) available in HOL4 when evaluated on unseen problems. We further elaborate the role of key components of the framework using ablation studies.
    Non-exponentially weighted aggregation: regret bounds for unbounded loss functions. (arXiv:2009.03017v5 [stat.ML] UPDATED)
    (2 min) We tackle the problem of online optimization with a general, possibly unbounded, loss function. It is well known that when the loss is bounded, the exponentially weighted aggregation strategy (EWA) leads to a regret in $\sqrt{T}$ after $T$ steps. In this paper, we study a generalized aggregation strategy, where the weights no longer depend exponentially on the losses. Our strategy is based on Follow The Regularized Leader (FTRL): we minimize the expected losses plus a regularizer, that is here a $\phi$-divergence. When the regularizer is the Kullback-Leibler divergence, we obtain EWA as a special case. Using alternative divergences enables unbounded losses, at the cost of a worst regret bound in some cases.
    Time Series Domain Adaptation via Sparse Associative Structure Alignment. (arXiv:2012.11797v2 [cs.LG] UPDATED)
    (2 min) Domain adaptation on time series data is an important but challenging task. Most of the existing works in this area are based on the learning of the domain-invariant representation of the data with the help of restrictions like MMD. However, such extraction of the domain-invariant representation is a non-trivial task for time series data, due to the complex dependence among the timestamps. In detail, in the fully dependent time series, a small change of the time lags or the offsets may lead to difficulty in the domain invariant extraction. Fortunately, the stability of the causality inspired us to explore the domain invariant structure of the data. To reduce the difficulty in the discovery of causal structure, we relax it to the sparse associative structure and propose a novel sparse associative structure alignment model for domain adaptation. First, we generate the segment set to exclude the obstacle of offsets. Second, the intra-variables and inter-variables sparse attention mechanisms are devised to extract associative structure time-series data with considering time lags. Finally, the associative structure alignment is used to guide the transfer of knowledge from the source domain to the target one. Experimental studies not only verify the good performance of our methods on three real-world datasets but also provide some insightful discoveries on the transferred knowledge.
    Discond-VAE: Disentangling Continuous Factors from the Discrete. (arXiv:2009.08039v2 [cs.LG] UPDATED)
    (2 min) In the real-world data, there are common variations shared by all classes (e.g. category label) and exclusive variations of each class. We propose a variant of VAE capable of disentangling both of these variations. To represent these generative factors of data, we introduce two sets of continuous latent variables, private variable and public variable. Our proposed framework models the private variable as a Mixture of Gaussian and the public variable as a Gaussian, respectively. Each mode of the private variable is responsible for a class of the discrete variable. Most of the previous attempts to integrate the discrete generative factors to disentanglement assume statistical independence between the continuous and discrete variables. Our proposed model, which we call Discond-VAE, DISentangles the class-dependent CONtinuous factors from the Discrete factors by introducing the private variables. The experiments show that Discond-VAE can discover the private and public factors from data. Moreover, even under the dataset with only public factors, Discond-VAE does not fail and adapts the private variables to represent the public factors.
    Dr-Vectors: Decision Residual Networks and an Improved Loss for Speaker Recognition. (arXiv:2104.01989v2 [cs.CL] UPDATED)
    (2 min) Many neural network speaker recognition systems model each speaker using a fixed-dimensional embedding vector. These embeddings are generally compared using either linear or 2nd-order scoring and, until recently, do not handle utterance-specific uncertainty. In this work we propose scoring these representations in a way that can capture uncertainty, enroll/test asymmetry and additional non-linear information. This is achieved by incorporating a 2nd-stage neural network (known as a decision network) as part of an end-to-end training regimen. In particular, we propose the concept of decision residual networks which involves the use of a compact decision network to leverage cosine scores and to model the residual signal that's needed. Additionally, we present a modification to the generalized end-to-end softmax loss function to target the separation of same/different speaker scores. We observed significant performance gains for the two techniques.
    Multi-modal fusion with gating using audio, lexical and disfluency features for Alzheimer's Dementia recognition from spontaneous speech. (arXiv:2106.09668v1 [cs.LG])
    (2 min) This paper is a submission to the Alzheimer's Dementia Recognition through Spontaneous Speech (ADReSS) challenge, which aims to develop methods that can assist in the automated prediction of severity of Alzheimer's Disease from speech data. We focus on acoustic and natural language features for cognitive impairment detection in spontaneous speech in the context of Alzheimer's Disease Diagnosis and the mini-mental state examination (MMSE) score prediction. We proposed a model that obtains unimodal decisions from different LSTMs, one for each modality of text and audio, and then combines them using a gating mechanism for the final prediction. We focused on sequential modelling of text and audio and investigated whether the disfluencies present in individuals' speech relate to the extent of their cognitive impairment. Our results show that the proposed classification and regression schemes obtain very promising results on both development and test sets. This suggests Alzheimer's Disease can be detected successfully with sequence modeling of the speech data of medical sessions.
    WaveGrad 2: Iterative Refinement for Text-to-Speech Synthesis. (arXiv:2106.09660v1 [eess.AS])
    (2 min) This paper introduces WaveGrad 2, a non-autoregressive generative model for text-to-speech synthesis. WaveGrad 2 is trained to estimate the gradient of the log conditional density of the waveform given a phoneme sequence. The model takes an input phoneme sequence, and through an iterative refinement process, generates an audio waveform. This contrasts to the original WaveGrad vocoder which conditions on mel-spectrogram features, generated by a separate model. The iterative refinement process starts from Gaussian noise, and through a series of refinement steps (e.g., 50 steps), progressively recovers the audio sequence. WaveGrad 2 offers a natural way to trade-off between inference speed and sample quality, through adjusting the number of refinement steps. Experiments show that the model can generate high fidelity audio, approaching the performance of a state-of-the-art neural TTS system. We also report various ablation studies over different model configurations. Audio samples are available at https://wavegrad.github.io/v2.
    LoRA: Low-Rank Adaptation of Large Language Models. (arXiv:2106.09685v1 [cs.CL])
    (2 min) The dominant paradigm of natural language processing consists of large-scale pre-training on general domain data and adaptation to particular tasks or domains. As we pre-train larger models, conventional fine-tuning, which retrains all model parameters, becomes less feasible. Using GPT-3 175B as an example, deploying many independent instances of fine-tuned models, each with 175B parameters, is extremely expensive. We propose Low-Rank Adaptation, or LoRA, which freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks. For GPT-3, LoRA can reduce the number of trainable parameters by 10,000 times and the computation hardware requirement by 3 times compared to full fine-tuning. LoRA performs on-par or better than fine-tuning in model quality on both GPT-3 and GPT-2, despite having fewer trainable parameters, a higher training throughput, and no additional inference latency. We also provide an empirical investigation into rank-deficiency in language model adaptations, which sheds light on the efficacy of LoRA. We release our implementation in GPT-2 at https://github.com/microsoft/LoRA .
    ASR Adaptation for E-commerce Chatbots using Cross-Utterance Context and Multi-Task Language Modeling. (arXiv:2106.09532v1 [eess.AS])
    (2 min) Automatic Speech Recognition (ASR) robustness toward slot entities are critical in e-commerce voice assistants that involve monetary transactions and purchases. Along with effective domain adaptation, it is intuitive that cross utterance contextual cues play an important role in disambiguating domain specific content words from speech. In this paper, we investigate various techniques to improve contextualization, content word robustness and domain adaptation of a Transformer-XL neural language model (NLM) to rescore ASR N-best hypotheses. To improve contextualization, we utilize turn level dialogue acts along with cross utterance context carry over. Additionally, to adapt our domain-general NLM towards e-commerce on-the-fly, we use embeddings derived from a finetuned masked LM on in-domain data. Finally, to improve robustness towards in-domain content words, we propose a multi-task model that can jointly perform content word detection and language modeling tasks. Compared to a non-contextual LSTM LM baseline, our best performing NLM rescorer results in a content WER reduction of 19.2% on e-commerce audio test set and a slot labeling F1 improvement of 6.4%.
    Importance measures derived from random forests: characterisation and extension. (arXiv:2106.09473v1 [stat.ML])
    (2 min) Nowadays new technologies, and especially artificial intelligence, are more and more established in our society. Big data analysis and machine learning, two sub-fields of artificial intelligence, are at the core of many recent breakthroughs in many application fields (e.g., medicine, communication, finance, ...), including some that are strongly related to our day-to-day life (e.g., social networks, computers, smartphones, ...). In machine learning, significant improvements are usually achieved at the price of an increasing computational complexity and thanks to bigger datasets. Currently, cutting-edge models built by the most advanced machine learning algorithms typically became simultaneously very efficient and profitable but also extremely complex. Their complexity is to such an extent that these models are commonly seen as black-boxes providing a prediction or a decision which can not be interpreted or justified. Nevertheless, whether these models are used autonomously or as a simple decision-making support tool, they are already being used in machine learning applications where health and human life are at stake. Therefore, it appears to be an obvious necessity not to blindly believe everything coming out of those models without a detailed understanding of their predictions or decisions. Accordingly, this thesis aims at improving the interpretability of models built by a specific family of machine learning algorithms, the so-called tree-based methods. Several mechanisms have been proposed to interpret these models and we aim along this thesis to improve their understanding, study their properties, and define their limitations.
    Multi-Agent Training beyond Zero-Sum with Correlated Equilibrium Meta-Solvers. (arXiv:2106.09435v1 [cs.MA])
    (2 min) Two-player, constant-sum games are well studied in the literature, but there has been limited progress outside of this setting. We propose Joint Policy-Space Response Oracles (JPSRO), an algorithm for training agents in n-player, general-sum extensive form games, which provably converges to an equilibrium. We further suggest correlated equilibria (CE) as promising meta-solvers, and propose a novel solution concept Maximum Gini Correlated Equilibrium (MGCE), a principled and computationally efficient family of solutions for solving the correlated equilibrium selection problem. We conduct several experiments using CE meta-solvers for JPSRO and demonstrate convergence on n-player, general-sum games.
    Modelling resource allocation in uncertain system environment through deep reinforcement learning. (arXiv:2106.09461v1 [cs.LG])
    (2 min) Reinforcement Learning has applications in field of mechatronics, robotics, and other resource-constrained control system. Problem of resource allocation is primarily solved using traditional predefined techniques and modern deep learning methods. The drawback of predefined and most deep learning methods for resource allocation is failing to meet the requirements in cases of uncertain system environment. We can approach problem of resource allocation in uncertain system environment alongside following certain criteria using deep reinforcement learning. Also, reinforcement learning has ability for adapting to new uncertain environment for prolonged period of time. The paper provides a detailed comparative analysis on various deep reinforcement learning methods by applying different components to modify architecture of reinforcement learning with use of noisy layers, prioritized replay, bagging, duelling networks, and other related combination to obtain improvement in terms of performance and reduction of computational cost. The paper identifies problem of resource allocation in uncertain environment could be effectively solved using Noisy Bagging duelling double deep Q network achieving efficiency of 97.7% by maximizing reward with significant exploration in given simulated environment for resource allocation.
    Do Large Scale Molecular Language Representations Capture Important Structural Information?. (arXiv:2106.09553v1 [cs.LG])
    (2 min) Predicting chemical properties from the structure of a molecule is of great importance in many applications including drug discovery and material design. Machine learning based molecular property prediction holds the promise of enabling accurate predictions at much less complexity, when compared to, for example Density Functional Theory (DFT) calculations. Features extracted from molecular graphs, using graph neural nets in a supervised manner, have emerged as strong baselines for such tasks. However, the vast chemical space together with the limited availability of labels makes supervised learning challenging, calling for learning a general-purpose molecular representation. Recently, pre-trained transformer-based language models (PTLMs) on large unlabeled corpus have produced state-of-the-art results in many downstream natural language processing tasks. Inspired by this development, here we present molecular embeddings obtained by training an efficient transformer encoder model, referred to as MoLFormer. This model was employed with a linear attention mechanism and highly paralleized training on 1D SMILES sequences of 1.1 billion unlabeled molecules from the PubChem and ZINC datasets. Experiments show that the learned molecular representation performs competitively, when compared to existing graph-based and fingerprint-based supervised learning baselines, on the challenging tasks of predicting properties of QM8 and QM9 molecules. Further task-specific fine-tuning of the MoLFormerr representation improves performance on several of those property prediction benchmarks. These results provide encouraging evidence that large-scale molecular language models can capture sufficient structural information to be able to accurately predict quantum chemical properties and beyond.
    A Simple Generative Network. (arXiv:2106.09330v1 [cs.LG])
    (2 min) Generative neural networks are able to mimic intricate probability distributions such as those of handwritten text, natural images, etc. Since their inception several models were proposed. The most successful of these were based on adversarial (GAN), auto-encoding (VAE) and maximum mean discrepancy (MMD) relatively complex architectures and schemes. Surprisingly, a very simple architecture (a single feed-forward neural network) in conjunction with an obvious optimization goal (Kullback_Leibler divergence) was apparently overlooked. This paper demonstrates that such a model (denoted SGN for its simplicity) is able to generate samples visually and quantitatively competitive as compared with the fore-mentioned state of the art methods.
    Privacy-Preserving Eye-tracking Using Deep Learning. (arXiv:2106.09621v1 [cs.CV])
    (2 min) The expanding usage of complex machine learning methods like deep learning has led to an explosion in human activity recognition, particularly applied to health. In particular, as part of a larger body sensor network system, face and full-body analysis is becoming increasingly common for evaluating health status. However, complex models which handle private and sometimes protected data, raise concerns about the potential leak of identifiable data. In this work, we focus on the case of a deep network model trained on images of individual faces. Full-face video recordings taken from 493 individuals undergoing an eye-tracking based evaluation of neurological function were used. Outputs, gradients, intermediate layer outputs, loss, and labels were used as inputs for a deep network with an added support vector machine emission layer to recognize membership in the training data. The inference attack method and associated mathematical analysis indicate that there is a low likelihood of unintended memorization of facial features in the deep learning model. In this study, it is showed that the named model preserves the integrity of training data with reasonable confidence. The same process can be implemented in similar conditions for different models.
    A Short Note of PAGE: Optimal Convergence Rates for Nonconvex Optimization. (arXiv:2106.09663v1 [math.OC])
    (2 min) In this note, we first recall the nonconvex problem setting and introduce the optimal PAGE algorithm (Li et al., ICML'21). Then we provide a simple and clean convergence analysis of PAGE for achieving optimal convergence rates. Moreover, PAGE and its analysis can be easily adopted and generalized to other works. We hope that this note provides the insights and is helpful for future works.
    A Simple Fix to Mahalanobis Distance for Improving Near-OOD Detection. (arXiv:2106.09022v1 [cs.LG])
    (2 min) Mahalanobis distance (MD) is a simple and popular post-processing method for detecting out-of-distribution (OOD) inputs in neural networks. We analyze its failure modes for near-OOD detection and propose a simple fix called relative Mahalanobis distance (RMD) which improves performance and is more robust to hyperparameter choice. On a wide selection of challenging vision, language, and biology OOD benchmarks (CIFAR-100 vs CIFAR-10, CLINC OOD intent detection, Genomics OOD), we show that RMD meaningfully improves upon MD performance (by up to 15% AUROC on genomics OOD).
    Secure Multi-Function Computation with Private Remote Sources. (arXiv:2106.09485v1 [cs.IT])
    (2 min) We consider a distributed function computation problem in which parties observing noisy versions of a remote source facilitate the computation of a function of their observations at a fusion center through public communication. The distributed function computation is subject to constraints, including not only reliability and storage but also privacy and secrecy. Specifically, 1) the remote source should remain private from an eavesdropper and the fusion center, measured in terms of the information leaked about the remote source; 2) the function computed should remain secret from the eavesdropper, measured in terms of the information leaked about the arguments of the function, to ensure secrecy regardless of the exact function used. We derive the exact rate regions for lossless and lossy single-function computation and illustrate the lossy single-function computation rate region for an information bottleneck example, in which the optimal auxiliary random variables are characterized for binary-input symmetric-output channels. We extend the approach to lossless and lossy asynchronous multiple-function computations with joint secrecy and privacy constraints, in which case inner and outer bounds for the rate regions differing only in the Markov chain conditions imposed are characterized.
    The Fishnet Open Images Database: A Dataset for Fish Detection and Fine-Grained Categorization in Fisheries. (arXiv:2106.09178v1 [cs.CV])
    (2 min) Camera-based electronic monitoring (EM) systems are increasingly being deployed onboard commercial fishing vessels to collect essential data for fisheries management and regulation. These systems generate large quantities of video data which must be reviewed on land by human experts. Computer vision can assist this process by automatically detecting and classifying fish species, however the lack of existing public data in this domain has hindered progress. To address this, we present the Fishnet Open Images Database, a large dataset of EM imagery for fish detection and fine-grained categorization onboard commercial fishing vessels. The dataset consists of 86,029 images containing 34 object classes, making it the largest and most diverse public dataset of fisheries EM imagery to-date. It includes many of the characteristic challenges of EM data: visual similarity between species, skewed class distributions, harsh weather conditions, and chaotic crew activity. We evaluate the performance of existing detection and classification algorithms and demonstrate that the dataset can serve as a challenging benchmark for development of computer vision algorithms in fisheries. The dataset is available at https://www.fishnet.ai/.
    Identifiability-Guaranteed Simplex-Structured Post-Nonlinear Mixture Learning via Autoencoder. (arXiv:2106.09070v1 [cs.LG])
    (2 min) This work focuses on the problem of unraveling nonlinearly mixed latent components in an unsupervised manner. The latent components are assumed to reside in the probability simplex, and are transformed by an unknown post-nonlinear mixing system. This problem finds various applications in signal and data analytics, e.g., nonlinear hyperspectral unmixing, image embedding, and nonlinear clustering. Linear mixture learning problems are already ill-posed, as identifiability of the target latent components is hard to establish in general. With unknown nonlinearity involved, the problem is even more challenging. Prior work offered a function equation-based formulation for provable latent component identification. However, the identifiability conditions are somewhat stringent and unrealistic. In addition, the identifiability analysis is based on the infinite sample (i.e., population) case, while the understanding for practical finite sample cases has been elusive. Moreover, the algorithm in the prior work trades model expressiveness with computational convenience, which often hinders the learning performance. Our contribution is threefold. First, new identifiability conditions are derived under largely relaxed assumptions. Second, comprehensive sample complexity results are presented -- which are the first of the kind. Third, a constrained autoencoder-based algorithmic framework is proposed for implementation, which effectively circumvents the challenges in the existing algorithm. Synthetic and real experiments corroborate our theoretical analyses.
    Uniform Convergence of Interpolators: Gaussian Width, Norm Bounds, and Benign Overfitting. (arXiv:2106.09276v1 [stat.ML])
    (2 min) We consider interpolation learning in high-dimensional linear regression with Gaussian data, and prove a generic uniform convergence guarantee on the generalization error of interpolators in an arbitrary hypothesis class in terms of the class's Gaussian width. Applying the generic bound to Euclidean norm balls recovers the consistency result of Bartlett et al. (2020) for minimum-norm interpolators, and confirms a prediction of Zhou et al. (2020) for near-minimal-norm interpolators in the special case of Gaussian data. We demonstrate the generality of the bound by applying it to the simplex, obtaining a novel consistency result for minimum l1-norm interpolators (basis pursuit). Our results show how norm-based generalization bounds can explain and be used to analyze benign overfitting, at least in some settings.
    MHNF: Multi-hop Heterogeneous Neighborhood information Fusion graph representation learning. (arXiv:2106.09289v1 [cs.LG])
    (2 min) Attention mechanism enables the Graph Neural Networks(GNNs) to learn the attention weights between the target node and its one-hop neighbors, the performance is further improved. However, the most existing GNNs are oriented to homogeneous graphs and each layer can only aggregate the information of one-hop neighbors. Stacking multi-layer networks will introduce a lot of noise and easily lead to over smoothing. We propose a Multi-hop Heterogeneous Neighborhood information Fusion graph representation learning method (MHNF). Specifically, we first propose a hybrid metapath autonomous extraction model to efficiently extract multi-hop hybrid neighbors. Then, we propose a hop-level heterogeneous Information aggregation model, which selectively aggregates different-hop neighborhood information within the same hybrid metapath. Finally, a hierarchical semantic attention fusion model (HSAF) is proposed, which can efficiently integrate different-hop and different-path neighborhood information respectively. This paper can solve the problem of aggregating the multi-hop neighborhood information and can learn hybrid metapaths for target task, reducing the limitation of manually specifying metapaths. In addition, HSAF can extract the internal node information of the metapaths and better integrate the semantic information of different levels. Experimental results on real datasets show that MHNF is superior to state-of-the-art methods in node classification and clustering tasks (10.94% - 69.09% and 11.58% - 394.93% relative improvement on average, respectively).
    Efficient reconstruction of depth three circuits with top fan-in two. (arXiv:2103.07445v2 [cs.CC] UPDATED)
    (2 min) We develop efficient randomized algorithms to solve the black-box reconstruction problem for polynomials over finite fields, computable by depth three arithmetic circuits with alternating addition/multiplication gates, such that output gate is an addition gate with in-degree two. These circuits compute polynomials of form $G\times(T_1 + T_2)$, where $G,T_1,T_2$ are product of affine forms, and polynomials $T_1,T_2$ have no common factors. Rank of such a circuit is defined as dimension of vector space spanned by all affine factors of $T_1$ and $T_2$. For any polynomial $f$ computable by such a circuit, $rank(f)$ is defined to be the minimum rank of any such circuit computing it. Our work develops randomized reconstruction algorithms which take as input black-box access to a polynomial $f$ (over finite field $\mathbb{F}$), computable by such a circuit. Here are the results. 1 [Low rank]: When $5\leq rank(f) = O(\log^3 d)$, it runs in time $(nd^{\log^3d}\log |\mathbb{F}|)^{O(1)}$, and, with high probability, outputs a depth three circuit computing $f$, with top addition gate having in-degree $\leq d^{rank(f)}$. 2 [High rank]: When $rank(f) = \Omega(\log^3 d)$, it runs in time $(nd\log |\mathbb{F}|)^{O(1)}$, and, with high probability, outputs a depth three circuit computing $f$, with top addition gate having in-degree two. Ours is the first blackbox reconstruction algorithm for this circuit class, that runs in time polynomial in $\log |\mathbb{F}|$. This problem has been mentioned as an open problem in [GKL12] (STOC 2012)
    Independent Asymmetric Embedding for Cascade Prediction on Social Networks. (arXiv:2105.08291v2 [cs.LG] UPDATED)
    (2 min) The prediction for information diffusion on social networks has great practical significance in marketing and public opinion control. Cascade prediction aims to predict the individuals who will potentially repost the message on the social network. One kind of methods either exploit demographical, structural, and temporal features for prediction, or explicitly rely on particular information diffusion models. The other kind of models are fully data-driven and do not require a global network structure. Thus massive diffusion prediction models based on network embedding are proposed. These models embed the users into the latent space using their cascade information, but are lack of consideration for the intervene among users when embedding. In this paper, we propose an independent asymmetric embedding method to learn social embedding for cascade prediction. Different from existing methods, our method embeds each individual into one latent influence space and multiple latent susceptibility spaces. Furthermore, our method captures the co-occurrence regulation of user combination in cascades to improve the calculating effectiveness. The results of extensive experiments conducted on real-world datasets verify both the predictive accuracy and cost-effectiveness of our approach.
    Backward Gradient Normalization in Deep Neural Networks. (arXiv:2106.09475v1 [cs.LG])
    (2 min) We introduce a new technique for gradient normalization during neural network training. The gradients are rescaled during the backward pass using normalization layers introduced at certain points within the network architecture. These normalization nodes do not affect forward activity propagation, but modify backpropagation equations to permit a well-scaled gradient flow that reaches the deepest network layers without experimenting vanishing or explosion. Results on tests with very deep neural networks show that the new technique can do an effective control of the gradient norm, allowing the update of weights in the deepest layers and improving network accuracy on several experimental conditions.
    Can I Be of Further Assistance? Using Unstructured Knowledge Access to Improve Task-oriented Conversational Modeling. (arXiv:2106.09174v1 [cs.CL])
    (2 min) Most prior work on task-oriented dialogue systems are restricted to limited coverage of domain APIs. However, users oftentimes have requests that are out of the scope of these APIs. This work focuses on responding to these beyond-API-coverage user turns by incorporating external, unstructured knowledge sources. Our approach works in a pipelined manner with knowledge-seeking turn detection, knowledge selection, and response generation in sequence. We introduce novel data augmentation methods for the first two steps and demonstrate that the use of information extracted from dialogue context improves the knowledge selection and end-to-end performances. Through experiments, we achieve state-of-the-art performance for both automatic and human evaluation metrics on the DSTC9 Track 1 benchmark dataset, validating the effectiveness of our contributions.
    Batch Value-function Approximation with Only Realizability. (arXiv:2008.04990v3 [cs.LG] UPDATED)
    (2 min) We make progress in a long-standing problem of batch reinforcement learning (RL): learning $Q^\star$ from an exploratory and polynomial-sized dataset, using a realizable and otherwise arbitrary function class. In fact, all existing algorithms demand function-approximation assumptions stronger than realizability, and the mounting negative evidence has led to a conjecture that sample-efficient learning is impossible in this setting (Chen and Jiang, 2019). Our algorithm, BVFT, breaks the hardness conjecture (albeit under a stronger notion of exploratory data) via a tournament procedure that reduces the learning problem to pairwise comparison, and solves the latter with the help of a state-action partition constructed from the compared functions. We also discuss how BVFT can be applied to model selection among other extensions and open problems.
    A Winning Hand: Compressing Deep Networks Can Improve Out-Of-Distribution Robustness. (arXiv:2106.09129v1 [cs.LG])
    (2 min) Two crucial requirements for a successful adoption of deep learning (DL) in the wild are: (1) robustness to distributional shifts, and (2) model compactness for achieving efficiency. Unfortunately, efforts towards simultaneously achieving Out-of-Distribution (OOD) robustness and extreme model compactness without sacrificing accuracy have mostly been unsuccessful. This raises an important question: "Is the inability to create compact, accurate, and robust deep neural networks (CARDs) fundamental?" To answer this question, we perform a large-scale analysis for a range of popular model compression techniques which uncovers several intriguing patterns. Notably, in contrast to traditional pruning approaches (e.g., fine tuning and gradual magnitude pruning), we find that "lottery ticket-style" pruning approaches can surprisingly be used to create high performing CARDs. Specifically, we are able to create extremely compact CARDs that are dramatically more robust than their significantly larger and full-precision counterparts while matching (or beating) their test accuracy, simply by pruning and/or quantizing. To better understand these differences, we perform sensitivity analysis in the Fourier domain for CARDs trained using different data augmentation methods. Motivated by our analysis, we develop a simple domain-adaptive test-time ensembling approach (CARD-Deck) that uses a gating module to dynamically select an appropriate CARD from the CARD-Deck based on their spectral-similarity with test samples. By leveraging complementary frequency biases of different compressed models, the proposed approach builds a "winning hand" of CARDs that establishes a new state-of-the-art on CIFAR-10-C accuracies (i.e., 96.8% clean and 92.75% robust) with dramatically better memory usage than their non-compressed counterparts. We also present some theoretical evidences supporting our empirical findings.
    Training Graph Neural Networks with 1000 Layers. (arXiv:2106.07476v2 [cs.LG] UPDATED)
    (2 min) Deep graph neural networks (GNNs) have achieved excellent results on various tasks on increasingly large graph datasets with millions of nodes and edges. However, memory complexity has become a major obstacle when training deep GNNs for practical applications due to the immense number of nodes, edges, and intermediate activations. To improve the scalability of GNNs, prior works propose smart graph sampling or partitioning strategies to train GNNs with a smaller set of nodes or sub-graphs. In this work, we study reversible connections, group convolutions, weight tying, and equilibrium models to advance the memory and parameter efficiency of GNNs. We find that reversible connections in combination with deep network architectures enable the training of overparameterized GNNs that significantly outperform existing methods on multiple datasets. Our models RevGNN-Deep (1001 layers with 80 channels each) and RevGNN-Wide (448 layers with 224 channels each) were both trained on a single commodity GPU and achieve an ROC-AUC of $87.74 \pm 0.13$ and $88.24 \pm 0.15$ on the ogbn-proteins dataset. To the best of our knowledge, RevGNN-Deep is the deepest GNN in the literature by one order of magnitude. Please visit our project website https://www.deepgcns.org/arch/gnn1000 for more information.
    Evaluating the Robustness of Bayesian Neural Networks Against Different Types of Attacks. (arXiv:2106.09223v1 [cs.LG])
    (2 min) To evaluate the robustness gain of Bayesian neural networks on image classification tasks, we perform input perturbations, and adversarial attacks to the state-of-the-art Bayesian neural networks, with a benchmark CNN model as reference. The attacks are selected to simulate signal interference and cyberattacks towards CNN-based machine learning systems. The result shows that a Bayesian neural network achieves significantly higher robustness against adversarial attacks generated against a deterministic neural network model, without adversarial training. The Bayesian posterior can act as the safety precursor of ongoing malicious activities. Furthermore, we show that the stochastic classifier after the deterministic CNN extractor has sufficient robustness enhancement rather than a stochastic feature extractor before the stochastic classifier. This advises on utilizing stochastic layers in building decision-making pipelines within a safety-critical domain.
    Algorithmic Bias and Data Bias: Understanding the Relation between Distributionally Robust Optimization and Data Curation. (arXiv:2106.09467v1 [cs.LG])
    (2 min) Machine learning systems based on minimizing average error have been shown to perform inconsistently across notable subsets of the data, which is not exposed by a low average error for the entire dataset. In consequential social and economic applications, where data represent people, this can lead to discrimination of underrepresented gender and ethnic groups. Given the importance of bias mitigation in machine learning, the topic leads to contentious debates on how to ensure fairness in practice (data bias versus algorithmic bias). Distributionally Robust Optimization (DRO) seemingly addresses this problem by minimizing the worst expected risk across subpopulations. We establish theoretical results that clarify the relation between DRO and the optimization of the same loss averaged on an adequately weighted training dataset. The results cover finite and infinite number of training distributions, as well as convex and non-convex loss functions. We show that neither DRO nor curating the training set should be construed as a complete solution for bias mitigation: in the same way that there is no universally robust training set, there is no universal way to setup a DRO problem and ensure a socially acceptable set of results. We then leverage these insights to provide a mininal set of practical recommendations for addressing bias with DRO. Finally, we discuss ramifications of our results in other related applications of DRO, using an example of adversarial robustness. Our results show that there is merit to both the algorithm-focused and the data-focused side of the bias debate, as long as arguments in favor of these positions are precisely qualified and backed by relevant mathematics known today.
    On Anytime Learning at Macroscale. (arXiv:2106.09563v1 [cs.LG])
    (2 min) Classical machine learning frameworks assume access to a possibly large dataset in order to train a predictive model. In many practical applications however, data does not arrive all at once, but in batches over time. This creates a natural trade-off between accuracy of a model and time to obtain such a model. A greedy predictor could produce non-trivial predictions by immediately training on batches as soon as these become available but, it may also make sub-optimal use of future data. On the other hand, a tardy predictor could wait for a long time to aggregate several batches into a larger dataset, but ultimately deliver a much better performance. In this work, we consider such a streaming learning setting, which we dub {\em anytime learning at macroscale} (ALMA). It is an instance of anytime learning applied not at the level of a single chunk of data, but at the level of the entire sequence of large batches. We first formalize this learning setting, we then introduce metrics to assess how well learners perform on the given task for a given memory and compute budget, and finally we test several baseline approaches on standard benchmarks repurposed for anytime learning at macroscale. The general finding is that bigger models always generalize better. In particular, it is important to grow model capacity over time if the initial model is relatively small. Moreover, updating the model at an intermediate rate strikes the best trade off between accuracy and time to obtain a useful predictor.
    Towards a Unified Framework for Fair and Stable Graph Representation Learning. (arXiv:2102.13186v3 [cs.LG] UPDATED)
    (2 min) As the representations output by Graph Neural Networks (GNNs) are increasingly employed in real-world applications, it becomes important to ensure that these representations are fair and stable. In this work, we establish a key connection between counterfactual fairness and stability and leverage it to propose a novel framework, NIFTY (uNIfying Fairness and stabiliTY), which can be used with any GNN to learn fair and stable representations. We introduce a novel objective function that simultaneously accounts for fairness and stability and develop a layer-wise weight normalization using the Lipschitz constant to enhance neural message passing in GNNs. In doing so, we enforce fairness and stability both in the objective function as well as in the GNN architecture. Further, we show theoretically that our layer-wise weight normalization promotes counterfactual fairness and stability in the resulting representations. We introduce three new graph datasets comprising of high-stakes decisions in criminal justice and financial lending domains. Extensive experimentation with the above datasets demonstrates the efficacy of our framework.
    Merging versus Ensembling in Multi-Study Prediction: Theoretical Insight from Random Effects. (arXiv:1905.07382v3 [stat.ML] UPDATED)
    (2 min) A critical decision point when training predictors using multiple studies is whether these studies should be combined or treated separately. We compare two multi-study learning approaches in the presence of potential heterogeneity in predictor-outcome relationships across datasets. We consider 1) merging all of the datasets and training a single learner, and 2) multi-study ensembling, which involves training a separate learner on each dataset and combining the predictions resulting from each learner. In a linear regression setting, we show analytically and confirm via simulation that merging yields lower prediction error than ensembling when the predictor-outcome relationships are relatively homogeneous across studies. However, as cross-study heterogeneity increases, there exists a transition point beyond which ensembling outperforms merging. We provide analytic expressions for the transition point in various scenarios, study asymptotic properties, and illustrate how transition point theory can be used for deciding when studies should be combined with an application from metabolomics.
    PAC-Bayes, MAC-Bayes and Conditional Mutual Information: Fast rate bounds that handle general VC classes. (arXiv:2106.09683v1 [cs.LG])
    (2 min) We give a novel, unified derivation of conditional PAC-Bayesian and mutual information (MI) generalization bounds. We derive conditional MI bounds as an instance, with special choice of prior, of conditional MAC-Bayesian (Mean Approximately Correct) bounds, itself derived from conditional PAC-Bayesian bounds, where `conditional' means that one can use priors conditioned on a joint training and ghost sample. This allows us to get nontrivial PAC-Bayes and MI-style bounds for general VC classes, something recently shown to be impossible with standard PAC-Bayesian/MI bounds. Second, it allows us to get faster rates of order $O \left(({\text{KL}}/n)^{\gamma}\right)$ for $\gamma > 1/2$ if a Bernstein condition holds and for exp-concave losses (with $\gamma=1$), which is impossible with both standard PAC-Bayes generalization and MI bounds. Our work extends the recent work by Steinke and Zakynthinou [2020] who handle MI with VC but neither PAC-Bayes nor fast rates, the recent work of Hellstr\"om and Durisi [2020] who extend the latter to the PAC-Bayes setting via a unifying exponential inequality, and Mhammedi et al. [2019] who initiated fast rate PAC-Bayes generalization error bounds but handle neither MI nor general VC classes.
    Insights into Data through Model Behaviour: An Explainability-driven Strategy for Data Auditing for Responsible Computer Vision Applications. (arXiv:2106.09177v1 [cs.CV])
    (2 min) In this study, we take a departure and explore an explainability-driven strategy to data auditing, where actionable insights into the data at hand are discovered through the eyes of quantitative explainability on the behaviour of a dummy model prototype when exposed to data. We demonstrate this strategy by auditing two popular medical benchmark datasets, and discover hidden data quality issues that lead deep learning models to make predictions for the wrong reasons. The actionable insights gained from this explainability driven data auditing strategy is then leveraged to address the discovered issues to enable the creation of high-performing deep learning models with appropriate prediction behaviour. The hope is that such an explainability-driven strategy can be complimentary to data-driven strategies to facilitate for more responsible development of machine learning algorithms for computer vision applications.
    Learning Languages in the Limit from Positive Information with Finitely Many Memory Changes. (arXiv:2010.04782v3 [cs.FL] UPDATED)
    (2 min) We investigate learning collections of languages from texts by an inductive inference machine with access to the current datum and a bounded memory in form of states. Such a bounded memory states (BMS) learner is considered successful in case it eventually settles on a correct hypothesis while exploiting only finitely many different states. We give the complete map of all pairwise relations for an established collection of criteria of successfull learning. Most prominently, we show that non-U-shapedness is not restrictive, while conservativeness and (strong) monotonicity are. Some results carry over from iterative learning by a general lemma showing that, for a wealth of restrictions (the semantic restrictions), iterative and bounded memory states learning are equivalent. We also give an example of a non-semantic restriction (strongly non-U-shapedness) where the two settings differ.
    An Empowerment-based Solution to Robotic Manipulation Tasks with Sparse Rewards. (arXiv:2010.07986v3 [cs.RO] UPDATED)
    (2 min) In order to provide adaptive and user-friendly solutions to robotic manipulation, it is important that the agent can learn to accomplish tasks even if they are only provided with very sparse instruction signals. To address the issues reinforcement learning algorithms face when task rewards are sparse, this paper proposes an intrinsic motivation approach that can be easily integrated into any standard reinforcement learning algorithm and can allow robotic manipulators to learn useful manipulation skills with only sparse extrinsic rewards. Through integrating and balancing empowerment and curiosity, this approach shows superior performance compared to other state-of-the-art intrinsic exploration approaches during extensive empirical testing. Qualitative analysis also shows that when combined with diversity-driven intrinsic motivations, this approach can help manipulators learn a set of diverse skills which could potentially be applied to other more complicated manipulation tasks and accelerate their learning process.
    Multi-Modal Detection of Alzheimer's Disease from Speech and Text. (arXiv:2012.00096v2 [cs.LG] UPDATED)
    (2 min) Reliable detection of the prodromal stages of Alzheimer's disease (AD) remains difficult even today because, unlike other neurocognitive impairments, there is no definitive diagnosis of AD in vivo. In this context, existing research has shown that patients often develop language impairment even in mild AD conditions. We propose a multimodal deep learning method that utilizes speech and the corresponding transcript simultaneously to detect AD. For audio signals, the proposed audio-based network, a convolutional neural network (CNN) based model, predicts the diagnosis for multiple speech segments, which are combined for the final prediction. Similarly, we use contextual embedding extracted from BERT concatenated with a CNN-generated embedding for classifying the transcript. The individual predictions of the two models are then combined to make the final classification. We also perform experiments to analyze the model performance when Automated Speech Recognition (ASR) system generated transcripts are used instead of manual transcription in the text-based model. The proposed method achieves 85.3% 10-fold cross-validation accuracy when trained and evaluated on the Dementiabank Pitt corpus.
    Always Be Dreaming: A New Approach for Data-Free Class-Incremental Learning. (arXiv:2106.09701v1 [cs.CV])
    (2 min) Modern computer vision applications suffer from catastrophic forgetting when incrementally learning new concepts over time. The most successful approaches to alleviate this forgetting require extensive replay of previously seen data, which is problematic when memory constraints or data legality concerns exist. In this work, we consider the high-impact problem of Data-Free Class-Incremental Learning (DFCIL), where an incremental learning agent must learn new concepts over time without storing generators or training data from past tasks. One approach for DFCIL is to replay synthetic images produced by inverting a frozen copy of the learner's classification model, but we show this approach fails for common class-incremental benchmarks when using standard distillation strategies. We diagnose the cause of this failure and propose a novel incremental distillation strategy for DFCIL, contributing a modified cross-entropy training and importance-weighted feature distillation, and show that our method results in up to a 25.1% increase in final task accuracy (absolute difference) compared to SOTA DFCIL methods for common class-incremental benchmarks. Our method even outperforms several standard replay based methods which store a coreset of images.
    Metrizing Weak Convergence with Maximum Mean Discrepancies. (arXiv:2006.09268v2 [cs.LG] UPDATED)
    (2 min) This paper characterizes the maximum mean discrepancies (MMD) that metrize the weak convergence of probability measures for a wide class of kernels. More precisely, we prove that, on a locally compact, non-compact, Hausdorff space, the MMD of a bounded continuous Borel measurable kernel k, whose reproducing kernel Hilbert space (RKHS) functions vanish at infinity, metrizes the weak convergence of probability measures if and only if k is continuous and integrally strictly positive definite (i.s.p.d.) over all signed, finite, regular Borel measures. We also correct a prior result of Simon-Gabriel & Sch\"olkopf (JMLR, 2018, Thm.12) by showing that there exist both bounded continuous i.s.p.d. kernels that do not metrize weak convergence and bounded continuous non-i.s.p.d. kernels that do metrize it.
    Predicting the Popularity of Reddit Posts with AI. (arXiv:2106.07380v2 [cs.LG] UPDATED)
    (2 min) Social media creates crucial mass changes, as popular posts and opinions cast a significant influence on users' decisions and thought processes. For example, the recent Reddit uprising inspired by r/wallstreetbets which had remarkable economic impact was started with a series of posts on the thread. The prediction of posts that may have a notable impact will allow for the preparation of possible following trends. This study aims to develop a machine learning model capable of accurately predicting the popularity of a Reddit post. Specifically, the model is predicting the number of upvotes a post will receive based on its textual content. I experimented with three different models: a baseline linear regression model, a random forest regression model, and a neural network. I collected Reddit post data from an online data set and analyzed the model's performance when trained on a single subreddit and a collection of subreddits. The results showed that the neural network model performed the best when the loss of the models were compared. With the use of a machine learning model to predict social trends through the reaction users have to post, a better picture of the near future can be envisioned.
    Pruning Randomly Initialized Neural Networks with Iterative Randomization. (arXiv:2106.09269v1 [cs.LG])
    (2 min) Pruning the weights of randomly initialized neural networks plays an important role in the context of lottery ticket hypothesis. Ramanujan et al. (2020) empirically showed that only pruning the weights can achieve remarkable performance instead of optimizing the weight values. However, to achieve the same level of performance as the weight optimization, the pruning approach requires more parameters in the networks before pruning and thus more memory space. To overcome this parameter inefficiency, we introduce a novel framework to prune randomly initialized neural networks with iteratively randomizing weight values (IteRand). Theoretically, we prove an approximation theorem in our framework, which indicates that the randomizing operations are provably effective to reduce the required number of the parameters. We also empirically demonstrate the parameter efficiency in multiple experiments on CIFAR-10 and ImageNet.
    Scalable Approach for Normalizing E-commerce Text Attributes (SANTA). (arXiv:2106.09493v1 [cs.CL])
    (2 min) In this paper, we present SANTA, a scalable framework to automatically normalize E-commerce attribute values (e.g. "Win 10 Pro") to a fixed set of pre-defined canonical values (e.g. "Windows 10"). Earlier works on attribute normalization focused on fuzzy string matching (also referred as syntactic matching in this paper). In this work, we first perform an extensive study of nine syntactic matching algorithms and establish that 'cosine' similarity leads to best results, showing 2.7% improvement over commonly used Jaccard index. Next, we argue that string similarity alone is not sufficient for attribute normalization as many surface forms require going beyond syntactic matching (e.g. "720p" and "HD" are synonyms). While semantic techniques like unsupervised embeddings (e.g. word2vec/fastText) have shown good results in word similarity tasks, we observed that they perform poorly to distinguish between close canonical forms, as these close forms often occur in similar contexts. We propose to learn token embeddings using a twin network with triplet loss. We propose an embedding learning task leveraging raw attribute values and product titles to learn these embeddings in a self-supervised fashion. We show that providing supervision using our proposed task improves over both syntactic and unsupervised embeddings based techniques for attribute normalization. Experiments on a real-world attribute normalization dataset of 50 attributes show that the embeddings trained using our proposed approach obtain 2.3% improvement over best string matching and 19.3% improvement over best unsupervised embeddings.
    Towards a Rigorous Theoretical Analysis and Evaluation of GNN Explanations. (arXiv:2106.09078v1 [cs.LG])
    (2 min) As Graph Neural Networks (GNNs) are increasingly employed in real-world applications, it becomes critical to ensure that the stakeholders understand the rationale behind their predictions. While several GNN explanation methods have been proposed recently, there has been little to no work on theoretically analyzing the behavior of these methods or systematically evaluating their effectiveness. Here, we introduce the first axiomatic framework for theoretically analyzing, evaluating, and comparing state-of-the-art GNN explanation methods. We outline and formalize the key desirable properties that all GNN explanation methods should satisfy in order to generate reliable explanations, namely, faithfulness, stability, and fairness. We leverage these properties to present the first ever theoretical analysis of the effectiveness of state-of-the-art GNN explanation methods. Our analysis establishes upper bounds on all the aforementioned properties for popular GNN explanation methods. We also leverage our framework to empirically evaluate these methods on multiple real-world datasets from diverse domains. Our empirical results demonstrate that some popular GNN explanation methods (e.g., gradient-based methods) perform no better than a random baseline and that methods which leverage the graph structure are more effective than those that solely rely on the node features.
    Distributional Robust Batch Contextual Bandits. (arXiv:2006.05630v2 [cs.LG] UPDATED)
    (2 min) Policy learning using historical observational data is an important problem that has found widespread applications. Examples include selecting offers, prices, advertisements to send to customers, as well as selecting which medication to prescribe to a patient. However, existing literature rests on the crucial assumption that the future environment where the learned policy will be deployed is the same as the past environment that has generated the data--an assumption that is often false or too coarse an approximation. In this paper, we lift this assumption and aim to learn a distributional robust policy with incomplete (bandit) observational data. We propose a novel learning algorithm that is able to learn a robust policy to adversarial perturbations and unknown covariate shifts. We first present a policy evaluation procedure in the ambiguous environment and then give a performance guarantee based on the theory of uniform convergence. Additionally, we also give a heuristic algorithm to solve the distributional robust policy learning problems efficiently. Finally, we demonstrate the robustness of our methods in the synthetic and real-world datasets.
    Recurrent Neural Networks for Stochastic Control Problems with Delay. (arXiv:2101.01385v2 [math.OC] UPDATED)
    (2 min) Stochastic control problems with delay are challenging due to the path-dependent feature of the system and thus its intrinsic high dimensions. In this paper, we propose and systematically study deep neural networks-based algorithms to solve stochastic control problems with delay features. Specifically, we employ neural networks for sequence modeling (\emph{e.g.}, recurrent neural networks such as long short-term memory) to parameterize the policy and optimize the objective function. The proposed algorithms are tested on three benchmark examples: a linear-quadratic problem, optimal consumption with fixed finite delay, and portfolio optimization with complete memory. Particularly, we notice that the architecture of recurrent neural networks naturally captures the path-dependent feature with much flexibility and yields better performance with more efficient and stable training of the network compared to feedforward networks. The superiority is even evident in the case of portfolio optimization with complete memory, which features infinite delay.
    Design and Analysis of Robust Deep Learning Models for Stock Price Prediction. (arXiv:2106.09664v1 [q-fin.ST])
    (2 min) Building predictive models for robust and accurate prediction of stock prices and stock price movement is a challenging research problem to solve. The well-known efficient market hypothesis believes in the impossibility of accurate prediction of future stock prices in an efficient stock market as the stock prices are assumed to be purely stochastic. However, numerous works proposed by researchers have demonstrated that it is possible to predict future stock prices with a high level of precision using sophisticated algorithms, model architectures, and the selection of appropriate variables in the models. This chapter proposes a collection of predictive regression models built on deep learning architecture for robust and precise prediction of the future prices of a stock listed in the diversified sectors in the National Stock Exchange (NSE) of India. The Metastock tool is used to download the historical stock prices over a period of two years (2013- 2014) at 5 minutes intervals. While the records for the first year are used to train the models, the testing is carried out using the remaining records. The design approaches of all the models and their performance results are presented in detail. The models are also compared based on their execution time and accuracy of prediction.
    Learning from Demonstration without Demonstrations. (arXiv:2106.09203v1 [cs.LG])
    (2 min) State-of-the-art reinforcement learning (RL) algorithms suffer from high sample complexity, particularly in the sparse reward case. A popular strategy for mitigating this problem is to learn control policies by imitating a set of expert demonstrations. The drawback of such approaches is that an expert needs to produce demonstrations, which may be costly in practice. To address this shortcoming, we propose Probabilistic Planning for Demonstration Discovery (P2D2), a technique for automatically discovering demonstrations without access to an expert. We formulate discovering demonstrations as a search problem and leverage widely-used planning algorithms such as Rapidly-exploring Random Tree to find demonstration trajectories. These demonstrations are used to initialize a policy, then refined by a generic RL algorithm. We provide theoretical guarantees of P2D2 finding successful trajectories, as well as bounds for its sampling complexity. We experimentally demonstrate the method outperforms classic and intrinsic exploration RL techniques in a range of classic control and robotics tasks, requiring only a fraction of exploration samples and achieving better asymptotic performance.
    How Low Can We Go: Trading Memory for Error in Low-Precision Training. (arXiv:2106.09686v1 [cs.LG])
    (2 min) Low-precision arithmetic trains deep learning models using less energy, less memory and less time. However, we pay a price for the savings: lower precision may yield larger round-off error and hence larger prediction error. As applications proliferate, users must choose which precision to use to train a new model, and chip manufacturers must decide which precisions to manufacture. We view these precision choices as a hyperparameter tuning problem, and borrow ideas from meta-learning to learn the tradeoff between memory and error. In this paper, we introduce Pareto Estimation to Pick the Perfect Precision (PEPPP). We use matrix factorization to find non-dominated configurations (the Pareto frontier) with a limited number of network evaluations. For any given memory budget, the precision that minimizes error is a point on this frontier. Practitioners can use the frontier to trade memory for error and choose the best precision for their goals.
    Author Clustering and Topic Estimation for Short Texts. (arXiv:2106.09533v1 [cs.IR])
    (2 min) Analysis of short text, such as social media posts, is extremely difficult because it relies on observing many document-level word co-occurrence pairs. Beyond topic distributions, a common downstream task of the modeling is grouping the authors of these documents for subsequent analyses. Traditional models estimate the document groupings and identify user clusters with an independent procedure. We propose a novel model that expands on the Latent Dirichlet Allocation by modeling strong dependence among the words in the same document, with user-level topic distributions. We also simultaneously cluster users, removing the need for post-hoc cluster estimation and improving topic estimation by shrinking noisy user-level topic distributions towards typical values. Our method performs as well as -- or better -- than traditional approaches to problems arising in short text, and we demonstrate its usefulness on a dataset of tweets from United States Senators, recovering both meaningful topics and clusters that reflect partisan ideology.
    A Self-supervised Method for Entity Alignment. (arXiv:2106.09395v1 [cs.CL])
    (2 min) Entity alignment, aiming to identify equivalent entities across different knowledge graphs (KGs), is a fundamental problem for constructing large-scale KGs. Over the course of its development, supervision has been considered necessary for accurate alignments. Inspired by the recent progress of self-supervised learning, we explore the extent to which we can get rid of supervision for entity alignment. Existing supervised methods for this task focus on pulling each pair of positive (labeled) entities close to each other. However, our analysis suggests that the learning of entity alignment can actually benefit more from pushing sampled (unlabeled) negatives far away than pulling positive aligned pairs close. We present SelfKG by leveraging this discovery to design a contrastive learning strategy across two KGs. Extensive experiments on benchmark datasets demonstrate that SelfKG without supervision can match or achieve comparable results with state-of-the-art supervised baselines. The performance of SelfKG demonstrates self-supervised learning offers great potential for entity alignment in KGs.
    Just How Toxic is Data Poisoning? A Unified Benchmark for Backdoor and Data Poisoning Attacks. (arXiv:2006.12557v3 [cs.LG] UPDATED)
    (2 min) Data poisoning and backdoor attacks manipulate training data in order to cause models to fail during inference. A recent survey of industry practitioners found that data poisoning is the number one concern among threats ranging from model stealing to adversarial attacks. However, it remains unclear exactly how dangerous poisoning methods are and which ones are more effective considering that these methods, even ones with identical objectives, have not been tested in consistent or realistic settings. We observe that data poisoning and backdoor attacks are highly sensitive to variations in the testing setup. Moreover, we find that existing methods may not generalize to realistic settings. While these existing works serve as valuable prototypes for data poisoning, we apply rigorous tests to determine the extent to which we should fear them. In order to promote fair comparison in future work, we develop standardized benchmarks for data poisoning and backdoor attacks.
    Randomized Value Functions via Posterior State-Abstraction Sampling. (arXiv:2010.02383v2 [cs.LG] UPDATED)
    (2 min) State abstraction has been an essential tool for dramatically improving the sample efficiency of reinforcement-learning algorithms. Indeed, by exposing and accentuating various types of latent structure within the environment, different classes of state abstraction have enabled improved theoretical guarantees and empirical performance. When dealing with state abstractions that capture structure in the value function, however, a standard assumption is that the true abstraction has been supplied or unrealistically computed a priori, leaving open the question of how to efficiently uncover such latent structure while jointly seeking out optimal behavior. Taking inspiration from the bandit literature, we propose that an agent seeking out latent task structure must explicitly represent and maintain its uncertainty over that structure as part of its overall uncertainty about the environment. We introduce a practical algorithm for doing this using two posterior distributions over state abstractions and abstract-state values. In empirically validating our approach, we find that substantial performance gains lie in the multi-task setting where tasks share a common, low-dimensional representation.
    Statistical Query Lower Bounds for List-Decodable Linear Regression. (arXiv:2106.09689v1 [cs.DS])
    (2 min) We study the problem of list-decodable linear regression, where an adversary can corrupt a majority of the examples. Specifically, we are given a set $T$ of labeled examples $(x, y) \in \mathbb{R}^d \times \mathbb{R}$ and a parameter $0< \alpha <1/2$ such that an $\alpha$-fraction of the points in $T$ are i.i.d. samples from a linear regression model with Gaussian covariates, and the remaining $(1-\alpha)$-fraction of the points are drawn from an arbitrary noise distribution. The goal is to output a small list of hypothesis vectors such that at least one of them is close to the target regression vector. Our main result is a Statistical Query (SQ) lower bound of $d^{\mathrm{poly}(1/\alpha)}$ for this problem. Our SQ lower bound qualitatively matches the performance of previously developed algorithms, providing evidence that current upper bounds for this task are nearly best possible.
    Scrambled Translation Problem: A Problem of Denoising UNMT. (arXiv:1911.01212v2 [cs.CL] UPDATED)
    (2 min) In this paper, we identify an interesting kind of error in the output of Unsupervised Neural Machine Translation (UNMT) systems like \textit{Undreamt}(footnote). We refer to this error type as \textit{Scrambled Translation problem}. We observe that UNMT models which use \textit{word shuffle} noise (as in case of Undreamt) can generate correct words, but fail to stitch them together to form phrases. As a result, words of the translated sentence look \textit{scrambled}, resulting in decreased BLEU. We hypothesise that the reason behind \textit{scrambled translation problem} is 'shuffling noise' which is introduced in every input sentence as a denoising strategy. To test our hypothesis, we experiment by retraining UNMT models with a simple \textit{retraining} strategy. We stop the training of the Denoising UNMT model after a pre-decided number of iterations and resume the training for the remaining iterations -- which number is also pre-decided -- using original sentence as input without adding any noise. Our proposed solution achieves significant performance improvement UNMT models that train conventionally. We demonstrate these performance gains on four language pairs, \textit{viz.}, English-French, English-German, English-Spanish, Hindi-Punjabi. Our qualitative and quantitative analysis shows that the retraining strategy helps achieve better alignment as observed by attention heatmap and better phrasal translation, leading to statistically significant improvement in BLEU scores.
    Towards Heterogeneous Clients with Elastic Federated Learning. (arXiv:2106.09433v1 [cs.LG])
    (2 min) Federated learning involves training machine learning models over devices or data silos, such as edge processors or data warehouses, while keeping the data local. Training in heterogeneous and potentially massive networks introduces bias into the system, which is originated from the non-IID data and the low participation rate in reality. In this paper, we propose Elastic Federated Learning (EFL), an unbiased algorithm to tackle the heterogeneity in the system, which makes the most informative parameters less volatile during training, and utilizes the incomplete local updates. It is an efficient and effective algorithm that compresses both upstream and downstream communications. Theoretically, the algorithm has convergence guarantee when training on the non-IID data at the low participation rate. Empirical experiments corroborate the competitive performance of EFL framework on the robustness and the efficiency.
    Adversarial Visual Robustness by Causal Intervention. (arXiv:2106.09534v1 [cs.CV])
    (2 min) Adversarial training is the de facto most promising defense against adversarial examples. Yet, its passive nature inevitably prevents it from being immune to unknown attackers. To achieve a proactive defense, we need a more fundamental understanding of adversarial examples, beyond the popular bounded threat model. In this paper, we provide a causal viewpoint of adversarial vulnerability: the cause is the confounder ubiquitously existing in learning, where attackers are precisely exploiting the confounding effect. Therefore, a fundamental solution for adversarial robustness is causal intervention. As the confounder is unobserved in general, we propose to use the instrumental variable that achieves intervention without the need for confounder observation. We term our robust training method as Causal intervention by instrumental Variable (CiiV). It has a differentiable retinotopic sampling layer and a consistency loss, which is stable and guaranteed not to suffer from gradient obfuscation. Extensive experiments on a wide spectrum of attackers and settings applied in MNIST, CIFAR-10, and mini-ImageNet datasets empirically demonstrate that CiiV is robust to adaptive attacks.
    Understanding Boolean Function Learnability on Deep Neural Networks. (arXiv:2009.05908v2 [cs.LG] UPDATED)
    (0 min) Computational learning theory states that many classes of boolean formulas are learnable in polynomial time. This paper addresses the understudied subject of how, in practice, such formulas can be learned by deep neural networks. Specifically, we analyse boolean formulas associated with the decision version of combinatorial optimisation problems, model sampling benchmarks, and random 3-CNFs with varying degrees of constrainedness. Our extensive experiments indicate that: (i) regardless of the combinatorial optimisation problem, relatively small and shallow neural networks are very good approximators of the associated formulas; (ii) smaller formulas seem harder to learn, possibly due to the fewer positive (satisfying) examples available; and (iii) interestingly, underconstrained 3-CNF formulas are more challenging to learn than overconstrained ones. Source code and relevant datasets are publicly available (https://github.com/machine-reasoning-ufrgs/mlbf).
    Non-intrusive Nonlinear Model Reduction via Machine Learning Approximations to Low-dimensional Operators. (arXiv:2106.09658v1 [cs.LG])
    (2 min) Although projection-based reduced-order models (ROMs) for parameterized nonlinear dynamical systems have demonstrated exciting results across a range of applications, their broad adoption has been limited by their intrusivity: implementing such a reduced-order model typically requires significant modifications to the underlying simulation code. To address this, we propose a method that enables traditionally intrusive reduced-order models to be accurately approximated in a non-intrusive manner. Specifically, the approach approximates the low-dimensional operators associated with projection-based reduced-order models (ROMs) using modern machine-learning regression techniques. The only requirement of the simulation code is the ability to export the velocity given the state and parameters as this functionality is used to train the approximated low-dimensional operators. In addition to enabling nonintrusivity, we demonstrate that the approach also leads to very low computational complexity, achieving up to $1000\times$ reduction in run time. We demonstrate the effectiveness of the proposed technique on two types of PDEs.
    Disentangling Identifiable Features from Noisy Data with Structured Nonlinear ICA. (arXiv:2106.09620v1 [stat.ML])
    (0 min) We introduce a new general identifiable framework for principled disentanglement referred to as Structured Nonlinear Independent Component Analysis (SNICA). Our contribution is to extend the identifiability theory of deep generative models for a very broad class of structured models. While previous works have shown identifiability for specific classes of time-series models, our theorems extend this to more general temporal structures as well as to models with more complex structures such as spatial dependencies. In particular, we establish the major result that identifiability for this framework holds even in the presence of noise of unknown distribution. The SNICA setting therefore subsumes all the existing nonlinear ICA models for time-series and also allows for new much richer identifiable models. Finally, as an example of our framework's flexibility, we introduce the first nonlinear ICA model for time-series that combines the following very useful properties: it accounts for both nonstationarity and autocorrelation in a fully unsupervised setting; performs dimensionality reduction; models hidden states; and enables principled estimation and inference by variational maximum-likelihood.
    Normalization of breast MRIs using Cycle-Consistent Generative Adversarial Networks. (arXiv:1912.08061v2 [eess.IV] UPDATED)
    (0 min) Dynamic Contrast Enhanced-Magnetic Resonance Imaging (DCE-MRI) is widely used to complement ultrasound examinations and x-ray mammography during the early detection and diagnosis of breast cancer. However, images generated by various MRI scanners (e.g. GE Healthcare vs Siemens) differ both in intensity and noise distribution, preventing algorithms trained on MRIs from one scanner to generalize to data from other scanners successfully. We propose a method for image normalization to solve this problem. MRI normalization is challenging because it requires both normalizing intensity values and mapping between the noise distributions of different scanners. We utilize a cycle-consistent generative adversarial network to learn a bidirectional mapping between MRIs produced by GE Healthcare and Siemens scanners. This allows us learning the mapping between two different scanner types without matched data, which is not commonly available. To ensure the preservation of breast shape and structures within the breast, we propose two technical innovations. First, we incorporate a mutual information loss with the CycleGAN architecture to ensure that the structure of the breast is maintained. Second, we propose a modified discriminator architecture which utilizes a smaller field-of-view to ensure the preservation of finer details in the breast tissue. Quantitative and qualitative evaluations show that the second proposed method was able to consistently preserve a high level of detail in the breast structure while also performing the proper intensity normalization and noise mapping. Our results demonstrate that the proposed model can successfully learn a bidirectional mapping between MRIs produced by different vendors, potentially enabling improved accuracy of downstream computational algorithms for diagnosis and detection of breast cancer. All the data used in this study are publicly available.
    Data-driven control of room temperature and bidirectional EV charging using deep reinforcement learning: simulations and experiments. (arXiv:2103.01886v2 [cs.LG] UPDATED)
    (2 min) This work presents a fully data-driven, black-box pipeline to obtain an optimal control policy for a multi-loop building control problem based on historical building and weather data, thus without the need for complex physics-based modelling. We demonstrate the method for joint control of room temperature and bidirectional EV charging to maximize the occupant thermal comfort and energy savings while leaving enough energy in the EV battery for the next trip. We modelled the room temperature with a recurrent neural network and EV charging with a piece-wise linear function. Using these models as a simulation environment, we applied a deep reinforcement learning (DRL) algorithm to obtain an optimal control policy. The learnt policy achieves on average 17% energy savings over the heating season and 19% better comfort satisfaction than a standard RB room temperature controller. When a bidirectional EV is additionally connected and a two-tariff electricity pricing is applied, the MIMO DRL policy successfully leverages the battery and decreases the overall cost of electricity compared to two standard RB controllers, one controlling the room temperature and another controlling the bidirectional EV (dis-)charging. Finally, we demonstrate a successful transfer of the learnt DRL policy from simulation onto a real building, the DFAB HOUSE at Empa Duebendorf in Switzerland, achieving up to 30% energy savings while maintaining similar comfort levels compared to a conventional RB room temperature controller over three weeks during the heating season.
    Self-Supervised Multimodal Domino: in Search of Biomarkers for Alzheimer's Disease. (arXiv:2012.13623v4 [cs.LG] UPDATED)
    (2 min) Sensory input from multiple sources is crucial for robust and coherent human perception. Different sources contribute complementary explanatory factors. Similarly, research studies often collect multimodal imaging data, each of which can provide shared and unique information. This observation motivated the design of powerful multimodal self-supervised representation-learning algorithms. In this paper, we unify recent work on multimodal self-supervised learning under a single framework. Observing that most self-supervised methods optimize similarity metrics between a set of model components, we propose a taxonomy of all reasonable ways to organize this process. We first evaluate models on toy multimodal MNIST datasets and then apply them to a multimodal neuroimaging dataset with Alzheimer's disease patients. We find that (1) multimodal contrastive learning has significant benefits over its unimodal counterpart, (2) the specific composition of multiple contrastive objectives is critical to performance on a downstream task, (3) maximization of the similarity between representations has a regularizing effect on a neural network, which can sometimes lead to reduced downstream performance but still reveal multimodal relations. Results show that the proposed approach outperforms previous self-supervised encoder-decoder methods based on canonical correlation analysis (CCA) or the mixture-of-experts multimodal variational autoEncoder (MMVAE) on various datasets with a linear evaluation protocol. Importantly, we find a promising solution to uncover connections between modalities through a jointly shared subspace that can help advance work in our search for neuroimaging biomarkers.
    Automatic Analysis of the Emotional Content of Speech in Daylong Child-Centered Recordings from a Neonatal Intensive Care Unit. (arXiv:2106.09539v1 [eess.AS])
    (2 min) Researchers have recently started to study how the emotional speech heard by young infants can affect their developmental outcomes. As a part of this research, hundreds of hours of daylong recordings from preterm infants' audio environments were collected from two hospitals in Finland and Estonia in the context of so-called APPLE study. In order to analyze the emotional content of speech in such a massive dataset, an automatic speech emotion recognition (SER) system is required. However, there are no emotion labels or existing indomain SER systems to be used for this purpose. In this paper, we introduce this initially unannotated large-scale real-world audio dataset and describe the development of a functional SER system for the Finnish subset of the data. We explore the effectiveness of alternative state-of-the-art techniques to deploy a SER system to a new domain, comparing cross-corpus generalization, WGAN-based domain adaptation, and active learning in the task. As a result, we show that the best-performing models are able to achieve a classification performance of 73.4% unweighted average recall (UAR) and 73.2% UAR for a binary classification for valence and arousal, respectively. The results also show that active learning achieves the most consistent performance compared to the two alternatives.
    XCiT: Cross-Covariance Image Transformers. (arXiv:2106.09681v1 [cs.CV])
    (0 min) Following their success in natural language processing, transformers have recently shown much promise for computer vision. The self-attention operation underlying transformers yields global interactions between all tokens ,i.e. words or image patches, and enables flexible modelling of image data beyond the local interactions of convolutions. This flexibility, however, comes with a quadratic complexity in time and memory, hindering application to long sequences and high-resolution images. We propose a "transposed" version of self-attention that operates across feature channels rather than tokens, where the interactions are based on the cross-covariance matrix between keys and queries. The resulting cross-covariance attention (XCA) has linear complexity in the number of tokens, and allows efficient processing of high-resolution images. Our cross-covariance image transformer (XCiT) is built upon XCA. It combines the accuracy of conventional transformers with the scalability of convolutional architectures. We validate the effectiveness and generality of XCiT by reporting excellent results on multiple vision benchmarks, including image classification and self-supervised feature learning on ImageNet-1k, object detection and instance segmentation on COCO, and semantic segmentation on ADE20k.
    Deep Learning Through the Lens of Example Difficulty. (arXiv:2106.09647v1 [cs.LG])
    (0 min) Existing work on understanding deep learning often employs measures that compress all data-dependent information into a few numbers. In this work, we adopt a perspective based on the role of individual examples. We introduce a measure of the computational difficulty of making a prediction for a given input: the (effective) prediction depth. Our extensive investigation reveals surprising yet simple relationships between the prediction depth of a given input and the model's uncertainty, confidence, accuracy and speed of learning for that data point. We further categorize difficult examples into three interpretable groups, demonstrate how these groups are processed differently inside deep models and showcase how this understanding allows us to improve prediction accuracy. Insights from our study lead to a coherent view of a number of separately reported phenomena in the literature: early layers generalize while later layers memorize; early layers converge faster and networks learn easy data and simple functions first.
    QuantumFed: A Federated Learning Framework for Collaborative Quantum Training. (arXiv:2106.09109v1 [cs.LG])
    (0 min) With the fast development of quantum computing and deep learning, quantum neural networks have attracted great attention recently. By leveraging the power of quantum computing, deep neural networks can potentially overcome computational power limitations in classic machine learning. However, when multiple quantum machines wish to train a global model using the local data on each machine, it may be very difficult to copy the data into one machine and train the model. Therefore, a collaborative quantum neural network framework is necessary. In this article, we borrow the core idea of federated learning to propose QuantumFed, a quantum federated learning framework to have multiple quantum nodes with local quantum data train a mode together. Our experiments show the feasibility and robustness of our framework.
    A Lyapunov Theory for Finite-Sample Guarantees of Asynchronous Q-Learning and TD-Learning Variants. (arXiv:2102.01567v3 [cs.LG] UPDATED)
    (2 min) This paper develops an unified framework to study finite-sample convergence guarantees of a large class of value-based asynchronous reinforcement learning (RL) algorithms. We do this by first reformulating the RL algorithms as \textit{Markovian Stochastic Approximation} (SA) algorithms to solve fixed-point equations. We then develop a Lyapunov analysis and derive mean-square error bounds on the convergence of the Markovian SA. Based on this result, we establish finite-sample mean-square convergence bounds for asynchronous RL algorithms such as $Q$-learning, $n$-step TD, TD$(\lambda)$, and off-policy TD algorithms including V-trace. As a by-product, by analyzing the convergence bounds of $n$-step TD and TD$(\lambda)$, we provide theoretical insights into the bias-variance trade-off, i.e., efficiency of bootstrapping in RL. This was first posed as an open problem in (Sutton, 1999).
    A Survey on Semi-Supervised Learning for Delayed Partially Labelled Data Streams. (arXiv:2106.09170v1 [cs.LG])
    (2 min) Unlabelled data appear in many domains and are particularly relevant to streaming applications, where even though data is abundant, labelled data is rare. To address the learning problems associated with such data, one can ignore the unlabelled data and focus only on the labelled data (supervised learning); use the labelled data and attempt to leverage the unlabelled data (semi-supervised learning); or assume some labels will be available on request (active learning). The first approach is the simplest, yet the amount of labelled data available will limit the predictive performance. The second relies on finding and exploiting the underlying characteristics of the data distribution. The third depends on an external agent to provide the required labels in a timely fashion. This survey pays special attention to methods that leverage unlabelled data in a semi-supervised setting. We also discuss the delayed labelling issue, which impacts both fully supervised and semi-supervised methods. We propose a unified problem setting, discuss the learning guarantees and existing methods, explain the differences between related problem settings. Finally, we review the current benchmarking practices and propose adaptations to enhance them.
    Prototypical Graph Contrastive Learning. (arXiv:2106.09645v1 [cs.LG])
    (2 min) Graph-level representations are critical in various real-world applications, such as predicting the properties of molecules. But in practice, precise graph annotations are generally very expensive and time-consuming. To address this issue, graph contrastive learning constructs instance discrimination task which pulls together positive pairs (augmentation pairs of the same graph) and pushes away negative pairs (augmentation pairs of different graphs) for unsupervised representation learning. However, since for a query, its negatives are uniformly sampled from all graphs, existing methods suffer from the critical sampling bias issue, i.e., the negatives likely having the same semantic structure with the query, leading to performance degradation. To mitigate this sampling bias issue, in this paper, we propose a Prototypical Graph Contrastive Learning (PGCL) approach. Specifically, PGCL models the underlying semantic structure of the graph data via clustering semantically similar graphs into the same group, and simultaneously encourages the clustering consistency for different augmentations of the same graph. Then given a query, it performs negative sampling via drawing the graphs from those clusters that differ from the cluster of query, which ensures the semantic difference between query and its negative samples. Moreover, for a query, PGCL further reweights its negative samples based on the distance between their prototypes (cluster centroids) and the query prototype such that those negatives having moderate prototype distance enjoy relatively large weights. This reweighting strategy is proved to be more effective than uniform sampling. Experimental results on various graph benchmarks testify the advantages of our PGCL over state-of-the-art methods.
    Stochastic Bias-Reduced Gradient Methods. (arXiv:2106.09481v1 [math.OC])
    (2 min) We develop a new primitive for stochastic optimization: a low-bias, low-cost estimator of the minimizer $x_\star$ of any Lipschitz strongly-convex function. In particular, we use a multilevel Monte-Carlo approach due to Blanchet and Glynn to turn any optimal stochastic gradient method into an estimator of $x_\star$ with bias $\delta$, variance $O(\log(1/\delta))$, and an expected sampling cost of $O(\log(1/\delta))$ stochastic gradient evaluations. As an immediate consequence, we obtain cheap and nearly unbiased gradient estimators for the Moreau-Yoshida envelope of any Lipschitz convex function, allowing us to perform dimension-free randomized smoothing. We demonstrate the potential of our estimator through four applications. First, we develop a method for minimizing the maximum of $N$ functions, improving on recent results and matching a lower bound up logarithmic factors. Second and third, we recover state-of-the-art rates for projection-efficient and gradient-efficient optimization using simple algorithms with a transparent analysis. Finally, we show that an improved version of our estimator would yield a nearly linear-time, optimal-utility, differentially-private non-smooth stochastic optimization method.
    Learning Knowledge Graph-based World Models of Textual Environments. (arXiv:2106.09608v1 [cs.LG])
    (2 min) World models improve a learning agent's ability to efficiently operate in interactive and situated environments. This work focuses on the task of building world models of text-based game environments. Text-based games, or interactive narratives, are reinforcement learning environments in which agents perceive and interact with the world using textual natural language. These environments contain long, multi-step puzzles or quests woven through a world that is filled with hundreds of characters, locations, and objects. Our world model learns to simultaneously: (1) predict changes in the world caused by an agent's actions when representing the world as a knowledge graph; and (2) generate the set of contextually relevant natural language actions required to operate in the world. We frame this task as a Set of Sequences generation problem by exploiting the inherent structure of knowledge graphs and actions and introduce both a transformer-based multi-task architecture and a loss function to train it. A zero-shot ablation study on never-before-seen textual worlds shows that our methodology significantly outperforms existing textual world modeling techniques as well as the importance of each of our contributions.
    Pushing the Limits of Non-Autoregressive Speech Recognition. (arXiv:2104.03416v3 [eess.AS] UPDATED)
    (2 min) We combine recent advancements in end-to-end speech recognition to non-autoregressive automatic speech recognition. We push the limits of non-autoregressive state-of-the-art results for multiple datasets: LibriSpeech, Fisher+Switchboard and Wall Street Journal. Key to our recipe, we leverage CTC on giant Conformer neural network architectures with SpecAugment and wav2vec2 pre-training. We achieve 1.8%/3.6% WER on LibriSpeech test/test-other sets, 5.1%/9.8% WER on Switchboard, and 3.4% on the Wall Street Journal, all without a language model.
    Deep Dimension Reduction for Supervised Representation Learning. (arXiv:2006.05865v2 [cs.LG] UPDATED)
    (2 min) The goal of supervised representation learning is to construct effective data representations for prediction. Among all the characteristics of an ideal nonparametric representation of high-dimensional complex data, sufficiency, low dimensionality and disentanglement are some of the most essential ones. We propose a deep dimension reduction approach to learning representations with these characteristics. The proposed approach is a nonparametric generalization of the sufficient dimension reduction method. We formulate the ideal representation learning task as that of finding a nonparametric representation that minimizes an objective function characterizing conditional independence and promoting disentanglement at the population level. We then estimate the target representation at the sample level nonparametrically using deep neural networks. We show that the estimated deep nonparametric representation is consistent in the sense that its excess risk converges to zero. Our extensive numerical experiments using simulated and real benchmark data demonstrate that the proposed methods have better performance than several existing dimension reduction methods and the standard deep learning models in the context of classification and regression.
    Poisoning and Backdooring Contrastive Learning. (arXiv:2106.09667v1 [cs.LG])
    (2 min) Contrastive learning methods like CLIP train on noisy and uncurated training datasets. This is cheaper than labeling datasets manually, and even improves out-of-distribution robustness. We show that this practice makes backdoor and poisoning attacks a significant threat. By poisoning just 0.005% of a dataset (e.g., just 150 images of the 3 million-example Conceptual Captions dataset), we can cause the model to misclassify test images by overlaying a small patch. Targeted poisoning attacks, whereby the model misclassifies a particular test input with an adversarially-desired label, are even easier requiring control of less than 0.0001% of the dataset (e.g., just two out of the 3 million images). Our attacks call into question whether training on noisy and uncurated Internet scrapes is desirable.
    Implicit Bias of SGD for Diagonal Linear Networks: a Provable Benefit of Stochasticity. (arXiv:2106.09524v1 [cs.LG])
    (2 min) Understanding the implicit bias of training algorithms is of crucial importance in order to explain the success of overparametrised neural networks. In this paper, we study the dynamics of stochastic gradient descent over diagonal linear networks through its continuous time version, namely stochastic gradient flow. We explicitly characterise the solution chosen by the stochastic flow and prove that it always enjoys better generalisation properties than that of gradient flow. Quite surprisingly, we show that the convergence speed of the training loss controls the magnitude of the biasing effect: the slower the convergence, the better the bias. To fully complete our analysis, we provide convergence guarantees for the dynamics. We also give experimental results which support our theoretical claims. Our findings highlight the fact that structured noise can induce better generalisation and they help explain the greater performances observed in practice of stochastic gradient descent over gradient descent.
    Optimum-statistical collaboration towards efficient black-box optimization. (arXiv:2106.09215v1 [stat.ML])
    (2 min) With increasingly more hyperparameters involved in their training, machine learning systems demand a better understanding of hyperparameter tuning automation. This has raised interest in studies of provably black-box optimization, which is made more practical by better exploration mechanism implemented in algorithm design, managing the flux of both optimization and statistical errors. Prior efforts focus on delineating optimization errors, but this is deficient: black-box optimization algorithms can be inefficient without considering heterogeneity among reward samples. In this paper, we make the key delineation on the role of statistical uncertainty in black-box optimization, guiding a more efficient algorithm design. We introduce \textit{optimum-statistical collaboration}, a framework of managing the interaction between optimization error flux and statistical error flux evolving in the optimization process. Inspired by this framework, we propose the \texttt{VHCT} algorithms for objective functions with only local-smoothness assumptions. In theory, we prove our algorithm enjoys rate-optimal regret bounds; in experiments, we show the algorithm outperforms prior efforts in extensive settings.
    An Imprecise SHAP as a Tool for Explaining the Class Probability Distributions under Limited Training Data. (arXiv:2106.09111v1 [cs.LG])
    (2 min) One of the most popular methods of the machine learning prediction explanation is the SHapley Additive exPlanations method (SHAP). An imprecise SHAP as a modification of the original SHAP is proposed for cases when the class probability distributions are imprecise and represented by sets of distributions. The first idea behind the imprecise SHAP is a new approach for computing the marginal contribution of a feature, which fulfils the important efficiency property of Shapley values. The second idea is an attempt to consider a general approach to calculating and reducing interval-valued Shapley values, which is similar to the idea of reachable probability intervals in the imprecise probability theory. A simple special implementation of the general approach in the form of linear optimization problems is proposed, which is based on using the Kolmogorov-Smirnov distance and imprecise contamination models. Numerical examples with synthetic and real data illustrate the imprecise SHAP.
    Optimality and Stability in Federated Learning: A Game-theoretic Approach. (arXiv:2106.09580v1 [cs.GT])
    (2 min) Federated learning is a distributed learning paradigm where multiple agents, each only with access to local data, jointly learn a global model. There has recently been an explosion of research aiming not only to improve the accuracy rates of federated learning, but also provide certain guarantees around social good properties such as total error. One branch of this research has taken a game-theoretic approach, and in particular, prior work has viewed federated learning as a hedonic game, where error-minimizing players arrange themselves into federating coalitions. This past work proves the existence of stable coalition partitions, but leaves open a wide range of questions, including how far from optimal these stable solutions are. In this work, we motivate and define a notion of optimality given by the average error rates among federating agents (players). First, we provide and prove the correctness of an efficient algorithm to calculate an optimal (error minimizing) arrangement of players. Next, we analyze the relationship between the stability and optimality of an arrangement. First, we show that for some regions of parameter space, all stable arrangements are optimal (Price of Anarchy equal to 1). However, we show this is not true for all settings: there exist examples of stable arrangements with higher cost than optimal (Price of Anarchy greater than 1). Finally, we give the first constant-factor bound on the performance gap between stability and optimality, proving that the total error of the worst stable solution can be no higher than 9 times the total error of an optimal solution (Price of Anarchy bound of 9).
    Mungojerrie: Reinforcement Learning of Linear-Time Objectives. (arXiv:2106.09161v1 [cs.LG])
    (2 min) Reinforcement learning synthesizes controllers without prior knowledge of the system. At each timestep, a reward is given. The controllers optimize the discounted sum of these rewards. Applying this class of algorithms requires designing a reward scheme, which is typically done manually. The designer must ensure that their intent is accurately captured. This may not be trivial, and is prone to error. An alternative to this manual programming, akin to programming directly in assembly, is to specify the objective in a formal language and have it "compiled" to a reward scheme. Mungojerrie ($\href{https://plv.colorado.edu/mungojerrie/}{plv.colorado.edu/mungojerrie}$) is a tool for testing reward schemes for $\omega$-regular objectives on finite models. The tool contains reinforcement learning algorithms and a probabilistic model checker. Mungojerrie supports models specified in PRISM and $\omega$-automata specified in HOA.
    Unsupervised Training Data Generation of Handwritten Formulas using Generative Adversarial Networks with Self-Attention. (arXiv:2106.09432v1 [cs.CV])
    (2 min) The recognition of handwritten mathematical expressions in images and video frames is a difficult and unsolved problem yet. Deep convectional neural networks are basically a promising approach, but typically require a large amount of labeled training data. However, such a large training dataset does not exist for the task of handwritten formula recognition. In this paper, we introduce a system that creates a large set of synthesized training examples of mathematical expressions which are derived from LaTeX documents. For this purpose, we propose a novel attention-based generative adversarial network to translate rendered equations to handwritten formulas. The datasets generated by this approach contain hundreds of thousands of formulas, making it ideal for pretraining or the design of more complex models. We evaluate our synthesized dataset and the recognition approach on the CROHME 2014 benchmark dataset. Experimental results demonstrate the feasibility of the approach.
    NPAS: A Compiler-aware Framework of Unified Network Pruning and Architecture Search for Beyond Real-Time Mobile Acceleration. (arXiv:2012.00596v3 [cs.LG] UPDATED)
    (2 min) With the increasing demand to efficiently deploy DNNs on mobile edge devices, it becomes much more important to reduce unnecessary computation and increase the execution speed. Prior methods towards this goal, including model compression and network architecture search (NAS), are largely performed independently and do not fully consider compiler-level optimizations which is a must-do for mobile acceleration. In this work, we first propose (i) a general category of fine-grained structured pruning applicable to various DNN layers, and (ii) a comprehensive, compiler automatic code generation framework supporting different DNNs and different pruning schemes, which bridge the gap of model compression and NAS. We further propose NPAS, a compiler-aware unified network pruning, and architecture search. To deal with large search space, we propose a meta-modeling procedure based on reinforcement learning with fast evaluation and Bayesian optimization, ensuring the total number of training epochs comparable with representative NAS frameworks. Our framework achieves 6.7ms, 5.9ms, 3.9ms ImageNet inference times with 78.2%, 75% (MobileNet-V3 level), and 71% (MobileNet-V2 level) Top-1 accuracy respectively on an off-the-shelf mobile phone, consistently outperforming prior work.
    RHNAS: Realizable Hardware and Neural Architecture Search. (arXiv:2106.09180v1 [cs.LG])
    (2 min) The rapidly evolving field of Artificial Intelligence necessitates automated approaches to co-design neural network architecture and neural accelerators to maximize system efficiency and address productivity challenges. To enable joint optimization of this vast space, there has been growing interest in differentiable NN-HW co-design. Fully differentiable co-design has reduced the resource requirements for discovering optimized NN-HW configurations, but fail to adapt to general hardware accelerator search spaces. This is due to the existence of non-synthesizable (invalid) designs in the search space of many hardware accelerators. To enable efficient and realizable co-design of configurable hardware accelerators with arbitrary neural network search spaces, we introduce RHNAS. RHNAS is a method that combines reinforcement learning for hardware optimization with differentiable neural architecture search. RHNAS discovers realizable NN-HW designs with 1.84x lower latency and 1.86x lower energy-delay product (EDP) on ImageNet and 2.81x lower latency and 3.30x lower EDP on CIFAR-10 over the default hardware accelerator design.
    Invertible Concept-based Explanations for CNN Models with Non-negative Concept Activation Vectors. (arXiv:2006.15417v4 [cs.CV] UPDATED)
    (2 min) Convolutional neural network (CNN) models for computer vision are powerful but lack explainability in their most basic form. This deficiency remains a key challenge when applying CNNs in important domains. Recent work on explanations through feature importance of approximate linear models has moved from input-level features (pixels or segments) to features from mid-layer feature maps in the form of concept activation vectors (CAVs). CAVs contain concept-level information and could be learned via clustering. In this work, we rethink the ACE algorithm of Ghorbani et~al., proposing an alternative invertible concept-based explanation (ICE) framework to overcome its shortcomings. Based on the requirements of fidelity (approximate models to target models) and interpretability (being meaningful to people), we design measurements and evaluate a range of matrix factorization methods with our framework. We find that non-negative concept activation vectors (NCAVs) from non-negative matrix factorization provide superior performance in interpretability and fidelity based on computational and human subject experiments. Our framework provides both local and global concept-level explanations for pre-trained CNN models.
    Smart Contract Vulnerability Detection: From Pure Neural Network to Interpretable Graph Feature and Expert Pattern Fusion. (arXiv:2106.09282v1 [cs.LG])
    (2 min) Smart contracts hold digital coins worth billions of dollars, their security issues have drawn extensive attention in the past years. Towards smart contract vulnerability detection, conventional methods heavily rely on fixed expert rules, leading to low accuracy and poor scalability. Recent deep learning approaches alleviate this issue but fail to encode useful expert knowledge. In this paper, we explore combining deep learning with expert patterns in an explainable fashion. Specifically, we develop automatic tools to extract expert patterns from the source code. We then cast the code into a semantic graph to extract deep graph features. Thereafter, the global graph feature and local expert patterns are fused to cooperate and approach the final prediction, while yielding their interpretable weights. Experiments are conducted on all available smart contracts with source code in two platforms, Ethereum and VNT Chain. Empirically, our system significantly outperforms state-of-the-art methods. Our code is released.
    Distribution Free Uncertainty for the Minimum Norm Solution of Over-parameterized Linear Regression. (arXiv:2102.07181v2 [cs.LG] UPDATED)
    (2 min) A fundamental principle of learning theory is that there is a trade-off between the complexity of a prediction rule and its ability to generalize. Modern machine learning models do not obey this paradigm: They produce an accurate prediction even with a perfect fit to the training set. We investigate over-parameterized linear regression models focusing on the minimum norm solution: This is the solution with the minimal norm that attains a perfect fit to the training set. We utilize the recently proposed predictive normalized maximum likelihood (pNML) learner which is the min-max regret solution for the distribution-free setting. We derive an upper bound of this min-max regret which is associated with the prediction uncertainty. We show that if the test sample lies mostly in a subspace spanned by the eigenvectors associated with the large eigenvalues of the empirical correlation matrix of the training data, the model generalizes despite its over-parameterized nature. We demonstrate the use of the pNML regret as a point-wise learnability measure on synthetic data and successfully observe the double-decent phenomenon of the over-parameterized models on UCI datasets.
    A Deep Reinforcement Learning Approach towards Pendulum Swing-up Problem based on TF-Agents. (arXiv:2106.09556v1 [stat.ML])
    (2 min) Adapting the idea of training CartPole with Deep Q-learning agent, we are able to find a promising result that prevent the pole from falling down. The capacity of reinforcement learning (RL) to learn from the interaction between the environment and agent provides an optimal control strategy. In this paper, we aim to solve the classic pendulum swing-up problem that making the learned pendulum to be in upright position and balanced. Deep Deterministic Policy Gradient algorithm is introduced to operate over continuous action domain in this problem. Salient results of optimal pendulum are proved with increasing average return, decreasing loss, and live video in the code part.
    Towards Explainable Student Group Collaboration Assessment Models Using Temporal Representations of Individual Student Roles. (arXiv:2106.09623v1 [cs.LG])
    (2 min) Collaboration is identified as a required and necessary skill for students to be successful in the fields of Science, Technology, Engineering and Mathematics (STEM). However, due to growing student population and limited teaching staff it is difficult for teachers to provide constructive feedback and instill collaborative skills using instructional methods. Development of simple and easily explainable machine-learning-based automated systems can help address this problem. Improving upon our previous work, in this paper we propose using simple temporal-CNN deep-learning models to assess student group collaboration that take in temporal representations of individual student roles as input. We check the applicability of dynamically changing feature representations for student group collaboration assessment and how they impact the overall performance. We also use Grad-CAM visualizations to better understand and interpret the important temporal indices that led to the deep-learning model's decision.
    Machine Learning for Variance Reduction in Online Experiments. (arXiv:2106.07263v2 [stat.ML] UPDATED)
    (2 min) We consider the problem of variance reduction in randomized controlled trials, through the use of covariates correlated with the outcome but independent of the treatment. We propose a machine learning regression-adjusted treatment effect estimator, which we call MLRATE. MLRATE uses machine learning predictors of the outcome to reduce estimator variance. It employs cross-fitting to avoid overfitting biases, and we prove consistency and asymptotic normality under general conditions. MLRATE is robust to poor predictions from the machine learning step: if the predictions are uncorrelated with the outcomes, the estimator performs asymptotically no worse than the standard difference-in-means estimator, while if predictions are highly correlated with outcomes, the efficiency gains are large. In A/A tests, for a set of 48 outcome metrics commonly monitored in Facebook experiments the estimator has over 70% lower variance than the simple difference-in-means estimator, and about 19% lower variance than the common univariate procedure which adjusts only for pre-experiment values of the outcome.
    Scientific Language Models for Biomedical Knowledge Base Completion: An Empirical Study. (arXiv:2106.09700v1 [cs.CL])
    (2 min) Biomedical knowledge graphs (KGs) hold rich information on entities such as diseases, drugs, and genes. Predicting missing links in these graphs can boost many important applications, such as drug design and repurposing. Recent work has shown that general-domain language models (LMs) can serve as "soft" KGs, and that they can be fine-tuned for the task of KG completion. In this work, we study scientific LMs for KG completion, exploring whether we can tap into their latent knowledge to enhance biomedical link prediction. We evaluate several domain-specific LMs, fine-tuning them on datasets centered on drugs and diseases that we represent as KGs and enrich with textual entity descriptions. We integrate the LM-based models with KG embedding models, using a router method that learns to assign each input example to either type of model and provides a substantial boost in performance. Finally, we demonstrate the advantage of LM models in the inductive setting with novel scientific entities. Our datasets and code are made publicly available.
    Orthogonal-Pad\'e Activation Functions: Trainable Activation functions for smooth and faster convergence in deep networks. (arXiv:2106.09693v1 [cs.NE])
    (2 min) We have proposed orthogonal-Pad\'e activation functions, which are trainable activation functions and show that they have faster learning capability and improves the accuracy in standard deep learning datasets and models. Based on our experiments, we have found two best candidates out of six orthogonal-Pad\'e activations, which we call safe Hermite-Pade (HP) activation functions, namely HP-1 and HP-2. When compared to ReLU, HP-1 and HP-2 has an increment in top-1 accuracy by 5.06% and 4.63% respectively in PreActResNet-34, by 3.02% and 2.75% respectively in MobileNet V2 model on CIFAR100 dataset while on CIFAR10 dataset top-1 accuracy increases by 2.02% and 1.78% respectively in PreActResNet-34, by 2.24% and 2.06% respectively in LeNet, by 2.15% and 2.03% respectively in Efficientnet B0.
    Square Root Principal Component Pursuit: Tuning-Free Noisy Robust Matrix Recovery. (arXiv:2106.09211v1 [cs.LG])
    (2 min) We propose a new framework -- Square Root Principal Component Pursuit -- for low-rank matrix recovery from observations corrupted with noise and outliers. Inspired by the square root Lasso, this new formulation does not require prior knowledge of the noise level. We show that a single, universal choice of the regularization parameter suffices to achieve reconstruction error proportional to the (a priori unknown) noise level. In comparison, previous formulations such as stable PCP rely on noise-dependent parameters to achieve similar performance, and are therefore challenging to deploy in applications where the noise level is unknown. We validate the effectiveness of our new method through experiments on simulated and real datasets. Our simulations corroborate the claim that a universal choice of the regularization parameter yields near optimal performance across a range of noise levels, indicating that the proposed method outperforms the (somewhat loose) bound proved here.
    BinaryCoP: Binary Neural Network-based COVID-19 Face-Mask Wear and Positioning Predictor on Edge Devices. (arXiv:2102.03456v2 [cs.CV] UPDATED)
    (3 min) Face masks have long been used in many areas of everyday life to protect against the inhalation of hazardous fumes and particles. They also offer an effective solution in healthcare for bi-directional protection against air-borne diseases. Wearing and positioning the mask correctly is essential for its function. Convolutional neural networks (CNNs) offer an excellent solution for face recognition and classification of correct mask wearing and positioning. In the context of the ongoing COVID-19 pandemic, such algorithms can be used at entrances to corporate buildings, airports, shopping areas, and other indoor locations, to mitigate the spread of the virus. These application scenarios impose major challenges to the underlying compute platform. The inference hardware must be cheap, small and energy efficient, while providing sufficient memory and compute power to execute accurate CNNs at a reasonably low latency. To maintain data privacy of the public, all processing must remain on the edge-device, without any communication with cloud servers. To address these challenges, we present a low-power binary neural network classifier for correct facial-mask wear and positioning. The classification task is implemented on an embedded FPGA, performing high-throughput binary operations. Classification can take place at up to ~6400 frames-per-second, easily enabling multi-camera, speed-gate settings or statistics collection in crowd settings. When deployed on a single entrance or gate, the idle power consumption is reduced to 1.6W, improving the battery-life of the device. We achieve an accuracy of up to 98% for four wearing positions of the MaskedFace-Net dataset. To maintain equivalent classification accuracy for all face structures, skin-tones, hair types, and mask types, the algorithms are tested for their ability to generalize the relevant features over all subjects using the Grad-CAM approach.
    Meta-Calibration: Meta-Learning of Model Calibration Using Differentiable Expected Calibration Error. (arXiv:2106.09613v1 [cs.LG])
    (2 min) Calibration of neural networks is a topical problem that is becoming increasingly important for real-world use of neural networks. The problem is especially noticeable when using modern neural networks, for which there is significant difference between the model confidence and the confidence it should have. Various strategies have been successfully proposed, yet there is more space for improvements. We propose a novel approach that introduces a differentiable metric for expected calibration error and successfully uses it as an objective for meta-learning, achieving competitive results with state-of-the-art approaches. Our approach presents a new direction of using meta-learning to directly optimize model calibration, which we believe will inspire further work in this promising and new direction.
    Spectral goodness-of-fit tests for complete and partial network data. (arXiv:2106.09702v1 [stat.ME])
    (2 min) Networks describe the, often complex, relationships between individual actors. In this work, we address the question of how to determine whether a parametric model, such as a stochastic block model or latent space model, fits a dataset well and will extrapolate to similar data. We use recent results in random matrix theory to derive a general goodness-of-fit test for dyadic data. We show that our method, when applied to a specific model of interest, provides an straightforward, computationally fast way of selecting parameters in a number of commonly used network models. For example, we show how to select the dimension of the latent space in latent space models. Unlike other network goodness-of-fit methods, our general approach does not require simulating from a candidate parametric model, which can be cumbersome with large graphs, and eliminates the need to choose a particular set of statistics on the graph for comparison. It also allows us to perform goodness-of-fit tests on partial network data, such as Aggregated Relational Data. We show with simulations that our method performs well in many situations of interest. We analyze several empirically relevant networks and show that our method leads to improved community detection algorithms. R code to implement our method is available on Github.
    Time Series is a Special Sequence: Forecasting with Sample Convolution and Interaction. (arXiv:2106.09305v1 [cs.LG])
    (2 min) Time series is a special type of sequence data, a set of observations collected at even intervals of time and ordered chronologically. Existing deep learning techniques use generic sequence models (e.g., recurrent neural network, Transformer model, or temporal convolutional network) for time series analysis, which ignore some of its unique properties. For example, the downsampling of time series data often preserves most of the information in the data, while this is not true for general sequence data such as text sequence and DNA sequence. Motivated by the above, in this paper, we propose a novel neural network architecture and apply it for the time series forecasting problem, wherein we conduct sample convolution and interaction at multiple resolutions for temporal modeling. The proposed architecture, namelySCINet, facilitates extracting features with enhanced predictability. Experimental results show that SCINet achieves significant prediction accuracy improvement over existing solutions across various real-world time series forecasting datasets. In particular, it can achieve high fore-casting accuracy for those temporal-spatial datasets without using sophisticated spatial modeling techniques. Our codes and data are presented in the supplemental material.
    Gone Fishing: Neural Active Learning with Fisher Embeddings. (arXiv:2106.09675v1 [cs.LG])
    (2 min) There is an increasing need for effective active learning algorithms that are compatible with deep neural networks. While there are many classic, well-studied sample selection methods, the non-convexity and varying internal representation of neural models make it unclear how to extend these approaches. This article introduces BAIT, a practical, tractable, and high-performing active learning algorithm for neural networks that addresses these concerns. BAIT draws inspiration from the theoretical analysis of maximum likelihood estimators (MLE) for parametric models. It selects batches of samples by optimizing a bound on the MLE error in terms of the Fisher information, which we show can be implemented efficiently at scale by exploiting linear-algebraic structure especially amenable to execution on modern hardware. Our experiments show that BAIT outperforms the previous state of the art on both classification and regression problems, and is flexible enough to be used with a variety of model architectures.
    On the Dark Side of Calibration for Modern Neural Networks. (arXiv:2106.09385v1 [cs.LG])
    (2 min) Modern neural networks are highly uncalibrated. It poses a significant challenge for safety-critical systems to utilise deep neural networks (DNNs), reliably. Many recently proposed approaches have demonstrated substantial progress in improving DNN calibration. However, they hardly touch upon refinement, which historically has been an essential aspect of calibration. Refinement indicates separability of a network's correct and incorrect predictions. This paper presents a theoretically and empirically supported exposition for reviewing a model's calibration and refinement. Firstly, we show the breakdown of expected calibration error (ECE), into predicted confidence and refinement. Connecting with this result, we highlight that regularisation based calibration only focuses on naively reducing a model's confidence. This logically has a severe downside to a model's refinement. We support our claims through rigorous empirical evaluations of many state of the art calibration approaches on standard datasets. We find that many calibration approaches with the likes of label smoothing, mixup etc. lower the utility of a DNN by degrading its refinement. Even under natural data shift, this calibration-refinement trade-off holds for the majority of calibration methods. These findings call for an urgent retrospective into some popular pathways taken for modern DNN calibration.
    Transductive Few-Shot Learning: Clustering is All You Need?. (arXiv:2106.09516v1 [cs.LG])
    (2 min) We investigate a general formulation for clustering and transductive few-shot learning, which integrates prototype-based objectives, Laplacian regularization and supervision constraints from a few labeled data points. We propose a concave-convex relaxation of the problem, and derive a computationally efficient block-coordinate bound optimizer, with convergence guarantee. At each iteration,our optimizer computes independent (parallel) updates for each point-to-cluster assignment. Therefore, it could be trivially distributed for large-scale clustering and few-shot tasks. Furthermore, we provides a thorough convergence analysis based on point-to-set maps. Were port comprehensive clustering and few-shot learning experiments over various data sets, showing that our method yields competitive performances, in term of accuracy and optimization quality, while scaling up to large problems. Using standard training on the base classes, without resorting to complex meta-learning and episodic-training strategies, our approach outperforms state-of-the-art few-shot methods by significant margins, across various models, settings and data sets. Surprisingly, we found that even standard clustering procedures (e.g., K-means), which correspond to particular, non-regularized cases of our general model, already achieve competitive performances in comparison to the state-of-the-art in few-shot learning. These surprising results point to the limitations of the current few-shot benchmarks, and question the viability of a large body of convoluted few-shot learning techniques in the recent literature.
    Multi-Modal Prototype Learning for Interpretable Multivariable Time Series Classification. (arXiv:2106.09636v1 [cs.LG])
    (2 min) Multivariable time series classification problems are increasing in prevalence and complexity in a variety of domains, such as biology and finance. While deep learning methods are an effective tool for these problems, they often lack interpretability. In this work, we propose a novel modular prototype learning framework for multivariable time series classification. In the first stage of our framework, encoders extract features from each variable independently. Prototype layers identify single-variable prototypes in the resulting feature spaces. The next stage of our framework represents the multivariable time series sample points in terms of their similarity to these single-variable prototypes. This results in an inherently interpretable representation of multivariable patterns, on which prototype learning is applied to extract representative examples i.e. multivariable prototypes. Our framework is thus able to explicitly identify both informative patterns in the individual variables, as well as the relationships between the variables. We validate our framework on a simulated dataset with embedded patterns, as well as a real human activity recognition problem. Our framework attains comparable or superior classification performance to existing time series classification methods on these tasks. On the simulated dataset, we find that our model returns interpretations consistent with the embedded patterns. Moreover, the interpretations learned on the activity recognition dataset align with domain knowledge.
    Improving On-Screen Sound Separation for Open Domain Videos with Audio-Visual Self-attention. (arXiv:2106.09669v1 [cs.SD])
    (2 min) We introduce a state-of-the-art audio-visual on-screen sound separation system which is capable of learning to separate sounds and associate them with on-screen objects by looking at in-the-wild videos. We identify limitations of previous work on audiovisual on-screen sound separation, including the simplicity and coarse resolution of spatio-temporal attention, and poor convergence of the audio separation model. Our proposed model addresses these issues using cross-modal and self-attention modules that capture audio-visual dependencies at a finer resolution over time, and by unsupervised pre-training of audio separation model. These improvements allow the model to generalize to a much wider set of unseen videos. For evaluation and semi-supervised training, we collected human annotations of on-screen audio from a large database of in-the-wild videos (YFCC100M). Our results show marked improvements in on-screen separation performance, in more general conditions than previous methods.
    Regularization of Mixture Models for Robust Principal Graph Learning. (arXiv:2106.09035v1 [cs.LG])
    (2 min) A regularized version of Mixture Models is proposed to learn a principal graph from a distribution of $D$-dimensional data points. In the particular case of manifold learning for ridge detection, we assume that the underlying manifold can be modeled as a graph structure acting like a topological prior for the Gaussian clusters turning the problem into a maximum a posteriori estimation. Parameters of the model are iteratively estimated through an Expectation-Maximization procedure making the learning of the structure computationally efficient with guaranteed convergence for any graph prior in a polynomial time. We also embed in the formalism a natural way to make the algorithm robust to outliers of the pattern and heteroscedasticity of the manifold sampling coherently with the graph structure. The method uses a graph prior given by the minimum spanning tree that we extend using random sub-samplings of the dataset to take into account cycles that can be observed in the spatial distribution.
    Distance Metric Learning for Graph Structured Data. (arXiv:2002.00727v2 [stat.ML] UPDATED)
    (2 min) Graphs are versatile tools for representing structured data. As a result, a variety of machine learning methods have been studied for graph data analysis. Although many such learning methods depend on the measurement of differences between input graphs, defining an appropriate distance metric for graphs remains a controversial issue. Hence, we propose a supervised distance metric learning method for the graph classification problem. Our method, named interpretable graph metric learning (IGML), learns discriminative metrics in a subgraph-based feature space, which has a strong graph representation capability. By introducing a sparsity-inducing penalty on the weight of each subgraph, IGML can identify a small number of important subgraphs that can provide insight into the given classification task. Because our formulation has a large number of optimization variables, an efficient algorithm that uses pruning techniques based on safe screening and working set selection methods is also proposed. An important property of IGML is that solution optimality is guaranteed because the problem is formulated as a convex problem and our pruning strategies only discard unnecessary subgraphs. Furthermore, we show that IGML is also applicable to other structured data such as itemset and sequence data, and that it can incorporate vertex-label similarity by using a transportation-based subgraph feature. We empirically evaluate the computational efficiency and classification performance of IGML on several benchmark datasets and provide some illustrative examples of how IGML identifies important subgraphs from a given graph dataset.
    Machine learning methods for postprocessing ensemble forecasts of wind gusts: A systematic comparison. (arXiv:2106.09512v1 [stat.ML])
    (2 min) Postprocessing ensemble weather predictions to correct systematic errors has become a standard practice in research and operations. However, only few recent studies have focused on ensemble postprocessing of wind gust forecasts, despite its importance for severe weather warnings. Here, we provide a comprehensive review and systematic comparison of eight statistical and machine learning methods for probabilistic wind gust forecasting via ensemble postprocessing, that can be divided in three groups: State of the art postprocessing techniques from statistics (ensemble model output statistics (EMOS), member-by-member postprocessing, isotonic distributional regression), established machine learning methods (gradient-boosting extended EMOS, quantile regression forests) and neural network-based approaches (distributional regression network, Bernstein quantile network, histogram estimation network). The methods are systematically compared using six years of data from a high-resolution, convection-permitting ensemble prediction system that was run operationally at the German weather service, and hourly observations at 175 surface weather stations in Germany. While all postprocessing methods yield calibrated forecasts and are able to correct the systematic errors of the raw ensemble predictions, incorporating information from additional meteorological predictor variables beyond wind gusts leads to significant improvements in forecast skill. In particular, we propose a flexible framework of locally adaptive neural networks with different probabilistic forecast types as output, which not only significantly outperform all benchmark postprocessing methods but also learn physically consistent relations associated with the diurnal cycle, especially the evening transition of the planetary boundary layer.
    Why Do Pretrained Language Models Help in Downstream Tasks? An Analysis of Head and Prompt Tuning. (arXiv:2106.09226v1 [cs.LG])
    (2 min) Pretrained language models have achieved state-of-the-art performance when adapted to a downstream NLP task. However, theoretical analysis of these models is scarce and challenging since the pretraining and downstream tasks can be very different. We propose an analysis framework that links the pretraining and downstream tasks with an underlying latent variable generative model of text -- the downstream classifier must recover a function of the posterior distribution over the latent variables. We analyze head tuning (learning a classifier on top of the frozen pretrained model) and prompt tuning in this setting. The generative model in our analysis is either a Hidden Markov Model (HMM) or an HMM augmented with a latent memory component, motivated by long-term dependencies in natural language. We show that 1) under certain non-degeneracy conditions on the HMM, simple classification heads can solve the downstream task, 2) prompt tuning obtains downstream guarantees with weaker non-degeneracy conditions, and 3) our recovery guarantees for the memory-augmented HMM are stronger than for the vanilla HMM because task-relevant information is easier to recover from the long-term memory. Experiments on synthetically generated data from HMMs back our theoretical findings.
    Optimizing Data Usage via Differentiable Rewards. (arXiv:1911.10088v3 [cs.LG] UPDATED)
    (2 min) To acquire a new skill, humans learn better and faster if a tutor, based on their current knowledge level, informs them of how much attention they should pay to particular content or practice problems. Similarly, a machine learning model could potentially be trained better with a scorer that "adapts" to its current learning state and estimates the importance of each training data instance. Training such an adaptive scorer efficiently is a challenging problem; in order to precisely quantify the effect of a data instance at a given time during the training, it is typically necessary to first complete the entire training process. To efficiently optimize data usage, we propose a reinforcement learning approach called Differentiable Data Selection (DDS). In DDS, we formulate a scorer network as a learnable function of the training data, which can be efficiently updated along with the main model being trained. Specifically, DDS updates the scorer with an intuitive reward signal: it should up-weigh the data that has a similar gradient with a dev set upon which we would finally like to perform well. Without significant computing overhead, DDS delivers strong and consistent improvements over several strong baselines on two very different tasks of machine translation and image classification.
    Exploring the Properties and Evolution of Neural Network Eigenspaces during Training. (arXiv:2106.09526v1 [cs.LG])
    (2 min) In this work we explore the information processing inside neural networks using logistic regression probes \cite{probes} and the saturation metric \cite{featurespace_saturation}. We show that problem difficulty and neural network capacity affect the predictive performance in an antagonistic manner, opening the possibility of detecting over- and under-parameterization of neural networks for a given task. We further show that the observed effects are independent from previously reported pathological patterns like the ``tail pattern'' described in \cite{featurespace_saturation}. Finally we are able to show that saturation patterns converge early during training, allowing for a quicker cycle time during analysis
    Accuracy, Interpretability, and Differential Privacy via Explainable Boosting. (arXiv:2106.09680v1 [cs.LG])
    (2 min) We show that adding differential privacy to Explainable Boosting Machines (EBMs), a recent method for training interpretable ML models, yields state-of-the-art accuracy while protecting privacy. Our experiments on multiple classification and regression datasets show that DP-EBM models suffer surprisingly little accuracy loss even with strong differential privacy guarantees. In addition to high accuracy, two other benefits of applying DP to EBMs are: a) trained models provide exact global and local interpretability, which is often important in settings where differential privacy is needed; and b) the models can be edited after training without loss of privacy to correct errors which DP noise may have introduced.
    Sparse bottleneck neural networks for exploratory non-linear visualization of Patch-seq data. (arXiv:2006.10411v2 [cs.LG] UPDATED)
    (2 min) Patch-seq, a recently developed experimental technique, allows neuroscientists to obtain transcriptomic and electrophysiological information from the same neurons. Efficiently analyzing and visualizing such paired multivariate data in order to extract biologically meaningful interpretations has, however, remained a challenge. Here, we use sparse deep neural networks with a two-dimensional bottleneck and group lasso penalty to predict electrophysiological features from the transcriptomic ones, yielding concise and biologically interpretable two-dimensional visualizations. In two large example data sets, this visualization reveals known neural classes and their marker genes without biological prior knowledge.
    Automatic Main Character Recognition for Photographic Studies. (arXiv:2106.09064v1 [cs.CV])
    (2 min) Main characters in images are the most important humans that catch the viewer's attention upon first look, and they are emphasized by properties such as size, position, color saturation, and sharpness of focus. Identifying the main character in images plays an important role in traditional photographic studies and media analysis, but the task is performed manually and can be slow and laborious. Furthermore, selection of main characters can be sometimes subjective. In this paper, we analyze the feasibility of solving the main character recognition needed for photographic studies automatically and propose a method for identifying the main characters. The proposed method uses machine learning based human pose estimation along with traditional computer vision approaches for this task. We approach the task as a binary classification problem where each detected human is classified either as a main character or not. To evaluate both the subjectivity of the task and the performance of our method, we collected a dataset of 300 varying images from multiple sources and asked five people, a photographic researcher and four other persons, to annotate the main characters. Our analysis showed a relatively high agreement between different annotators. The proposed method achieved a promising F1 score of 0.83 on the full image set and 0.96 on a subset evaluated as most clear and important cases by the photographic researcher.
    Predicting cognitive scores with graph neural networks through sample selection learning. (arXiv:2106.09408v1 [cs.LG])
    (2 min) Analyzing the relation between intelligence and neural activity is of the utmost importance in understanding the working principles of the human brain in health and disease. In existing literature, functional brain connectomes have been used successfully to predict cognitive measures such as intelligence quotient (IQ) scores in both healthy and disordered cohorts using machine learning models. However, existing methods resort to flattening the brain connectome (i.e., graph) through vectorization which overlooks its topological properties. To address this limitation and inspired from the emerging graph neural networks (GNNs), we design a novel regression GNN model (namely RegGNN) for predicting IQ scores from brain connectivity. On top of that, we introduce a novel, fully modular sample selection method to select the best samples to learn from for our target prediction task. However, since such deep learning architectures are computationally expensive to train, we further propose a \emph{learning-based sample selection} method that learns how to choose the training samples with the highest expected predictive power on unseen samples. For this, we capitalize on the fact that connectomes (i.e., their adjacency matrices) lie in the symmetric positive definite (SPD) matrix cone. Our results on full-scale and verbal IQ prediction outperforms comparison methods in autism spectrum disorder cohorts and achieves a competitive performance for neurotypical subjects using 3-fold cross-validation. Furthermore, we show that our sample selection approach generalizes to other learning-based methods, which shows its usefulness beyond our GNN architecture.
    Physics-informed CoKriging model of a redox flow battery. (arXiv:2106.09188v1 [physics.chem-ph])
    (2 min) Redox flow batteries (RFBs) offer the capability to store large amounts of energy cheaply and efficiently, however, there is a need for fast and accurate models of the charge-discharge curve of a RFB to potentially improve the battery capacity and performance. We develop a multifidelity model for predicting the charge-discharge curve of a RFB. In the multifidelity model, we use the Physics-informed CoKriging (CoPhIK) machine learning method that is trained on experimental data and constrained by the so-called "zero-dimensional" physics-based model. Here we demonstrate that the model shows good agreement with experimental results and significant improvements over existing zero-dimensional models. We show that the proposed model is robust as it is not sensitive to the input parameters in the zero-dimensional model. We also show that only a small amount of high-fidelity experimental datasets are needed for accurate predictions for the range of considered input parameters, which include current density, flow rate, and initial concentrations.
    BABEL: Bodies, Action and Behavior with English Labels. (arXiv:2106.09696v1 [cs.CV])
    (2 min) Understanding the semantics of human movement -- the what, how and why of the movement -- is an important problem that requires datasets of human actions with semantic labels. Existing datasets take one of two approaches. Large-scale video datasets contain many action labels but do not contain ground-truth 3D human motion. Alternatively, motion-capture (mocap) datasets have precise body motions but are limited to a small number of actions. To address this, we present BABEL, a large dataset with language labels describing the actions being performed in mocap sequences. BABEL consists of action labels for about 43 hours of mocap sequences from AMASS. Action labels are at two levels of abstraction -- sequence labels describe the overall action in the sequence, and frame labels describe all actions in every frame of the sequence. Each frame label is precisely aligned with the duration of the corresponding action in the mocap sequence, and multiple actions can overlap. There are over 28k sequence labels, and 63k frame labels in BABEL, which belong to over 250 unique action categories. Labels from BABEL can be leveraged for tasks like action recognition, temporal action localization, motion synthesis, etc. To demonstrate the value of BABEL as a benchmark, we evaluate the performance of models on 3D action recognition. We demonstrate that BABEL poses interesting learning challenges that are applicable to real-world scenarios, and can serve as a useful benchmark of progress in 3D action recognition. The dataset, baseline method, and evaluation code is made available, and supported for academic research purposes at https://babel.is.tue.mpg.de/.
    On the Power of Preconditioning in Sparse Linear Regression. (arXiv:2106.09207v1 [cs.LG])
    (2 min) Sparse linear regression is a fundamental problem in high-dimensional statistics, but strikingly little is known about how to efficiently solve it without restrictive conditions on the design matrix. We consider the (correlated) random design setting, where the covariates are independently drawn from a multivariate Gaussian $N(0,\Sigma)$ with $\Sigma : n \times n$, and seek estimators $\hat{w}$ minimizing $(\hat{w}-w^*)^T\Sigma(\hat{w}-w^*)$, where $w^*$ is the $k$-sparse ground truth. Information theoretically, one can achieve strong error bounds with $O(k \log n)$ samples for arbitrary $\Sigma$ and $w^*$; however, no efficient algorithms are known to match these guarantees even with $o(n)$ samples, without further assumptions on $\Sigma$ or $w^*$. As far as hardness, computational lower bounds are only known with worst-case design matrices. Random-design instances are known which are hard for the Lasso, but these instances can generally be solved by Lasso after a simple change-of-basis (i.e. preconditioning). In this work, we give upper and lower bounds clarifying the power of preconditioning in sparse linear regression. First, we show that the preconditioned Lasso can solve a large class of sparse linear regression problems nearly optimally: it succeeds whenever the dependency structure of the covariates, in the sense of the Markov property, has low treewidth -- even if $\Sigma$ is highly ill-conditioned. Second, we construct (for the first time) random-design instances which are provably hard for an optimally preconditioned Lasso. In fact, we complete our treewidth classification by proving that for any treewidth-$t$ graph, there exists a Gaussian Markov Random Field on this graph such that the preconditioned Lasso, with any choice of preconditioner, requires $\Omega(t^{1/20})$ samples to recover $O(\log n)$-sparse signals when covariates are drawn from this model.
    Towards Understanding Deep Learning from Noisy Labels with Small-Loss Criterion. (arXiv:2106.09291v1 [cs.LG])
    (2 min) Deep neural networks need large amounts of labeled data to achieve good performance. In real-world applications, labels are usually collected from non-experts such as crowdsourcing to save cost and thus are noisy. In the past few years, deep learning methods for dealing with noisy labels have been developed, many of which are based on the small-loss criterion. However, there are few theoretical analyses to explain why these methods could learn well from noisy labels. In this paper, we theoretically explain why the widely-used small-loss criterion works. Based on the explanation, we reformalize the vanilla small-loss criterion to better tackle noisy labels. The experimental results verify our theoretical explanation and also demonstrate the effectiveness of the reformalization.
    Amortized Auto-Tuning: Cost-Efficient Transfer Optimization for Hyperparameter Recommendation. (arXiv:2106.09179v1 [cs.LG])
    (2 min) With the surge in the number of hyperparameters and training times of modern machine learning models, hyperparameter tuning is becoming increasingly expensive. Although methods have been proposed to speed up tuning via knowledge transfer, they typically require the final performance of hyperparameters and do not focus on low-fidelity information. Nevertheless, this common practice is suboptimal and can incur an unnecessary use of resources. It is more cost-efficient to instead leverage the low-fidelity tuning observations to measure inter-task similarity and transfer knowledge from existing to new tasks accordingly. However, performing multi-fidelity tuning comes with its own challenges in the transfer setting: the noise in the additional observations and the need for performance forecasting. Therefore, we conduct a thorough analysis of the multi-task multi-fidelity Bayesian optimization framework, which leads to the best instantiation--amortized auto-tuning (AT2). We further present an offline-computed 27-task hyperparameter recommendation (HyperRec) database to serve the community. Extensive experiments on HyperRec and other real-world databases illustrate the effectiveness of our AT2 method.
    Deep generative modeling for probabilistic forecasting in power systems. (arXiv:2106.09370v1 [cs.LG])
    (2 min) Greater direct electrification of end-use sectors with a higher share of renewables is one of the pillars to power a carbon-neutral society by 2050. This study uses a recent deep learning technique, the normalizing flows, to produce accurate probabilistic forecasts that are crucial for decision-makers to face the new challenges in power systems applications. Through comprehensive empirical evaluations using the open data of the Global Energy Forecasting Competition 2014, we demonstrate that our methodology is competitive with other state-of-the-art deep learning generative models: generative adversarial networks and variational autoencoders. The models producing weather-based wind, solar power, and load scenarios are properly compared both in terms of forecast value, by considering the case study of an energy retailer, and quality using several complementary metrics.
    Contrastive Reinforcement Learning of Symbolic Reasoning Domains. (arXiv:2106.09146v1 [cs.AI])
    (2 min) Abstract symbolic reasoning, as required in domains such as mathematics and logic, is a key component of human intelligence. Solvers for these domains have important applications, especially to computer-assisted education. But learning to solve symbolic problems is challenging for machine learning algorithms. Existing models either learn from human solutions or use hand-engineered features, making them expensive to apply in new domains. In this paper, we instead consider symbolic domains as simple environments where states and actions are given as unstructured text, and binary rewards indicate whether a problem is solved. This flexible setup makes it easy to specify new domains, but search and planning become challenging. We introduce four environments inspired by the Mathematics Common Core Curriculum, and observe that existing Reinforcement Learning baselines perform poorly. We then present a novel learning algorithm, Contrastive Policy Learning (ConPoLe) that explicitly optimizes the InfoNCE loss, which lower bounds the mutual information between the current state and next states that continue on a path to the solution. ConPoLe successfully solves all four domains. Moreover, problem representations learned by ConPoLe enable accurate prediction of the categories of problems in a real mathematics curriculum. Our results suggest new directions for reinforcement learning in symbolic domains, as well as applications to mathematics education.
    Localized Uncertainty Attacks. (arXiv:2106.09222v1 [stat.ML])
    (2 min) The susceptibility of deep learning models to adversarial perturbations has stirred renewed attention in adversarial examples resulting in a number of attacks. However, most of these attacks fail to encompass a large spectrum of adversarial perturbations that are imperceptible to humans. In this paper, we present localized uncertainty attacks, a novel class of threat models against deterministic and stochastic classifiers. Under this threat model, we create adversarial examples by perturbing only regions in the inputs where a classifier is uncertain. To find such regions, we utilize the predictive uncertainty of the classifier when the classifier is stochastic or, we learn a surrogate model to amortize the uncertainty when it is deterministic. Unlike $\ell_p$ ball or functional attacks which perturb inputs indiscriminately, our targeted changes can be less perceptible. When considered under our threat model, these attacks still produce strong adversarial examples; with the examples retaining a greater degree of similarity with the inputs.
    Quantized Federated Learning under Transmission Delay and Outage Constraints. (arXiv:2106.09397v1 [cs.IT])
    (2 min) Federated learning (FL) has been recognized as a viable distributed learning paradigm which trains a machine learning model collaboratively with massive mobile devices in the wireless edge while protecting user privacy. Although various communication schemes have been proposed to expedite the FL process, most of them have assumed ideal wireless channels which provide reliable and lossless communication links between the server and mobile clients. Unfortunately, in practical systems with limited radio resources such as constraint on the training latency and constraints on the transmission power and bandwidth, transmission of a large number of model parameters inevitably suffers from quantization errors (QE) and transmission outage (TO). In this paper, we consider such non-ideal wireless channels, and carry out the first analysis showing that the FL convergence can be severely jeopardized by TO and QE, but intriguingly can be alleviated if the clients have uniform outage probabilities. These insightful results motivate us to propose a robust FL scheme, named FedTOE, which performs joint allocation of wireless resources and quantization bits across the clients to minimize the QE while making the clients have the same TO probability. Extensive experimental results are presented to show the superior performance of FedTOE for a deep learning-based classification task with transmission latency constraints.
    Scaling Laws for Acoustic Models. (arXiv:2106.09488v1 [eess.AS])
    (2 min) There is a recent trend in machine learning to increase model quality by growing models to sizes previously thought to be unreasonable. Recent work has shown that autoregressive generative models with cross-entropy objective functions exhibit smooth power-law relationships, or scaling laws, that predict model quality from model size, training set size, and the available compute budget. These scaling laws allow one to choose nearly optimal hyper-parameters given constraints on available training data, model parameter count, or training computation budget. In this paper, we demonstrate that acoustic models trained with an auto-predictive coding loss behave as if they are subject to similar scaling laws. We extend previous work to jointly predict loss due to model size, to training set size, and to the inherent "irreducible loss" of the task. We find that the scaling laws accurately match model performance over two orders of magnitude in both model size and training set size, and make predictions about the limits of model performance.
    Biomedical Interpretable Entity Representations. (arXiv:2106.09502v1 [cs.CL])
    (2 min) Pre-trained language models induce dense entity representations that offer strong performance on entity-centric NLP tasks, but such representations are not immediately interpretable. This can be a barrier to model uptake in important domains such as biomedicine. There has been recent work on general interpretable representation learning (Onoe and Durrett, 2020), but these domain-agnostic representations do not readily transfer to the important domain of biomedicine. In this paper, we create a new entity type system and training set from a large corpus of biomedical texts by mapping entities to concepts in a medical ontology, and from these to Wikipedia pages whose categories are our types. From this mapping we derive Biomedical Interpretable Entity Representations(BIERs), in which dimensions correspond to fine-grained entity types, and values are predicted probabilities that a given entity is of the corresponding type. We propose a novel method that exploits BIER's final sparse and intermediate dense representations to facilitate model and entity type debugging. We show that BIERs achieve strong performance in biomedical tasks including named entity disambiguation and entity label classification, and we provide error analysis to highlight the utility of their interpretability, particularly in low-supervision settings. Finally, we provide our induced 68K biomedical type system, the corresponding 37 million triples of derived data used to train BIER models and our best performing model.
    Coded Federated Learning Framework for AI-Based Mobile Application Services with Privacy-Awareness. (arXiv:2106.09261v1 [cs.NI])
    (2 min) By encoding computing tasks, coded computing can not only mitigate straggling problems in federated learning (FL), but also preserve privacy of sensitive data uploaded/contributed by participating mobile users (MUs) to the centralized server, owned by a mobile application provider (MAP). However, these advantages come with extra coding cost/complexity and communication overhead (referred to as \emph{privacy cost}) that must be considered given the limited computing/communications resources at MUs/MAP, the rationality and incentive competition among MUs in contributing data to the MAP. This article proposes a novel coded FL-based framework for a privacy-aware mobile application service to address these challenges. In particular, the MAP first determines a set of the best MUs for the FL process based on MUs' provided information/features. Then, each selected MU can propose a contract to the MAP according to its expected trainable local data and privacy-protected coded data. To find the optimal contracts that can maximize utilities of the MAP and all the participating MUs while maintaining high learning quality of the whole system, we first develop a multi-principal one-agent contract-based problem leveraging coded FL-based multiple utility functions under the MUs' privacy cost, the MAP's limited computing resource, and asymmetric information between the MAP and MUs. Then, we transform the problem into an equivalent low-complexity problem and develop an iterative algorithm to solve it. Experiments with a real-world dataset show that our framework can speed up training time up to 49% and improve prediction accuracy up to 4.6 times while enhancing network's social welfare, i.e., total utility of all participating entities, up to 114% under the privacy cost consideration compared with those of baseline methods.
    EEG-GNN: Graph Neural Networks for Classification of Electroencephalogram (EEG) Signals. (arXiv:2106.09135v1 [cs.LG])
    (2 min) Convolutional neural networks (CNN) have been frequently used to extract subject-invariant features from electroencephalogram (EEG) for classification tasks. This approach holds the underlying assumption that electrodes are equidistant analogous to pixels of an image and hence fails to explore/exploit the complex functional neural connectivity between different electrode sites. We overcome this limitation by tailoring the concepts of convolution and pooling applied to 2D grid-like inputs for the functional network of electrode sites. Furthermore, we develop various graph neural network (GNN) models that project electrodes onto the nodes of a graph, where the node features are represented as EEG channel samples collected over a trial, and nodes can be connected by weighted/unweighted edges according to a flexible policy formulated by a neuroscientist. The empirical evaluations show that our proposed GNN-based framework outperforms standard CNN classifiers across ErrP, and RSVP datasets, as well as allowing neuroscientific interpretability and explainability to deep learning methods tailored to EEG related classification problems. Another practical advantage of our GNN-based framework is that it can be used in EEG channel selection, which is critical for reducing computational cost, and designing portable EEG headsets.
    Federated CycleGAN for Privacy-Preserving Image-to-Image Translation. (arXiv:2106.09246v1 [cs.CV])
    (2 min) Unsupervised image-to-image translation methods such as CycleGAN learn to convert images from one domain to another using unpaired training data sets from different domains. Unfortunately, these approaches still require centrally collected unpaired records, potentially violating privacy and security issues. Although the recent federated learning (FL) allows a neural network to be trained without data exchange, the basic assumption of the FL is that all clients have their own training data from a similar domain, which is different from our image-to-image translation scenario in which each client has images from its unique domain and the goal is to learn image translation between different domains without accessing the target domain data. To address this, here we propose a novel federated CycleGAN architecture that can learn image translation in an unsupervised manner while maintaining the data privacy. Specifically, our approach arises from a novel observation that CycleGAN loss can be decomposed into the sum of client specific local objectives that can be evaluated using only their data. This local objective decomposition allows multiple clients to participate in federated CycleGAN training without sacrificing performance. Furthermore, our method employs novel switchable generator and discriminator architecture using Adaptive Instance Normalization (AdaIN) that significantly reduces the band-width requirement of the federated learning. Our experimental results on various unsupervised image translation tasks show that our federated CycleGAN provides comparable performance compared to the non-federated counterpart.
    Large Scale Private Learning via Low-rank Reparametrization. (arXiv:2106.09352v1 [cs.LG])
    (2 min) We propose a reparametrization scheme to address the challenges of applying differentially private SGD on large neural networks, which are 1) the huge memory cost of storing individual gradients, 2) the added noise suffering notorious dimensional dependence. Specifically, we reparametrize each weight matrix with two \emph{gradient-carrier} matrices of small dimension and a \emph{residual weight} matrix. We argue that such reparametrization keeps the forward/backward process unchanged while enabling us to compute the projected gradient without computing the gradient itself. To learn with differential privacy, we design \emph{reparametrized gradient perturbation (RGP)} that perturbs the gradients on gradient-carrier matrices and reconstructs an update for the original weight from the noisy gradients. Importantly, we use historical updates to find the gradient-carrier matrices, whose optimality is rigorously justified under linear regression and empirically verified with deep learning tasks. RGP significantly reduces the memory cost and improves the utility. For example, we are the first able to apply differential privacy on the BERT model and achieve an average accuracy of $83.9\%$ on four downstream tasks with $\epsilon=8$, which is within $5\%$ loss compared to the non-private baseline but enjoys much lower privacy leakage risk.
    SPeCiaL: Self-Supervised Pretraining for Continual Learning. (arXiv:2106.09065v1 [cs.CV])
    (2 min) This paper presents SPeCiaL: a method for unsupervised pretraining of representations tailored for continual learning. Our approach devises a meta-learning objective that differentiates through a sequential learning process. Specifically, we train a linear model over the representations to match different augmented views of the same image together, each view presented sequentially. The linear model is then evaluated on both its ability to classify images it just saw, and also on images from previous iterations. This gives rise to representations that favor quick knowledge retention with minimal forgetting. We evaluate SPeCiaL in the Continual Few-Shot Learning setting, and show that it can match or outperform other supervised pretraining approaches.
    CROP: Certifying Robust Policies for Reinforcement Learning through Functional Smoothing. (arXiv:2106.09292v1 [cs.LG])
    (2 min) We present the first framework of Certifying Robust Policies for reinforcement learning (CROP) against adversarial state perturbations. We propose two particular types of robustness certification criteria: robustness of per-state actions and lower bound of cumulative rewards. Specifically, we develop a local smoothing algorithm which uses a policy derived from Q-functions smoothed with Gaussian noise over each encountered state to guarantee the robustness of actions taken along this trajectory. Next, we develop a global smoothing algorithm for certifying the robustness of a finite-horizon cumulative reward under adversarial state perturbations. Finally, we propose a local smoothing approach which makes use of adaptive search in order to obtain tight certification bounds for reward. We use the proposed RL robustness certification framework to evaluate six methods that have previously been shown to yield empirically robust RL, including adversarial training and several forms of regularization, on two representative Atari games. We show that RegPGD, RegCVX, and RadialRL achieve high certified robustness among these. Furthermore, we demonstrate that our certifications are often tight by evaluating these algorithms against adversarial attacks.
    Unsupervised Path Representation Learning with Curriculum Negative Sampling. (arXiv:2106.09373v1 [cs.LG])
    (2 min) Path representations are critical in a variety of transportation applications, such as estimating path ranking in path recommendation systems and estimating path travel time in navigation systems. Existing studies often learn task-specific path representations in a supervised manner, which require a large amount of labeled training data and generalize poorly to other tasks. We propose an unsupervised learning framework Path InfoMax (PIM) to learn generic path representations that work for different downstream tasks. We first propose a curriculum negative sampling method, for each input path, to generate a small amount of negative paths, by following the principles of curriculum learning. Next, \emph{PIM} employs mutual information maximization to learn path representations from both a global and a local view. In the global view, PIM distinguishes the representations of the input paths from those of the negative paths. In the local view, \emph{PIM} distinguishes the input path representations from the representations of the nodes that appear only in the negative paths. This enables the learned path representations to encode both global and local information at different scales. Extensive experiments on two downstream tasks, ranking score estimation and travel time estimation, using two road network datasets suggest that PIM significantly outperforms other unsupervised methods and is also able to be used as a pre-training method to enhance supervised path representation learning.
    Multi-Label Learning from Single Positive Labels. (arXiv:2106.09708v1 [cs.CV])
    (2 min) Predicting all applicable labels for a given image is known as multi-label classification. Compared to the standard multi-class case (where each image has only one label), it is considerably more challenging to annotate training data for multi-label classification. When the number of potential labels is large, human annotators find it difficult to mention all applicable labels for each training image. Furthermore, in some settings detection is intrinsically difficult e.g. finding small object instances in high resolution images. As a result, multi-label training data is often plagued by false negatives. We consider the hardest version of this problem, where annotators provide only one relevant label for each image. As a result, training sets will have only one positive label per image and no confirmed negatives. We explore this special case of learning from missing labels across four different multi-label image classification datasets for both linear classifiers and end-to-end fine-tuned deep networks. We extend existing multi-label losses to this setting and propose novel variants that constrain the number of expected positive labels during training. Surprisingly, we show that in some cases it is possible to approach the performance of fully labeled classifiers despite training with significantly fewer confirmed labels.
    Interpretable Machine Learning Classifiers for Brain Tumour Survival Prediction. (arXiv:2106.09424v1 [cs.LG])
    (2 min) Prediction of survival in patients diagnosed with a brain tumour is challenging because of heterogeneous tumour behaviours and responses to treatment. Better estimations of prognosis would support treatment planning and patient support. Advances in machine learning have informed development of clinical predictive models, but their integration into clinical practice is almost non-existent. One reasons for this is the lack of interpretability of models. In this paper, we use a novel brain tumour dataset to compare two interpretable rule list models against popular machine learning approaches for brain tumour survival prediction. All models are quantitatively evaluated using standard performance metrics. The rule lists are also qualitatively assessed for their interpretability and clinical utility. The interpretability of the black box machine learning models is evaluated using two post-hoc explanation techniques, LIME and SHAP. Our results show that the rule lists were only slightly outperformed by the black box models. We demonstrate that rule list algorithms produced simple decision lists that align with clinical expertise. By comparison, post-hoc interpretability methods applied to black box models may produce unreliable explanations of local model predictions. Model interpretability is essential for understanding differences in predictive performance and for integration into clinical practice.
    Exponential Error Convergence in Data Classification with Optimized Random Features: Acceleration by Quantum Machine Learning. (arXiv:2106.09028v1 [quant-ph])
    (2 min) Random features are a central technique for scalable learning algorithms based on kernel methods. A recent work has shown that an algorithm for machine learning by quantum computer, quantum machine learning (QML), can exponentially speed up sampling of optimized random features, even without imposing restrictive assumptions on sparsity and low-rankness of matrices that had limited applicability of conventional QML algorithms; this QML algorithm makes it possible to significantly reduce and provably minimize the required number of features for regression tasks. However, a major interest in the field of QML is how widely the advantages of quantum computation can be exploited, not only in the regression tasks. We here construct a QML algorithm for a classification task accelerated by the optimized random features. We prove that the QML algorithm for sampling optimized random features, combined with stochastic gradient descent (SGD), can achieve state-of-the-art exponential convergence speed of reducing classification error in a classification task under a low-noise condition; at the same time, our algorithm with optimized random features can take advantage of the significant reduction of the required number of features so as to accelerate each iteration in the SGD and evaluation of the classifier obtained from our algorithm. These results discover a promising application of QML to significant acceleration of the leading classification algorithm based on kernel methods, without ruining its applicability to a practical class of data sets and the exponential error-convergence speed.
    Voice2Series: Reprogramming Acoustic Models for Time Series Classification. (arXiv:2106.09296v1 [cs.LG])
    (2 min) Learning to classify time series with limited data is a practical yet challenging problem. Current methods are primarily based on hand-designed feature extraction rules or domain-specific data augmentation. Motivated by the advances in deep speech processing models and the fact that voice data are univariate temporal signals, in this paper, we propose Voice2Series (V2S), a novel end-to-end approach that reprograms acoustic models for time series classification, through input transformation learning and output label mapping. Leveraging the representation learning power of a large-scale pre-trained speech processing model, on 30 different time series tasks we show that V2S either outperforms or is tied with state-of-the-art methods on 20 tasks, and improves their average accuracy by 1.84%. We further provide a theoretical justification of V2S by proving its population risk is upper bounded by the source risk and a Wasserstein distance accounting for feature alignment via reprogramming. Our results offer new and effective means to time series classification.
    Invisible for both Camera and LiDAR: Security of Multi-Sensor Fusion based Perception in Autonomous Driving Under Physical-World Attacks. (arXiv:2106.09249v1 [cs.CR])
    (3 min) In Autonomous Driving (AD) systems, perception is both security and safety critical. Despite various prior studies on its security issues, all of them only consider attacks on camera- or LiDAR-based AD perception alone. However, production AD systems today predominantly adopt a Multi-Sensor Fusion (MSF) based design, which in principle can be more robust against these attacks under the assumption that not all fusion sources are (or can be) attacked at the same time. In this paper, we present the first study of security issues of MSF-based perception in AD systems. We directly challenge the basic MSF design assumption above by exploring the possibility of attacking all fusion sources simultaneously. This allows us for the first time to understand how much security guarantee MSF can fundamentally provide as a general defense strategy for AD perception. We formulate the attack as an optimization problem to generate a physically-realizable, adversarial 3D-printed object that misleads an AD system to fail in detecting it and thus crash into it. We propose a novel attack pipeline that addresses two main design challenges: (1) non-differentiable target camera and LiDAR sensing systems, and (2) non-differentiable cell-level aggregated features popularly used in LiDAR-based AD perception. We evaluate our attack on MSF included in representative open-source industry-grade AD systems in real-world driving scenarios. Our results show that the attack achieves over 90% success rate across different object types and MSF. Our attack is also found stealthy, robust to victim positions, transferable across MSF algorithms, and physical-world realizable after being 3D-printed and captured by LiDAR and camera devices. To concretely assess the end-to-end safety impact, we further perform simulation evaluation and show that it can cause a 100% vehicle collision rate for an industry-grade AD system.
    Federated Learning for Intrusion Detection System: Concepts, Challenges and Future Directions. (arXiv:2106.09527v1 [cs.CR])
    (2 min) The rapid development of the Internet and smart devices trigger surge in network traffic making its infrastructure more complex and heterogeneous. The predominated usage of mobile phones, wearable devices and autonomous vehicles are examples of distributed networks which generate huge amount of data each and every day. The computational power of these devices have also seen steady progression which has created the need to transmit information, store data locally and drive network computations towards edge devices. Intrusion detection systems play a significant role in ensuring security and privacy of such devices. Machine Learning and Deep Learning with Intrusion Detection Systems have gained great momentum due to their achievement of high classification accuracy. However the privacy and security aspects potentially gets jeopardised due to the need of storing and communicating data to centralized server. On the contrary, federated learning (FL) fits in appropriately as a privacy-preserving decentralized learning technique that does not transfer data but trains models locally and transfers the parameters to the centralized server. The present paper aims to present an extensive and exhaustive review on the use of FL in intrusion detection system. In order to establish the need for FL, various types of IDS, relevant ML approaches and its associated issues are discussed. The paper presents detailed overview of the implementation of FL in various aspects of anomaly detection. The allied challenges of FL implementations are also identified which provides idea on the scope of future direction of research. The paper finally presents the plausible solutions associated with the identified challenges in FL based intrusion detection system implementation acting as a baseline for prospective research.
    RAR-U-Net: a Residual Encoder to Attention Decoder by Residual Connections Framework for Spine Segmentation under Noisy Labels. (arXiv:2009.12873v4 [eess.IV] UPDATED)
    (2 min) Segmentation algorithms for medical images are widely studied for various clinical and research purposes. In this paper, we propose a new and efficient method for medical image segmentation under noisy labels. The method operates under a deep learning paradigm, incorporating four novel contributions. Firstly, a residual interconnection is explored in different scale encoders to transfer gradient information efficiently. Secondly, four copy-and-crop connections are replaced by residual-block-based concatenation to alleviate the disparity between encoders and decoders. Thirdly, convolutional attention modules for feature refinement are studied on all scale decoders. Finally, an adaptive denoising learning strategy (ADL) is introduced into the training process to avoid too much influence from the noisy labels. Experimental results are illustrated on a publicly available benchmark database of spine CTs. Our proposed method achieves competitive performance against other state-of-the-art methods over a variety of different evaluation measures.
    DeepSplit: Scalable Verification of Deep Neural Networks via Operator Splitting. (arXiv:2106.09117v1 [cs.LG])
    (2 min) Analyzing the worst-case performance of deep neural networks against input perturbations amounts to solving a large-scale non-convex optimization problem, for which several past works have proposed convex relaxations as a promising alternative. However, even for reasonably-sized neural networks, these relaxations are not tractable, and so must be replaced by even weaker relaxations in practice. In this work, we propose a novel operator splitting method that can directly solve a convex relaxation of the problem to high accuracy, by splitting it into smaller sub-problems that often have analytical solutions. The method is modular and scales to problem instances that were previously impossible to solve exactly due to their size. Furthermore, the solver operations are amenable to fast parallelization with GPU acceleration. We demonstrate our method in obtaining tighter bounds on the worst-case performance of large convolutional networks in image classification and reinforcement learning settings.
    A General Framework For Detecting Anomalous Inputs to DNN Classifiers. (arXiv:2007.15147v3 [cs.LG] UPDATED)
    (2 min) Detecting anomalous inputs, such as adversarial and out-of-distribution (OOD) inputs, is critical for classifiers (including deep neural networks or DNNs) deployed in real-world applications. While prior works have proposed various methods to detect such anomalous samples using information from the internal layer representations of a DNN, there is a lack of consensus on a principled approach for the different components of such a detection method. As a result, often heuristic and one-off methods are applied for different aspects of this problem. We propose an unsupervised anomaly detection framework based on the internal DNN layer representations in the form of a meta-algorithm with configurable components. We proceed to propose specific instantiations for each component of the meta-algorithm based on ideas grounded in statistical testing and anomaly detection. We evaluate the proposed methods on well-known image classification datasets with strong adversarial attacks and OOD inputs, including an adaptive attack that uses the internal layer representations of the DNN (often not considered in prior work). Comparisons with five recently-proposed competing detection methods demonstrates the effectiveness of our method in detecting adversarial and OOD inputs.
    Joining datasets via data augmentation in the label space for neural networks. (arXiv:2106.09260v1 [cs.LG])
    (2 min) Most, if not all, modern deep learning systems restrict themselves to a single dataset for neural network training and inference. In this article, we are interested in systematic ways to join datasets that are made of similar purposes. Unlike previous published works that ubiquitously conduct the dataset joining in the uninterpretable latent vectorial space, the core to our method is an augmentation procedure in the label space. The primary challenge to address the label space for dataset joining is the discrepancy between labels: non-overlapping label annotation sets, different labeling granularity or hierarchy and etc. Notably we propose a new technique leveraging artificially created knowledge graph, recurrent neural networks and policy gradient that successfully achieve the dataset joining in the label space. Empirical results on both image and text classification justify the validity of our approach.
    Automatic Construction of Evaluation Suites for Natural Language Generation Datasets. (arXiv:2106.09069v1 [cs.CL])
    (2 min) Machine learning approaches applied to NLP are often evaluated by summarizing their performance in a single number, for example accuracy. Since most test sets are constructed as an i.i.d. sample from the overall data, this approach overly simplifies the complexity of language and encourages overfitting to the head of the data distribution. As such, rare language phenomena or text about underrepresented groups are not equally included in the evaluation. To encourage more in-depth model analyses, researchers have proposed the use of multiple test sets, also called challenge sets, that assess specific capabilities of a model. In this paper, we develop a framework based on this idea which is able to generate controlled perturbations and identify subsets in text-to-scalar, text-to-text, or data-to-text settings. By applying this framework to the GEM generation benchmark, we propose an evaluation suite made of 80 challenge sets, demonstrate the kinds of analyses that it enables and shed light onto the limits of current generation models.
    CoANE: Modeling Context Co-occurrence for Attributed Network Embedding. (arXiv:2106.09241v1 [cs.SI])
    (2 min) Attributed network embedding (ANE) is to learn low-dimensional vectors so that not only the network structure but also node attributes can be preserved in the embedding space. Existing ANE models do not consider the specific combination between graph structure and attributes. While each node has its structural characteristics, such as highly-interconnected neighbors along with their certain patterns of attribute distribution, each node's neighborhood should be not only depicted by multi-hop nodes, but consider certain clusters or social circles. To model such information, in this paper, we propose a novel ANE model, Context Co-occurrence-aware Attributed Network Embedding (CoANE). The basic idea of CoANE is to model the context attributes that each node's involved diverse patterns, and apply the convolutional mechanism to encode positional information by treating each attribute as a channel. The learning of context co-occurrence can capture the latent social circles of each node. To better encode structural and semantic knowledge of nodes, we devise a three-way objective function, consisting of positive graph likelihood, contextual negative sampling, and attribute reconstruction. We conduct experiments on five real datasets in the tasks of link prediction, node label classification, and node clustering. The results exhibit that CoANE can significantly outperform state-of-the-art ANE models.
    Automatic Curricula via Expert Demonstrations. (arXiv:2106.09159v1 [cs.LG])
    (2 min) We propose Automatic Curricula via Expert Demonstrations (ACED), a reinforcement learning (RL) approach that combines the ideas of imitation learning and curriculum learning in order to solve challenging robotic manipulation tasks with sparse reward functions. Curriculum learning solves complicated RL tasks by introducing a sequence of auxiliary tasks with increasing difficulty, yet how to automatically design effective and generalizable curricula remains a challenging research problem. ACED extracts curricula from a small amount of expert demonstration trajectories by dividing demonstrations into sections and initializing training episodes to states sampled from different sections of demonstrations. Through moving the reset states from the end to the beginning of demonstrations as the learning agent improves its performance, ACED not only learns challenging manipulation tasks with unseen initializations and goals, but also discovers novel solutions that are distinct from the demonstrations. In addition, ACED can be naturally combined with other imitation learning methods to utilize expert demonstrations in a more efficient manner, and we show that a combination of ACED with behavior cloning allows pick-and-place tasks to be learned with as few as 1 demonstration and block stacking tasks to be learned with 20 demonstrations.
    Seeing Differently, Acting Similarly: Imitation Learning with Heterogeneous Observations. (arXiv:2106.09256v1 [cs.LG])
    (2 min) In many real-world imitation learning tasks, the demonstrator and the learner have to act in different but full observation spaces. This situation generates significant obstacles for existing imitation learning approaches to work, even when they are combined with traditional space adaptation techniques. The main challenge lies in bridging expert's occupancy measures to learner's dynamically changing occupancy measures under the different observation spaces. In this work, we model the above learning problem as Heterogeneous Observations Imitation Learning (HOIL). We propose the Importance Weighting with REjection (IWRE) algorithm based on the techniques of importance-weighting, learning with rejection, and active querying to solve the key challenge of occupancy measure matching. Experimental results show that IWRE can successfully solve HOIL tasks, including the challenging task of transforming the vision-based demonstrations to random access memory (RAM)-based policies under the Atari domain.
    Work in Progress: Mobile or FPGA? A Comprehensive Evaluation on Energy Efficiency and a Unified Optimization Framework. (arXiv:2106.09166v1 [cs.LG])
    (2 min) Efficient deployment of Deep Neural Networks (DNNs) on edge devices (i.e., FPGAs and mobile platforms) is very challenging, especially under a recent witness of the increasing DNN model size and complexity. Although various optimization approaches have been proven to be effective in many DNNs on edge devices, most state-of-the-art work focuses on ad-hoc optimizations, and there lacks a thorough study to comprehensively reveal the potentials and constraints of different edge devices when considering different optimizations. In this paper, we qualitatively and quantitatively compare the energy-efficiency of FPGA-based and mobile-based DNN executions, and provide detailed analysis.
    mPyPl: Python Monadic Pipeline Library for Complex Functional Data Processing. (arXiv:2106.09164v1 [cs.PL])
    (2 min) In this paper, we present a new Python library called mPyPl, which is intended to simplify complex data processing tasks using functional approach. This library defines operations on lazy data streams of named dictionaries represented as generators (so-called multi-field datastreams), and allows enriching those data streams with more 'fields' in the process of data preparation and feature extraction. Thus, most data preparation tasks can be expressed in the form of neat linear 'pipeline', similar in syntax to UNIX pipes, or |> functional composition operator in F#. We define basic operations on multi-field data streams, which resemble classical monadic operations, and show similarity of the proposed approach to monads in functional programming. We also show how the library was used in complex deep learning tasks of event detection in video, and discuss different evaluation strategies that allow for different compromises in terms of memory and performance.
    Frustratingly Easy Transferability Estimation. (arXiv:2106.09362v1 [cs.LG])
    (2 min) Transferability estimation has been an essential tool in selecting a pre-trained model and the layers of it to transfer, so as to maximize the performance on a target task and prevent negative transfer. Existing estimation algorithms either require intensive training on target tasks or have difficulties in evaluating the transferability between layers. We propose a simple, efficient, and effective transferability measure named TransRate. With single pass through the target data, TransRate measures the transferability as the mutual information between the features of target examples extracted by a pre-trained model and labels of them. We overcome the challenge of efficient mutual information estimation by resorting to coding rate that serves as an effective alternative to entropy. TransRate is theoretically analyzed to be closely related to the performance after transfer learning. Despite its extraordinary simplicity in 10 lines of codes, TransRate performs remarkably well in extensive evaluations on 22 pre-trained models and 16 downstream tasks.
    Unsupervised Video Prediction from a Single Frame by Estimating 3D Dynamic Scene Structure. (arXiv:2106.09051v1 [cs.CV])
    (2 min) Our goal in this work is to generate realistic videos given just one initial frame as input. Existing unsupervised approaches to this task do not consider the fact that a video typically shows a 3D environment, and that this should remain coherent from frame to frame even as the camera and objects move. We address this by developing a model that first estimates the latent 3D structure of the scene, including the segmentation of any moving objects. It then predicts future frames by simulating the object and camera dynamics, and rendering the resulting views. Importantly, it is trained end-to-end using only the unsupervised objective of predicting future frames, without any 3D information nor segmentation annotations. Experiments on two challenging datasets of natural videos show that our model can estimate 3D structure and motion segmentation from a single frame, and hence generate plausible and varied predictions.
    Zeroth-Order Methods for Convex-Concave Minmax Problems: Applications to Decision-Dependent Risk Minimization. (arXiv:2106.09082v1 [math.OC])
    (2 min) Min-max optimization is emerging as a key framework for analyzing problems of robustness to strategically and adversarially generated data. We propose a random reshuffling-based gradient free Optimistic Gradient Descent-Ascent algorithm for solving convex-concave min-max problems with finite sum structure. We prove that the algorithm enjoys the same convergence rate as that of zeroth-order algorithms for convex minimization problems. We further specialize the algorithm to solve distributionally robust, decision-dependent learning problems, where gradient information is not readily available. Through illustrative simulations, we observe that our proposed approach learns models that are simultaneously robust against adversarial distribution shifts and strategic decisions from the data sources, and outperforms existing methods from the strategic classification literature.
    LiRA: Learning Visual Speech Representations from Audio through Self-supervision. (arXiv:2106.09171v1 [cs.LG])
    (2 min) The large amount of audiovisual content being shared online today has drawn substantial attention to the prospect of audiovisual self-supervised learning. Recent works have focused on each of these modalities separately, while others have attempted to model both simultaneously in a cross-modal fashion. However, comparatively little attention has been given to leveraging one modality as a training objective to learn from the other. In this work, we propose Learning visual speech Representations from Audio via self-supervision (LiRA). Specifically, we train a ResNet+Conformer model to predict acoustic features from unlabelled visual speech. We find that this pre-trained model can be leveraged towards word-level and sentence-level lip-reading through feature extraction and fine-tuning experiments. We show that our approach significantly outperforms other self-supervised methods on the Lip Reading in the Wild (LRW) dataset and achieves state-of-the-art performance on Lip Reading Sentences 2 (LRS2) using only a fraction of the total labelled data.

2021-06-17

  • cs.CL updates on arXiv.org

    Topic Classification on Spoken Documents Using Deep Acoustic and Linguistic Features. (arXiv:2106.08637v1 [cs.CL])
    (2 min) Topic classification systems on spoken documents usually consist of two modules: an automatic speech recognition (ASR) module to convert speech into text and a text topic classification (TTC) module to predict the topic class from the decoded text. In this paper, instead of using the ASR transcripts, the fusion of deep acoustic and linguistic features is used for topic classification on spoken documents. More specifically, a conventional CTC-based acoustic model (AM) using phonemes as output units is first trained, and the outputs of the layer before the linear phoneme classifier in the trained AM are used as the deep acoustic features of spoken documents. Furthermore, these deep acoustic features are fed to a phoneme-to-word (P2W) module to obtain deep linguistic features. Finally, a local multi-head attention module is proposed to fuse these two types of deep features for topic classification. Experiments conducted on a subset selected from Switchboard corpus show that our proposed framework outperforms the conventional ASR+TTC systems and achieves a 3.13% improvement in ACC.
    Augmenting Part-of-speech Tagging with Syntactic Information for Vietnamese and Chinese. (arXiv:2102.12136v2 [cs.CL] UPDATED)
    (2 min) Word segmentation and part-of-speech tagging are two critical preliminary steps for downstream tasks in Vietnamese natural language processing. In reality, people tend to consider also the phrase boundary when performing word segmentation and part of speech tagging rather than solely process word by word from left to right. In this paper, we implement this idea to improve word segmentation and part of speech tagging the Vietnamese language by employing a simplified constituency parser. Our neural model for joint word segmentation and part-of-speech tagging has the architecture of the syllable-based CRF constituency parser. To reduce the complexity of parsing, we replace all constituent labels with a single label indicating for phrases. This model can be augmented with predicted word boundary and part-of-speech tags by other tools. Because Vietnamese and Chinese have some similar linguistic phenomena, we evaluated the proposed model and its augmented versions on three Vietnamese benchmark datasets and six Chinese benchmark datasets. Our experimental results show that the proposed model achieves higher performances than previous works for both languages.
    On the long-term learning ability of LSTM LMs. (arXiv:2106.08927v1 [cs.CL])
    (2 min) We inspect the long-term learning ability of Long Short-Term Memory language models (LSTM LMs) by evaluating a contextual extension based on the Continuous Bag-of-Words (CBOW) model for both sentence- and discourse-level LSTM LMs and by analyzing its performance. We evaluate on text and speech. Sentence-level models using the long-term contextual module perform comparably to vanilla discourse-level LSTM LMs. On the other hand, the extension does not provide gains for discourse-level models. These findings indicate that discourse-level LSTM LMs already rely on contextual information to perform long-term learning.
    Semantic sentence similarity: size does not always matter. (arXiv:2106.08648v1 [cs.CL])
    (2 min) This study addresses the question whether visually grounded speech recognition (VGS) models learn to capture sentence semantics without access to any prior linguistic knowledge. We produce synthetic and natural spoken versions of a well known semantic textual similarity database and show that our VGS model produces embeddings that correlate well with human semantic similarity judgements. Our results show that a model trained on a small image-caption database outperforms two models trained on much larger databases, indicating that database size is not all that matters. We also investigate the importance of having multiple captions per image and find that this is indeed helpful even if the total number of images is lower, suggesting that paraphrasing is a valuable learning signal. While the general trend in the field is to create ever larger datasets to train models on, our findings indicate other characteristics of the database can just as important important.
    A Fair and Comprehensive Comparison of Multimodal Tweet Sentiment Analysis Methods. (arXiv:2106.08829v1 [cs.SI])
    (2 min) Opinion and sentiment analysis is a vital task to characterize subjective information in social media posts. In this paper, we present a comprehensive experimental evaluation and comparison with six state-of-the-art methods, from which we have re-implemented one of them. In addition, we investigate different textual and visual feature embeddings that cover different aspects of the content, as well as the recently introduced multimodal CLIP embeddings. Experimental results are presented for two different publicly available benchmark datasets of tweets and corresponding images. In contrast to the evaluation methodology of previous work, we introduce a reproducible and fair evaluation scheme to make results comparable. Finally, we conduct an error analysis to outline the limitations of the methods and possibilities for the future work.
    PRASEMap: A Probabilistic Reasoning and Semantic Embedding based Knowledge Graph Alignment System. (arXiv:2106.08801v1 [cs.CL])
    (2 min) Knowledge Graph (KG) alignment aims at finding equivalent entities and relations (i.e., mappings) between two KGs. The existing approaches utilize either reasoning-based or semantic embedding-based techniques, but few studies explore their combination. In this demonstration, we present PRASEMap, an unsupervised KG alignment system that iteratively computes the Mappings with both Probabilistic Reasoning (PR) And Semantic Embedding (SE) techniques. PRASEMap can support various embedding-based KG alignment approaches as the SE module, and enables easy human computer interaction that additionally provides an option for users to feed the mapping annotations back to the system for better results. The demonstration showcases these features via a stand-alone Web application with user friendly interfaces.
    QuatDE: Dynamic Quaternion Embedding for Knowledge Graph Completion. (arXiv:2105.09002v2 [cs.CL] UPDATED)
    (2 min) Knowledge graph embedding has been an active research topic for knowledge base completion (KGC), with progressive improvement from the initial TransE, TransH, RotatE et al to the current state-of-the-art QuatE. However, QuatE ignores the multi-faceted nature of the entity and the complexity of the relation, only using rigorous operation on quaternion space to capture the interaction between entitiy pair and relation, leaving opportunities for better knowledge representation which will finally help KGC. In this paper, we propose a novel model, QuatDE, with a dynamic mapping strategy to explicitly capture the variety of relational patterns and separate different semantic information of the entity, using transition vectors to adjust the point position of the entity embedding vectors in the quaternion space via Hamilton product, enhancing the feature interaction capability between elements of the triplet. Experiment results show QuatDE achieves state-of-the-art performance on three well-established knowledge graph completion benchmarks. In particular, the MR evaluation has relatively increased by 26% on WN18 and 15% on WN18RR, which proves the generalization of QuatDE.
    Grounding Spatio-Temporal Language with Transformers. (arXiv:2106.08858v1 [cs.AI])
    (2 min) Language is an interface to the outside world. In order for embodied agents to use it, language must be grounded in other, sensorimotor modalities. While there is an extended literature studying how machines can learn grounded language, the topic of how to learn spatio-temporal linguistic concepts is still largely uncharted. To make progress in this direction, we here introduce a novel spatio-temporal language grounding task where the goal is to learn the meaning of spatio-temporal descriptions of behavioral traces of an embodied agent. This is achieved by training a truth function that predicts if a description matches a given history of observations. The descriptions involve time-extended predicates in past and present tense as well as spatio-temporal references to objects in the scene. To study the role of architectural biases in this task, we train several models including multimodal Transformer architectures; the latter implement different attention computations between words and objects across space and time. We test models on two classes of generalization: 1) generalization to randomly held-out sentences; 2) generalization to grammar primitives. We observe that maintaining object identity in the attention computation of our Transformers is instrumental to achieving good performance on generalization overall, and that summarizing object traces in a single token has little influence on performance. We then discuss how this opens new perspectives for language-guided autonomous embodied agents. We also release our code under open-source license as well as pretrained models and datasets to encourage the wider community to build upon and extend our work in the future.
    Do Acoustic Word Embeddings Capture Phonological Similarity? An Empirical Study. (arXiv:2106.08686v1 [cs.CL])
    (2 min) Several variants of deep neural networks have been successfully employed for building parametric models that project variable-duration spoken word segments onto fixed-size vector representations, or acoustic word embeddings (AWEs). However, it remains unclear to what degree we can rely on the distance in the emerging AWE space as an estimate of word-form similarity. In this paper, we ask: does the distance in the acoustic embedding space correlate with phonological dissimilarity? To answer this question, we empirically investigate the performance of supervised approaches for AWEs with different neural architectures and learning objectives. We train AWE models in controlled settings for two languages (German and Czech) and evaluate the embeddings on two tasks: word discrimination and phonological similarity. Our experiments show that (1) the distance in the embedding space in the best cases only moderately correlates with phonological distance, and (2) improving the performance on the word discrimination task does not necessarily yield models that better reflect word phonological similarity. Our findings highlight the necessity to rethink the current intrinsic evaluations for AWEs.
    BANG: Bridging Autoregressive and Non-autoregressive Generation with Large Scale Pretraining. (arXiv:2012.15525v3 [cs.CL] UPDATED)
    (2 min) In this paper, we propose BANG, a new pretraining model to Bridge the gap between Autoregressive (AR) and Non-autoregressive (NAR) Generation. AR and NAR generation can be uniformly regarded as to what extent previous tokens can be attended, and BANG bridges AR and NAR generation by designing a novel model structure for large-scale pretraining. The pretrained BANG model can simultaneously support AR, NAR and semi-NAR generation to meet different requirements. Experiments on question generation (SQuAD 1.1), summarization (XSum) and dialogue generation (PersonaChat) show that BANG improves NAR and semi-NAR performance significantly as well as attaining comparable performance with strong AR pretrained models. Compared with the semi-NAR strong baselines, BANG achieves absolute improvements of 14.01 and 5.24 in the overall scores of SQuAD 1.1 and XSum, respectively. In addition, BANG achieves absolute improvements of 10.73, 6.39 and 5.90 in the overall scores of SQuAD, XSUM and PersonaChat respectively compared with the strong NAR baselines.
    Reflective Decoding: Beyond Unidirectional Generation with Off-the-Shelf Language Models. (arXiv:2010.08566v3 [cs.CL] UPDATED)
    (2 min) Publicly available, large pretrained LanguageModels (LMs) generate text with remarkable quality, but only sequentially from left to right. As a result, they are not immediately applicable to generation tasks that break the unidirectional assumption, such as paraphrasing or text-infilling, necessitating task-specific supervision. In this paper, we present Reflective Decoding, a novel unsupervised algorithm that allows for direct application of unidirectional LMs to non-sequential tasks. Our 2-step approach requires no supervision or even parallel corpora, only two off-the-shelf pretrained LMs in opposite directions: forward and backward. First, in the contextualization step, we use LMs to generate ensembles of past and future contexts which collectively capture the input (e.g. the source sentence for paraphrasing). Second, in the reflection step, we condition on these "context ensembles", generating outputs that are compatible with them. Comprehensive empirical results demonstrate that Reflective Decoding outperforms strong unsupervised baselines on both paraphrasing and abductive text infilling, significantly narrowing the gap between unsupervised and supervised methods. Reflective Decoding surpasses multiple supervised baselines on various metrics including human evaluation.
    End-to-End Spoken Language Understanding for Generalized Voice Assistants. (arXiv:2106.09009v1 [cs.CL])
    (2 min) End-to-end (E2E) spoken language understanding (SLU) systems predict utterance semantics directly from speech using a single model. Previous work in this area has focused on targeted tasks in fixed domains, where the output semantic structure is assumed a priori and the input speech is of limited complexity. In this work we present our approach to developing an E2E model for generalized SLU in commercial voice assistants (VAs). We propose a fully differentiable, transformer-based, hierarchical system that can be pretrained at both the ASR and NLU levels. This is then fine-tuned on both transcription and semantic classification losses to handle a diverse set of intent and argument combinations. This leads to an SLU system that achieves significant improvements over baselines on a complex internal generalized VA dataset with a 43% improvement in accuracy, while still meeting the 99% accuracy benchmark on the popular Fluent Speech Commands dataset. We further evaluate our model on a hard test set, exclusively containing slot arguments unseen in training, and demonstrate a nearly 20% improvement, showing the efficacy of our approach in truly demanding VA scenarios.
    Algorithm to Compilation Codesign: An Integrated View of Neural Network Sparsity. (arXiv:2106.08846v1 [cs.LG])
    (2 min) Reducing computation cost, inference latency, and memory footprint of neural networks are frequently cited as research motivations for pruning and sparsity. However, operationalizing those benefits and understanding the end-to-end effect of algorithm design and regularization on the runtime execution is not often examined in depth. Here we apply structured and unstructured pruning to attention weights of transformer blocks of the BERT language model, while also expanding block sparse representation (BSR) operations in the TVM compiler. Integration of BSR operations enables the TVM runtime execution to leverage structured pattern sparsity induced by model regularization. This integrated view of pruning algorithms enables us to study relationships between modeling decisions and their direct impact on sparsity-enhanced execution. Our main findings are: 1) we validate that performance benefits of structured sparsity block regularization must be enabled by the BSR augmentations to TVM, with 4x speedup relative to vanilla PyTorch and 2.2x speedup relative to standard TVM compilation (without expanded BSR support). 2) for BERT attention weights, the end-to-end optimal block sparsity shape in this CPU inference context is not a square block (as in \cite{gray2017gpu}) but rather a linear 32x1 block 3) the relationship between performance and block size / shape is is suggestive of how model regularization parameters interact with task scheduler optimizations resulting in the observed end-to-end performance.
    KazakhTTS: An Open-Source Kazakh Text-to-Speech Synthesis Dataset. (arXiv:2104.08459v3 [eess.AS] UPDATED)
    (2 min) This paper introduces a high-quality open-source speech synthesis dataset for Kazakh, a low-resource language spoken by over 13 million people worldwide. The dataset consists of about 93 hours of transcribed audio recordings spoken by two professional speakers (female and male). It is the first publicly available large-scale dataset developed to promote Kazakh text-to-speech (TTS) applications in both academia and industry. In this paper, we share our experience by describing the dataset development procedures and faced challenges, and discuss important future directions. To demonstrate the reliability of our dataset, we built baseline end-to-end TTS models and evaluated them using the subjective mean opinion score (MOS) measure. Evaluation results show that the best TTS models trained on our dataset achieve MOS above 4 for both speakers, which makes them applicable for practical use. The dataset, training recipe, and pretrained TTS models are freely available.
    $C^3$: Compositional Counterfactual Constrastive Learning for Video-grounded Dialogues. (arXiv:2106.08914v1 [cs.LG])
    (2 min) Video-grounded dialogue systems aim to integrate video understanding and dialogue understanding to generate responses that are relevant to both the dialogue and video context. Most existing approaches employ deep learning models and have achieved remarkable performance, given the relatively small datasets available. However, the results are partly accomplished by exploiting biases in the datasets rather than developing multimodal reasoning, resulting in limited generalization. In this paper, we propose a novel approach of Compositional Counterfactual Contrastive Learning ($C^3$) to develop contrastive training between factual and counterfactual samples in video-grounded dialogues. Specifically, we design factual/counterfactual sampling based on the temporal steps in videos and tokens in dialogues and propose contrastive loss functions that exploit object-level or action-level variance. Different from prior approaches, we focus on contrastive hidden state representations among compositional output tokens to optimize the representation space in a generation setting. We achieved promising performance gains on the Audio-Visual Scene-Aware Dialogues (AVSD) benchmark and showed the benefits of our approach in grounding video and dialogue context.
    A Topic Coverage Approach to Evaluation of Topic Models. (arXiv:2012.06274v2 [cs.IR] UPDATED)
    (2 min) Topic models are widely used unsupervised models of text capable of learning topics - weighted lists of words and documents - from large collections of text documents. When topic models are used for discovery of topics in text collections, a question that arises naturally is how well the model-induced topics correspond to topics of interest to the analyst. In this paper we revisit and extend a so far neglected approach to topic model evaluation based on measuring topic coverage - computationally matching model topics with a set of reference topics that models are expected to uncover. The approach is well suited for analyzing models' performance in topic discovery and for large-scale analysis of both topic models and measures of model quality. We propose new measures of coverage and evaluate, in a series of experiments, different types of topic models on two distinct text domains for which interest for topic discovery exists. The experiments include evaluation of model quality, analysis of coverage of distinct topic categories, and the analysis of the relationship between coverage and other methods of topic model evaluation. The contributions of the paper include new measures of coverage, insights into both topic models and other methods of model evaluation, and the datasets and code for facilitating future research of both topic coverage and other approaches to topic model evaluation.
    CRAFT: A Benchmark for Causal Reasoning About Forces and inTeractions. (arXiv:2012.04293v2 [cs.AI] UPDATED)
    (2 min) Humans are able to perceive, understand and reason about physical events. Developing models with similar physical understanding capabilities is a long-standing goal of artificial intelligence. As a step towards this goal, in this work, we introduce CRAFT, a new visual question answering dataset that requires causal reasoning about physical forces and object interactions. It contains 58K video and question pairs that are generated from 10K videos from 20 different virtual environments, containing various objects in motion that interact with each other and the scene. Two question categories from CRAFT include previously studied descriptive and counterfactual questions. Besides, inspired by the theories of force dynamics in cognitive linguistics, we introduce new question categories that involve understanding the interactions of objects through the notions of cause, enable, and prevent. Our results demonstrate that even though these tasks seem to be simple and intuitive for humans, the evaluated baseline models, including existing state-of-the-art methods, do not yet deal with the challenges posed in our benchmark dataset.
    On the proper role of linguistically-oriented deep net analysis in linguistic theorizing. (arXiv:2106.08694v1 [cs.CL])
    (2 min) A lively research field has recently emerged that uses experimental methods to probe the linguistic behavior of modern deep networks. While work in this tradition often reports intriguing results about the grammatical skills of deep nets, it is not clear what their implications for linguistic theorizing should be. As a consequence, linguistically-oriented deep net analysis has had very little impact on linguistics at large. In this chapter, I suggest that deep networks should be treated as theories making explicit predictions about the acceptability of linguistic utterances. I argue that, if we overcome some obstacles standing in the way of seriously pursuing this idea, we will gain a powerful new theoretical tool, complementary to mainstream algebraic approaches.
    From Discourse to Narrative: Knowledge Projection for Event Relation Extraction. (arXiv:2106.08629v1 [cs.CL])
    (2 min) Current event-centric knowledge graphs highly rely on explicit connectives to mine relations between events. Unfortunately, due to the sparsity of connectives, these methods severely undermine the coverage of EventKGs. The lack of high-quality labelled corpora further exacerbates that problem. In this paper, we propose a knowledge projection paradigm for event relation extraction: projecting discourse knowledge to narratives by exploiting the commonalities between them. Specifically, we propose Multi-tier Knowledge Projection Network (MKPNet), which can leverage multi-tier discourse knowledge effectively for event relation extraction. In this way, the labelled data requirement is significantly reduced, and implicit event relations can be effectively extracted. Intrinsic experimental results show that MKPNet achieves the new state-of-the-art performance, and extrinsic experimental results verify the value of the extracted event relations.
    Evaluating Gender Bias in Hindi-English Machine Translation. (arXiv:2106.08680v1 [cs.CL])
    (2 min) With language models being deployed increasingly in the real world, it is essential to address the issue of the fairness of their outputs. The word embedding representations of these language models often implicitly draw unwanted associations that form a social bias within the model. The nature of gendered languages like Hindi, poses an additional problem to the quantification and mitigation of bias, owing to the change in the form of the words in the sentence, based on the gender of the subject. Additionally, there is sparse work done in the realm of measuring and debiasing systems for Indic languages. In our work, we attempt to evaluate and quantify the gender bias within a Hindi-English machine translation system. We implement a modified version of the existing TGBI metric based on the grammatical considerations for Hindi. We also compare and contrast the resulting bias measurements across multiple metrics for pre-trained embeddings and the ones learned by our machine translation model.
    Refining Language Models with Compositional Explanations. (arXiv:2103.10415v2 [cs.CL] UPDATED)
    (2 min) Pre-trained language models have been successful on text classification tasks, but are prone to learning spurious correlations from biased datasets, and are thus vulnerable when making inferences in a new domain. Prior works reveal such spurious patterns via post-hoc explanation algorithms which compute the importance of input features. Further, the model is regularized to align the importance scores with human knowledge, so that the unintended model behaviors are eliminated. However, such a regularization technique lacks flexibility and coverage, since only importance scores towards a pre-defined list of features are adjusted, while more complex human knowledge such as feature interaction and pattern generalization can hardly be incorporated. In this work, we propose to refine a learned language model for a target domain by collecting human-provided compositional explanations regarding observed biases. By parsing these explanations into executable logic rules, the human-specified refinement advice from a small set of explanations can be generalized to more training examples. We additionally introduce a regularization term allowing adjustments for both importance and interaction of features to better rectify model behavior. We demonstrate the effectiveness of the proposed approach on two text classification tasks by showing improved performance in target domain as well as improved model fairness after refinement.
    The Impact of ASR on the Automatic Analysis of Linguistic Complexity and Sophistication in Spontaneous L2 Speech. (arXiv:2104.08529v2 [cs.CL] UPDATED)
    (2 min) In recent years, automated approaches to assessing linguistic complexity in second language (L2) writing have made significant progress in gauging learner performance, predicting human ratings of the quality of learner productions, and benchmarking L2 development. In contrast, there is comparatively little work in the area of speaking, particularly with respect to fully automated approaches to assessing L2 spontaneous speech. While the importance of a well-performing ASR system is widely recognized, little research has been conducted to investigate the impact of its performance on subsequent automatic text analysis. In this paper, we focus on this issue and examine the impact of using a state-of-the-art ASR system for subsequent automatic analysis of linguistic complexity in spontaneously produced L2 speech. A set of 30 selected measures were considered, falling into four categories: syntactic, lexical, n-gram frequency, and information-theoretic measures. The agreement between the scores for these measures obtained on the basis of ASR-generated vs. manual transcriptions was determined through correlation analysis. A more differential effect of ASR performance on specific types of complexity measures when controlling for task type effects is also presented.
    An Information Divergence Measure Between Neural Text and Human Text. (arXiv:2102.01454v2 [cs.CL] UPDATED)
    (2 min) As major progress is made in open-ended text generation, measuring how close machine-generated text is to human language remains a critical open problem. We propose Mauve, a comparison measure for open-ended text generation, which directly compares a generation model's distribution to that of human-written text. Mauve measures the mean area under a divergence curve for the two distributions, exploring the trade-off between two types of errors: those arising from parts of the human distribution that the model distribution approximates well, and those it does not. Mauve extends a family of information divergence metrics, introducing a tractable approximation based on computing the KL divergence in a quantized embedding space. This yields an efficient implementation that scales up to modern text generation models. Through an extensive empirical study on three open-ended generation tasks, we find that Mauve identifies known properties of generated text, scales naturally with model size, and correlates with human judgments, with fewer restrictions than existing distributional evaluation metrics.
    Evidence-based Factual Error Correction. (arXiv:2012.15788v2 [cs.CL] UPDATED)
    (2 min) This paper introduces the task of factual error correction: performing edits to a claim so that the generated rewrite is better supported by evidence. This extends the well-studied task of fact verification by providing a mechanism to correct written texts that are refuted or only partially supported by evidence. We demonstrate that it is feasible to train factual error correction systems from existing fact checking datasets which only contain labeled claims accompanied by evidence, but not the correction. We achieve this by employing a two-stage distant supervision approach that incorporates evidence into masked claims when generating corrections. Our approach, based on the T5 transformer and using retrieved evidence, achieved better results than existing work which used a pointer copy network and gold evidence, producing accurate factual error corrections for 5x more instances in human evaluation and a .125 increase in SARI score. The evaluation is conducted on a dataset of 65,000 instances based on a recent fact verification shared task and we release it to enable further work on the task.
    Collaborative Training of Acoustic Encoders for Speech Recognition. (arXiv:2106.08960v1 [cs.CL])
    (2 min) On-device speech recognition requires training models of different sizes for deploying on devices with various computational budgets. When building such different models, we can benefit from training them jointly to take advantage of the knowledge shared between them. Joint training is also efficient since it reduces the redundancy in the training procedure's data handling operations. We propose a method for collaboratively training acoustic encoders of different sizes for speech recognition. We use a sequence transducer setup where different acoustic encoders share a common predictor and joiner modules. The acoustic encoders are also trained using co-distillation through an auxiliary task for frame level chenone prediction, along with the transducer loss. We perform experiments using the LibriSpeech corpus and demonstrate that the collaboratively trained acoustic encoders can provide up to a 11% relative improvement in the word error rate on both the test partitions.
    Eider: Evidence-enhanced Document-level Relation Extraction. (arXiv:2106.08657v1 [cs.CL])
    (2 min) Document-level relation extraction (DocRE) aims at extracting the semantic relations among entity pairs in a document. In DocRE, a subset of the sentences in a document, called the evidence sentences, might be sufficient for predicting the relation between a specific entity pair. To make better use of the evidence sentences, in this paper, we propose a three-stage evidence-enhanced DocRE framework consisting of joint relation and evidence extraction, evidence-centered relation extraction (RE), and fusion of extraction results. We first jointly train an RE model with a simple and memory-efficient evidence extraction model. Then, we construct pseudo documents based on the extracted evidence sentences and run the RE model again. Finally, we fuse the extraction results of the first two stages using a blending layer and make a final prediction. Extensive experiments show that our proposed framework achieves state-of-the-art performance on the DocRED dataset, outperforming the second-best method by 0.76/0.82 Ign F1/F1. In particular, our method significantly improves the performance on inter-sentence relations by 1.23 Inter F1.
    Named Entity Recognition with Small Strongly Labeled and Large Weakly Labeled Data. (arXiv:2106.08977v1 [cs.CL])
    (2 min) Weak supervision has shown promising results in many natural language processing tasks, such as Named Entity Recognition (NER). Existing work mainly focuses on learning deep NER models only with weak supervision, i.e., without any human annotation, and shows that by merely using weakly labeled data, one can achieve good performance, though still underperforms fully supervised NER with manually/strongly labeled data. In this paper, we consider a more practical scenario, where we have both a small amount of strongly labeled data and a large amount of weakly labeled data. Unfortunately, we observe that weakly labeled data does not necessarily improve, or even deteriorate the model performance (due to the extensive noise in the weak labels) when we train deep NER models over a simple or weighted combination of the strongly labeled and weakly labeled data. To address this issue, we propose a new multi-stage computational framework -- NEEDLE with three essential ingredients: (1) weak label completion, (2) noise-aware loss function, and (3) final fine-tuning over the strongly labeled data. Through experiments on E-commerce query NER and Biomedical NER, we demonstrate that NEEDLE can effectively suppress the noise of the weak labels and outperforms existing methods. In particular, we achieve new SOTA F1-scores on 3 Biomedical NER datasets: BC5CDR-chem 93.74, BC5CDR-disease 90.69, NCBI-disease 92.28.
    Domain-independent User Simulation with Transformers for Task-oriented Dialogue Systems. (arXiv:2106.08838v1 [cs.CL])
    (2 min) Dialogue policy optimisation via reinforcement learning requires a large number of training interactions, which makes learning with real users time consuming and expensive. Many set-ups therefore rely on a user simulator instead of humans. These user simulators have their own problems. While hand-coded, rule-based user simulators have been shown to be sufficient in small, simple domains, for complex domains the number of rules quickly becomes intractable. State-of-the-art data-driven user simulators, on the other hand, are still domain-dependent. This means that adaptation to each new domain requires redesigning and retraining. In this work, we propose a domain-independent transformer-based user simulator (TUS). The structure of our TUS is not tied to a specific domain, enabling domain generalisation and learning of cross-domain user behaviour from data. We compare TUS with the state of the art using automatic as well as human evaluations. TUS can compete with rule-based user simulators on pre-defined domains and is able to generalise to unseen domains in a zero-shot fashion.
    Phoneme-BERT: Joint Language Modelling of Phoneme Sequence and ASR Transcript. (arXiv:2102.00804v2 [eess.AS] UPDATED)
    (2 min) Recent years have witnessed significant improvement in ASR systems to recognize spoken utterances. However, it is still a challenging task for noisy and out-of-domain data, where substitution and deletion errors are prevalent in the transcribed text. These errors significantly degrade the performance of downstream tasks. In this work, we propose a BERT-style language model, referred to as PhonemeBERT, that learns a joint language model with phoneme sequence and ASR transcript to learn phonetic-aware representations that are robust to ASR errors. We show that PhonemeBERT can be used on downstream tasks using phoneme sequences as additional features, and also in low-resource setup where we only have ASR-transcripts for the downstream tasks with no phoneme information available. We evaluate our approach extensively by generating noisy data for three benchmark datasets - Stanford Sentiment Treebank, TREC and ATIS for sentiment, question and intent classification tasks respectively. The results of the proposed approach beats the state-of-the-art baselines comprehensively on each dataset.
    Improving the expressiveness of neural vocoding with non-affine Normalizing Flows. (arXiv:2106.08649v1 [eess.AS])
    (2 min) This paper proposes a general enhancement to the Normalizing Flows (NF) used in neural vocoding. As a case study, we improve expressive speech vocoding with a revamped Parallel Wavenet (PW). Specifically, we propose to extend the affine transformation of PW to the more expressive invertible non-affine function. The greater expressiveness of the improved PW leads to better-perceived signal quality and naturalness in the waveform reconstruction and text-to-speech (TTS) tasks. We evaluate the model across different speaking styles on a multi-speaker, multi-lingual dataset. In the waveform reconstruction task, the proposed model closes the naturalness and signal quality gap from the original PW to recordings by $10\%$, and from other state-of-the-art neural vocoding systems by more than $60\%$. We also demonstrate improvements in objective metrics on the evaluation test set with L2 Spectral Distance and Cross-Entropy reduced by $3\%$ and $6\unicode{x2030}$ comparing to the affine PW. Furthermore, we extend the probability density distillation procedure proposed by the original PW paper, so that it works with any non-affine invertible and differentiable function.
    Revisiting the Weaknesses of Reinforcement Learning for Neural Machine Translation. (arXiv:2106.08942v1 [cs.CL])
    (2 min) Policy gradient algorithms have found wide adoption in NLP, but have recently become subject to criticism, doubting their suitability for NMT. Choshen et al. (2020) identify multiple weaknesses and suspect that their success is determined by the shape of output distributions rather than the reward. In this paper, we revisit these claims and study them under a wider range of configurations. Our experiments on in-domain and cross-domain adaptation reveal the importance of exploration and reward scaling, and provide empirical counter-evidence to these claims.
    RefBERT: Compressing BERT by Referencing to Pre-computed Representations. (arXiv:2106.08898v1 [cs.CL])
    (2 min) Recently developed large pre-trained language models, e.g., BERT, have achieved remarkable performance in many downstream natural language processing applications. These pre-trained language models often contain hundreds of millions of parameters and suffer from high computation and latency in real-world applications. It is desirable to reduce the computation overhead of the models for fast training and inference while keeping the model performance in downstream applications. Several lines of work utilize knowledge distillation to compress the teacher model to a smaller student model. However, they usually discard the teacher's knowledge when in inference. Differently, in this paper, we propose RefBERT to leverage the knowledge learned from the teacher, i.e., facilitating the pre-computed BERT representation on the reference sample and compressing BERT into a smaller student model. To guarantee our proposal, we provide theoretical justification on the loss function and the usage of reference samples. Significantly, the theoretical result shows that including the pre-computed teacher's representations on the reference samples indeed increases the mutual information in learning the student model. Finally, we conduct the empirical evaluation and show that our RefBERT can beat the vanilla TinyBERT over 8.1\% and achieves more than 94\% of the performance of $\BERTBASE$ on the GLUE benchmark. Meanwhile, RefBERT is 7.4x smaller and 9.5x faster on inference than BERT$_{\rm BASE}$.
    Attention-Based Keyword Localisation in Speech using Visual Grounding. (arXiv:2106.08859v1 [cs.CL])
    (2 min) Visually grounded speech models learn from images paired with spoken captions. By tagging images with soft text labels using a trained visual classifier with a fixed vocabulary, previous work has shown that it is possible to train a model that can detect whether a particular text keyword occurs in speech utterances or not. Here we investigate whether visually grounded speech models can also do keyword localisation: predicting where, within an utterance, a given textual keyword occurs without any explicit text-based or alignment supervision. We specifically consider whether incorporating attention into a convolutional model is beneficial for localisation. Although absolute localisation performance with visually supervised models is still modest (compared to using unordered bag-of-word text labels for supervision), we show that attention provides a large gain in performance over previous visually grounded models. As in many other speech-image studies, we find that many of the incorrect localisations are due to semantic confusions, e.g. locating the word 'backstroke' for the query keyword 'swimming'.
    Subword Sampling for Low Resource Word Alignment. (arXiv:2012.11657v2 [cs.CL] UPDATED)
    (2 min) Annotation projection is an important area in NLP that can greatly contribute to creating language resources for low-resource languages. Word alignment plays a key role in this setting. However, most of the existing word alignment methods are designed for a high resource setting in machine translation where millions of parallel sentences are available. This amount reduces to a few thousands of sentences when dealing with low-resource languages failing the existing established IBM models. In this paper, we propose subword sampling-based alignment of text units. This method's hypothesis is that the aggregation of different granularities of text for certain language pairs can help word-level alignment. For certain languages for which gold-standard alignments exist, we propose an iterative Bayesian optimization framework to optimize selecting possible subwords from the space of possible subword representations of the source and target sentences. We show that the subword sampling method consistently outperforms word-level alignment on six language pairs: English-German, English-French, English-Romanian, English-Persian, English-Hindi, and English-Inuktitut. In addition, we show that the hyperparameters learned for certain language pairs can be applied to other languages at no supervision and consistently improve the alignment results. We observe that using $5K$ parallel sentences together with our proposed subword sampling approach, we obtain similar F1 scores to the use of $100K$'s of parallel sentences in existing word-level fast-align/eflomal alignment methods.
    Towards Automated Website Classification by Deep Learning. (arXiv:1910.09991v2 [cs.LG] UPDATED)
    (2 min) In recent years, the interest in Big Data sources has been steadily growing within the Official Statistic community. The Italian National Institute of Statistics (Istat) is currently carrying out several Big Data pilot studies. One of these studies, the ICT Big Data pilot, aims at exploiting massive amounts of textual data automatically scraped from the websites of Italian enterprises in order to predict a set of target variables (e.g. e-commerce) that are routinely observed by the traditional ICT Survey. In this paper, we show that Deep Learning techniques can successfully address this problem. Essentially, we tackle a text classification task: an algorithm must learn to infer whether an Italian enterprise performs e-commerce from the textual content of its website. To reach this goal, we developed a sophisticated processing pipeline and evaluated its performance through extensive experiments. Our pipeline uses Convolutional Neural Networks and relies on Word Embeddings to encode raw texts into grayscale images (i.e. normalized numeric matrices). Web-scraped texts are huge and have very low signal to noise ratio: to overcome these issues, we adopted a framework known as False Positive Reduction, which has seldom (if ever) been applied before to text classification tasks. Several original contributions enable our processing pipeline to reach good classification results. Empirical evidence shows that our proposal outperforms all the alternative Machine Learning solutions already tested in Istat for the same task.
    Alzheimer's Disease Detection from Spontaneous Speech through Combining Linguistic Complexity and (Dis)Fluency Features with Pretrained Language Models. (arXiv:2106.08689v1 [cs.CL])
    (2 min) In this paper, we combined linguistic complexity and (dis)fluency features with pretrained language models for the task of Alzheimer's disease detection of the 2021 ADReSSo (Alzheimer's Dementia Recognition through Spontaneous Speech) challenge. An accuracy of 83.1% was achieved on the test set, which amounts to an improvement of 4.23% over the baseline model. Our best-performing model that integrated component models using a stacking ensemble technique performed equally well on cross-validation and test data, indicating that it is robust against overfitting.
    Earnings-21: A Practical Benchmark for ASR in the Wild. (arXiv:2104.11348v3 [cs.CL] UPDATED)
    (2 min) Commonly used speech corpora inadequately challenge academic and commercial ASR systems. In particular, speech corpora lack metadata needed for detailed analysis and WER measurement. In response, we present Earnings-21, a 39-hour corpus of earnings calls containing entity-dense speech from nine different financial sectors. This corpus is intended to benchmark ASR systems in the wild with special attention towards named entity recognition. We benchmark four commercial ASR models, two internal models built with open-source tools, and an open-source LibriSpeech model and discuss their differences in performance on Earnings-21. Using our recently released fstalign tool, we provide a candid analysis of each model's recognition capabilities under different partitions. Our analysis finds that ASR accuracy for certain NER categories is poor, presenting a significant impediment to transcript comprehension and usage. Earnings-21 bridges academic and commercial ASR system evaluation and enables further research on entity modeling and WER on real world audio.
    SEOVER: Sentence-level Emotion Orientation Vector based Conversation Emotion Recognition Model. (arXiv:2106.08785v1 [cs.CL])
    (2 min) For the task of conversation emotion recognition, recent works focus on speaker relationship modeling but ignore the role of utterance's emotional tendency.In this paper, we propose a new expression paradigm of sentence-level emotion orientation vector to model the potential correlation of emotions between sentence vectors. Based on it, we design an emotion recognition model, which extracts the sentence-level emotion orientation vectors from the language model and jointly learns from the dialogue sentiment analysis model and extracted sentence-level emotion orientation vectors to identify the speaker's emotional orientation during the conversation. We conduct experiments on two benchmark datasets and compare them with the five baseline models.The experimental results show that our model has better performance on all data sets.
    Pathological voice adaptation with autoencoder-based voice conversion. (arXiv:2106.08427v1 [cs.SD])
    (2 min) In this paper, we propose a new approach to pathological speech synthesis. Instead of using healthy speech as a source, we customise an existing pathological speech sample to a new speaker's voice characteristics. This approach alleviates the evaluation problem one normally has when converting typical speech to pathological speech, as in our approach, the voice conversion (VC) model does not need to be optimised for speech degradation but only for the speaker change. This change in the optimisation ensures that any degradation found in naturalness is due to the conversion process and not due to the model exaggerating characteristics of a speech pathology. To show a proof of concept of this method, we convert dysarthric speech using the UASpeech database and an autoencoder-based VC technique. Subjective evaluation results show reasonable naturalness for high intelligibility dysarthric speakers, though lower intelligibility seems to introduce a marginal degradation in naturalness scores for mid and low intelligibility speakers compared to ground truth. Conversion of speaker characteristics for low and high intelligibility speakers is successful, but not for mid. Whether the differences in the results for the different intelligibility levels is due to the intelligibility levels or due to the speakers needs to be further investigated.
    Emotion Dynamics in Movie Dialogues. (arXiv:2103.01345v3 [cs.CL] UPDATED)
    (2 min) Emotion dynamics is a framework for measuring how an individual's emotions change over time. It is a powerful tool for understanding how we behave and interact with the world. In this paper, we introduce a framework to track emotion dynamics through one's utterances. Specifically we introduce a number of utterance emotion dynamics (UED) metrics inspired by work in Psychology. We use this approach to trace emotional arcs of movie characters. We analyze thousands of such character arcs to test hypotheses that inform our broader understanding of stories. Notably, we show that there is a tendency for characters to use increasingly more negative words and become increasingly emotionally discordant with each other until about 90 percent of the narrative length. UED also has applications in behavior studies, social sciences, and public health.
    Out-of-Scope Intent Detection with Self-Supervision and Discriminative Training. (arXiv:2106.08616v1 [cs.CL])
    (2 min) Out-of-scope intent detection is of practical importance in task-oriented dialogue systems. Since the distribution of outlier utterances is arbitrary and unknown in the training stage, existing methods commonly rely on strong assumptions on data distribution such as mixture of Gaussians to make inference, resulting in either complex multi-step training procedures or hand-crafted rules such as confidence threshold selection for outlier detection. In this paper, we propose a simple yet effective method to train an out-of-scope intent classifier in a fully end-to-end manner by simulating the test scenario in training, which requires no assumption on data distribution and no additional post-processing or threshold setting. Specifically, we construct a set of pseudo outliers in the training stage, by generating synthetic outliers using inliner features via self-supervision and sampling out-of-scope sentences from easily available open-domain datasets. The pseudo outliers are used to train a discriminative classifier that can be directly applied to and generalize well on the test task. We evaluate our method extensively on four benchmark dialogue datasets and observe significant improvements over state-of-the-art approaches. Our code has been released at https://github.com/liam0949/DCLOOS.
    Coreference Augmentation for Multi-Domain Task-Oriented Dialogue State Tracking. (arXiv:2106.08723v1 [cs.CL])
    (2 min) Dialogue State Tracking (DST), which is the process of inferring user goals by estimating belief states given the dialogue history, plays a critical role in task-oriented dialogue systems. A coreference phenomenon observed in multi-turn conversations is not addressed by existing DST models, leading to sub-optimal performances. In this paper, we propose Coreference Dialogue State Tracker (CDST) that explicitly models the coreference feature. In particular, at each turn, the proposed model jointly predicts the coreferred domain-slot pair and extracts the coreference values from the dialogue context. Experimental results on MultiWOZ 2.1 dataset show that the proposed model achieves the state-of-the-art joint goal accuracy of 56.47%.
    Improving Entity Linking through Semantic Reinforced Entity Embeddings. (arXiv:2106.08495v1 [cs.CL])
    (2 min) Entity embeddings, which represent different aspects of each entity with a single vector like word embeddings, are a key component of neural entity linking models. Existing entity embeddings are learned from canonical Wikipedia articles and local contexts surrounding target entities. Such entity embeddings are effective, but too distinctive for linking models to learn contextual commonality. We propose a simple yet effective method, FGS2EE, to inject fine-grained semantic information into entity embeddings to reduce the distinctiveness and facilitate the learning of contextual commonality. FGS2EE first uses the embeddings of semantic type words to generate semantic embeddings, and then combines them with existing entity embeddings through linear aggregation. Extensive experiments show the effectiveness of such embeddings. Based on our entity embeddings, we achieved new sate-of-the-art performance on entity linking.
    Alternated Training with Synthetic and Authentic Data for Neural Machine Translation. (arXiv:2106.08582v1 [cs.CL])
    (2 min) While synthetic bilingual corpora have demonstrated their effectiveness in low-resource neural machine translation (NMT), adding more synthetic data often deteriorates translation performance. In this work, we propose alternated training with synthetic and authentic data for NMT. The basic idea is to alternate synthetic and authentic corpora iteratively during training. Compared with previous work, we introduce authentic data as guidance to prevent the training of NMT models from being disturbed by noisy synthetic data. Experiments on Chinese-English and German-English translation tasks show that our approach improves the performance over several strong baselines. We visualize the BLEU landscape to further investigate the role of authentic and synthetic data during alternated training. From the visualization, we find that authentic data helps to direct the NMT model parameters towards points with higher BLEU scores and leads to consistent translation performance improvement.
    Code to Comment Translation: A Comparative Study on Model Effectiveness & Errors. (arXiv:2106.08415v1 [cs.SE])
    (2 min) Automated source code summarization is a popular software engineering research topic wherein machine translation models are employed to "translate" code snippets into relevant natural language descriptions. Most evaluations of such models are conducted using automatic reference-based metrics. However, given the relatively large semantic gap between programming languages and natural language, we argue that this line of research would benefit from a qualitative investigation into the various error modes of current state-of-the-art models. Therefore, in this work, we perform both a quantitative and qualitative comparison of three recently proposed source code summarization models. In our quantitative evaluation, we compare the models based on the smoothed BLEU-4, METEOR, and ROUGE-L machine translation metrics, and in our qualitative evaluation, we perform a manual open-coding of the most common errors committed by the models when compared to ground truth captions. Our investigation reveals new insights into the relationship between metric-based performance and model prediction errors grounded in an empirically derived error taxonomy that can be used to drive future research efforts
    Generative Conversational Networks. (arXiv:2106.08484v1 [cs.CL])
    (2 min) Inspired by recent work in meta-learning and generative teaching networks, we propose a framework called Generative Conversational Networks, in which conversational agents learn to generate their own labelled training data (given some seed data) and then train themselves from that data to perform a given task. We use reinforcement learning to optimize the data generation process where the reward signal is the agent's performance on the task. The task can be any language-related task, from intent detection to full task-oriented conversations. In this work, we show that our approach is able to generalise from seed data and performs well in limited data and limited computation settings, with significant gains for intent detection and slot tagging across multiple datasets: ATIS, TOD, SNIPS, and Restaurants8k. We show an average improvement of 35% in intent detection and 21% in slot tagging over a baseline model trained from the seed data. We also conduct an analysis of the novelty of the generated data and provide generated examples for intent detection, slot tagging, and non-goal oriented conversations.
    RyanSpeech: A Corpus for Conversational Text-to-Speech Synthesis. (arXiv:2106.08468v1 [cs.CL])
    (2 min) This paper introduces RyanSpeech, a new speech corpus for research on automated text-to-speech (TTS) systems. Publicly available TTS corpora are often noisy, recorded with multiple speakers, or lack quality male speech data. In order to meet the need for a high quality, publicly available male speech corpus within the field of speech recognition, we have designed and created RyanSpeech which contains textual materials from real-world conversational settings. These materials contain over 10 hours of a professional male voice actor's speech recorded at 44.1 kHz. This corpus's design and pipeline make RyanSpeech ideal for developing TTS systems in real-world applications. To provide a baseline for future research, protocols, and benchmarks, we trained 4 state-of-the-art speech models and a vocoder on RyanSpeech. The results show 3.36 in mean opinion scores (MOS) in our best model. We have made both the corpus and trained models for public use.
    Discrete Auto-regressive Variational Attention Models for Text Modeling. (arXiv:2106.08571v1 [cs.LG])
    (2 min) Variational autoencoders (VAEs) have been widely applied for text modeling. In practice, however, they are troubled by two challenges: information underrepresentation and posterior collapse. The former arises as only the last hidden state of LSTM encoder is transformed into the latent space, which is generally insufficient to summarize the data. The latter is a long-standing problem during the training of VAEs as the optimization is trapped to a disastrous local optimum. In this paper, we propose Discrete Auto-regressive Variational Attention Model (DAVAM) to address the challenges. Specifically, we introduce an auto-regressive variational attention approach to enrich the latent space by effectively capturing the semantic dependency from the input. We further design discrete latent space for the variational attention and mathematically show that our model is free from posterior collapse. Extensive experiments on language modeling tasks demonstrate the superiority of DAVAM against several VAE counterparts.
    Coreference-Aware Dialogue Summarization. (arXiv:2106.08556v1 [cs.CL])
    (2 min) Summarizing conversations via neural approaches has been gaining research traction lately, yet it is still challenging to obtain practical solutions. Examples of such challenges include unstructured information exchange in dialogues, informal interactions between speakers, and dynamic role changes of speakers as the dialogue evolves. Many of such challenges result in complex coreference links. Therefore, in this work, we investigate different approaches to explicitly incorporate coreference information in neural abstractive dialogue summarization models to tackle the aforementioned challenges. Experimental results show that the proposed approaches achieve state-of-the-art performance, implying it is useful to utilize coreference information in dialogue summarization. Evaluation results on factual correctness suggest such coreference-aware models are better at tracing the information flow among interlocutors and associating accurate status/actions with the corresponding interlocutors and person mentions.
    Unsupervised Enrichment of Persona-grounded Dialog with Background Stories. (arXiv:2106.08364v1 [cs.CL])
    (2 min) Humans often refer to personal narratives, life experiences, and events to make a conversation more engaging and rich. While persona-grounded dialog models are able to generate responses that follow a given persona, they often miss out on stating detailed experiences or events related to a persona, often leaving conversations shallow and dull. In this work, we equip dialog models with 'background stories' related to a persona by leveraging fictional narratives from existing story datasets (e.g. ROCStories). Since current dialog datasets do not contain such narratives as responses, we perform an unsupervised adaptation of a retrieved story for generating a dialog response using a gradient-based rewriting technique. Our proposed method encourages the generated response to be fluent (i.e., highly likely) with the dialog history, minimally different from the retrieved story to preserve event ordering and consistent with the original persona. We demonstrate that our method can generate responses that are more diverse, and are rated more engaging and human-like by human evaluators, compared to outputs from existing dialog models.
    What Context Features Can Transformer Language Models Use?. (arXiv:2106.08367v1 [cs.CL])
    (2 min) Transformer-based language models benefit from conditioning on contexts of hundreds to thousands of previous tokens. What aspects of these contexts contribute to accurate model prediction? We describe a series of experiments that measure usable information by selectively ablating lexical and structural information in transformer language models trained on English Wikipedia. In both mid- and long-range contexts, we find that several extremely destructive context manipulations -- including shuffling word order within sentences and deleting all words other than nouns -- remove less than 15% of the usable information. Our results suggest that long contexts, but not their detailed syntactic and propositional content, are important for the low perplexity of current transformer language models.
  • cs.CV updates on arXiv.org

    Self-supervised GANs with Label Augmentation. (arXiv:2106.08601v1 [cs.LG])
    (2 min) Recently, transformation-based self-supervised learning has been applied to generative adversarial networks (GANs) to mitigate the catastrophic forgetting problem of discriminator by learning stable representations. However, the separate self-supervised tasks in existing self-supervised GANs cause an inconsistent goal with generative modeling due to the learning of the generator from their generator distribution-agnostic classifiers. To address this issue, we propose a novel self-supervised GANs framework with label augmentation, i.e., augmenting the GAN labels (real or fake) with the self-supervised pseudo-labels. In particular, the discriminator and the self-supervised classifier are unified to learn a single task that predicts the augmented label such that the discriminator/classifier is aware of the generator distribution, while the generator tries to confuse the discriminator/classifier by optimizing the discrepancy between the transformed real and generated distributions. Theoretically, we prove that the generator, at the equilibrium point, converges to replicate the data distribution. Empirically, we demonstrate that the proposed method significantly outperforms competitive baselines on both generative modeling and representation learning across benchmark datasets.
    A Spiking Neural Network for Image Segmentation. (arXiv:2106.08921v1 [cs.NE])
    (2 min) We seek to investigate the scalability of neuromorphic computing for computer vision, with the objective of replicating non-neuromorphic performance on computer vision tasks while reducing power consumption. We convert the deep Artificial Neural Network (ANN) architecture U-Net to a Spiking Neural Network (SNN) architecture using the Nengo framework. Both rate-based and spike-based models are trained and optimized for benchmarking performance and power, using a modified version of the ISBI 2D EM Segmentation dataset consisting of microscope images of cells. We propose a partitioning method to optimize inter-chip communication to improve speed and energy efficiency when deploying multi-chip networks on the Loihi neuromorphic chip. We explore the advantages of regularizing firing rates of Loihi neurons for converting ANN to SNN with minimum accuracy loss and optimized energy consumption. We propose a percentile based regularization loss function to limit the spiking rate of the neuron between a desired range. The SNN is converted directly from the corresponding ANN, and demonstrates similar semantic segmentation as the ANN using the same number of neurons and weights. However, the neuromorphic implementation on the Intel Loihi neuromorphic chip is over 2x more energy-efficient than conventional hardware (CPU, GPU) when running online (one image at a time). These power improvements are achieved without sacrificing the task performance accuracy of the network, and when all weights (Loihi, CPU, and GPU networks) are quantized to 8 bits.
    Automating Augmentation Through Random Unidimensional Search. (arXiv:2106.08756v1 [cs.LG])
    (2 min) It is no secret amongst deep learning researchers that finding the right data augmentation strategy during training can mean the difference between a state-of-the-art result and a run-of-the-mill ranking. To that end, the community has seen many efforts to automate the process of finding the perfect augmentation procedure for any task at hand. Unfortunately, even recent cutting-edge methods bring massive computational overhead, requiring as many as 100 full model trainings to settle on an ideal configuration. We show how to achieve even better performance in just 7: with Random Unidimensional Augmentation. Source code is available at https://github.com/fastestimator/RUA
    Morphset:Augmenting categorical emotion datasets with dimensional affect labels using face morphing. (arXiv:2103.02854v2 [cs.CV] UPDATED)
    (2 min) Emotion recognition and understanding is a vital component in human-machine interaction. Dimensional models of affect such as those using valence and arousal have advantages over traditional categorical ones due to the complexity of emotional states in humans. However, dimensional emotion annotations are difficult and expensive to collect, therefore they are not as prevalent in the affective computing community. To address these issues, we propose a method to generate synthetic images from existing categorical emotion datasets using face morphing as well as dimensional labels in the circumplex space with full control over the resulting sample distribution, while achieving augmentation factors of at least 20x or more.
    LaneAF: Robust Multi-Lane Detection with Affinity Fields. (arXiv:2103.12040v3 [cs.CV] UPDATED)
    (2 min) This study presents an approach to lane detection involving the prediction of binary segmentation masks and per-pixel affinity fields. These affinity fields, along with the binary masks, can then be used to cluster lane pixels horizontally and vertically into corresponding lane instances in a post-processing step. This clustering is achieved through a simple row-by-row decoding process with little overhead; such an approach allows LaneAF to detect a variable number of lanes without assuming a fixed or maximum number of lanes. Moreover, this form of clustering is more interpretable in comparison to previous visual clustering approaches, and can be analyzed to identify and correct sources of error. Qualitative and quantitative results obtained on popular lane detection datasets demonstrate the model's ability to detect and cluster lanes effectively and robustly. Our proposed approach sets a new state-of-the-art on the challenging CULane dataset and the recently introduced Unsupervised LLAMAS dataset.
    Training Generative Adversarial Networks in One Stage. (arXiv:2103.00430v3 [cs.CV] UPDATED)
    (2 min) Generative Adversarial Networks (GANs) have demonstrated unprecedented success in various image generation tasks. The encouraging results, however, come at the price of a cumbersome training process, during which the generator and discriminator are alternately updated in two stages. In this paper, we investigate a general training scheme that enables training GANs efficiently in only one stage. Based on the adversarial losses of the generator and discriminator, we categorize GANs into two classes, Symmetric GANs and Asymmetric GANs, and introduce a novel gradient decomposition method to unify the two, allowing us to train both classes in one stage and hence alleviate the training effort. We also computationally analyze the efficiency of the proposed method, and empirically demonstrate that, the proposed method yields a solid $1.5\times$ acceleration across various datasets and network architectures. Furthermore, we show that the proposed method is readily applicable to other adversarial-training scenarios, such as data-free knowledge distillation. The code is available at https://github.com/zju-vipa/OSGAN.
    Gaze Preserving CycleGANs for Eyeglass Removal & Persistent Gaze Estimation. (arXiv:2002.02077v6 [cs.CV] UPDATED)
    (2 min) A driver's gaze is critical for determining their attention, state, situational awareness, and readiness to take over control from partially automated vehicles. Estimating the gaze direction is the most obvious way to gauge a driver's state under ideal conditions when limited to using non-intrusive imaging sensors. Unfortunately, the vehicular environment introduces a variety of challenges that are usually unaccounted for - harsh illumination, nighttime conditions, and reflective eyeglasses. Relying on head pose alone under such conditions can prove to be unreliable and erroneous. In this study, we offer solutions to address these problems encountered in the real world. To solve issues with lighting, we demonstrate that using an infrared camera with suitable equalization and normalization suffices. To handle eyeglasses and their corresponding artifacts, we adopt image-to-image translation using generative adversarial networks to pre-process images prior to gaze estimation. Our proposed Gaze Preserving CycleGAN (GPCycleGAN) is trained to preserve the driver's gaze while removing potential eyeglasses from face images. GPCycleGAN is based on the well-known CycleGAN approach - with the addition of a gaze classifier and a gaze consistency loss for additional supervision. Our approach exhibits improved performance, interpretability, robustness and superior qualitative results on challenging real-world datasets.
    Bridging Multi-Task Learning and Meta-Learning: Towards Efficient Training and Effective Adaptation. (arXiv:2106.09017v1 [cs.LG])
    (2 min) Multi-task learning (MTL) aims to improve the generalization of several related tasks by learning them jointly. As a comparison, in addition to the joint training scheme, modern meta-learning allows unseen tasks with limited labels during the test phase, in the hope of fast adaptation over them. Despite the subtle difference between MTL and meta-learning in the problem formulation, both learning paradigms share the same insight that the shared structure between existing training tasks could lead to better generalization and adaptation. In this paper, we take one important step further to understand the close connection between these two learning paradigms, through both theoretical analysis and empirical investigation. Theoretically, we first demonstrate that MTL shares the same optimization formulation with a class of gradient-based meta-learning (GBML) algorithms. We then prove that for over-parameterized neural networks with sufficient depth, the learned predictive functions of MTL and GBML are close. In particular, this result implies that the predictions given by these two models are similar over the same unseen task. Empirically, we corroborate our theoretical findings by showing that, with proper implementation, MTL is competitive against state-of-the-art GBML algorithms on a set of few-shot image classification benchmarks. Since existing GBML algorithms often involve costly second-order bi-level optimization, our first-order MTL method is an order of magnitude faster on large-scale datasets such as mini-ImageNet. We believe this work could help bridge the gap between these two learning paradigms, and provide a computationally efficient alternative to GBML that also supports fast task adaptation.
    MixMix: All You Need for Data-Free Compression Are Feature and Data Mixing. (arXiv:2011.09899v2 [cs.LG] UPDATED)
    (2 min) User data confidentiality protection is becoming a rising challenge in the present deep learning research. Without access to data, conventional data-driven model compression faces a higher risk of performance degradation. Recently, some works propose to generate images from a specific pretrained model to serve as training data. However, the inversion process only utilizes biased feature statistics stored in one model and is from low-dimension to high-dimension. As a consequence, it inevitably encounters the difficulties of generalizability and inexact inversion, which leads to unsatisfactory performance. To address these problems, we propose MixMix based on two simple yet effective techniques: (1) Feature Mixing: utilizes various models to construct a universal feature space for generalized inversion; (2) Data Mixing: mixes the synthesized images and labels to generate exact label information. We prove the effectiveness of MixMix from both theoretical and empirical perspectives. Extensive experiments show that MixMix outperforms existing methods on the mainstream compression tasks, including quantization, knowledge distillation, and pruning. Specifically, MixMix achieves up to 4% and 20% accuracy uplift on quantization and pruning, respectively, compared to existing data-free compression work.
    Evolving Image Compositions for Feature Representation Learning. (arXiv:2106.09011v1 [cs.CV])
    (2 min) Convolutional neural networks for visual recognition require large amounts of training samples and usually benefit from data augmentation. This paper proposes PatchMix, a data augmentation method that creates new samples by composing patches from pairs of images in a grid-like pattern. These new samples' ground truth labels are set as proportional to the number of patches from each image. We then add a set of additional losses at the patch-level to regularize and to encourage good representations at both the patch and image levels. A ResNet-50 model trained on ImageNet using PatchMix exhibits superior transfer learning capabilities across a wide array of benchmarks. Although PatchMix can rely on random pairings and random grid-like patterns for mixing, we explore evolutionary search as a guiding strategy to discover optimal grid-like patterns and image pairing jointly. For this purpose, we conceive a fitness function that bypasses the need to re-train a model to evaluate each choice. In this way, PatchMix outperforms a base model on CIFAR-10 (+1.91), CIFAR-100 (+5.31), Tiny Imagenet (+3.52), and ImageNet (+1.16) by significant margins, also outperforming previous state-of-the-art pairwise augmentation strategies.
    Explainers in the Wild: Making Surrogate Explainers Robust to Distortions through Perception. (arXiv:2102.10951v2 [cs.CV] UPDATED)
    (2 min) Explaining the decisions of models is becoming pervasive in the image processing domain, whether it is by using post-hoc methods or by creating inherently interpretable models. While the widespread use of surrogate explainers is a welcome addition to inspect and understand black-box models, assessing the robustness and reliability of the explanations is key for their success. Additionally, whilst existing work in the explainability field proposes various strategies to address this problem, the challenges of working with data in the wild is often overlooked. For instance, in image classification, distortions to images can not only affect the predictions assigned by the model, but also the explanation. Given a clean and a distorted version of an image, even if the prediction probabilities are similar, the explanation may still be different. In this paper we propose a methodology to evaluate the effect of distortions in explanations by embedding perceptual distances that tailor the neighbourhoods used to training surrogate explainers. We also show that by operating in this way, we can make the explanations more robust to distortions. We generate explanations for images in the Imagenet-C dataset and demonstrate how using a perceptual distances in the surrogate explainer creates more coherent explanations for the distorted and reference images.
    Predictive coding feedback results in perceived illusory contours in a recurrent neural network. (arXiv:2102.01955v2 [cs.CV] UPDATED)
    (2 min) Modern feedforward convolutional neural networks (CNNs) can now solve some computer vision tasks at super-human levels. However, these networks only roughly mimic human visual perception. One difference from human vision is that they do not appear to perceive illusory contours (e.g. Kanizsa squares) in the same way humans do. Physiological evidence from visual cortex suggests that the perception of illusory contours could involve feedback connections. Would recurrent feedback neural networks perceive illusory contours like humans? In this work we equip a deep feedforward convolutional network with brain-inspired recurrent dynamics. The network was first pretrained with an unsupervised reconstruction objective on a natural image dataset, to expose it to natural object contour statistics. Then, a classification decision layer was added and the model was finetuned on a form discrimination task: squares vs. randomly oriented inducer shapes (no illusory contour). Finally, the model was tested with the unfamiliar ''illusory contour'' configuration: inducer shapes oriented to form an illusory square. Compared with feedforward baselines, the iterative ''predictive coding'' feedback resulted in more illusory contours being classified as physical squares. The perception of the illusory contour was measurable in the luminance profile of the image reconstructions produced by the model, demonstrating that the model really ''sees'' the illusion. Ablation studies revealed that natural image pretraining and feedback error correction are both critical to the perception of the illusion. Finally we validated our conclusions in a deeper network (VGG): adding the same predictive coding feedback dynamics again leads to the perception of illusory contours.
    Invertible Attention. (arXiv:2106.09003v1 [cs.CV])
    (2 min) Attention has been proved to be an efficient mechanism to capture long-range dependencies. However, so far it has not been deployed in invertible networks. This is due to the fact that in order to make a network invertible, every component within the network needs to be a bijective transformation, but a normal attention block is not. In this paper, we propose invertible attention that can be plugged into existing invertible models. We mathematically and experimentally prove that the invertibility of an attention model can be achieved by carefully constraining its Lipschitz constant. We validate the invertibility of our invertible attention on image reconstruction task with 3 popular datasets: CIFAR-10, SVHN, and CelebA. We also show that our invertible attention achieves similar performance in comparison with normal non-invertible attention on dense prediction tasks.
    Automatic Social Distance Estimation From Images: Performance Evaluation, Test Benchmark, and Algorithm. (arXiv:2103.06759v2 [cs.CV] UPDATED)
    (3 min) The COVID-19 virus has caused a global pandemic since March 2020. The World Health Organization (WHO) has provided guidelines on how to reduce the spread of the virus and one of the most important measures is social distancing. Maintaining a minimum of one meter distance from other people is strongly suggested to reduce the risk of infection. This has created a strong interest in monitoring the social distances either as a safety measure or to study how the measures have affected human behavior and country-wise differences in this. The need for automatic social distance estimation algorithms is evident, but there is no suitable test benchmark for such algorithms. Collecting images with measured ground-truth pair-wise distances between all the people using different camera settings is cumbersome. Furthermore, performance evaluation for social distance estimation algorithms is not straightforward and there is no widely accepted evaluation protocol. In this paper, we provide a dataset of varying images with measured pair-wise social distances under different camera positionings and focal length values. We suggest a performance evaluation protocol and provide a benchmark to easily evaluate social distance estimation algorithms. We also propose a method for automatic social distance estimation. Our method takes advantage of object detection and human pose estimation. It can be applied on any single image as long as focal length and sensor size information are known. The results on our benchmark are encouraging with 92% human detection rate and only 28.9% average error in distance estimation among the detected people.
    Joint detection and matching of feature points in multimodal images. (arXiv:1810.12941v3 [cs.CV] UPDATED)
    (2 min) In this work, we propose a novel Convolutional Neural Network (CNN) architecture for the joint detection and matching of feature points in images acquired by different sensors using a single forward pass. The resulting feature detector is tightly coupled with the feature descriptor, in contrast to classical approaches (SIFT, etc.), where the detection phase precedes and differs from computing the descriptor. Our approach utilizes two CNN subnetworks, the first being a Siamese CNN and the second, consisting of dual non-weight-sharing CNNs. This allows simultaneous processing and fusion of the joint and disjoint cues in the multimodal image patches. The proposed approach is experimentally shown to outperform contemporary state-of-the-art schemes when applied to multiple datasets of multimodal images. It is also shown to provide repeatable feature points detections across multisensor images, outperforming state-of-the-art detectors. To the best of our knowledge, it is the first unified approach for the detection and matching of such images.
    Smoothing the Disentangled Latent Style Space for Unsupervised Image-to-Image Translation. (arXiv:2106.09016v1 [cs.CV])
    (2 min) Image-to-Image (I2I) multi-domain translation models are usually evaluated also using the quality of their semantic interpolation results. However, state-of-the-art models frequently show abrupt changes in the image appearance during interpolation, and usually perform poorly in interpolations across domains. In this paper, we propose a new training protocol based on three specific losses which help a translation network to learn a smooth and disentangled latent style space in which: 1) Both intra- and inter-domain interpolations correspond to gradual changes in the generated images and 2) The content of the source image is better preserved during the translation. Moreover, we propose a novel evaluation metric to properly measure the smoothness of latent style space of I2I translation models. The proposed method can be plugged into existing translation approaches, and our extensive experiments on different datasets show that it can significantly boost the quality of the generated images and the graduality of the interpolations.
    Local plasticity rules can learn deep representations using self-supervised contrastive predictions. (arXiv:2010.08262v4 [cs.LG] UPDATED)
    (2 min) Learning in the brain is poorly understood and learning rules that respect biological constraints, yet yield deep hierarchical representations, are still unknown. Here, we propose a learning rule that takes inspiration from neuroscience and recent advances in self-supervised deep learning. Learning minimizes a simple layer-specific loss function and does not need to back-propagate error signals within or between layers. Instead, weight updates follow a local, Hebbian, learning rule that only depends on pre- and post-synaptic neuronal activity, predictive dendritic input and widely broadcasted modulation factors which are identical for large groups of neurons. The learning rule applies contrastive predictive learning to a causal, biological setting using saccades (i.e. rapid shifts in gaze direction). We find that networks trained with this self-supervised and local rule build deep hierarchical representations of images, speech and video.
    Sleeper Agent: Scalable Hidden Trigger Backdoors for Neural Networks Trained from Scratch. (arXiv:2106.08970v1 [cs.LG])
    (2 min) As the curation of data for machine learning becomes increasingly automated, dataset tampering is a mounting threat. Backdoor attackers tamper with training data to embed a vulnerability in models that are trained on that data. This vulnerability is then activated at inference time by placing a "trigger" into the model's input. Typical backdoor attacks insert the trigger directly into the training data, although the presence of such an attack may be visible upon inspection. In contrast, the Hidden Trigger Backdoor Attack achieves poisoning without placing a trigger into the training data at all. However, this hidden trigger attack is ineffective at poisoning neural networks trained from scratch. We develop a new hidden trigger attack, Sleeper Agent, which employs gradient matching, data selection, and target model re-training during the crafting process. Sleeper Agent is the first hidden trigger backdoor attack to be effective against neural networks trained from scratch. We demonstrate its effectiveness on ImageNet and in black-box settings. Our implementation code can be found at https://github.com/hsouri/Sleeper-Agent.
    The Oxford Road Boundaries Dataset. (arXiv:2106.08983v1 [cs.CV])
    (2 min) In this paper we present the Oxford Road Boundaries Dataset, designed for training and testing machine-learning-based road-boundary detection and inference approaches. We have hand-annotated two of the 10 km-long forays from the Oxford Robotcar Dataset and generated from other forays several thousand further examples with semi-annotated road-boundary masks. To boost the number of training samples in this way, we used a vision-based localiser to project labels from the annotated datasets to other traversals at different times and weather conditions. As a result, we release 62605 labelled samples, of which 47639 samples are curated. Each of these samples contains both raw and classified masks for left and right lenses. Our data contains images from a diverse set of scenarios such as straight roads, parked cars, junctions, etc. Files for download and tools for manipulating the labelled data are available at: oxford-robotics-institute.github.io/road-boundaries-dataset
    LARNet: Lie Algebra Residual Network for Face Recognition. (arXiv:2103.08147v2 [cs.CV] UPDATED)
    (2 min) Face recognition is an important yet challenging problem in computer vision. A major challenge in practical face recognition applications lies in significant variations between profile and frontal faces. Traditional techniques address this challenge either by synthesizing frontal faces or by pose invariant learning. In this paper, we propose a novel method with Lie algebra theory to explore how face rotation in the 3D space affects the deep feature generation process of convolutional neural networks (CNNs). We prove that face rotation in the image space is equivalent to an additive residual component in the feature space of CNNs, which is determined solely by the rotation. Based on this theoretical finding, we further design a Lie Algebraic Residual Network (LARNet) for tackling pose robust face recognition. Our LARNet consists of a residual subnet for decoding rotation information from input face images, and a gating subnet to learn rotation magnitude for controlling the strength of the residual component contributing to the feature learning process. Comprehensive experimental evaluations on both frontal-profile face datasets and general face recognition datasets convincingly demonstrate that our method consistently outperforms the state-of-the-art ones.
    Cascading Modular Network (CAM-Net) for Multimodal Image Synthesis. (arXiv:2106.09015v1 [cs.CV])
    (2 min) Deep generative models such as GANs have driven impressive advances in conditional image synthesis in recent years. A persistent challenge has been to generate diverse versions of output images from the same input image, due to the problem of mode collapse: because only one ground truth output image is given per input image, only one mode of the conditional distribution is modelled. In this paper, we focus on this problem of multimodal conditional image synthesis and build on the recently proposed technique of Implicit Maximum Likelihood Estimation (IMLE). Prior IMLE-based methods required different architectures for different tasks, which limit their applicability, and were lacking in fine details in the generated images. We propose CAM-Net, a unified architecture that can be applied to a broad range of tasks. Additionally, it is capable of generating convincing high frequency details, achieving a reduction of the Frechet Inception Distance (FID) by up to 45.3% compared to the baseline.
    Black-Box Dissector: Towards Erasing-based Hard-Label Model Stealing Attack. (arXiv:2105.00623v2 [cs.CV] UPDATED)
    (2 min) Previous studies have verified that the functionality of black-box models can be stolen with full probability outputs. However, under the more practical hard-label setting, we observe that existing methods suffer from catastrophic performance degradation. We argue this is due to the lack of rich information in the probability prediction and the overfitting caused by hard labels. To this end, we propose a novel hard-label model stealing method termed \emph{black-box dissector}, which consists of two erasing-based modules. One is a CAM-driven erasing strategy that is designed to increase the information capacity hidden in hard labels from the victim model. The other is a random-erasing-based self-knowledge distillation module that utilizes soft labels from the substitute model to mitigate overfitting. Extensive experiments on four widely-used datasets consistently demonstrate that our method outperforms state-of-the-art methods, with an improvement of at most $8.27\%$. We also validate the effectiveness and practical potential of our method on real-world APIs and defense methods. Furthermore, our method promotes other downstream tasks, \emph{i.e.}, transfer adversarial attacks.
    LCDNet: Deep Loop Closure Detection and Point Cloud Registration for LiDAR SLAM. (arXiv:2103.05056v2 [cs.RO] UPDATED)
    (2 min) Loop closure detection is an essential component of Simultaneous Localization and Mapping (SLAM) systems, which reduces the drift accumulated over time. Over the years, several deep learning approaches have been proposed to address this task, however their performance has been subpar compared to handcrafted techniques, especially while dealing with reverse loops. In this paper, we introduce the novel LCDNet that effectively detects loop closures in LiDAR point clouds by simultaneously identifying previously visited places and estimating the 6-DoF relative transformation between the current scan and the map. LCDNet is composed of a shared encoder, a place recognition head that extracts global descriptors, and a relative pose head that estimates the transformation between two point clouds. We introduce a novel relative pose head based on the unbalanced optimal transport theory that we implement in a differentiable manner to allow for end-to-end training. Extensive evaluations of LCDNet on multiple real-world autonomous driving datasets show that our approach outperforms state-of-the-art loop closure detection and point cloud registration techniques by a large margin, especially while dealing with reverse loops. Moreover, we integrate our proposed loop closure detection approach into a LiDAR SLAM library to provide a complete mapping system and demonstrate the generalization ability using different sensor setup in an unseen city.
    Improved CNN-based Learning of Interpolation Filters for Low-Complexity Inter Prediction in Video Coding. (arXiv:2106.08936v1 [eess.IV])
    (2 min) The versatility of recent machine learning approaches makes them ideal for improvement of next generation video compression solutions. Unfortunately, these approaches typically bring significant increases in computational complexity and are difficult to interpret into explainable models, affecting their potential for implementation within practical video coding applications. This paper introduces a novel explainable neural network-based inter-prediction scheme, to improve the interpolation of reference samples needed for fractional precision motion compensation. The approach requires a single neural network to be trained from which a full quarter-pixel interpolation filter set is derived, as the network is easily interpretable due to its linear structure. A novel training framework enables each network branch to resemble a specific fractional shift. This practical solution makes it very efficient to use alongside conventional video coding schemes. When implemented in the context of the state-of-the-art Versatile Video Coding (VVC) test model, 0.77%, 1.27% and 2.25% BD-rate savings can be achieved on average for lower resolution sequences under the random access, low-delay B and low-delay P configurations, respectively, while the complexity of the learned interpolation schemes is significantly reduced compared to the interpolation with full CNNs.
    PGMAN: An Unsupervised Generative Multi-adversarial Network for Pan-sharpening. (arXiv:2012.09054v2 [eess.IV] UPDATED)
    (2 min) Pan-sharpening aims at fusing a low-resolution (LR) multi-spectral (MS) image and a high-resolution (HR) panchromatic (PAN) image acquired by a satellite to generate an HR MS image. Many deep learning based methods have been developed in the past few years. However, since there are no intended HR MS images as references for learning, almost all of the existing methods down-sample the MS and PAN images and regard the original MS images as targets to form a supervised setting for training. These methods may perform well on the down-scaled images, however, they generalize poorly to the full-resolution images. To conquer this problem, we design an unsupervised framework that is able to learn directly from the full-resolution images without any preprocessing. The model is built based on a novel generative multi-adversarial network. We use a two-stream generator to extract the modality-specific features from the PAN and MS images, respectively, and develop a dual-discriminator to preserve the spectral and spatial information of the inputs when performing fusion. Furthermore, a novel loss function is introduced to facilitate training under the unsupervised setting. Experiments and comparisons with other state-of-the-art methods on GaoFen-2 and QuickBird images demonstrate that the proposed method can obtain much better fusion results on the full-resolution images.
    Selection of Source Images Heavily Influences the Effectiveness of Adversarial Attacks. (arXiv:2106.07141v2 [cs.CV] UPDATED)
    (2 min) Although the adoption rate of deep neural networks (DNNs) has tremendously increased in recent years, a solution for their vulnerability against adversarial examples has not yet been found. As a result, substantial research efforts are dedicated to fix this weakness, with many studies typically using a subset of source images to generate adversarial examples, treating every image in this subset as equal. We demonstrate that, in fact, not every source image is equally suited for this kind of assessment. To do so, we devise a large-scale model-to-model transferability scenario for which we meticulously analyze the properties of adversarial examples, generated from every suitable source image in ImageNet by making use of two of the most frequently deployed attacks. In this transferability scenario, which involves seven distinct DNN models, including the recently proposed vision transformers, we reveal that it is possible to have a difference of up to $12.5\%$ in model-to-model transferability success, $1.01$ in average $L_2$ perturbation, and $0.03$ ($8/225$) in average $L_{\infty}$ perturbation when $1,000$ source images are sampled randomly among all suitable candidates. We then take one of the first steps in evaluating the robustness of images used to create adversarial examples, proposing a number of simple but effective methods to identify unsuitable source images, thus making it possible to mitigate extreme cases in experimentation and support high-quality benchmarking.
    Optimality of short-term synaptic plasticity in modelling certain dynamic environments. (arXiv:2009.06808v2 [cs.NE] UPDATED)
    (2 min) Biological neurons and their in-silico emulations for neuromorphic artificial intelligence (AI) use extraordinarily energy-efficient mechanisms, such as spike-based communication and local synaptic plasticity. It remains unclear whether these neuronal mechanisms only offer efficiency or also underlie the superiority of biological intelligence. Here, we prove rigorously that, indeed, the Bayes-optimal prediction and inference of randomly but continuously transforming environments, a common natural setting, relies on short-term spike-timing-dependent plasticity, a hallmark of biological synapses. Further, this dynamic Bayesian inference through plasticity enables circuits of the cerebral cortex in simulations to recognize previously unseen, highly distorted dynamic stimuli. Strikingly, this also introduces a biologically-modelled AI, the first to overcome multiple limitations of deep learning and outperform artificial neural networks in a visual task. The cortical-like network is spiking and event-based, trained only with unsupervised and local plasticity, on a small, narrow, and static training dataset, but achieves recognition of unseen, transformed, and dynamic data better than deep neural networks with continuous activations, trained with supervised backpropagation on the transforming data. These results link short-term plasticity to high-level cortical function, suggest optimality of natural intelligence for natural environments, and repurpose neuromorphic AI from mere efficiency to computational supremacy altogether.
    Towards Evaluating and Training Verifiably Robust Neural Networks. (arXiv:2104.00447v3 [cs.CV] UPDATED)
    (2 min) Recent works have shown that interval bound propagation (IBP) can be used to train verifiably robust neural networks. Reseachers observe an intriguing phenomenon on these IBP trained networks: CROWN, a bounding method based on tight linear relaxation, often gives very loose bounds on these networks. We also observe that most neurons become dead during the IBP training process, which could hurt the representation capability of the network. In this paper, we study the relationship between IBP and CROWN, and prove that CROWN is always tighter than IBP when choosing appropriate bounding lines. We further propose a relaxed version of CROWN, linear bound propagation (LBP), that can be used to verify large networks to obtain lower verified errors than IBP. We also design a new activation function, parameterized ramp function (ParamRamp), which has more diversity of neuron status than ReLU. We conduct extensive experiments on MNIST, CIFAR-10 and Tiny-ImageNet with ParamRamp activation and achieve state-of-the-art verified robustness. Code and the appendix are available at https://github.com/ZhaoyangLyu/VerifiablyRobustNN.
    CRAFT: A Benchmark for Causal Reasoning About Forces and inTeractions. (arXiv:2012.04293v2 [cs.AI] UPDATED)
    (2 min) Humans are able to perceive, understand and reason about physical events. Developing models with similar physical understanding capabilities is a long-standing goal of artificial intelligence. As a step towards this goal, in this work, we introduce CRAFT, a new visual question answering dataset that requires causal reasoning about physical forces and object interactions. It contains 58K video and question pairs that are generated from 10K videos from 20 different virtual environments, containing various objects in motion that interact with each other and the scene. Two question categories from CRAFT include previously studied descriptive and counterfactual questions. Besides, inspired by the theories of force dynamics in cognitive linguistics, we introduce new question categories that involve understanding the interactions of objects through the notions of cause, enable, and prevent. Our results demonstrate that even though these tasks seem to be simple and intuitive for humans, the evaluated baseline models, including existing state-of-the-art methods, do not yet deal with the challenges posed in our benchmark dataset.
    Keep the Gradients Flowing: Using Gradient Flow to Study Sparse Network Optimization. (arXiv:2102.01670v2 [cs.LG] UPDATED)
    (2 min) Training sparse networks to converge to the same performance as dense neural architectures has proven to be elusive. Recent work suggests that initialization is the key. However, while this direction of research has had some success, focusing on initialization alone appears to be inadequate. In this paper, we take a broader view of training sparse networks and consider the role of regularization, optimization, and architecture choices on sparse models. We propose a simple experimental framework, Same Capacity Sparse vs Dense Comparison (SC-SDC), that allows for a fair comparison of sparse and dense networks. Furthermore, we propose a new measure of gradient flow, Effective Gradient Flow (EGF), that better correlates to performance in sparse networks. Using top-line metrics, SC-SDC and EGF, we show that default choices of optimizers, activation functions and regularizers used for dense networks can disadvantage sparse networks. Based upon these findings, we show that gradient flow in sparse networks can be improved by reconsidering aspects of the architecture design and the training regime. Our work suggests that initialization is only one piece of the puzzle and taking a wider view of tailoring optimization to sparse networks yields promising results.
    Multitask 3D CBCT-to-CT Translation and Organs-at-Risk Segmentation Using Physics-Based Data Augmentation. (arXiv:2103.05690v2 [cs.CV] UPDATED)
    (2 min) Purpose: In current clinical practice, noisy and artifact-ridden weekly cone-beam computed tomography (CBCT) images are only used for patient setup during radiotherapy. Treatment planning is done once at the beginning of the treatment using high-quality planning CT (pCT) images and manual contours for organs-at-risk (OARs) structures. If the quality of the weekly CBCT images can be improved while simultaneously segmenting OAR structures, this can provide critical information for adapting radiotherapy mid-treatment as well as for deriving biomarkers for treatment response. Methods: Using a novel physics-based data augmentation strategy, we synthesize a large dataset of perfectly/inherently registered planning CT and synthetic-CBCT pairs for locally advanced lung cancer patient cohort, which are then used in a multitask 3D deep learning framework to simultaneously segment and translate real weekly CBCT images to high-quality planning CT-like images. Results: We compared the synthetic CT and OAR segmentations generated by the model to real planning CT and manual OAR segmentations and showed promising results. The real week 1 (baseline) CBCT images which had an average MAE of 162.77 HU compared to pCT images are translated to synthetic CT images that exhibit a drastically improved average MAE of 29.31 HU and average structural similarity of 92% with the pCT images. The average DICE scores of the 3D organs-at-risk segmentations are: lungs 0.96, heart 0.88, spinal cord 0.83 and esophagus 0.66. Conclusions: We demonstrate an approach to translate artifact-ridden CBCT images to high quality synthetic CT images while simultaneously generating good quality segmentation masks for different organs-at-risk. This approach could allow clinicians to adjust treatment plans using only the routine low-quality CBCT images, potentially improving patient outcomes.
    Differentiable Diffusion for Dense Depth Estimation from Multi-view Images. (arXiv:2106.08917v1 [cs.CV])
    (2 min) We present a method to estimate dense depth by optimizing a sparse set of points such that their diffusion into a depth map minimizes a multi-view reprojection error from RGB supervision. We optimize point positions, depths, and weights with respect to the loss by differential splatting that models points as Gaussians with analytic transmittance. Further, we develop an efficient optimization routine that can simultaneously optimize the 50k+ points required for complex scene reconstruction. We validate our routine using ground truth data and show high reconstruction quality. Then, we apply this to light field and wider baseline images via self supervision, and show improvements in both average and outlier error for depth maps diffused from inaccurate sparse points. Finally, we compare qualitative and quantitative results to image processing and deep learning methods.
    Learning Category- and Instance-Aware Pixel Embedding for Fast Panoptic Segmentation. (arXiv:2009.13342v2 [cs.CV] UPDATED)
    (2 min) Panoptic segmentation (PS) is a complex scene understanding task that requires providing high-quality segmentation for both thing objects and stuff regions. Previous methods handle these two classes with semantic and instance segmentation modules separately, following with heuristic fusion or additional modules to resolve the conflicts between the two outputs. This work simplifies this pipeline of PS by consistently modeling the two classes with a novel PS framework, which extends a detection model with an extra module to predict category- and instance-aware pixel embedding (CIAE). CIAE is a novel pixel-wise embedding feature that encodes both semantic-classification and instance-distinction information. At the inference process, PS results are simply derived by assigning each pixel to a detected instance or a stuff class according to the learned embedding. Our method not only demonstrates fast inference speed but also the first one-stage method to achieve comparable performance to two-stage methods on the challenging COCO benchmark.
    CLAWS: Clustering Assisted Weakly Supervised Learning with Normalcy Suppression for Anomalous Event Detection. (arXiv:2011.12077v3 [cs.CV] UPDATED)
    (2 min) Learning to detect real-world anomalous events through video-level labels is a challenging task due to the rare occurrence of anomalies as well as noise in the labels. In this work, we propose a weakly supervised anomaly detection method which has manifold contributions including1) a random batch based training procedure to reduce inter-batch correlation, 2) a normalcy suppression mechanism to minimize anomaly scores of the normal regions of a video by taking into account the overall information available in one training batch, and 3) a clustering distance based loss to contribute towards mitigating the label noise and to produce better anomaly representations by encouraging our model to generate distinct normal and anomalous clusters. The proposed method obtains83.03% and 89.67% frame-level AUC performance on the UCF Crime and ShanghaiTech datasets respectively, demonstrating its superiority over the existing state-of-the-art algorithms.
    CloudCast: A Satellite-Based Dataset and Baseline for Forecasting Clouds. (arXiv:2007.07978v2 [cs.CV] UPDATED)
    (2 min) Forecasting the formation and development of clouds is a central element of modern weather forecasting systems. Incorrect clouds forecasts can lead to major uncertainty in the overall accuracy of weather forecasts due to their intrinsic role in the Earth's climate system. Few studies have tackled this challenging problem from a machine learning point-of-view due to a shortage of high-resolution datasets with many historical observations globally. In this paper, we present a novel satellite-based dataset called ``CloudCast''. It consists of 70,080 images with 10 different cloud types for multiple layers of the atmosphere annotated on a pixel level. The spatial resolution of the dataset is 928 x 1530 pixels (3x3 km per pixel) with 15-min intervals between frames for the period 2017-01-01 to 2018-12-31. All frames are centered and projected over Europe. To supplement the dataset, we conduct an evaluation study with current state-of-the-art video prediction methods such as convolutional long short-term memory networks, generative adversarial networks, and optical flow-based extrapolation methods. As the evaluation of video prediction is difficult in practice, we aim for a thorough evaluation in the spatial and temporal domain. Our benchmark models show promising results but with ample room for improvement. This is the first publicly available global-scale dataset with high-resolution cloud types on a high temporal granularity to the authors' best knowledge.
    Point and Ask: Incorporating Pointing into Visual Question Answering. (arXiv:2011.13681v2 [cs.CV] UPDATED)
    (2 min) Visual Question Answering (VQA) has become one of the key benchmarks of visual recognition progress. Multiple VQA extensions have been explored to better simulate real-world settings: different question formulations, changing training and test distributions, conversational consistency in dialogues, and explanation-based answering. In this work, we further expand this space by considering visual questions that include a spatial point of reference. Pointing is a nearly universal gesture among humans, and real-world VQA is likely to involve a gesture towards the target region. Concretely, we (1) introduce and motivate point-input questions as an extension of VQA, (2) define three novel classes of questions within this space, and (3) for each class, introduce both a benchmark dataset and a series of baseline models to handle its unique challenges. There are two key distinctions from prior work. First, we explicitly design the benchmarks to require the point input, i.e., we ensure that the visual question cannot be answered accurately without the spatial reference. Second, we explicitly explore the more realistic point spatial input rather than the standard but unnatural bounding box input. Through our exploration we uncover and address several visual recognition challenges, including the ability to infer human intent, reason both locally and globally about the image, and effectively combine visual, language and spatial inputs. Code is available at: https://github.com/princetonvisualai/pointingqa .
    Improving filling level classification with adversarial training. (arXiv:2102.04057v2 [cs.CV] UPDATED)
    (2 min) We investigate the problem of classifying - from a single image - the level of content in a cup or a drinking glass. This problem is made challenging by several ambiguities caused by transparencies, shape variations and partial occlusions, and by the availability of only small training datasets. In this paper, we tackle this problem with an appropriate strategy for transfer learning. Specifically, we use adversarial training in a generic source dataset and then refine the training with a task-specific dataset. We also discuss and experimentally evaluate several training strategies and their combination on a range of container types of the CORSMAL Containers Manipulation dataset. We show that transfer learning with adversarial training in the source domain consistently improves the classification accuracy on the test set and limits the overfitting of the classifier to specific features of the training data.
    DSRN: an Efficient Deep Network for Image Relighting. (arXiv:2102.09242v2 [cs.CV] UPDATED)
    (2 min) Custom and natural lighting conditions can be emulated in images of the scene during post-editing. Extraordinary capabilities of the deep learning framework can be utilized for such purpose. Deep image relighting allows automatic photo enhancement by illumination-specific retouching. Most of the state-of-the-art methods for relighting are run-time intensive and memory inefficient. In this paper, we propose an efficient, real-time framework Deep Stacked Relighting Network (DSRN) for image relighting by utilizing the aggregated features from input image at different scales. Our model is very lightweight with total size of about 42 MB and has an average inference time of about 0.0116s for image of resolution $1024 \times 1024$ which is faster as compared to other multi-scale models. Our solution is quite robust for translating image color temperature from input image to target image and also performs moderately for light gradient generation with respect to the target image. Additionally, we show that if images illuminated from opposite directions are used as input, the qualitative results improve over using a single input image.
    Guided interactive image segmentation using machine learning and color based data set clustering. (arXiv:2005.07662v2 [cs.CV] UPDATED)
    (2 min) We present a novel approach that combines machine learning based interactive image segmentation with a two-stage clustering method to identify similarly colored images for efficient batch image segmentation by guided reuse of classifiers. The segmentation task is formulated as a supervised machine learning problem working on homogeneous groups of voxels termed supervoxels. Classifiers are interactively trained from sparse annotations in an iterative process of annotation refinement. Resulting models can be used for batch processing of previously unseen images. By clustering images into subsets of similar colorization, we identify a minimal set of prototype images and demonstrate that using only classifiers trained on these prototype images for their color-cluster significantly improves the average segmentation performance of batch processing. The presented methods are applicable for almost any image type and therefore represent a useful tool for image analysis tasks in general.
    Learning to Disentangle GAN Fingerprint for Fake Image Attribution. (arXiv:2106.08749v1 [cs.CV])
    (2 min) Rapid pace of generative models has brought about new threats to visual forensics such as malicious personation and digital copyright infringement, which promotes works on fake image attribution. Existing works on fake image attribution mainly rely on a direct classification framework. Without additional supervision, the extracted features could include many content-relevant components and generalize poorly. Meanwhile, how to obtain an interpretable GAN fingerprint to explain the decision remains an open question. Adopting a multi-task framework, we propose a GAN Fingerprint Disentangling Network (GFD-Net) to simultaneously disentangle the fingerprint from GAN-generated images and produce a content-irrelevant representation for fake image attribution. A series of constraints are provided to guarantee the stability and discriminability of the fingerprint, which in turn helps content-irrelevant feature extraction. Further, we perform comprehensive analysis on GAN fingerprint, providing some clues about the properties of GAN fingerprint and which factors dominate the fingerprint in GAN architecture. Experiments show that our GFD-Net achieves superior fake image attribution performance in both closed-world and open-world testing. We also apply our method in binary fake image detection and exhibit a significant generalization ability on unseen generators.
    The shape and simplicity biases of adversarially robust ImageNet-trained CNNs. (arXiv:2006.09373v4 [cs.CV] UPDATED)
    (2 min) Adversarial training has been the topic of dozens of studies and a leading method for defending against adversarial attacks. Yet, it remains largely unknown (a) how adversarially-robust ImageNet classifiers (R classifiers) generalize to out-of-distribution examples; and (b) how their generalization capability relates to their hidden representations. In this paper, we perform a thorough, systematic study to answer these two questions across AlexNet, GoogLeNet, and ResNet-50 architectures. We found that while standard ImageNet classifiers have a strong texture bias, their R counterparts rely heavily on shapes. Remarkably, adversarial training induces three simplicity biases into hidden neurons in the process of 'robustifying' the network. That is, each convolutional neuron in R networks often changes to detecting (1) pixel-wise smoother patterns i.e. a mechanism that blocks high-frequency noise from passing through the network; (2) more lower-level features i.e. textures and colors (instead of objects); and (3) fewer types of inputs. Our findings reveal the interesting mechanisms that made networks more adversarially robust and also explain some recent findings. Our findings reveal the interesting mechanisms that made networks more adversarially robust and also explain some recent findings e.g. why R networks benefit from much larger capacity (Xie and Yuille, 2020) and can act as a strong image prior in image synthesis (Santurkar et al., 2019).
    An unifying point of view on expressive power of GNNs. (arXiv:2106.08992v1 [cs.LG])
    (2 min) Graph Neural Networks (GNNs) are a wide class of connectionist models for graph processing. They perform an iterative message passing operation on each node and its neighbors, to solve classification/ clustering tasks --- on some nodes or on the whole graph --- collecting all such messages, regardless of their order. Despite the differences among the various models belonging to this class, most of them adopt the same computation scheme, based on a local aggregation mechanism and, intuitively, the local computation framework is mainly responsible for the expressive power of GNNs. In this paper, we prove that the Weisfeiler--Lehman test induces an equivalence relationship on the graph nodes that exactly corresponds to the unfolding equivalence, defined on the original GNN model. Therefore, the results on the expressive power of the original GNNs can be extended to general GNNs which, under mild conditions, can be proved capable of approximating, in probability and up to any precision, any function on graphs that respects the unfolding equivalence.
    Multi-Resolution Continuous Normalizing Flows. (arXiv:2106.08462v1 [cs.CV])
    (2 min) Recent work has shown that Neural Ordinary Differential Equations (ODEs) can serve as generative models of images using the perspective of Continuous Normalizing Flows (CNFs). Such models offer exact likelihood calculation, and invertible generation/density estimation. In this work we introduce a Multi-Resolution variant of such models (MRCNF), by characterizing the conditional distribution over the additional information required to generate a fine image that is consistent with the coarse image. We introduce a transformation between resolutions that allows for no change in the log likelihood. We show that this approach yields comparable likelihood values for various image datasets, with improved performance at higher resolutions, with fewer parameters, using only 1 GPU.
    Polynomial Trajectory Predictions for Improved Learning Performance. (arXiv:2101.12616v2 [cs.CV] UPDATED)
    (2 min) The rising demand for Active Safety systems in automotive applications stresses the need for a reliable short to mid-term trajectory prediction. Anticipating the unfolding path of road users, one can act to increase the overall safety. In this work, we propose to train artificial neural networks for movement understanding by predicting trajectories in their natural form, as a function of time. Predicting polynomial coefficients allows us to increased accuracy and improve generalisation.
    Quantifying the Preferential Direction of the Model Gradient in Adversarial Training With Projected Gradient Descent. (arXiv:2009.04709v3 [stat.ML] UPDATED)
    (2 min) Adversarial training, especially projected gradient descent (PGD), has been a successful approach for improving robustness against adversarial attacks. After adversarial training, gradients of models with respect to their inputs have a preferential direction. However, the direction of alignment is not mathematically well established, making it difficult to evaluate quantitatively. We propose a novel definition of this direction as the direction of the vector pointing toward the closest point of the support of the closest inaccurate class in decision space. To evaluate the alignment with this direction after adversarial training, we apply a metric that uses generative adversarial networks to produce the smallest residual needed to change the class present in the image. We show that PGD-trained models have a higher alignment than the baseline according to our definition, that our metric presents higher alignment values than a competing metric formulation, and that enforcing this alignment increases the robustness of models.
    Machine learning-based analysis of hyperspectral images for automated sepsis diagnosis. (arXiv:2106.08445v1 [cs.LG])
    (3 min) Sepsis is a leading cause of mortality and critical illness worldwide. While robust biomarkers for early diagnosis are still missing, recent work indicates that hyperspectral imaging (HSI) has the potential to overcome this bottleneck by monitoring microcirculatory alterations. Automated machine learning-based diagnosis of sepsis based on HSI data, however, has not been explored to date. Given this gap in the literature, we leveraged an existing data set to (1) investigate whether HSI-based automated diagnosis of sepsis is possible and (2) put forth a list of possible confounders relevant for HSI-based tissue classification. While we were able to classify sepsis with an accuracy of over $98\,\%$ using the existing data, our research also revealed several subject-, therapy- and imaging-related confounders that may lead to an overestimation of algorithm performance when not balanced across the patient groups. We conclude that further prospective studies, carefully designed with respect to these confounders, are necessary to confirm the preliminary results obtained in this study.
    End-to-End Semi-Supervised Object Detection with Soft Teacher. (arXiv:2106.09018v1 [cs.CV])
    (2 min) This paper presents an end-to-end semi-supervised object detection approach, in contrast to previous more complex multi-stage methods. The end-to-end training gradually improves pseudo label qualities during the curriculum, and the more and more accurate pseudo labels in turn benefit object detection training. We also propose two simple yet effective techniques within this framework: a soft teacher mechanism where the classification loss of each unlabeled bounding box is weighed by the classification score produced by the teacher network; a box jittering approach to select reliable pseudo boxes for the learning of box regression. On COCO benchmark, the proposed approach outperforms previous methods by a large margin under various labeling ratios, i.e. 1\%, 5\% and 10\%. Moreover, our approach proves to perform also well when the amount of labeled data is relatively large. For example, it can improve a 40.9 mAP baseline detector trained using the full COCO training set by +3.6 mAP, reaching 44.5 mAP, by leveraging the 123K unlabeled images of COCO. On the state-of-the-art Swin Transformer-based object detector (58.9 mAP on test-dev), it can still significantly improve the detection accuracy by +1.5 mAP, reaching 60.4 mAP, and improve the instance segmentation accuracy by +1.2 mAP, reaching 52.4 mAP, pushing the new state-of-the-art.
    Adaptive Feature Alignment for Adversarial Training. (arXiv:2105.15157v2 [cs.CV] UPDATED)
    (2 min) Recent studies reveal that Convolutional Neural Networks (CNNs) are typically vulnerable to adversarial attacks, which pose a threat to security-sensitive applications. Many adversarial defense methods improve robustness at the cost of accuracy, raising the contradiction between standard and adversarial accuracies. In this paper, we observe an interesting phenomenon that feature statistics change monotonically and smoothly w.r.t the rising of attacking strength. Based on this observation, we propose the adaptive feature alignment (AFA) to generate features of arbitrary attacking strengths. Our method is trained to automatically align features of arbitrary attacking strength. This is done by predicting a fusing weight in a dual-BN architecture. Unlike previous works that need to either retrain the model or manually tune a hyper-parameters for different attacking strengths, our method can deal with arbitrary attacking strengths with a single model without introducing any hyper-parameter. Importantly, our method improves the model robustness against adversarial samples without incurring much loss in standard accuracy. Experiments on CIFAR-10, SVHN, and tiny-ImageNet datasets demonstrate that our method outperforms the state-of-the-art under a wide range of attacking strengths.
    Explaining decision of model from its prediction. (arXiv:2106.08366v1 [cs.CV])
    (2 min) This document summarizes different visual explanations methods such as CAM, Grad-CAM, Localization using Multiple Instance Learning - Saliency-based methods, Saliency-driven Class-Impressions, Muting pixels in input image - Adversarial methods and Activation visualization, Convolution filter visualization - Feature-based methods. We have also shown the results produced by different methods and a comparison between CAM, GradCAM, and Guided Backpropagation.
    Split and Expand: An inference-time improvement for Weakly Supervised Cell Instance Segmentation. (arXiv:2007.10817v2 [cs.CV] UPDATED)
    (2 min) We consider the problem of segmenting cell nuclei instances from Hematoxylin and Eosin (H&E) stains with dot annotations only. While most recent works focus on improving the segmentation quality, this is usually insufficient for instance segmentation of cell instances clustered together or with a small size. In this work, we propose a simple two-step post-processing procedure, Split and Expand, that directly improves the conversion of segmentation maps to instances. In the splitting step, we generate fine-grained cell instances from the segmentation map with the guidance of cell-center predictions. For the expansion step, we utilize Layer-wise Relevance Propagation (LRP) explanation results to add small cells that are not captured in the segmentation map. Although we additionally train an output head to predict cell-centers, the post-processing procedure itself is not explicitly trained and is executed at inference-time only. A feature re-weighting loss based on LRP is proposed to improve our method even further. We test our procedure on the MoNuSeg and TNBC datasets and show quantitatively and qualitatively that our proposed method improves object-level metrics substantially.
    Imperfect ImaGANation: Implications of GANs Exacerbating Biases on Facial Data Augmentation and Snapchat Selfie Lenses. (arXiv:2001.09528v3 [cs.LG] UPDATED)
    (2 min) In this paper, we show that popular Generative Adversarial Networks (GANs) exacerbate biases along the axes of gender and skin tone when given a skewed distribution of face-shots. While practitioners celebrate synthetic data generation using GANs as an economical way to augment data for training data-hungry machine learning models, it is unclear whether they recognize the perils of such techniques when applied to real world datasets biased along latent dimensions. Specifically, we show that (1) traditional GANs further skew the distribution of a dataset consisting of engineering faculty headshots, generating minority modes less often and of worse quality and (2) image-to-image translation (conditional) GANs also exacerbate biases by lightening skin color of non-white faces and transforming female facial features to be masculine when generating faces of engineering professors. Thus, our study is meant to serve as a cautionary tale.
    Temporal Convolution Networks with Positional Encoding for Evoked Expression Estimation. (arXiv:2106.08596v1 [cs.CV])
    (2 min) This paper presents an approach for Evoked Expressions from Videos (EEV) challenge, which aims to predict evoked facial expressions from video. We take advantage of pre-trained models on large-scale datasets in computer vision and audio signals to extract the deep representation of timestamps in the video. A temporal convolution network, rather than an RNN like architecture, is used to explore temporal relationships due to its advantage in memory consumption and parallelism. Furthermore, to address the missing annotations of some timestamps, positional encoding is employed to ensure continuity of input data when discarding these timestamps during training. We achieved state-of-the-art results on the EEV challenge with a Pearson correlation coefficient of 0.05477, the first ranked performance in the EEV 2021 challenge.
    A Naturalness Evaluation Database for Video Prediction Models. (arXiv:2005.00356v3 [eess.IV] UPDATED)
    (2 min) The study of video prediction models is believed to be a fundamental approach to representation learning for videos. While a plethora of generative models for predicting the future frame pixel values given the past few frames exist, the quantitative evaluation of the predicted frames has been found to be extremely challenging. In this context, we introduce the problem of naturalness evaluation, which refers to how natural or realistic a predicted video looks. We create the Indian Institute of Science VIdeo Naturalness Evaluation (IISc VINE) Database consisting of 300 videos, obtained by applying different prediction models on different datasets, and accompanying human opinion scores. We collected subjective ratings of naturalness from 50 human participants for these videos. Our subjective study reveals that human observers were highly consistent in their judgments of naturalness. We benchmark several popularly used measures for evaluating video prediction and show that they do not adequately correlate with these subjective scores. We introduce two new features to effectively capture naturalness, motion-compensated cosine similarities of deep features of predicted frames with past frames, and deep features extracted from rescaled frame differences. We show that our feature design leads to state of the art naturalness prediction in accordance with human judgments on our IISc VINE Database. The database and code are publicly available on our project website: https://nagabhushansn95.github.io/publications/2020/vine
    Metamorphic image registration using a semi-Lagrangian scheme. (arXiv:2106.08817v1 [cs.CV])
    (2 min) In this paper, we propose an implementation of both Large Deformation Diffeomorphic Metric Mapping (LDDMM) and Metamorphosis image registration using a semi-Lagrangian scheme for geodesic shooting. We propose to solve both problems as an inexact matching providing a single and unifying cost function. We demonstrate that for image registration the use of a semi-Lagrangian scheme is more stable than a standard Eulerian scheme. Our GPU implementation is based on PyTorch, which greatly simplifies and accelerates the computations thanks to its powerful automatic differentiation engine. It will be freely available at https://github.com/antonfrancois/Demeter_metamorphosis.
    Compound Frechet Inception Distance for Quality Assessment of GAN Created Images. (arXiv:2106.08575v1 [cs.CV])
    (2 min) Generative adversarial networks or GANs are a type of generative modeling framework. GANs involve a pair of neural networks engaged in a competition in iteratively creating fake data, indistinguishable from the real data. One notable application of GANs is developing fake human faces, also known as "deep fakes," due to the deep learning algorithms at the core of the GAN framework. Measuring the quality of the generated images is inherently subjective but attempts to objectify quality using standardized metrics have been made. One example of objective metrics is the Frechet Inception Distance (FID), which measures the difference between distributions of feature vectors for two separate datasets of images. There are situations that images with low perceptual qualities are not assigned appropriate FID scores. We propose to improve the robustness of the evaluation process by integrating lower-level features to cover a wider array of visual defects. Our proposed method integrates three levels of feature abstractions to evaluate the quality of generated images. Experimental evaluations show better performance of the proposed method for distorted images.
    Structure First Detail Next: Image Inpainting with Pyramid Generator. (arXiv:2106.08905v1 [cs.CV])
    (2 min) Recent deep generative models have achieved promising performance in image inpainting. However, it is still very challenging for a neural network to generate realistic image details and textures, due to its inherent spectral bias. By our understanding of how artists work, we suggest to adopt a `structure first detail next' workflow for image inpainting. To this end, we propose to build a Pyramid Generator by stacking several sub-generators, where lower-layer sub-generators focus on restoring image structures while the higher-layer sub-generators emphasize image details. Given an input image, it will be gradually restored by going through the entire pyramid in a bottom-up fashion. Particularly, our approach has a learning scheme of progressively increasing hole size, which allows it to restore large-hole images. In addition, our method could fully exploit the benefits of learning with high-resolution images, and hence is suitable for high-resolution image inpainting. Extensive experimental results on benchmark datasets have validated the effectiveness of our approach compared with state-of-the-art methods.
    Multi-scale Neural ODEs for 3D Medical Image Registration. (arXiv:2106.08493v1 [cs.CV])
    (2 min) Image registration plays an important role in medical image analysis. Conventional optimization based methods provide an accurate estimation due to the iterative process at the cost of expensive computation. Deep learning methods such as learn-to-map are much faster but either iterative or coarse-to-fine approach is required to improve accuracy for handling large motions. In this work, we proposed to learn a registration optimizer via a multi-scale neural ODE model. The inference consists of iterative gradient updates similar to a conventional gradient descent optimizer but in a much faster way, because the neural ODE learns from the training data to adapt the gradient efficiently at each iteration. Furthermore, we proposed to learn a modal-independent similarity metric to address image appearance variations across different image contrasts. We performed evaluations through extensive experiments in the context of multi-contrast 3D MR images from both public and private data sources and demonstrate the superior performance of our proposed methods.
    Achieving Domain Robustness in Stereo Matching Networks by Removing Shortcut Learning. (arXiv:2106.08486v1 [cs.CV])
    (2 min) Learning-based stereo matching and depth estimation networks currently excel on public benchmarks with impressive results. However, state-of-the-art networks often fail to generalize from synthetic imagery to more challenging real data domains. This paper is an attempt to uncover hidden secrets of achieving domain robustness and in particular, discovering the important ingredients of generalization success of stereo matching networks by analyzing the effect of synthetic image learning on real data performance. We provide evidence that demonstrates that learning of features in the synthetic domain by a stereo matching network is heavily influenced by two "shortcuts" presented in the synthetic data: (1) identical local statistics (RGB colour features) between matching pixels in the synthetic stereo images and (2) lack of realism in synthetic textures on 3D objects simulated in game engines. We will show that by removing such shortcuts, we can achieve domain robustness in the state-of-the-art stereo matching frameworks and produce a remarkable performance on multiple realistic datasets, despite the fact that the networks were trained on synthetic data, only. Our experimental results point to the fact that eliminating shortcuts from the synthetic data is key to achieve domain-invariant generalization between synthetic and real data domains.
    Anomaly Detection in Video Sequences: A Benchmark and Computational Model. (arXiv:2106.08570v1 [cs.CV])
    (2 min) Anomaly detection has attracted considerable search attention. However, existing anomaly detection databases encounter two major problems. Firstly, they are limited in scale. Secondly, training sets contain only video-level labels indicating the existence of an abnormal event during the full video while lacking annotations of precise time durations. To tackle these problems, we contribute a new Large-scale Anomaly Detection (LAD) database as the benchmark for anomaly detection in video sequences, which is featured in two aspects. 1) It contains 2000 video sequences including normal and abnormal video clips with 14 anomaly categories including crash, fire, violence, etc. with large scene varieties, making it the largest anomaly analysis database to date. 2) It provides the annotation data, including video-level labels (abnormal/normal video, anomaly type) and frame-level labels (abnormal/normal video frame) to facilitate anomaly detection. Leveraging the above benefits from the LAD database, we further formulate anomaly detection as a fully-supervised learning problem and propose a multi-task deep neural network to solve it. We first obtain the local spatiotemporal contextual feature by using an Inflated 3D convolutional (I3D) network. Then we construct a recurrent convolutional neural network fed the local spatiotemporal contextual feature to extract the spatiotemporal contextual feature. With the global spatiotemporal contextual feature, the anomaly type and score can be computed simultaneously by a multi-task neural network. Experimental results show that the proposed method outperforms the state-of-the-art anomaly detection methods on our database and other public databases of anomaly detection. Codes are available at https://github.com/wanboyang/anomaly_detection_LAD2000.
    Watching Too Much Television is Good: Self-Supervised Audio-Visual Representation Learning from Movies and TV Shows. (arXiv:2106.08513v1 [cs.CV])
    (2 min) The abundance and ease of utilizing sound, along with the fact that auditory clues reveal so much about what happens in the scene, make the audio-visual space a perfectly intuitive choice for self-supervised representation learning. However, the current literature suggests that training on \textit{uncurated} data yields considerably poorer representations compared to the \textit{curated} alternatives collected in supervised manner, and the gap only narrows when the volume of data significantly increases. Furthermore, the quality of learned representations is known to be heavily influenced by the size and taxonomy of the curated datasets used for self-supervised training. This begs the question of whether we are celebrating too early on catching up with supervised learning when our self-supervised efforts still rely almost exclusively on curated data. In this paper, we study the efficacy of learning from Movies and TV Shows as forms of uncurated data for audio-visual self-supervised learning. We demonstrate that a simple model based on contrastive learning, trained on a collection of movies and TV shows, not only dramatically outperforms more complex methods which are trained on orders of magnitude larger uncurated datasets, but also performs very competitively with the state-of-the-art that learns from large-scale curated data. We identify that audiovisual patterns like the appearance of the main character or prominent scenes and mise-en-sc\`ene which frequently occur through the whole duration of a movie, lead to an overabundance of easy negative instances in the contrastive learning formulation. Capitalizing on such observation, we propose a hierarchical sampling policy, which despite its simplicity, effectively improves the performance, particularly when learning from TV shows which naturally face less semantic diversity.
    Unsupervised Domain Adaptation with Variational Approximation for Cardiac Segmentation. (arXiv:2106.08752v1 [cs.CV])
    (2 min) Unsupervised domain adaptation is useful in medical image segmentation. Particularly, when ground truths of the target images are not available, domain adaptation can train a target-specific model by utilizing the existing labeled images from other modalities. Most of the reported works mapped images of both the source and target domains into a common latent feature space, and then reduced their discrepancy either implicitly with adversarial training or explicitly by directly minimizing a discrepancy metric. In this work, we propose a new framework, where the latent features of both domains are driven towards a common and parameterized variational form, whose conditional distribution given the image is Gaussian. This is achieved by two networks based on variational auto-encoders (VAEs) and a regularization for this variational approximation. Both of the VAEs, each for one domain, contain a segmentation module, where the source segmentation is trained in a supervised manner, while the target one is trained unsupervisedly. We validated the proposed domain adaptation method using two cardiac segmentation tasks, i.e., the cross-modality (CT and MR) whole heart segmentation and the cross-sequence cardiac MR segmentation. Results show that the proposed method achieved better accuracies compared to two state-of-the-art approaches and demonstrated good potential for cardiac segmentation. Furthermore, the proposed explicit regularization was shown to be effective and efficient in narrowing down the distribution gap between domains, which is useful for unsupervised domain adaptation. Our code and data has been released via https://zmiclab.github.io/projects.html.
    Tackling the Challenges in Scene Graph Generation with Local-to-Global Interactions. (arXiv:2106.08543v1 [cs.CV])
    (2 min) In this work, we seek new insights into the underlying challenges of the Scene Graph Generation (SGG) task. Quantitative and qualitative analysis of the Visual Genome dataset implies -- 1) Ambiguity: even if inter-object relationship contains the same object (or predicate), they may not be visually or semantically similar, 2) Asymmetry: despite the nature of the relationship that embodied the direction, it was not well addressed in previous studies, and 3) Higher-order contexts: leveraging the identities of certain graph elements can help to generate accurate scene graphs. Motivated by the analysis, we design a novel SGG framework, Local-to-Global Interaction Networks (LOGIN). Locally, interactions extract the essence between three instances - subject, object, and background - while baking direction awareness into the network by constraining the input order. Globally, interactions encode the contexts between every graph components -- nodes and edges. Also we introduce Attract & Repel loss which finely adjusts predicate embeddings. Our framework enables predicting the scene graph in a local-to-global manner by design, leveraging the possible complementariness. To quantify how much LOGIN is aware of relational direction, we propose a new diagnostic task called Bidirectional Relationship Classification (BRC). We see that LOGIN can successfully distinguish relational direction than existing methods (in BRC task) while showing state-of-the-art results on the Visual Genome benchmark (in SGG task).
    $C^3$: Compositional Counterfactual Constrastive Learning for Video-grounded Dialogues. (arXiv:2106.08914v1 [cs.LG])
    (2 min) Video-grounded dialogue systems aim to integrate video understanding and dialogue understanding to generate responses that are relevant to both the dialogue and video context. Most existing approaches employ deep learning models and have achieved remarkable performance, given the relatively small datasets available. However, the results are partly accomplished by exploiting biases in the datasets rather than developing multimodal reasoning, resulting in limited generalization. In this paper, we propose a novel approach of Compositional Counterfactual Contrastive Learning ($C^3$) to develop contrastive training between factual and counterfactual samples in video-grounded dialogues. Specifically, we design factual/counterfactual sampling based on the temporal steps in videos and tokens in dialogues and propose contrastive loss functions that exploit object-level or action-level variance. Different from prior approaches, we focus on contrastive hidden state representations among compositional output tokens to optimize the representation space in a generation setting. We achieved promising performance gains on the Audio-Visual Scene-Aware Dialogues (AVSD) benchmark and showed the benefits of our approach in grounding video and dialogue context.
    2nd Place Solution for Waymo Open Dataset Challenge - Real-time 2D Object Detection. (arXiv:2106.08713v1 [cs.CV])
    (2 min) In an autonomous driving system, it is essential to recognize vehicles, pedestrians and cyclists from images. Besides the high accuracy of the prediction, the requirement of real-time running brings new challenges for convolutional network models. In this report, we introduce a real-time method to detect the 2D objects from images. We aggregate several popular one-stage object detectors and train the models of variety input strategies independently, to yield better performance for accurate multi-scale detection of each category, especially for small objects. For model acceleration, we leverage TensorRT to optimize the inference time of our detection pipeline. As shown in the leaderboard, our proposed detection framework ranks the 2nd place with 75.00% L1 mAP and 69.72% L2 mAP in the real-time 2D detection track of the Waymo Open Dataset Challenges, while our framework achieves the latency of 45.8ms/frame on an Nvidia Tesla V100 GPU.
    A Fair and Comprehensive Comparison of Multimodal Tweet Sentiment Analysis Methods. (arXiv:2106.08829v1 [cs.SI])
    (2 min) Opinion and sentiment analysis is a vital task to characterize subjective information in social media posts. In this paper, we present a comprehensive experimental evaluation and comparison with six state-of-the-art methods, from which we have re-implemented one of them. In addition, we investigate different textual and visual feature embeddings that cover different aspects of the content, as well as the recently introduced multimodal CLIP embeddings. Experimental results are presented for two different publicly available benchmark datasets of tweets and corresponding images. In contrast to the evaluation methodology of previous work, we introduce a reproducible and fair evaluation scheme to make results comparable. Finally, we conduct an error analysis to outline the limitations of the methods and possibilities for the future work.
    JRDB-Act: A Large-scale Multi-modal Dataset for Spatio-temporal Action, Social Group and Activity Detection. (arXiv:2106.08827v1 [cs.CV])
    (2 min) The availability of large-scale video action understanding datasets has facilitated advances in the interpretation of visual scenes containing people. However, learning to recognize human activities in an unconstrained real-world environment, with potentially highly unbalanced and long-tailed distributed data remains a significant challenge, not least owing to the lack of a reflective large-scale dataset. Most existing large-scale datasets are either collected from a specific or constrained environment, e.g. kitchens or rooms, or video sharing platforms such as YouTube. In this paper, we introduce JRDB-Act, a multi-modal dataset, as an extension of the existing JRDB, which is captured by asocial mobile manipulator and reflects a real distribution of human daily life actions in a university campus environment. JRDB-Act has been densely annotated with atomic actions, comprises over 2.8M action labels, constituting a large-scale spatio-temporal action detection dataset. Each human bounding box is labelled with one pose-based action label and multiple (optional) interaction-based action labels. Moreover JRDB-Act comes with social group identification annotations conducive to the task of grouping individuals based on their interactions in the scene to infer their social activities (common activities in each social group).
    Toward Affective XAI: Facial Affect Analysis for Understanding Explainable Human-AI Interactions. (arXiv:2106.08761v1 [cs.CV])
    (2 min) As machine learning approaches are increasingly used to augment human decision-making, eXplainable Artificial Intelligence (XAI) research has explored methods for communicating system behavior to humans. However, these approaches often fail to account for the emotional responses of humans as they interact with explanations. Facial affect analysis, which examines human facial expressions of emotions, is one promising lens for understanding how users engage with explanations. Therefore, in this work, we aim to (1) identify which facial affect features are pronounced when people interact with XAI interfaces, and (2) develop a multitask feature embedding for linking facial affect signals with participants' use of explanations. Our analyses and results show that the occurrence and values of facial AU1 and AU4, and Arousal are heightened when participants fail to use explanations effectively. This suggests that facial affect analysis should be incorporated into XAI to personalize explanations to individuals' interaction styles and to adapt explanations based on the difficulty of the task performed.
    Silent Speech and Emotion Recognition from Vocal Tract Shape Dynamics in Real-Time MRI. (arXiv:2106.08706v1 [eess.IV])
    (2 min) Speech sounds of spoken language are obtained by varying configuration of the articulators surrounding the vocal tract. They contain abundant information that can be utilized to better understand the underlying mechanism of human speech production. We propose a novel deep neural network-based learning framework that understands acoustic information in the variable-length sequence of vocal tract shaping during speech production, captured by real-time magnetic resonance imaging (rtMRI), and translate it into text. The proposed framework comprises of spatiotemporal convolutions, a recurrent network, and the connectionist temporal classification loss, trained entirely end-to-end. On the USC-TIMIT corpus, the model achieved a 40.6% PER at sentence-level, much better compared to the existing models. To the best of our knowledge, this is the first study that demonstrates the recognition of entire spoken sentence based on an individual's articulatory motions captured by rtMRI video. We also performed an analysis of variations in the geometry of articulation in each sub-regions of the vocal tract (i.e., pharyngeal, velar and dorsal, hard palate, labial constriction region) with respect to different emotions and genders. Results suggest that each sub-regions distortion is affected by both emotion and gender.
    Federated Semi-supervised Medical Image Classification via Inter-client Relation Matching. (arXiv:2106.08600v1 [cs.CV])
    (2 min) Federated learning (FL) has emerged with increasing popularity to collaborate distributed medical institutions for training deep networks. However, despite existing FL algorithms only allow the supervised training setting, most hospitals in realistic usually cannot afford the intricate data labeling due to absence of budget or expertise. This paper studies a practical yet challenging FL problem, named \textit{Federated Semi-supervised Learning} (FSSL), which aims to learn a federated model by jointly utilizing the data from both labeled and unlabeled clients (i.e., hospitals). We present a novel approach for this problem, which improves over traditional consistency regularization mechanism with a new inter-client relation matching scheme. The proposed learning scheme explicitly connects the learning across labeled and unlabeled clients by aligning their extracted disease relationships, thereby mitigating the deficiency of task knowledge at unlabeled clients and promoting discriminative information from unlabeled samples. We validate our method on two large-scale medical image classification datasets. The effectiveness of our method has been demonstrated with the clear improvements over state-of-the-arts as well as the thorough ablation analysis on both tasks\footnote{Code will be made available at \url{https://github.com/liuquande/FedIRM}}.
    Mobile Augmented Reality: User Interfaces, Frameworks, and Intelligence. (arXiv:2106.08710v1 [cs.HC])
    (2 min) Mobile Augmented Reality (MAR) integrates computer-generated virtual objects with physical environments for mobile devices. MAR systems enable users to interact with MAR devices, such as smartphones and head-worn wearables, and performs seamless transitions from the physical world to a mixed world with digital entities. These MAR systems support user experiences by using MAR devices to provide universal accessibility to digital contents. Over the past 20 years, a number of MAR systems have been developed, however, the studies and design of MAR frameworks have not yet been systematically reviewed from the perspective of user-centric design. This article presents the first effort of surveying existing MAR frameworks (count: 37) and further discusses the latest studies on MAR through a top-down approach: 1) MAR applications; 2) MAR visualisation techniques adaptive to user mobility and contexts; 3) systematic evaluation of MAR frameworks including supported platforms and corresponding features such as tracking, feature extraction plus sensing capabilities; and 4) underlying machine learning approaches supporting intelligent operations within MAR systems. Finally, we summarise the development of emerging research fields, current state-of-the-art, and discuss the important open challenges and possible theoretical and technical directions. This survey aims to benefit both researchers and MAR system developers alike.
    TextStyleBrush: Transfer of Text Aesthetics from a Single Example. (arXiv:2106.08385v1 [cs.CV])
    (2 min) We present a novel approach for disentangling the content of a text image from all aspects of its appearance. The appearance representation we derive can then be applied to new content, for one-shot transfer of the source style to new content. We learn this disentanglement in a self-supervised manner. Our method processes entire word boxes, without requiring segmentation of text from background, per-character processing, or making assumptions on string lengths. We show results in different text domains which were previously handled by specialized methods, e.g., scene text, handwritten text. To these ends, we make a number of technical contributions: (1) We disentangle the style and content of a textual image into a non-parametric, fixed-dimensional vector. (2) We propose a novel approach inspired by StyleGAN but conditioned over the example style at different resolution and content. (3) We present novel self-supervised training criteria which preserve both source style and target content using a pre-trained font classifier and text recognizer. Finally, (4) we also introduce Imgur5K, a new challenging dataset for handwritten word images. We offer numerous qualitative photo-realistic results of our method. We further show that our method surpasses previous work in quantitative tests on scene text and handwriting datasets, as well as in a user study.
    Shuffle Transformer with Feature Alignment for Video Face Parsing. (arXiv:2106.08650v1 [cs.CV])
    (2 min) This is a short technical report introducing the solution of the Team TCParser for Short-video Face Parsing Track of The 3rd Person in Context (PIC) Workshop and Challenge at CVPR 2021. In this paper, we introduce a strong backbone which is cross-window based Shuffle Transformer for presenting accurate face parsing representation. To further obtain the finer segmentation results, especially on the edges, we introduce a Feature Alignment Aggregation (FAA) module. It can effectively relieve the feature misalignment issue caused by multi-resolution feature aggregation. Benefiting from the stronger backbone and better feature aggregation, the proposed method achieves 86.9519% score in the Short-video Face Parsing track of the 3rd Person in Context (PIC) Workshop and Challenge, ranked the first place.
    EdgeConv with Attention Module for Monocular Depth Estimation. (arXiv:2106.08615v1 [cs.CV])
    (2 min) Monocular depth estimation is an especially important task in robotics and autonomous driving, where 3D structural information is essential. However, extreme lighting conditions and complex surface objects make it difficult to predict depth in a single image. Therefore, to generate accurate depth maps, it is important for the model to learn structural information about the scene. We propose a novel Patch-Wise EdgeConv Module (PEM) and EdgeConv Attention Module (EAM) to solve the difficulty of monocular depth estimation. The proposed modules extract structural information by learning the relationship between image patches close to each other in space using edge convolution. Our method is evaluated on two popular datasets, the NYU Depth V2 and the KITTI Eigen split, achieving state-of-the-art performance. We prove that the proposed model predicts depth robustly in challenging scenes through various comparative experiments.
    Learning Implicit Glyph Shape Representation. (arXiv:2106.08573v1 [cs.CV])
    (2 min) In this paper, we present a novel implicit glyph shape representation, which models glyphs as shape primitives enclosed by quadratic curves, and naturally enables generating glyph images at arbitrary high resolutions. Experiments on font reconstruction and interpolation tasks verified that this structured implicit representation is suitable for describing both structure and style features of glyphs. Furthermore, based on the proposed representation, we design a simple yet effective disentangled network for the challenging one-shot font style transfer problem, and achieve the best results comparing to state-of-the-art alternatives in both quantitative and qualitative comparisons. Benefit from this representation, our generated glyphs have the potential to be converted to vector fonts through post-processing, reducing the gap between rasterized images and vector graphics. We hope this work can provide a powerful tool for 2D shape analysis and synthesis, and inspire further exploitation in implicit representations for 2D shape modeling.
    ECKPN: Explicit Class Knowledge Propagation Network for Transductive Few-shot Learning. (arXiv:2106.08523v1 [cs.CV])
    (2 min) Recently, the transductive graph-based methods have achieved great success in the few-shot classification task. However, most existing methods ignore exploring the class-level knowledge that can be easily learned by humans from just a handful of samples. In this paper, we propose an Explicit Class Knowledge Propagation Network (ECKPN), which is composed of the comparison, squeeze and calibration modules, to address this problem. Specifically, we first employ the comparison module to explore the pairwise sample relations to learn rich sample representations in the instance-level graph. Then, we squeeze the instance-level graph to generate the class-level graph, which can help obtain the class-level visual knowledge and facilitate modeling the relations of different classes. Next, the calibration module is adopted to characterize the relations of the classes explicitly to obtain the more discriminative class-level knowledge representations. Finally, we combine the class-level knowledge with the instance-level sample representations to guide the inference of the query samples. We conduct extensive experiments on four few-shot classification benchmarks, and the experimental results show that the proposed ECKPN significantly outperforms the state-of-the-art methods.
    Structured DropConnect for Uncertainty Inference in Image Classification. (arXiv:2106.08624v1 [cs.CV])
    (2 min) With the complexity of the network structure, uncertainty inference has become an important task to improve the classification accuracy for artificial intelligence systems. For image classification tasks, we propose a structured DropConnect (SDC) framework to model the output of a deep neural network by a Dirichlet distribution. We introduce a DropConnect strategy on weights in the fully connected layers during training. In test, we split the network into several sub-networks, and then model the Dirichlet distribution by match its moments with the mean and variance of the outputs of these sub-networks. The entropy of the estimated Dirichlet distribution is finally utilized for uncertainty inference. In this paper, this framework is implemented on LeNet$5$ and VGG$16$ models for misclassification detection and out-of-distribution detection on MNIST and CIFAR-$10$ datasets. Experimental results show that the performance of the proposed SDC can be comparable to other uncertainty inference methods. Furthermore, the SDC is adapted well to different network structures with certain generalization capabilities and research prospects.
    Unsupervised Person Re-identification via Multi-Label Prediction and Classification based on Graph-Structural Insight. (arXiv:2106.08798v1 [cs.CV])
    (2 min) This paper addresses unsupervised person re-identification (Re-ID) using multi-label prediction and classification based on graph-structural insight. Our method extracts features from person images and produces a graph that consists of the features and a pairwise similarity of them as nodes and edges, respectively. Based on the graph, the proposed graph structure based multi-label prediction (GSMLP) method predicts multi-labels by considering the pairwise similarity and the adjacency node distribution of each node. The multi-labels created by GSMLP are applied to the proposed selective multi-label classification (SMLC) loss. SMLC integrates a hard-sample mining scheme and a multi-label classification. The proposed GSMLP and SMLC boost the performance of unsupervised person Re-ID without any pre-labelled dataset. Experimental results justify the superiority of the proposed method in unsupervised person Re-ID by producing state-of-the-art performance. The source code for this paper is publicly available on 'https://github.com/uknownpioneer/GSMLP-SMLC.git'.
    Revisit Visual Representation in Analytics Taxonomy: A Compression Perspective. (arXiv:2106.08512v1 [cs.CV])
    (2 min) Visual analytics have played an increasingly critical role in the Internet of Things, where massive visual signals have to be compressed and fed into machines. But facing such big data and constrained bandwidth capacity, existing image/video compression methods lead to very low-quality representations, while existing feature compression techniques fail to support diversified visual analytics applications/tasks with low-bit-rate representations. In this paper, we raise and study the novel problem of supporting multiple machine vision analytics tasks with the compressed visual representation, namely, the information compression problem in analytics taxonomy. By utilizing the intrinsic transferability among different tasks, our framework successfully constructs compact and expressive representations at low bit-rates to support a diversified set of machine vision tasks, including both high-level semantic-related tasks and mid-level geometry analytic tasks. In order to impose compactness in the representations, we propose a codebook-based hyperprior, which helps map the representation into a low-dimensional manifold. As it well fits the signal structure of the deep visual feature, it facilitates more accurate entropy estimation, and results in higher compression efficiency. With the proposed framework and the codebook-based hyperprior, we further investigate the relationship of different task features owning different levels of abstraction granularity. Experimental results demonstrate that with the proposed scheme, a set of diversified tasks can be supported at a significantly lower bit-rate, compared with existing compression schemes.
    Robustness of Object Detectors in Degrading Weather Conditions. (arXiv:2106.08795v1 [cs.CV])
    (2 min) State-of-the-art object detection systems for autonomous driving achieve promising results in clear weather conditions. However, such autonomous safety critical systems also need to work in degrading weather conditions, such as rain, fog and snow. Unfortunately, most approaches evaluate only on the KITTI dataset, which consists only of clear weather scenes. In this paper we address this issue and perform one of the most detailed evaluation on single and dual modality architectures on data captured in real weather conditions. We analyse the performance degradation of these architectures in degrading weather conditions. We demonstrate that an object detection architecture performing good in clear weather might not be able to handle degrading weather conditions. We also perform ablation studies on the dual modality architectures and show their limitations.
    CMF: Cascaded Multi-model Fusion for Referring Image Segmentation. (arXiv:2106.08617v1 [cs.CV])
    (2 min) In this work, we address the task of referring image segmentation (RIS), which aims at predicting a segmentation mask for the object described by a natural language expression. Most existing methods focus on establishing unidirectional or directional relationships between visual and linguistic features to associate two modalities together, while the multi-scale context is ignored or insufficiently modeled. Multi-scale context is crucial to localize and segment those objects that have large scale variations during the multi-modal fusion process. To solve this problem, we propose a simple yet effective Cascaded Multi-modal Fusion (CMF) module, which stacks multiple atrous convolutional layers in parallel and further introduces a cascaded branch to fuse visual and linguistic features. The cascaded branch can progressively integrate multi-scale contextual information and facilitate the alignment of two modalities during the multi-modal fusion process. Experimental results on four benchmark datasets demonstrate that our method outperforms most state-of-the-art methods. Code is available at https://github.com/jianhua2022/CMF-Refseg.
    FastAno: Fast Anomaly Detection via Spatio-temporal Patch Transformation. (arXiv:2106.08613v1 [cs.CV])
    (2 min) Video anomaly detection has gained significant attention due to the increasing requirements of automatic monitoring for surveillance videos. Especially, the prediction based approach is one of the most studied methods to detect anomalies by predicting frames that include abnormal events in the test set after learning with the normal frames of the training set. However, a lot of prediction networks are computationally expensive owing to the use of pre-trained optical flow networks, or fail to detect abnormal situations because of their strong generative ability to predict even the anomalies. To address these shortcomings, we propose spatial rotation transformation (SRT) and temporal mixing transformation (TMT) to generate irregular patch cuboids within normal frame cuboids in order to enhance the learning of normal features. Additionally, the proposed patch transformation is used only during the training phase, allowing our model to detect abnormal frames at fast speed during inference. Our model is evaluated on three anomaly detection benchmarks, achieving competitive accuracy and surpassing all the previous works in terms of speed.
    ICDAR 2021 Competition on Components Segmentation Task of Document Photos. (arXiv:2106.08499v1 [cs.CV])
    (2 min) This paper describes the short-term competition on Components Segmentation Task of Document Photos that was prepared in the context of the 16th International Conference on Document Analysis and Recognition (ICDAR 2021). This competition aims to bring together researchers working on the filed of identification document image processing and provides them a suitable benchmark to compare their techniques on the component segmentation task of document images. Three challenge tasks were proposed entailing different segmentation assignments to be performed on a provided dataset. The collected data are from several types of Brazilian ID documents, whose personal information was conveniently replaced. There were 16 participants whose results obtained for some or all the three tasks show different rates for the adopted metrics, like Dice Similarity Coefficient ranging from 0.06 to 0.99. Different Deep Learning models were applied by the entrants with diverse strategies to achieve the best results in each of the tasks. Obtained results show that the current applied methods for solving one of the proposed tasks (document boundary detection) are already well stablished. However, for the other two challenge tasks (text zone and handwritten sign detection) research and development of more robust approaches are still required to achieve acceptable results.
    Shape from Blur: Recovering Textured 3D Shape and Motion of Fast Moving Objects. (arXiv:2106.08762v1 [cs.CV])
    (2 min) We address the novel task of jointly reconstructing the 3D shape, texture, and motion of an object from a single motion-blurred image. While previous approaches address the deblurring problem only in the 2D image domain, our proposed rigorous modeling of all object properties in the 3D domain enables the correct description of arbitrary object motion. This leads to significantly better image decomposition and sharper deblurring results. We model the observed appearance of a motion-blurred object as a combination of the background and a 3D object with constant translation and rotation. Our method minimizes a loss on reconstructing the input image via differentiable rendering with suitable regularizers. This enables estimating the textured 3D mesh of the blurred object with high fidelity. Our method substantially outperforms competing approaches on several benchmarks for fast moving objects deblurring. Qualitative results show that the reconstructed 3D mesh generates high-quality temporal super-resolution and novel views of the deblurred object.
    AtrialGeneral: Domain Generalization for Left Atrial Segmentation of Multi-Center LGE MRIs. (arXiv:2106.08727v1 [cs.CV])
    (2 min) Left atrial (LA) segmentation from late gadolinium enhanced magnetic resonance imaging (LGE MRI) is a crucial step needed for planning the treatment of atrial fibrillation. However, automatic LA segmentation from LGE MRI is still challenging, due to the poor image quality, high variability in LA shapes, and unclear LA boundary. Though deep learning-based methods can provide promising LA segmentation results, they often generalize poorly to unseen domains, such as data from different scanners and/or sites. In this work, we collect 210 LGE MRIs from different centers with different levels of image quality. To evaluate the domain generalization ability of models on the LA segmentation task, we employ four commonly used semantic segmentation networks for the LA segmentation from multi-center LGE MRIs. Besides, we investigate three domain generalization strategies, i.e., histogram matching, mutual information based disentangled representation, and random style transfer, where a simple histogram matching is proved to be most effective.
    Over-and-Under Complete Convolutional RNN for MRI Reconstruction. (arXiv:2106.08886v1 [cs.CV])
    (2 min) Reconstructing magnetic resonance (MR) images from undersampled data is a challenging problem due to various artifacts introduced by the under-sampling operation. Recent deep learning-based methods for MR image reconstruction usually leverage a generic auto-encoder architecture which captures low-level features at the initial layers and high?level features at the deeper layers. Such networks focus much on global features which may not be optimal to reconstruct the fully-sampled image. In this paper, we propose an Over-and-Under Complete Convolu?tional Recurrent Neural Network (OUCR), which consists of an overcomplete and an undercomplete Convolutional Recurrent Neural Network(CRNN). The overcomplete branch gives special attention in learning local structures by restraining the receptive field of the network. Combining it with the undercomplete branch leads to a network which focuses more on low-level features without losing out on the global structures. Extensive experiments on two datasets demonstrate that the proposed method achieves significant improvements over the compressed sensing and popular deep learning-based methods with less number of trainable parameters. Our code is available at https://github.com/guopengf/OUCR.
    Contrastive Learning with Continuous Proxy Meta-Data for 3D MRI Classification. (arXiv:2106.08808v1 [cs.CV])
    (2 min) Traditional supervised learning with deep neural networks requires a tremendous amount of labelled data to converge to a good solution. For 3D medical images, it is often impractical to build a large homogeneous annotated dataset for a specific pathology. Self-supervised methods offer a new way to learn a representation of the images in an unsupervised manner with a neural network. In particular, contrastive learning has shown great promises by (almost) matching the performance of fully-supervised CNN on vision tasks. Nonetheless, this method does not take advantage of available meta-data, such as participant's age, viewed as prior knowledge. Here, we propose to leverage continuous proxy metadata, in the contrastive learning framework, by introducing a new loss called y-Aware InfoNCE loss. Specifically, we improve the positive sampling during pre-training by adding more positive examples with similar proxy meta-data with the anchor, assuming they share similar discriminative semantic features.With our method, a 3D CNN model pre-trained on $10^4$ multi-site healthy brain MRI scans can extract relevant features for three classification tasks: schizophrenia, bipolar diagnosis and Alzheimer's detection. When fine-tuned, it also outperforms 3D CNN trained from scratch on these tasks, as well as state-of-the-art self-supervised methods. Our code is made publicly available here.
    GelSight Wedge: Measuring High-Resolution 3D Contact Geometry with a Compact Robot Finger. (arXiv:2106.08851v1 [cs.RO])
    (2 min) Vision-based tactile sensors have the potential to provide important contact geometry to localize the objective with visual occlusion. However, it is challenging to measure high-resolution 3D contact geometry for a compact robot finger, to simultaneously meet optical and mechanical constraints. In this work, we present the GelSight Wedge sensor, which is optimized to have a compact shape for robot fingers, while achieving high-resolution 3D reconstruction. We evaluate the 3D reconstruction under different lighting configurations, and extend the method from 3 lights to 1 or 2 lights. We demonstrate the flexibility of the design by shrinking the sensor to the size of a human finger for fine manipulation tasks. We also show the effectiveness and potential of the reconstructed 3D geometry for pose tracking in the 3D space.
    Unsupervised-learning-based method for chest MRI-CT transformation using structure constrained unsupervised generative attention networks. (arXiv:2106.08557v1 [cs.CV])
    (2 min) The integrated positron emission tomography/magnetic resonance imaging (PET/MRI) scanner facilitates the simultaneous acquisition of metabolic information via PET and morphological information with high soft-tissue contrast using MRI. Although PET/MRI facilitates the capture of high-accuracy fusion images, its major drawback can be attributed to the difficulty encountered when performing attenuation correction, which is necessary for quantitative PET evaluation. The combined PET/MRI scanning requires the generation of attenuation-correction maps from MRI owing to no direct relationship between the gamma-ray attenuation information and MRIs. While MRI-based bone-tissue segmentation can be readily performed for the head and pelvis regions, the realization of accurate bone segmentation via chest CT generation remains a challenging task. This can be attributed to the respiratory and cardiac motions occurring in the chest as well as its anatomically complicated structure and relatively thin bone cortex. This paper presents a means to minimise the anatomical structural changes without human annotation by adding structural constraints using a modality-independent neighbourhood descriptor (MIND) to a generative adversarial network (GAN) that can transform unpaired images. The results obtained in this study revealed the proposed U-GAT-IT + MIND approach to outperform all other competing approaches. The findings of this study hint towards possibility of synthesising clinically acceptable CT images from chest MRI without human annotation, thereby minimising the changes in the anatomical structure.
    Domain Consistency Regularization for Unsupervised Multi-source Domain Adaptive Classification. (arXiv:2106.08590v1 [cs.CV])
    (2 min) Deep learning-based multi-source unsupervised domain adaptation (MUDA) has been actively studied in recent years. Compared with single-source unsupervised domain adaptation (SUDA), domain shift in MUDA exists not only between the source and target domains but also among multiple source domains. Most existing MUDA algorithms focus on extracting domain-invariant representations among all domains whereas the task-specific decision boundaries among classes are largely neglected. In this paper, we propose an end-to-end trainable network that exploits domain Consistency Regularization for unsupervised Multi-source domain Adaptive classification (CRMA). CRMA aligns not only the distributions of each pair of source and target domains but also that of all domains. For each pair of source and target domains, we employ an intra-domain consistency to regularize a pair of domain-specific classifiers to achieve intra-domain alignment. In addition, we design an inter-domain consistency that targets joint inter-domain alignment among all domains. To address different similarities between multiple source domains and the target domain, we design an authorization strategy that assigns different authorities to domain-specific classifiers adaptively for optimal pseudo label prediction and self-training. Extensive experiments show that CRMA tackles unsupervised domain adaptation effectively under a multi-source setup and achieves superior adaptation consistently across multiple MUDA datasets.
    GKNet: grasp keypoint network for grasp candidates detection. (arXiv:2106.08497v1 [cs.RO])
    (2 min) Contemporary grasp detection approaches employ deep learning to achieve robustness to sensor and object model uncertainty. The two dominant approaches design either grasp-quality scoring or anchor-based grasp recognition networks. This paper presents a different approach to grasp detection by treating it as keypoint detection. The deep network detects each grasp candidate as a pair of keypoints, convertible to the grasp representation g = {x, y, w, {\theta}}^T, rather than a triplet or quartet of corner points. Decreasing the detection difficulty by grouping keypoints into pairs boosts performance. To further promote dependencies between keypoints, the general non-local module is incorporated into the proposed learning framework. A final filtering strategy based on discrete and continuous orientation prediction removes false correspondences and further improves grasp detection performance. GKNet, the approach presented here, achieves the best balance of accuracy and speed on the Cornell and the abridged Jacquard dataset (96.9% and 98.39% at 41.67 and 23.26 fps). Follow-up experiments on a manipulator evaluate GKNet using 4 types of grasping experiments reflecting different nuisance sources: static grasping, dynamic grasping, grasping at varied camera angles, and bin picking. GKNet outperforms reference baselines in static and dynamic grasping experiments while showing robustness to varied camera viewpoints and bin picking experiments. The results confirm the hypothesis that grasp keypoints are an effective output representation for deep grasp networks that provide robustness to expected nuisance factors.
    X-MAN: Explaining multiple sources of anomalies in video. (arXiv:2106.08856v1 [cs.CV])
    (2 min) Our objective is to detect anomalies in video while also automatically explaining the reason behind the detector's response. In a practical sense, explainability is crucial for this task as the required response to an anomaly depends on its nature and severity. However, most leading methods (based on deep neural networks) are not interpretable and hide the decision making process in uninterpretable feature representations. In an effort to tackle this problem we make the following contributions: (1) we show how to build interpretable feature representations suitable for detecting anomalies with state of the art performance, (2) we propose an interpretable probabilistic anomaly detector which can describe the reason behind it's response using high level concepts, (3) we are the first to directly consider object interactions for anomaly detection and (4) we propose a new task of explaining anomalies and release a large dataset for evaluating methods on this task. Our method competes well with the state of the art on public datasets while also providing anomaly explanation based on objects and their interactions.
    ParticleAugment: Sampling-Based Data Augmentation. (arXiv:2106.08693v1 [cs.LG])
    (2 min) We present an automated data augmentation approach for image classification. We formulate the problem as Monte Carlo sampling where our goal is to approximate the optimal augmentation policies. We propose a particle filtering formulation to find optimal augmentation policies and their schedules during model training. Our performance measurement procedure relies on a validation subset of our training set, while the policy transition model depends on a Gaussian prior and an optional augmentation velocity parameter. In our experiments, we show that our formulation for automated augmentation reaches promising results on CIFAR-10, CIFAR-100, and ImageNet datasets using the standard network architectures for this problem. By comparing with the related work, we also show that our method reaches a balance between the computational cost of policy search and the model performance.
    SiamAPN++: Siamese Attentional Aggregation Network for Real-Time UAV Tracking. (arXiv:2106.08816v1 [cs.CV])
    (2 min) Recently, the Siamese-based method has stood out from multitudinous tracking methods owing to its state-of-the-art (SOTA) performance. Nevertheless, due to various special challenges in UAV tracking, \textit{e.g.}, severe occlusion, and fast motion, most existing Siamese-based trackers hardly combine superior performance with high efficiency. To this concern, in this paper, a novel attentional Siamese tracker (SiamAPN++) is proposed for real-time UAV tracking. By virtue of the attention mechanism, the attentional aggregation network (AAN) is conducted with self-AAN and cross-AAN, raising the expression ability of features eventually. The former AAN aggregates and models the self-semantic interdependencies of the single feature map via spatial and channel dimensions. The latter aims to aggregate the cross-interdependencies of different semantic features including the location information of anchors. In addition, the dual features version of the anchor proposal network is proposed to raise the robustness of proposing anchors, increasing the perception ability to objects with various scales. Experiments on two well-known authoritative benchmarks are conducted, where SiamAPN++ outperforms its baseline SiamAPN and other SOTA trackers. Besides, real-world tests onboard a typical embedded platform demonstrate that SiamAPN++ achieves promising tracking results with real-time speed.
    Detection of Morphed Face Images Using Discriminative Wavelet Sub-bands. (arXiv:2106.08565v1 [cs.CV])
    (2 min) This work investigates the well-known problem of morphing attacks, which has drawn considerable attention in the biometrics community. Morphed images have exposed face recognition systems' susceptibility to false acceptance, resulting in dire consequences, especially for national security applications. To detect morphing attacks, we propose a method which is based on a discriminative 2D Discrete Wavelet Transform (2D-DWT). A discriminative wavelet sub-band can highlight inconsistencies between a real and a morphed image. We observe that there is a salient discrepancy between the entropy of a given sub-band in a bona fide image, and the same sub-band's entropy in a morphed sample. Considering this dissimilarity between these two entropy values, we find the Kullback-Leibler divergence between the two distributions, namely the entropy of the bona fide and the corresponding morphed images. The most discriminative wavelet sub-bands are those with the highest corresponding KL-divergence values. Accordingly, 22 sub-bands are selected as the most discriminative ones in terms of morph detection. We show that a Deep Neural Network (DNN) trained on the 22 discriminative sub-bands can detect morphed samples precisely. Most importantly, the effectiveness of our algorithm is validated through experiments on three datasets: VISAPP17, LMA, and MorGAN. We also performed an ablation study on the sub-band selection.
    Toward Robotic Weed Control: Detection of Nutsedge Weed in Bermudagrass Turf Using Inaccurate and Insufficient Training Data. (arXiv:2106.08897v1 [cs.CV])
    (2 min) To enable robotic weed control, we develop algorithms to detect nutsedge weed from bermudagrass turf. Due to the similarity between the weed and the background turf, manual data labeling is expensive and error-prone. Consequently, directly applying deep learning methods for object detection cannot generate satisfactory results. Building on an instance detection approach (i.e. Mask R-CNN), we combine synthetic data with raw data to train the network. We propose an algorithm to generate high fidelity synthetic data, adopting different levels of annotations to reduce labeling cost. Moreover, we construct a nutsedge skeleton-based probabilistic map (NSPM) as the neural network input to reduce the reliance on pixel-wise precise labeling. We also modify loss function from cross entropy to Kullback-Leibler divergence which accommodates uncertainty in the labeling process. We implement the proposed algorithm and compare it with both Faster R-CNN and Mask R-CNN. The results show that our design can effectively overcome the impact of imprecise and insufficient training sample issues and significantly outperform the Faster R-CNN counterpart with a false negative rate of only 0.4%. In particular, our approach also reduces labeling time by 95% while achieving better performance if comparing with the original Mask R-CNN approach.
    Disentangling Semantic-to-visual Confusion for Zero-shot Learning. (arXiv:2106.08605v1 [cs.CV])
    (2 min) Using generative models to synthesize visual features from semantic distribution is one of the most popular solutions to ZSL image classification in recent years. The triplet loss (TL) is popularly used to generate realistic visual distributions from semantics by automatically searching discriminative representations. However, the traditional TL cannot search reliable unseen disentangled representations due to the unavailability of unseen classes in ZSL. To alleviate this drawback, we propose in this work a multi-modal triplet loss (MMTL) which utilizes multimodal information to search a disentangled representation space. As such, all classes can interplay which can benefit learning disentangled class representations in the searched space. Furthermore, we develop a novel model called Disentangling Class Representation Generative Adversarial Network (DCR-GAN) focusing on exploiting the disentangled representations in training, feature synthesis, and final recognition stages. Benefiting from the disentangled representations, DCR-GAN could fit a more realistic distribution over both seen and unseen features. Extensive experiments show that our proposed model can lead to superior performance to the state-of-the-arts on four benchmark datasets. Our code is available at https://github.com/FouriYe/DCRGAN-TMM.
    PatchNet: Unsupervised Object Discovery based on Patch Embedding. (arXiv:2106.08599v1 [cs.CV])
    (2 min) We demonstrate that frequently appearing objects can be discovered by training randomly sampled patches from a small number of images (100 to 200) by self-supervision. Key to this approach is the pattern space, a latent space of patterns that represents all possible sub-images of the given image data. The distance structure in the pattern space captures the co-occurrence of patterns due to the frequent objects. The pattern space embedding is learned by minimizing the contrastive loss between randomly generated adjacent patches. To prevent the embedding from learning the background, we modulate the contrastive loss by color-based object saliency and background dissimilarity. The learned distance structure serves as object memory, and the frequent objects are simply discovered by clustering the pattern vectors from the random patches sampled for inference. Our image representation based on image patches naturally handles the position and scale invariance property that is crucial to multi-object discovery. The method has been proven surprisingly effective, and successfully applied to finding multiple human faces and bodies from natural images.
    DMSANet: Dual Multi Scale Attention Network. (arXiv:2106.08382v1 [cs.CV])
    (2 min) Attention mechanism of late has been quite popular in the computer vision community. A lot of work has been done to improve the performance of the network, although almost always it results in increased computational complexity. In this paper, we propose a new attention module that not only achieves the best performance but also has lesser parameters compared to most existing models. Our attention module can easily be integrated with other convolutional neural networks because of its lightweight nature. The proposed network named Dual Multi Scale Attention Network (DMSANet) is comprised of two parts: the first part is used to extract features at various scales and aggregate them, the second part uses spatial and channel attention modules in parallel to adaptively integrate local features with their global dependencies. We benchmark our network performance for Image Classification on ImageNet dataset, Object Detection and Instance Segmentation both on MS COCO dataset.
    Seeing Through Clouds in Satellite Images. (arXiv:2106.08408v1 [cs.CV])
    (2 min) This paper presents a neural-network-based solution to recover pixels occluded by clouds in satellite images. We leverage radio frequency (RF) signals in the ultra/super-high frequency band that penetrate clouds to help reconstruct the occluded regions in multispectral images. We introduce the first multi-modal multi-temporal cloud removal model. Our model uses publicly available satellite observations and produces daily cloud-free images. Experimental results show that our system significantly outperforms baselines by 8dB in PSNR. We also demonstrate use cases of our system in digital agriculture, flood monitoring, and wildfire detection. We will release the processed dataset to facilitate future research.
    Dynamically Grown Generative Adversarial Networks. (arXiv:2106.08505v1 [cs.CV])
    (2 min) Recent work introduced progressive network growing as a promising way to ease the training for large GANs, but the model design and architecture-growing strategy still remain under-explored and needs manual design for different image data. In this paper, we propose a method to dynamically grow a GAN during training, optimizing the network architecture and its parameters together with automation. The method embeds architecture search techniques as an interleaving step with gradient-based training to periodically seek the optimal architecture-growing strategy for the generator and discriminator. It enjoys the benefits of both eased training because of progressive growing and improved performance because of broader architecture design space. Experimental results demonstrate new state-of-the-art of image generation. Observations in the search procedure also provide constructive insights into the GAN model design such as generator-discriminator balance and convolutional layer choices.
    Understanding and Evaluating Racial Biases in Image Captioning. (arXiv:2106.08503v1 [cs.CV])
    (2 min) Image captioning is an important task for benchmarking visual reasoning and for enabling accessibility for people with vision impairments. However, as in many machine learning settings, social biases can influence image captioning in undesirable ways. In this work, we study bias propagation pathways within image captioning, focusing specifically on the COCO dataset. Prior work has analyzed gender bias in captions using automatically-derived gender labels; here we examine racial and intersectional biases using manual annotations. Our first contribution is in annotating the perceived gender and skin color of 28,315 of the depicted people after obtaining IRB approval. Using these annotations, we compare racial biases present in both manual and automatically-generated image captions. We demonstrate differences in caption performance, sentiment, and word choice between images of lighter versus darker-skinned people. Further, we find the magnitude of these differences to be greater in modern captioning systems compared to older ones, thus leading to concerns that without proper consideration and mitigation these differences will only become increasingly prevalent. Code and data is available at https://princetonvisualai.github.io/imagecaptioning-bias .
    Scene Transformer: A unified multi-task model for behavior prediction and planning. (arXiv:2106.08417v1 [cs.CV])
    (2 min) Predicting the future motion of multiple agents is necessary for planning in dynamic environments. This task is challenging for autonomous driving since agents (e.g., vehicles and pedestrians) and their associated behaviors may be diverse and influence each other. Most prior work has focused on first predicting independent futures for each agent based on all past motion, and then planning against these independent predictions. However, planning against fixed predictions can suffer from the inability to represent the future interaction possibilities between different agents, leading to sub-optimal planning. In this work, we formulate a model for predicting the behavior of all agents jointly in real-world driving environments in a unified manner. Inspired by recent language modeling approaches, we use a masking strategy as the query to our model, enabling one to invoke a single model to predict agent behavior in many ways, such as potentially conditioned on the goal or full future trajectory of the autonomous vehicle or the behavior of other agents in the environment. Our model architecture fuses heterogeneous world state in a unified Transformer architecture by employing attention across road elements, agent interactions and time steps. We evaluate our approach on autonomous driving datasets for behavior prediction, and achieve state-of-the-art performance. Our work demonstrates that formulating the problem of behavior prediction in a unified architecture with a masking strategy may allow us to have a single model that can perform multiple motion prediction and planning related tasks effectively.
    A Multi-Layered Approach for Measuring the Simulation-to-Reality Gap of Radar Perception for Autonomous Driving. (arXiv:2106.08372v1 [cs.RO])
    (2 min) With the increasing safety validation requirements for the release of a self-driving car, alternative approaches, such as simulation-based testing, are emerging in addition to conventional real-world testing. In order to rely on virtual tests the employed sensor models have to be validated. For this reason, it is necessary to quantify the discrepancy between simulation and reality in order to determine whether a certain fidelity is sufficient for a desired intended use. There exists no sound method to measure this simulation-to-reality gap of radar perception for autonomous driving. We address this problem by introducing a multi-layered evaluation approach, which consists of a combination of an explicit and an implicit sensor model evaluation. The former directly evaluates the realism of the synthetically generated sensor data, while the latter refers to an evaluation of a downstream target application. In order to demonstrate the method, we evaluated the fidelity of three typical radar model types (ideal, data-driven, ray tracing-based) and their applicability for virtually testing radar-based multi-object tracking. We have shown the effectiveness of the proposed approach in terms of providing an in-depth sensor model assessment that renders existing disparities visible and enables a realistic estimation of the overall model fidelity across different scenarios.
  • cs.IR updates on arXiv.org

    TUTA: Tree-based Transformers for Generally Structured Table Pre-training. (arXiv:2010.12537v3 [cs.IR] UPDATED)
    (2 min) Tables are widely used with various structures to organize and present data. Recent attempts on table understanding mainly focus on relational tables, yet overlook to other common table structures. In this paper, we propose TUTA, a unified pre-training architecture for understanding generally structured tables. Noticing that understanding a table requires spatial, hierarchical, and semantic information, we enhance transformers with three novel structure-aware mechanisms. First, we devise a unified tree-based structure, called a bi-dimensional coordinate tree, to describe both the spatial and hierarchical information of generally structured tables. Upon this, we propose tree-based attention and position embedding to better capture the spatial and hierarchical information. Moreover, we devise three progressive pre-training objectives to enable representations at the token, cell, and table levels. We pre-train TUTA on a wide range of unlabeled web and spreadsheet tables and fine-tune it on two critical tasks in the field of table structure understanding: cell type classification and table type classification. Experiments show that TUTA is highly effective, achieving state-of-the-art on five widely-studied datasets.
    Personalized News Recommendation: A Survey. (arXiv:2106.08934v1 [cs.IR])
    (2 min) Personalized news recommendation is an important technique to help users find their interested news information and alleviate their information overload. It has been extensively studied over decades and has achieved notable success in improving users' news reading experience. However, there are still many unsolved problems and challenges that need to be further studied. To help researchers master the advances in personalized news recommendation over the past years, in this paper we present a comprehensive overview of personalized news recommendation. Instead of following the conventional taxonomy of news recommendation methods, in this paper we propose a novel perspective to understand personalized news recommendation based on its core problems and the associated techniques and challenges. We first review the techniques for tackling each core problem in a personalized news recommender system and the challenges they face. Next, we introduce the public datasets and evaluation metrics used for personalized news recommendation. We then discuss the key points on improving the responsibility of personalized news recommender systems. Finally, we raise several research directions that are worth investigating in future. This paper can provide up-to-date and comprehensive views to help readers understand the personalized news recommendation field. We hope this paper can facilitate research on personalized news recommendation and as well as related fields in natural language processing and data mining.
    Universal and specific features of Ukrainian economic research: publication analysis based on Crossref data. (arXiv:2106.08701v1 [cs.DL])
    (2 min) Our study is one of the first examples of multidimensional and longitudinal disciplinary analysis at the national level based on Crossref data. We present a large-scale quantitative analysis of Ukrainian economics. This study is not yet another example of research aimed at ranking of local journals, authors or institutions, but rather exploring general tendencies that can be compared to other countries or regions. We study different aspects of Ukrainian economics output. In particular, the collaborative nature, geographic landscape and some peculiarities of citation statistics are investigated. We have found that Ukrainian economics is characterized by a comparably small share of co-authored publications, however, it demonstrates the tendency towards more collaborative output. Based on our analysis, we discuss specific and universal features of Ukrainian economic research. The importance of supporting various initiatives aimed at enriching open scholarly metadata is considered. A comprehensive and high-quality meta description of publications is probably the shortest path to a better understanding of national trends, especially for non-English speaking countries. The results of our analysis can be used to better understand Ukrainian economic research and support research policy decisions.
    FAIR: Fairness-Aware Information Retrieval Evaluation. (arXiv:2106.08527v1 [cs.IR])
    (2 min) With the emerging needs of creating fairness-aware solutions for search and recommendation systems, a daunting challenge exists of evaluating such solutions. While many of the traditional information retrieval (IR) metrics can capture the relevance, diversity and novelty for the utility with respect to users, they are not suitable for inferring whether the presented results are fair from the perspective of responsible information exposure. On the other hand, various fairness metrics have been proposed but they do not account for the user utility or do not measure it adequately. To address this problem, we propose a new metric called Fairness-Aware IR (FAIR). By unifying standard IR metrics and fairness measures into an integrated metric, this metric offers a new perspective for evaluating fairness-aware ranking results. Based on this metric, we developed an effective ranking algorithm that jointly optimized user utility and fairness. The experimental results showed that our FAIR metric could highlight results with good user utility and fair information exposure. We showed how FAIR related to existing metrics and demonstrated the effectiveness of our FAIR-based algorithm. We believe our work opens up a new direction of pursuing a computationally feasible metric for evaluating and implementing the fairness-aware IR systems.
    A Neural Model for Joint Document and Snippet Ranking in Question Answering for Large Document Collections. (arXiv:2106.08908v1 [cs.IR])
    (2 min) Question answering (QA) systems for large document collections typically use pipelines that (i) retrieve possibly relevant documents, (ii) re-rank them, (iii) rank paragraphs or other snippets of the top-ranked documents, and (iv) select spans of the top-ranked snippets as exact answers. Pipelines are conceptually simple, but errors propagate from one component to the next, without later components being able to revise earlier decisions. We present an architecture for joint document and snippet ranking, the two middle stages, which leverages the intuition that relevant documents have good snippets and good snippets come from relevant documents. The architecture is general and can be used with any neural text relevance ranker. We experiment with two main instantiations of the architecture, based on POSIT-DRMM (PDRMM) and a BERT-based ranker. Experiments on biomedical data from BIOASQ show that our joint models vastly outperform the pipelines in snippet retrieval, the main goal for QA, with fewer trainable parameters, also remaining competitive in document retrieval. Furthermore, our joint PDRMM-based model is competitive with BERT-based models, despite using orders of magnitude fewer parameters. These claims are also supported by human evaluation on two test batches of BIOASQ. To test our key findings on another dataset, we modified the Natural Questions dataset so that it can also be used for document and snippet retrieval. Our joint PDRMM-based model again outperforms the corresponding pipeline in snippet retrieval on the modified Natural Questions dataset, even though it performs worse than the pipeline in document retrieval. We make our code and the modified Natural Questions dataset publicly available.
    TSSuBERT: Tweet Stream Summarization Using BERT. (arXiv:2106.08770v1 [cs.IR])
    (2 min) The development of deep neural networks and the emergence of pre-trained language models such as BERT allow to increase performance on many NLP tasks. However, these models do not meet the same popularity for tweet summarization, which can probably be explained by the lack of existing collections for training and evaluation. Our contribution in this paper is twofold : (1) we introduce a large dataset for Twitter event summarization, and (2) we propose a neural model to automatically summarize huge tweet streams. This extractive model combines in an original way pre-trained language models and vocabulary frequency-based representations to predict tweet salience. An additional advantage of the model is that it automatically adapts the size of the output summary according to the input tweet stream. We conducted experiments using two different Twitter collections, and promising results are observed in comparison with state-of-the-art baselines.
    A Topic Coverage Approach to Evaluation of Topic Models. (arXiv:2012.06274v2 [cs.IR] UPDATED)
    (2 min) Topic models are widely used unsupervised models of text capable of learning topics - weighted lists of words and documents - from large collections of text documents. When topic models are used for discovery of topics in text collections, a question that arises naturally is how well the model-induced topics correspond to topics of interest to the analyst. In this paper we revisit and extend a so far neglected approach to topic model evaluation based on measuring topic coverage - computationally matching model topics with a set of reference topics that models are expected to uncover. The approach is well suited for analyzing models' performance in topic discovery and for large-scale analysis of both topic models and measures of model quality. We propose new measures of coverage and evaluate, in a series of experiments, different types of topic models on two distinct text domains for which interest for topic discovery exists. The experiments include evaluation of model quality, analysis of coverage of distinct topic categories, and the analysis of the relationship between coverage and other methods of topic model evaluation. The contributions of the paper include new measures of coverage, insights into both topic models and other methods of model evaluation, and the datasets and code for facilitating future research of both topic coverage and other approaches to topic model evaluation.
    Topology Distillation for Recommender System. (arXiv:2106.08700v1 [cs.LG])
    (2 min) Recommender Systems (RS) have employed knowledge distillation which is a model compression technique training a compact student model with the knowledge transferred from a pre-trained large teacher model. Recent work has shown that transferring knowledge from the teacher's intermediate layer significantly improves the recommendation quality of the student. However, they transfer the knowledge of individual representation point-wise and thus have a limitation in that primary information of RS lies in the relations in the representation space. This paper proposes a new topology distillation approach that guides the student by transferring the topological structure built upon the relations in the teacher space. We first observe that simply making the student learn the whole topological structure is not always effective and even degrades the student's performance. We demonstrate that because the capacity of the student is highly limited compared to that of the teacher, learning the whole topological structure is daunting for the student. To address this issue, we propose a novel method named Hierarchical Topology Distillation (HTD) which distills the topology hierarchically to cope with the large capacity gap. Our extensive experiments on real-world datasets show that the proposed method significantly outperforms the state-of-the-art competitors. We also provide in-depth analyses to ascertain the benefit of distilling the topology for RS.
    Analysing Dense Passage Retrieval for Multi-hop Question Answering. (arXiv:2106.08433v1 [cs.IR])
    (2 min) We analyse the performance of passage retrieval models in the presence of complex (multi-hop) questions to provide a better understanding of how retrieval systems behave when multiple hops of reasoning are needed. In simple open-domain question answering (QA), dense passage retrieval has become one of the standard approaches for retrieving the relevant passages to infer an answer. Recently, dense passage retrieval also achieved state-of-the-art results in multi-hop QA, where aggregating information from multiple documents and reasoning over them is required. However, so far, the dense retrieval models are not evaluated properly concerning the multi-hop nature of the problem: models are typically evaluated by the end result of the retrieval pipeline, which leaves unclear where their success lies. In this work, we provide an in-depth evaluation of such models not only unveiling the reasons behind their success but also their limitations. Moreover, we introduce a hybrid (lexical and dense) retrieval approach that is highly competitive with the state-of-the-art dense retrieval model, while requiring substantially less computational resources. Furthermore, we also perform qualitative analysis to better understand the challenges behind passage retrieval for multi-hop QA.
  • cs.LG updates on arXiv.org

    Towards Optimally Weighted Physics-Informed Neural Networks in Ocean Modelling. (arXiv:2106.08747v1 [physics.ao-ph])
    (2 min) The carbon pump of the world's ocean plays a vital role in the biosphere and climate of the earth, urging improved understanding of the functions and influences of the ocean for climate change analyses. State-of-the-art techniques are required to develop models that can capture the complexity of ocean currents and temperature flows. This work explores the benefits of using physics-informed neural networks (PINNs) for solving partial differential equations related to ocean modeling; such as the Burgers, wave, and advection-diffusion equations. We explore the trade-offs of using data vs. physical models in PINNs for solving partial differential equations. PINNs account for the deviation from physical laws in order to improve learning and generalization. We observed how the relative weight between the data and physical model in the loss function influence training results, where small data sets benefit more from the added physics information.
    A Predictive Coding Account for Chaotic Itinerancy. (arXiv:2106.08937v1 [cs.NE])
    (2 min) As a phenomenon in dynamical systems allowing autonomous switching between stable behaviors, chaotic itinerancy has gained interest in neurorobotics research. In this study, we draw a connection between this phenomenon and the predictive coding theory by showing how a recurrent neural network implementing predictive coding can generate neural trajectories similar to chaotic itinerancy in the presence of input noise. We propose two scenarios generating random and past-independent attractor switching trajectories using our model.
    On the Objective Evaluation of Post Hoc Explainers. (arXiv:2106.08376v1 [cs.LG])
    (2 min) Many applications of data-driven models demand transparency of decisions, especially in health care, criminal justice, and other high-stakes environments. Modern trends in machine learning research have led to algorithms that are increasingly intricate to the degree that they are considered to be black boxes. In an effort to reduce the opacity of decisions, methods have been proposed to construe the inner workings of such models in a human-comprehensible manner. These post hoc techniques are described as being universal explainers - capable of faithfully augmenting decisions with algorithmic insight. Unfortunately, there is little agreement about what constitutes a "good" explanation. Moreover, current methods of explanation evaluation are derived from either subjective or proxy means. In this work, we propose a framework for the evaluation of post hoc explainers on ground truth that is directly derived from the additive structure of a model. We demonstrate the efficacy of the framework in understanding explainers by evaluating popular explainers on thousands of synthetic and several real-world tasks. The framework unveils that explanations may be accurate but misattribute the importance of individual features.
    Linear Classifiers in Product Space Forms. (arXiv:2102.10204v2 [cs.LG] UPDATED)
    (2 min) Embedding methods for product spaces are powerful techniques for low-distortion and low-dimensional representation of complex data structures. Nevertheless, little is known regarding downstream learning and optimization problems in such spaces. Here, we address the problem of linear classification in a product space form -- a mix of Euclidean, spherical, and hyperbolic spaces. First, we describe new formulations for linear classifiers on a Riemannian manifold using geodesics and Riemannian metrics which generalize straight lines and inner products in vector spaces, respectively. Second, we prove that linear classifiers in $d$-dimensional space forms of any curvature have the same expressive power, i.e., they can shatter exactly $d+1$ points. Third, we formalize linear classifiers in product space forms, describe the first corresponding perceptron and SVM classification algorithms, and establish rigorous convergence results for the former. We support our theoretical findings with simulation results on several datasets, including synthetic data, CIFAR-100, MNIST, Omniglot, and single-cell RNA sequencing data. The results show that learning methods applied to small-dimensional embeddings in product space forms outperform their algorithmic counterparts in each space form.
    Bridging Multi-Task Learning and Meta-Learning: Towards Efficient Training and Effective Adaptation. (arXiv:2106.09017v1 [cs.LG])
    (2 min) Multi-task learning (MTL) aims to improve the generalization of several related tasks by learning them jointly. As a comparison, in addition to the joint training scheme, modern meta-learning allows unseen tasks with limited labels during the test phase, in the hope of fast adaptation over them. Despite the subtle difference between MTL and meta-learning in the problem formulation, both learning paradigms share the same insight that the shared structure between existing training tasks could lead to better generalization and adaptation. In this paper, we take one important step further to understand the close connection between these two learning paradigms, through both theoretical analysis and empirical investigation. Theoretically, we first demonstrate that MTL shares the same optimization formulation with a class of gradient-based meta-learning (GBML) algorithms. We then prove that for over-parameterized neural networks with sufficient depth, the learned predictive functions of MTL and GBML are close. In particular, this result implies that the predictions given by these two models are similar over the same unseen task. Empirically, we corroborate our theoretical findings by showing that, with proper implementation, MTL is competitive against state-of-the-art GBML algorithms on a set of few-shot image classification benchmarks. Since existing GBML algorithms often involve costly second-order bi-level optimization, our first-order MTL method is an order of magnitude faster on large-scale datasets such as mini-ImageNet. We believe this work could help bridge the gap between these two learning paradigms, and provide a computationally efficient alternative to GBML that also supports fast task adaptation.
    A Topic Coverage Approach to Evaluation of Topic Models. (arXiv:2012.06274v2 [cs.IR] UPDATED)
    (2 min) Topic models are widely used unsupervised models of text capable of learning topics - weighted lists of words and documents - from large collections of text documents. When topic models are used for discovery of topics in text collections, a question that arises naturally is how well the model-induced topics correspond to topics of interest to the analyst. In this paper we revisit and extend a so far neglected approach to topic model evaluation based on measuring topic coverage - computationally matching model topics with a set of reference topics that models are expected to uncover. The approach is well suited for analyzing models' performance in topic discovery and for large-scale analysis of both topic models and measures of model quality. We propose new measures of coverage and evaluate, in a series of experiments, different types of topic models on two distinct text domains for which interest for topic discovery exists. The experiments include evaluation of model quality, analysis of coverage of distinct topic categories, and the analysis of the relationship between coverage and other methods of topic model evaluation. The contributions of the paper include new measures of coverage, insights into both topic models and other methods of model evaluation, and the datasets and code for facilitating future research of both topic coverage and other approaches to topic model evaluation.
    Assessing the Impact: Does an Improvement to a Revenue Management System Lead to an Improved Revenue?. (arXiv:2101.10249v2 [cs.LG] UPDATED)
    (2 min) Airlines and other industries have been making use of sophisticated Revenue Management Systems to maximize revenue for decades. While improving the different components of these systems has been the focus of numerous studies, estimating the impact of such improvements on the revenue has been overlooked in the literature despite its practical importance. Indeed, quantifying the benefit of a change in a system serves as support for investment decisions. This is a challenging problem as it corresponds to the difference between the generated value and the value that would have been generated keeping the system as before. The latter is not observable. Moreover, the expected impact can be small in relative value. In this paper, we cast the problem as counterfactual prediction of unobserved revenue. The impact on revenue is then the difference between the observed and the estimated revenue. The originality of this work lies in the innovative application of econometric methods proposed for macroeconomic applications to a new problem setting. Broadly applicable, the approach benefits from only requiring revenue data observed for origin-destination pairs in the network of the airline at each day, before and after a change in the system is applied. We report results using real large-scale data from Air Canada. We compare a deep neural network counterfactual predictions model with econometric models. They achieve respectively 1% and 1.1% of error on the counterfactual revenue predictions, and allow to accurately estimate small impacts (in the order of 2%).
    Variational System Identification for Nonlinear State-Space Models. (arXiv:2012.05072v2 [stat.ML] UPDATED)
    (2 min) This paper considers parameter estimation for nonlinear state-space models, which is an important but challenging problem. We address this challenge by employing a variational inference (VI) approach, which is a principled method that has deep connections to maximum likelihood estimation. This VI approach ultimately provides estimates of the model as solutions to an optimisation problem, which is deterministic, tractable and can be solved using standard optimisation tools. A specialisation of this approach for systems with additive Gaussian noise is also detailed. The proposed method is examined numerically on a range of simulated and real examples focusing on the robustness to parameter initialisation; additionally, favourable comparisons are performed against state-of-the-art alternatives.
    SpeakerStew: Scaling to Many Languages with a Triaged Multilingual Text-Dependent and Text-Independent Speaker Verification System. (arXiv:2104.02125v3 [eess.AS] UPDATED)
    (2 min) In this paper, we describe SpeakerStew - a hybrid system to perform speaker verification on 46 languages. Two core ideas were explored in this system: (1) Pooling training data of different languages together for multilingual generalization and reducing development cycles; (2) A novel triage mechanism between text-dependent and text-independent models to reduce runtime cost and expected latency. To the best of our knowledge, this is the first study of speaker verification systems at the scale of 46 languages. The problem is framed from the perspective of using a smart speaker device with interactions consisting of a wake-up keyword (text-dependent) followed by a speech query (text-independent). Experimental evidence suggests that training on multiple languages can generalize to unseen varieties while maintaining performance on seen varieties. We also found that it can reduce computational requirements for training models by an order of magnitude. Furthermore, during model inference on English data, we observe that leveraging a triage framework can reduce the number of calls to the more computationally expensive text-independent system by 73% (and reduce latency by 59%) while maintaining an EER no worse than the text-independent setup.
    Imperfect ImaGANation: Implications of GANs Exacerbating Biases on Facial Data Augmentation and Snapchat Selfie Lenses. (arXiv:2001.09528v3 [cs.LG] UPDATED)
    (2 min) In this paper, we show that popular Generative Adversarial Networks (GANs) exacerbate biases along the axes of gender and skin tone when given a skewed distribution of face-shots. While practitioners celebrate synthetic data generation using GANs as an economical way to augment data for training data-hungry machine learning models, it is unclear whether they recognize the perils of such techniques when applied to real world datasets biased along latent dimensions. Specifically, we show that (1) traditional GANs further skew the distribution of a dataset consisting of engineering faculty headshots, generating minority modes less often and of worse quality and (2) image-to-image translation (conditional) GANs also exacerbate biases by lightening skin color of non-white faces and transforming female facial features to be masculine when generating faces of engineering professors. Thus, our study is meant to serve as a cautionary tale.
    MixMix: All You Need for Data-Free Compression Are Feature and Data Mixing. (arXiv:2011.09899v2 [cs.LG] UPDATED)
    (2 min) User data confidentiality protection is becoming a rising challenge in the present deep learning research. Without access to data, conventional data-driven model compression faces a higher risk of performance degradation. Recently, some works propose to generate images from a specific pretrained model to serve as training data. However, the inversion process only utilizes biased feature statistics stored in one model and is from low-dimension to high-dimension. As a consequence, it inevitably encounters the difficulties of generalizability and inexact inversion, which leads to unsatisfactory performance. To address these problems, we propose MixMix based on two simple yet effective techniques: (1) Feature Mixing: utilizes various models to construct a universal feature space for generalized inversion; (2) Data Mixing: mixes the synthesized images and labels to generate exact label information. We prove the effectiveness of MixMix from both theoretical and empirical perspectives. Extensive experiments show that MixMix outperforms existing methods on the mainstream compression tasks, including quantization, knowledge distillation, and pruning. Specifically, MixMix achieves up to 4% and 20% accuracy uplift on quantization and pruning, respectively, compared to existing data-free compression work.
    GemNet: Universal Directional Graph Neural Networks for Molecules. (arXiv:2106.08903v1 [physics.comp-ph])
    (2 min) Effectively predicting molecular interactions has the potential to accelerate molecular dynamics by multiple orders of magnitude and thus revolutionize chemical simulations. Graph neural networks (GNNs) have recently shown great successes for this task, overtaking classical methods based on fixed molecular kernels. However, they still appear very limited from a theoretical perspective, since regular GNNs cannot distinguish certain types of graphs. In this work we close this gap between theory and practice. We show that GNNs with directed edge embeddings and two-hop message passing are indeed universal approximators for predictions that are invariant to global rotation and translation, and equivariant to permutation. We then leverage these insights and multiple structural improvements to propose the geometric message passing neural network (GemNet). We demonstrate the benefits of the proposed changes in multiple ablation studies. GemNet outperforms previous models on the COLL and MD17 molecular dynamics datasets by 36%, performing especially well on the most challenging molecules.
    Reset-Free Lifelong Learning with Skill-Space Planning. (arXiv:2012.03548v3 [cs.LG] UPDATED)
    (2 min) The objective of lifelong reinforcement learning (RL) is to optimize agents which can continuously adapt and interact in changing environments. However, current RL approaches fail drastically when environments are non-stationary and interactions are non-episodic. We propose Lifelong Skill Planning (LiSP), an algorithmic framework for non-episodic lifelong RL based on planning in an abstract space of higher-order skills. We learn the skills in an unsupervised manner using intrinsic rewards and plan over the learned skills using a learned dynamics model. Moreover, our framework permits skill discovery even from offline data, thereby reducing the need for excessive real-world interactions. We demonstrate empirically that LiSP successfully enables long-horizon planning and learns agents that can avoid catastrophic failures even in challenging non-stationary and non-episodic environments derived from gridworld and MuJoCo benchmarks.
    Low-memory stochastic backpropagation with multi-channel randomized trace estimation. (arXiv:2106.06998v2 [cs.LG] UPDATED)
    (2 min) Thanks to the combination of state-of-the-art accelerators and highly optimized open software frameworks, there has been tremendous progress in the performance of deep neural networks. While these developments have been responsible for many breakthroughs, progress towards solving large-scale problems, such as video encoding and semantic segmentation in 3D, is hampered because access to on-premise memory is often limited. Instead of relying on (optimal) checkpointing or invertibility of the network layers -- to recover the activations during backpropagation -- we propose to approximate the gradient of convolutional layers in neural networks with a multi-channel randomized trace estimation technique. Compared to other methods, this approach is simple, amenable to analyses, and leads to a greatly reduced memory footprint. Even though the randomized trace estimation introduces stochasticity during training, we argue that this is of little consequence as long as the induced errors are of the same order as errors in the gradient due to the use of stochastic gradient descent. We discuss the performance of networks trained with stochastic backpropagation and how the error can be controlled while maximizing memory usage and minimizing computational overhead.
    Causal Inference in medicine and in health policy, a summary. (arXiv:2105.04655v3 [cs.LG] UPDATED)
    (2 min) A data science task can be deemed as making sense of the data or testing a hypothesis about it. The conclusions inferred from data can greatly guide us to make informative decisions. Big data has enabled us to carry out countless prediction tasks in conjunction with machine learning, such as identifying high risk patients suffering from a certain disease and taking preventable measures. However, healthcare practitioners are not content with mere predictions - they are also interested in the cause-effect relation between input features and clinical outcomes. Understanding such relations will help doctors treat patients and reduce the risk effectively. Causality is typically identified by randomized controlled trials. Often such trials are not feasible when scientists and researchers turn to observational studies and attempt to draw inferences. However, observational studies may also be affected by selection and/or confounding biases that can result in wrong causal conclusions. In this chapter, we will try to highlight some of the drawbacks that may arise in traditional machine learning and statistical approaches to analyze the observational data, particularly in the healthcare data analytics domain. We will discuss causal inference and ways to discover the cause-effect from observational studies in healthcare domain. Moreover, we will demonstrate the applications of causal inference in tackling some common machine learning issues such as missing data and model transportability. Finally, we will discuss the possibility of integrating reinforcement learning with causality as a way to counter confounding bias.
    Simultaneous Training of Partially Masked Neural Networks. (arXiv:2106.08895v1 [cs.LG])
    (2 min) For deploying deep learning models to lower end devices, it is necessary to train less resource-demanding variants of state-of-the-art architectures. This does not eliminate the need for more expensive models as they have a higher performance. In order to avoid training two separate models, we show that it is possible to train neural networks in such a way that a predefined 'core' subnetwork can be split-off from the trained full network with remarkable good performance. We extend on prior methods that focused only on core networks of smaller width, while we focus on supporting arbitrary core network architectures. Our proposed training scheme switches consecutively between optimizing only the core part of the network and the full one. The accuracy of the full model remains comparable, while the core network achieves better performance than when it is trained in isolation. In particular, we show that training a Transformer with a low-rank core gives a low-rank model with superior performance than when training the low-rank model alone. We analyze our training scheme theoretically, and show its convergence under assumptions that are either standard or practically justified. Moreover, we show that the developed theoretical framework allows analyzing many other partial training schemes for neural networks.
    Off-Belief Learning. (arXiv:2103.04000v2 [cs.AI] UPDATED)
    (2 min) The standard problem setting in Dec-POMDPs is self-play, where the goal is to find a set of policies that play optimally together. Policies learned through self-play may adopt arbitrary conventions and implicitly rely on multi-step reasoning based on fragile assumptions about other agents' actions and thus fail when paired with humans or independently trained agents at test time. To address this, we present off-belief learning (OBL). At each timestep OBL agents follow a policy $\pi_1$ that is optimized assuming past actions were taken by a given, fixed policy ($\pi_0$), but assuming that future actions will be taken by $\pi_1$. When $\pi_0$ is uniform random, OBL converges to an optimal policy that does not rely on inferences based on other agents' behavior (an optimal grounded policy). OBL can be iterated in a hierarchy, where the optimal policy from one level becomes the input to the next, thereby introducing multi-level cognitive reasoning in a controlled manner. Unlike existing approaches, which may converge to any equilibrium policy, OBL converges to a unique policy, making it suitable for zero-shot coordination (ZSC). OBL can be scaled to high-dimensional settings with a fictitious transition mechanism and shows strong performance in both a toy-setting and the benchmark human-AI & ZSC problem Hanabi.
    Development of Quantized DNN Library for Exact Hardware Emulation. (arXiv:2106.08892v1 [cs.LG])
    (2 min) Quantization is used to speed up execution time and save power when runnning Deep neural networks (DNNs) on edge devices like AI chips. To investigate the effect of quantization, we need performing inference after quantizing the weights of DNN with 32-bit floating-point precision by a some bit width, and then quantizing them back to 32-bit floating-point precision. This is because the DNN library can only handle floating-point numbers. However, the accuracy of the emulation does not provide accurate precision. We need accurate precision to detect overflow in MAC operations or to verify the operation on edge de vices. We have developed PyParch, a DNN library that executes quantized DNNs (QNNs) with exactly the same be havior as hardware. In this paper, we describe a new proposal and implementation of PyParch. As a result of the evaluation, the accuracy of QNNs with arbitrary bit widths can be estimated for la rge and complex DNNs such as YOLOv5, and the overflow can be detected. We evaluated the overhead of the emulation time and found that it was 5.6 times slower for QNN and 42 times slower for QNN with overflow detection compared to the normal DNN execution time.
    Improving filling level classification with adversarial training. (arXiv:2102.04057v2 [cs.CV] UPDATED)
    (2 min) We investigate the problem of classifying - from a single image - the level of content in a cup or a drinking glass. This problem is made challenging by several ambiguities caused by transparencies, shape variations and partial occlusions, and by the availability of only small training datasets. In this paper, we tackle this problem with an appropriate strategy for transfer learning. Specifically, we use adversarial training in a generic source dataset and then refine the training with a task-specific dataset. We also discuss and experimentally evaluate several training strategies and their combination on a range of container types of the CORSMAL Containers Manipulation dataset. We show that transfer learning with adversarial training in the source domain consistently improves the classification accuracy on the test set and limits the overfitting of the classifier to specific features of the training data.
    Scalable Quasi-Bayesian Inference for Instrumental Variable Regression. (arXiv:2106.08750v1 [stat.ML])
    (2 min) Recent years have witnessed an upsurge of interest in employing flexible machine learning models for instrumental variable (IV) regression, but the development of uncertainty quantification methodology is still lacking. In this work we present a scalable quasi-Bayesian procedure for IV regression, building upon the recently developed kernelized IV models. Contrary to Bayesian modeling for IV, our approach does not require additional assumptions on the data generating process, and leads to a scalable approximate inference algorithm with time cost comparable to the corresponding point estimation methods. Our algorithm can be further extended to work with neural network models. We analyze the theoretical properties of the proposed quasi-posterior, and demonstrate through empirical evaluation the competitive performance of our method.
    Voicy: Zero-Shot Non-Parallel Voice Conversion in Noisy Reverberant Environments. (arXiv:2106.08873v1 [cs.SD])
    (2 min) Voice Conversion (VC) is a technique that aims to transform the non-linguistic information of a source utterance to change the perceived identity of the speaker. While there is a rich literature on VC, most proposed methods are trained and evaluated on clean speech recordings. However, many acoustic environments are noisy and reverberant, severely restricting the applicability of popular VC methods to such scenarios. To address this limitation, we propose Voicy, a new VC framework particularly tailored for noisy speech. Our method, which is inspired by the de-noising auto-encoders framework, is comprised of four encoders (speaker, content, phonetic and acoustic-ASR) and one decoder. Importantly, Voicy is capable of performing non-parallel zero-shot VC, an important requirement for any VC system that needs to work on speakers not seen during training. We have validated our approach using a noisy reverberant version of the LibriSpeech dataset. Experimental results show that Voicy outperforms other tested VC techniques in terms of naturalness and target speaker similarity in noisy reverberant environments.
    Economic Nowcasting with Long Short-Term Memory Artificial Neural Networks (LSTM). (arXiv:2106.08901v1 [econ.EM])
    (2 min) Artificial neural networks (ANNs) have been the catalyst to numerous advances in a variety of fields and disciplines in recent years. Their impact on economics, however, has been comparatively muted. One type of ANN, the long short-term memory network (LSTM), is particularly wellsuited to deal with economic time-series. Here, the architecture's performance and characteristics are evaluated in comparison with the dynamic factor model (DFM), currently a popular choice in the field of economic nowcasting. LSTMs are found to produce superior results to DFMs in the nowcasting of three separate variables; global merchandise export values and volumes, and global services exports. Further advantages include their ability to handle large numbers of input features in a variety of time frequencies. A disadvantage is the inability to ascribe contributions of input features to model outputs, common to all ANNs. In order to facilitate continued applied research of the methodology by avoiding the need for any knowledge of deep-learning libraries, an accompanying Python library was developed using PyTorch, https://pypi.org/project/nowcast-lstm/.
    Selection of Source Images Heavily Influences the Effectiveness of Adversarial Attacks. (arXiv:2106.07141v2 [cs.CV] UPDATED)
    (2 min) Although the adoption rate of deep neural networks (DNNs) has tremendously increased in recent years, a solution for their vulnerability against adversarial examples has not yet been found. As a result, substantial research efforts are dedicated to fix this weakness, with many studies typically using a subset of source images to generate adversarial examples, treating every image in this subset as equal. We demonstrate that, in fact, not every source image is equally suited for this kind of assessment. To do so, we devise a large-scale model-to-model transferability scenario for which we meticulously analyze the properties of adversarial examples, generated from every suitable source image in ImageNet by making use of two of the most frequently deployed attacks. In this transferability scenario, which involves seven distinct DNN models, including the recently proposed vision transformers, we reveal that it is possible to have a difference of up to $12.5\%$ in model-to-model transferability success, $1.01$ in average $L_2$ perturbation, and $0.03$ ($8/225$) in average $L_{\infty}$ perturbation when $1,000$ source images are sampled randomly among all suitable candidates. We then take one of the first steps in evaluating the robustness of images used to create adversarial examples, proposing a number of simple but effective methods to identify unsuitable source images, thus making it possible to mitigate extreme cases in experimentation and support high-quality benchmarking.
    Efficient Deep Learning: A Survey on Making Deep Learning Models Smaller, Faster, and Better. (arXiv:2106.08962v1 [cs.LG])
    (2 min) Deep Learning has revolutionized the fields of computer vision, natural language understanding, speech recognition, information retrieval and more. However, with the progressive improvements in deep learning models, their number of parameters, latency, resources required to train, etc. have all have increased significantly. Consequently, it has become important to pay attention to these footprint metrics of a model as well, not just its quality. We present and motivate the problem of efficiency in deep learning, followed by a thorough survey of the five core areas of model efficiency (spanning modeling techniques, infrastructure, and hardware) and the seminal work there. We also present an experiment-based guide along with code, for practitioners to optimize their model training and deployment. We believe this is the first comprehensive survey in the efficient deep learning space that covers the landscape of model efficiency from modeling techniques to hardware support. Our hope is that this survey would provide the reader with the mental model and the necessary understanding of the field to apply generic efficiency techniques to immediately get significant improvements, and also equip them with ideas for further research and experimentation to achieve additional gains.
    Automatic Social Distance Estimation From Images: Performance Evaluation, Test Benchmark, and Algorithm. (arXiv:2103.06759v2 [cs.CV] UPDATED)
    (3 min) The COVID-19 virus has caused a global pandemic since March 2020. The World Health Organization (WHO) has provided guidelines on how to reduce the spread of the virus and one of the most important measures is social distancing. Maintaining a minimum of one meter distance from other people is strongly suggested to reduce the risk of infection. This has created a strong interest in monitoring the social distances either as a safety measure or to study how the measures have affected human behavior and country-wise differences in this. The need for automatic social distance estimation algorithms is evident, but there is no suitable test benchmark for such algorithms. Collecting images with measured ground-truth pair-wise distances between all the people using different camera settings is cumbersome. Furthermore, performance evaluation for social distance estimation algorithms is not straightforward and there is no widely accepted evaluation protocol. In this paper, we provide a dataset of varying images with measured pair-wise social distances under different camera positionings and focal length values. We suggest a performance evaluation protocol and provide a benchmark to easily evaluate social distance estimation algorithms. We also propose a method for automatic social distance estimation. Our method takes advantage of object detection and human pose estimation. It can be applied on any single image as long as focal length and sensor size information are known. The results on our benchmark are encouraging with 92% human detection rate and only 28.9% average error in distance estimation among the detected people.
    The Partial Response Network: a neural network nomogram. (arXiv:1908.05978v3 [cs.LG] UPDATED)
    (3 min) Among interpretable machine learning methods, the class of Generalised Additive Neural Networks (GANNs) is referred to as Self-Explaining Neural Networks (SENN) because of the linear dependence on explicit functions of the inputs. In binary classification this shows the precise weight that each input contributes towards the logit. The nomogram is a graphical representation of these weights. We show that functions of individual and pairs of variables can be derived from a functional Analysis of Variance (ANOVA) representation, enabling an efficient feature selection to be carried by application of the logistic Lasso. This process infers the structure of GANNs which otherwise needs to be predefined. As this method is particularly suited for tabular data, it starts by fitting a generic flexible model, in this case a Multi-layer Perceptron (MLP) to which the ANOVA decomposition is applied. This has the further advantage that the resulting GANN can be replicated as a SENN, enabling further refinement of the univariate and bivariate component functions to take place. The component functions are partial responses hence the SENN is a partial response network. The Partial Response Network (PRN) is equally as transparent as a traditional logistic regression model, but capable of non-linear classification with comparable or superior performance to the original MLP. In other words, the PRN is a fully interpretable representation of the MLP, at the level of univariate and bivariate effects. The performance of the PRN is shown to be competitive for benchmark data, against state-of-the-art machine learning methods including GBM, SVM and Random Forests. It is also compared with spline-based Sparse Additive Models (SAM) showing that a semi-parametric representation of the GAM as a neural network can be as effective as the SAM though less constrained by the need to set spline nodes.
    Refining Language Models with Compositional Explanations. (arXiv:2103.10415v2 [cs.CL] UPDATED)
    (2 min) Pre-trained language models have been successful on text classification tasks, but are prone to learning spurious correlations from biased datasets, and are thus vulnerable when making inferences in a new domain. Prior works reveal such spurious patterns via post-hoc explanation algorithms which compute the importance of input features. Further, the model is regularized to align the importance scores with human knowledge, so that the unintended model behaviors are eliminated. However, such a regularization technique lacks flexibility and coverage, since only importance scores towards a pre-defined list of features are adjusted, while more complex human knowledge such as feature interaction and pattern generalization can hardly be incorporated. In this work, we propose to refine a learned language model for a target domain by collecting human-provided compositional explanations regarding observed biases. By parsing these explanations into executable logic rules, the human-specified refinement advice from a small set of explanations can be generalized to more training examples. We additionally introduce a regularization term allowing adjustments for both importance and interaction of features to better rectify model behavior. We demonstrate the effectiveness of the proposed approach on two text classification tasks by showing improved performance in target domain as well as improved model fairness after refinement.
    Model-Based Counterfactual Synthesizer for Interpretation. (arXiv:2106.08971v1 [cs.LG])
    (2 min) Counterfactuals, serving as one of the emerging type of model interpretations, have recently received attention from both researchers and practitioners. Counterfactual explanations formalize the exploration of ``what-if'' scenarios, and are an instance of example-based reasoning using a set of hypothetical data samples. Counterfactuals essentially show how the model decision alters with input perturbations. Existing methods for generating counterfactuals are mainly algorithm-based, which are time-inefficient and assume the same counterfactual universe for different queries. To address these limitations, we propose a Model-based Counterfactual Synthesizer (MCS) framework for interpreting machine learning models. We first analyze the model-based counterfactual process and construct a base synthesizer using a conditional generative adversarial net (CGAN). To better approximate the counterfactual universe for those rare queries, we novelly employ the umbrella sampling technique to conduct the MCS framework training. Besides, we also enhance the MCS framework by incorporating the causal dependence among attributes with model inductive bias, and validate its design correctness from the causality identification perspective. Experimental results on several datasets demonstrate the effectiveness as well as efficiency of our proposed MCS framework, and verify the advantages compared with other alternatives.
    How memory architecture affects performance and learning in simple POMDPs. (arXiv:2106.08849v1 [cs.LG])
    (2 min) Reinforcement learning is made much more complex when the agent's observation is partial or noisy. This case corresponds to a partially observable Markov decision process (POMDP). One strategy to seek good performance in POMDPs is to endow the agent with a finite memory, whose update is governed by the policy. However, policy optimization is non-convex in that case and can lead to poor training performance for random initialization. The performance can be empirically improved by constraining the memory architecture, then sacrificing optimality to facilitate training. Here we study this trade-off in the two-arm bandit problem, and compare two extreme cases: (i) the random access memory where any transitions between $M$ memory states are allowed and (ii) a fixed memory where the agent can access its last $m$ actions and rewards. For (i), the probability $q$ to play the worst arm is known to be exponentially small in $M$ for the optimal policy. Our main result is to show that similar performance can be reached for (ii) as well, despite the simplicity of the memory architecture: using a conjecture on Gray-ordered binary necklaces, we find policies for which $q$ is exponentially small in $2^m$ i.e. $q\sim\alpha^{2^m}$ for some $\alpha < 1$. Interestingly, we observe empirically that training from random initialization leads to very poor results for (i), and significantly better results for (ii).
    Localization, Convexity, and Star Aggregation. (arXiv:2105.08866v2 [stat.ML] UPDATED)
    (2 min) Offset Rademacher complexities have been shown to imply sharp, data-dependent upper bounds for the square loss in a broad class of problems including improper statistical learning and online learning. We show that in the statistical setting, the offset complexity upper bound can be generalized to any loss satisfying a certain uniform convexity condition. Amazingly, this condition is shown to also capture exponential concavity and self-concordance, uniting several apparently disparate results. By a unified geometric argument, these bounds translate directly to improper learning in a non-convex class using Audibert's "star algorithm." As applications, we recover the optimal rates for proper and improper learning with the $p$-loss, $1 < p < \infty$ and show that improper variants of empirical risk minimization can attain fast rates for logistic regression and other generalized linear models.
    Beyond Tikhonov: Faster Learning with Self-Concordant Losses via Iterative Regularization. (arXiv:2106.08855v1 [cs.LG])
    (2 min) The theory of spectral filtering is a remarkable tool to understand the statistical properties of learning with kernels. For least squares, it allows to derive various regularization schemes that yield faster convergence rates of the excess risk than with Tikhonov regularization. This is typically achieved by leveraging classical assumptions called source and capacity conditions, which characterize the difficulty of the learning task. In order to understand estimators derived from other loss functions, Marteau-Ferey et al. have extended the theory of Tikhonov regularization to generalized self concordant loss functions (GSC), which contain, e.g., the logistic loss. In this paper, we go a step further and show that fast and optimal rates can be achieved for GSC by using the iterated Tikhonov regularization scheme, which is intrinsically related to the proximal point method in optimization, and overcomes the limitation of the classical Tikhonov regularization.
    LCDNet: Deep Loop Closure Detection and Point Cloud Registration for LiDAR SLAM. (arXiv:2103.05056v2 [cs.RO] UPDATED)
    (2 min) Loop closure detection is an essential component of Simultaneous Localization and Mapping (SLAM) systems, which reduces the drift accumulated over time. Over the years, several deep learning approaches have been proposed to address this task, however their performance has been subpar compared to handcrafted techniques, especially while dealing with reverse loops. In this paper, we introduce the novel LCDNet that effectively detects loop closures in LiDAR point clouds by simultaneously identifying previously visited places and estimating the 6-DoF relative transformation between the current scan and the map. LCDNet is composed of a shared encoder, a place recognition head that extracts global descriptors, and a relative pose head that estimates the transformation between two point clouds. We introduce a novel relative pose head based on the unbalanced optimal transport theory that we implement in a differentiable manner to allow for end-to-end training. Extensive evaluations of LCDNet on multiple real-world autonomous driving datasets show that our approach outperforms state-of-the-art loop closure detection and point cloud registration techniques by a large margin, especially while dealing with reverse loops. Moreover, we integrate our proposed loop closure detection approach into a LiDAR SLAM library to provide a complete mapping system and demonstrate the generalization ability using different sensor setup in an unseen city.
    Dense for the Price of Sparse: Improved Performance of Sparsely Initialized Networks via a Subspace Offset. (arXiv:2102.07655v2 [cs.LG] UPDATED)
    (2 min) That neural networks may be pruned to high sparsities and retain high accuracy is well established. Recent research efforts focus on pruning immediately after initialization so as to allow the computational savings afforded by sparsity to extend to the training process. In this work, we introduce a new `DCT plus Sparse' layer architecture, which maintains information propagation and trainability even with as little as 0.01% trainable kernel parameters remaining. We show that standard training of networks built with these layers, and pruned at initialization, achieves state-of-the-art accuracy for extreme sparsities on a variety of benchmark network architectures and datasets. Moreover, these results are achieved using only simple heuristics to determine the locations of the trainable parameters in the network, and thus without having to initially store or compute with the full, unpruned network, as is required by competing prune-at-initialization algorithms. Switching from standard sparse layers to DCT plus Sparse layers does not increase the storage footprint of a network and incurs only a small additional computational overhead.
    Explicitly Encouraging Low Fractional Dimensional Trajectories Via Reinforcement Learning. (arXiv:2012.11662v2 [cs.LG] UPDATED)
    (2 min) A key limitation in using various modern methods of machine learning in developing feedback control policies is the lack of appropriate methodologies to analyze their long-term dynamics, in terms of making any sort of guarantees (even statistically) about robustness. The central reasons for this are largely due to the so-called curse of dimensionality, combined with the black-box nature of the resulting control policies themselves. This paper aims at the first of these issues. Although the full state space of a system may be quite large in dimensionality, it is a common feature of most model-based control methods that the resulting closed-loop systems demonstrate dominant dynamics that are rapidly driven to some lower-dimensional sub-space within. In this work we argue that the dimensionality of this subspace is captured by tools from fractal geometry, namely various notions of a fractional dimension. We then show that the dimensionality of trajectories induced by model free reinforcement learning agents can be influenced adding a post processing function to the agents reward signal. We verify that the dimensionality reduction is robust to noise being added to the system and show that that the modified agents are more actually more robust to noise and push disturbances in general for the systems we examined.
    Training Generative Adversarial Networks in One Stage. (arXiv:2103.00430v3 [cs.CV] UPDATED)
    (2 min) Generative Adversarial Networks (GANs) have demonstrated unprecedented success in various image generation tasks. The encouraging results, however, come at the price of a cumbersome training process, during which the generator and discriminator are alternately updated in two stages. In this paper, we investigate a general training scheme that enables training GANs efficiently in only one stage. Based on the adversarial losses of the generator and discriminator, we categorize GANs into two classes, Symmetric GANs and Asymmetric GANs, and introduce a novel gradient decomposition method to unify the two, allowing us to train both classes in one stage and hence alleviate the training effort. We also computationally analyze the efficiency of the proposed method, and empirically demonstrate that, the proposed method yields a solid $1.5\times$ acceleration across various datasets and network architectures. Furthermore, we show that the proposed method is readily applicable to other adversarial-training scenarios, such as data-free knowledge distillation. The code is available at https://github.com/zju-vipa/OSGAN.
    Learning the exchange-correlation functional from nature with fully differentiable density functional theory. (arXiv:2102.04229v4 [physics.chem-ph] UPDATED)
    (2 min) Improving the predictive capability of molecular properties in ab initio simulations is essential for advanced material discovery. Despite recent progress making use of machine learning, utilizing deep neural networks to improve quantum chemistry modelling remains severely limited by the scarcity and heterogeneity of appropriate experimental data. Here we show how training a neural network to replace the exchange-correlation functional within a fully-differentiable three-dimensional Kohn-Sham density functional theory (DFT) framework can greatly improve simulation accuracy. Using only eight experimental data points on diatomic molecules, our trained exchange-correlation networks enable improved prediction accuracy of atomization energies across a collection of 104 molecules containing new bonds and atoms that are not present in the training dataset.
    Parameter-free Locally Accelerated Conditional Gradients. (arXiv:2102.06806v2 [math.OC] UPDATED)
    (2 min) Projection-free conditional gradient (CG) methods are the algorithms of choice for constrained optimization setups in which projections are often computationally prohibitive but linear optimization over the constraint set remains computationally feasible. Unlike in projection-based methods, globally accelerated convergence rates are in general unattainable for CG. However, a very recent work on Locally accelerated CG (LaCG) has demonstrated that local acceleration for CG is possible for many settings of interest. The main downside of LaCG is that it requires knowledge of the smoothness and strong convexity parameters of the objective function. We remove this limitation by introducing a novel, Parameter-Free Locally accelerated CG (PF-LaCG) algorithm, for which we provide rigorous convergence guarantees. Our theoretical results are complemented by numerical experiments, which demonstrate local acceleration and showcase the practical improvements of PF-LaCG over non-accelerated algorithms, both in terms of iteration count and wall-clock time.
    Optimized ensemble deep learning framework for scalable forecasting of dynamics containing extreme events. (arXiv:2106.08968v1 [cs.LG])
    (2 min) The remarkable flexibility and adaptability of both deep learning models and ensemble methods have led to the proliferation for their application in understanding many physical phenomena. Traditionally, these two techniques have largely been treated as independent methodologies in practical applications. This study develops an optimized ensemble deep learning (OEDL) framework wherein these two machine learning techniques are jointly used to achieve synergistic improvements in model accuracy, stability, scalability, and reproducibility prompting a new wave of applications in the forecasting of dynamics. Unpredictability is considered as one of the key features of chaotic dynamics, so forecasting such dynamics of nonlinear systems is a relevant issue in the scientific community. It becomes more challenging when the prediction of extreme events is the focus issue for us. In this circumstance, the proposed OEDL model based on a best convex combination of feed-forward neural networks, reservoir computing, and long short-term memory can play a key role in advancing predictions of dynamics consisting of extreme events. The combined framework can generate the best out-of-sample performance than the individual deep learners and standard ensemble framework for both numerically simulated and real world data sets. We exhibit the outstanding performance of the OEDL framework for forecasting extreme events generated from Lienard-type system, prediction of COVID-19 cases in Brazil, dengue cases in San Juan, and sea surface temperature in Nino 3.4 region.
    Nonequilibrium thermodynamics of self-supervised learning. (arXiv:2106.08981v1 [cond-mat.stat-mech])
    (2 min) Self-supervised learning (SSL) of energy based models has an intuitive relation to equilibrium thermodynamics because the softmax layer, mapping energies to probabilities, is a Gibbs distribution. However, in what way SSL is a thermodynamic process? We show that some SSL paradigms behave as a thermodynamic composite system formed by representations and self-labels in contact with a nonequilibrium reservoir. Moreover, this system is subjected to usual thermodynamic cycles, such as adiabatic expansion and isochoric heating, resulting in a generalized Gibbs ensemble (GGE). In this picture, we show that learning is seen as a demon that operates in cycles using feedback measurements to extract negative work from the system. As applications, we examine some SSL algorithms using this idea.
    Two-sample Test using Projected Wasserstein Distance: Breaking the Curse of Dimensionality. (arXiv:2010.11970v3 [stat.ML] UPDATED)
    (2 min) We develop a projected Wasserstein distance for the two-sample test, a fundamental problem in statistics and machine learning: given two sets of samples, to determine whether they are from the same distribution. In particular, we aim to circumvent the curse of dimensionality in Wasserstein distance: when the dimension is high, it has diminishing testing power, which is inherently due to the slow concentration property of Wasserstein metrics in the high dimension space. A key contribution is to couple optimal projection to find the low dimensional linear mapping to maximize the Wasserstein distance between projected probability distributions. We characterize the theoretical property of the finite-sample convergence rate on IPMs and present practical algorithms for computing this metric. Numerical examples validate our theoretical results.
    Improved Sample Complexities for Deep Networks and Robust Classification via an All-Layer Margin. (arXiv:1910.04284v5 [cs.LG] UPDATED)
    (2 min) For linear classifiers, the relationship between (normalized) output margin and generalization is captured in a clear and simple bound -- a large output margin implies good generalization. Unfortunately, for deep models, this relationship is less clear: existing analyses of the output margin give complicated bounds which sometimes depend exponentially on depth. In this work, we propose to instead analyze a new notion of margin, which we call the "all-layer margin." Our analysis reveals that the all-layer margin has a clear and direct relationship with generalization for deep models. This enables the following concrete applications of the all-layer margin: 1) by analyzing the all-layer margin, we obtain tighter generalization bounds for neural nets which depend on Jacobian and hidden layer norms and remove the exponential dependency on depth 2) our neural net results easily translate to the adversarially robust setting, giving the first direct analysis of robust test error for deep networks, and 3) we present a theoretically inspired training algorithm for increasing the all-layer margin. Our algorithm improves both clean and adversarially robust test performance over strong baselines in practice.
    Smoothing the Disentangled Latent Style Space for Unsupervised Image-to-Image Translation. (arXiv:2106.09016v1 [cs.CV])
    (0 min) Image-to-Image (I2I) multi-domain translation models are usually evaluated also using the quality of their semantic interpolation results. However, state-of-the-art models frequently show abrupt changes in the image appearance during interpolation, and usually perform poorly in interpolations across domains. In this paper, we propose a new training protocol based on three specific losses which help a translation network to learn a smooth and disentangled latent style space in which: 1) Both intra- and inter-domain interpolations correspond to gradual changes in the generated images and 2) The content of the source image is better preserved during the translation. Moreover, we propose a novel evaluation metric to properly measure the smoothness of latent style space of I2I translation models. The proposed method can be plugged into existing translation approaches, and our extensive experiments on different datasets show that it can significantly boost the quality of the generated images and the graduality of the interpolations.
    Towards Automatic Actor-Critic Solutions to Continuous Control. (arXiv:2106.08918v1 [cs.LG])
    (0 min) Model-free off-policy actor-critic methods are an efficient solution to complex continuous control tasks. However, these algorithms rely on a number of design tricks and many hyperparameters, making their applications to new domains difficult and computationally expensive. This paper creates an evolutionary approach that automatically tunes these design decisions and eliminates the RL-specific hyperparameters from the Soft Actor-Critic algorithm. Our design is sample efficient and provides practical advantages over baseline approaches, including improved exploration, generalization over multiple control frequencies, and a robust ensemble of high-performance policies. Empirically, we show that our agent outperforms well-tuned hyperparameter settings in popular benchmarks from the DeepMind Control Suite. We then apply it to new control tasks to find high-performance solutions with minimal compute and research effort.
    Thompson Sampling with Information Relaxation Penalties. (arXiv:1902.04251v2 [cs.LG] UPDATED)
    (0 min) We consider a finite-horizon multi-armed bandit (MAB) problem in a Bayesian setting, for which we propose an information relaxation sampling framework. With this framework, we define an intuitive family of control policies that include Thompson sampling (TS) and the Bayesian optimal policy as endpoints. Analogous to TS, which, at each decision epoch pulls an arm that is best with respect to the randomly sampled parameters, our algorithms sample entire future reward realizations and take the corresponding best action. However, this is done in the presence of "penalties" that seek to compensate for the availability of future information. We develop several novel policies and performance bounds for MAB problems that vary in terms of improving performance and increasing computational complexity between the two endpoints. Our policies can be viewed as natural generalizations of TS that simultaneously incorporate knowledge of the time horizon and explicitly consider the exploration-exploitation trade-off. We prove associated structural results on performance bounds and suboptimality gaps. Numerical experiments suggest that this new class of policies perform well, in particular in settings where the finite time horizon introduces significant exploration-exploitation tension into the problem. Finally, inspired by the finite-horizon Gittins index, we propose an index policy that builds on our framework that particularly outperforms the state-of-the-art algorithms in our numerical experiments.
    Polynomial Trajectory Predictions for Improved Learning Performance. (arXiv:2101.12616v2 [cs.CV] UPDATED)
    (2 min) The rising demand for Active Safety systems in automotive applications stresses the need for a reliable short to mid-term trajectory prediction. Anticipating the unfolding path of road users, one can act to increase the overall safety. In this work, we propose to train artificial neural networks for movement understanding by predicting trajectories in their natural form, as a function of time. Predicting polynomial coefficients allows us to increased accuracy and improve generalisation.
    Bandit Modeling of Map Selection in Counter-Strike: Global Offensive. (arXiv:2106.08888v1 [cs.LG])
    (2 min) Many esports use a pick and ban process to define the parameters of a match before it starts. In Counter-Strike: Global Offensive (CSGO) matches, two teams first pick and ban maps, or virtual worlds, to play. Teams typically ban and pick maps based on a variety of factors, such as banning maps which they do not practice, or choosing maps based on the team's recent performance. We introduce a contextual bandit framework to tackle the problem of map selection in CSGO and to investigate teams' pick and ban decision-making. Using a data set of over 3,500 CSGO matches and over 25,000 map selection decisions, we consider different framings for the problem, different contexts, and different reward metrics. We find that teams have suboptimal map choice policies with respect to both picking and banning. We also define an approach for rewarding bans, which has not been explored in the bandit setting, and find that incorporating ban rewards improves model performance. Finally, we determine that usage of our model could improve teams' predicted map win probability by up to 11% and raise overall match win probabilities by 19.8% for evenly-matched teams.
    TabularNet: A Neural Network Architecture for Understanding Semantic Structures of Tabular Data. (arXiv:2106.03096v2 [cs.LG] UPDATED)
    (2 min) Tabular data are ubiquitous for the widespread applications of tables and hence have attracted the attention of researchers to extract underlying information. One of the critical problems in mining tabular data is how to understand their inherent semantic structures automatically. Existing studies typically adopt Convolutional Neural Network (CNN) to model the spatial information of tabular structures yet ignore more diverse relational information between cells, such as the hierarchical and paratactic relationships. To simultaneously extract spatial and relational information from tables, we propose a novel neural network architecture, TabularNet. The spatial encoder of TabularNet utilizes the row/column-level Pooling and the Bidirectional Gated Recurrent Unit (Bi-GRU) to capture statistical information and local positional correlation, respectively. For relational information, we design a new graph construction method based on the WordNet tree and adopt a Graph Convolutional Network (GCN) based encoder that focuses on the hierarchical and paratactic relationships between cells. Our neural network architecture can be a unified neural backbone for different understanding tasks and utilized in a multitask scenario. We conduct extensive experiments on three classification tasks with two real-world spreadsheet data sets, and the results demonstrate the effectiveness of our proposed TabularNet over state-of-the-art baselines.
    Split and Expand: An inference-time improvement for Weakly Supervised Cell Instance Segmentation. (arXiv:2007.10817v2 [cs.CV] UPDATED)
    (2 min) We consider the problem of segmenting cell nuclei instances from Hematoxylin and Eosin (H&E) stains with dot annotations only. While most recent works focus on improving the segmentation quality, this is usually insufficient for instance segmentation of cell instances clustered together or with a small size. In this work, we propose a simple two-step post-processing procedure, Split and Expand, that directly improves the conversion of segmentation maps to instances. In the splitting step, we generate fine-grained cell instances from the segmentation map with the guidance of cell-center predictions. For the expansion step, we utilize Layer-wise Relevance Propagation (LRP) explanation results to add small cells that are not captured in the segmentation map. Although we additionally train an output head to predict cell-centers, the post-processing procedure itself is not explicitly trained and is executed at inference-time only. A feature re-weighting loss based on LRP is proposed to improve our method even further. We test our procedure on the MoNuSeg and TNBC datasets and show quantitatively and qualitatively that our proposed method improves object-level metrics substantially.
    Towards Automated Website Classification by Deep Learning. (arXiv:1910.09991v2 [cs.LG] UPDATED)
    (2 min) In recent years, the interest in Big Data sources has been steadily growing within the Official Statistic community. The Italian National Institute of Statistics (Istat) is currently carrying out several Big Data pilot studies. One of these studies, the ICT Big Data pilot, aims at exploiting massive amounts of textual data automatically scraped from the websites of Italian enterprises in order to predict a set of target variables (e.g. e-commerce) that are routinely observed by the traditional ICT Survey. In this paper, we show that Deep Learning techniques can successfully address this problem. Essentially, we tackle a text classification task: an algorithm must learn to infer whether an Italian enterprise performs e-commerce from the textual content of its website. To reach this goal, we developed a sophisticated processing pipeline and evaluated its performance through extensive experiments. Our pipeline uses Convolutional Neural Networks and relies on Word Embeddings to encode raw texts into grayscale images (i.e. normalized numeric matrices). Web-scraped texts are huge and have very low signal to noise ratio: to overcome these issues, we adopted a framework known as False Positive Reduction, which has seldom (if ever) been applied before to text classification tasks. Several original contributions enable our processing pipeline to reach good classification results. Empirical evidence shows that our proposal outperforms all the alternative Machine Learning solutions already tested in Istat for the same task.
    LassoNet: A Neural Network with Feature Sparsity. (arXiv:1907.12207v10 [stat.ML] UPDATED)
    (2 min) Much work has been done recently to make neural networks more interpretable, and one obvious approach is to arrange for the network to use only a subset of the available features. In linear models, Lasso (or $\ell_1$-regularized) regression assigns zero weights to the most irrelevant or redundant features, and is widely used in data science. However the Lasso only applies to linear models. Here we introduce LassoNet, a neural network framework with global feature selection. Our approach enforces a hierarchy: specifically a feature can participate in a hidden unit only if its linear representative is active. Unlike other approaches to feature selection for neural nets, our method uses a modified objective function with constraints, and so integrates feature selection with the parameter learning directly. As a result, it delivers an entire regularization path of solutions with a range of feature sparsity. On systematic experiments, LassoNet significantly outperforms state-of-the-art methods for feature selection and regression. The LassoNet method uses projected proximal gradient descent, and generalizes directly to deep networks. It can be implemented by adding just a few lines of code to a standard neural network.
    Topology Distillation for Recommender System. (arXiv:2106.08700v1 [cs.LG])
    (2 min) Recommender Systems (RS) have employed knowledge distillation which is a model compression technique training a compact student model with the knowledge transferred from a pre-trained large teacher model. Recent work has shown that transferring knowledge from the teacher's intermediate layer significantly improves the recommendation quality of the student. However, they transfer the knowledge of individual representation point-wise and thus have a limitation in that primary information of RS lies in the relations in the representation space. This paper proposes a new topology distillation approach that guides the student by transferring the topological structure built upon the relations in the teacher space. We first observe that simply making the student learn the whole topological structure is not always effective and even degrades the student's performance. We demonstrate that because the capacity of the student is highly limited compared to that of the teacher, learning the whole topological structure is daunting for the student. To address this issue, we propose a novel method named Hierarchical Topology Distillation (HTD) which distills the topology hierarchically to cope with the large capacity gap. Our extensive experiments on real-world datasets show that the proposed method significantly outperforms the state-of-the-art competitors. We also provide in-depth analyses to ascertain the benefit of distilling the topology for RS.
    Interval-censored Hawkes processes. (arXiv:2104.07932v2 [cs.LG] UPDATED)
    (2 min) This work builds a novel point process and tools to use the Hawkes process with interval-censored data. Such data records the aggregated counts of events solely during specific time intervals -- such as the number of patients admitted to the hospital or the volume of vehicles passing traffic loop detectors -- and not the exact occurrence time of the events. First, we establish the Mean Behavior Poisson (MBP) process, a novel Poisson process with a direct parameter correspondence to the popular self-exciting Hawkes process. The event intensity function of the MBP is the expected intensity over all possible Hawkes realizations with the same parameter set. We fit MBP in the interval-censored setting using an interval-censored Poisson log-likelihood (IC-LL). We use the parameter equivalence to uncover the parameters of the associated Hawkes process. Second, we introduce two novel exogenous functions to distinguish the exogenous from the endogenous events. We propose the multi-impulse exogenous function when the exogenous events are observed as event time and the latent homogeneous Poisson process exogenous function when the exogenous events are presented as interval-censored volumes. Third, we provide several approximation methods to estimate the intensity and compensator function of MBP when no analytical solution exists. Fourth and finally, we connect the interval-censored loss of MBP to a broader class of Bregman divergence-based functions. Using the connection, we show that the current state of the art in popularity estimation (Hawkes Intensity Process (HIP) (Rizoiu et al.,2017b)) is a particular case of the MBP process. We verify our models through empirical testing on synthetic data and real-world data. We find that on real-world datasets that ourMBP process outperforms HIP for the task of popularity prediction.
    Regularized Orthogonal Machine Learning for Nonlinear Semiparametric Models. (arXiv:1806.04823v7 [math.ST] UPDATED)
    (2 min) This paper proposes a Lasso-type estimator for a high-dimensional sparse parameter identified by a single index conditional moment restriction (CMR). In addition to this parameter, the moment function can also depend on a nuisance function, such as the propensity score or the conditional choice probability, which we estimate by modern machine learning tools. We first adjust the moment function so that the gradient of the future loss function is insensitive (formally, Neyman-orthogonal) with respect to the first-stage regularization bias, preserving the single index property. We then take the loss function to be an indefinite integral of the adjusted moment function with respect to the single index. The proposed Lasso estimator converges at the oracle rate, where the oracle knows the nuisance function and solves only the parametric problem. We demonstrate our method by estimating the short-term heterogeneous impact of Connecticut's Jobs First welfare reform experiment on women's welfare participation decision.
    Nonparametric Empirical Bayes Estimation and Testing for Sparse and Heteroscedastic Signals. (arXiv:2106.08881v1 [cs.LG])
    (0 min) Large-scale modern data often involves estimation and testing for high-dimensional unknown parameters. It is desirable to identify the sparse signals, ``the needles in the haystack'', with accuracy and false discovery control. However, the unprecedented complexity and heterogeneity in modern data structure require new machine learning tools to effectively exploit commonalities and to robustly adjust for both sparsity and heterogeneity. In addition, estimates for high-dimensional parameters often lack uncertainty quantification. In this paper, we propose a novel Spike-and-Nonparametric mixture prior (SNP) -- a spike to promote the sparsity and a nonparametric structure to capture signals. In contrast to the state-of-the-art methods, the proposed methods solve the estimation and testing problem at once with several merits: 1) an accurate sparsity estimation; 2) point estimates with shrinkage/soft-thresholding property; 3) credible intervals for uncertainty quantification; 4) an optimal multiple testing procedure that controls false discovery rate. Our method exhibits promising empirical performance on both simulated data and a gene expression case study.
    Using Machine Learning to Select High-Quality Measurements. (arXiv:2106.08891v1 [physics.data-an])
    (2 min) We describe the use of machine learning algorithms to select high-quality measurements for the Mu2e experiment. This technique is important for experiments with backgrounds that arise due to measurement errors. The algorithms use multiple pieces of ancillary information that are sensitive to measurement quality to separate high-quality and low-quality measurements.
    Communication-Efficient Agnostic Federated Averaging. (arXiv:2104.02748v2 [cs.LG] UPDATED)
    (0 min) In distributed learning settings such as federated learning, the training algorithm can be potentially biased towards different clients. Mohri et al. (2019) proposed a domain-agnostic learning algorithm, where the model is optimized for any target distribution formed by a mixture of the client distributions in order to overcome this bias. They further proposed an algorithm for the cross-silo federated learning setting, where the number of clients is small. We consider this problem in the cross-device setting, where the number of clients is much larger. We propose a communication-efficient distributed algorithm called Agnostic Federated Averaging (or AgnosticFedAvg) to minimize the domain-agnostic objective proposed in Mohri et al. (2019), which is amenable to other private mechanisms such as secure aggregation. We highlight two types of naturally occurring domains in federated learning and argue that AgnosticFedAvg performs well on both. To demonstrate the practical effectiveness of AgnosticFedAvg, we report positive results for large-scale language modeling tasks in both simulation and live experiments, where the latter involves training language models for Spanish virtual keyboard for millions of user devices.
    Outside the Echo Chamber: Optimizing the Performative Risk. (arXiv:2102.08570v2 [cs.LG] UPDATED)
    (2 min) In performative prediction, predictions guide decision-making and hence can influence the distribution of future data. To date, work on performative prediction has focused on finding performatively stable models, which are the fixed points of repeated retraining. However, stable solutions can be far from optimal when evaluated in terms of the performative risk, the loss experienced by the decision maker when deploying a model. In this paper, we shift attention beyond performative stability and focus on optimizing the performative risk directly. We identify a natural set of properties of the loss function and model-induced distribution shift under which the performative risk is convex, a property which does not follow from convexity of the loss alone. Furthermore, we develop algorithms that leverage our structural assumptions to optimize the performative risk with better sample efficiency than generic methods for derivative-free convex optimization.
    Collaborative Learning and Personalization in Multi-Agent Stochastic Linear Bandits. (arXiv:2106.08902v1 [stat.ML])
    (2 min) We consider the problem of minimizing regret in an $N$ agent heterogeneous stochastic linear bandits framework, where the agents (users) are similar but not all identical. We model user heterogeneity using two popularly used ideas in practice; (i) A clustering framework where users are partitioned into groups with users in the same group being identical to each other, but different across groups, and (ii) a personalization framework where no two users are necessarily identical, but a user's parameters are close to that of the population average. In the clustered users' setup, we propose a novel algorithm, based on successive refinement of cluster identities and regret minimization. We show that, for any agent, the regret scales as $\mathcal{O}(\sqrt{T/N})$, if the agent is in a `well separated' cluster, or scales as $\mathcal{O}(T^{\frac{1}{2} + \varepsilon}/(N)^{\frac{1}{2} -\varepsilon})$ if its cluster is not well separated, where $\varepsilon$ is positive and arbitrarily close to $0$. Our algorithm is adaptive to the cluster separation, and is parameter free -- it does not need to know the number of clusters, separation and cluster size, yet the regret guarantee adapts to the inherent complexity. In the personalization framework, we introduce a natural algorithm where, the personal bandit instances are initialized with the estimates of the global average model. We show that, an agent $i$ whose parameter deviates from the population average by $\epsilon_i$, attains a regret scaling of $\widetilde{O}(\epsilon_i\sqrt{T})$. This demonstrates that if the user representations are close (small $\epsilon_i)$, the resulting regret is low, and vice-versa. The results are empirically validated and we observe superior performance of our adaptive algorithms over non-adaptive baselines.
    LemgoRL: An open-source Benchmark Tool to Train Reinforcement Learning Agents for Traffic Signal Control in a real-world simulation scenario. (arXiv:2103.16223v2 [cs.LG] UPDATED)
    (2 min) Sub-optimal control policies in intersection traffic signal controllers (TSC) contribute to congestion and lead to negative effects on human health and the environment. Reinforcement learning (RL) for traffic signal control is a promising approach to design better control policies and has attracted considerable research interest in recent years. However, most work done in this area used simplified simulation environments of traffic scenarios to train RL-based TSC. To deploy RL in real-world traffic systems, the gap between simplified simulation environments and real-world applications has to be closed. Therefore, we propose LemgoRL, a benchmark tool to train RL agents as TSC in a realistic simulation environment of Lemgo, a medium-sized town in Germany. In addition to the realistic simulation model, LemgoRL encompasses a traffic signal logic unit that ensures compliance with all regulatory and safety requirements. LemgoRL offers the same interface as the well-known OpenAI gym toolkit to enable easy deployment in existing research work. Our benchmark tool drives the development of RL algorithms towards real-world applications. We provide LemgoRL as an open-source tool at https://github.com/rl-ina/lemgorl.
    Few-shot Neural Architecture Search. (arXiv:2006.06863v8 [cs.LG] UPDATED)
    (2 min) Efficient evaluation of a network architecture drawn from a large search space remains a key challenge in Neural Architecture Search (NAS). Vanilla NAS evaluates each architecture by training from scratch, which gives the true performance but is extremely time-consuming. Recently, one-shot NAS substantially reduces the computation cost by training only one supernetwork, a.k.a. supernet, to approximate the performance of every architecture in the search space via weight-sharing. However, the performance estimation can be very inaccurate due to the co-adaption among operations. In this paper, we propose few-shot NAS that uses multiple supernetworks, called sub-supernet, each covering different regions of the search space to alleviate the undesired co-adaption. Compared to one-shot NAS, few-shot NAS improves the accuracy of architecture evaluation with a small increase of evaluation cost. With only up to 7 sub-supernets, few-shot NAS establishes new SoTAs: on ImageNet, it finds models that reach 80.5% top-1 accuracy at 600 MB FLOPS and 77.5% top-1 accuracy at 238 MFLOPS; on CIFAR10, it reaches 98.72% top-1 accuracy without using extra data or transfer learning. In Auto-GAN, few-shot NAS outperforms the previously published results by up to 20%. Extensive experiments show that few-shot NAS significantly improves various one-shot methods, including 4 gradient-based and 6 search-based methods on 3 different tasks in NasBench-201 and NasBench1-shot-1.
    Ideal formulations for constrained convex optimization problems with indicator variables. (arXiv:2007.00107v2 [math.OC] UPDATED)
    (2 min) Motivated by modern regression applications, in this paper, we study the convexification of a class of convex optimization problems with indicator variables and combinatorial constraints on the indicators. Unlike most of the previous work on convexification of sparse regression problems, we simultaneously consider the nonlinear non-separable objective, indicator variables, and combinatorial constraints. Specifically, we give the convex hull description of the epigraph of the composition of a one-dimensional convex function and an affine function under arbitrary combinatorial constraints. As special cases of this result, we derive ideal convexifications for problems with hierarchy, multi-collinearity, and sparsity constraints. Moreover, we also give a short proof that for a separable objective function, the perspective reformulation is ideal independent from the constraints of the problem. Our computational experiments with regression problems under hierarchy constraints on real datasets demonstrate the potential of the proposed approach in improving the relaxation quality without significant computational overhead.
    Communication-Efficient Federated Learning with Compensated Overlap-FedAvg. (arXiv:2012.06706v2 [cs.LG] UPDATED)
    (2 min) Petabytes of data are generated each day by emerging Internet of Things (IoT), but only few of them can be finally collected and used for Machine Learning (ML) purposes due to the apprehension of data & privacy leakage, which seriously retarding ML's growth. To alleviate this problem, Federated learning is proposed to perform model training by multiple clients' combined data without the dataset sharing within the cluster. Nevertheless, federated learning introduces massive communication overhead as the synchronized data in each epoch is of the same size as the model, and thereby leading to a low communication efficiency. Consequently, variant methods mainly focusing on the communication rounds reduction and data compression are proposed to reduce the communication overhead of federated learning. In this paper, we propose Overlap-FedAvg, a framework that parallels the model training phase with model uploading & downloading phase, so that the latter phase can be totally covered by the former phase. Compared to vanilla FedAvg, Overlap-FedAvg is further developed with a hierarchical computing strategy, a data compensation mechanism and a nesterov accelerated gradients~(NAG) algorithm. Besides, Overlap-FedAvg is orthogonal to many other compression methods so that they can be applied together to maximize the utilization of the cluster. Furthermore, the theoretical analysis is provided to prove the convergence of the proposed Overlap-FedAvg framework. Extensive experiments on both conventional and recurrent tasks with multiple models and datasets also demonstrate that the proposed Overlap-FedAvg framework substantially boosts the federated learning process.
    Cardiovascular Disease Prediction using Recursive Feature Elimination and Gradient Boosting Classification Techniques. (arXiv:2106.08889v1 [cs.LG])
    (0 min) Cardiovascular diseases (CVDs) are one of the most common chronic illnesses that affect peoples health. Early detection of CVDs can reduce mortality rates by preventing or reducing the severity of the disease. Machine learning algorithms are a promising method for identifying risk factors. This paper proposes a proposed recursive feature elimination-based gradient boosting (RFE-GB) algorithm in order to obtain accurate heart disease prediction. The patients health record with important CVD features has been analyzed for the evaluation of the results. Several other machine learning methods were also used to build the prediction model, and the results were compared with the proposed model. The results of this proposed model infer that the combined recursive feature elimination and gradient boosting algorithm achieves the highest accuracy (89.7 %). Further, with an area under the curve of 0.84, the proposed RFE-GB algorithm was found superior and had obtained a substantial gain over other techniques. Thus, the proposed RFE-GB algorithm will serve as a prominent model for CVD estimation and treatment.
    Learning effective stochastic differential equations from microscopic simulations: combining stochastic numerics and deep learning. (arXiv:2106.09004v1 [physics.comp-ph])
    (2 min) We identify effective stochastic differential equations (SDE) for coarse observables of fine-grained particle- or agent-based simulations; these SDE then provide coarse surrogate models of the fine scale dynamics. We approximate the drift and diffusivity functions in these effective SDE through neural networks, which can be thought of as effective stochastic ResNets. The loss function is inspired by, and embodies, the structure of established stochastic numerical integrators (here, Euler-Maruyama and Milstein); our approximations can thus benefit from error analysis of these underlying numerical schemes. They also lend themselves naturally to "physics-informed" gray-box identification when approximate coarse models, such as mean field equations, are available. Our approach does not require long trajectories, works on scattered snapshot data, and is designed to naturally handle different time steps per snapshot. We consider both the case where the coarse collective observables are known in advance, as well as the case where they must be found in a data-driven manner.
    High-Dimensional Bayesian Optimisation with Variational Autoencoders and Deep Metric Learning. (arXiv:2106.03609v2 [cs.LG] UPDATED)
    (2 min) We introduce a method based on deep metric learning to perform Bayesian optimisation over high-dimensional, structured input spaces using variational autoencoders (VAEs). By extending ideas from supervised deep metric learning, we address a longstanding problem in high-dimensional VAE Bayesian optimisation, namely how to enforce a discriminative latent space as an inductive bias. Importantly, we achieve such an inductive bias using just 1% of the available labelled data relative to previous work, highlighting the sample efficiency of our approach. As a theoretical contribution, we present a proof of vanishing regret for our method. As an empirical contribution, we present state-of-the-art results on real-world high-dimensional black-box optimisation problems including property-guided molecule generation. It is the hope that the results presented in this paper can act as a guiding principle for realising effective high-dimensional Bayesian optimisation.
    Edge Sparse Basis Network: A Deep Learning Framework for EEG Source Localization. (arXiv:2102.09188v3 [cs.LG] UPDATED)
    (2 min) EEG source localization is an important technical issue in EEG analysis. Despite many numerical methods existed for EEG source localization, they all rely on strong priors and the deep sources are intractable. Here we propose a deep learning framework using spatial basis function decomposition for EEG source localization. This framework combines the edge sparsity prior and Gaussian source basis, called Edge Sparse Basis Network (ESBN). The performance of ESBN is validated by both synthetic data and real EEG data during motor tasks. The results suggest that the supervised ESBN outperforms the traditional numerical methods in synthetic data and the unsupervised fine-tuning provides more focal and accurate localizations in real data. Our proposed deep learning framework can be extended to account for other source priors, and the real-time property of ESBN can facilitate the applications of EEG in brain-computer interfaces and clinics.
    Learning-Based Vulnerability Analysis of Cyber-Physical Systems. (arXiv:2103.06271v2 [cs.CR] UPDATED)
    (2 min) This work focuses on the use of deep learning for vulnerability analysis of cyber-physical systems (CPS). Specifically, we consider a control architecture widely used in CPS (e.g., robotics), where the low-level control is based on e.g., the extended Kalman filter (EKF) and an anomaly detector. To facilitate analyzing the impact potential sensing attacks could have, our objective is to develop learning-enabled attack generators capable of designing stealthy attacks that maximally degrade system operation. We show how such problem can be cast within a learning-based grey-box framework where parts of the runtime information are known to the attacker, and introduce two models based on feed-forward neural networks (FNN); both models are trained offline, using a cost function that combines the attack effects on the estimation error and the residual signal used for anomaly detection, so that the trained models are capable of recursively generating such effective sensor attacks in real-time. The effectiveness of the proposed methods is illustrated on several case studies.
    Using Voice and Biofeedback to Predict User Engagement during Requirements Interviews. (arXiv:2104.02410v2 [cs.SE] UPDATED)
    (3 min) Capturing users engagement is crucial for gathering feedback about the features of a software product. In a market-driven context, current approaches to collect and analyze users feedback are based on techniques leveraging information extracted from product reviews and social media. These approaches are hardly applicable in bespoke software development, or in contexts in which one needs to gather information from specific users. In such cases, companies need to resort to face-to-face interviews to get feedback on their products. In this paper, we propose to utilize biometric data, in terms of physiological and voice features, to complement interviews with information about the engagement of the user on the discussed product-relevant topics. We evaluate our approach by interviewing users while gathering their physiological data (i.e., biofeedback) using an Empatica E4 wristband, and capturing their voice through the default audio-recorder of a common laptop. Our results show that we can predict users' engagement by training supervised machine learning algorithms on biometric data, and that voice features alone can be sufficiently effective. The performance of the prediction algorithms is maximised when pre-processing the training data with the synthetic minority oversampling technique (SMOTE). The results of our work suggest that biofeedback and voice analysis can be used to facilitate prioritization of requirements oriented to product improvement, and to steer the interview based on users' engagement. Furthermore, the usage of voice features can be particularly helpful for emotion-aware requirements elicitation in remote communication, either performed by human analysts or voice-based chatbots.
    Gradient-trained Weights in Wide Neural Networks Align Layerwise to Error-scaled Input Correlations. (arXiv:2106.08453v1 [cs.LG])
    (2 min) Recent works have examined how deep neural networks, which can solve a variety of difficult problems, incorporate the statistics of training data to achieve their success. However, existing results have been established only in limited settings. In this work, we derive the layerwise weight dynamics of infinite-width neural networks with nonlinear activations trained by gradient descent. We show theoretically that weight updates are aligned with input correlations from intermediate layers weighted by error, and demonstrate empirically that the result also holds in finite-width wide networks. The alignment result allows us to formulate backpropagation-free learning rules, named Align-zero and Align-ada, that theoretically achieve the same alignment as backpropagation. Finally, we test these learning rules on benchmark problems in feedforward and recurrent neural networks and demonstrate, in wide networks, comparable performance to backpropagation.
    Evolving Image Compositions for Feature Representation Learning. (arXiv:2106.09011v1 [cs.CV])
    (2 min) Convolutional neural networks for visual recognition require large amounts of training samples and usually benefit from data augmentation. This paper proposes PatchMix, a data augmentation method that creates new samples by composing patches from pairs of images in a grid-like pattern. These new samples' ground truth labels are set as proportional to the number of patches from each image. We then add a set of additional losses at the patch-level to regularize and to encourage good representations at both the patch and image levels. A ResNet-50 model trained on ImageNet using PatchMix exhibits superior transfer learning capabilities across a wide array of benchmarks. Although PatchMix can rely on random pairings and random grid-like patterns for mixing, we explore evolutionary search as a guiding strategy to discover optimal grid-like patterns and image pairing jointly. For this purpose, we conceive a fitness function that bypasses the need to re-train a model to evaluate each choice. In this way, PatchMix outperforms a base model on CIFAR-10 (+1.91), CIFAR-100 (+5.31), Tiny Imagenet (+3.52), and ImageNet (+1.16) by significant margins, also outperforming previous state-of-the-art pairwise augmentation strategies.
    Improved CNN-based Learning of Interpolation Filters for Low-Complexity Inter Prediction in Video Coding. (arXiv:2106.08936v1 [eess.IV])
    (2 min) The versatility of recent machine learning approaches makes them ideal for improvement of next generation video compression solutions. Unfortunately, these approaches typically bring significant increases in computational complexity and are difficult to interpret into explainable models, affecting their potential for implementation within practical video coding applications. This paper introduces a novel explainable neural network-based inter-prediction scheme, to improve the interpolation of reference samples needed for fractional precision motion compensation. The approach requires a single neural network to be trained from which a full quarter-pixel interpolation filter set is derived, as the network is easily interpretable due to its linear structure. A novel training framework enables each network branch to resemble a specific fractional shift. This practical solution makes it very efficient to use alongside conventional video coding schemes. When implemented in the context of the state-of-the-art Versatile Video Coding (VVC) test model, 0.77%, 1.27% and 2.25% BD-rate savings can be achieved on average for lower resolution sequences under the random access, low-delay B and low-delay P configurations, respectively, while the complexity of the learned interpolation schemes is significantly reduced compared to the interpolation with full CNNs.
    Towards Evaluating and Training Verifiably Robust Neural Networks. (arXiv:2104.00447v3 [cs.CV] UPDATED)
    (2 min) Recent works have shown that interval bound propagation (IBP) can be used to train verifiably robust neural networks. Reseachers observe an intriguing phenomenon on these IBP trained networks: CROWN, a bounding method based on tight linear relaxation, often gives very loose bounds on these networks. We also observe that most neurons become dead during the IBP training process, which could hurt the representation capability of the network. In this paper, we study the relationship between IBP and CROWN, and prove that CROWN is always tighter than IBP when choosing appropriate bounding lines. We further propose a relaxed version of CROWN, linear bound propagation (LBP), that can be used to verify large networks to obtain lower verified errors than IBP. We also design a new activation function, parameterized ramp function (ParamRamp), which has more diversity of neuron status than ReLU. We conduct extensive experiments on MNIST, CIFAR-10 and Tiny-ImageNet with ParamRamp activation and achieve state-of-the-art verified robustness. Code and the appendix are available at https://github.com/ZhaoyangLyu/VerifiablyRobustNN.
    WaveNet-Based Deep Neural Networks for the Characterization of Anomalous Diffusion (WADNet). (arXiv:2106.08887v1 [cs.LG])
    (2 min) Anomalous diffusion, which shows a deviation of transport dynamics from the framework of standard Brownian motion, is involved in the evolution of various physical, chemical, biological, and economic systems. The study of such random processes is of fundamental importance in unveiling the physical properties of random walkers and complex systems. However, classical methods to characterize anomalous diffusion are often disqualified for individual short trajectories, leading to the launch of the Anomalous Diffusion (AnDi) Challenge. This challenge aims at objectively assessing and comparing new approaches for single trajectory characterization, with respect to three different aspects: the inference of the anomalous diffusion exponent; the classification of the diffusion model; and the segmentation of trajectories. In this article, to address the inference and classification tasks in the challenge, we develop a WaveNet-based deep neural network (WADNet) by combining a modified WaveNet encoder with long short-term memory networks, without any prior knowledge of anomalous diffusion. As the performance of our model has surpassed the current 1st places in the challenge leaderboard on both two tasks for all dimensions (6 subtasks), WADNet could be the part of state-of-the-art techniques to decode the AnDi database. Our method presents a benchmark for future research, and could accelerate the development of a versatile tool for the characterization of anomalous diffusion.
    Automated scoring of pre-REM sleep in mice with deep learning. (arXiv:2105.01933v2 [q-bio.QM] UPDATED)
    (2 min) Reliable automation of the labor-intensive manual task of scoring animal sleep can facilitate the analysis of long-term sleep studies. In recent years, deep-learning-based systems, which learn optimal features from the data, increased scoring accuracies for the classical sleep stages of Wake, REM, and Non-REM. Meanwhile, it has been recognized that the statistics of transitional stages such as pre-REM, found between Non-REM and REM, may hold additional insight into the physiology of sleep and are now under vivid investigation. We propose a classification system based on a simple neural network architecture that scores the classical stages as well as pre-REM sleep in mice. When restricted to the classical stages, the optimized network showed state-of-the-art classification performance with an out-of-sample F1 score of 0.95 in male C57BL/6J mice. When unrestricted, the network showed lower F1 scores on pre-REM (0.5) compared to the classical stages. The result is comparable to previous attempts to score transitional stages in other species such as transition sleep in rats or N1 sleep in humans. Nevertheless, we observed that the sequence of predictions including pre-REM typically transitioned from Non-REM to REM reflecting sleep dynamics observed by human scorers. Our findings provide further evidence for the difficulty of scoring transitional sleep stages, likely because such stages of sleep are under-represented in typical data sets or show large inter-scorer variability. We further provide our source code and an online platform to run predictions with our trained network.
    Fundamental Limits of Reinforcement Learning in Environment with Endogeneous and Exogeneous Uncertainty. (arXiv:2106.08477v1 [cs.LG])
    (2 min) Online reinforcement learning (RL) has been widely applied in information processing scenarios, which usually exhibit much uncertainty due to the intrinsic randomness of channels and service demands. In this paper, we consider an un-discounted RL in general Markov decision processes (MDPs) with both endogeneous and exogeneous uncertainty, where both the rewards and state transition probability are unknown to the RL agent and evolve with the time as long as their respective variations do not exceed certain dynamic budget (i.e., upper bound). We first develop a variation-aware Bernstein-based upper confidence reinforcement learning (VB-UCRL), which we allow to restart according to a schedule dependent on the variations. We successfully overcome the challenges due to the exogeneous uncertainty and establish a regret bound of saving at most $\sqrt{S}$ or $S^{\frac{1}{6}}T^{\frac{1}{12}}$ compared with the latest results in the literature, where $S$ denotes the state size of the MDP and $T$ indicates the iteration index of learning steps.
    Adversarial Attacks on Deep Models for Financial Transaction Records. (arXiv:2106.08361v1 [cs.LG])
    (2 min) Machine learning models using transaction records as inputs are popular among financial institutions. The most efficient models use deep-learning architectures similar to those in the NLP community, posing a challenge due to their tremendous number of parameters and limited robustness. In particular, deep-learning models are vulnerable to adversarial attacks: a little change in the input harms the model's output. In this work, we examine adversarial attacks on transaction records data and defences from these attacks. The transaction records data have a different structure than the canonical NLP or time series data, as neighbouring records are less connected than words in sentences, and each record consists of both discrete merchant code and continuous transaction amount. We consider a black-box attack scenario, where the attack doesn't know the true decision model, and pay special attention to adding transaction tokens to the end of a sequence. These limitations provide more realistic scenario, previously unexplored in NLP world. The proposed adversarial attacks and the respective defences demonstrate remarkable performance using relevant datasets from the financial industry. Our results show that a couple of generated transactions are sufficient to fool a deep-learning model. Further, we improve model robustness via adversarial training or separate adversarial examples detection. This work shows that embedding protection from adversarial attacks improves model robustness, allowing a wider adoption of deep models for transaction records in banking and finance.
    Momentum Pseudo-Labeling for Semi-Supervised Speech Recognition. (arXiv:2106.08922v1 [eess.AS])
    (2 min) Pseudo-labeling (PL) has been shown to be effective in semi-supervised automatic speech recognition (ASR), where a base model is self-trained with pseudo-labels generated from unlabeled data. While PL can be further improved by iteratively updating pseudo-labels as the model evolves, most of the previous approaches involve inefficient retraining of the model or intricate control of the label update. We present momentum pseudo-labeling (MPL), a simple yet effective strategy for semi-supervised ASR. MPL consists of a pair of online and offline models that interact and learn from each other, inspired by the mean teacher method. The online model is trained to predict pseudo-labels generated on the fly by the offline model. The offline model maintains a momentum-based moving average of the online model. MPL is performed in a single training process and the interaction between the two models effectively helps them reinforce each other to improve the ASR performance. We apply MPL to an end-to-end ASR model based on the connectionist temporal classification. The experimental results demonstrate that MPL effectively improves over the base model and is scalable to different semi-supervised scenarios with varying amounts of data or domain mismatch.
    On the Sample Complexity and Metastability of Heavy-tailed Policy Search in Continuous Control. (arXiv:2106.08414v1 [cs.LG])
    (2 min) Reinforcement learning is a framework for interactive decision-making with incentives sequentially revealed across time without a system dynamics model. Due to its scaling to continuous spaces, we focus on policy search where one iteratively improves a parameterized policy with stochastic policy gradient (PG) updates. In tabular Markov Decision Problems (MDPs), under persistent exploration and suitable parameterization, global optimality may be obtained. By contrast, in continuous space, the non-convexity poses a pathological challenge as evidenced by existing convergence results being mostly limited to stationarity or arbitrary local extrema. To close this gap, we step towards persistent exploration in continuous space through policy parameterizations defined by distributions of heavier tails defined by tail-index parameter alpha, which increases the likelihood of jumping in state space. Doing so invalidates smoothness conditions of the score function common to PG. Thus, we establish how the convergence rate to stationarity depends on the policy's tail index alpha, a Holder continuity parameter, integrability conditions, and an exploration tolerance parameter introduced here for the first time. Further, we characterize the dependence of the set of local maxima on the tail index through an exit and transition time analysis of a suitably defined Markov chain, identifying that policies associated with Levy Processes of a heavier tail converge to wider peaks. This phenomenon yields improved stability to perturbations in supervised learning, which we corroborate also manifests in improved performance of policy search, especially when myopic and farsighted incentives are misaligned.
    Scene Transformer: A unified multi-task model for behavior prediction and planning. (arXiv:2106.08417v1 [cs.CV])
    (2 min) Predicting the future motion of multiple agents is necessary for planning in dynamic environments. This task is challenging for autonomous driving since agents (e.g., vehicles and pedestrians) and their associated behaviors may be diverse and influence each other. Most prior work has focused on first predicting independent futures for each agent based on all past motion, and then planning against these independent predictions. However, planning against fixed predictions can suffer from the inability to represent the future interaction possibilities between different agents, leading to sub-optimal planning. In this work, we formulate a model for predicting the behavior of all agents jointly in real-world driving environments in a unified manner. Inspired by recent language modeling approaches, we use a masking strategy as the query to our model, enabling one to invoke a single model to predict agent behavior in many ways, such as potentially conditioned on the goal or full future trajectory of the autonomous vehicle or the behavior of other agents in the environment. Our model architecture fuses heterogeneous world state in a unified Transformer architecture by employing attention across road elements, agent interactions and time steps. We evaluate our approach on autonomous driving datasets for behavior prediction, and achieve state-of-the-art performance. Our work demonstrates that formulating the problem of behavior prediction in a unified architecture with a masking strategy may allow us to have a single model that can perform multiple motion prediction and planning related tasks effectively.
    Real-time Attacks Against Deep Reinforcement Learning Policies. (arXiv:2106.08746v1 [cs.LG])
    (2 min) Recent work has discovered that deep reinforcement learning (DRL) policies are vulnerable to adversarial examples. These attacks mislead the policy of DRL agents by perturbing the state of the environment observed by agents. They are feasible in principle but too slow to fool DRL policies in real time. We propose a new attack to fool DRL policies that is both effective and efficient enough to be mounted in real time. We utilize the Universal Adversarial Perturbation (UAP) method to compute effective perturbations independent of the individual inputs to which they are applied. Via an extensive evaluation using Atari 2600 games, we show that our technique is effective, as it fully degrades the performance of both deterministic and stochastic policies (up to 100%, even when the $l_\infty$ bound on the perturbation is as small as 0.005). We also show that our attack is efficient, incurring an online computational cost of 0.027ms on average. It is faster compared to the response time (0.6ms on average) of agents with different DRL policies, and considerably faster than prior attacks (2.7ms on average). Furthermore, we demonstrate that known defenses are ineffective against universal perturbations. We propose an effective detection technique which can form the basis for robust defenses against attacks based on universal perturbations.
    Counterfactual Graphs for Explainable Classification of Brain Networks. (arXiv:2106.08640v1 [cs.SI])
    (2 min) Training graph classifiers able to distinguish between healthy brains and dysfunctional ones, can help identifying substructures associated to specific cognitive phenotypes. However, the mere predictive power of the graph classifier is of limited interest to the neuroscientists, which have plenty of tools for the diagnosis of specific mental disorders. What matters is the interpretation of the model, as it can provide novel insights and new hypotheses. In this paper we propose \emph{counterfactual graphs} as a way to produce local post-hoc explanations of any black-box graph classifier. Given a graph and a black-box, a counterfactual is a graph which, while having high structural similarity with the original graph, is classified by the black-box in a different class. We propose and empirically compare several strategies for counterfactual graph search. Our experiments against a white-box classifier with known optimal counterfactual, show that our methods, although heuristic, can produce counterfactuals very close to the optimal one. Finally, we show how to use counterfactual graphs to build global explanations correctly capturing the behaviour of different black-box classifiers and providing interesting insights for the neuroscientists.
    Learning Fair Policies in Decentralized Cooperative Multi-Agent Reinforcement Learning. (arXiv:2012.09421v3 [cs.LG] UPDATED)
    (2 min) We consider the problem of learning fair policies in (deep) cooperative multi-agent reinforcement learning (MARL). We formalize it in a principled way as the problem of optimizing a welfare function that explicitly encodes two important aspects of fairness: efficiency and equity. As a solution method, we propose a novel neural network architecture, which is composed of two sub-networks specifically designed for taking into account the two aspects of fairness. In experiments, we demonstrate the importance of the two sub-networks for fair optimization. Our overall approach is general as it can accommodate any (sub)differentiable welfare function. Therefore, it is compatible with various notions of fairness that have been proposed in the literature (e.g., lexicographic maximin, generalized Gini social welfare function, proportional fairness). Our solution method is generic and can be implemented in various MARL settings: centralized training and decentralized execution, or fully decentralized. Finally, we experimentally validate our approach in various domains and show that it can perform much better than previous methods.
    Predicting crop yields with little ground truth: A simple statistical model for in-season forecasting. (arXiv:2106.08720v1 [cs.LG])
    (2 min) We present a fully automated model for in-season crop yield prediction, designed to work where there is a dearth of sub-national "ground truth" information. Our approach relies primarily on satellite data and is characterized by careful feature engineering combined with a simple regression model. As such, it can work almost anywhere in the world. Applying it to 10 different crop-country pairs (5 cereals -- corn, wheat, sorghum, barley and millet, in 2 countries -- Ethiopia and Kenya), we achieve RMSEs of 5\%-10\% for predictions 9 months into the year, and 7\%-14\% for predictions 3 months into the year. The model outputs daily forecasts for the final yield of the current year. It is trained using approximately 4 million data points for each crop-country pair. These consist of: historical country-level annual yields, crop calendars, crop cover, NDVI, temperature, rainfall, and evapotransporation.
    Banker Online Mirror Descent. (arXiv:2106.08943v1 [cs.LG])
    (2 min) We propose Banker-OMD, a novel framework generalizing the classical Online Mirror Descent (OMD) technique in online learning algorithm design. Banker-OMD allows algorithms to robustly handle delayed feedback, and offers a general methodology for achieving $\tilde{O}(\sqrt{T} + \sqrt{D})$-style regret bounds in various delayed-feedback online learning tasks, where $T$ is the time horizon length and $D$ is the total feedback delay. We demonstrate the power of Banker-OMD with applications to three important bandit scenarios with delayed feedback, including delayed adversarial Multi-armed bandits (MAB), delayed adversarial linear bandits, and a novel delayed best-of-both-worlds MAB setting. Banker-OMD achieves nearly-optimal performance in all the three settings. In particular, it leads to the first delayed adversarial linear bandit algorithm achieving $\tilde{O}(\text{poly}(n)(\sqrt{T} + \sqrt{D}))$ regret.
    Breaking The Dimension Dependence in Sparse Distribution Estimation under Communication Constraints. (arXiv:2106.08597v1 [stat.ML])
    (2 min) We consider the problem of estimating a $d$-dimensional $s$-sparse discrete distribution from its samples observed under a $b$-bit communication constraint. The best-known previous result on $\ell_2$ estimation error for this problem is $O\left( \frac{s\log\left( {d}/{s}\right)}{n2^b}\right)$. Surprisingly, we show that when sample size $n$ exceeds a minimum threshold $n^*(s, d, b)$, we can achieve an $\ell_2$ estimation error of $O\left( \frac{s}{n2^b}\right)$. This implies that when $n>n^*(s, d, b)$ the convergence rate does not depend on the ambient dimension $d$ and is the same as knowing the support of the distribution beforehand. We next ask the question: ``what is the minimum $n^*(s, d, b)$ that allows dimension-free convergence?''. To upper bound $n^*(s, d, b)$, we develop novel localization schemes to accurately and efficiently localize the unknown support. For the non-interactive setting, we show that $n^*(s, d, b) = O\left( \min \left( {d^2\log^2 d}/{2^b}, {s^4\log^2 d}/{2^b}\right) \right)$. Moreover, we connect the problem with non-adaptive group testing and obtain a polynomial-time estimation scheme when $n = \tilde{\Omega}\left({s^4\log^4 d}/{2^b}\right)$. This group testing based scheme is adaptive to the sparsity parameter $s$, and hence can be applied without knowing it. For the interactive setting, we propose a novel tree-based estimation scheme and show that the minimum sample-size needed to achieve dimension-free convergence can be further reduced to $n^*(s, d, b) = \tilde{O}\left( {s^2\log^2 d}/{2^b} \right)$.
    The Error-Feedback Framework: Better Rates for SGD with Delayed Gradients and Compressed Communication. (arXiv:1909.05350v2 [cs.LG] UPDATED)
    (2 min) We analyze (stochastic) gradient descent (SGD) with delayed updates on smooth quasi-convex and non-convex functions and derive concise, non-asymptotic, convergence rates. We show that the rate of convergence in all cases consists of two terms: (i) a stochastic term which is not affected by the delay, and (ii) a higher order deterministic term which is only linearly slowed down by the delay. Thus, in the presence of noise, the effects of the delay become negligible after a few iterations and the algorithm converges at the same optimal rate as standard SGD. This result extends a line of research that showed similar results in the asymptotic regime or for strongly-convex quadratic functions only. We further show similar results for SGD with more intricate form of delayed gradients---compressed gradients under error compensation and for local~SGD where multiple workers perform local steps before communicating with each other. In all of these settings, we improve upon the best known rates. These results show that SGD is robust to compressed and/or delayed stochastic gradient updates. This is in particular important for distributed parallel implementations, where asynchronous and communication efficient methods are the key to achieve linear speedups for optimization with multiple devices.
    Optimal Accounting of Differential Privacy via Characteristic Function. (arXiv:2106.08567v1 [cs.LG])
    (2 min) Characterizing the privacy degradation over compositions, i.e., privacy accounting, is a fundamental topic in differential privacy (DP) with many applications to differentially private machine learning and federated learning. We propose a unification of recent advances (Renyi DP, privacy profiles, $f$-DP and the PLD formalism) via the characteristic function ($\phi$-function) of a certain ``worst-case'' privacy loss random variable. We show that our approach allows natural adaptive composition like Renyi DP, provides exactly tight privacy accounting like PLD, and can be (often losslessly) converted to privacy profile and $f$-DP, thus providing $(\epsilon,\delta)$-DP guarantees and interpretable tradeoff functions. Algorithmically, we propose an analytical Fourier accountant that represents the complex logarithm of $\phi$-functions symbolically and uses Gaussian quadrature for numerical computation. On several popular DP mechanisms and their subsampled counterparts, we demonstrate the flexibility and tightness of our approach in theory and experiments.
    Dynamically Grown Generative Adversarial Networks. (arXiv:2106.08505v1 [cs.CV])
    (2 min) Recent work introduced progressive network growing as a promising way to ease the training for large GANs, but the model design and architecture-growing strategy still remain under-explored and needs manual design for different image data. In this paper, we propose a method to dynamically grow a GAN during training, optimizing the network architecture and its parameters together with automation. The method embeds architecture search techniques as an interleaving step with gradient-based training to periodically seek the optimal architecture-growing strategy for the generator and discriminator. It enjoys the benefits of both eased training because of progressive growing and improved performance because of broader architecture design space. Experimental results demonstrate new state-of-the-art of image generation. Observations in the search procedure also provide constructive insights into the GAN model design such as generator-discriminator balance and convolutional layer choices.
    Towards Adversarial Robustness via Transductive Learning. (arXiv:2106.08387v1 [cs.LG])
    (2 min) There has been emerging interest to use transductive learning for adversarial robustness (Goldwasser et al., NeurIPS 2020; Wu et al., ICML 2020). Compared to traditional "test-time" defenses, these defense mechanisms "dynamically retrain" the model based on test time input via transductive learning; and theoretically, attacking these defenses boils down to bilevel optimization, which seems to raise the difficulty for adaptive attacks. In this paper, we first formalize and analyze modeling aspects of transductive robustness. Then, we propose the principle of attacking model space for solving bilevel attack objectives, and present an instantiation of the principle which breaks previous transductive defenses. These attacks thus point to significant difficulties in the use of transductive learning to improve adversarial robustness. To this end, we present new theoretical and empirical evidence in support of the utility of transductive learning.
    Data Augmentation for Graph Convolutional Network on Semi-Supervised Classification. (arXiv:2106.08848v1 [cs.LG])
    (2 min) Data augmentation aims to generate new and synthetic features from the original data, which can identify a better representation of data and improve the performance and generalizability of downstream tasks. However, data augmentation for graph-based models remains a challenging problem, as graph data is more complex than traditional data, which consists of two features with different properties: graph topology and node attributes. In this paper, we study the problem of graph data augmentation for Graph Convolutional Network (GCN) in the context of improving the node embeddings for semi-supervised node classification. Specifically, we conduct cosine similarity based cross operation on the original features to create new graph features, including new node attributes and new graph topologies, and we combine them as new pairwise inputs for specific GCNs. Then, we propose an attentional integrating model to weighted sum the hidden node embeddings encoded by these GCNs into the final node embeddings. We also conduct a disparity constraint on these hidden node embeddings when training to ensure that non-redundant information is captured from different features. Experimental results on five real-world datasets show that our method improves the classification accuracy with a clear margin (+2.5% - +84.2%) than the original GCN model.
    Clustering Mixture Models in Almost-Linear Time via List-Decodable Mean Estimation. (arXiv:2106.08537v1 [cs.DS])
    (2 min) We study the problem of list-decodable mean estimation, where an adversary can corrupt a majority of the dataset. Specifically, we are given a set $T$ of $n$ points in $\mathbb{R}^d$ and a parameter $0< \alpha <\frac 1 2$ such that an $\alpha$-fraction of the points in $T$ are i.i.d. samples from a well-behaved distribution $\mathcal{D}$ and the remaining $(1-\alpha)$-fraction of the points are arbitrary. The goal is to output a small list of vectors at least one of which is close to the mean of $\mathcal{D}$. As our main contribution, we develop new algorithms for list-decodable mean estimation, achieving nearly-optimal statistical guarantees, with running time $n^{1 + o(1)} d$. All prior algorithms for this problem had additional polynomial factors in $\frac 1 \alpha$. As a corollary, we obtain the first almost-linear time algorithms for clustering mixtures of $k$ separated well-behaved distributions, nearly-matching the statistical guarantees of spectral methods. Prior clustering algorithms inherently relied on an application of $k$-PCA, thereby incurring runtimes of $\Omega(n d k)$. This marks the first runtime improvement for this basic statistical problem in nearly two decades. The starting point of our approach is a novel and simpler near-linear time robust mean estimation algorithm in the $\alpha \to 1$ regime, based on a one-shot matrix multiplicative weights-inspired potential decrease. We crucially leverage this new algorithmic framework in the context of the iterative multi-filtering technique of Diakonikolas et. al. '18, '20, providing a method to simultaneously cluster and downsample points using one-dimensional projections --- thus, bypassing the $k$-PCA subroutines required by prior algorithms.
    Generating Tertiary Protein Structures via an Interpretative Variational Autoencoder. (arXiv:2004.07119v2 [q-bio.BM] UPDATED)
    (2 min) Much scientific enquiry across disciplines is founded upon a mechanistic treatment of dynamic systems that ties form to function. A highly visible instance of this is in molecular biology, where an important goal is to determine functionally-relevant forms/structures that a protein molecule employs to interact with molecular partners in the living cell. This goal is typically pursued under the umbrella of stochastic optimization with algorithms that optimize a scoring function. Research repeatedly shows that current scoring function, though steadily improving, correlate weakly with molecular activity. Inspired by recent momentum in generative deep learning, this paper proposes and evaluates an alternative approach to generating functionally-relevant three-dimensional structures of a protein. Though typically deep generative models struggle with highly-structured data, the work presented here circumvents this challenge via graph-generative models. A comprehensive evaluation of several deep architectures shows the promise of generative models in directly revealing the latent space for sampling novel tertiary structures, as well as in highlighting axes/factors that carry structural meaning and open the black box often associated with deep models. The work presented here is a first step towards interpretative, deep generative models becoming viable and informative complementary approaches to protein structure prediction.
    Self-supervised GANs with Label Augmentation. (arXiv:2106.08601v1 [cs.LG])
    (2 min) Recently, transformation-based self-supervised learning has been applied to generative adversarial networks (GANs) to mitigate the catastrophic forgetting problem of discriminator by learning stable representations. However, the separate self-supervised tasks in existing self-supervised GANs cause an inconsistent goal with generative modeling due to the learning of the generator from their generator distribution-agnostic classifiers. To address this issue, we propose a novel self-supervised GANs framework with label augmentation, i.e., augmenting the GAN labels (real or fake) with the self-supervised pseudo-labels. In particular, the discriminator and the self-supervised classifier are unified to learn a single task that predicts the augmented label such that the discriminator/classifier is aware of the generator distribution, while the generator tries to confuse the discriminator/classifier by optimizing the discrepancy between the transformed real and generated distributions. Theoretically, we prove that the generator, at the equilibrium point, converges to replicate the data distribution. Empirically, we demonstrate that the proposed method significantly outperforms competitive baselines on both generative modeling and representation learning across benchmark datasets.
    Deriving Autism Spectrum Disorder Functional Networks from RS-FMRI Data using Group ICA and Dictionary Learning. (arXiv:2106.09000v1 [q-bio.NC])
    (2 min) The objective of this study is to derive functional networks for the autism spectrum disorder (ASD) population using the group ICA and dictionary learning model together and to classify ASD and typically developing (TD) participants using the functional connectivity calculated from the derived functional networks. In our experiments, the ASD functional networks were derived from resting-state functional magnetic resonance imaging (rs-fMRI) data. We downloaded a total of 120 training samples, including 58 ASD and 62 TD participants, which were obtained from the public repository: Autism Brain Imaging Data Exchange I (ABIDE I). Our methodology and results have five main parts. First, we utilize a group ICA model to extract functional networks from the ASD group and rank the top 20 regions of interest (ROIs). Second, we utilize a dictionary learning model to extract functional networks from the ASD group and rank the top 20 ROIs. Third, we merged the 40 selected ROIs from the two models together as the ASD functional networks. Fourth, we generate three corresponding masks based on the 20 selected ROIs from group ICA, the 20 ROIs selected from dictionary learning, and the 40 combined ROIs selected from both. Finally, we extract ROIs for all training samples using the above three masks, and the calculated functional connectivity was used as features for ASD and TD classification. The classification results showed that the functional networks derived from ICA and dictionary learning together outperform those derived from a single ICA model or a single dictionary learning model.
    Mobile Augmented Reality: User Interfaces, Frameworks, and Intelligence. (arXiv:2106.08710v1 [cs.HC])
    (2 min) Mobile Augmented Reality (MAR) integrates computer-generated virtual objects with physical environments for mobile devices. MAR systems enable users to interact with MAR devices, such as smartphones and head-worn wearables, and performs seamless transitions from the physical world to a mixed world with digital entities. These MAR systems support user experiences by using MAR devices to provide universal accessibility to digital contents. Over the past 20 years, a number of MAR systems have been developed, however, the studies and design of MAR frameworks have not yet been systematically reviewed from the perspective of user-centric design. This article presents the first effort of surveying existing MAR frameworks (count: 37) and further discusses the latest studies on MAR through a top-down approach: 1) MAR applications; 2) MAR visualisation techniques adaptive to user mobility and contexts; 3) systematic evaluation of MAR frameworks including supported platforms and corresponding features such as tracking, feature extraction plus sensing capabilities; and 4) underlying machine learning approaches supporting intelligent operations within MAR systems. Finally, we summarise the development of emerging research fields, current state-of-the-art, and discuss the important open challenges and possible theoretical and technical directions. This survey aims to benefit both researchers and MAR system developers alike.
    Maxmin-Fair Ranking: Individual Fairness under Group-Fairness Constraints. (arXiv:2106.08652v1 [cs.LG])
    (2 min) We study a novel problem of fairness in ranking aimed at minimizing the amount of individual unfairness introduced when enforcing group-fairness constraints. Our proposal is rooted in the distributional maxmin fairness theory, which uses randomization to maximize the expected satisfaction of the worst-off individuals. We devise an exact polynomial-time algorithm to find maxmin-fair distributions of general search problems (including, but not limited to, ranking), and show that our algorithm can produce rankings which, while satisfying the given group-fairness constraints, ensure that the maximum possible value is brought to individuals.
    COVID-19 Vaccines: Characterizing Misinformation Campaigns and Vaccine Hesitancy on Twitter. (arXiv:2106.08423v1 [cs.SI])
    (2 min) Vaccine hesitancy and misinformation on social media has increased concerns about COVID-19 vaccine uptake required to achieve herd immunity and overcome the pandemic. However anti-science and political misinformation and conspiracies have been rampant throughout the pandemic. For COVID-19 vaccines, we investigate misinformation and conspiracy campaigns and their characteristic behaviours. We identify whether coordinated efforts are used to promote misinformation in vaccine related discussions, and find accounts coordinately promoting a `Great Reset' conspiracy group promoting vaccine related misinformation and strong anti-vaccine and anti-social messages such as boycott vaccine passports, no lock-downs and masks. We characterize other misinformation communities from the information diffusion structure, and study the large anti-vaccine misinformation community and smaller anti-vaccine communities, including a far-right anti-vaccine conspiracy group. In comparison with the mainstream and health news, left-leaning group, which are more pro-vaccine, the right-leaning group is influenced more by the anti-vaccine and far-right misinformation/conspiracy communities. The misinformation communities are more vocal either specific to the vaccine discussion or political discussion, and we find other differences in the characteristic behaviours of different communities. Lastly, we investigate misinformation narratives and tactics of information distortion that can increase vaccine hesitancy, using topic modeling and comparison with reported vaccine side-effects (VAERS) finding rarer side-effects are more frequently discussed on social media.
    Silhouettes and quasi residual plots for neural nets and tree-based classifiers. (arXiv:2106.08814v1 [stat.ML])
    (2 min) Classification by neural nets and by tree-based methods are powerful tools of machine learning. There exist interesting visualizations of the inner workings of these and other classifiers. Here we pursue a different goal, which is to visualize the cases being classified, either in training data or in test data. An important aspect is whether a case has been classified to its given class (label) or whether the classifier wants to assign it to different class. This is reflected in the (conditional and posterior) probability of the alternative class (PAC). A high PAC indicates label bias, i.e. the possibility that the case was mislabeled. The PAC is used to construct a silhouette plot which is similar in spirit to the silhouette plot for cluster analysis (Rousseeuw, 1987). The average silhouette width can be used to compare different classifications of the same dataset. We will also draw quasi residual plots of the PAC versus a data feature, which may lead to more insight in the data. One of these data features is how far each case lies from its given class. The graphical displays are illustrated and interpreted on benchmark data sets containing images, mixed features, and tweets.
    Analysis and Optimisation of Bellman Residual Errors with Neural Function Approximation. (arXiv:2106.08774v1 [cs.LG])
    (2 min) Recent development of Deep Reinforcement Learning has demonstrated superior performance of neural networks in solving challenging problems with large or even continuous state spaces. One specific approach is to deploy neural networks to approximate value functions by minimising the Mean Squared Bellman Error function. Despite great successes of Deep Reinforcement Learning, development of reliable and efficient numerical algorithms to minimise the Bellman Error is still of great scientific interest and practical demand. Such a challenge is partially due to the underlying optimisation problem being highly non-convex or using incorrect gradient information as done in Semi-Gradient algorithms. In this work, we analyse the Mean Squared Bellman Error from a smooth optimisation perspective combined with a Residual Gradient formulation. Our contribution is two-fold. First, we analyse critical points of the error function and provide technical insights on the optimisation procure and design choices for neural networks. When the existence of global minima is assumed and the objective fulfils certain conditions we can eliminate suboptimal local minima when using over-parametrised neural networks. We can construct an efficient Approximate Newton's algorithm based on our analysis and confirm theoretical properties of this algorithm such as being locally quadratically convergent to a global minimum numerically. Second, we demonstrate feasibility and generalisation capabilities of the proposed algorithm empirically using continuous control problems and provide a numerical verification of our critical point analysis. We outline the short coming of Semi-Gradients. To benefit from an approximate Newton's algorithm complete derivatives of the Mean Squared Bellman error must be considered during training.
    Bridge Networks. (arXiv:2106.08446v1 [cs.LG])
    (2 min) Despite rapid progress, current deep learning methods face a number of critical challenges. These include high energy consumption, catastrophic forgetting, dependance on global losses, and an inability to reason symbolically. By combining concepts from information bottleneck theory and vector-symbolic architectures, we propose and implement a novel information processing architecture, the 'Bridge network.' We show this architecture provides unique advantages which can address the problem of global losses and catastrophic forgetting. Furthermore, we argue that it provides a further basis for increasing energy efficiency of execution and the ability to reason symbolically.
    Exploring the Loss Landscape in Neural Architecture Search. (arXiv:2005.02960v3 [cs.LG] UPDATED)
    (2 min) Neural architecture search (NAS) has seen a steep rise in interest over the last few years. Many algorithms for NAS consist of searching through a space of architectures by iteratively choosing an architecture, evaluating its performance by training it, and using all prior evaluations to come up with the next choice. The evaluation step is noisy - the final accuracy varies based on the random initialization of the weights. Prior work has focused on devising new search algorithms to handle this noise, rather than quantifying or understanding the level of noise in architecture evaluations. In this work, we show that (1) the simplest hill-climbing algorithm is a powerful baseline for NAS, and (2), when the noise in popular NAS benchmark datasets is reduced to a minimum, hill-climbing to outperforms many popular state-of-the-art algorithms. We further back up this observation by showing that the number of local minima is substantially reduced as the noise decreases, and by giving a theoretical characterization of the performance of local search in NAS. Based on our findings, for NAS research we suggest (1) using local search as a baseline, and (2) denoising the training pipeline when possible.
    Mining Interpretable Spatio-temporal Logic Properties for Spatially Distributed Systems. (arXiv:2106.08548v1 [cs.LG])
    (2 min) The Internet-of-Things, complex sensor networks, multi-agent cyber-physical systems are all examples of spatially distributed systems that continuously evolve in time. Such systems generate huge amounts of spatio-temporal data, and system designers are often interested in analyzing and discovering structure within the data. There has been considerable interest in learning causal and logical properties of temporal data using logics such as Signal Temporal Logic (STL); however, there is limited work on discovering such relations on spatio-temporal data. We propose the first set of algorithms for unsupervised learning for spatio-temporal data. Our method does automatic feature extraction from the spatio-temporal data by projecting it onto the parameter space of a parametric spatio-temporal reach and escape logic (PSTREL). We propose an agglomerative hierarchical clustering technique that guarantees that each cluster satisfies a distinct STREL formula. We show that our method generates STREL formulas of bounded description complexity using a novel decision-tree approach which generalizes previous unsupervised learning techniques for Signal Temporal Logic. We demonstrate the effectiveness of our approach on case studies from diverse domains such as urban transportation, epidemiology, green infrastructure, and air quality monitoring.
    Directed Graph Embeddings in Pseudo-Riemannian Manifolds. (arXiv:2106.08678v1 [stat.ML])
    (2 min) The inductive biases of graph representation learning algorithms are often encoded in the background geometry of their embedding space. In this paper, we show that general directed graphs can be effectively represented by an embedding model that combines three components: a pseudo-Riemannian metric structure, a non-trivial global topology, and a unique likelihood function that explicitly incorporates a preferred direction in embedding space. We demonstrate the representational capabilities of this method by applying it to the task of link prediction on a series of synthetic and real directed graphs from natural language applications and biology. In particular, we show that low-dimensional cylindrical Minkowski and anti-de Sitter spacetimes can produce equal or better graph representations than curved Riemannian manifolds of higher dimensions.
    Leveraging Probabilistic Circuits for Nonparametric Multi-Output Regression. (arXiv:2106.08687v1 [cs.LG])
    (2 min) Inspired by recent advances in the field of expert-based approximations of Gaussian processes (GPs), we present an expert-based approach to large-scale multi-output regression using single-output GP experts. Employing a deeply structured mixture of single-output GPs encoded via a probabilistic circuit allows us to capture correlations between multiple output dimensions accurately. By recursively partitioning the covariate space and the output space, posterior inference in our model reduces to inference on single-output GP experts, which only need to be conditioned on a small subset of the observations. We show that inference can be performed exactly and efficiently in our model, that it can capture correlations between output dimensions and, hence, often outperforms approaches that do not incorporate inter-output correlations, as demonstrated on several data sets in terms of the negative log predictive density.
    A Multi-Layered Approach for Measuring the Simulation-to-Reality Gap of Radar Perception for Autonomous Driving. (arXiv:2106.08372v1 [cs.RO])
    (2 min) With the increasing safety validation requirements for the release of a self-driving car, alternative approaches, such as simulation-based testing, are emerging in addition to conventional real-world testing. In order to rely on virtual tests the employed sensor models have to be validated. For this reason, it is necessary to quantify the discrepancy between simulation and reality in order to determine whether a certain fidelity is sufficient for a desired intended use. There exists no sound method to measure this simulation-to-reality gap of radar perception for autonomous driving. We address this problem by introducing a multi-layered evaluation approach, which consists of a combination of an explicit and an implicit sensor model evaluation. The former directly evaluates the realism of the synthetically generated sensor data, while the latter refers to an evaluation of a downstream target application. In order to demonstrate the method, we evaluated the fidelity of three typical radar model types (ideal, data-driven, ray tracing-based) and their applicability for virtually testing radar-based multi-object tracking. We have shown the effectiveness of the proposed approach in terms of providing an in-depth sensor model assessment that renders existing disparities visible and enables a realistic estimation of the overall model fidelity across different scenarios.
    Correlation Clustering in Constant Many Parallel Rounds. (arXiv:2106.08448v1 [cs.DS])
    (2 min) Correlation clustering is a central topic in unsupervised learning, with many applications in ML and data mining. In correlation clustering, one receives as input a signed graph and the goal is to partition it to minimize the number of disagreements. In this work we propose a massively parallel computation (MPC) algorithm for this problem that is considerably faster than prior work. In particular, our algorithm uses machines with memory sublinear in the number of nodes in the graph and returns a constant approximation while running only for a constant number of rounds. To the best of our knowledge, our algorithm is the first that can provably approximate a clustering problem on graphs using only a constant number of MPC rounds in the sublinear memory regime. We complement our analysis with an experimental analysis of our techniques.
    Source Separation-based Data Augmentation for Improved Joint Beat and Downbeat Tracking. (arXiv:2106.08703v1 [cs.SD])
    (2 min) Due to advances in deep learning, the performance of automatic beat and downbeat tracking in musical audio signals has seen great improvement in recent years. In training such deep learning based models, data augmentation has been found an important technique. However, existing data augmentation methods for this task mainly target at balancing the distribution of the training data with respect to their tempo. In this paper, we investigate another approach for data augmentation, to account for the composition of the training data in terms of the percussive and non-percussive sound sources. Specifically, we propose to employ a blind drum separation model to segregate the drum and non-drum sounds from each training audio signal, filtering out training signals that are drumless, and then use the obtained drum and non-drum stems to augment the training data. We report experiments on four completely unseen test sets, validating the effectiveness of the proposed method, and accordingly the importance of drum sound composition in the training data for beat and downbeat tracking.
    Costs and Benefits of Wasserstein Fair Regression. (arXiv:2106.08812v1 [cs.LG])
    (2 min) Real-world applications of machine learning tools in high-stakes domains are often regulated to be fair, in the sense that the predicted target should satisfy some quantitative notion of parity with respect to a protected attribute. However, the exact tradeoff between fairness and accuracy with a real-valued target is not clear. In this paper, we characterize the inherent tradeoff between statistical parity and accuracy in the regression setting by providing a lower bound on the error of any fair regressor. Our lower bound is sharp, algorithm-independent, and admits a simple interpretation: when the moments of the target differ between groups, any fair algorithm has to make a large error on at least one of the groups. We further extend this result to give a lower bound on the joint error of any (approximately) fair algorithm, using the Wasserstein distance to measure the quality of the approximation. On the upside, we establish the first connection between individual fairness, accuracy parity, and the Wasserstein distance by showing that if a regressor is individually fair, it also approximately verifies the accuracy parity, where the gap is given by the Wasserstein distance between the two groups. Inspired by our theoretical results, we develop a practical algorithm for fair regression through the lens of representation learning, and conduct experiments on a real-world dataset to corroborate our findings.
    Non-PSD Matrix Sketching with Applications to Regression and Optimization. (arXiv:2106.08544v1 [cs.LG])
    (2 min) A variety of dimensionality reduction techniques have been applied for computations involving large matrices. The underlying matrix is randomly compressed into a smaller one, while approximately retaining many of its original properties. As a result, much of the expensive computation can be performed on the small matrix. The sketching of positive semidefinite (PSD) matrices is well understood, but there are many applications where the related matrices are not PSD, including Hessian matrices in non-convex optimization and covariance matrices in regression applications involving complex numbers. In this paper, we present novel dimensionality reduction methods for non-PSD matrices, as well as their ``square-roots", which involve matrices with complex entries. We show how these techniques can be used for multiple downstream tasks. In particular, we show how to use the proposed matrix sketching techniques for both convex and non-convex optimization, $\ell_p$-regression for every $1 \leq p \leq \infty$, and vector-matrix-vector queries.
    Circa: Stochastic ReLUs for Private Deep Learning. (arXiv:2106.08475v1 [cs.LG])
    (2 min) The simultaneous rise of machine learning as a service and concerns over user privacy have increasingly motivated the need for private inference (PI). While recent work demonstrates PI is possible using cryptographic primitives, the computational overheads render it impractical. The community is largely unprepared to address these overheads, as the source of slowdown in PI stems from the ReLU operator whereas optimizations for plaintext inference focus on optimizing FLOPs. In this paper we re-think the ReLU computation and propose optimizations for PI tailored to properties of neural networks. Specifically, we reformulate ReLU as an approximate sign test and introduce a novel truncation method for the sign test that significantly reduces the cost per ReLU. These optimizations result in a specific type of stochastic ReLU. The key observation is that the stochastic fault behavior is well suited for the fault-tolerant properties of neural network inference. Thus, we provide significant savings without impacting accuracy. We collectively call the optimizations Circa and demonstrate improvements of up to 4.7x storage and 3x runtime over baseline implementations; we further show that Circa can be used on top of recent PI optimizations to obtain 1.8x additional speedup.
    Spoofing Generalization: When Can't You Trust Proprietary Models?. (arXiv:2106.08393v1 [cs.LG])
    (2 min) In this work, we study the computational complexity of determining whether a machine learning model that perfectly fits the training data will generalizes to unseen data. In particular, we study the power of a malicious agent whose goal is to construct a model g that fits its training data and nothing else, but is indistinguishable from an accurate model f. We say that g strongly spoofs f if no polynomial-time algorithm can tell them apart. If instead we restrict to algorithms that run in $n^c$ time for some fixed $c$, we say that g c-weakly spoofs f. Our main results are 1. Under cryptographic assumptions, strong spoofing is possible and 2. For any c> 0, c-weak spoofing is possible unconditionally While the assumption of a malicious agent is an extreme scenario (hopefully companies training large models are not malicious), we believe that it sheds light on the inherent difficulties of blindly trusting large proprietary models or data.
    CODA: Constructivism Learning for Instance-Dependent Dropout Architecture Construction. (arXiv:2106.08444v1 [cs.LG])
    (2 min) Dropout is attracting intensive research interest in deep learning as an efficient approach to prevent overfitting. Recently incorporating structural information when deciding which units to drop out produced promising results comparing to methods that ignore the structural information. However, a major issue of the existing work is that it failed to differentiate among instances when constructing the dropout architecture. This can be a significant deficiency for many applications. To solve this issue, we propose Constructivism learning for instance-dependent Dropout Architecture (CODA), which is inspired from a philosophical theory, constructivism learning. Specially, based on the theory we have designed a better drop out technique, Uniform Process Mixture Models, using a Bayesian nonparametric method Uniform process. We have evaluated our proposed method on 5 real-world datasets and compared the performance with other state-of-the-art dropout techniques. The experimental results demonstrated the effectiveness of CODA.
    To Raise or Not To Raise: The Autonomous Learning Rate Question. (arXiv:2106.08767v1 [cs.LG])
    (2 min) There is a parameter ubiquitous throughout the deep learning world: learning rate. There is likewise a ubiquitous question: what should that learning rate be? The true answer to this question is often tedious and time consuming to obtain, and a great deal of arcane knowledge has accumulated in recent years over how to pick and modify learning rates to achieve optimal training performance. Moreover, the long hours spent carefully crafting the perfect learning rate can come to nothing the moment your network architecture, optimizer, dataset, or initial conditions change ever so slightly. But it need not be this way. We propose a new answer to the great learning rate question: the Autonomous Learning Rate Controller. Find it at https://github.com/fastestimator/ARC
    Lorenz System State Stability Identification using Neural Networks. (arXiv:2106.08489v1 [math.DS])
    (2 min) Nonlinear dynamical systems such as Lorenz63 equations are known to be chaotic in nature and sensitive to initial conditions. As a result, a small perturbation in the initial conditions results in deviation in state trajectory after a few time steps. The algorithms and computational resources needed to accurately identify the system states vary depending on whether the solution is in transition region or not. We refer to the transition and non-transition regions as unstable and stable regions respectively. We label a system state to be stable if it's immediate past and future states reside in the same regime. However, at a given time step we don't have the prior knowledge about whether system is in stable or unstable region. In this paper, we develop and train a feed forward (multi-layer perceptron) Neural Network to classify the system states of a Lorenz system as stable and unstable. We pose this task as a supervised learning problem where we train the neural network on Lorenz system which have states labeled as stable or unstable. We then test the ability of the neural network models to identify the stable and unstable states on a different Lorenz system that is generated using different initial conditions. We also evaluate the classification performance in the mismatched case i.e., when the initial conditions for training and validation data are sampled from different intervals. We show that certain normalization schemes can greatly improve the performance of neural networks in especially these mismatched scenarios. The classification framework developed in the paper can be a preprocessor for a larger context of sequential decision making framework where the decision making is performed based on observed stable or unstable states.
    Distilling Self-Knowledge From Contrastive Links to Classify Graph Nodes Without Passing Messages. (arXiv:2106.08541v1 [cs.LG])
    (2 min) Nowadays, Graph Neural Networks (GNNs) following the Message Passing paradigm become the dominant way to learn on graphic data. Models in this paradigm have to spend extra space to look up adjacent nodes with adjacency matrices and extra time to aggregate multiple messages from adjacent nodes. To address this issue, we develop a method called LinkDist that distils self-knowledge from connected node pairs into a Multi-Layer Perceptron (MLP) without the need to aggregate messages. Experiment with 8 real-world datasets shows the MLP derived from LinkDist can predict the label of a node without knowing its adjacencies but achieve comparable accuracy against GNNs in the contexts of semi- and full-supervised node classification. Moreover, LinkDist benefits from its Non-Message Passing paradigm that we can also distil self-knowledge from arbitrarily sampled node pairs in a contrastive way to further boost the performance of LinkDist.
    Explaining the Behavior of Black-Box Prediction Algorithms with Causal Learning. (arXiv:2006.02482v3 [cs.LG] UPDATED)
    (2 min) We propose to explain the behavior of black-box prediction methods (e.g., deep neural networks trained on image pixel data) using causal graphical models. Specifically, we explore learning the structure of a causal graph where the nodes represent prediction outcomes along with a set of macro-level "interpretable" features, while allowing for arbitrary unmeasured confounding among these variables. The resulting graph may indicate which of the interpretable features, if any, are possible causes of the prediction outcome and which may be merely associated with prediction outcomes due to confounding. The approach is motivated by a counterfactual theory of causal explanation wherein good explanations point to factors that are "difference-makers" in an interventionist sense. The resulting analysis may be useful in algorithm auditing and evaluation, by identifying features which make a causal difference to the algorithm's output.
    Cascading Modular Network (CAM-Net) for Multimodal Image Synthesis. (arXiv:2106.09015v1 [cs.CV])
    (2 min) Deep generative models such as GANs have driven impressive advances in conditional image synthesis in recent years. A persistent challenge has been to generate diverse versions of output images from the same input image, due to the problem of mode collapse: because only one ground truth output image is given per input image, only one mode of the conditional distribution is modelled. In this paper, we focus on this problem of multimodal conditional image synthesis and build on the recently proposed technique of Implicit Maximum Likelihood Estimation (IMLE). Prior IMLE-based methods required different architectures for different tasks, which limit their applicability, and were lacking in fine details in the generated images. We propose CAM-Net, a unified architecture that can be applied to a broad range of tasks. Additionally, it is capable of generating convincing high frequency details, achieving a reduction of the Frechet Inception Distance (FID) by up to 45.3% compared to the baseline.
    Practical and Private (Deep) Learning without Sampling or Shuffling. (arXiv:2103.00039v2 [cs.CR] UPDATED)
    (2 min) We consider training models with differential privacy (DP) using mini-batch gradients. The existing state-of-the-art, Differentially Private Stochastic Gradient Descent (DP-SGD), requires privacy amplification by sampling or shuffling to obtain the best privacy/accuracy/computation trade-offs. Unfortunately, the precise requirements on exact sampling and shuffling can be hard to obtain in important practical scenarios, particularly federated learning (FL). We design and analyze a DP variant of Follow-The-Regularized-Leader (DP-FTRL) that compares favorably (both theoretically and empirically) to amplified DP-SGD, while allowing for much more flexible data access patterns. DP-FTRL does not use any form of privacy amplification. The code is available at https://github.com/google-research/federated/tree/master/dp_ftrl and https://github.com/google-research/DP-FTRL .
    Gaze Preserving CycleGANs for Eyeglass Removal & Persistent Gaze Estimation. (arXiv:2002.02077v6 [cs.CV] UPDATED)
    (2 min) A driver's gaze is critical for determining their attention, state, situational awareness, and readiness to take over control from partially automated vehicles. Estimating the gaze direction is the most obvious way to gauge a driver's state under ideal conditions when limited to using non-intrusive imaging sensors. Unfortunately, the vehicular environment introduces a variety of challenges that are usually unaccounted for - harsh illumination, nighttime conditions, and reflective eyeglasses. Relying on head pose alone under such conditions can prove to be unreliable and erroneous. In this study, we offer solutions to address these problems encountered in the real world. To solve issues with lighting, we demonstrate that using an infrared camera with suitable equalization and normalization suffices. To handle eyeglasses and their corresponding artifacts, we adopt image-to-image translation using generative adversarial networks to pre-process images prior to gaze estimation. Our proposed Gaze Preserving CycleGAN (GPCycleGAN) is trained to preserve the driver's gaze while removing potential eyeglasses from face images. GPCycleGAN is based on the well-known CycleGAN approach - with the addition of a gaze classifier and a gaze consistency loss for additional supervision. Our approach exhibits improved performance, interpretability, robustness and superior qualitative results on challenging real-world datasets.
    Predictive Modeling of Hospital Readmission: Challenges and Solutions. (arXiv:2106.08488v1 [cs.LG])
    (2 min) Hospital readmission prediction is a study to learn models from historical medical data to predict probability of a patient returning to hospital in a certain period, 30 or 90 days, after the discharge. The motivation is to help health providers deliver better treatment and post-discharge strategies, lower the hospital readmission rate, and eventually reduce the medical costs. Due to inherent complexity of diseases and healthcare ecosystems, modeling hospital readmission is facing many challenges. By now, a variety of methods have been developed, but existing literature fails to deliver a complete picture to answer some fundamental questions, such as what are the main challenges and solutions in modeling hospital readmission; what are typical features/models used for readmission prediction; how to achieve meaningful and transparent predictions for decision making; and what are possible conflicts when deploying predictive approaches for real-world usages. In this paper, we systematically review computational models for hospital readmission prediction, and propose a taxonomy of challenges featuring four main categories: (1) data variety and complexity; (2) data imbalance, locality and privacy; (3) model interpretability; and (4) model implementation. The review summarizes methods in each category, and highlights technical solutions proposed to address the challenges. In addition, a review of datasets and resources available for hospital readmission modeling also provides firsthand materials to support researchers and practitioners to design new approaches for effective and efficient hospital readmission prediction.
    KALE Flow: A Relaxed KL Gradient Flow for Probabilities with Disjoint Support. (arXiv:2106.08929v1 [stat.ML])
    (2 min) We study the gradient flow for a relaxed approximation to the Kullback-Leibler (KL) divergence between a moving source and a fixed target distribution. This approximation, termed the KALE (KL approximate lower-bound estimator), solves a regularized version of the Fenchel dual problem defining the KL over a restricted class of functions. When using a Reproducing Kernel Hilbert Space (RKHS) to define the function class, we show that the KALE continuously interpolates between the KL and the Maximum Mean Discrepancy (MMD). Like the MMD and other Integral Probability Metrics, the KALE remains well defined for mutually singular distributions. Nonetheless, the KALE inherits from the limiting KL a greater sensitivity to mismatch in the support of the distributions, compared with the MMD. These two properties make the KALE gradient flow particularly well suited when the target distribution is supported on a low-dimensional manifold. Under an assumption of sufficient smoothness of the trajectories, we show the global convergence of the KALE flow. We propose a particle implementation of the flow given initial samples from the source and the target distribution, which we use to empirically confirm the KALE's properties.
    Super-k: A Piecewise Linear Classifier Based on Voronoi Tessellations. (arXiv:2012.15492v3 [cs.LG] UPDATED)
    (2 min) Voronoi tessellations are used to partition the Euclidean space into polyhedral regions, which are called Voronoi cells. Labeling the Voronoi cells with the class information, we can map any classification problem into a Voronoi tessellation. In this way, the classification problem changes into a query of just finding the enclosing Voronoi cell. In order to accomplish this task, we have developed a new algorithm which generates a labeled Voronoi tessellation that partitions the training data into polyhedral regions and obtains interclass boundaries as an indirect result. It is called Supervised k-Voxels or in short Super-k. We are introducing Super-k as a foundational new algorithm and opening the possibility of a new family of algorithms. In this paper, it is shown via comparisons on certain datasets that the Super-k algorithm has the potential of providing comparable performance of the well-known SVM family of algorithms with less complexity.
    Offline Contextual Bandits with Overparameterized Models. (arXiv:2006.15368v4 [cs.LG] UPDATED)
    (2 min) Recent results in supervised learning suggest that while overparameterized models have the capacity to overfit, they in fact generalize quite well. We ask whether the same phenomenon occurs for offline contextual bandits. Our results are mixed. Value-based algorithms benefit from the same generalization behavior as overparameterized supervised learning, but policy-based algorithms do not. We show that this discrepancy is due to the \emph{action-stability} of their objectives. An objective is action-stable if there exists a prediction (action-value vector or action distribution) which is optimal no matter which action is observed. While value-based objectives are action-stable, policy-based objectives are unstable. We formally prove upper bounds on the regret of overparameterized value-based learning and lower bounds on the regret for policy-based algorithms. In our experiments with large neural networks, this gap between action-stable value-based objectives and unstable policy-based objectives leads to significant performance differences.
    Multilinear Dirichlet Processes. (arXiv:2106.08852v1 [cs.LG])
    (2 min) Dependent Dirichlet processes (DDP) have been widely applied to model data from distributions over collections of measures which are correlated in some way. On the other hand, in recent years, increasing research efforts in machine learning and data mining have been dedicated to dealing with data involving interactions from two or more factors. However, few researchers have addressed the heterogeneous relationship in data brought by modulation of multiple factors using techniques of DDP. In this paper, we propose a novel technique, MultiLinear Dirichlet Processes (MLDP), to constructing DDPs by combining DP with a state-of-the-art factor analysis technique, multilinear factor analyzers (MLFA). We have evaluated MLDP on real-word data sets for different applications and have achieved state-of-the-art performance.
    LieTransformer: Equivariant self-attention for Lie Groups. (arXiv:2012.10885v4 [cs.LG] UPDATED)
    (2 min) Group equivariant neural networks are used as building blocks of group invariant neural networks, which have been shown to improve generalisation performance and data efficiency through principled parameter sharing. Such works have mostly focused on group equivariant convolutions, building on the result that group equivariant linear maps are necessarily convolutions. In this work, we extend the scope of the literature to self-attention, that is emerging as a prominent building block of deep learning models. We propose the LieTransformer, an architecture composed of LieSelfAttention layers that are equivariant to arbitrary Lie groups and their discrete subgroups. We demonstrate the generality of our approach by showing experimental results that are competitive to baseline methods on a wide range of tasks: shape counting on point clouds, molecular property regression and modelling particle trajectories under Hamiltonian dynamics.
    Multi-Class Classification from Single-Class Data with Confidences. (arXiv:2106.08864v1 [cs.LG])
    (2 min) Can we learn a multi-class classifier from only data of a single class? We show that without any assumptions on the loss functions, models, and optimizers, we can successfully learn a multi-class classifier from only data of a single class with a rigorous consistency guarantee when confidences (i.e., the class-posterior probabilities for all the classes) are available. Specifically, we propose an empirical risk minimization framework that is loss-/model-/optimizer-independent. Instead of constructing a boundary between the given class and other classes, our method can conduct discriminative classification between all the classes even if no data from the other classes are provided. We further theoretically and experimentally show that our method can be Bayes-consistent with a simple modification even if the provided confidences are highly noisy. Then, we provide an extension of our method for the case where data from a subset of all the classes are available. Experimental results demonstrate the effectiveness of our methods.
    Sleeper Agent: Scalable Hidden Trigger Backdoors for Neural Networks Trained from Scratch. (arXiv:2106.08970v1 [cs.LG])
    (2 min) As the curation of data for machine learning becomes increasingly automated, dataset tampering is a mounting threat. Backdoor attackers tamper with training data to embed a vulnerability in models that are trained on that data. This vulnerability is then activated at inference time by placing a "trigger" into the model's input. Typical backdoor attacks insert the trigger directly into the training data, although the presence of such an attack may be visible upon inspection. In contrast, the Hidden Trigger Backdoor Attack achieves poisoning without placing a trigger into the training data at all. However, this hidden trigger attack is ineffective at poisoning neural networks trained from scratch. We develop a new hidden trigger attack, Sleeper Agent, which employs gradient matching, data selection, and target model re-training during the crafting process. Sleeper Agent is the first hidden trigger backdoor attack to be effective against neural networks trained from scratch. We demonstrate its effectiveness on ImageNet and in black-box settings. Our implementation code can be found at https://github.com/hsouri/Sleeper-Agent.
    Algorithm to Compilation Codesign: An Integrated View of Neural Network Sparsity. (arXiv:2106.08846v1 [cs.LG])
    (2 min) Reducing computation cost, inference latency, and memory footprint of neural networks are frequently cited as research motivations for pruning and sparsity. However, operationalizing those benefits and understanding the end-to-end effect of algorithm design and regularization on the runtime execution is not often examined in depth. Here we apply structured and unstructured pruning to attention weights of transformer blocks of the BERT language model, while also expanding block sparse representation (BSR) operations in the TVM compiler. Integration of BSR operations enables the TVM runtime execution to leverage structured pattern sparsity induced by model regularization. This integrated view of pruning algorithms enables us to study relationships between modeling decisions and their direct impact on sparsity-enhanced execution. Our main findings are: 1) we validate that performance benefits of structured sparsity block regularization must be enabled by the BSR augmentations to TVM, with 4x speedup relative to vanilla PyTorch and 2.2x speedup relative to standard TVM compilation (without expanded BSR support). 2) for BERT attention weights, the end-to-end optimal block sparsity shape in this CPU inference context is not a square block (as in \cite{gray2017gpu}) but rather a linear 32x1 block 3) the relationship between performance and block size / shape is is suggestive of how model regularization parameters interact with task scheduler optimizations resulting in the observed end-to-end performance.
    FGLP: A Federated Fine-Grained Location Prediction System for Mobile Users. (arXiv:2106.08946v1 [cs.LG])
    (2 min) Fine-grained location prediction on smart phones can be used to improve app/system performance. Application scenarios include video quality adaptation as a function of the 5G network quality at predicted user locations, and augmented reality apps that speed up content rendering based on predicted user locations. Such use cases require prediction error in the same range as the GPS error, and no existing works on location prediction can achieve this level of accuracy. We present a system for fine-grained location prediction (FGLP) of mobile users, based on GPS traces collected on the phones. FGLP has two components: a federated learning framework and a prediction model. The framework runs on the phones of the users and also on a server that coordinates learning from all users in the system. FGLP represents the user location data as relative points in an abstract 2D space, which enables learning across different physical spaces. The model merges Bidirectional Long Short-Term Memory (BiLSTM) and Convolutional Neural Networks (CNN), where BiLSTM learns the speed and direction of the mobile users, and CNN learns information such as user movement preferences. FGLP uses federated learning to protect user privacy and reduce bandwidth consumption. Our experimental results, using a dataset with over 600,000 users, demonstrate that FGLP outperforms baseline models in terms of prediction accuracy. We also demonstrate that FGLP works well in conjunction with transfer learning, which enables model reusability. Finally, benchmark results on several types of Android phones demonstrate FGLP's feasibility in real life.
    CloudCast: A Satellite-Based Dataset and Baseline for Forecasting Clouds. (arXiv:2007.07978v2 [cs.CV] UPDATED)
    (2 min) Forecasting the formation and development of clouds is a central element of modern weather forecasting systems. Incorrect clouds forecasts can lead to major uncertainty in the overall accuracy of weather forecasts due to their intrinsic role in the Earth's climate system. Few studies have tackled this challenging problem from a machine learning point-of-view due to a shortage of high-resolution datasets with many historical observations globally. In this paper, we present a novel satellite-based dataset called ``CloudCast''. It consists of 70,080 images with 10 different cloud types for multiple layers of the atmosphere annotated on a pixel level. The spatial resolution of the dataset is 928 x 1530 pixels (3x3 km per pixel) with 15-min intervals between frames for the period 2017-01-01 to 2018-12-31. All frames are centered and projected over Europe. To supplement the dataset, we conduct an evaluation study with current state-of-the-art video prediction methods such as convolutional long short-term memory networks, generative adversarial networks, and optical flow-based extrapolation methods. As the evaluation of video prediction is difficult in practice, we aim for a thorough evaluation in the spatial and temporal domain. Our benchmark models show promising results but with ample room for improvement. This is the first publicly available global-scale dataset with high-resolution cloud types on a high temporal granularity to the authors' best knowledge.
    Dissecting Hessian: Understanding Common Structure of Hessian in Neural Networks. (arXiv:2010.04261v5 [cs.LG] UPDATED)
    (2 min) Hessian captures important properties of the deep neural network loss landscape. Previous works have observed low rank structure in the Hessians of neural networks. We make several new observations about the top eigenspace of layer-wise Hessian: top eigenspaces for different models have surprisingly high overlap, and top eigenvectors form low rank matrices when they are reshaped into the same shape as the corresponding weight matrix. Towards formally explaining such structures of the Hessian, we show that the new eigenspace structure can be explained by approximating the Hessian using Kronecker factorization; we also prove the low rank structure for random data at random initialization for over-parametrized two-layer neural nets. Our new understanding can explain why some of these structures become weaker when the network is trained with batch normalization. The Kronecker factorization also leads to better explicit generalization bounds.
    An unifying point of view on expressive power of GNNs. (arXiv:2106.08992v1 [cs.LG])
    (2 min) Graph Neural Networks (GNNs) are a wide class of connectionist models for graph processing. They perform an iterative message passing operation on each node and its neighbors, to solve classification/ clustering tasks --- on some nodes or on the whole graph --- collecting all such messages, regardless of their order. Despite the differences among the various models belonging to this class, most of them adopt the same computation scheme, based on a local aggregation mechanism and, intuitively, the local computation framework is mainly responsible for the expressive power of GNNs. In this paper, we prove that the Weisfeiler--Lehman test induces an equivalence relationship on the graph nodes that exactly corresponds to the unfolding equivalence, defined on the original GNN model. Therefore, the results on the expressive power of the original GNNs can be extended to general GNNs which, under mild conditions, can be proved capable of approximating, in probability and up to any precision, any function on graphs that respects the unfolding equivalence.
    The shape and simplicity biases of adversarially robust ImageNet-trained CNNs. (arXiv:2006.09373v4 [cs.CV] UPDATED)
    (2 min) Adversarial training has been the topic of dozens of studies and a leading method for defending against adversarial attacks. Yet, it remains largely unknown (a) how adversarially-robust ImageNet classifiers (R classifiers) generalize to out-of-distribution examples; and (b) how their generalization capability relates to their hidden representations. In this paper, we perform a thorough, systematic study to answer these two questions across AlexNet, GoogLeNet, and ResNet-50 architectures. We found that while standard ImageNet classifiers have a strong texture bias, their R counterparts rely heavily on shapes. Remarkably, adversarial training induces three simplicity biases into hidden neurons in the process of 'robustifying' the network. That is, each convolutional neuron in R networks often changes to detecting (1) pixel-wise smoother patterns i.e. a mechanism that blocks high-frequency noise from passing through the network; (2) more lower-level features i.e. textures and colors (instead of objects); and (3) fewer types of inputs. Our findings reveal the interesting mechanisms that made networks more adversarially robust and also explain some recent findings. Our findings reveal the interesting mechanisms that made networks more adversarially robust and also explain some recent findings e.g. why R networks benefit from much larger capacity (Xie and Yuille, 2020) and can act as a strong image prior in image synthesis (Santurkar et al., 2019).
    Amortized Synthesis of Constrained Configurations Using a Differentiable Surrogate. (arXiv:2106.09019v1 [cs.LG])
    (2 min) In design, fabrication, and control problems, we are often faced with the task of synthesis, in which we must generate an object or configuration that satisfies a set of constraints while maximizing one or more objective functions. The synthesis problem is typically characterized by a physical process in which many different realizations may achieve the goal. This many-to-one map presents challenges to the supervised learning of feed-forward synthesis, as the set of viable designs may have a complex structure. In addition, the non-differentiable nature of many physical simulations prevents direct optimization. We address both of these problems with a two-stage neural network architecture that we may consider to be an autoencoder. We first learn the decoder: a differentiable surrogate that approximates the many-to-one physical realization process. We then learn the encoder, which maps from goal to design, while using the fixed decoder to evaluate the quality of the realization. We evaluate the approach on two case studies: extruder path planning in additive manufacturing and constrained soft robot inverse kinematics. We compare our approach to direct optimization of design using the learned surrogate, and to supervised learning of the synthesis problem. We find that our approach produces higher quality solutions than supervised learning, while being competitive in quality with direct optimization, at a greatly reduced computational cost.
    PettingZoo: Gym for Multi-Agent Reinforcement Learning. (arXiv:2009.14471v6 [cs.LG] UPDATED)
    (2 min) This paper introduces the PettingZoo library and the accompanying Agent Environment Cycle ("AEC") games model. PettingZoo is a library of diverse sets of multi-agent environments with a universal, elegant Python API. PettingZoo was developed with the goal of accelerating research in Multi-Agent Reinforcement Learning ("MARL"), by making work more interchangeable, accessible and reproducible akin to what OpenAI's Gym library did for single-agent reinforcement learning. PettingZoo's API, while inheriting many features of Gym, is unique amongst MARL APIs in that it's based around the novel AEC games model. We argue, in part through case studies on major problems in popular MARL environments, that the popular game models are poor conceptual models of the games commonly used with MARL, that they promote severe bugs that are hard to detect, and that the AEC games model addresses these problems.
    Random feature neural networks learn Black-Scholes type PDEs without curse of dimensionality. (arXiv:2106.08900v1 [cs.LG])
    (2 min) This article investigates the use of random feature neural networks for learning Kolmogorov partial (integro-)differential equations associated to Black-Scholes and more general exponential L\'evy models. Random feature neural networks are single-hidden-layer feedforward neural networks in which only the output weights are trainable. This makes training particularly simple, but (a priori) reduces expressivity. Interestingly, this is not the case for Black-Scholes type PDEs, as we show here. We derive bounds for the prediction error of random neural networks for learning sufficiently non-degenerate Black-Scholes type models. A full error analysis is provided and it is shown that the derived bounds do not suffer from the curse of dimensionality. We also investigate an application of these results to basket options and validate the bounds numerically. These results prove that neural networks are able to \textit{learn} solutions to Black-Scholes type PDEs without the curse of dimensionality. In addition, this provides an example of a relevant learning problem in which random feature neural networks are provably efficient.
    Dataset Dynamics via Gradient Flows in Probability Space. (arXiv:2010.12760v2 [cs.LG] UPDATED)
    (2 min) Various machine learning tasks, from generative modeling to domain adaptation, revolve around the concept of dataset transformation and manipulation. While various methods exist for transforming unlabeled datasets, principled methods to do so for labeled (e.g., classification) datasets are missing. In this work, we propose a novel framework for dataset transformation, which we cast as optimization over data-generating joint probability distributions. We approach this class of problems through Wasserstein gradient flows in probability space, and derive practical and efficient particle-based methods for a flexible but well-behaved class of objective functions. Through various experiments, we show that this framework can be used to impose constraints on classification datasets, adapt them for transfer learning, or to re-purpose fixed or black-box models to classify ---with high accuracy--- previously unseen datasets.
    Estimating the Robustness of Public Transport Systems Using Machine Learning. (arXiv:2106.08967v1 [cs.LG])
    (2 min) The planning of attractive and cost efficient public transport systems is a highly complex optimization process involving many steps. Integrating robustness from a passenger's point of view makes the task even more challenging. With numerous different definitions of robustness in literature, a real-world acceptable evaluation of the robustness of a public transport system is to simulate its performance under a large number of possible scenarios. Unfortunately, this is computationally very expensive. In this paper, we therefore explore a new way of such a scenario-based robustness approximation by using methods from machine learning. We achieve a fast approach with a very high accuracy by gathering a subset of key features of a public transport system and its passenger demand and training an artificial neural network to learn the outcome of a given set of robustness tests. The network is then able to predict the robustness of untrained instances with high accuracy using only its key features, allowing for a robustness oracle for transport planners that approximates the robustness in constant time. Such an oracle can be used as black box to increase the robustness within a local search framework for integrated public transportation planning. In computational experiments with different benchmark instances we demonstrate an excellent quality of our predictions.
    Knowledge-Adaptation Priors. (arXiv:2106.08769v1 [cs.LG])
    (2 min) Humans and animals have a natural ability to quickly adapt to their surroundings, but machine-learning models, when subjected to changes, often require a complete retraining from scratch. We present Knowledge-adaptation priors (K-priors) to reduce the cost of retraining by enabling quick and accurate adaptation for a wide-variety of tasks and models. This is made possible by a combination of weight and function-space priors to reconstruct the gradients of the past, which recovers and generalizes many existing, but seemingly-unrelated, adaptation strategies. Training with simple first-order gradient methods can often recover the exact retrained model to an arbitrary accuracy by choosing a sufficiently large memory of the past data. Empirical results confirm that the adaptation can be cheap and accurate, and a promising alternative to retraining.
    Detecting chaos in lineage-trees: A deep learning approach. (arXiv:2106.08956v1 [cs.LG])
    (2 min) Many complex phenomena, from weather systems to heartbeat rhythm patterns, are effectively modeled as low-dimensional dynamical systems. Such systems may behave chaotically under certain conditions, and so the ability to detect chaos based on empirical measurement is an important step in characterizing and predicting these processes. Classifying a system as chaotic usually requires estimating its largest Lyapunov exponent, which quantifies the average rate of convergence or divergence of initially close trajectories in state space, and for which a positive value is generally accepted as an operational definition of chaos. Estimating the largest Lyapunov exponent from observations of a process is especially challenging in systems affected by dynamical noise, which is the case for many models of real-world processes, in particular models of biological systems. We describe a novel method for estimating the largest Lyapunov exponent from data, based on training Deep Learning models on synthetically generated trajectories, and demonstrate that this method yields accurate and noise-robust predictions given relatively short inputs and across a range of different dynamical systems. Our method is unique in that it can analyze tree-shaped data, a ubiquitous topology in biological settings, and specifically in dynamics over lineages of cells or organisms. We also characterize the types of input information extracted by our models for their predictions, allowing for a deeper understanding into the different ways by which chaos can be analyzed in different topologies.
    Deep-learning based Tools for Automated Protocol Definition of Advanced Diagnostic Imaging Exams. (arXiv:2106.08963v1 [cs.LG])
    (2 min) Purpose: This study evaluates the effectiveness and impact of automated order-based protocol assignment for magnetic resonance imaging (MRI) exams using natural language processing (NLP) and deep learning (DL). Methods: NLP tools were applied to retrospectively process orders from over 116,000 MRI exams with 200 unique sub-specialized protocols ("Local" protocol class). Separate DL models were trained on 70\% of the processed data for "Local" protocols as well as 93 American College of Radiology ("ACR") protocols and 48 "General" protocols. The DL Models were assessed in an "auto-protocoling (AP)" inference mode which returns the top recommendation and in a "clinical decision support (CDS)" inference mode which returns up to 10 protocols for radiologist review. The accuracy of each protocol recommendation was computed and analyzed based on the difference between the normalized output score of the corresponding neural net for the top two recommendations. Results: The top predicted protocol in AP mode was correct for 82.8%, 73.8%, and 69.3% of the test cases for "General", "ACR", and "Local" protocol classes, respectively. Higher levels of accuracy over 96% were obtained for all protocol classes in CDS mode. However, at current validation performance levels, the proposed models offer modest, positive, financial impact on large-scale imaging networks. Conclusions: DL-based protocol automation is feasible and can be tuned to route substantial fractions of exams for auto-protocoling, with higher accuracy with more general protocols. Economic analyses of the tested algorithms indicate that improved algorithm performance is required to yield a practical exam auto-protocoling tool for sub-specialized imaging exams.
    Recursive Construction of Stable Assemblies of Recurrent Neural Networks. (arXiv:2106.08928v1 [cs.LG])
    (2 min) Advanced applications of modern machine learning will likely involve combinations of trained networks, as are already used in spectacular systems such as DeepMind's AlphaGo. Recursively building such combinations in an effective and stable fashion while also allowing for continual refinement of the individual networks - as nature does for biological networks - will require new analysis tools. This paper takes a step in this direction by establishing contraction properties of broad classes of nonlinear recurrent networks and neural ODEs, and showing how these quantified properties allow in turn to recursively construct stable networks of networks in a systematic fashion. The results can also be used to stably combine recurrent networks and physical systems with quantified contraction properties. Similarly, they may be applied to modular computational models of cognition.
    $C^3$: Compositional Counterfactual Constrastive Learning for Video-grounded Dialogues. (arXiv:2106.08914v1 [cs.LG])
    (2 min) Video-grounded dialogue systems aim to integrate video understanding and dialogue understanding to generate responses that are relevant to both the dialogue and video context. Most existing approaches employ deep learning models and have achieved remarkable performance, given the relatively small datasets available. However, the results are partly accomplished by exploiting biases in the datasets rather than developing multimodal reasoning, resulting in limited generalization. In this paper, we propose a novel approach of Compositional Counterfactual Contrastive Learning ($C^3$) to develop contrastive training between factual and counterfactual samples in video-grounded dialogues. Specifically, we design factual/counterfactual sampling based on the temporal steps in videos and tokens in dialogues and propose contrastive loss functions that exploit object-level or action-level variance. Different from prior approaches, we focus on contrastive hidden state representations among compositional output tokens to optimize the representation space in a generation setting. We achieved promising performance gains on the Audio-Visual Scene-Aware Dialogues (AVSD) benchmark and showed the benefits of our approach in grounding video and dialogue context.
    Intelligent Tire-Based Slip Ratio Estimation Using Different Machine Learning Algorithms. (arXiv:2106.08961v1 [cs.LG])
    (2 min) Estimation of the longitudinal slip ratio of tires is important in boosting the control performance of the vehicle under driving and braking conditions. In this paper, the slip ratio is estimated using four machine learning algorithms (Neural Network, Gradient Boosting Machine, Random Forest and Support Vector Machine) based on the acceleration signals from the tri-axial MEMS accelerometers utilized in the intelligent tire system. The experimental data are collected through the MTS experimental platform. The corresponding acceleration signals within the tire contact patch are extracted after filtering to be used for the training the aforesaid machine learning algorithms. A comparison is provided between the implemented ML algorithms using a 10-fold CV. NRMS errors in the CV results indicate that NN has the highest accuracy in comparison with other techniques. The NRSM errors of NN, GBM, RF, and SVM are 2.59\%, 3.30\%, 4.21\%, and 5.34\%, respectively. Among these techniques, GBM has a more stable results as it has the smallest output variance. The present study with the fusion of intelligent tire system and machine learning algorithms paves the way for the accurate estimation of tire slip ratio, which is critical for the development of reliable vehicle control algorithms.
    Grounding Spatio-Temporal Language with Transformers. (arXiv:2106.08858v1 [cs.AI])
    (2 min) Language is an interface to the outside world. In order for embodied agents to use it, language must be grounded in other, sensorimotor modalities. While there is an extended literature studying how machines can learn grounded language, the topic of how to learn spatio-temporal linguistic concepts is still largely uncharted. To make progress in this direction, we here introduce a novel spatio-temporal language grounding task where the goal is to learn the meaning of spatio-temporal descriptions of behavioral traces of an embodied agent. This is achieved by training a truth function that predicts if a description matches a given history of observations. The descriptions involve time-extended predicates in past and present tense as well as spatio-temporal references to objects in the scene. To study the role of architectural biases in this task, we train several models including multimodal Transformer architectures; the latter implement different attention computations between words and objects across space and time. We test models on two classes of generalization: 1) generalization to randomly held-out sentences; 2) generalization to grammar primitives. We observe that maintaining object identity in the attention computation of our Transformers is instrumental to achieving good performance on generalization overall, and that summarizing object traces in a single token has little influence on performance. We then discuss how this opens new perspectives for language-guided autonomous embodied agents. We also release our code under open-source license as well as pretrained models and datasets to encourage the wider community to build upon and extend our work in the future.
    Evaluating Gender Bias in Hindi-English Machine Translation. (arXiv:2106.08680v1 [cs.CL])
    (2 min) With language models being deployed increasingly in the real world, it is essential to address the issue of the fairness of their outputs. The word embedding representations of these language models often implicitly draw unwanted associations that form a social bias within the model. The nature of gendered languages like Hindi, poses an additional problem to the quantification and mitigation of bias, owing to the change in the form of the words in the sentence, based on the gender of the subject. Additionally, there is sparse work done in the realm of measuring and debiasing systems for Indic languages. In our work, we attempt to evaluate and quantify the gender bias within a Hindi-English machine translation system. We implement a modified version of the existing TGBI metric based on the grammatical considerations for Hindi. We also compare and contrast the resulting bias measurements across multiple metrics for pre-trained embeddings and the ones learned by our machine translation model.
    mSHAP: SHAP Values for Two-Part Models. (arXiv:2106.08990v1 [stat.ML])
    (2 min) Two-part models are important to and used throughout insurance and actuarial science. Since insurance is required for registering a car, obtaining a mortgage, and participating in certain businesses, it is especially important that the models which price insurance policies are fair and non-discriminatory. Black box models can make it very difficult to know which covariates are influencing the results. SHAP values enable interpretation of various black box models, but little progress has been made in two-part models. In this paper, we propose mSHAP (or multiplicative SHAP), a method for computing SHAP values of two-part models using the SHAP values of the individual models. This method will allow for the predictions of two-part models to be explained at an individual observation level. After developing mSHAP, we perform an in-depth simulation study. Although the kernelSHAP algorithm is also capable of computing approximate SHAP values for a two-part model, a comparison with our method demonstrates that mSHAP is exponentially faster. Ultimately, we apply mSHAP to a two-part ratemaking model for personal auto property damage insurance coverage. Additionally, an R package (mshap) is available to easily implement the method in a wide variety of applications.
    Optimality of short-term synaptic plasticity in modelling certain dynamic environments. (arXiv:2009.06808v2 [cs.NE] UPDATED)
    (2 min) Biological neurons and their in-silico emulations for neuromorphic artificial intelligence (AI) use extraordinarily energy-efficient mechanisms, such as spike-based communication and local synaptic plasticity. It remains unclear whether these neuronal mechanisms only offer efficiency or also underlie the superiority of biological intelligence. Here, we prove rigorously that, indeed, the Bayes-optimal prediction and inference of randomly but continuously transforming environments, a common natural setting, relies on short-term spike-timing-dependent plasticity, a hallmark of biological synapses. Further, this dynamic Bayesian inference through plasticity enables circuits of the cerebral cortex in simulations to recognize previously unseen, highly distorted dynamic stimuli. Strikingly, this also introduces a biologically-modelled AI, the first to overcome multiple limitations of deep learning and outperform artificial neural networks in a visual task. The cortical-like network is spiking and event-based, trained only with unsupervised and local plasticity, on a small, narrow, and static training dataset, but achieves recognition of unseen, transformed, and dynamic data better than deep neural networks with continuous activations, trained with supervised backpropagation on the transforming data. These results link short-term plasticity to high-level cortical function, suggest optimality of natural intelligence for natural environments, and repurpose neuromorphic AI from mere efficiency to computational supremacy altogether.
    Named Entity Recognition with Small Strongly Labeled and Large Weakly Labeled Data. (arXiv:2106.08977v1 [cs.CL])
    (2 min) Weak supervision has shown promising results in many natural language processing tasks, such as Named Entity Recognition (NER). Existing work mainly focuses on learning deep NER models only with weak supervision, i.e., without any human annotation, and shows that by merely using weakly labeled data, one can achieve good performance, though still underperforms fully supervised NER with manually/strongly labeled data. In this paper, we consider a more practical scenario, where we have both a small amount of strongly labeled data and a large amount of weakly labeled data. Unfortunately, we observe that weakly labeled data does not necessarily improve, or even deteriorate the model performance (due to the extensive noise in the weak labels) when we train deep NER models over a simple or weighted combination of the strongly labeled and weakly labeled data. To address this issue, we propose a new multi-stage computational framework -- NEEDLE with three essential ingredients: (1) weak label completion, (2) noise-aware loss function, and (3) final fine-tuning over the strongly labeled data. Through experiments on E-commerce query NER and Biomedical NER, we demonstrate that NEEDLE can effectively suppress the noise of the weak labels and outperforms existing methods. In particular, we achieve new SOTA F1-scores on 3 Biomedical NER datasets: BC5CDR-chem 93.74, BC5CDR-disease 90.69, NCBI-disease 92.28.
    Learning Causal Semantic Representation for Out-of-Distribution Prediction. (arXiv:2011.01681v4 [stat.ML] UPDATED)
    (2 min) Conventional supervised learning methods, especially deep ones, are found to be sensitive to out-of-distribution (OOD) examples, largely because the learned representation mixes the semantic factor with the variation factor due to their domain-specific correlation, while only the semantic factor causes the output. To address the problem, we propose a Causal Semantic Generative model (CSG) based on a causal reasoning so that the two factors are modeled separately, and develop methods for OOD prediction from a single training domain, which is common and challenging. The methods are based on the causal invariance principle, with a novel design for both efficient learning and easy prediction. Theoretically, we prove that under certain conditions, CSG can identify the semantic factor by fitting training data, and this semantic-identification guarantees the boundedness of OOD generalization error and the success of adaptation. Empirical study shows improved OOD performance over prevailing baselines.
    Early fault detection with multi-target neural networks. (arXiv:2106.08957v1 [cs.LG])
    (2 min) Wind power is seeing a strong growth around the world. At the same time, shrinking profit margins in the energy markets let wind farm managers explore options for cost reductions in the turbine operation and maintenance. Sensor-based condition monitoring facilitates remote diagnostics of turbine subsystems, enabling faster responses when unforeseen maintenance is required. Condition monitoring with data from the turbines' supervisory control and data acquisition (SCADA) systems was proposed and SCADA-based fault detection and diagnosis approaches introduced based on single-task normal operation models of turbine state variables. As the number of SCADA channels has grown strongly, thousands of independent single-target models are in place today for monitoring a single turbine. Multi-target learning was recently proposed to limit the number of models. This study applied multi-target neural networks to the task of early fault detection in drive-train components. The accuracy and delay of detecting gear bearing faults were compared to state-of-the-art single-target approaches. We found that multi-target multi-layer perceptrons (MLPs) detected faults at least as early and in many cases earlier than single-target MLPs. The multi-target MLPs could detect faults up to several days earlier than the single-target models. This can deliver a significant advantage in the planning and performance of maintenance work. At the same time, the multi-target MLPs achieved the same level of prediction stability.
    Unbiased Methods for Multi-Goal Reinforcement Learning. (arXiv:2106.08863v1 [cs.LG])
    (2 min) In multi-goal reinforcement learning (RL) settings, the reward for each goal is sparse, and located in a small neighborhood of the goal. In large dimension, the probability of reaching a reward vanishes and the agent receives little learning signal. Methods such as Hindsight Experience Replay (HER) tackle this issue by also learning from realized but unplanned-for goals. But HER is known to introduce bias, and can converge to low-return policies by overestimating chancy outcomes. First, we vindicate HER by proving that it is actually unbiased in deterministic environments, such as many optimal control settings. Next, for stochastic environments in continuous spaces, we tackle sparse rewards by directly taking the infinitely sparse reward limit. We fully formalize the problem of multi-goal RL with infinitely sparse Dirac rewards at each goal. We introduce unbiased deep Q-learning and actor-critic algorithms that can handle such infinitely sparse rewards, and test them in toy environments.
    A Spiking Neural Network for Image Segmentation. (arXiv:2106.08921v1 [cs.NE])
    (2 min) We seek to investigate the scalability of neuromorphic computing for computer vision, with the objective of replicating non-neuromorphic performance on computer vision tasks while reducing power consumption. We convert the deep Artificial Neural Network (ANN) architecture U-Net to a Spiking Neural Network (SNN) architecture using the Nengo framework. Both rate-based and spike-based models are trained and optimized for benchmarking performance and power, using a modified version of the ISBI 2D EM Segmentation dataset consisting of microscope images of cells. We propose a partitioning method to optimize inter-chip communication to improve speed and energy efficiency when deploying multi-chip networks on the Loihi neuromorphic chip. We explore the advantages of regularizing firing rates of Loihi neurons for converting ANN to SNN with minimum accuracy loss and optimized energy consumption. We propose a percentile based regularization loss function to limit the spiking rate of the neuron between a desired range. The SNN is converted directly from the corresponding ANN, and demonstrates similar semantic segmentation as the ANN using the same number of neurons and weights. However, the neuromorphic implementation on the Intel Loihi neuromorphic chip is over 2x more energy-efficient than conventional hardware (CPU, GPU) when running online (one image at a time). These power improvements are achieved without sacrificing the task performance accuracy of the network, and when all weights (Loihi, CPU, and GPU networks) are quantized to 8 bits.
    Real-Time Anomaly Detection in Edge Streams. (arXiv:2009.08452v2 [cs.LG] UPDATED)
    (2 min) Given a stream of graph edges from a dynamic graph, how can we assign anomaly scores to edges in an online manner, for the purpose of detecting unusual behavior, using constant time and memory? Existing approaches aim to detect individually surprising edges. In this work, we propose MIDAS, which focuses on detecting microcluster anomalies, or suddenly arriving groups of suspiciously similar edges, such as lockstep behavior, including denial of service attacks in network traffic data. We further propose MIDAS-F, to solve the problem by which anomalies are incorporated into the algorithm's internal states, creating a `poisoning' effect that can allow future anomalies to slip through undetected. MIDAS-F introduces two modifications: 1) We modify the anomaly scoring function, aiming to reduce the `poisoning' effect of newly arriving edges; 2) We introduce a conditional merge step, which updates the algorithm's data structures after each time tick, but only if the anomaly score is below a threshold value, also to reduce the `poisoning' effect. Experiments show that MIDAS-F has significantly higher accuracy than MIDAS. MIDAS has the following properties: (a) it detects microcluster anomalies while providing theoretical guarantees about its false positive probability; (b) it is online, thus processing each edge in constant time and constant memory, and also processes the data orders-of-magnitude faster than state-of-the-art approaches; (c) it provides up to 62% higher ROC-AUC than state-of-the-art approaches.
    Adaptive Visibility Graph Neural Network and It's Application in Modulation Classification. (arXiv:2106.08564v1 [cs.LG])
    (2 min) Our digital world is full of time series and graphs which capture the various aspects of many complex systems. Traditionally, there are respective methods in processing these two different types of data, e.g., Recurrent Neural Network (RNN) and Graph Neural Network (GNN), while in recent years, time series could be mapped to graphs by using the techniques such as Visibility Graph (VG), so that researchers can use graph algorithms to mine the knowledge in time series. Such mapping methods establish a bridge between time series and graphs, and have high potential to facilitate the analysis of various real-world time series. However, the VG method and its variants are just based on fixed rules and thus lack of flexibility, largely limiting their application in reality. In this paper, we propose an Adaptive Visibility Graph (AVG) algorithm that can adaptively map time series into graphs, based on which we further establish an end-to-end classification framework AVGNet, by utilizing GNN model DiffPool as the classifier. We then adopt AVGNet for radio signal modulation classification which is an important task in the field of wireless communication. The simulations validate that AVGNet outperforms a series of advanced deep learning methods, achieving the state-of-the-art performance in this task.
    Theoretical Analysis of Self-Training with Deep Networks on Unlabeled Data. (arXiv:2010.03622v4 [cs.LG] UPDATED)
    (2 min) Self-training algorithms, which train a model to fit pseudolabels predicted by another previously-learned model, have been very successful for learning with unlabeled data using neural networks. However, the current theoretical understanding of self-training only applies to linear models. This work provides a unified theoretical analysis of self-training with deep networks for semi-supervised learning, unsupervised domain adaptation, and unsupervised learning. At the core of our analysis is a simple but realistic "expansion" assumption, which states that a low probability subset of the data must expand to a neighborhood with large probability relative to the subset. We also assume that neighborhoods of examples in different classes have minimal overlap. We prove that under these assumptions, the minimizers of population objectives based on self-training and input-consistency regularization will achieve high accuracy with respect to ground-truth labels. By using off-the-shelf generalization bounds, we immediately convert this result to sample complexity guarantees for neural nets that are polynomial in the margin and Lipschitzness. Our results help explain the empirical successes of recently proposed self-training algorithms which use input consistency regularization.
    Discrete Auto-regressive Variational Attention Models for Text Modeling. (arXiv:2106.08571v1 [cs.LG])
    (2 min) Variational autoencoders (VAEs) have been widely applied for text modeling. In practice, however, they are troubled by two challenges: information underrepresentation and posterior collapse. The former arises as only the last hidden state of LSTM encoder is transformed into the latent space, which is generally insufficient to summarize the data. The latter is a long-standing problem during the training of VAEs as the optimization is trapped to a disastrous local optimum. In this paper, we propose Discrete Auto-regressive Variational Attention Model (DAVAM) to address the challenges. Specifically, we introduce an auto-regressive variational attention approach to enrich the latent space by effectively capturing the semantic dependency from the input. We further design discrete latent space for the variational attention and mathematically show that our model is free from posterior collapse. Extensive experiments on language modeling tasks demonstrate the superiority of DAVAM against several VAE counterparts.
    Ditto: Fair and Robust Federated Learning Through Personalization. (arXiv:2012.04221v3 [cs.LG] UPDATED)
    (2 min) Fairness and robustness are two important concerns for federated learning systems. In this work, we identify that robustness to data and model poisoning attacks and fairness, measured as the uniformity of performance across devices, are competing constraints in statistically heterogeneous networks. To address these constraints, we propose employing a simple, general framework for personalized federated learning, Ditto, that can inherently provide fairness and robustness benefits, and develop a scalable solver for it. Theoretically, we analyze the ability of Ditto to achieve fairness and robustness simultaneously on a class of linear problems. Empirically, across a suite of federated datasets, we show that Ditto not only achieves competitive performance relative to recent personalization methods, but also enables more accurate, robust, and fair models relative to state-of-the-art fair or robust baselines.
    Geometry of Similarity Comparisons. (arXiv:2006.09858v3 [cs.LG] UPDATED)
    (2 min) Many data analysis problems can be cast as distance geometry problems in \emph{space forms} -- Euclidean, spherical, or hyperbolic spaces. Often, absolute distance measurements are often unreliable or simply unavailable and only proxies to absolute distances in the form of similarities are available. Hence we ask the following: Given only \emph{comparisons} of similarities amongst a set of entities, what can be said about the geometry of the underlying space form? To study this question, we introduce the notions of the \textit{ordinal capacity} of a target space form and \emph{ordinal spread} of the similarity measurements. The latter is an indicator of complex patterns in the measurements, while the former quantifies the capacity of a space form to accommodate a set of measurements with a specific ordinal spread profile. We prove that the ordinal capacity of a space form is related to its dimension and the sign of its curvature. This leads to a lower bound on the Euclidean and spherical embedding dimension of what we term similarity graphs. More importantly, we show that the statistical behavior of the ordinal spread random variables defined on a similarity graph can be used to identify its underlying space form. We support our theoretical claims with experiments on weighted trees, single-cell RNA expression data and spherical cartographic measurements.
    A Neural Model for Joint Document and Snippet Ranking in Question Answering for Large Document Collections. (arXiv:2106.08908v1 [cs.IR])
    (2 min) Question answering (QA) systems for large document collections typically use pipelines that (i) retrieve possibly relevant documents, (ii) re-rank them, (iii) rank paragraphs or other snippets of the top-ranked documents, and (iv) select spans of the top-ranked snippets as exact answers. Pipelines are conceptually simple, but errors propagate from one component to the next, without later components being able to revise earlier decisions. We present an architecture for joint document and snippet ranking, the two middle stages, which leverages the intuition that relevant documents have good snippets and good snippets come from relevant documents. The architecture is general and can be used with any neural text relevance ranker. We experiment with two main instantiations of the architecture, based on POSIT-DRMM (PDRMM) and a BERT-based ranker. Experiments on biomedical data from BIOASQ show that our joint models vastly outperform the pipelines in snippet retrieval, the main goal for QA, with fewer trainable parameters, also remaining competitive in document retrieval. Furthermore, our joint PDRMM-based model is competitive with BERT-based models, despite using orders of magnitude fewer parameters. These claims are also supported by human evaluation on two test batches of BIOASQ. To test our key findings on another dataset, we modified the Natural Questions dataset so that it can also be used for document and snippet retrieval. Our joint PDRMM-based model again outperforms the corresponding pipeline in snippet retrieval on the modified Natural Questions dataset, even though it performs worse than the pipeline in document retrieval. We make our code and the modified Natural Questions dataset publicly available.
    On the long-term learning ability of LSTM LMs. (arXiv:2106.08927v1 [cs.CL])
    (2 min) We inspect the long-term learning ability of Long Short-Term Memory language models (LSTM LMs) by evaluating a contextual extension based on the Continuous Bag-of-Words (CBOW) model for both sentence- and discourse-level LSTM LMs and by analyzing its performance. We evaluate on text and speech. Sentence-level models using the long-term contextual module perform comparably to vanilla discourse-level LSTM LMs. On the other hand, the extension does not provide gains for discourse-level models. These findings indicate that discourse-level LSTM LMs already rely on contextual information to perform long-term learning.
    RefBERT: Compressing BERT by Referencing to Pre-computed Representations. (arXiv:2106.08898v1 [cs.CL])
    (2 min) Recently developed large pre-trained language models, e.g., BERT, have achieved remarkable performance in many downstream natural language processing applications. These pre-trained language models often contain hundreds of millions of parameters and suffer from high computation and latency in real-world applications. It is desirable to reduce the computation overhead of the models for fast training and inference while keeping the model performance in downstream applications. Several lines of work utilize knowledge distillation to compress the teacher model to a smaller student model. However, they usually discard the teacher's knowledge when in inference. Differently, in this paper, we propose RefBERT to leverage the knowledge learned from the teacher, i.e., facilitating the pre-computed BERT representation on the reference sample and compressing BERT into a smaller student model. To guarantee our proposal, we provide theoretical justification on the loss function and the usage of reference samples. Significantly, the theoretical result shows that including the pre-computed teacher's representations on the reference samples indeed increases the mutual information in learning the student model. Finally, we conduct the empirical evaluation and show that our RefBERT can beat the vanilla TinyBERT over 8.1\% and achieves more than 94\% of the performance of $\BERTBASE$ on the GLUE benchmark. Meanwhile, RefBERT is 7.4x smaller and 9.5x faster on inference than BERT$_{\rm BASE}$.
    Local plasticity rules can learn deep representations using self-supervised contrastive predictions. (arXiv:2010.08262v4 [cs.LG] UPDATED)
    (2 min) Learning in the brain is poorly understood and learning rules that respect biological constraints, yet yield deep hierarchical representations, are still unknown. Here, we propose a learning rule that takes inspiration from neuroscience and recent advances in self-supervised deep learning. Learning minimizes a simple layer-specific loss function and does not need to back-propagate error signals within or between layers. Instead, weight updates follow a local, Hebbian, learning rule that only depends on pre- and post-synaptic neuronal activity, predictive dendritic input and widely broadcasted modulation factors which are identical for large groups of neurons. The learning rule applies contrastive predictive learning to a causal, biological setting using saccades (i.e. rapid shifts in gaze direction). We find that networks trained with this self-supervised and local rule build deep hierarchical representations of images, speech and video.
    Quantifying the Preferential Direction of the Model Gradient in Adversarial Training With Projected Gradient Descent. (arXiv:2009.04709v3 [stat.ML] UPDATED)
    (2 min) Adversarial training, especially projected gradient descent (PGD), has been a successful approach for improving robustness against adversarial attacks. After adversarial training, gradients of models with respect to their inputs have a preferential direction. However, the direction of alignment is not mathematically well established, making it difficult to evaluate quantitatively. We propose a novel definition of this direction as the direction of the vector pointing toward the closest point of the support of the closest inaccurate class in decision space. To evaluate the alignment with this direction after adversarial training, we apply a metric that uses generative adversarial networks to produce the smallest residual needed to change the class present in the image. We show that PGD-trained models have a higher alignment than the baseline according to our definition, that our metric presents higher alignment values than a competing metric formulation, and that enforcing this alignment increases the robustness of models.
    Offline RL Without Off-Policy Evaluation. (arXiv:2106.08909v1 [cs.LG])
    (2 min) Most prior approaches to offline reinforcement learning (RL) have taken an iterative actor-critic approach involving off-policy evaluation. In this paper we show that simply doing one step of constrained/regularized policy improvement using an on-policy Q estimate of the behavior policy performs surprisingly well. This one-step algorithm beats the previously reported results of iterative algorithms on a large portion of the D4RL benchmark. The simple one-step baseline achieves this strong performance without many of the tricks used by previously proposed iterative algorithms and is more robust to hyperparameters. We argue that the relatively poor performance of iterative approaches is a result of the high variance inherent in doing off-policy evaluation and magnified by the repeated optimization of policies against those high-variance estimates. In addition, we hypothesize that the strong performance of the one-step algorithm is due to a combination of favorable structure in the environment and behavior policy.
    Covariance-based smoothed particle hydrodynamics. A machine-learning application to simulating disc fragmentation. (arXiv:2106.08870v1 [physics.comp-ph])
    (2 min) A PCA-based, machine learning version of the SPH method is proposed. In the present scheme, the smoothing tensor is computed to have their eigenvalues proportional to the covariance's principal components, using a modified octree data structure, which allows the fast estimation of the anisotropic self-regulating kNN. Each SPH particle is the center of such an optimal kNN cluster, i.e., the one whose covariance tensor allows the find of the kNN cluster itself according to the Mahalanobis metric. Such machine learning constitutes a fixed point problem. The definitive (self-regulating) kNN cluster defines the smoothing volume, or properly saying, the smoothing ellipsoid, required to perform the anisotropic interpolation. Thus, the smoothing kernel has an ellipsoidal profile, which changes how the kernel gradients are computed. As an application, it was performed the simulation of collapse and fragmentation of a non-magnetic, rotating gaseous sphere. An interesting outcome was the formation of protostars in the disc fragmentation, shown to be much more persistent and much more abundant in the anisotropic simulation than in the isotropic case.
    Robust Training in High Dimensions via Block Coordinate Geometric Median Descent. (arXiv:2106.08882v1 [cs.LG])
    (2 min) Geometric median (\textsc{Gm}) is a classical method in statistics for achieving a robust estimation of the uncorrupted data; under gross corruption, it achieves the optimal breakdown point of 0.5. However, its computational complexity makes it infeasible for robustifying stochastic gradient descent (SGD) for high-dimensional optimization problems. In this paper, we show that by applying \textsc{Gm} to only a judiciously chosen block of coordinates at a time and using a memory mechanism, one can retain the breakdown point of 0.5 for smooth non-convex problems, with non-asymptotic convergence rates comparable to the SGD with \textsc{Gm}.
    Variational Disentanglement for Rare Event Modeling. (arXiv:2009.08541v5 [stat.ML] UPDATED)
    (2 min) Combining the increasing availability and abundance of healthcare data and the current advances in machine learning methods have created renewed opportunities to improve clinical decision support systems. However, in healthcare risk prediction applications, the proportion of cases with the condition (label) of interest is often very low relative to the available sample size. Though very prevalent in healthcare, such imbalanced classification settings are also common and challenging in many other scenarios. So motivated, we propose a variational disentanglement approach to semi-parametrically learn from rare events in heavily imbalanced classification problems. Specifically, we leverage the imposed extreme-distribution behavior on a latent space to extract information from low-prevalence events, and develop a robust prediction arm that joins the merits of the generalized additive model and isotonic neural nets. Results on synthetic studies and diverse real-world datasets, including mortality prediction on a COVID-19 cohort, demonstrate that the proposed approach outperforms existing alternatives.
    ModelDiff: Testing-Based DNN Similarity Comparison for Model Reuse Detection. (arXiv:2106.08890v1 [cs.LG])
    (2 min) The knowledge of a deep learning model may be transferred to a student model, leading to intellectual property infringement or vulnerability propagation. Detecting such knowledge reuse is nontrivial because the suspect models may not be white-box accessible and/or may serve different tasks. In this paper, we propose ModelDiff, a testing-based approach to deep learning model similarity comparison. Instead of directly comparing the weights, activations, or outputs of two models, we compare their behavioral patterns on the same set of test inputs. Specifically, the behavioral pattern of a model is represented as a decision distance vector (DDV), in which each element is the distance between the model's reactions to a pair of inputs. The knowledge similarity between two models is measured with the cosine similarity between their DDVs. To evaluate ModelDiff, we created a benchmark that contains 144 pairs of models that cover most popular model reuse methods, including transfer learning, model compression, and model stealing. Our method achieved 91.7% correctness on the benchmark, which demonstrates the effectiveness of using ModelDiff for model reuse detection. A study on mobile deep learning apps has shown the feasibility of ModelDiff on real-world models.
    Complexity aspects of local minima and related notions. (arXiv:2008.06148v2 [math.OC] UPDATED)
    (2 min) We consider the notions of (i) critical points, (ii) second-order points, (iii) local minima, and (iv) strict local minima for multivariate polynomials. For each type of point, and as a function of the degree of the polynomial, we study the complexity of deciding (1) if a given point is of that type, and (2) if a polynomial has a point of that type. Our results characterize the complexity of these two questions for all degrees left open by prior literature. Our main contributions reveal that many of these questions turn out to be tractable for cubic polynomials. In particular, we present an efficiently-checkable necessary and sufficient condition for local minimality of a point for a cubic polynomial. We also show that a local minimum of a cubic polynomial can be efficiently found by solving semidefinite programs of size linear in the number of variables. By contrast, we show that it is strongly NP-hard to decide if a cubic polynomial has a critical point. We also prove that the set of second-order points of any cubic polynomial is a spectrahedron, and conversely that any spectrahedron is the projection of the set of second-order points of a cubic polynomial. In our final section, we briefly present a potential application of finding local minima of cubic polynomials to the design of a third-order Newton method.
    ParticleAugment: Sampling-Based Data Augmentation. (arXiv:2106.08693v1 [cs.LG])
    (2 min) We present an automated data augmentation approach for image classification. We formulate the problem as Monte Carlo sampling where our goal is to approximate the optimal augmentation policies. We propose a particle filtering formulation to find optimal augmentation policies and their schedules during model training. Our performance measurement procedure relies on a validation subset of our training set, while the policy transition model depends on a Gaussian prior and an optional augmentation velocity parameter. In our experiments, we show that our formulation for automated augmentation reaches promising results on CIFAR-10, CIFAR-100, and ImageNet datasets using the standard network architectures for this problem. By comparing with the related work, we also show that our method reaches a balance between the computational cost of policy search and the model performance.
    Online Learning with Uncertain Feedback Graphs. (arXiv:2106.08441v1 [cs.LG])
    (2 min) Online learning with expert advice is widely used in various machine learning tasks. It considers the problem where a learner chooses one from a set of experts to take advice and make a decision. In many learning problems, experts may be related, henceforth the learner can observe the losses associated with a subset of experts that are related to the chosen one. In this context, the relationship among experts can be captured by a feedback graph, which can be used to assist the learner's decision making. However, in practice, the nominal feedback graph often entails uncertainties, which renders it impossible to reveal the actual relationship among experts. To cope with this challenge, the present work studies various cases of potential uncertainties, and develops novel online learning algorithms to deal with uncertainties while making use of the uncertain feedback graph. The proposed algorithms are proved to enjoy sublinear regret under mild conditions. Experiments on real datasets are presented to demonstrate the effectiveness of the novel algorithms.
    A Wasserstein Minimax Framework for Mixed Linear Regression. (arXiv:2106.07537v2 [stat.ML] UPDATED)
    (2 min) Multi-modal distributions are commonly used to model clustered data in statistical learning tasks. In this paper, we consider the Mixed Linear Regression (MLR) problem. We propose an optimal transport-based framework for MLR problems, Wasserstein Mixed Linear Regression (WMLR), which minimizes the Wasserstein distance between the learned and target mixture regression models. Through a model-based duality analysis, WMLR reduces the underlying MLR task to a nonconvex-concave minimax optimization problem, which can be provably solved to find a minimax stationary point by the Gradient Descent Ascent (GDA) algorithm. In the special case of mixtures of two linear regression models, we show that WMLR enjoys global convergence and generalization guarantees. We prove that WMLR's sample complexity grows linearly with the dimension of data. Finally, we discuss the application of WMLR to the federated learning task where the training samples are collected by multiple agents in a network. Unlike the Expectation Maximization algorithm, WMLR directly extends to the distributed, federated learning setting. We support our theoretical results through several numerical experiments, which highlight our framework's ability to handle the federated learning setting with mixture models.
    Best of both worlds: local and global explanations with human-understandable concepts. (arXiv:2106.08641v1 [cs.LG])
    (2 min) Interpretability techniques aim to provide the rationale behind a model's decision, typically by explaining either an individual prediction (local explanation, e.g. `why is this patient diagnosed with this condition') or a class of predictions (global explanation, e.g. `why are patients diagnosed with this condition in general'). While there are many methods focused on either one, few frameworks can provide both local and global explanations in a consistent manner. In this work, we combine two powerful existing techniques, one local (Integrated Gradients, IG) and one global (Testing with Concept Activation Vectors), to provide local, and global concept-based explanations. We first validate our idea using two synthetic datasets with a known ground truth, and further demonstrate with a benchmark natural image dataset. We test our method with various concepts, target classes, model architectures and IG baselines. We show that our method improves global explanations over TCAV when compared to ground truth, and provides useful insights. We hope our work provides a step towards building bridges between many existing local and global methods to get the best of both worlds.
    Memorization and Generalization in Neural Code Intelligence Models. (arXiv:2106.08704v1 [cs.SE])
    (2 min) Deep Neural Networks (DNN) are increasingly commonly used in software engineering and code intelligence tasks. These are powerful tools that are capable of learning highly generalizable patterns from large datasets through millions of parameters. At the same time, training DNNs means walking a knife's edges, because their large capacity also renders them prone to memorizing data points. While traditionally thought of as an aspect of over-training, recent work suggests that the memorization risk manifests especially strongly when the training datasets are noisy and memorization is the only recourse. Unfortunately, most code intelligence tasks rely on rather noise-prone and repetitive data sources, such as GitHub, which, due to their sheer size, cannot be manually inspected and evaluated. We evaluate the memorization and generalization tendencies in neural code intelligence models through a case study across several benchmarks and model families by leveraging established approaches from other fields that use DNNs, such as introducing targeted noise into the training dataset. In addition to reinforcing prior general findings about the extent of memorization in DNNs, our results shed light on the impact of noisy dataset in training.
    Comparison of Outlier Detection Techniques for Structured Data. (arXiv:2106.08779v1 [cs.LG])
    (2 min) An outlier is an observation or a data point that is far from rest of the data points in a given dataset or we can be said that an outlier is away from the center of mass of observations. Presence of outliers can skew statistical measures and data distributions which can lead to misleading representation of the underlying data and relationships. It is seen that the removal of outliers from the training dataset before modeling can give better predictions. With the advancement of machine learning, the outlier detection models are also advancing at a good pace. The goal of this work is to highlight and compare some of the existing outlier detection techniques for the data scientists to use that information for outlier algorithm selection while building a machine learning model.
    Fast Quantum Property Prediction via Deeper 2D and 3D Graph Networks. (arXiv:2106.08551v1 [cs.LG])
    (2 min) Molecular property prediction is gaining increasing attention due to its diverse applications. One task of particular interests and importance is to predict quantum chemical properties without 3D equilibrium structures. This is practically favorable since obtaining 3D equilibrium structures requires extremely expensive calculations. In this work, we design a deep graph neural network to predict quantum properties by directly learning from 2D molecular graphs. In addition, we propose a 3D graph neural network to learn from low-cost conformer sets, which can be obtained with open-source tools using an affordable budget. We employ our methods to participate in the 2021 KDD Cup on OGB Large-Scale Challenge (OGB-LSC), which aims to predict the HOMO-LUMO energy gap of molecules. Final evaluation results reveal that we are one of the winners with a mean absolute error of 0.1235 on the holdout test set. Our implementation is available as part of the MoleculeX package (https://github.com/divelab/MoleculeX).
    SEEN: Sharpening Explanations for Graph Neural Networks using Explanations from Neighborhoods. (arXiv:2106.08532v1 [cs.LG])
    (2 min) Explaining the foundations for predictions obtained from graph neural networks (GNNs) is critical for credible use of GNN models for real-world problems. Owing to the rapid growth of GNN applications, recent progress in explaining predictions from GNNs, such as sensitivity analysis, perturbation methods, and attribution methods, showed great opportunities and possibilities for explaining GNN predictions. In this study, we propose a method to improve the explanation quality of node classification tasks that can be applied in a post hoc manner through aggregation of auxiliary explanations from important neighboring nodes, named SEEN. Applying SEEN does not require modification of a graph and can be used with diverse explainability techniques due to its independent mechanism. Experiments on matching motif-participating nodes from a given graph show great improvement in explanation accuracy of up to 12.71% and demonstrate the correlation between the auxiliary explanations and the enhanced explanation accuracy through leveraging their contributions. SEEN provides a simple but effective method to enhance the explanation quality of GNN model outputs, and this method is applicable in combination with most explainability techniques.
    Probabilistic DAG Search. (arXiv:2106.08717v1 [cs.LG])
    (2 min) Exciting contemporary machine learning problems have recently been phrased in the classic formalism of tree search -- most famously, the game of Go. Interestingly, the state-space underlying these sequential decision-making problems often posses a more general latent structure than can be captured by a tree. In this work, we develop a probabilistic framework to exploit a search space's latent structure and thereby share information across the search tree. The method is based on a combination of approximate inference in jointly Gaussian models for the explored part of the problem, and an abstraction for the unexplored part that imposes a reduction of complexity ad hoc. We empirically find our algorithm to compare favorably to existing non-probabilistic alternatives in Tic-Tac-Toe and a feature selection application.
    Averaging on the Bures-Wasserstein manifold: dimension-free convergence of gradient descent. (arXiv:2106.08502v1 [math.OC])
    (2 min) We study first-order optimization algorithms for computing the barycenter of Gaussian distributions with respect to the optimal transport metric. Although the objective is geodesically non-convex, Riemannian GD empirically converges rapidly, in fact faster than off-the-shelf methods such as Euclidean GD and SDP solvers. This stands in stark contrast to the best-known theoretical results for Riemannian GD, which depend exponentially on the dimension. In this work, we prove new geodesic convexity results which provide stronger control of the iterates, yielding a dimension-free convergence rate. Our techniques also enable the analysis of two related notions of averaging, the entropically-regularized barycenter and the geometric median, providing the first convergence guarantees for Riemannian GD for these problems.
    Robust Reinforcement Learning Under Minimax Regret for Green Security. (arXiv:2106.08413v1 [cs.LG])
    (2 min) Green security domains feature defenders who plan patrols in the face of uncertainty about the adversarial behavior of poachers, illegal loggers, and illegal fishers. Importantly, the deterrence effect of patrols on adversaries' future behavior makes patrol planning a sequential decision-making problem. Therefore, we focus on robust sequential patrol planning for green security following the minimax regret criterion, which has not been considered in the literature. We formulate the problem as a game between the defender and nature who controls the parameter values of the adversarial behavior and design an algorithm MIRROR to find a robust policy. MIRROR uses two reinforcement learning-based oracles and solves a restricted game considering limited defender strategies and parameter values. We evaluate MIRROR on real-world poaching data.
    Reinforcement Learning for Markovian Bandits: Is Posterior Sampling more Scalable than Optimism?. (arXiv:2106.08771v1 [cs.LG])
    (2 min) We study learning algorithms for the classical Markovian bandit problem with discount. We explain how to adapt PSRL [24] and UCRL2 [2] to exploit the problem structure. These variants are called MB-PSRL and MB-UCRL2. While the regret bound and runtime of vanilla implementations of PSRL and UCRL2 are exponential in the number of bandits, we show that the episodic regret of MB-PSRL and MB-UCRL2 is�(S $\sqrt$ nK) where K is the number of episodes, n is the number of bandits and S is the number of states of each bandit (the exact bound in S, n and K is given in the paper). Up to a factor $\sqrt$ S, this matches the lower bound of $\Omega$($\sqrt$ SnK) that we also derive in the paper. MB-PSRL is also computationally efficient: its runtime is linear in the number of bandits. We further show that this linear runtime cannot be achieved by adapting classical non-Bayesian algorithms such as UCRL2 or UCBVI to Markovian bandit problems. Finally, we perform numerical experiments that confirm that MB-PSRL outperforms other existing algorithms in practice, both in terms of regret and of computation time.
    Solving Continuous Control with Episodic Memory. (arXiv:2106.08832v1 [cs.LG])
    (2 min) Episodic memory lets reinforcement learning algorithms remember and exploit promising experience from the past to improve agent performance. Previous works on memory mechanisms show benefits of using episodic-based data structures for discrete action problems in terms of sample-efficiency. The application of episodic memory for continuous control with a large action space is not trivial. Our study aims to answer the question: can episodic memory be used to improve agent's performance in continuous control? Our proposed algorithm combines episodic memory with Actor-Critic architecture by modifying critic's objective. We further improve performance by introducing episodic-based replay buffer prioritization. We evaluate our algorithm on OpenAI gym domains and show greater sample-efficiency compared with the state-of-the art model-free off-policy algorithms.
    Input Invex Neural Network. (arXiv:2106.08748v1 [cs.LG])
    (2 min) In this paper, we present a novel method to constrain invexity on Neural Networks (NN). Invex functions ensure every stationary point is global minima. Hence, gradient descent commenced from any point will lead to the global minima. Another advantage of invexity on NN is to divide data space locally into two connected sets with a highly non-linear decision boundary by simply thresholding the output. To this end, we formulate a universal invex function approximator and employ it to enforce invexity in NN. We call it Input Invex Neural Networks (II-NN). We first fit data with a known invex function, followed by modification with a NN, compare the direction of the gradient and penalize the direction of gradient on NN if it contradicts with the direction of reference invex function. In order to penalize the direction of the gradient we perform Gradient Clipped Gradient Penalty (GC-GP). We applied our method to the existing NNs for both image classification and regression tasks. From the extensive empirical and qualitative experiments, we observe that our method gives the performance similar to ordinary NN yet having invexity. Our method outperforms linear NN and Input Convex Neural Network (ICNN) with a large margin. We publish our code and implementation details at github.
    TSO: Curriculum Generation using continuous optimization. (arXiv:2106.08569v1 [cs.LG])
    (2 min) The training of deep learning models poses vast challenges of including parameter tuning and ordering of training data. Significant research has been done in Curriculum learning for optimizing the sequence of training data. Recent works have focused on using complex reinforcement learning techniques to find the optimal data ordering strategy to maximize learning for a given network. In this paper, we present a simple and efficient technique based on continuous optimization. We call this new approach Training Sequence Optimization (TSO). There are three critical components in our proposed approach: (a) An encoder network maps/embeds training sequence into continuous space. (b) A predictor network uses the continuous representation of a strategy as input and predicts the accuracy for fixed network architecture. (c) A decoder further maps a continuous representation of a strategy to the ordered training dataset. The performance predictor and encoder enable us to perform gradient-based optimization in the continuous space to find the embedding of optimal training data ordering with potentially better accuracy. Experiments show that we can gain 2AP with our generated optimal curriculum strategy over the random strategy using the CIFAR-100 dataset and have better boosts than the state of the art CL algorithms. We do an ablation study varying the architecture, dataset and sample sizes showcasing our approach's robustness.
    Locality defeats the curse of dimensionality in convolutional teacher-student scenarios. (arXiv:2106.08619v1 [stat.ML])
    (2 min) Convolutional neural networks perform a local and translationally-invariant treatment of the data: quantifying which of these two aspects is central to their success remains a challenge. We study this problem within a teacher-student framework for kernel regression, using `convolutional' kernels inspired by the neural tangent kernel of simple convolutional architectures of given filter size. Using heuristic methods from physics, we find in the ridgeless case that locality is key in determining the learning curve exponent $\beta$ (that relates the test error $\epsilon_t\sim P^{-\beta}$ to the size of the training set $P$), whereas translational invariance is not. In particular, if the filter size of the teacher $t$ is smaller than that of the student $s$, $\beta$ is a function of $s$ only and does not depend on the input dimension. We confirm our predictions on $\beta$ empirically. Theoretically, in some cases (including when teacher and student are equal) it can be shown that this prediction is an upper bound on performance. We conclude by proving, using a natural universality assumption, that performing kernel regression with a ridge that decreases with the size of the training set leads to similar learning curve exponents to those we obtain in the ridgeless case.
    A Framework for Discovering Optimal Solutions in Photonic Inverse Design. (arXiv:2106.08419v1 [physics.optics])
    (2 min) Photonic inverse design has emerged as an indispensable engineering tool for complex optical systems. In many instances it is important to optimize for both material and geometry configurations, which results in complex non-smooth search spaces with multiple local minima. Finding solutions approaching global optimum may present a computationally intractable task. Here, we develop a framework that allows expediting the search of solutions close to global optimum on complex optimization spaces. We study the way representative black box optimization algorithms work, including genetic algorithm (GA), particle swarm optimization (PSO), simulated annealing (SA), and mesh adaptive direct search (NOMAD). We then propose and utilize a two-step approach that identifies best performance algorithms on arbitrarily complex search spaces. We reveal a connection between the search space complexity and algorithm performance and find that PSO and NOMAD consistently deliver better performance for mixed integer problems encountered in photonic inverse design, particularly with the account of material combinations. Our results differ from a commonly anticipated advantage of GA. Our findings will foster more efficient design of photonic systems with optimal performance.
    Learning-based Support Estimation in Sublinear Time. (arXiv:2106.08396v1 [cs.LG])
    (2 min) We consider the problem of estimating the number of distinct elements in a large data set (or, equivalently, the support size of the distribution induced by the data set) from a random sample of its elements. The problem occurs in many applications, including biology, genomics, computer systems and linguistics. A line of research spanning the last decade resulted in algorithms that estimate the support up to $ \pm \varepsilon n$ from a sample of size $O(\log^2(1/\varepsilon) \cdot n/\log n)$, where $n$ is the data set size. Unfortunately, this bound is known to be tight, limiting further improvements to the complexity of this problem. In this paper we consider estimation algorithms augmented with a machine-learning-based predictor that, given any element, returns an estimation of its frequency. We show that if the predictor is correct up to a constant approximation factor, then the sample complexity can be reduced significantly, to \[ \ \log (1/\varepsilon) \cdot n^{1-\Theta(1/\log(1/\varepsilon))}. \] We evaluate the proposed algorithms on a collection of data sets, using the neural-network based estimators from {Hsu et al, ICLR'19} as predictors. Our experiments demonstrate substantial (up to 3x) improvements in the estimation accuracy compared to the state of the art algorithm.
    Silent Speech and Emotion Recognition from Vocal Tract Shape Dynamics in Real-Time MRI. (arXiv:2106.08706v1 [eess.IV])
    (2 min) Speech sounds of spoken language are obtained by varying configuration of the articulators surrounding the vocal tract. They contain abundant information that can be utilized to better understand the underlying mechanism of human speech production. We propose a novel deep neural network-based learning framework that understands acoustic information in the variable-length sequence of vocal tract shaping during speech production, captured by real-time magnetic resonance imaging (rtMRI), and translate it into text. The proposed framework comprises of spatiotemporal convolutions, a recurrent network, and the connectionist temporal classification loss, trained entirely end-to-end. On the USC-TIMIT corpus, the model achieved a 40.6% PER at sentence-level, much better compared to the existing models. To the best of our knowledge, this is the first study that demonstrates the recognition of entire spoken sentence based on an individual's articulatory motions captured by rtMRI video. We also performed an analysis of variations in the geometry of articulation in each sub-regions of the vocal tract (i.e., pharyngeal, velar and dorsal, hard palate, labial constriction region) with respect to different emotions and genders. Results suggest that each sub-regions distortion is affected by both emotion and gender.
    Achieving Domain Robustness in Stereo Matching Networks by Removing Shortcut Learning. (arXiv:2106.08486v1 [cs.CV])
    (2 min) Learning-based stereo matching and depth estimation networks currently excel on public benchmarks with impressive results. However, state-of-the-art networks often fail to generalize from synthetic imagery to more challenging real data domains. This paper is an attempt to uncover hidden secrets of achieving domain robustness and in particular, discovering the important ingredients of generalization success of stereo matching networks by analyzing the effect of synthetic image learning on real data performance. We provide evidence that demonstrates that learning of features in the synthetic domain by a stereo matching network is heavily influenced by two "shortcuts" presented in the synthetic data: (1) identical local statistics (RGB colour features) between matching pixels in the synthetic stereo images and (2) lack of realism in synthetic textures on 3D objects simulated in game engines. We will show that by removing such shortcuts, we can achieve domain robustness in the state-of-the-art stereo matching frameworks and produce a remarkable performance on multiple realistic datasets, despite the fact that the networks were trained on synthetic data, only. Our experimental results point to the fact that eliminating shortcuts from the synthetic data is key to achieve domain-invariant generalization between synthetic and real data domains.
    Global Rhythm Style Transfer Without Text Transcriptions. (arXiv:2106.08519v1 [eess.AS])
    (2 min) Prosody plays an important role in characterizing the style of a speaker or an emotion, but most non-parallel voice or emotion style transfer algorithms do not convert any prosody information. Two major components of prosody are pitch and rhythm. Disentangling the prosody information, particularly the rhythm component, from the speech is challenging because it involves breaking the synchrony between the input speech and the disentangled speech representation. As a result, most existing prosody style transfer algorithms would need to rely on some form of text transcriptions to identify the content information, which confines their application to high-resource languages only. Recently, SpeechSplit has made sizeable progress towards unsupervised prosody style transfer, but it is unable to extract high-level global prosody style in an unsupervised manner. In this paper, we propose AutoPST, which can disentangle global prosody style from speech without relying on any text transcriptions. AutoPST is an Autoencoder-based Prosody Style Transfer framework with a thorough rhythm removal module guided by the self-expressive representation learning. Experiments on different style transfer tasks show that AutoPST can effectively convert prosody that correctly reflects the styles of the target domains.
    Eigen Analysis of Self-Attention and its Reconstruction from Partial Computation. (arXiv:2106.08823v1 [cs.LG])
    (2 min) State-of-the-art transformer models use pairwise dot-product based self-attention, which comes at a computational cost quadratic in the input sequence length. In this paper, we investigate the global structure of attention scores computed using this dot product mechanism on a typical distribution of inputs, and study the principal components of their variation. Through eigen analysis of full attention score matrices, as well as of their individual rows, we find that most of the variation among attention scores lie in a low-dimensional eigenspace. Moreover, we find significant overlap between these eigenspaces for different layers and even different transformer models. Based on this, we propose to compute scores only for a partial subset of token pairs, and use them to estimate scores for the remaining pairs. Beyond investigating the accuracy of reconstructing attention scores themselves, we investigate training transformer models that employ these approximations, and analyze the effect on overall accuracy. Our analysis and the proposed method provide insights into how to balance the benefits of exact pair-wise attention and its significant computational expense.
    Ada-BKB: Scalable Gaussian Process Optimization on Continuous Domain by Adaptive Discretization. (arXiv:2106.08598v1 [cs.LG])
    (2 min) Gaussian process optimization is a successful class of algorithms (e.g. GP-UCB) to optimize a black-box function through sequential evaluations. However, when the domain of the function is continuous, Gaussian process optimization has to either rely on a fixed discretization of the space, or solve a non-convex optimization subproblem at each evaluation. The first approach can negatively affect performance, while the second one puts a heavy computational burden on the algorithm. A third option, that only recently has been theoretically studied, is to adaptively discretize the function domain. Even though this approach avoids the extra non-convex optimization costs, the overall computational complexity is still prohibitive. An algorithm such as GP-UCB has a runtime of $O(T^4)$, where $T$ is the number of iterations. In this paper, we introduce Ada-BKB (Adaptive Budgeted Kernelized Bandit), a no-regret Gaussian process optimization algorithm for functions on continuous domains, that provably runs in $O(T^2 d_\text{eff}^2)$, where $d_\text{eff}$ is the effective dimension of the explored space, and which is typically much smaller than $T$. We corroborate our findings with experiments on synthetic non-convex functions and on the real-world problem of hyper-parameter optimization.
    PRASEMap: A Probabilistic Reasoning and Semantic Embedding based Knowledge Graph Alignment System. (arXiv:2106.08801v1 [cs.CL])
    (2 min) Knowledge Graph (KG) alignment aims at finding equivalent entities and relations (i.e., mappings) between two KGs. The existing approaches utilize either reasoning-based or semantic embedding-based techniques, but few studies explore their combination. In this demonstration, we present PRASEMap, an unsupervised KG alignment system that iteratively computes the Mappings with both Probabilistic Reasoning (PR) And Semantic Embedding (SE) techniques. PRASEMap can support various embedding-based KG alignment approaches as the SE module, and enables easy human computer interaction that additionally provides an option for users to feed the mapping annotations back to the system for better results. The demonstration showcases these features via a stand-alone Web application with user friendly interfaces.
    Automating Augmentation Through Random Unidimensional Search. (arXiv:2106.08756v1 [cs.LG])
    (2 min) It is no secret amongst deep learning researchers that finding the right data augmentation strategy during training can mean the difference between a state-of-the-art result and a run-of-the-mill ranking. To that end, the community has seen many efforts to automate the process of finding the perfect augmentation procedure for any task at hand. Unfortunately, even recent cutting-edge methods bring massive computational overhead, requiring as many as 100 full model trainings to settle on an ideal configuration. We show how to achieve even better performance in just 7: with Random Unidimensional Augmentation. Source code is available at https://github.com/fastestimator/RUA
    Drum-Aware Ensemble Architecture for Improved Joint Musical Beat and Downbeat Tracking. (arXiv:2106.08685v1 [cs.SD])
    (2 min) This paper presents a novel system architecture that integrates blind source separation with joint beat and downbeat tracking in musical audio signals. The source separation module segregates the percussive and non-percussive components of the input signal, over which beat and downbeat tracking are performed separately and then the results are aggregated with a learnable fusion mechanism. This way, the system can adaptively determine how much the tracking result for an input signal should depend on the input's percussive or non-percussive components. Evaluation on four testing sets that feature different levels of presence of drum sounds shows that the new architecture consistently outperforms the widely-adopted baseline architecture that does not employ source separation.
    Predicting Unreliable Predictions by Shattering a Neural Network. (arXiv:2106.08365v1 [cs.LG])
    (2 min) Piecewise linear neural networks can be split into subfunctions, each with its own activation pattern, domain, and empirical error. Empirical error for the full network can be written as an expectation over empirical error of subfunctions. Constructing a generalization bound on subfunction empirical error indicates that the more densely a subfunction is surrounded by training samples in representation space, the more reliable its predictions are. Further, it suggests that models with fewer activation regions generalize better, and models that abstract knowledge to a greater degree generalize better, all else equal. We propose not only a theoretical framework to reason about subfunction error bounds but also a pragmatic way of approximately evaluating it, which we apply to predicting which samples the network will not successfully generalize to. We test our method on detection of misclassification and out-of-distribution samples, finding that it performs competitively in both cases. In short, some network activation patterns are associated with higher reliability than others, and these can be identified using subfunction error bounds.
    Code to Comment Translation: A Comparative Study on Model Effectiveness & Errors. (arXiv:2106.08415v1 [cs.SE])
    (2 min) Automated source code summarization is a popular software engineering research topic wherein machine translation models are employed to "translate" code snippets into relevant natural language descriptions. Most evaluations of such models are conducted using automatic reference-based metrics. However, given the relatively large semantic gap between programming languages and natural language, we argue that this line of research would benefit from a qualitative investigation into the various error modes of current state-of-the-art models. Therefore, in this work, we perform both a quantitative and qualitative comparison of three recently proposed source code summarization models. In our quantitative evaluation, we compare the models based on the smoothed BLEU-4, METEOR, and ROUGE-L machine translation metrics, and in our qualitative evaluation, we perform a manual open-coding of the most common errors committed by the models when compared to ground truth captions. Our investigation reveals new insights into the relationship between metric-based performance and model prediction errors grounded in an empirically derived error taxonomy that can be used to drive future research efforts
    Ctrl-P: Temporal Control of Prosodic Variation for Speech Synthesis. (arXiv:2106.08352v1 [eess.AS])
    (2 min) Text does not fully specify the spoken form, so text-to-speech models must be able to learn from speech data that vary in ways not explained by the corresponding text. One way to reduce the amount of unexplained variation in training data is to provide acoustic information as an additional learning signal. When generating speech, modifying this acoustic information enables multiple distinct renditions of a text to be produced. Since much of the unexplained variation is in the prosody, we propose a model that generates speech explicitly conditioned on the three primary acoustic correlates of prosody: $F_{0}$, energy and duration. The model is flexible about how the values of these features are specified: they can be externally provided, or predicted from text, or predicted then subsequently modified. Compared to a model that employs a variational auto-encoder to learn unsupervised latent features, our model provides more interpretable, temporally-precise, and disentangled control. When automatically predicting the acoustic features from text, it generates speech that is more natural than that from a Tacotron 2 model with reference encoder. Subsequent human-in-the-loop modification of the predicted acoustic features can significantly further increase naturalness.
    Comparison of Automated Machine Learning Tools for SMS Spam Message Filtering. (arXiv:2106.08671v1 [cs.LG])
    (2 min) Short Message Service (SMS) is a very popular service used for communication by mobile users. However, this popular service can be abused by executing illegal activities and influencing security risks. Nowadays, many automatic machine learning (AutoML) tools exist which can help domain experts and lay users to build high-quality ML models with little or no machine learning knowledge. In this work, a classification performance comparison was conducted between three automatic ML tools for SMS spam message filtering. These tools are mljar-supervised AutoML, H2O AutoML, and Tree-based Pipeline Optimization Tool (TPOT) AutoML. Experimental results showed that ensemble models achieved the best classification performance. The Stacked Ensemble model, which was built using H2O AutoML, achieved the best performance in terms of Log Loss (0.8370), true positive (1088/1116), and true negative (281/287) metrics. There is a 19.05\% improvement in Log Loss with respect to TPOT AutoML and 10.53\% improvement with respect to mljar-supervised AutoML. The satisfactory filtering performance achieved with AutoML tools provides a potential application for AutoML tools to automatically determine the best ML model that can perform best for SMS spam message filtering.
    HELP: Hardware-Adaptive Efficient Latency Predictor for NAS via Meta-Learning. (arXiv:2106.08630v1 [cs.LG])
    (2 min) For deployment, neural architecture search should be hardware-aware, in order to satisfy the device-specific constraints (e.g., memory usage, latency and energy consumption) and enhance the model efficiency. Existing methods on hardware-aware NAS collect a large number of samples (e.g., accuracy and latency) from a target device, either builds a lookup table or a latency estimator. However, such approach is impractical in real-world scenarios as there exist numerous devices with different hardware specifications, and collecting samples from such a large number of devices will require prohibitive computational and monetary cost. To overcome such limitations, we propose Hardware-adaptive Efficient Latency Predictor (HELP), which formulates the device-specific latency estimation problem as a meta-learning problem, such that we can estimate the latency of a model's performance for a given task on an unseen device with a few samples. To this end, we introduce novel hardware embeddings to embed any devices considering them as black-box functions that output latencies, and meta-learn the hardware-adaptive latency predictor in a device-dependent manner, using the hardware embeddings. We validate the proposed HELP for its latency estimation performance on unseen platforms, on which it achieves high estimation performance with as few as 10 measurement samples, outperforming all relevant baselines. We also validate end-to-end NAS frameworks using HELP against ones without it, and show that it largely reduces the total time cost of the base NAS method, in latency-constrained settings.
    Developing a Fidelity Evaluation Approach for Interpretable Machine Learning. (arXiv:2106.08492v1 [cs.LG])
    (2 min) Although modern machine learning and deep learning methods allow for complex and in-depth data analytics, the predictive models generated by these methods are often highly complex, and lack transparency. Explainable AI (XAI) methods are used to improve the interpretability of these complex models, and in doing so improve transparency. However, the inherent fitness of these explainable methods can be hard to evaluate. In particular, methods to evaluate the fidelity of the explanation to the underlying black box require further development, especially for tabular data. In this paper, we (a) propose a three phase approach to developing an evaluation method; (b) adapt an existing evaluation method primarily for image and text data to evaluate models trained on tabular data; and (c) evaluate two popular explainable methods using this evaluation method. Our evaluations suggest that the internal mechanism of the underlying predictive model, the internal mechanism of the explainable method used and model and data complexity all affect explanation fidelity. Given that explanation fidelity is so sensitive to context and tools and data used, we could not clearly identify any specific explainable method as being superior to another.
    Keep the Gradients Flowing: Using Gradient Flow to Study Sparse Network Optimization. (arXiv:2102.01670v2 [cs.LG] UPDATED)
    (2 min) Training sparse networks to converge to the same performance as dense neural architectures has proven to be elusive. Recent work suggests that initialization is the key. However, while this direction of research has had some success, focusing on initialization alone appears to be inadequate. In this paper, we take a broader view of training sparse networks and consider the role of regularization, optimization, and architecture choices on sparse models. We propose a simple experimental framework, Same Capacity Sparse vs Dense Comparison (SC-SDC), that allows for a fair comparison of sparse and dense networks. Furthermore, we propose a new measure of gradient flow, Effective Gradient Flow (EGF), that better correlates to performance in sparse networks. Using top-line metrics, SC-SDC and EGF, we show that default choices of optimizers, activation functions and regularizers used for dense networks can disadvantage sparse networks. Based upon these findings, we show that gradient flow in sparse networks can be improved by reconsidering aspects of the architecture design and the training regime. Our work suggests that initialization is only one piece of the puzzle and taking a wider view of tailoring optimization to sparse networks yields promising results.
    Multi-Resolution Continuous Normalizing Flows. (arXiv:2106.08462v1 [cs.CV])
    (2 min) Recent work has shown that Neural Ordinary Differential Equations (ODEs) can serve as generative models of images using the perspective of Continuous Normalizing Flows (CNFs). Such models offer exact likelihood calculation, and invertible generation/density estimation. In this work we introduce a Multi-Resolution variant of such models (MRCNF), by characterizing the conditional distribution over the additional information required to generate a fine image that is consistent with the coarse image. We introduce a transformation between resolutions that allows for no change in the log likelihood. We show that this approach yields comparable likelihood values for various image datasets, with improved performance at higher resolutions, with fewer parameters, using only 1 GPU.
    Model Predictive Control with and without Terminal Weight: Stability and Algorithms. (arXiv:2011.14193v2 [eess.SY] CROSS LISTED)
    (2 min) This paper presents stability analysis tools for model predictive control (MPC) with and without terminal weight. Stability analysis of MPC with a limited horizon but without terminal weight is a long-standing open problem. By using a modified value function as an Lyapunov function candidate and the principle of optimality, this paper establishes stability conditions for this type of widely spread MPC algorithms. A new stability guaranteed MPC algorithm without terminal weight (MPCS) is presented. With the help of designing a new sublevel set defined by the value function of one-step ahead stage cost, conditions for checking its recursive feasibility and stability of the proposed MPC algorithm are presented. The new stability condition and the derived MPCS overcome the difficulties arising in the existing terminal weight based MPC framework, including the need of searching a suitable terminal weight and possible poor performance caused by an inappropriate terminal weight. This work is further extended to MPC with a terminal weight for the completeness. Numerical examples are presented to demonstrate the effectiveness of the proposed tool, whereas the existing stability analysis tools are either not applicable or lead to quite conservative results. It shows that the proposed tools offer a number of mechanisms to achieve stability: adjusting state and/or control weights, extending the length of horizon, and adding a simple extra constraint on the first or second state in the optimisation.
    Out-of-Scope Intent Detection with Self-Supervision and Discriminative Training. (arXiv:2106.08616v1 [cs.CL])
    (2 min) Out-of-scope intent detection is of practical importance in task-oriented dialogue systems. Since the distribution of outlier utterances is arbitrary and unknown in the training stage, existing methods commonly rely on strong assumptions on data distribution such as mixture of Gaussians to make inference, resulting in either complex multi-step training procedures or hand-crafted rules such as confidence threshold selection for outlier detection. In this paper, we propose a simple yet effective method to train an out-of-scope intent classifier in a fully end-to-end manner by simulating the test scenario in training, which requires no assumption on data distribution and no additional post-processing or threshold setting. Specifically, we construct a set of pseudo outliers in the training stage, by generating synthetic outliers using inliner features via self-supervision and sampling out-of-scope sentences from easily available open-domain datasets. The pseudo outliers are used to train a discriminative classifier that can be directly applied to and generalize well on the test task. We evaluate our method extensively on four benchmark dialogue datasets and observe significant improvements over state-of-the-art approaches. Our code has been released at https://github.com/liam0949/DCLOOS.
    A Dataset-Level Geometric Framework for Ensemble Classifiers. (arXiv:2106.08658v1 [cs.LG])
    (2 min) Ensemble classifiers have been investigated by many in the artificial intelligence and machine learning community. Majority voting and weighted majority voting are two commonly used combination schemes in ensemble learning. However, understanding of them is incomplete at best, with some properties even misunderstood. In this paper, we present a group of properties of these two schemes formally under a dataset-level geometric framework. Two key factors, every component base classifier's performance and dissimilarity between each pair of component classifiers are evaluated by the same metric - the Euclidean distance. Consequently, ensembling becomes a deterministic problem and the performance of an ensemble can be calculated directly by a formula. We prove several theorems of interest and explain their implications for ensembles. In particular, we compare and contrast the effect of the number of component classifiers on these two types of ensemble schemes. Empirical investigation is also conducted to verify the theoretical results when other metrics such as accuracy are used. We believe that the results from this paper are very useful for us to understand the fundamental properties of these two combination schemes and the principles of ensemble classifiers in general. The results are also helpful for us to investigate some issues in ensemble classifiers, such as ensemble performance prediction, selecting a small number of base classifiers to obtain efficient and effective ensembles.
    DMSANet: Dual Multi Scale Attention Network. (arXiv:2106.08382v1 [cs.CV])
    (2 min) Attention mechanism of late has been quite popular in the computer vision community. A lot of work has been done to improve the performance of the network, although almost always it results in increased computational complexity. In this paper, we propose a new attention module that not only achieves the best performance but also has lesser parameters compared to most existing models. Our attention module can easily be integrated with other convolutional neural networks because of its lightweight nature. The proposed network named Dual Multi Scale Attention Network (DMSANet) is comprised of two parts: the first part is used to extract features at various scales and aggregate them, the second part uses spatial and channel attention modules in parallel to adaptively integrate local features with their global dependencies. We benchmark our network performance for Image Classification on ImageNet dataset, Object Detection and Instance Segmentation both on MS COCO dataset.
    Quantum-inspired event reconstruction with Tensor Networks: Matrix Product States. (arXiv:2106.08334v1 [hep-ph])
    (2 min) Tensor Networks are non-trivial representations of high-dimensional tensors, originally designed to describe quantum many-body systems. We show that Tensor Networks are ideal vehicles to connect quantum mechanical concepts to machine learning techniques, thereby facilitating an improved interpretability of neural networks. This study presents the discrimination of top quark signal over QCD background processes using a Matrix Product State classifier. We show that entanglement entropy can be used to interpret what a network learns, which can be used to reduce the complexity of the network and feature space without loss of generality or performance. For the optimisation of the network, we compare the Density Matrix Renormalization Group (DMRG) algorithm to stochastic gradient descent (SGD) and propose a joined training algorithm to harness the explainability of DMRG with the efficiency of SGD.
    Machine learning-based analysis of hyperspectral images for automated sepsis diagnosis. (arXiv:2106.08445v1 [cs.LG])
    (3 min) Sepsis is a leading cause of mortality and critical illness worldwide. While robust biomarkers for early diagnosis are still missing, recent work indicates that hyperspectral imaging (HSI) has the potential to overcome this bottleneck by monitoring microcirculatory alterations. Automated machine learning-based diagnosis of sepsis based on HSI data, however, has not been explored to date. Given this gap in the literature, we leveraged an existing data set to (1) investigate whether HSI-based automated diagnosis of sepsis is possible and (2) put forth a list of possible confounders relevant for HSI-based tissue classification. While we were able to classify sepsis with an accuracy of over $98\,\%$ using the existing data, our research also revealed several subject-, therapy- and imaging-related confounders that may lead to an overestimation of algorithm performance when not balanced across the patient groups. We conclude that further prospective studies, carefully designed with respect to these confounders, are necessary to confirm the preliminary results obtained in this study.
    Implicit Finite-Horizon Approximation and Efficient Optimal Algorithms for Stochastic Shortest Path. (arXiv:2106.08377v1 [cs.LG])
    (2 min) We introduce a generic template for developing regret minimization algorithms in the Stochastic Shortest Path (SSP) model, which achieves minimax optimal regret as long as certain properties are ensured. The key of our analysis is a new technique called implicit finite-horizon approximation, which approximates the SSP model by a finite-horizon counterpart only in the analysis without explicit implementation. Using this template, we develop two new algorithms: the first one is model-free (the first in the literature to our knowledge) and minimax optimal under strictly positive costs; the second one is model-based and minimax optimal even with zero-cost state-action pairs, matching the best existing result from [Tarbouriech et al., 2021b]. Importantly, both algorithms admit highly sparse updates, making them computationally more efficient than all existing algorithms. Moreover, both can be made completely parameter-free.
    Reproducing Kernel Hilbert Space, Mercer's Theorem, Eigenfunctions, Nystr\"om Method, and Use of Kernels in Machine Learning: Tutorial and Survey. (arXiv:2106.08443v1 [stat.ML])
    (2 min) This is a tutorial and survey paper on kernels, kernel methods, and related fields. We start with reviewing the history of kernels in functional analysis and machine learning. Then, Mercer kernel, Hilbert and Banach spaces, Reproducing Kernel Hilbert Space (RKHS), Mercer's theorem and its proof, frequently used kernels, kernel construction from distance metric, important classes of kernels (including bounded, integrally positive definite, universal, stationary, and characteristic kernels), kernel centering and normalization, and eigenfunctions are explained in detail. Then, we introduce types of use of kernels in machine learning including kernel methods (such as kernel support vector machines), kernel learning by semi-definite programming, Hilbert-Schmidt independence criterion, maximum mean discrepancy, kernel mean embedding, and kernel dimensionality reduction. We also cover rank and factorization of kernel matrix as well as the approximation of eigenfunctions and kernels using the Nystr{\"o}m method. This paper can be useful for various fields of science including machine learning, dimensionality reduction, functional analysis in mathematics, and mathematical physics in quantum mechanics.

2021-06-16

  • cs.CL updates on arXiv.org

    Semantics Altering Modifications for Evaluating Comprehension in Machine Reading. (arXiv:2012.04056v2 [cs.CL] UPDATED)
    (2 min) Advances in NLP have yielded impressive results for the task of machine reading comprehension (MRC), with approaches having been reported to achieve performance comparable to that of humans. In this paper, we investigate whether state-of-the-art MRC models are able to correctly process Semantics Altering Modifications (SAM): linguistically-motivated phenomena that alter the semantics of a sentence while preserving most of its lexical surface form. We present a method to automatically generate and align challenge sets featuring original and altered examples. We further propose a novel evaluation methodology to correctly assess the capability of MRC systems to process these examples independent of the data they were optimised on, by discounting for effects introduced by domain shift. In a large-scale empirical study, we apply the methodology in order to evaluate extractive MRC models with regard to their capability to correctly process SAM-enriched data. We comprehensively cover 12 different state-of-the-art neural architecture configurations and four training datasets and find that -- despite their well-known remarkable performance -- optimised models consistently struggle to correctly process semantically altered data.
    Bilateral Personalized Dialogue Generation with Dynamic Persona-Aware Fusion. (arXiv:2106.07857v1 [cs.CL])
    (2 min) Generating personalized responses is one of the major challenges in natural human-robot interaction. Current researches in this field mainly focus on generating responses consistent with the robot's pre-assigned persona, while ignoring the user's persona. Such responses may be inappropriate or even offensive, which may lead to the bad user experience. Therefore, we propose a bilateral personalized dialogue generation (BPDG) method with dynamic persona-aware fusion via multi-task transfer learning to generate responses consistent with both personas. The proposed method aims to accomplish three learning tasks: 1) an encoder is trained with dialogue utterances added with corresponded personalized attributes and relative position (language model task), 2) a dynamic persona-aware fusion module predicts the persona presence to adaptively fuse the contextual and bilateral personas encodings (persona prediction task) and 3) a decoder generates natural, fluent and personalized responses (dialogue generation task). To make the generated responses more personalized and bilateral persona-consistent, the Conditional Mutual Information Maximum (CMIM) criterion is adopted to select the final response from the generated candidates. The experimental results show that the proposed method outperforms several state-of-the-art methods in terms of both automatic and manual evaluations.
    Talk, Don't Write: A Study of Direct Speech-Based Image Retrieval. (arXiv:2104.01894v3 [cs.CL] UPDATED)
    (2 min) Speech-based image retrieval has been studied as a proxy for joint representation learning, usually without emphasis on retrieval itself. As such, it is unclear how well speech-based retrieval can work in practice -- both in an absolute sense and versus alternative strategies that combine automatic speech recognition (ASR) with strong text encoders. In this work, we extensively study and expand choices of encoder architectures, training methodology (including unimodal and multimodal pretraining), and other factors. Our experiments cover different types of speech in three datasets: Flickr Audio, Places Audio, and Localized Narratives. Our best model configuration achieves large gains over state of the art, e.g., pushing recall-at-one from 21.8% to 33.2% for Flickr Audio and 27.6% to 53.4% for Places Audio. We also show our best speech-based models can match or exceed cascaded ASR-to-text encoding when speech is spontaneous, accented, or otherwise hard to automatically transcribe.
    Interpretable Self-supervised Multi-task Learning for COVID-19 Information Retrieval and Extraction. (arXiv:2106.08252v1 [cs.IR])
    (2 min) The rapidly evolving literature of COVID-19 related articles makes it challenging for NLP models to be effectively trained for information retrieval and extraction with the corresponding labeled data that follows the current distribution of the pandemic. On the other hand, due to the uncertainty of the situation, human experts' supervision would always be required to double check the decision making of these models highlighting the importance of interpretability. In the light of these challenges, this study proposes an interpretable self-supervised multi-task learning model to jointly and effectively tackle the tasks of information retrieval (IR) and extraction (IE) during the current emergency health crisis situation. Our results show that our model effectively leverage the multi-task and self-supervised learning to improve generalization, data efficiency and robustness to the ongoing dataset shift problem. Our model outperforms baselines in IE and IR tasks, respectively by micro-f score of 0.08 (LCA-F score of 0.05), and MAP of 0.05 on average. In IE the zero- and few-shot learning performances are on average 0.32 and 0.19 micro-f score higher than those of the baselines.
    Topics to Avoid: Demoting Latent Confounds in Text Classification. (arXiv:1909.00453v2 [cs.LG] UPDATED)
    (2 min) Despite impressive performance on many text classification tasks, deep neural networks tend to learn frequent superficial patterns that are specific to the training data and do not always generalize well. In this work, we observe this limitation with respect to the task of native language identification. We find that standard text classifiers which perform well on the test set end up learning topical features which are confounds of the prediction task (e.g., if the input text mentions Sweden, the classifier predicts that the author's native language is Swedish). We propose a method that represents the latent topical confounds and a model which "unlearns" confounding features by predicting both the label of the input text and the confound; but we train the two predictors adversarially in an alternating fashion to learn a text representation that predicts the correct label but is less prone to using information about the confound. We show that this model generalizes better and learns features that are indicative of the writing style rather than the content.
    Question Answering Infused Pre-training of General-Purpose Contextualized Representations. (arXiv:2106.08190v1 [cs.CL])
    (2 min) This paper proposes a pre-training objective based on question answering (QA) for learning general-purpose contextual representations, motivated by the intuition that the representation of a phrase in a passage should encode all questions that the phrase can answer in context. We accomplish this goal by training a bi-encoder QA model, which independently encodes passages and questions, to match the predictions of a more accurate cross-encoder model on 80 million synthesized QA pairs. By encoding QA-relevant information, the bi-encoder's token-level representations are useful for non-QA downstream tasks without extensive (or in some cases, any) fine-tuning. We show large improvements over both RoBERTa-large and previous state-of-the-art results on zero-shot and few-shot paraphrase detection on four datasets, few-shot named entity recognition on two datasets, and zero-shot sentiment analysis on three datasets.
    Learning to Generate Task-Specific Adapters from Task Description. (arXiv:2101.00420v2 [cs.CL] UPDATED)
    (2 min) Pre-trained text-to-text transformers such as BART have achieved impressive performance across a range of NLP tasks. Recent study further shows that they can learn to generalize to novel tasks, by including task descriptions as part of the source sequence and training the model with (source, target) examples. At test time, these fine-tuned models can make inferences on new tasks using the new task descriptions as part of the input. However, this approach has potential limitations, as the model learns to solve individual (source, target) examples (i.e., at the instance level), instead of learning to solve tasks by taking all examples within a task as a whole (i.e., at the task level). To this end, we introduce Hypter, a framework that improves text-to-text transformer's generalization ability to unseen tasks by training a hypernetwork to generate task-specific, light-weight adapters from task descriptions. Experiments on ZEST dataset and a synthetic SQuAD dataset demonstrate that Hypter improves upon fine-tuning baselines. Notably, when using BART-Large as the main network, Hypter brings 11.3% comparative improvement on ZEST dataset.
    Kaizen: Continuously improving teacher using Exponential Moving Average for semi-supervised speech recognition. (arXiv:2106.07759v1 [eess.AS])
    (2 min) In this paper, we introduce the Kaizen framework that uses a continuously improving teacher to generate pseudo-labels for semi-supervised training. The proposed approach uses a teacher model which is updated as the exponential moving average of the student model parameters. This can be seen as a continuous version of the iterative pseudo-labeling approach for semi-supervised training. It is applicable for different training criteria, and in this paper we demonstrate it for frame-level hybrid hidden Markov model - deep neural network (HMM-DNN) models and sequence-level connectionist temporal classification (CTC) based models. The proposed approach shows more than 10% word error rate (WER) reduction over standard teacher-student training and more than 50\% relative WER reduction over 10 hour supervised baseline when using large scale realistic unsupervised public videos in UK English and Italian languages.
    Learning Stable Classifiers by Transferring Unstable Features. (arXiv:2106.07847v1 [cs.LG])
    (2 min) We study transfer learning in the presence of spurious correlations. We experimentally demonstrate that directly transferring the stable feature extractor learned on the source task may not eliminate these biases for the target task. However, we hypothesize that the unstable features in the source task and those in the target task are directly related. By explicitly informing the target classifier of the source task's unstable features, we can regularize the biases in the target task. Specifically, we derive a representation that encodes the unstable features by contrasting different data environments in the source task. On the target task, we cluster data from this representation, and achieve robustness by minimizing the worst-case risk across all clusters. We evaluate our method on both text and image classifications. Empirical results demonstrate that our algorithm is able to maintain robustness on the target task, outperforming the best baseline by 22.9% in absolute accuracy across 12 transfer settings. Our code is available at https://github.com/YujiaBao/Tofu.
    Three-part diachronic semantic change dataset for Russian. (arXiv:2106.08294v1 [cs.CL])
    (2 min) We present a manually annotated lexical semantic change dataset for Russian: RuShiftEval. Its novelty is ensured by a single set of target words annotated for their diachronic semantic shifts across three time periods, while the previous work either used only two time periods, or different sets of target words. The paper describes the composition and annotation procedure for the dataset. In addition, it is shown how the ternary nature of RuShiftEval allows to trace specific diachronic trajectories: `changed at a particular time period and stable afterwards' or `was changing throughout all time periods'. Based on the analysis of the submissions to the recent shared task on semantic change detection for Russian, we argue that correctly identifying such trajectories can be an interesting sub-task itself.
    Teacher-Student MixIT for Unsupervised and Semi-supervised Speech Separation. (arXiv:2106.07843v1 [cs.SD])
    (2 min) In this paper, we introduce a novel semi-supervised learning framework for end-to-end speech separation. The proposed method first uses mixtures of unseparated sources and the mixture invariant training (MixIT) criterion to train a teacher model. The teacher model then estimates separated sources that are used to train a student model with standard permutation invariant training (PIT). The student model can be fine-tuned with supervised data, i.e., paired artificial mixtures and clean speech sources, and further improved via model distillation. Experiments with single and multi channel mixtures show that the teacher-student training resolves the over-separation problem observed in the original MixIT method. Further, the semisupervised performance is comparable to a fully-supervised separation system trained using ten times the amount of supervised data.
    Acoustic Data-Driven Subword Modeling for End-to-End Speech Recognition. (arXiv:2104.09106v2 [cs.CL] UPDATED)
    (2 min) Subword units are commonly used for end-to-end automatic speech recognition (ASR), while a fully acoustic-oriented subword modeling approach is somewhat missing. We propose an acoustic data-driven subword modeling (ADSM) approach that adapts the advantages of several text-based and acoustic-based subword methods into one pipeline. With a fully acoustic-oriented label design and learning process, ADSM produces acoustic-structured subword units and acoustic-matched target sequence for further ASR training. The obtained ADSM labels are evaluated with different end-to-end ASR approaches including CTC, RNN-Transducer and attention models. Experiments on the LibriSpeech corpus show that ADSM clearly outperforms both byte pair encoding (BPE) and pronunciation-assisted subword modeling (PASM) in all cases. Detailed analysis shows that ADSM achieves acoustically more logical word segmentation and more balanced sequence length, and thus, is suitable for both time-synchronous and label-synchronous models. We also briefly describe how to apply acoustic-based subword regularization and unseen text segmentation using ADSM.
    OCHADAI-KYOTO at SemEval-2021 Task 1: Enhancing Model Generalization and Robustness for Lexical Complexity Prediction. (arXiv:2105.05535v3 [cs.CL] UPDATED)
    (2 min) We propose an ensemble model for predicting the lexical complexity of words and multiword expressions (MWEs). The model receives as input a sentence with a target word or MWEand outputs its complexity score. Given that a key challenge with this task is the limited size of annotated data, our model relies on pretrained contextual representations from different state-of-the-art transformer-based language models (i.e., BERT and RoBERTa), and on a variety of training methods for further enhancing model generalization and robustness:multi-step fine-tuning and multi-task learning, and adversarial training. Additionally, we propose to enrich contextual representations by adding hand-crafted features during training. Our model achieved competitive results and ranked among the top-10 systems in both sub-tasks.
    Pre-Trained Models: Past, Present and Future. (arXiv:2106.07139v2 [cs.AI] UPDATED)
    (2 min) Large-scale pre-trained models (PTMs) such as BERT and GPT have recently achieved great success and become a milestone in the field of artificial intelligence (AI). Owing to sophisticated pre-training objectives and huge model parameters, large-scale PTMs can effectively capture knowledge from massive labeled and unlabeled data. By storing knowledge into huge parameters and fine-tuning on specific tasks, the rich knowledge implicitly encoded in huge parameters can benefit a variety of downstream tasks, which has been extensively demonstrated via experimental verification and empirical analysis. It is now the consensus of the AI community to adopt PTMs as backbone for downstream tasks rather than learning models from scratch. In this paper, we take a deep look into the history of pre-training, especially its special relation with transfer learning and self-supervised learning, to reveal the crucial position of PTMs in the AI development spectrum. Further, we comprehensively review the latest breakthroughs of PTMs. These breakthroughs are driven by the surge of computational power and the increasing availability of data, towards four important directions: designing effective architectures, utilizing rich contexts, improving computational efficiency, and conducting interpretation and theoretical analysis. Finally, we discuss a series of open problems and research directions of PTMs, and hope our view can inspire and advance the future study of PTMs.
    The Multilingual TEDx Corpus for Speech Recognition and Translation. (arXiv:2102.01757v2 [cs.CL] UPDATED)
    (2 min) We present the Multilingual TEDx corpus, built to support speech recognition (ASR) and speech translation (ST) research across many non-English source languages. The corpus is a collection of audio recordings from TEDx talks in 8 source languages. We segment transcripts into sentences and align them to the source-language audio and target-language translations. The corpus is released along with open-sourced code enabling extension to new talks and languages as they become available. Our corpus creation methodology can be applied to more languages than previous work, and creates multi-way parallel evaluation sets. We provide baselines in multiple ASR and ST settings, including multilingual models to improve translation performance for low-resource language pairs.
    Consistency Regularization for Cross-Lingual Fine-Tuning. (arXiv:2106.08226v1 [cs.CL])
    (2 min) Fine-tuning pre-trained cross-lingual language models can transfer task-specific supervision from one language to the others. In this work, we propose to improve cross-lingual fine-tuning with consistency regularization. Specifically, we use example consistency regularization to penalize the prediction sensitivity to four types of data augmentations, i.e., subword sampling, Gaussian noise, code-switch substitution, and machine translation. In addition, we employ model consistency to regularize the models trained with two augmented versions of the same training set. Experimental results on the XTREME benchmark show that our method significantly improves cross-lingual fine-tuning across various tasks, including text classification, question answering, and sequence labeling.
    Can BERT Dig It? -- Named Entity Recognition for Information Retrieval in the Archaeology Domain. (arXiv:2106.07742v1 [cs.IR])
    (2 min) The amount of archaeological literature is growing rapidly. Until recently, these data were only accessible through metadata search. We implemented a text retrieval engine for a large archaeological text collection ($\sim 658$ Million words). In archaeological IR, domain-specific entities such as locations, time periods, and artefacts, play a central role. This motivated the development of a named entity recognition (NER) model to annotate the full collection with archaeological named entities. In this paper, we present ArcheoBERTje, a BERT model pre-trained on Dutch archaeological texts. We compare the model's quality and output on a Named Entity Recognition task to a generic multilingual model and a generic Dutch model. We also investigate ensemble methods for combining multiple BERT models, and combining the best BERT model with a domain thesaurus using Conditional Random Fields (CRF). We find that ArcheoBERTje outperforms both the multilingual and Dutch model significantly with a smaller standard deviation between runs, reaching an average F1 score of 0.735. The model also outperforms ensemble methods combining the three models. Combining ArcheoBERTje predictions and explicit domain knowledge from the thesaurus did not increase the F1 score. We quantitatively and qualitatively analyse the differences between the vocabulary and output of the BERT models on the full collection and provide some valuable insights in the effect of fine-tuning for specific domains. Our results indicate that for a highly specific text domain such as archaeology, further pre-training on domain-specific data increases the model's quality on NER by a much larger margin than shown for other domains in the literature, and that domain-specific pre-training makes the addition of domain knowledge from a thesaurus unnecessary.
    Dialectal Speech Recognition and Translation of Swiss German Speech to Standard German Text: Microsoft's Submission to SwissText 2021. (arXiv:2106.08126v1 [eess.AS])
    (2 min) This paper describes the winning approach in the public SwissText 2021 competition on dialect recognition and translation of Swiss German speech to standard German text. Swiss German refers to the multitude of Alemannic dialects spoken in the German-speaking parts of Switzerland. Swiss German differs significantly from standard German in pronunciation, word inventory and grammar. It is mostly incomprehensible to native German speakers. Moreover, it lacks a standardized written script. To solve the challenging task, we propose a hybrid automatic speech recognition system with a lexicon that incorporates translations, a 1st pass language model that deals with Swiss German particularities, a transfer-learned acoustic model and a strong neural language model for 2nd pass rescoring. Our submission reaches 46.04% BLEU on a blind conversational test set and outperforms the second best competitor by a 12% relative margin.
    Natural Language Adversarial Defense through Synonym Encoding. (arXiv:1909.06723v4 [cs.CL] UPDATED)
    (2 min) In the area of natural language processing, deep learning models are recently known to be vulnerable to various types of adversarial perturbations, but relatively few works are done on the defense side. Especially, there exists few effective defense method against the successful synonym substitution based attacks that preserve the syntactic structure and semantic information of the original text while fooling the deep learning models. We contribute in this direction and propose a novel adversarial defense method called Synonym Encoding Method (SEM). Specifically, SEM inserts an encoder before the input layer of the target model to map each cluster of synonyms to a unique encoding and trains the model to eliminate possible adversarial perturbations without modifying the network architecture or adding extra data. Extensive experiments demonstrate that SEM can effectively defend the current synonym substitution based attacks and block the transferability of adversarial examples. SEM is also easy and efficient to scale to large models and big datasets.
    Deriving Word Vectors from Contextualized Language Models using Topic-Aware Mention Selection. (arXiv:2106.07947v1 [cs.CL])
    (2 min) One of the long-standing challenges in lexical semantics consists in learning representations of words which reflect their semantic properties. The remarkable success of word embeddings for this purpose suggests that high-quality representations can be obtained by summarizing the sentence contexts of word mentions. In this paper, we propose a method for learning word representations that follows this basic strategy, but differs from standard word embeddings in two important ways. First, we take advantage of contextualized language models (CLMs) rather than bags of word vectors to encode contexts. Second, rather than learning a word vector directly, we use a topic model to partition the contexts in which words appear, and then learn different topic-specific vectors for each word. Finally, we use a task-specific supervision signal to make a soft selection of the resulting vectors. We show that this simple strategy leads to high-quality word vectors, which are more predictive of semantic properties than word embeddings and existing CLM-based strategies.
    Extracting Training Data from Large Language Models. (arXiv:2012.07805v2 [cs.CR] UPDATED)
    (2 min) It has become common to publish large (billion parameter) language models that have been trained on private datasets. This paper demonstrates that in such settings, an adversary can perform a training data extraction attack to recover individual training examples by querying the language model. We demonstrate our attack on GPT-2, a language model trained on scrapes of the public Internet, and are able to extract hundreds of verbatim text sequences from the model's training data. These extracted examples include (public) personally identifiable information (names, phone numbers, and email addresses), IRC conversations, code, and 128-bit UUIDs. Our attack is possible even though each of the above sequences are included in just one document in the training data. We comprehensively evaluate our extraction attack to understand the factors that contribute to its success. Worryingly, we find that larger models are more vulnerable than smaller models. We conclude by drawing lessons and discussing possible safeguards for training large language models.
    Language Tags Matter for Zero-Shot Neural Machine Translation. (arXiv:2106.07930v1 [cs.CL])
    (2 min) Multilingual Neural Machine Translation (MNMT) has aroused widespread interest due to its efficiency. An exciting advantage of MNMT models is that they could also translate between unsupervised (zero-shot) language directions. Language tag (LT) strategies are often adopted to indicate the translation directions in MNMT. In this paper, we demonstrate that the LTs are not only indicators for translation directions but also crucial to zero-shot translation qualities. Unfortunately, previous work tends to ignore the importance of LT strategies. We demonstrate that a proper LT strategy could enhance the consistency of semantic representations and alleviate the off-target issue in zero-shot directions. Experimental results show that by ignoring the source language tag (SLT) and adding the target language tag (TLT) to the encoder, the zero-shot translations could achieve a +8 BLEU score difference over other LT strategies in IWSLT17, Europarl, TED talks translation tasks.
    Unsupervised Abstractive Opinion Summarization by Generating Sentences with Tree-Structured Topic Guidance. (arXiv:2106.08007v1 [cs.CL])
    (2 min) This paper presents a novel unsupervised abstractive summarization method for opinionated texts. While the basic variational autoencoder-based models assume a unimodal Gaussian prior for the latent code of sentences, we alternate it with a recursive Gaussian mixture, where each mixture component corresponds to the latent code of a topic sentence and is mixed by a tree-structured topic distribution. By decoding each Gaussian component, we generate sentences with tree-structured topic guidance, where the root sentence conveys generic content, and the leaf sentences describe specific topics. Experimental results demonstrate that the generated topic sentences are appropriate as a summary of opinionated texts, which are more informative and cover more input contents than those generated by the recent unsupervised summarization model (Bra\v{z}inskas et al., 2020). Furthermore, we demonstrate that the variance of latent Gaussians represents the granularity of sentences, analogous to Gaussian word embedding (Vilnis and McCallum, 2015).
    StockBabble: A Conversational Financial Agent to support Stock Market Investors. (arXiv:2106.08298v1 [cs.HC])
    (2 min) We introduce StockBabble, a conversational agent designed to support understanding and engagement with the stock market. StockBabble's value and novelty is in its ability to empower retail investors -- many of which may be new to investing -- and supplement their informational needs using a user-friendly agent. Users have the ability to query information on companies to retrieve a general and financial overview of a stock, including accessing the latest news and trading recommendations. They can also request charts which contain live prices and technical investment indicators, and add shares to a personal portfolio to allow performance monitoring over time. To evaluate our agent's potential, we conducted a user study with 15 participants. In total, 73% (11/15) of respondents said that they felt more confident in investing after using StockBabble, and all 15 would consider recommending it to others. These results are encouraging and suggest a wider appeal for such agents. Moreover, we believe this research can help to inform the design and development of future intelligent, financial personal assistants.
    Graph-based Label Propagation for Semi-Supervised Speaker Identification. (arXiv:2106.08207v1 [cs.SD])
    (2 min) Speaker identification in the household scenario (e.g., for smart speakers) is typically based on only a few enrollment utterances but a much larger set of unlabeled data, suggesting semisupervised learning to improve speaker profiles. We propose a graph-based semi-supervised learning approach for speaker identification in the household scenario, to leverage the unlabeled speech samples. In contrast to most of the works in speaker recognition that focus on speaker-discriminative embeddings, this work focuses on speaker label inference (scoring). Given a pre-trained embedding extractor, graph-based learning allows us to integrate information about both labeled and unlabeled utterances. Considering each utterance as a graph node, we represent pairwise utterance similarity scores as edge weights. Graphs are constructed per household, and speaker identities are propagated to unlabeled nodes to optimize a global consistency criterion. We show in experiments on the VoxCeleb dataset that this approach makes effective use of unlabeled data and improves speaker identification accuracy compared to two state-of-the-art scoring methods as well as their semi-supervised variants based on pseudo-labels.
    Incorporating Word Sense Disambiguation in Neural Language Models. (arXiv:2106.07967v1 [cs.CL])
    (2 min) We present two supervised (pre-)training methods to incorporate gloss definitions from lexical resources into neural language models (LMs). The training improves our models' performance for Word Sense Disambiguation (WSD) but also benefits general language understanding tasks while adding almost no parameters. We evaluate our techniques with seven different neural LMs and find that XLNet is more suitable for WSD than BERT. Our best-performing methods exceeds state-of-the-art WSD techniques on the SemCor 3.0 dataset by 0.5% F1 and increase BERT's performance on the GLUE benchmark by 1.1% on average.
    Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup. (arXiv:2101.06983v2 [cs.LG] CROSS LISTED)
    (2 min) Contrastive learning has been applied successfully to learn vector representations of text. Previous research demonstrated that learning high-quality representations benefits from batch-wise contrastive loss with a large number of negatives. In practice, the technique of in-batch negative is used, where for each example in a batch, other batch examples' positives will be taken as its negatives, avoiding encoding extra negatives. This, however, still conditions each example's loss on all batch examples and requires fitting the entire large batch into GPU memory. This paper introduces a gradient caching technique that decouples backpropagation between contrastive loss and the encoder, removing encoder backward pass data dependency along the batch dimension. As a result, gradients can be computed for one subset of the batch at a time, leading to almost constant memory usage.
    Using heterogeneity in semi-supervised transcription hypotheses to improve code-switched speech recognition. (arXiv:2106.07699v1 [cs.CL])
    (2 min) Modeling code-switched speech is an important problem in automatic speech recognition (ASR). Labeled code-switched data are rare, so monolingual data are often used to model code-switched speech. These monolingual data may be more closely matched to one of the languages in the code-switch pair. We show that such asymmetry can bias prediction toward the better-matched language and degrade overall model performance. To address this issue, we propose a semi-supervised approach for code-switched ASR. We consider the case of English-Mandarin code-switching, and the problem of using monolingual data to build bilingual "transcription models'' for annotation of unlabeled code-switched data. We first build multiple transcription models so that their individual predictions are variously biased toward either English or Mandarin. We then combine these biased transcriptions using confidence-based selection. This strategy generates a superior transcript for semi-supervised training, and obtains a 19% relative improvement compared to a semi-supervised system that relies on a transcription model built with only the best-matched monolingual data.
    Reasoning Over Virtual Knowledge Bases With Open Predicate Relations. (arXiv:2102.07043v2 [cs.AI] UPDATED)
    (2 min) We present the Open Predicate Query Language (OPQL); a method for constructing a virtual KB (VKB) trained entirely from text. Large Knowledge Bases (KBs) are indispensable for a wide-range of industry applications such as question answering and recommendation. Typically, KBs encode world knowledge in a structured, readily accessible form derived from laborious human annotation efforts. Unfortunately, while they are extremely high precision, KBs are inevitably highly incomplete and automated methods for enriching them are far too inaccurate. Instead, OPQL constructs a VKB by encoding and indexing a set of relation mentions in a way that naturally enables reasoning and can be trained without any structured supervision. We demonstrate that OPQL outperforms prior VKB methods on two different KB reasoning tasks and, additionally, can be used as an external memory integrated into a language model (OPQL-LM) leading to improvements on two open-domain question answering tasks.
    CoDERT: Distilling Encoder Representations with Co-learning for Transducer-based Speech Recognition. (arXiv:2106.07734v1 [cs.CL])
    (2 min) We propose a simple yet effective method to compress an RNN-Transducer (RNN-T) through the well-known knowledge distillation paradigm. We show that the transducer's encoder outputs naturally have a high entropy and contain rich information about acoustically similar word-piece confusions. This rich information is suppressed when combined with the lower entropy decoder outputs to produce the joint network logits. Consequently, we introduce an auxiliary loss to distill the encoder logits from a teacher transducer's encoder, and explore training strategies where this encoder distillation works effectively. We find that tandem training of teacher and student encoders with an inplace encoder distillation outperforms the use of a pre-trained and static teacher transducer. We also report an interesting phenomenon we refer to as implicit distillation, that occurs when the teacher and student encoders share the same decoder. Our experiments show 5.37-8.4% relative word error rate reductions (WERR) on in-house test sets, and 5.05-6.18% relative WERRs on LibriSpeech test sets.
    Interactive Learning from Activity Description. (arXiv:2102.07024v2 [cs.CL] UPDATED)
    (2 min) We present a novel interactive learning protocol that enables training request-fulfilling agents by verbally describing their activities. Unlike imitation learning (IL), our protocol allows the teaching agent to provide feedback in a language that is most appropriate for them. Compared with reward in reinforcement learning (RL), the description feedback is richer and allows for improved sample complexity. We develop a probabilistic framework and an algorithm that practically implements our protocol. Empirical results in two challenging request-fulfilling problems demonstrate the strengths of our approach: compared with RL baselines, it is more sample-efficient; compared with IL baselines, it achieves competitive success rates without requiring the teaching agent to be able to demonstrate the desired behavior using the learning agent's actions. Apart from empirical evaluation, we also provide theoretical guarantees for our algorithm under certain assumptions about the teacher and the environment.
    Constraining Linear-chain CRFs to Regular Languages. (arXiv:2106.07306v2 [cs.LG] UPDATED)
    (2 min) In structured prediction, a major challenge for models is to represent the interdependencies within their output structures. For the common case where outputs are structured as a sequence, linear-chain conditional random fields (CRFs) are a widely used model class which can learn local dependencies in output sequences. However, the CRF's Markov assumption makes it impossible for these models to capture nonlocal dependencies, and standard CRFs are unable to respect nonlocal constraints of the data (such as global arity constraints on output labels). We present a generalization of CRFs that can enforce a broad class of constraints, including nonlocal ones, by specifying the space of possible output structures as a regular language $\mathcal{L}$. The resulting regular-constrained CRF (RegCCRF) has the same formal properties as a standard CRF, but assigns zero probability to all label sequences not in $\mathcal{L}$. Notably, RegCCRFs can incorporate their constraints during training, while related models only enforce constraints during decoding. We prove that constrained training is never worse than constrained decoding, and show using synthetic data that it can be substantially better in practice. Additionally, we demonstrate a practical benefit on downstream tasks by incorporating a RegCCRF into a deep neural model for semantic role labeling, exceeding state-of-the-art results on a standard dataset.
    Maximum Spanning Trees Are Invariant to Temperature Scaling in Graph-based Dependency Parsing. (arXiv:2106.08159v1 [cs.CL])
    (2 min) Modern graph-based syntactic dependency parsers operate by predicting, for each token within a sentence, a probability distribution over its possible syntactic heads (i.e., all other tokens) and then extracting a maximum spanning tree from the resulting log-probabilities. Nowadays, virtually all such parsers utilize deep neural networks and may thus be susceptible to miscalibration (in particular, overconfident predictions). In this paper, we prove that temperature scaling, a popular technique for post-hoc calibration of neural networks, cannot change the output of the aforementioned procedure. We conclude that other techniques are needed to tackle miscalibration in graph-based dependency parsers in a way that improves parsing accuracy.
    Direction is what you need: Improving Word Embedding Compression in Large Language Models. (arXiv:2106.08181v1 [cs.CL])
    (2 min) The adoption of Transformer-based models in natural language processing (NLP) has led to great success using a massive number of parameters. However, due to deployment constraints in edge devices, there has been a rising interest in the compression of these models to improve their inference time and memory footprint. This paper presents a novel loss objective to compress token embeddings in the Transformer-based models by leveraging an AutoEncoder architecture. More specifically, we emphasize the importance of the direction of compressed embeddings with respect to original uncompressed embeddings. The proposed method is task-agnostic and does not require further language modeling pre-training. Our method significantly outperforms the commonly used SVD-based matrix-factorization approach in terms of initial language model Perplexity. Moreover, we evaluate our proposed approach over SQuAD v1.1 dataset and several downstream tasks from the GLUE benchmark, where we also outperform the baseline in most scenarios. Our code is public.
    Determinantal Beam Search. (arXiv:2106.07400v2 [cs.CL] UPDATED)
    (2 min) Beam search is a go-to strategy for decoding neural sequence models. The algorithm can naturally be viewed as a subset optimization problem, albeit one where the corresponding set function does not reflect interactions between candidates. Empirically, this leads to sets often exhibiting high overlap, e.g., strings may differ by only a single word. Yet in use-cases that call for multiple solutions, a diverse or representative set is often desired. To address this issue, we propose a reformulation of beam search, which we call determinantal beam search. Determinantal beam search has a natural relationship to determinantal point processes (DPPs), models over sets that inherently encode intra-set interactions. By posing iterations in beam search as a series of subdeterminant maximization problems, we can turn the algorithm into a diverse subset selection process. In a case study, we use the string subsequence kernel to explicitly encourage n-gram coverage in text generated from a sequence model. We observe that our algorithm offers competitive performance against other diverse set generation strategies in the context of language generation, while providing a more general approach to optimizing for diversity.
    Adaptive Margin Circle Loss for Speaker Verification. (arXiv:2106.08004v1 [cs.SD])
    (2 min) Deep-Neural-Network (DNN) based speaker verification sys-tems use the angular softmax loss with margin penalties toenhance the intra-class compactness of speaker embeddings,which achieved remarkable performance. In this paper, we pro-pose a novel angular loss function called adaptive margin cir-cle loss for speaker verification. The stage-based margin andchunk-based margin are applied to improve the angular discrim-ination of circle loss on the training set. The analysis on gradi-ents shows that, compared with the previous angular loss likeAdditive Margin Softmax(Am-Softmax), circle loss has flexi-ble optimization and definite convergence status. Experimentsare carried out on the Voxceleb and SITW. By applying adap-tive margin circle loss, our best system achieves 1.31%EER onVoxceleb1 and 2.13% on SITW core-core.
    Alternate Endings: Improving Prosody for Incremental Neural TTS with Predicted Future Text Input. (arXiv:2102.09914v2 [cs.CL] UPDATED)
    (2 min) The prosody of a spoken word is determined by its surrounding context. In incremental text-to-speech synthesis, where the synthesizer produces an output before it has access to the complete input, the full context is often unknown which can result in a loss of naturalness in the synthesized speech. In this paper, we investigate whether the use of predicted future text can attenuate this loss. We compare several test conditions of next future word: (a) unknown (zero-word), (b) language model predicted, (c) randomly predicted and (d) ground-truth. We measure the prosodic features (pitch, energy and duration) and find that predicted text provides significant improvements over a zero-word lookahead, but only slight gains over random-word lookahead. We confirm these results with a perceptive test.
    Unbiased Sentence Encoder For Large-Scale Multi-lingual Search Engines. (arXiv:2106.07719v1 [cs.CL])
    (2 min) In this paper, we present a multi-lingual sentence encoder that can be used in search engines as a query and document encoder. This embedding enables a semantic similarity score between queries and documents that can be an important feature in document ranking and relevancy. To train such a customized sentence encoder, it is beneficial to leverage users search data in the form of query-document clicked pairs however, we must avoid relying too much on search click data as it is biased and does not cover many unseen cases. The search data is heavily skewed towards short queries and for long queries is small and often noisy. The goal is to design a universal multi-lingual encoder that works for all cases and covers both short and long queries. We select a number of public NLI datasets in different languages and translation data and together with user search data we train a language model using a multi-task approach. A challenge is that these datasets are not homogeneous in terms of content, size and the balance ratio. While the public NLI datasets are usually two-sentence based with the same portion of positive and negative pairs, the user search data can contain multi-sentence documents and only positive pairs. We show how multi-task training enables us to leverage all these datasets and exploit knowledge sharing across these tasks.
    The Possible, the Plausible, and the Desirable: Event-Based Modality Detection for Language Processing. (arXiv:2106.08037v1 [cs.CL])
    (2 min) Modality is the linguistic ability to describe events with added information such as how desirable, plausible, or feasible they are. Modality is important for many NLP downstream tasks such as the detection of hedging, uncertainty, speculation, and more. Previous studies that address modality detection in NLP often restrict modal expressions to a closed syntactic class, and the modal sense labels are vastly different across different studies, lacking an accepted standard. Furthermore, these senses are often analyzed independently of the events that they modify. This work builds on the theoretical foundations of the Georgetown Gradable Modal Expressions (GME) work by Rubinstein et al. (2013) to propose an event-based modality detection task where modal expressions can be words of any syntactic class and sense labels are drawn from a comprehensive taxonomy which harmonizes the modal concepts contributed by the different studies. We present experiments on the GME corpus aiming to detect and classify fine-grained modal concepts and associate them with their modified events. We show that detecting and classifying modal expressions is not only feasible, but also improves the detection of modal events in their own right.
    PairConnect: A Compute-Efficient MLP Alternative to Attention. (arXiv:2106.08235v1 [cs.LG])
    (2 min) Transformer models have demonstrated superior performance in natural language processing. The dot product self-attention in Transformer allows us to model interactions between words. However, this modeling comes with significant computational overhead. In this work, we revisit the memory-compute trade-off associated with Transformer, particularly multi-head attention, and show a memory-heavy but significantly more compute-efficient alternative to Transformer. Our proposal, denoted as PairConnect, a multilayer perceptron (MLP), models the pairwise interaction between words by explicit pairwise word embeddings. As a result, PairConnect substitutes self dot product with a simple embedding lookup. We show mathematically that despite being an MLP, our compute-efficient PairConnect is strictly more expressive than Transformer. Our experiment on language modeling tasks suggests that PairConnect could achieve comparable results with Transformer while reducing the computational cost associated with inference significantly.
    Semantic Representation and Inference for NLP. (arXiv:2106.08117v1 [cs.CL])
    (2 min) Semantic representation and inference is essential for Natural Language Processing (NLP). The state of the art for semantic representation and inference is deep learning, and particularly Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs), and transformer Self-Attention models. This thesis investigates the use of deep learning for novel semantic representation and inference, and makes contributions in the following three areas: creating training data, improving semantic representations and extending inference learning. In terms of creating training data, we contribute the largest publicly available dataset of real-life factual claims for the purpose of automatic claim verification (MultiFC), and we present a novel inference model composed of multi-scale CNNs with different kernel sizes that learn from external sources to infer fact checking labels. In terms of improving semantic representations, we contribute a novel model that captures non-compositional semantic indicators. By definition, the meaning of a non-compositional phrase cannot be inferred from the individual meanings of its composing words (e.g., hot dog). Motivated by this, we operationalize the compositionality of a phrase contextually by enriching the phrase representation with external word embeddings and knowledge graphs. Finally, in terms of inference learning, we propose a series of novel deep learning architectures that improve inference by using syntactic dependencies, by ensembling role guided attention heads, incorporating gating layers, and concatenating multiple heads in novel and effective ways. This thesis consists of seven publications (five published and two under review).
    Biomedical Entity Linking with Contrastive Context Matching. (arXiv:2106.07583v2 [cs.CL] UPDATED)
    (2 min) We introduce BioCoM, a contrastive learning framework for biomedical entity linking that uses only two resources: a small-sized dictionary and a large number of raw biomedical articles. Specifically, we build the training instances from raw PubMed articles by dictionary matching and use them to train a context-aware entity linking model with contrastive learning. We predict the normalized biomedical entity at inference time through a nearest-neighbor search. Results found that BioCoM substantially outperforms state-of-the-art models, especially in low-resource settings, by effectively using the context of the entities.
    Overcoming Domain Mismatch in Low Resource Sequence-to-Sequence ASR Models using Hybrid Generated Pseudotranscripts. (arXiv:2106.07716v1 [cs.CL])
    (2 min) Sequence-to-sequence (seq2seq) models are competitive with hybrid models for automatic speech recognition (ASR) tasks when large amounts of training data are available. However, data sparsity and domain adaptation are more problematic for seq2seq models than their hybrid counterparts. We examine corpora of five languages from the IARPA MATERIAL program where the transcribed data is conversational telephone speech (CTS) and evaluation data is broadcast news (BN). We show that there is a sizable initial gap in such a data condition between hybrid and seq2seq models, and the hybrid model is able to further improve through the use of additional language model (LM) data. We use an additional set of untranscribed data primarily in the BN domain for semisupervised training. In semisupervised training, a seed model trained on transcribed data generates hypothesized transcripts for unlabeled domain-matched data for further training. By using a hybrid model with an expanded language model for pseudotranscription, we are able to improve our seq2seq model from an average word error rate (WER) of 66.7% across all five languages to 29.0% WER. While this puts the seq2seq model at a competitive operating point, hybrid models are still able to use additional LM data to maintain an advantage.
    ARTA: Collection and Classification of Ambiguous Requests and Thoughtful Actions. (arXiv:2106.07999v1 [cs.CL])
    (2 min) Human-assisting systems such as dialogue systems must take thoughtful, appropriate actions not only for clear and unambiguous user requests, but also for ambiguous user requests, even if the users themselves are not aware of their potential requirements. To construct such a dialogue agent, we collected a corpus and developed a model that classifies ambiguous user requests into corresponding system actions. In order to collect a high-quality corpus, we asked workers to input antecedent user requests whose pre-defined actions could be regarded as thoughtful. Although multiple actions could be identified as thoughtful for a single user request, annotating all combinations of user requests and system actions is impractical. For this reason, we fully annotated only the test data and left the annotation of the training data incomplete. In order to train the classification model on such training data, we applied the positive/unlabeled (PU) learning method, which assumes that only a part of the data is labeled with positive examples. The experimental results show that the PU learning method achieved better performance than the general positive/negative (PN) learning method to classify thoughtful actions given an ambiguous user request.
    Keyword Transformer: A Self-Attention Model for Keyword Spotting. (arXiv:2104.00769v3 [eess.AS] UPDATED)
    (2 min) The Transformer architecture has been successful across many domains, including natural language processing, computer vision and speech recognition. In keyword spotting, self-attention has primarily been used on top of convolutional or recurrent encoders. We investigate a range of ways to adapt the Transformer architecture to keyword spotting and introduce the Keyword Transformer (KWT), a fully self-attentional architecture that exceeds state-of-the-art performance across multiple tasks without any pre-training or additional data. Surprisingly, this simple architecture outperforms more complex models that mix convolutional, recurrent and attentive layers. KWT can be used as a drop-in replacement for these models, setting two new benchmark records on the Google Speech Commands dataset with 98.6% and 97.7% accuracy on the 12 and 35-command tasks respectively.
    CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark. (arXiv:2106.08087v1 [cs.CL])
    (2 min) Artificial Intelligence (AI), along with the recent progress in biomedical language understanding, is gradually changing medical practice. With the development of biomedical language understanding benchmarks, AI applications are widely used in the medical field. However, most benchmarks are limited to English, which makes it challenging to replicate many of the successes in English for other languages. To facilitate research in this direction, we collect real-world biomedical data and present the first Chinese Biomedical Language Understanding Evaluation (CBLUE) benchmark: a collection of natural language understanding tasks including named entity recognition, information extraction, clinical diagnosis normalization, single-sentence/sentence-pair classification, and an associated online platform for model evaluation, comparison, and analysis. To establish evaluation on these tasks, we report empirical results with the current 11 pre-trained Chinese models, and experimental results show that state-of-the-art neural models perform by far worse than the human ceiling. Our benchmark is released at \url{https://tianchi.aliyun.com/dataset/dataDetail?dataId=95414&lang=en-us}.
    Contextualized Attention-based Knowledge Transfer for Spoken Conversational Question Answering. (arXiv:2010.11066v3 [cs.CL] UPDATED)
    (2 min) Spoken conversational question answering (SCQA) requires machines to model complex dialogue flow given the speech utterances and text corpora. Different from traditional text question answering (QA) tasks, SCQA involves audio signal processing, passage comprehension, and contextual understanding. However, ASR systems introduce unexpected noisy signals to the transcriptions, which result in performance degradation on SCQA. To overcome the problem, we propose CADNet, a novel contextualized attention-based distillation approach, which applies both cross-attention and self-attention to obtain ASR-robust contextualized embedding representations of the passage and dialogue history for performance improvements. We also introduce the spoken conventional knowledge distillation framework to distill the ASR-robust knowledge from the estimated probabilities of the teacher model to the student. We conduct extensive experiments on the Spoken-CoQA dataset and demonstrate that our approach achieves remarkable performance in this task.
    CausalNLP: A Practical Toolkit for Causal Inference with Text. (arXiv:2106.08043v1 [cs.CL])
    (2 min) The vast majority of existing methods and systems for causal inference assume that all variables under consideration are categorical or numerical (e.g., gender, price, blood pressure, enrollment). In this paper, we present CausalNLP, a toolkit for inferring causality from observational data that includes text in addition to traditional numerical and categorical variables. CausalNLP employs the use of meta-learners for treatment effect estimation and supports using raw text and its linguistic properties as both a treatment and a "controlled-for" variable (e.g., confounder). The library is open-source and available at: https://github.com/amaiya/causalnlp.
    Targeted Data Acquisition for Evolving Negotiation Agents. (arXiv:2106.07728v1 [cs.AI])
    (2 min) Successful negotiators must learn how to balance optimizing for self-interest and cooperation. Yet current artificial negotiation agents often heavily depend on the quality of the static datasets they were trained on, limiting their capacity to fashion an adaptive response balancing self-interest and cooperation. For this reason, we find that these agents can achieve either high utility or cooperation, but not both. To address this, we introduce a targeted data acquisition framework where we guide the exploration of a reinforcement learning agent using annotations from an expert oracle. The guided exploration incentivizes the learning agent to go beyond its static dataset and develop new negotiation strategies. We show that this enables our agents to obtain higher-reward and more Pareto-optimal solutions when negotiating with both simulated and human partners compared to standard supervised learning and reinforcement learning methods. This trend additionally holds when comparing agents using our targeted data acquisition framework to variants of agents trained with a mix of supervised learning and reinforcement learning, or to agents using tailored reward functions that explicitly optimize for utility and Pareto-optimality.
    Dissecting User-Perceived Latency of On-Device E2E Speech Recognition. (arXiv:2104.02207v2 [cs.SD] UPDATED)
    (2 min) As speech-enabled devices such as smartphones and smart speakers become increasingly ubiquitous, there is growing interest in building automatic speech recognition (ASR) systems that can run directly on-device; end-to-end (E2E) speech recognition models such as recurrent neural network transducers and their variants have recently emerged as prime candidates for this task. Apart from being accurate and compact, such systems need to decode speech with low user-perceived latency (UPL), producing words as soon as they are spoken. This work examines the impact of various techniques - model architectures, training criteria, decoding hyperparameters, and endpointer parameters - on UPL. Our analyses suggest that measures of model size (parameters, input chunk sizes), or measures of computation (e.g., FLOPS, RTF) that reflect the model's ability to process input frames are not always strongly correlated with observed UPL. Thus, conventional algorithmic latency measurements might be inadequate in accurately capturing latency observed when models are deployed on embedded devices. Instead, we find that factors affecting token emission latency, and endpointing behavior have a larger impact on UPL. We achieve the best trade-off between latency and word error rate when performing ASR jointly with endpointing, while utilizing the recently proposed alignment regularization mechanism.
    Equivalence of Segmental and Neural Transducer Modeling: A Proof of Concept. (arXiv:2104.06104v2 [cs.CL] UPDATED)
    (2 min) With the advent of direct models in automatic speech recognition (ASR), the formerly prevalent frame-wise acoustic modeling based on hidden Markov models (HMM) diversified into a number of modeling architectures like encoder-decoder attention models, transducer models and segmental models (direct HMM). While transducer models stay with a frame-level model definition, segmental models are defined on the level of label segments directly. While (soft-)attention-based models avoid explicit alignment, transducer and segmental approach internally do model alignment, either by segment hypotheses or, more implicitly, by emitting so-called blank symbols. In this work, we prove that the widely used class of RNN-Transducer models and segmental models (direct HMM) are equivalent and therefore show equal modeling power. It is shown that blank probabilities translate into segment length probabilities and vice versa. In addition, we provide initial experiments investigating decoding and beam-pruning, comparing time-synchronous and label-/segment-synchronous search strategies and their properties using the same underlying model.
    An Automated Quality Evaluation Framework of Psychotherapy Conversations with Local Quality Estimates. (arXiv:2106.07922v1 [cs.CL])
    (2 min) Computational approaches for assessing the quality of conversation-based psychotherapy, such as Cognitive Behavioral Therapy (CBT) and Motivational Interviewing (MI), have been developed recently to support quality assurance and clinical training. However, due to the long session lengths and limited modeling resources, computational methods largely rely on frequency-based lexical features or distribution of dialogue acts. In this work, we propose a hierarchical framework to automatically evaluate the quality of a CBT interaction. We divide each psychotherapy session into conversation segments and input those into a BERT-based model to produce segment embeddings. We first fine-tune BERT for predicting segment-level (local) quality scores and then use segment embeddings as lower-level input to a Bidirectional LSTM-based neural network to predict session-level (global) quality estimates. In particular, the segment-level quality scores are initialized with the session-level scores and we model the global quality as a function of the local quality scores to achieve the accurate segment-level quality estimates. These estimated segment-level scores benefit theBERT fine-tuning and in learning better segment embeddings. We evaluate the proposed framework on data drawn from real-world CBT clinical session recordings to predict multiple session-level behavior codes. The results indicate that our approach leads to improved evaluation accuracy for most codes in both regression and classification tasks.
    Modeling morphology with Linear Discriminative Learning: considerations and design choices. (arXiv:2106.07936v1 [cs.CL])
    (2 min) This study addresses a series of methodological questions that arise when modeling inflectional morphology with Linear Discriminative Learning. Taking the semi-productive German noun system as example, we illustrate how decisions made about the representation of form and meaning influence model performance. We clarify that for modeling frequency effects in learning, it is essential to make use of incremental learning rather than the endstate of learning. We also discuss how the model can be set up to approximate the learning of inflected words in context. In addition, we illustrate how in this approach the wug task can be modeled in considerable detail. In general, the model provides an excellent memory for known words, but appropriately shows more limited performance for unseen data, in line with the semi-productivity of German noun inflection and generalization performance of native German speakers.
    Sequence-Level Training for Non-Autoregressive Neural Machine Translation. (arXiv:2106.08122v1 [cs.CL])
    (2 min) In recent years, Neural Machine Translation (NMT) has achieved notable results in various translation tasks. However, the word-by-word generation manner determined by the autoregressive mechanism leads to high translation latency of the NMT and restricts its low-latency applications. Non-Autoregressive Neural Machine Translation (NAT) removes the autoregressive mechanism and achieves significant decoding speedup through generating target words independently and simultaneously. Nevertheless, NAT still takes the word-level cross-entropy loss as the training objective, which is not optimal because the output of NAT cannot be properly evaluated due to the multimodality problem. In this paper, we propose using sequence-level training objectives to train NAT models, which evaluate the NAT outputs as a whole and correlates well with the real translation quality. Firstly, we propose training NAT models to optimize sequence-level evaluation metrics (e.g., BLEU) based on several novel reinforcement algorithms customized for NAT, which outperforms the conventional method by reducing the variance of gradient estimation. Secondly, we introduce a novel training objective for NAT models, which aims to minimize the Bag-of-Ngrams (BoN) difference between the model output and the reference sentence. The BoN training objective is differentiable and can be calculated efficiently without doing any approximations. Finally, we apply a three-stage training strategy to combine these two methods to train the NAT model. We validate our approach on four translation tasks (WMT14 En$\leftrightarrow$De, WMT16 En$\leftrightarrow$Ro), which shows that our approach largely outperforms NAT baselines and achieves remarkable performance on all translation tasks.
    Launching into clinical space with medspaCy: a new clinical text processing toolkit in Python. (arXiv:2106.07799v1 [cs.CL])
    (2 min) Despite impressive success of machine learning algorithms in clinical natural language processing (cNLP), rule-based approaches still have a prominent role. In this paper, we introduce medspaCy, an extensible, open-source cNLP library based on spaCy framework that allows flexible integration of rule-based and machine learning-based algorithms adapted to clinical text. MedspaCy includes a variety of components that meet common cNLP needs such as context analysis and mapping to standard terminologies. By utilizing spaCy's clear and easy-to-use conventions, medspaCy enables development of custom pipelines that integrate easily with other spaCy-based modules. Our toolkit includes several core components and facilitates rapid development of pipelines for clinical text.
    An enriched category theory of language: from syntax to semantics. (arXiv:2106.07890v1 [math.CT])
    (2 min) Given a piece of text, the ability to generate a coherent extension of it implies some sophistication, including a knowledge of grammar and semantics. In this paper, we propose a mathematical framework for passing from probability distributions on extensions of given texts to an enriched category containing semantic information. Roughly speaking, we model probability distributions on texts as a category enriched over the unit interval. Objects of this category are expressions in language and hom objects are conditional probabilities that one expression is an extension of another. This category is syntactical: it describes what goes with what. We then pass to the enriched category of unit interval-valued copresheaves on this syntactical category to find semantic information.
    Improving Paraphrase Detection with the Adversarial Paraphrasing Task. (arXiv:2106.07691v1 [cs.CL])
    (2 min) If two sentences have the same meaning, it should follow that they are equivalent in their inferential properties, i.e., each sentence should textually entail the other. However, many paraphrase datasets currently in widespread use rely on a sense of paraphrase based on word overlap and syntax. Can we teach them instead to identify paraphrases in a way that draws on the inferential properties of the sentences, and is not over-reliant on lexical and syntactic similarities of a sentence pair? We apply the adversarial paradigm to this question, and introduce a new adversarial method of dataset creation for paraphrase identification: the Adversarial Paraphrasing Task (APT), which asks participants to generate semantically equivalent (in the sense of mutually implicative) but lexically and syntactically disparate paraphrases. These sentence pairs can then be used both to test paraphrase identification models (which get barely random accuracy) and then improve their performance. To accelerate dataset generation, we explore automation of APT using T5, and show that the resulting dataset also improves accuracy. We discuss implications for paraphrase detection and release our dataset in the hope of making paraphrase detection models better able to detect sentence-level meaning equivalence.
    SSMix: Saliency-Based Span Mixup for Text Classification. (arXiv:2106.08062v1 [cs.CL])
    (2 min) Data augmentation with mixup has shown to be effective on various computer vision tasks. Despite its great success, there has been a hurdle to apply mixup to NLP tasks since text consists of discrete tokens with variable length. In this work, we propose SSMix, a novel mixup method where the operation is performed on input text rather than on hidden vectors like previous approaches. SSMix synthesizes a sentence while preserving the locality of two original texts by span-based mixing and keeping more tokens related to the prediction relying on saliency information. With extensive experiments, we empirically validate that our method outperforms hidden-level mixup methods on a wide range of text classification benchmarks, including textual entailment, sentiment classification, and question-type classification. Our code is available at https://github.com/clovaai/ssmix.
    Disentangling Syntax and Semantics in the Brain with Deep Networks. (arXiv:2103.01620v2 [cs.CL] UPDATED)
    (2 min) The activations of language transformers like GPT-2 have been shown to linearly map onto brain activity during speech comprehension. However, the nature of these activations remains largely unknown and presumably conflate distinct linguistic classes. Here, we propose a taxonomy to factorize the high-dimensional activations of language models into four combinatorial classes: lexical, compositional, syntactic, and semantic representations. We then introduce a statistical method to decompose, through the lens of GPT-2's activations, the brain activity of 345 subjects recorded with functional magnetic resonance imaging (fMRI) during the listening of ~4.6 hours of narrated text. The results highlight two findings. First, compositional representations recruit a more widespread cortical network than lexical ones, and encompass the bilateral temporal, parietal and prefrontal cortices. Second, contrary to previous claims, syntax and semantics are not associated with separated modules, but, instead, appear to share a common and distributed neural substrate. Overall, this study introduces a versatile framework to isolate, in the brain activity, the distributed representations of linguistic constructs.
    Challenges and Considerations with Code-Mixed NLP for Multilingual Societies. (arXiv:2106.07823v1 [cs.CL])
    (2 min) Multilingualism refers to the high degree of proficiency in two or more languages in the written and oral communication modes. It often results in language mixing, a.k.a. code-mixing, when a multilingual speaker switches between multiple languages in a single utterance of a text or speech. This paper discusses the current state of the NLP research, limitations, and foreseeable pitfalls in addressing five real-world applications for social good crisis management, healthcare, political campaigning, fake news, and hate speech for multilingual societies. We also propose futuristic datasets, models, and tools that can significantly advance the current research in multilingual NLP applications for the societal good. As a representative example, we consider English-Hindi code-mixing but draw similar inferences for other language pairs
    Assessing the Use of Prosody in Constituency Parsing of Imperfect Transcripts. (arXiv:2106.07794v1 [cs.CL])
    (2 min) This work explores constituency parsing on automatically recognized transcripts of conversational speech. The neural parser is based on a sentence encoder that leverages word vectors contextualized with prosodic features, jointly learning prosodic feature extraction with parsing. We assess the utility of the prosody in parsing on imperfect transcripts, i.e. transcripts with automatic speech recognition (ASR) errors, by applying the parser in an N-best reranking framework. In experiments on Switchboard, we obtain 13-15% of the oracle N-best gain relative to parsing the 1-best ASR output, with insignificant impact on word recognition error rate. Prosody provides a significant part of the gain, and analyses suggest that it leads to more grammatical utterances via recovering function words.
    Stacked Acoustic-and-Textual Encoding: Integrating the Pre-trained Models into Speech Translation Encoders. (arXiv:2105.05752v2 [cs.CL] UPDATED)
    (2 min) Encoder pre-training is promising in end-to-end Speech Translation (ST), given the fact that speech-to-translation data is scarce. But ST encoders are not simple instances of Automatic Speech Recognition (ASR) or Machine Translation (MT) encoders. For example, we find that ASR encoders lack the global context representation, which is necessary for translation, whereas MT encoders are not designed to deal with long but locally attentive acoustic sequences. In this work, we propose a Stacked Acoustic-and-Textual Encoding (SATE) method for speech translation. Our encoder begins with processing the acoustic sequence as usual, but later behaves more like an MT encoder for a global representation of the input sequence. In this way, it is straightforward to incorporate the pre-trained models into the system. Also, we develop an adaptor module to alleviate the representation inconsistency between the pre-trained ASR encoder and MT encoder, and develop a multi-teacher knowledge distillation method to preserve the pre-training knowledge. Experimental results on the LibriSpeech En-Fr and MuST-C En-De ST tasks show that our method achieves state-of-the-art BLEU scores of 18.3 and 25.2. To our knowledge, we are the first to develop an end-to-end ST system that achieves comparable or even better BLEU performance than the cascaded ST counterpart when large-scale ASR and MT data is available.
    Knowledge-Rich BERT Embeddings for Readability Assessment. (arXiv:2106.07935v1 [cs.CL])
    (2 min) Automatic readability assessment (ARA) is the task of evaluating the level of ease or difficulty of text documents for a target audience. For researchers, one of the many open problems in the field is to make such models trained for the task show efficacy even for low-resource languages. In this study, we propose an alternative way of utilizing the information-rich embeddings of BERT models through a joint-learning method combined with handcrafted linguistic features for readability assessment. Results show that the proposed method outperforms classical approaches in readability assessment using English and Filipino datasets, and obtaining as high as 12.4% increase in F1 performance. We also show that the knowledge encoded in BERT embeddings can be used as a substitute feature set for low-resource languages like Filipino with limited semantic and syntactic NLP tools to explicitly extract feature values for the task.
    Text Generation with Efficient (Soft) Q-Learning. (arXiv:2106.07704v1 [cs.CL])
    (2 min) Maximum likelihood estimation (MLE) is the predominant algorithm for training text generation models. This paradigm relies on direct supervision examples, which is not applicable to many applications, such as generating adversarial attacks or generating prompts to control language models. Reinforcement learning (RL) on the other hand offers a more flexible solution by allowing users to plug in arbitrary task metrics as reward. Yet previous RL algorithms for text generation, such as policy gradient (on-policy RL) and Q-learning (off-policy RL), are often notoriously inefficient or unstable to train due to the large sequence space and the sparse reward received only at the end of sequences. In this paper, we introduce a new RL formulation for text generation from the soft Q-learning perspective. It further enables us to draw from the latest RL advances, such as path consistency learning, to combine the best of on-/off-policy updates, and learn effectively from sparse reward. We apply the approach to a wide range of tasks, including learning from noisy/negative examples, adversarial attacks, and prompt generation. Experiments show our approach consistently outperforms both task-specialized algorithms and the previous RL methods. On standard supervised tasks where MLE prevails, our approach also achieves competitive performance and stability by training text generation from scratch.
    EPICURE Ensemble Pretrained Models for Extracting Cancer Mutations from Literature. (arXiv:2106.07722v1 [cs.CL])
    (2 min) To interpret the genetic profile present in a patient sample, it is necessary to know which mutations have important roles in the development of the corresponding cancer type. Named entity recognition is a core step in the text mining pipeline which facilitates mining valuable cancer information from the scientific literature. However, due to the scarcity of related datasets, previous NER attempts in this domain either suffer from low performance when deep learning based models are deployed, or they apply feature based machine learning models or rule based models to tackle this problem, which requires intensive efforts from domain experts, and limit the model generalization capability. In this paper, we propose EPICURE, an ensemble pre trained model equipped with a conditional random field pattern layer and a span prediction pattern layer to extract cancer mutations from text. We also adopt a data augmentation strategy to expand our training set from multiple datasets. Experimental results on three benchmark datasets show competitive results compared to the baseline models.
  • cs.CV updates on arXiv.org

    AGENT: A Benchmark for Core Psychological Reasoning. (arXiv:2102.12321v3 [cs.AI] UPDATED)
    (2 min) For machine agents to successfully interact with humans in real-world settings, they will need to develop an understanding of human mental life. Intuitive psychology, the ability to reason about hidden mental variables that drive observable actions, comes naturally to people: even pre-verbal infants can tell agents from objects, expecting agents to act efficiently to achieve goals given constraints. Despite recent interest in machine agents that reason about other agents, it is not clear if such agents learn or hold the core psychology principles that drive human reasoning. Inspired by cognitive development studies on intuitive psychology, we present a benchmark consisting of a large dataset of procedurally generated 3D animations, AGENT (Action, Goal, Efficiency, coNstraint, uTility), structured around four scenarios (goal preferences, action efficiency, unobserved constraints, and cost-reward trade-offs) that probe key concepts of core intuitive psychology. We validate AGENT with human-ratings, propose an evaluation protocol emphasizing generalization, and compare two strong baselines built on Bayesian inverse planning and a Theory of Mind neural network. Our results suggest that to pass the designed tests of core intuitive psychology at human levels, a model must acquire or have built-in representations of how agents plan, combining utility computations and core knowledge of objects and physics.
    BEiT: BERT Pre-Training of Image Transformers. (arXiv:2106.08254v1 [cs.CV])
    (2 min) We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional Encoder representation from Image Transformers. Following BERT developed in the natural language processing area, we propose a masked image modeling task to pretrain vision Transformers. Specifically, each image has two views in our pre-training, i.e, image patches (such as 16x16 pixels), and visual tokens (i.e., discrete tokens). We first "tokenize" the original image into visual tokens. Then we randomly mask some image patches and fed them into the backbone Transformer. The pre-training objective is to recover the original visual tokens based on the corrupted image patches. After pre-training BEiT, we directly fine-tune the model parameters on downstream tasks by appending task layers upon the pretrained encoder. Experimental results on image classification and semantic segmentation show that our model achieves competitive results with previous pre-training methods. For example, base-size BEiT achieves 83.2% top-1 accuracy on ImageNet-1K, significantly outperforming from-scratch DeiT training (81.8%) with the same setup. Moreover, large-size BEiT obtains 86.3% only using ImageNet-1K, even outperforming ViT-L with supervised pre-training on ImageNet-22K (85.2%). The code and pretrained models are available at https://aka.ms/beit.
    Gradient Forward-Propagation for Large-Scale Temporal Video Modelling. (arXiv:2106.08318v1 [cs.CV])
    (2 min) How can neural networks be trained on large-volume temporal data efficiently? To compute the gradients required to update parameters, backpropagation blocks computations until the forward and backward passes are completed. For temporal signals, this introduces high latency and hinders real-time learning. It also creates a coupling between consecutive layers, which limits model parallelism and increases memory consumption. In this paper, we build upon Sideways, which avoids blocking by propagating approximate gradients forward in time, and we propose mechanisms for temporal integration of information based on different variants of skip connections. We also show how to decouple computation and delegate individual neural modules to different devices, allowing distributed and parallel training. The proposed Skip-Sideways achieves low latency training, model parallelism, and, importantly, is capable of extracting temporal features, leading to more stable training and improved performance on real-world action recognition video datasets such as HMDB51, UCF101, and the large-scale Kinetics-600. Finally, we also show that models trained with Skip-Sideways generate better future frames than Sideways models, and hence they can better utilize motion cues.
    Wavelength-based Attributed Deep Neural Network for Underwater Image Restoration. (arXiv:2106.07910v1 [eess.IV])
    (2 min) Underwater images, in general, suffer from low contrast and high color distortions due to the non-uniform attenuation of the light as it propagates through the water. In addition, the degree of attenuation varies with the wavelength resulting in the asymmetric traversing of colors. Despite the prolific works for underwater image restoration (UIR) using deep learning, the above asymmetricity has not been addressed in the respective network engineering. As the first novelty, this paper shows that attributing the right receptive field size (context) based on the traversing range of the color channel may lead to a substantial performance gain for the task of UIR. Further, it is important to suppress the irrelevant multi-contextual features and increase the representational power of the model. Therefore, as a second novelty, we have incorporated an attentive skip mechanism to adaptively refine the learned multi-contextual features. The proposed framework, called Deep WaveNet, is optimized using the traditional pixel-wise and feature-based cost functions. An extensive set of experiments have been carried out to show the efficacy of the proposed scheme over existing best-published literature on benchmark datasets. More importantly, we have demonstrated a comprehensive validation of enhanced images across various high-level vision tasks, e.g., underwater image semantic segmentation, and diver's 2D pose estimation. A sample video to exhibit our real-world performance is available at \url{https://www.youtube.com/watch?v=8qtuegBdfac}.
    Mutation Sensitive Correlation Filter for Real-Time UAV Tracking with Adaptive Hybrid Label. (arXiv:2106.08073v1 [cs.CV])
    (2 min) Unmanned aerial vehicle (UAV) based visual tracking has been confronted with numerous challenges, e.g., object motion and occlusion. These challenges generally introduce unexpected mutations of target appearance and result in tracking failure. However, prevalent discriminative correlation filter (DCF) based trackers are insensitive to target mutations due to a predefined label, which concentrates on merely the centre of the training region. Meanwhile, appearance mutations caused by occlusion or similar objects usually lead to the inevitable learning of wrong information. To cope with appearance mutations, this paper proposes a novel DCF-based method to enhance the sensitivity and resistance to mutations with an adaptive hybrid label, i.e., MSCF. The ideal label is optimized jointly with the correlation filter and remains temporal consistency. Besides, a novel measurement of mutations called mutation threat factor (MTF) is applied to correct the label dynamically. Considerable experiments are conducted on widely used UAV benchmarks. The results indicate that the performance of MSCF tracker surpasses other 26 state-of-the-art DCF-based and deep-based trackers. With a real-time speed of _38 frames/s, the proposed approach is sufficient for UAV tracking commissions.
    Physion: Evaluating Physical Prediction from Vision in Humans and Machines. (arXiv:2106.08261v1 [cs.AI])
    (0 min) While machine learning algorithms excel at many challenging visual tasks, it is unclear that they can make predictions about commonplace real world physical events. Here, we present a visual and physical prediction benchmark that precisely measures this capability. In realistically simulating a wide variety of physical phenomena -- rigid and soft-body collisions, stable multi-object configurations, rolling and sliding, projectile motion -- our dataset presents a more comprehensive challenge than existing benchmarks. Moreover, we have collected human responses for our stimuli so that model predictions can be directly compared to human judgments. We compare an array of algorithms -- varying in their architecture, learning objective, input-output structure, and training data -- on their ability to make diverse physical predictions. We find that graph neural networks with access to the physical state best capture human behavior, whereas among models that receive only visual input, those with object-centric representations or pretraining do best but fall far short of human accuracy. This suggests that extracting physically meaningful representations of scenes is the main bottleneck to achieving human-like visual prediction. We thus demonstrate how our benchmark can identify areas for improvement and measure progress on this key aspect of physical understanding.
    DISCO: Dynamic and Invariant Sensitive Channel Obfuscation for deep neural networks. (arXiv:2012.11025v2 [cs.CV] UPDATED)
    (2 min) Recent deep learning models have shown remarkable performance in image classification. While these deep learning systems are getting closer to practical deployment, the common assumption made about data is that it does not carry any sensitive information. This assumption may not hold for many practical cases, especially in the domain where an individual's personal information is involved, like healthcare and facial recognition systems. We posit that selectively removing features in this latent space can protect the sensitive information and provide a better privacy-utility trade-off. Consequently, we propose DISCO which learns a dynamic and data driven pruning filter to selectively obfuscate sensitive information in the feature space. We propose diverse attack schemes for sensitive inputs \& attributes and demonstrate the effectiveness of DISCO against state-of-the-art methods through quantitative and qualitative evaluation. Finally, we also release an evaluation benchmark dataset of 1 million sensitive representations to encourage rigorous exploration of novel attack schemes.
    Cross-Domain Facial Expression Recognition: A Unified Evaluation Benchmark and Adversarial Graph Learning. (arXiv:2008.00923v7 [cs.CV] UPDATED)
    (3 min) To address the problem of data inconsistencies among different facial expression recognition (FER) datasets, many cross-domain FER methods (CD-FERs) have been extensively devised in recent years. Although each declares to achieve superior performance, fair comparisons are lacking due to the inconsistent choices of the source/target datasets and feature extractors. In this work, we first analyze the performance effect caused by these inconsistent choices, and then re-implement some well-performing CD-FER and recently published domain adaptation algorithms. We ensure that all these algorithms adopt the same source datasets and feature extractors for fair CD-FER evaluations. We find that most of the current leading algorithms use adversarial learning to learn holistic domain-invariant features to mitigate domain shifts. However, these algorithms ignore local features, which are more transferable across different datasets and carry more detailed content for fine-grained adaptation. To address these issues, we integrate graph representation propagation with adversarial learning for cross-domain holistic-local feature co-adaptation by developing a novel adversarial graph representation adaptation (AGRA) framework. Specifically, it first builds two graphs to correlate holistic and local regions within each domain and across different domains, respectively. Then, it extracts holistic-local features from the input image and uses learnable per-class statistical distributions to initialize the corresponding graph nodes. Finally, two stacked graph convolution networks (GCNs) are adopted to propagate holistic-local features within each domain to explore their interaction and across different domains for holistic-local feature co-adaptation. We conduct extensive and fair evaluations on several popular benchmarks and show that the proposed AGRA framework outperforms previous state-of-the-art methods.
    Weakly-supervised High-resolution Segmentation of Mammography Images for Breast Cancer Diagnosis. (arXiv:2106.07049v2 [cs.CV] UPDATED)
    (2 min) In the last few years, deep learning classifiers have shown promising results in image-based medical diagnosis. However, interpreting the outputs of these models remains a challenge. In cancer diagnosis, interpretability can be achieved by localizing the region of the input image responsible for the output, i.e. the location of a lesion. Alternatively, segmentation or detection models can be trained with pixel-wise annotations indicating the locations of malignant lesions. Unfortunately, acquiring such labels is labor-intensive and requires medical expertise. To overcome this difficulty, weakly-supervised localization can be utilized. These methods allow neural network classifiers to output saliency maps highlighting the regions of the input most relevant to the classification task (e.g. malignant lesions in mammograms) using only image-level labels (e.g. whether the patient has cancer or not) during training. When applied to high-resolution images, existing methods produce low-resolution saliency maps. This is problematic in applications in which suspicious lesions are small in relation to the image size. In this work, we introduce a novel neural network architecture to perform weakly-supervised segmentation of high-resolution images. The proposed model selects regions of interest via coarse-level localization, and then performs fine-grained segmentation of those regions. We apply this model to breast cancer diagnosis with screening mammography, and validate it on a large clinically-realistic dataset. Measured by Dice similarity score, our approach outperforms existing methods by a large margin in terms of localization performance of benign and malignant lesions, relatively improving the performance by 39.6% and 20.0%, respectively. Code and the weights of some of the models are available at https://github.com/nyukat/GLAM
    A baseline for semi-supervised learning of efficient semantic segmentation models. (arXiv:2106.07075v2 [cs.CV] UPDATED)
    (2 min) Semi-supervised learning is especially interesting in the dense prediction context due to high cost of pixel-level ground truth. Unfortunately, most such approaches are evaluated on outdated architectures which hamper research due to very slow training and high requirements on GPU RAM. We address this concern by presenting a simple and effective baseline which works very well both on standard and efficient architectures. Our baseline is based on one-way consistency and non-linear geometric and photometric perturbations. We show advantage of perturbing only the student branch and present a plausible explanation of such behaviour. Experiments on Cityscapes and CIFAR-10 demonstrate competitive performance with respect to prior work.
    Mean Embeddings with Test-Time Data Augmentation for Ensembling of Representations. (arXiv:2106.08038v1 [cs.LG])
    (2 min) Averaging predictions over a set of models -- an ensemble -- is widely used to improve predictive performance and uncertainty estimation of deep learning models. At the same time, many machine learning systems, such as search, matching, and recommendation systems, heavily rely on embeddings. Unfortunately, due to misalignment of features of independently trained models, embeddings, cannot be improved with a naive deep ensemble like approach. In this work, we look at the ensembling of representations and propose mean embeddings with test-time augmentation (MeTTA) simple yet well-performing recipe for ensembling representations. Empirically we demonstrate that MeTTA significantly boosts the quality of linear evaluation on ImageNet for both supervised and self-supervised models. Even more exciting, we draw connections between MeTTA, image retrieval, and transformation invariant models. We believe that spreading the success of ensembles to inference higher-quality representations is the important step that will open many new applications of ensembling.
    Compact and adaptive multiplane images for view synthesis. (arXiv:2102.10086v2 [cs.CV] UPDATED)
    (2 min) Recently, learning methods have been designed to create Multiplane Images (MPIs) for view synthesis. While MPIs are extremely powerful and facilitate high quality renderings, a great amount of memory is required, making them impractical for many applications. In this paper, we propose a learning method that optimizes the available memory to render compact and adaptive MPIs. Our MPIs avoid redundant information and take into account the scene geometry to determine the depth sampling.
    Image Feature Information Extraction for Interest Point Detection: A Comprehensive Review. (arXiv:2106.07929v1 [cs.CV])
    (0 min) Interest point detection is one of the most fundamental and critical problems in computer vision and image processing. In this paper, we carry out a comprehensive review on image feature information (IFI) extraction techniques for interest point detection. To systematically introduce how the existing interest point detection methods extract IFI from an input image, we propose a taxonomy of the IFI extraction techniques for interest point detection. According to this taxonomy, we discuss different types of IFI extraction techniques for interest point detection. Furthermore, we identify the main unresolved issues related to the existing IFI extraction techniques for interest point detection and any interest point detection methods that have not been discussed before. The existing popular datasets and evaluation standards are provided and the performances for eighteen state-of-the-art approaches are evaluated and discussed. Moreover, future research directions on IFI extraction techniques for interest point detection are elaborated.
    Traffic Scenario Clustering by Iterative Optimisation of Self-Supervised Networks Using a Random Forest Activation Pattern Similarity. (arXiv:2105.07639v2 [cs.CV] UPDATED)
    (2 min) Traffic scenario categorisation is an essential component of automated driving, for e.\,g., in motion planning algorithms and their validation. Finding new relevant scenarios without handcrafted steps reduce the required resources for the development of autonomous driving dramatically. In this work, a method is proposed to address this challenge by introducing a clustering technique based on a novel data-adaptive similarity measure, called Random Forest Activation Pattern (RFAP) similarity. The RFAP similarity is generated using a tree encoding scheme in a Random Forest algorithm. The clustering method proposed in this work takes into account that there are labelled scenarios available and the information from the labelled scenarios can help to guide the clustering of unlabelled scenarios. It consists of three steps. First, a self-supervised Convolutional Neural Network~(CNN) is trained on all available traffic scenarios using a defined self-supervised objective. Second, the CNN is fine-tuned for classification of the labelled scenarios. Third, using the labelled and unlabelled scenarios an iterative optimisation procedure is performed for clustering. In the third step at each epoch of the iterative optimisation, the CNN is used as a feature generator for an unsupervised Random Forest. The trained forest, in turn, provides the RFAP similarity to adapt iteratively the feature generation process implemented by the CNN. Extensive experiments and ablation studies have been done on the highD dataset. The proposed method shows superior performance compared to baseline clustering techniques.
    Self-Supervised Learning with Kernel Dependence Maximization. (arXiv:2106.08320v1 [stat.ML])
    (2 min) We approach self-supervised learning of image representations from a statistical dependence perspective, proposing Self-Supervised Learning with the Hilbert-Schmidt Independence Criterion (SSL-HSIC). SSL-HSIC maximizes dependence between representations of transformed versions of an image and the image identity, while minimizing the kernelized variance of those features. This self-supervised learning framework yields a new understanding of InfoNCE, a variational lower bound on the mutual information (MI) between different transformations. While the MI itself is known to have pathologies which can result in meaningless representations being learned, its bound is much better behaved: we show that it implicitly approximates SSL-HSIC (with a slightly different regularizer). Our approach also gives us insight into BYOL, since SSL-HSIC similarly learns local neighborhoods of samples. SSL-HSIC allows us to directly optimize statistical dependence in time linear in the batch size, without restrictive data assumptions or indirect mutual information estimators. Trained with or without a target network, SSL-HSIC matches the current state-of-the-art for standard linear evaluation on ImageNet, semi-supervised learning and transfer to other classification and vision tasks such as semantic segmentation, depth estimation and object recognition.
    Overcomplete Representations Against Adversarial Videos. (arXiv:2012.04262v2 [cs.CV] UPDATED)
    (2 min) Adversarial robustness of deep neural networks is an extensively studied problem in the literature and various methods have been proposed to defend against adversarial images. However, only a handful of defense methods have been developed for defending against attacked videos. In this paper, we propose a novel Over-and-Under complete restoration network for Defending against adversarial videos (OUDefend). Most restoration networks adopt an encoder-decoder architecture that first shrinks spatial dimension then expands it back. This approach learns undercomplete representations, which have large receptive fields to collect global information but overlooks local details. On the other hand, overcomplete representations have opposite properties. Hence, OUDefend is designed to balance local and global features by learning those two representations. We attach OUDefend to target video recognition models as a feature restoration block and train the entire network end-to-end. Experimental results show that the defenses focusing on images may be ineffective to videos, while OUDefend enhances robustness against different types of adversarial videos, ranging from additive attacks, multiplicative attacks to physically realizable attacks. Code: https://github.com/shaoyuanlo/OUDefend
    Motion Vector Extrapolation for Video Object Detection. (arXiv:2104.08918v2 [cs.CV] UPDATED)
    (2 min) Despite the continued successes of computationally efficient deep neural network architectures for video object detection, performance continually arrives at the great trilemma of speed versus accuracy versus computational resources (pick two). Current attempts to exploit temporal information in video data to overcome this trilemma are bottlenecked by the state-of-the-art in object detection models. We present, a technique which performs video object detection through the use of off-the-shelf object detectors alongside existing optical flow based motion estimation techniques in parallel. Through a set of experiments on the benchmark MOT20 dataset, we demonstrate that our approach significantly reduces the baseline latency of any given object detector without sacrificing any accuracy. Further latency reduction, up to 25x lower than the original latency, can be achieved with minimal accuracy loss. MOVEX enables low latency video object detection on common CPU based systems, thus allowing for high performance video object detection beyond the domain of GPU computing. The code is available at https://github.com/juliantrue/movex.
    Error Diffusion Halftoning Against Adversarial Examples. (arXiv:2101.09451v2 [cs.CV] UPDATED)
    (2 min) Adversarial examples contain carefully crafted perturbations that can fool deep neural networks (DNNs) into making wrong predictions. Enhancing the adversarial robustness of DNNs has gained considerable interest in recent years. Although image transformation-based defenses were widely considered at an earlier time, most of them have been defeated by adaptive attacks. In this paper, we propose a new image transformation defense based on error diffusion halftoning, and combine it with adversarial training to defend against adversarial examples. Error diffusion halftoning projects an image into a 1-bit space and diffuses quantization error to neighboring pixels. This process can remove adversarial perturbations from a given image while maintaining acceptable image quality in the meantime in favor of recognition. Experimental results demonstrate that the proposed method is able to improve adversarial robustness even under advanced adaptive attacks, while most of the other image transformation-based defenses do not. We show that a proper image transformation can still be an effective defense approach. Code: https://github.com/shaoyuanlo/Halftoning-Defense
    Cascading Convolutional Temporal Colour Constancy. (arXiv:2106.07955v1 [cs.CV])
    (2 min) Computational Colour Constancy (CCC) consists of estimating the colour of one or more illuminants in a scene and using them to remove unwanted chromatic distortions. Much research has focused on illuminant estimation for CCC on single images, with few attempts of leveraging the temporal information intrinsic in sequences of correlated images (e.g., the frames in a video), a task known as Temporal Colour Constancy (TCC). The state-of-the-art for TCC is TCCNet, a deep-learning architecture that uses a ConvLSTM for aggregating the encodings produced by CNN submodules for each image in a sequence. We extend this architecture with different models obtained by (i) substituting the TCCNet submodules with C4, the state-of-the-art method for CCC targeting images; (ii) adding a cascading strategy to perform an iterative improvement of the estimate of the illuminant. We tested our models on the recently released TCC benchmark and achieved results that surpass the state-of-the-art. Analyzing the impact of the number of frames involved in illuminant estimation on performance, we show that it is possible to reduce inference time by training the models on few selected frames from the sequences while retaining comparable accuracy.
    Technical Report: Temporal Aggregate Representations. (arXiv:2106.03152v2 [cs.CV] UPDATED)
    (0 min) This technical report extends our work presented in [9] with more experiments. In [9], we tackle long-term video understanding, which requires reasoning from current and past or future observations and raises several fundamental questions. How should temporal or sequential relationships be modelled? What temporal extent of information and context needs to be processed? At what temporal scale should they be derived? [9] addresses these questions with a flexible multi-granular temporal aggregation framework. In this report, we conduct further experiments with this framework on different tasks and a new dataset, EPIC-KITCHENS-100.
    Zero-sample surface defect detection and classification based on semantic feedback neural network. (arXiv:2106.07959v1 [cs.CV])
    (0 min) Defect detection and classification technology has changed from traditional artificial visual inspection to current intelligent automated inspection, but most of the current defect detection methods are training related detection models based on a data-driven approach, taking into account the difficulty of collecting some sample data in the industrial field. We apply zero-shot learning technology to the industrial field. Aiming at the problem of the existing "Latent Feature Guide Attribute Attention" (LFGAA) zero-shot image classification network, the output latent attributes and artificially defined attributes are different in the semantic space, which leads to the problem of model performance degradation, proposed an LGFAA network based on semantic feedback, and improved model performance by constructing semantic embedded modules and feedback mechanisms. At the same time, for the common domain shift problem in zero-shot learning, based on the idea of co-training algorithm using the difference information between different views of data to learn from each other, we propose an Ensemble Co-training algorithm, which adaptively reduces the prediction error in image tag embedding from multiple angles. Various experiments conducted on the zero-shot dataset and the cylinder liner dataset in the industrial field provide competitive results.
    Non-Gradient Manifold Neural Network. (arXiv:2106.07905v1 [cs.LG])
    (2 min) Deep neural network (DNN) generally takes thousands of iterations to optimize via gradient descent and thus has a slow convergence. In addition, softmax, as a decision layer, may ignore the distribution information of the data during classification. Aiming to tackle the referred problems, we propose a novel manifold neural network based on non-gradient optimization, i.e., the closed-form solutions. Considering that the activation function is generally invertible, we reconstruct the network via forward ridge regression and low rank backward approximation, which achieve the rapid convergence. Moreover, by unifying the flexible Stiefel manifold and adaptive support vector machine, we devise the novel decision layer which efficiently fits the manifold structure of the data and label information. Consequently, a jointly non-gradient optimization method is designed to generate the network with closed-form results. Eventually, extensive experiments validate the superior performance of the model.
    Machine Learning for Nondestructive Wear Assessment in Large Internal Combustion Engines. (arXiv:2103.08482v2 [cs.CV] UPDATED)
    (2 min) Digitalization offers a large number of promising tools for large internal combustion engines such as condition monitoring or condition-based maintenance. This includes the status evaluation of key engine components such as cylinder liners, whose inner surfaces are subject to constant wear due to their movement relative to the pistons. Existing state-of-the-art methods for quantifying wear require disassembly and cutting of the examined liner followed by a high-resolution microscopic surface depth measurement that quantitatively evaluates wear based on bearing load curves (also known as Abbott-Firestone curves). Such reference methods are destructive, time-consuming and costly. The goal of the research presented here is to develop nondestructive yet reliable methods for quantifying the surface condition. A deep-learning framework is proposed that allows computation of the bearing load curves from reflection RGB images of the liner surface that can be collected with a wide variety of simple imaging devices, without the need to remove and destroy the investigated liner. For this purpose, a convolutional neural network is trained to predict the bearing load curve of the corresponding depth profile from the collected RGB images, which in turn can be used for further wear evaluation. Training of the network is performed using a custom-built database containing depth profiles and reflection images of liner surfaces of large gas engines. The results of the proposed method are visually examined and quantified considering several probabilistic distance metrics and comparison of roughness indicators between ground truth and model predictions. The observed success of the proposed method suggests its great potential for quantitative wear assessment on engines during service directly on site.
    A Spacecraft Dataset for Detection, Segmentation and Parts Recognition. (arXiv:2106.08186v1 [cs.CV])
    (2 min) Virtually all aspects of modern life depend on space technology. Thanks to the great advancement of computer vision in general and deep learning-based techniques in particular, over the decades, the world witnessed the growing use of deep learning in solving problems for space applications, such as self-driving robot, tracers, insect-like robot on cosmos and health monitoring of spacecraft. These are just some prominent examples that has advanced space industry with the help of deep learning. However, the success of deep learning models requires a lot of training data in order to have decent performance, while on the other hand, there are very limited amount of publicly available space datasets for the training of deep learning models. Currently, there is no public datasets for space-based object detection or instance segmentation, partly because manually annotating object segmentation masks is very time consuming as they require pixel-level labelling, not to mention the challenge of obtaining images from space. In this paper, we aim to fill this gap by releasing a dataset for spacecraft detection, instance segmentation and part recognition. The main contribution of this work is the development of the dataset using images of space stations and satellites, with rich annotations including bounding boxes of spacecrafts and masks to the level of object parts, which are obtained with a mixture of automatic processes and manual efforts. We also provide evaluations with state-of-the-art methods in object detection and instance segmentation as a benchmark for the dataset. The link for downloading the proposed dataset can be found on https://github.com/Yurushia1998/SatelliteDataset.
    Mixed Model OCR Training on Historical Latin Script for Out-of-the-Box Recognition and Finetuning. (arXiv:2106.07881v1 [cs.CV])
    (2 min) In order to apply Optical Character Recognition (OCR) to historical printings of Latin script fully automatically, we report on our efforts to construct a widely-applicable polyfont recognition model yielding text with a Character Error Rate (CER) around 2% when applied out-of-the-box. Moreover, we show how this model can be further finetuned to specific classes of printings with little manual and computational effort. The mixed or polyfont model is trained on a wide variety of materials, in terms of age (from the 15th to the 19th century), typography (various types of Fraktur and Antiqua), and languages (among others, German, Latin, and French). To optimize the results we combined established techniques of OCR training like pretraining, data augmentation, and voting. In addition, we used various preprocessing methods to enrich the training data and obtain more robust models. We also implemented a two-stage approach which first trains on all available, considerably unbalanced data and then refines the output by training on a selected more balanced subset. Evaluations on 29 previously unseen books resulted in a CER of 1.73%, outperforming a widely used standard model with a CER of 2.84% by almost 40%. Training a more specialized model for some unseen Early Modern Latin books starting from our mixed model led to a CER of 1.47%, an improvement of up to 50% compared to training from scratch and up to 30% compared to training from the aforementioned standard model. Our new mixed model is made openly available to the community.
    Towards Total Recall in Industrial Anomaly Detection. (arXiv:2106.08265v1 [cs.CV])
    (2 min) Being able to spot defective parts is a critical component in large-scale industrial manufacturing. A particular challenge that we address in this work is the cold-start problem: fit a model using nominal (non-defective) example images only. While handcrafted solutions per class are possible, the goal is to build systems that work well simultaneously on many different tasks automatically. The best peforming approaches combine embeddings from ImageNet models with an outlier detection model. In this paper, we extend on this line of work and propose PatchCore, which uses a maximally representative memory bank of nominal patch-features. PatchCore offers competitive inference times while achieving state-of-the-art performance for both detection and localization. On the standard dataset MVTec AD, PatchCore achieves an image-level anomaly detection AUROC score of $99.1\%$, more than halving the error compared to the next best competitor. We further report competitive results on two additional datasets and also find competitive results in the few samples regime.
    Contextualizing Multiple Tasks via Learning to Decompose. (arXiv:2106.08112v1 [cs.LG])
    (2 min) One single instance could possess multiple portraits and reveal diverse relationships with others according to different contexts. Those ambiguities increase the difficulty of learning a generalizable model when there exists one concept or mixed concepts in a task. We propose a general approach Learning to Decompose Network (LeadNet) for both two cases, which contextualizes a model through meta-learning multiple maps for concepts discovery -- the representations of instances are decomposed and adapted conditioned on the contexts. Through taking a holistic view over multiple latent components over instances in a sampled pseudo task, LeadNet learns to automatically select the right concept via incorporating those rich semantics inside and between objects. LeadNet demonstrates its superiority in various applications, including exploring multiple views of confusing tasks, out-of-distribution recognition, and few-shot image classification.
    Computer-aided Interpretable Features for Leaf Image Classification. (arXiv:2106.08077v1 [cs.CV])
    (2 min) Plant species identification is time consuming, costly, and requires lots of efforts, and expertise knowledge. In recent, many researchers use deep learning methods to classify plants directly using plant images. While deep learning models have achieved a great success, the lack of interpretability limit their widespread application. To overcome this, we explore the use of interpretable, measurable and computer-aided features extracted from plant leaf images. Image processing is one of the most challenging, and crucial steps in feature-extraction. The purpose of image processing is to improve the leaf image by removing undesired distortion. The main image processing steps of our algorithm involves: i) Convert original image to RGB (Red-Green-Blue) image, ii) Gray scaling, iii) Gaussian smoothing, iv) Binary thresholding, v) Remove stalk, vi) Closing holes, and vii) Resize image. The next step after image processing is to extract features from plant leaf images. We introduced 52 computationally efficient features to classify plant species. These features are mainly classified into four groups as: i) shape-based features, ii) color-based features, iii) texture-based features, and iv) scagnostic features. Length, width, area, texture correlation, monotonicity and scagnostics are to name few of them. We explore the ability of features to discriminate the classes of interest under supervised learning and unsupervised learning settings. For that, supervised dimensionality reduction technique, Linear Discriminant Analysis (LDA), and unsupervised dimensionality reduction technique, Principal Component Analysis (PCA) are used to convert and visualize the images from digital-image space to feature space. The results show that the features are sufficient to discriminate the classes of interest under both supervised and unsupervised learning settings.
    DeepKoCo: Efficient latent planning with a robust Koopman representation. (arXiv:2011.12690v2 [cs.LG] UPDATED)
    (2 min) This paper presents DeepKoCo, a novel model-based agent that learns a latent Koopman representation from images. This representation allows DeepKoCo to plan efficiently using linear control methods, such as linear model predictive control. Compared to traditional agents, DeepKoCo is robust to task-irrelevant dynamics, thanks to the use of a tailored lossy autoencoder network that allows DeepKoCo to learn latent dynamics that reconstruct and predict only observed costs, rather than all observed dynamics. As our results show, DeepKoCo achieves a similar final performance as traditional model-free methods on complex control tasks, while being considerably more robust to distractor dynamics, making the proposed agent more amenable for real-life applications.
    End-to-End Human Pose and Mesh Reconstruction with Transformers. (arXiv:2012.09760v3 [cs.CV] UPDATED)
    (2 min) We present a new method, called MEsh TRansfOrmer (METRO), to reconstruct 3D human pose and mesh vertices from a single image. Our method uses a transformer encoder to jointly model vertex-vertex and vertex-joint interactions, and outputs 3D joint coordinates and mesh vertices simultaneously. Compared to existing techniques that regress pose and shape parameters, METRO does not rely on any parametric mesh models like SMPL, thus it can be easily extended to other objects such as hands. We further relax the mesh topology and allow the transformer self-attention mechanism to freely attend between any two vertices, making it possible to learn non-local relationships among mesh vertices and joints. With the proposed masked vertex modeling, our method is more robust and effective in handling challenging situations like partial occlusions. METRO generates new state-of-the-art results for human mesh reconstruction on the public Human3.6M and 3DPW datasets. Moreover, we demonstrate the generalizability of METRO to 3D hand reconstruction in the wild, outperforming existing state-of-the-art methods on FreiHAND dataset. Code and pre-trained models are available at https://github.com/microsoft/MeshTransformer.
    Top-Related Meta-Learning Method for Few-Shot Object Detection. (arXiv:2007.06837v6 [cs.CV] UPDATED)
    (3 min) Many meta-learning methods are proposed for few-shot detection. However, previous most methods have two main problems, poor detection APs, and strong bias because of imbalance and insufficient datasets. Previous works mainly alleviate these issues by additional datasets, multi-relation attention mechanisms and sub-modules. However, they require more cost. In this work, for meta-learning, we find that the main challenges focus on related or irrelevant semantic features between categories. Therefore, based on semantic features, we propose a Top-C classification loss (i.e., TCL-C) for classification task and a category-based grouping mechanism for category-based meta-features obtained by the meta-model. The TCL-C exploits the true-label prediction and the most likely C-1 false classification predictions to improve detection performance on few-shot classes. According to similar appearance (i.e., visual appearance, shape, and limbs etc.) and environment in which objects often appear, the category-based grouping mechanism splits categories into disjoint groups to make similar semantic features more compact between categories within a group and obtain more significant difference between groups, alleviating the strong bias problem and further improving detection APs. The whole training consists of the base model and the fine-tuning phases. According to grouping mechanism, we group the meta-features vectors obtained by meta-model, so that the distribution difference between groups is obvious, and the one within each group is less. Extensive experiments on Pascal VOC dataset demonstrate that ours which combines the TCL-C with category-based grouping significantly outperforms previous state-of-the-art methods for few-shot detection. Compared with previous competitive baseline, ours improves detection APs by almost 4% for few-shot detection.
    EuroCrops: A Pan-European Dataset for Time Series Crop Type Classification. (arXiv:2106.08151v1 [eess.IV])
    (2 min) We present EuroCrops, a dataset based on self-declared field annotations for training and evaluating methods for crop type classification and mapping, together with its process of acquisition and harmonisation. By this, we aim to enrich the research efforts and discussion for data-driven land cover classification via Earth observation and remote sensing. Additionally, through inclusion of self-declarations gathered in the scope of subsidy control from all countries of the European Union (EU), this dataset highlights the difficulties and pitfalls one comes across when operating on a transnational level. We, therefore, also introduce a new taxonomy scheme, HCAT-ID, that aspires to capture all the aspects of reference data originating from administrative and agency databases. To address researchers from both the remote sensing and the computer vision and machine learning communities, we publish the dataset in different formats and processing levels.
    Spot the Difference: Topological Anomaly Detection via Geometric Alignment. (arXiv:2106.08233v1 [cs.CV])
    (2 min) Geometric alignment appears in a variety of applications, ranging from domain adaptation, optimal transport, and normalizing flows in machine learning; optical flow and learned augmentation in computer vision and deformable registration within biomedical imaging. A recurring challenge is the alignment of domains whose topology is not the same; a problem that is routinely ignored, potentially introducing bias in downstream analysis. As a first step towards solving such alignment problems, we propose an unsupervised topological difference detection algorithm. The model is based on a conditional variational auto-encoder and detects topological anomalies with regards to a reference alongside the registration step. We consider both a) topological changes in the image under spatial variation and b) unexpected transformations. Our approach is validated on a proxy task of unsupervised anomaly detection in images.
    Accelerate CNNs from Three Dimensions: A Comprehensive Pruning Framework. (arXiv:2010.04879v3 [cs.CV] UPDATED)
    (2 min) Most neural network pruning methods, such as filter-level and layer-level prunings, prune the network model along one dimension (depth, width, or resolution) solely to meet a computational budget. However, such a pruning policy often leads to excessive reduction of that dimension, thus inducing a huge accuracy loss. To alleviate this issue, we argue that pruning should be conducted along three dimensions comprehensively. For this purpose, our pruning framework formulates pruning as an optimization problem. Specifically, it first casts the relationships between a certain model's accuracy and depth/width/resolution into a polynomial regression and then maximizes the polynomial to acquire the optimal values for the three dimensions. Finally, the model is pruned along the three optimal dimensions accordingly. In this framework, since collecting too much data for training the regression is very time-costly, we propose two approaches to lower the cost: 1) specializing the polynomial to ensure an accurate regression even with less training data; 2) employing iterative pruning and fine-tuning to collect the data faster. Extensive experiments show that our proposed algorithm surpasses state-of-the-art pruning algorithms and even neural architecture search-based algorithms.
    Deep Transfer Learning for Brain Magnetic Resonance Image Multi-class Classification. (arXiv:2106.07333v2 [cs.CV] UPDATED)
    (2 min) Magnetic Resonance Imaging (MRI) is a principal diagnostic approach used in the field of radiology to create images of the anatomical and physiological structure of patients. MRI is the prevalent medical imaging practice to find abnormalities in soft tissues. Traditionally they are analyzed by a radiologist to detect abnormalities in soft tissues, especially the brain. The process of interpreting a massive volume of patient's MRI is laborious. Hence, the use of Machine Learning methodologies can aid in detecting abnormalities in soft tissues with considerable accuracy. In this research, we have curated a novel dataset and developed a framework that uses Deep Transfer Learning to perform a multi-classification of tumors in the brain MRI images. In this paper, we adopted the Deep Residual Convolutional Neural Network (ResNet50) architecture for the experiments along with discriminative learning techniques to train the model. Using the novel dataset and two publicly available MRI brain datasets, this proposed approach attained a classification accuracy of 86.40% on the curated dataset, 93.80% on the Harvard Whole Brain Atlas dataset, and 97.05% accuracy on the School of Biomedical Engineering dataset. Results of our experiments significantly demonstrate our proposed framework for transfer learning is a potential and effective method for brain tumor multi-classification tasks.
    A White Paper on Neural Network Quantization. (arXiv:2106.08295v1 [cs.LG])
    (2 min) While neural networks have advanced the frontiers in many applications, they often come at a high computational cost. Reducing the power and latency of neural network inference is key if we want to integrate modern networks into edge devices with strict power and compute requirements. Neural network quantization is one of the most effective ways of achieving these savings but the additional noise it induces can lead to accuracy degradation. In this white paper, we introduce state-of-the-art algorithms for mitigating the impact of quantization noise on the network's performance while maintaining low-bit weights and activations. We start with a hardware motivated introduction to quantization and then consider two main classes of algorithms: Post-Training Quantization (PTQ) and Quantization-Aware-Training (QAT). PTQ requires no re-training or labelled data and is thus a lightweight push-button approach to quantization. In most cases, PTQ is sufficient for achieving 8-bit quantization with close to floating-point accuracy. QAT requires fine-tuning and access to labeled training data but enables lower bit quantization with competitive results. For both solutions, we provide tested pipelines based on existing literature and extensive experimentation that lead to state-of-the-art performance for common deep learning models and tasks.
    Coping with Label Shift via Distributionally Robust Optimisation. (arXiv:2010.12230v2 [cs.LG] UPDATED)
    (2 min) The label shift problem refers to the supervised learning setting where the train and test label distributions do not match. Existing work addressing label shift usually assumes access to an \emph{unlabelled} test sample. This sample may be used to estimate the test label distribution, and to then train a suitably re-weighted classifier. While approaches using this idea have proven effective, their scope is limited as it is not always feasible to access the target domain; further, they require repeated retraining if the model is to be deployed in \emph{multiple} test environments. Can one instead learn a \emph{single} classifier that is robust to arbitrary label shifts from a broad family? In this paper, we answer this question by proposing a model that minimises an objective based on distributionally robust optimisation (DRO). We then design and analyse a gradient descent-proximal mirror ascent algorithm tailored for large-scale problems to optimise the proposed objective. %, and establish its convergence. Finally, through experiments on CIFAR-100 and ImageNet, we show that our technique can significantly improve performance over a number of baselines in settings where label shift is present.
    Generating Data Augmentation samples for Semantic Segmentation of Salt Bodies in a Synthetic Seismic Image Dataset. (arXiv:2106.08269v1 [cs.CV])
    (2 min) Nowadays, subsurface salt body localization and delineation, also called semantic segmentation of salt bodies, are among the most challenging geophysicist tasks. Thus, identifying large salt bodies is notoriously tricky and is crucial for identifying hydrocarbon reservoirs and drill path planning. This work proposes a Data Augmentation method based on training two generative models to augment the number of samples in a seismic image dataset for the semantic segmentation of salt bodies. Our method uses deep learning models to generate pairs of seismic image patches and their respective salt masks for the Data Augmentation. The first model is a Variational Autoencoder and is responsible for generating patches of salt body masks. The second is a Conditional Normalizing Flow model, which receives the generated masks as inputs and generates the associated seismic image patches. We evaluate the proposed method by comparing the performance of ten distinct state-of-the-art models for semantic segmentation, trained with and without the generated augmentations, in a dataset from two synthetic seismic images. The proposed methodology yields an average improvement of 8.57% in the IoU metric across all compared models. The best result is achieved by a DeeplabV3+ model variant, which presents an IoU score of 95.17% when trained with our augmentations. Additionally, our proposal outperformed six selected data augmentation methods, and the most significant improvement in the comparison, of 9.77%, is achieved by composing our DA with augmentations from an elastic transformation. At last, we show that the proposed method is adaptable for a larger context size by achieving results comparable to the obtained on the smaller context size.
    Diverse Video Captioning Through Latent Variable Expansion. (arXiv:1910.12019v6 [cs.CV] UPDATED)
    (2 min) Automatically describing video content with text description is challenging but important task, which has been attracting a lot of attention in computer vision community. Previous works mainly strive for the accuracy of the generated sentences, while ignoring the sentences diversity, which is inconsistent with human behavior. In this paper, we aim to caption each video with multiple descriptions and propose a novel framework. Concretely, for a given video, the intermediate latent variables of conventional encode-decode process are utilized as input to the conditional generative adversarial network (CGAN) with the purpose of generating diverse sentences. We adopt different Convolutional Neural Networks (CNNs) as our generator that produces descriptions conditioned on latent variables and discriminator that assesses the quality of generated sentences. Simultaneously, a novel DCE metric is designed to assess the diverse captions. We evaluate our method on the benchmark datasets, where it demonstrates its ability to generate diverse descriptions and achieves superior results against other state-of-the-art methods.
    S2R-DepthNet: Learning a Generalizable Depth-specific Structural Representation. (arXiv:2104.00877v2 [cs.CV] UPDATED)
    (2 min) Human can infer the 3D geometry of a scene from a sketch instead of a realistic image, which indicates that the spatial structure plays a fundamental role in understanding the depth of scenes. We are the first to explore the learning of a depth-specific structural representation, which captures the essential feature for depth estimation and ignores irrelevant style information. Our S2R-DepthNet (Synthetic to Real DepthNet) can be well generalized to unseen real-world data directly even though it is only trained on synthetic data. S2R-DepthNet consists of: a) a Structure Extraction (STE) module which extracts a domaininvariant structural representation from an image by disentangling the image into domain-invariant structure and domain-specific style components, b) a Depth-specific Attention (DSA) module, which learns task-specific knowledge to suppress depth-irrelevant structures for better depth estimation and generalization, and c) a depth prediction module (DP) to predict depth from the depth-specific representation. Without access of any real-world images, our method even outperforms the state-of-the-art unsupervised domain adaptation methods which use real-world images of the target domain for training. In addition, when using a small amount of labeled real-world data, we achieve the state-ofthe-art performance under the semi-supervised setting. The code and trained models are available at https://github.com/microsoft/S2R-DepthNet.
    Combining Semantic Guidance and Deep Reinforcement Learning For Generating Human Level Paintings. (arXiv:2011.12589v2 [cs.CV] UPDATED)
    (2 min) Generation of stroke-based non-photorealistic imagery, is an important problem in the computer vision community. As an endeavor in this direction, substantial recent research efforts have been focused on teaching machines "how to paint", in a manner similar to a human painter. However, the applicability of previous methods has been limited to datasets with little variation in position, scale and saliency of the foreground object. As a consequence, we find that these methods struggle to cover the granularity and diversity possessed by real world images. To this end, we propose a Semantic Guidance pipeline with 1) a bi-level painting procedure for learning the distinction between foreground and background brush strokes at training time. 2) We also introduce invariance to the position and scale of the foreground object through a neural alignment model, which combines object localization and spatial transformer networks in an end to end manner, to zoom into a particular semantic instance. 3) The distinguishing features of the in-focus object are then amplified by maximizing a novel guided backpropagation based focus reward. The proposed agent does not require any supervision on human stroke-data and successfully handles variations in foreground object attributes, thus, producing much higher quality canvases for the CUB-200 Birds and Stanford Cars-196 datasets. Finally, we demonstrate the further efficacy of our method on complex datasets with multiple foreground object instances by evaluating an extension of our method on the challenging Virtual-KITTI dataset. Source code and models are available at https://github.com/1jsingh/semantic-guidance.
    Multi-StyleGAN: Towards Image-Based Simulation of Time-Lapse Live-Cell Microscopy. (arXiv:2106.08285v1 [cs.CV])
    (2 min) Time-lapse fluorescent microscopy (TLFM) combined with predictive mathematical modelling is a powerful tool to study the inherently dynamic processes of life on the single-cell level. Such experiments are costly, complex and labour intensive. A complimentary approach and a step towards completely in silico experiments, is to synthesise the imagery itself. Here, we propose Multi-StyleGAN as a descriptive approach to simulate time-lapse fluorescence microscopy imagery of living cells, based on a past experiment. This novel generative adversarial network synthesises a multi-domain sequence of consecutive timesteps. We showcase Multi-StyleGAN on imagery of multiple live yeast cells in microstructured environments and train on a dataset recorded in our laboratory. The simulation captures underlying biophysical factors and time dependencies, such as cell morphology, growth, physical interactions, as well as the intensity of a fluorescent reporter protein. An immediate application is to generate additional training and validation data for feature extraction algorithms or to aid and expedite development of advanced experimental techniques such as online monitoring or control of cells. Code and dataset is available at https://git.rwth-aachen.de/bcs/projects/tp/multi-stylegan.
    Do We Actually Need Dense Over-Parameterization? In-Time Over-Parameterization in Sparse Training. (arXiv:2102.02887v3 [cs.LG] UPDATED)
    (2 min) In this paper, we introduce a new perspective on training deep neural networks capable of state-of-the-art performance without the need for the expensive over-parameterization by proposing the concept of In-Time Over-Parameterization (ITOP) in sparse training. By starting from a random sparse network and continuously exploring sparse connectivities during training, we can perform an Over-Parameterization in the space-time manifold, closing the gap in the expressibility between sparse training and dense training. We further use ITOP to understand the underlying mechanism of Dynamic Sparse Training (DST) and indicate that the benefits of DST come from its ability to consider across time all possible parameters when searching for the optimal sparse connectivity. As long as there are sufficient parameters that have been reliably explored during training, DST can outperform the dense neural network by a large margin. We present a series of experiments to support our conjecture and achieve the state-of-the-art sparse training performance with ResNet-50 on ImageNet. More impressively, our method achieves dominant performance over the overparameterization-based sparse methods at extreme sparsity levels. When trained on CIFAR-100, our method can match the performance of the dense model even at an extreme sparsity (98%). Code can be found https://github.com/Shiweiliuiiiiiii/In-Time-Over-Parameterization.
    Multi-script Handwritten Digit Recognition Using Multi-task Learning. (arXiv:2106.08267v1 [cs.CV])
    (2 min) Handwritten digit recognition is one of the extensively studied area in machine learning. Apart from the wider research on handwritten digit recognition on MNIST dataset, there are many other research works on various script recognition. However, it is not very common for multi-script digit recognition which encourage the development of robust and multipurpose systems. Additionally working on multi-script digit recognition enables multi-task learning, considering the script classification as a related task for instance. It is evident that multi-task learning improves model performance through inductive transfer using the information contained in related tasks. Therefore, in this study multi-script handwritten digit recognition using multi-task learning will be investigated. As a specific case of demonstrating the solution to the problem, Amharic handwritten character recognition will also be experimented. The handwritten digits of three scripts including Latin, Arabic and Kannada are studied to show that multi-task models with reformulation of the individual tasks have shown promising results. In this study a novel way of using the individual tasks predictions was proposed to help classification performance and regularize the different loss for the purpose of the main task. This finding has outperformed the baseline and the conventional multi-task learning models. More importantly, it avoided the need for weighting the different losses of the tasks, which is one of the challenges in multi-task learning.
    Automatic linear measurements of the fetal brain on MRI with deep neural networks. (arXiv:2106.08174v1 [eess.IV])
    (3 min) Timely, accurate and reliable assessment of fetal brain development is essential to reduce short and long-term risks to fetus and mother. Fetal MRI is increasingly used for fetal brain assessment. Three key biometric linear measurements important for fetal brain evaluation are Cerebral Biparietal Diameter (CBD), Bone Biparietal Diameter (BBD), and Trans-Cerebellum Diameter (TCD), obtained manually by expert radiologists on reference slices, which is time consuming and prone to human error. The aim of this study was to develop a fully automatic method computing the CBD, BBD and TCD measurements from fetal brain MRI. The input is fetal brain MRI volumes which may include the fetal body and the mother's abdomen. The outputs are the measurement values and reference slices on which the measurements were computed. The method, which follows the manual measurements principle, consists of five stages: 1) computation of a Region Of Interest that includes the fetal brain with an anisotropic 3D U-Net classifier; 2) reference slice selection with a Convolutional Neural Network; 3) slice-wise fetal brain structures segmentation with a multiclass U-Net classifier; 4) computation of the fetal brain midsagittal line and fetal brain orientation, and; 5) computation of the measurements. Experimental results on 214 volumes for CBD, BBD and TCD measurements yielded a mean $L_1$ difference of 1.55mm, 1.45mm and 1.23mm respectively, and a Bland-Altman 95% confidence interval ($CI_{95}$) of 3.92mm, 3.98mm and 2.25mm respectively. These results are similar to the manual inter-observer variability. The proposed automatic method for computing biometric linear measurements of the fetal brain from MR imaging achieves human level performance. It has the potential of being a useful method for the assessment of fetal brain biometry in normal and pathological cases, and of improving routine clinical practice.
    Going Beyond Classification Accuracy Metrics in Model Compression. (arXiv:2012.01604v2 [cs.CV] UPDATED)
    (2 min) With the rise in edge-computing devices, there has been an increasing demand to deploy energy and resource-efficient models. A large body of research has been devoted to developing methods that can reduce the size of the model considerably without affecting the standard metrics such as top-1 accuracy. However, these pruning approaches tend to result in a significant mismatch in other metrics such as fairness across classes and explainability. To combat such misalignment, we propose a novel multi-part loss function inspired by the knowledge-distillation literature. Through extensive experiments, we demonstrate the effectiveness of our approach across different compression algorithms, architectures, tasks as well as datasets. In particular, we obtain up to $4.1\times$ reduction in the number of prediction mismatches between the compressed and reference models, and up to $5.7\times$ in cases where the reference model makes the correct prediction; all while making no changes to the compression algorithm, and minor modifications to the loss function. Furthermore, we demonstrate how inducing simple alignment between the predictions of the models naturally improves the alignment on other metrics including fairness and attributions. Our framework can thus serve as a simple plug-and-play component for compression algorithms in the future.
    Real-time Pose and Shape Reconstruction of Two Interacting Hands With a Single Depth Camera. (arXiv:2106.08059v1 [cs.CV])
    (2 min) We present a novel method for real-time pose and shape reconstruction of two strongly interacting hands. Our approach is the first two-hand tracking solution that combines an extensive list of favorable properties, namely it is marker-less, uses a single consumer-level depth camera, runs in real time, handles inter- and intra-hand collisions, and automatically adjusts to the user's hand shape. In order to achieve this, we embed a recent parametric hand pose and shape model and a dense correspondence predictor based on a deep neural network into a suitable energy minimization framework. For training the correspondence prediction network, we synthesize a two-hand dataset based on physical simulations that includes both hand pose and shape annotations while at the same time avoiding inter-hand penetrations. To achieve real-time rates, we phrase the model fitting in terms of a nonlinear least-squares problem so that the energy can be optimized based on a highly efficient GPU-based Gauss-Newton optimizer. We show state-of-the-art results in scenes that exceed the complexity level demonstrated by previous work, including tight two-hand grasps, significant inter-hand occlusions, and gesture interaction.
    Automated triaging of head MRI examinations using convolutional neural networks. (arXiv:2106.08176v1 [eess.IV])
    (2 min) The growing demand for head magnetic resonance imaging (MRI) examinations, along with a global shortage of radiologists, has led to an increase in the time taken to report head MRI scans around the world. For many neurological conditions, this delay can result in increased morbidity and mortality. An automated triaging tool could reduce reporting times for abnormal examinations by identifying abnormalities at the time of imaging and prioritizing the reporting of these scans. In this work, we present a convolutional neural network for detecting clinically-relevant abnormalities in $\text{T}_2$-weighted head MRI scans. Using a validated neuroradiology report classifier, we generated a labelled dataset of 43,754 scans from two large UK hospitals for model training, and demonstrate accurate classification (area under the receiver operating curve (AUC) = 0.943) on a test set of 800 scans labelled by a team of neuroradiologists. Importantly, when trained on scans from only a single hospital the model generalized to scans from the other hospital ($\Delta$AUC $\leq$ 0.02). A simulation study demonstrated that our model would reduce the mean reporting time for abnormal examinations from 28 days to 14 days and from 9 days to 5 days at the two hospitals, demonstrating feasibility for use in a clinical triage environment.
    Talk, Don't Write: A Study of Direct Speech-Based Image Retrieval. (arXiv:2104.01894v3 [cs.CL] UPDATED)
    (2 min) Speech-based image retrieval has been studied as a proxy for joint representation learning, usually without emphasis on retrieval itself. As such, it is unclear how well speech-based retrieval can work in practice -- both in an absolute sense and versus alternative strategies that combine automatic speech recognition (ASR) with strong text encoders. In this work, we extensively study and expand choices of encoder architectures, training methodology (including unimodal and multimodal pretraining), and other factors. Our experiments cover different types of speech in three datasets: Flickr Audio, Places Audio, and Localized Narratives. Our best model configuration achieves large gains over state of the art, e.g., pushing recall-at-one from 21.8% to 33.2% for Flickr Audio and 27.6% to 53.4% for Places Audio. We also show our best speech-based models can match or exceed cascaded ASR-to-text encoding when speech is spontaneous, accented, or otherwise hard to automatically transcribe.
    Dynamic Head: Unifying Object Detection Heads with Attentions. (arXiv:2106.08322v1 [cs.CV])
    (2 min) The complex nature of combining localization and classification in object detection has resulted in the flourished development of methods. Previous works tried to improve the performance in various object detection heads but failed to present a unified view. In this paper, we present a novel dynamic head framework to unify object detection heads with attentions. By coherently combining multiple self-attention mechanisms between feature levels for scale-awareness, among spatial locations for spatial-awareness, and within output channels for task-awareness, the proposed approach significantly improves the representation ability of object detection heads without any computational overhead. Further experiments demonstrate that the effectiveness and efficiency of the proposed dynamic head on the COCO benchmark. With a standard ResNeXt-101-DCN backbone, we largely improve the performance over popular object detectors and achieve a new state-of-the-art at 54.0 AP. Furthermore, with latest transformer backbone and extra data, we can push current best COCO result to a new record at 60.6 AP. The code will be released at https://github.com/microsoft/DynamicHead.
    Perceptually-inspired super-resolution of compressed videos. (arXiv:2106.08147v1 [eess.IV])
    (2 min) Spatial resolution adaptation is a technique which has often been employed in video compression to enhance coding efficiency. This approach encodes a lower resolution version of the input video and reconstructs the original resolution during decoding. Instead of using conventional up-sampling filters, recent work has employed advanced super-resolution methods based on convolutional neural networks (CNNs) to further improve reconstruction quality. These approaches are usually trained to minimise pixel-based losses such as Mean-Squared Error (MSE), despite the fact that this type of loss metric does not correlate well with subjective opinions. In this paper, a perceptually-inspired super-resolution approach (M-SRGAN) is proposed for spatial up-sampling of compressed video using a modified CNN model, which has been trained using a generative adversarial network (GAN) on compressed content with perceptual loss functions. The proposed method was integrated with HEVC HM 16.20, and has been evaluated on the JVET Common Test Conditions (UHD test sequences) using the Random Access configuration. The results show evident perceptual quality improvement over the original HM 16.20, with an average bitrate saving of 35.6% (Bj{\o}ntegaard Delta measurement) based on a perceptual quality metric, VMAF.
    Flow Guided Transformable Bottleneck Networks for Motion Retargeting. (arXiv:2106.07771v1 [cs.CV])
    (2 min) Human motion retargeting aims to transfer the motion of one person in a "driving" video or set of images to another person. Existing efforts leverage a long training video from each target person to train a subject-specific motion transfer model. However, the scalability of such methods is limited, as each model can only generate videos for the given target subject, and such training videos are labor-intensive to acquire and process. Few-shot motion transfer techniques, which only require one or a few images from a target, have recently drawn considerable attention. Methods addressing this task generally use either 2D or explicit 3D representations to transfer motion, and in doing so, sacrifice either accurate geometric modeling or the flexibility of an end-to-end learned representation. Inspired by the Transformable Bottleneck Network, which renders novel views and manipulations of rigid objects, we propose an approach based on an implicit volumetric representation of the image content, which can then be spatially manipulated using volumetric flow fields. We address the challenging question of how to aggregate information across different body poses, learning flow fields that allow for combining content from the appropriate regions of input images of highly non-rigid human subjects performing complex motions into a single implicit volumetric representation. This allows us to learn our 3D representation solely from videos of moving people. Armed with both 3D object understanding and end-to-end learned rendering, this categorically novel representation delivers state-of-the-art image generation quality, as shown by our quantitative and qualitative evaluations.
    Pruning and Quantization for Deep Neural Network Acceleration: A Survey. (arXiv:2101.09671v3 [cs.CV] UPDATED)
    (2 min) Deep neural networks have been applied in many applications exhibiting extraordinary abilities in the field of computer vision. However, complex network architectures challenge efficient real-time deployment and require significant computation resources and energy costs. These challenges can be overcome through optimizations such as network compression. Network compression can often be realized with little loss of accuracy. In some cases accuracy may even improve. This paper provides a survey on two types of network compression: pruning and quantization. Pruning can be categorized as static if it is performed offline or dynamic if it is performed at run-time. We compare pruning techniques and describe criteria used to remove redundant computations. We discuss trade-offs in element-wise, channel-wise, shape-wise, filter-wise, layer-wise and even network-wise pruning. Quantization reduces computations by reducing the precision of the datatype. Weights, biases, and activations may be quantized typically to 8-bit integers although lower bit width implementations are also discussed including binary neural networks. Both pruning and quantization can be used independently or combined. We compare current techniques, analyze their strengths and weaknesses, present compressed network accuracy results on a number of frameworks, and provide practical guidance for compressing networks.
    Is this Harmful? Learning to Predict Harmfulness Ratings from Video. (arXiv:2106.08323v1 [cs.CV])
    (2 min) Automatically identifying harmful content in video is an important task with a wide range of applications. However, due to the difficulty of collecting high-quality labels as well as demanding computational requirements, the task has not had a satisfying general approach. Typically, only small subsets of the problem are considered, such as identifying violent content. In cases where the general problem is tackled, rough approximations and simplifications are made to deal with the lack of labels and computational complexity. In this work, we identify and tackle the two main obstacles. First, we create a dataset of approximately 4000 video clips, annotated by professionals in the field. Secondly, we demonstrate that advances in video recognition enable training models on our dataset that consider the full context of the scene. We conduct an in-depth study on our modeling choices and find that we greatly benefit from combining the visual and audio modality and that pretraining on large-scale video recognition datasets and class balanced sampling further improves performance. We additionally perform a qualitative study that reveals the heavily multi-modal nature of our dataset. Our dataset will be made available upon publication.
    A Lightweight ReLU-Based Feature Fusion for Aerial Scene Classification. (arXiv:2106.07879v1 [eess.IV])
    (2 min) In this paper, we propose a transfer-learning based model construction technique for the aerial scene classification problem. The core of our technique is a layer selection strategy, named ReLU-Based Feature Fusion (RBFF), that extracts feature maps from a pretrained CNN-based single-object image classification model, namely MobileNetV2, and constructs a model for the aerial scene classification task. RBFF stacks features extracted from the batch normalization layer of a few selected blocks of MobileNetV2, where the candidate blocks are selected based on the characteristics of the ReLU activation layers present in those blocks. The feature vector is then compressed into a low-dimensional feature space using dimension reduction algorithms on which we train a low-cost SVM classifier for the classification of the aerial images. We validate our choice of selected features based on the significance of the extracted features with respect to our classification pipeline. RBFF remarkably does not involve any training of the base CNN model except for a few parameters for the classifier, which makes the technique very cost-effective for practical deployments. The constructed model despite being lightweight outperforms several recently proposed models in terms of accuracy for a number of aerial scene datasets.
    Generating Thermal Human Faces for Physiological Assessment Using Thermal Sensor Auxiliary Labels. (arXiv:2106.08091v1 [cs.CV])
    (2 min) Thermal images reveal medically important physiological information about human stress, signs of inflammation, and emotional mood that cannot be seen on visible images. Providing a method to generate thermal faces from visible images would be highly valuable for the telemedicine community in order to show this medical information. To the best of our knowledge, there are limited works on visible-to-thermal (VT) face translation, and many current works go the opposite direction to generate visible faces from thermal surveillance images (TV) for law enforcement applications. As a result, we introduce favtGAN, a VT GAN which uses the pix2pix image translation model with an auxiliary sensor label prediction network for generating thermal faces from visible images. Since most TV methods are trained on only one data source drawn from one thermal sensor, we combine datasets from faces and cityscapes. These combined data are captured from similar sensors in order to bootstrap the training and transfer learning task, especially valuable because visible-thermal face datasets are limited. Experiments on these combined datasets show that favtGAN demonstrates an increase in SSIM and PSNR scores of generated thermal faces, compared to training on a single face dataset alone.
    DFM: A Performance Baseline for Deep Feature Matching. (arXiv:2106.07791v1 [cs.CV])
    (2 min) A novel image matching method is proposed that utilizes learned features extracted by an off-the-shelf deep neural network to obtain a promising performance. The proposed method uses pre-trained VGG architecture as a feature extractor and does not require any additional training specific to improve matching. Inspired by well-established concepts in the psychology area, such as the Mental Rotation paradigm, an initial warping is performed as a result of a preliminary geometric transformation estimate. These estimates are simply based on dense matching of nearest neighbors at the terminal layer of VGG network outputs of the images to be matched. After this initial alignment, the same approach is repeated again between reference and aligned images in a hierarchical manner to reach a good localization and matching performance. Our algorithm achieves 0.57 and 0.80 overall scores in terms of Mean Matching Accuracy (MMA) for 1 pixel and 2 pixels thresholds respectively on Hpatches dataset, which indicates a better performance than the state-of-the-art.
    End-to-End Learning of Keypoint Representations for Continuous Control from Images. (arXiv:2106.07995v1 [cs.LG])
    (2 min) In many control problems that include vision, optimal controls can be inferred from the location of the objects in the scene. This information can be represented using keypoints, which is a list of spatial locations in the input image. Previous works show that keypoint representations learned during unsupervised pre-training using encoder-decoder architectures can provide good features for control tasks. In this paper, we show that it is possible to learn efficient keypoint representations end-to-end, without the need for unsupervised pre-training, decoders, or additional losses. Our proposed architecture consists of a differentiable keypoint extractor that feeds the coordinates of the estimated keypoints directly to a soft actor-critic agent. The proposed algorithm yields performance competitive to the state-of-the art on DeepMind Control Suite tasks.
    Relation Modeling in Spatio-Temporal Action Localization. (arXiv:2106.08061v1 [cs.CV])
    (2 min) This paper presents our solution to the AVA-Kinetics Crossover Challenge of ActivityNet workshop at CVPR 2021. Our solution utilizes multiple types of relation modeling methods for spatio-temporal action detection and adopts a training strategy to integrate multiple relation modeling in end-to-end training over the two large-scale video datasets. Learning with memory bank and finetuning for long-tailed distribution are also investigated to further improve the performance. In this paper, we detail the implementations of our solution and provide experiments results and corresponding discussions. We finally achieve 40.67 mAP on the test set of AVA-Kinetics.
    How Modular Should Neural Module Networks Be for Systematic Generalization?. (arXiv:2106.08170v1 [cs.LG])
    (2 min) Neural Module Networks (NMNs) aim at Visual Question Answering (VQA) via composition of modules that tackle a sub-task. NMNs are a promising strategy to achieve systematic generalization, i.e. overcoming biasing factors in the training distribution. However, the aspects of NMNs that facilitate systematic generalization are not fully understood. In this paper, we demonstrate that the stage and the degree at which modularity is defined has large influence on systematic generalization. In a series of experiments on three VQA datasets (MNIST with multiple attributes, SQOOP, and CLEVR-CoGenT), our results reveal that tuning the degree of modularity in the network, especially at the image encoder stage, reaches substantially higher systematic generalization. These findings lead to new NMN architectures that outperform previous ones in terms of systematic generalization.
    Hotel Recognition via Latent Image Embedding. (arXiv:2106.08042v1 [cs.CV])
    (2 min) We approach the problem of hotel recognition with deep metric learning. We overview the existing approaches and propose a modification to Contrastive loss called Contrastive-Triplet loss. We construct a robust pipeline for benchmarking metric learning models and perform experiments on Hotels-50K and CUB200 datasets. Contrastive-Triplet loss is shown to achieve better retrieval on Hotels-50k. We open-source our code.
    Cine-MRI detection of abdominal adhesions with spatio-temporal deep learning. (arXiv:2106.08094v1 [eess.IV])
    (2 min) Adhesions are an important cause of chronic pain following abdominal surgery. Recent developments in abdominal cine-MRI have enabled the non-invasive diagnosis of adhesions. Adhesions are identified on cine-MRI by the absence of sliding motion during movement. Diagnosis and mapping of adhesions improves the management of patients with pain. Detection of abdominal adhesions on cine-MRI is challenging from both a radiological and deep learning perspective. We focus on classifying presence or absence of adhesions in sagittal abdominal cine-MRI series. We experimented with spatio-temporal deep learning architectures centered around a ConvGRU architecture. A hybrid architecture comprising a ResNet followed by a ConvGRU model allows to classify a whole time-series. Compared to a stand-alone ResNet with a two time-point (inspiration/expiration) input, we show an increase in classification performance (AUROC) from 0.74 to 0.83 ($p<0.05$). Our full temporal classification approach adds only a small amount (5%) of parameters to the entire architecture, which may be useful for other medical imaging problems with a temporal dimension.
    SAR Image Classification Based on Spiking Neural Network through Spike-Time Dependent Plasticity and Gradient Descent. (arXiv:2106.08005v1 [cs.CV])
    (2 min) At present, the Synthetic Aperture Radar (SAR) image classification method based on convolution neural network (CNN) has faced some problems such as poor noise resistance and generalization ability. Spiking neural network (SNN) is one of the core components of brain-like intelligence and has good application prospects. This article constructs a complete SAR image classifier based on unsupervised and supervised learning of SNN by using spike sequences with complex spatio-temporal information. We firstly expound the spiking neuron model, the receptive field of SNN, and the construction of spike sequence. Then we put forward an unsupervised learning algorithm based on STDP and a supervised learning algorithm based on gradient descent. The average classification accuracy of single layer and bilayer unsupervised learning SNN in three categories images on MSTAR dataset is 80.8\% and 85.1\%, respectively. Furthermore, the convergent output spike sequences of unsupervised learning can be used as teaching signals. Based on the TensorFlow framework, a single layer supervised learning SNN is built from the bottom, and the classification accuracy reaches 90.05\%. By comparing noise resistance and model parameters between SNNs and CNNs, the effectiveness and outstanding advantages of SNN are verified. Code to reproduce our experiments is available at \url{https://github.com/Jiankun-chen/Supervised-SNN-with-GD}.
    Efficient Micro-Structured Weight Unification for Neural Network Compression. (arXiv:2106.08301v1 [cs.LG])
    (2 min) Compressing Deep Neural Network (DNN) models to alleviate the storage and computation requirements is essential for practical applications, especially for resource limited devices. Although capable of reducing a reasonable amount of model parameters, previous unstructured or structured weight pruning methods can hardly truly accelerate inference, either due to the poor hardware compatibility of the unstructured sparsity or due to the low sparse rate of the structurally pruned network. Aiming at reducing both storage and computation, as well as preserving the original task performance, we propose a generalized weight unification framework at a hardware compatible micro-structured level to achieve high amount of compression and acceleration. Weight coefficients of a selected micro-structured block are unified to reduce the storage and computation of the block without changing the neuron connections, which turns to a micro-structured pruning special case when all unified coefficients are set to zero, where neuron connections (hence storage and computation) are completely removed. In addition, we developed an effective training framework based on the alternating direction method of multipliers (ADMM), which converts our complex constrained optimization into separately solvable subproblems. Through iteratively optimizing the subproblems, the desired micro-structure can be ensured with high compression ratio and low performance degradation. We extensively evaluated our method using a variety of benchmark models and datasets for different applications. Experimental results demonstrate state-of-the-art performance.
    Closing the Reality Gap with Unsupervised Sim-to-Real Image Translation. (arXiv:1911.01529v2 [cs.LG] UPDATED)
    (2 min) Deep learning approaches have become the standard solution to many problems in computer vision and robotics, but obtaining sufficient training data in high enough quality is challenging, as human labor is error prone, time consuming, and expensive. Solutions based on simulation have become more popular in recent years, but the gap between simulation and reality is still a major issue. In this paper, we introduce a novel method for augmenting synthetic image data through unsupervised image-to-image translation by applying the style of real world images to simulated images with open source frameworks. The generated dataset is combined with conventional augmentation methods and is then applied to a neural network model running in real-time on autonomous soccer robots. Our evaluation shows a significant improvement compared to models trained on images generated entirely in simulation.
    A Value-Function-based Interior-point Method for Non-convex Bi-level Optimization. (arXiv:2106.07991v1 [math.OC])
    (2 min) Bi-level optimization model is able to capture a wide range of complex learning tasks with practical interest. Due to the witnessed efficiency in solving bi-level programs, gradient-based methods have gained popularity in the machine learning community. In this work, we propose a new gradient-based solution scheme, namely, the Bi-level Value-Function-based Interior-point Method (BVFIM). Following the main idea of the log-barrier interior-point scheme, we penalize the regularized value function of the lower level problem into the upper level objective. By further solving a sequence of differentiable unconstrained approximation problems, we consequently derive a sequential programming scheme. The numerical advantage of our scheme relies on the fact that, when gradient methods are applied to solve the approximation problem, we successfully avoid computing any expensive Hessian-vector or Jacobian-vector product. We prove the convergence without requiring any convexity assumption on either the upper level or the lower level objective. Experiments demonstrate the efficiency of the proposed BVFIM on non-convex bi-level problems.
    Weakly-Supervised Photo-realistic Texture Generation for 3D Face Reconstruction. (arXiv:2106.08148v1 [cs.CV])
    (2 min) Although much progress has been made recently in 3D face reconstruction, most previous work has been devoted to predicting accurate and fine-grained 3D shapes. In contrast, relatively little work has focused on generating high-fidelity face textures. Compared with the prosperity of photo-realistic 2D face image generation, high-fidelity 3D face texture generation has yet to be studied. In this paper, we proposed a novel UV map generation model that predicts the UV map from a single face image. The model consists of a UV sampler and a UV generator. By selectively sampling the input face image's pixels and adjusting their relative locations, the UV sampler generates an incomplete UV map that could faithfully reconstruct the original face. Missing textures in the incomplete UV map are further full-filled by the UV generator. The training is based on pseudo ground truth blended by the 3DMM texture and the input face texture, thus weakly supervised. To deal with the artifacts in the imperfect pseudo UV map, multiple partial UV map discriminators are leveraged.
    Encouraging Intra-Class Diversity Through a Reverse Contrastive Loss for Better Single-Source Domain Generalization. (arXiv:2106.07916v1 [cs.CV])
    (2 min) Traditional deep learning algorithms often fail to generalize when they are tested outside of the domain of training data. Because data distributions can change dynamically in real-life applications once a learned model is deployed, in this paper we are interested in single-source domain generalization (SDG) which aims to develop deep learning algorithms able to generalize from a single training domain where no information about the test domain is available at training time. Firstly, we design two simple MNISTbased SDG benchmarks, namely MNIST Color SDG-MP and MNIST Color SDG-UP, which highlight the two different fundamental SDG issues of increasing difficulties: 1) a class-correlated pattern in the training domain is missing (SDG-MP), or 2) uncorrelated with the class (SDG-UP), in the testing data domain. This is in sharp contrast with the current domain generalization (DG) benchmarks which mix up different correlation and variation factors and thereby make hard to disentangle success or failure factors when benchmarking DG algorithms. We further evaluate several state-of-the-art SDG algorithms through our simple benchmark, namely MNIST Color SDG-MP, and show that the issue SDG-MP is largely unsolved despite of a decade of efforts in developing DG algorithms. Finally, we also propose a partially reversed contrastive loss to encourage intra-class diversity and find less strongly correlated patterns, to deal with SDG-MP and show that the proposed approach is very effective on our MNIST Color SDG-MP benchmark.
    SUPER-ADAM: Faster and Universal Framework of Adaptive Gradients. (arXiv:2106.08208v1 [math.OC])
    (2 min) Adaptive gradient methods have shown excellent performance for solving many machine learning problems. Although multiple adaptive methods were recently studied, they mainly focus on either empirical or theoretical aspects and also only work for specific problems by using specific adaptive learning rates. It is desired to design a universal framework for practical algorithms of adaptive gradients with theoretical guarantee to solve general problems. To fill this gap, we propose a faster and universal framework of adaptive gradients (i.e., SUPER-ADAM) by introducing a universal adaptive matrix that includes most existing adaptive gradient forms. Moreover, our framework can flexibly integrates the momentum and variance reduced techniques. In particular, our novel framework provides the convergence analysis support for adaptive gradient methods under the nonconvex setting. In theoretical analysis, we prove that our new algorithm can achieve the best known complexity of $\tilde{O}(\epsilon^{-3})$ for finding an $\epsilon$-stationary point of nonconvex optimization, which matches the lower bound for stochastic smooth nonconvex optimization. In numerical experiments, we employ various deep learning tasks to validate that our algorithm consistently outperforms the existing adaptive algorithms.
    A Clinically Inspired Approach for Melanoma classification. (arXiv:2106.08021v1 [cs.CV])
    (2 min) Melanoma is a leading cause of deaths due to skin cancer deaths and hence, early and effective diagnosis of melanoma is of interest. Current approaches for automated diagnosis of melanoma either use pattern recognition or analytical recognition like ABCDE (asymmetry, border, color, diameter and evolving) criterion. In practice however, a differential approach wherein outliers (ugly duckling) are detected and used to evaluate nevi/lesions. Incorporation of differential recognition in Computer Aided Diagnosis (CAD) systems has not been explored but can be beneficial as it can provide a clinical justification for the derived decision. We present a method for identifying and quantifying ugly ducklings by performing Intra-Patient Comparative Analysis (IPCA) of neighboring nevi. This is then incorporated in a CAD system design for melanoma detection. This design ensures flexibility to handle cases where IPCA is not possible. Our experiments on a public dataset show that the outlier information helps boost the sensitivity of detection by at least 4.1 % and specificity by 4.0 % to 8.9 %, depending on the use of a strong (EfficientNet) or moderately strong (VGG or ResNet) classifier.
    ResDepth: A Deep Prior For 3D Reconstruction From High-resolution Satellite Images. (arXiv:2106.08107v1 [eess.IV])
    (2 min) Modern optical satellite sensors enable high-resolution stereo reconstruction from space. But the challenging imaging conditions when observing the Earth from space push stereo matching to its limits. In practice, the resulting digital surface models (DSMs) are fairly noisy and often do not attain the accuracy needed for high-resolution applications such as 3D city modeling. Arguably, stereo correspondence based on low-level image similarity is insufficient and should be complemented with a-priori knowledge about the expected surface geometry beyond basic local smoothness. To that end, we introduce ResDepth, a convolutional neural network that learns such an expressive geometric prior from example data. ResDepth refines an initial, raw stereo DSM while conditioning the refinement on the images. I.e., it acts as a smart, learned post-processing filter and can seamlessly complement any stereo matching pipeline. In a series of experiments, we find that the proposed method consistently improves stereo DSMs both quantitatively and qualitatively. We show that the prior encoded in the network weights captures meaningful geometric characteristics of urban design, which also generalize across different districts and even from one city to another. Moreover, we demonstrate that, by training on a variety of stereo pairs, ResDepth can acquire a sufficient degree of invariance against variations in imaging conditions and acquisition geometry.
    Color2Style: Real-Time Exemplar-Based Image Colorization with Self-Reference Learning and Deep Feature Modulation. (arXiv:2106.08017v1 [cs.CV])
    (2 min) Legacy black-and-white photos are riddled with people's nostalgia and glorious memories of the past. To better relive the elapsed frozen moments, in this paper, we present a deep exemplar-based image colorization approach named Color2Style to resurrect these grayscale image media by filling them with vibrant colors. Generally, for exemplar-based colorization, unsupervised and unpaired training are usually adopted, due to the difficulty of obtaining input and ground truth image pairs. To train an exemplar-based colorization model, current algorithms usually strive to achieve two procedures: i) retrieving a large number of reference images with high similarity in advance, which is inevitably time-consuming and tedious; ii) designing complicated modules to transfer the colors of the reference image to the grayscale image, by calculating and leveraging the deep semantic correspondence between them (e.g., non-local operation). Contrary to the previous methods, we solve and simplify the above two steps in one end-to-end learning procedure. First, we adopt a self-augmented self-reference training scheme, where the reference image is generated by graphical transformations from the original colorful one whereby the training can be formulated in a paired manner. Second, instead of computing complex and inexplicable correspondence maps, our method exploits a simple yet effective deep feature modulation (DFM) module, which injects the color embeddings extracted from the reference image into the deep representations of the input grayscale image. Such design is much more lightweight and intelligible, achieving appealing performance with real-time processing speed. Moreover, our model does not require multifarious loss functions and regularization terms like existing methods, but only two widely used loss functions. Codes and models will be available at https://github.com/zhaohengyuan1/Color2Style.
    ReS2tAC -- UAV-Borne Real-Time SGM Stereo Optimized for Embedded ARM and CUDA Devices. (arXiv:2106.07927v1 [cs.CV])
    (2 min) With the emergence of low-cost robotic systems, such as unmanned aerial vehicle, the importance of embedded high-performance image processing has increased. For a long time, FPGAs were the only processing hardware that were capable of high-performance computing, while at the same time preserving a low power consumption, essential for embedded systems. However, the recently increasing availability of embedded GPU-based systems, such as the NVIDIA Jetson series, comprised of an ARM CPU and a NVIDIA Tegra GPU, allows for massively parallel embedded computing on graphics hardware. With this in mind, we propose an approach for real-time embedded stereo processing on ARM and CUDA-enabled devices, which is based on the popular and widely used Semi-Global Matching algorithm. In this, we propose an optimization of the algorithm for embedded CUDA GPUs, by using massively parallel computing, as well as using the NEON intrinsics to optimize the algorithm for vectorized SIMD processing on embedded ARM CPUs. We have evaluated our approach with different configurations on two public stereo benchmark datasets to demonstrate that they can reach an error rate as low as 3.3%. Furthermore, our experiments show that the fastest configuration of our approach reaches up to 46 FPS on VGA image resolution. Finally, in a use-case specific qualitative evaluation, we have evaluated the power consumption of our approach and deployed it on the DJI Manifold 2-G attached to a DJI Matrix 210v2 RTK unmanned aerial vehicle (UAV), demonstrating its suitability for real-time stereo processing onboard a UAV.
    Direction-aware Feature-level Frequency Decomposition for Single Image Deraining. (arXiv:2106.07941v1 [cs.CV])
    (2 min) We present a novel direction-aware feature-level frequency decomposition network for single image deraining. Compared with existing solutions, the proposed network has three compelling characteristics. First, unlike previous algorithms, we propose to perform frequency decomposition at feature-level instead of image-level, allowing both low-frequency maps containing structures and high-frequency maps containing details to be continuously refined during the training procedure. Second, we further establish communication channels between low-frequency maps and high-frequency maps to interactively capture structures from high-frequency maps and add them back to low-frequency maps and, simultaneously, extract details from low-frequency maps and send them back to high-frequency maps, thereby removing rain streaks while preserving more delicate features in the input image. Third, different from existing algorithms using convolutional filters consistent in all directions, we propose a direction-aware filter to capture the direction of rain streaks in order to more effectively and thoroughly purge the input images of rain streaks. We extensively evaluate the proposed approach in three representative datasets and experimental results corroborate our approach consistently outperforms state-of-the-art deraining algorithms.
    Optimal Latent Vector Alignment for Unsupervised Domain Adaptation in Medical Image Segmentation. (arXiv:2106.08188v1 [cs.LG])
    (2 min) This paper addresses the domain shift problem for segmentation. As a solution, we propose OLVA, a novel and lightweight unsupervised domain adaptation method based on a Variational Auto-Encoder (VAE) and Optimal Transport (OT) theory. Thanks to the VAE, our model learns a shared cross-domain latent space that follows a normal distribution, which reduces the domain shift. To guarantee valid segmentations, our shared latent space is designed to model the shape rather than the intensity variations. We further rely on an OT loss to match and align the remaining discrepancy between the two domains in the latent space. We demonstrate OLVA's effectiveness for the segmentation of multiple cardiac structures on the public Multi-Modality Whole Heart Segmentation (MM-WHS) dataset, where the source domain consists of annotated 3D MR images and the unlabelled target domain of 3D CTs. Our results show remarkable improvements with an additional margin of 12.5\% dice score over concurrent generative training approaches.
    Canonical Face Embeddings. (arXiv:2106.07822v1 [cs.CV])
    (2 min) We present evidence that many common convolutional neural networks (CNNs) trained for face verification learn functions that are nearly equivalent under rotation. More specifically, we demonstrate that one face verification model's embeddings (i.e. last--layer activations) can be compared directly to another model's embeddings after only a rotation or linear transformation, with little performance penalty. This finding is demonstrated using IJB-C 1:1 verification across the combinations of ten modern off-the-shelf CNN-based face verification models which vary in training dataset, CNN architecture, way of using angular loss, or some combination of the 3, and achieve a mean true accept rate of 0.96 at a false accept rate of 0.01. When instead evaluating embeddings generated from two CNNs, where one CNN's embeddings are mapped with a linear transformation, the mean true accept rate drops to 0.95 using the same verification paradigm. Restricting these linear maps to only perform rotation produces a mean true accept rate of 0.91. These mappings' existence suggests that a common representation is learned by models with variation in training or structure. A discovery such as this likely has broad implications, and we provide an application in which face embeddings can be de-anonymized using a limited number of samples.
    Face Age Progression With Attribute Manipulation. (arXiv:2106.07696v1 [cs.CV])
    (2 min) Face is one of the predominant means of person recognition. In the process of ageing, human face is prone to many factors such as time, attributes, weather and other subject specific variations. The impact of these factors were not well studied in the literature of face aging. In this paper, we propose a novel holistic model in this regard viz., ``Face Age progression With Attribute Manipulation (FAWAM)", i.e. generating face images at different ages while simultaneously varying attributes and other subject specific characteristics. We address the task in a bottom-up manner, as two submodules i.e. face age progression and face attribute manipulation. For face aging, we use an attribute-conscious face aging model with a pyramidal generative adversarial network that can model age-specific facial changes while maintaining intrinsic subject specific characteristics. For facial attribute manipulation, the age processed facial image is manipulated with desired attributes while preserving other details unchanged, leveraging an attribute generative adversarial network architecture. We conduct extensive analysis in standard large scale datasets and our model achieves significant performance both quantitatively and qualitatively.
    Revisiting the Calibration of Modern Neural Networks. (arXiv:2106.07998v1 [cs.LG])
    (2 min) Accurate estimation of predictive uncertainty (model calibration) is essential for the safe application of neural networks. Many instances of miscalibration in modern neural networks have been reported, suggesting a trend that newer, more accurate models produce poorly calibrated predictions. Here, we revisit this question for recent state-of-the-art image classification models. We systematically relate model calibration and accuracy, and find that the most recent models, notably those not using convolutions, are among the best calibrated. Trends observed in prior model generations, such as decay of calibration with distribution shift or model size, are less pronounced in recent architectures. We also show that model size and amount of pretraining do not fully explain these differences, suggesting that architecture is a major determinant of calibration properties.
    Compositional Sketch Search. (arXiv:2106.08009v1 [cs.CV])
    (2 min) We present an algorithm for searching image collections using free-hand sketches that describe the appearance and relative positions of multiple objects. Sketch based image retrieval (SBIR) methods predominantly match queries containing a single, dominant object invariant to its position within an image. Our work exploits drawings as a concise and intuitive representation for specifying entire scene compositions. We train a convolutional neural network (CNN) to encode masked visual features from sketched objects, pooling these into a spatial descriptor encoding the spatial relationships and appearances of objects in the composition. Training the CNN backbone as a Siamese network under triplet loss yields a metric search embedding for measuring compositional similarity which may be efficiently leveraged for visual search by applying product quantization.
    Scaling Neural Tangent Kernels via Sketching and Random Features. (arXiv:2106.07880v1 [cs.LG])
    (2 min) The Neural Tangent Kernel (NTK) characterizes the behavior of infinitely-wide neural networks trained under least squares loss by gradient descent. Recent works also report that NTK regression can outperform finitely-wide neural networks trained on small-scale datasets. However, the computational complexity of kernel methods has limited its use in large-scale learning tasks. To accelerate learning with NTK, we design a near input-sparsity time approximation algorithm for NTK, by sketching the polynomial expansions of arc-cosine kernels: our sketch for the convolutional counterpart of NTK (CNTK) can transform any image using a linear runtime in the number of pixels. Furthermore, we prove a spectral approximation guarantee for the NTK matrix, by combining random features (based on leverage score sampling) of the arc-cosine kernels with a sketching algorithm. We benchmark our methods on various large-scale regression and classification tasks and show that a linear regressor trained on our CNTK features matches the accuracy of exact CNTK on CIFAR-10 dataset while achieving 150x speedup.
    Object detection and Autoencoder-based 6D pose estimation for highly cluttered Bin Picking. (arXiv:2106.08045v1 [cs.CV])
    (2 min) Bin picking is a core problem in industrial environments and robotics, with its main module as 6D pose estimation. However, industrial depth sensors have a lack of accuracy when it comes to small objects. Therefore, we propose a framework for pose estimation in highly cluttered scenes with small objects, which mainly relies on RGB data and makes use of depth information only for pose refinement. In this work, we compare synthetic data generation approaches for object detection and pose estimation and introduce a pose filtering algorithm that determines the most accurate estimated poses. We will make our
    Robust Out-of-Distribution Detection on Deep Probabilistic Generative Models. (arXiv:2106.07903v1 [cs.LG])
    (2 min) Out-of-distribution (OOD) detection is an important task in machine learning systems for ensuring their reliability and safety. Deep probabilistic generative models facilitate OOD detection by estimating the likelihood of a data sample. However, such models frequently assign a suspiciously high likelihood to a specific outlier. Several recent works have addressed this issue by training a neural network with auxiliary outliers, which are generated by perturbing the input data. In this paper, we discover that these approaches fail for certain OOD datasets. Thus, we suggest a new detection metric that operates without outlier exposure. We observe that our metric is robust to diverse variations of an image compared to the previous outlier-exposing methods. Furthermore, our proposed score requires neither auxiliary models nor additional training. Instead, this paper utilizes the likelihood ratio statistic in a new perspective to extract genuine properties from the given single deep probabilistic generative model. We also apply a novel numerical approximation to enable fast implementation. Finally, we demonstrate comprehensive experiments on various probabilistic generative models and show that our method achieves state-of-the-art performance.
    Demographic Fairness in Face Identification: The Watchlist Imbalance Effect. (arXiv:2106.08049v1 [cs.CV])
    (2 min) Recently, different researchers have found that the gallery composition of a face database can induce performance differentials to facial identification systems in which a probe image is compared against up to all stored reference images to reach a biometric decision. This negative effect is referred to as "watchlist imbalance effect". In this work, we present a method to theoretically estimate said effect for a biometric identification system given its verification performance across demographic groups and the composition of the used gallery. Further, we report results for identification experiments on differently composed demographic subsets, i.e. females and males, of the public academic MORPH database using the open-source ArcFace face recognition system. It is shown that the database composition has a huge impact on performance differentials in biometric identification systems, even if performance differentials are less pronounced in the verification scenario. This study represents the first detailed analysis of the watchlist imbalance effect which is expected to be of high interest for future research in the field of facial recognition.
    Simon Says: Evaluating and Mitigating Bias in Pruned Neural Networks with Knowledge Distillation. (arXiv:2106.07849v1 [cs.LG])
    (2 min) In recent years the ubiquitous deployment of AI has posed great concerns in regards to algorithmic bias, discrimination, and fairness. Compared to traditional forms of bias or discrimination caused by humans, algorithmic bias generated by AI is more abstract and unintuitive therefore more difficult to explain and mitigate. A clear gap exists in the current literature on evaluating and mitigating bias in pruned neural networks. In this work, we strive to tackle the challenging issues of evaluating, mitigating, and explaining induced bias in pruned neural networks. Our paper makes three contributions. First, we propose two simple yet effective metrics, Combined Error Variance (CEV) and Symmetric Distance Error (SDE), to quantitatively evaluate the induced bias prevention quality of pruned models. Second, we demonstrate that knowledge distillation can mitigate induced bias in pruned neural networks, even with unbalanced datasets. Third, we reveal that model similarity has strong correlations with pruning induced bias, which provides a powerful method to explain why bias occurs in pruned neural networks. Our code is available at https://github.com/codestar12/pruning-distilation-bias
    Efficient Facial Expression Analysis For Dimensional Affect Recognition Using Geometric Features. (arXiv:2106.07817v1 [cs.CV])
    (2 min) Despite their continued popularity, categorical approaches to affect recognition have limitations, especially in real-life situations. Dimensional models of affect offer important advantages for the recognition of subtle expressions and more fine-grained analysis. We introduce a simple but effective facial expression analysis (FEA) system for dimensional affect, solely based on geometric features and Partial Least Squares (PLS) regression. The system jointly learns to estimate Arousal and Valence ratings from a set of facial images. The proposed approach is robust, efficient, and exhibits comparable performance to contemporary deep learning models, while requiring a fraction of the computational resources.
    Learning Audio-Visual Dereverberation. (arXiv:2106.07732v1 [cs.SD])
    (2 min) Reverberation from audio reflecting off surfaces and objects in the environment not only degrades the quality of speech for human perception, but also severely impacts the accuracy of automatic speech recognition. Prior work attempts to remove reverberation based on the audio modality only. Our idea is to learn to dereverberate speech from audio-visual observations. The visual environment surrounding a human speaker reveals important cues about the room geometry, materials, and speaker location, all of which influence the precise reverberation effects in the audio stream. We introduce Visually-Informed Dereverberation of Audio (VIDA), an end-to-end approach that learns to remove reverberation based on both the observed sounds and visual scene. In support of this new task, we develop a large-scale dataset that uses realistic acoustic renderings of speech in real-world 3D scans of homes offering a variety of room acoustics. Demonstrating our approach on both simulated and real imagery for speech enhancement, speech recognition, and speaker identification, we show it achieves state-of-the-art performance and substantially improves over traditional audio-only methods. Project page: this http URL
    Defending Touch-based Continuous Authentication Systems from Active Adversaries Using Generative Adversarial Networks. (arXiv:2106.07867v1 [cs.CR])
    (2 min) Previous studies have demonstrated that commonly studied (vanilla) touch-based continuous authentication systems (V-TCAS) are susceptible to population attack. This paper proposes a novel Generative Adversarial Network assisted TCAS (G-TCAS) framework, which showed more resilience to the population attack. G-TCAS framework was tested on a dataset of 117 users who interacted with a smartphone and tablet pair. On average, the increase in the false accept rates (FARs) for V-TCAS was much higher (22%) than G-TCAS (13%) for the smartphone. Likewise, the increase in the FARs for V-TCAS was 25% compared to G-TCAS (6%) for the tablet.
    CathAI: Fully Automated Interpretation of Coronary Angiograms Using Neural Networks. (arXiv:2106.07708v1 [cs.LG])
    (2 min) Coronary heart disease (CHD) is the leading cause of adult death in the United States and worldwide, and for which the coronary angiography procedure is the primary gateway for diagnosis and clinical management decisions. The standard-of-care for interpretation of coronary angiograms depends upon ad-hoc visual assessment by the physician operator. However, ad-hoc visual interpretation of angiograms is poorly reproducible, highly variable and bias prone. Here we show for the first time that fully-automated angiogram interpretation to estimate coronary artery stenosis is possible using a sequence of deep neural network algorithms. The algorithmic pipeline we developed--called CathAI--achieves state-of-the art performance across the sequence of tasks required to accomplish automated interpretation of unselected, real-world angiograms. CathAI (Algorithms 1-2) demonstrated positive predictive value, sensitivity and F1 score of >=90% to identify the projection angle overall and >=93% for left or right coronary artery angiogram detection, the primary anatomic structures of interest. To predict obstructive coronary artery stenosis (>=70% stenosis), CathAI (Algorithm 4) exhibited an area under the receiver operating characteristic curve (AUC) of 0.862 (95% CI: 0.843-0.880). When externally validated in a healthcare system in another country, CathAI AUC was 0.869 (95% CI: 0.830-0.907) to predict obstructive coronary artery stenosis. Our results demonstrate that multiple purpose-built neural networks can function in sequence to accomplish the complex series of tasks required for automated analysis of real-world angiograms. Deployment of CathAI may serve to increase standardization and reproducibility in coronary stenosis assessment, while providing a robust foundation to accomplish future tasks for algorithmic angiographic interpretation.
    Temporal Consistency Checks to Detect LiDAR Spoofing Attacks on Autonomous Vehicle Perception. (arXiv:2106.07833v1 [cs.CR])
    (2 min) LiDAR sensors are used widely in Autonomous Vehicles for better perceiving the environment which enables safer driving decisions. Recent work has demonstrated serious LiDAR spoofing attacks with alarming consequences. In particular, model-level LiDAR spoofing attacks aim to inject fake depth measurements to elicit ghost objects that are erroneously detected by 3D Object Detectors, resulting in hazardous driving decisions. In this work, we explore the use of motion as a physical invariant of genuine objects for detecting such attacks. Based on this, we propose a general methodology, 3D Temporal Consistency Check (3D-TC2), which leverages spatio-temporal information from motion prediction to verify objects detected by 3D Object Detectors. Our preliminary design and implementation of a 3D-TC2 prototype demonstrates very promising performance, providing more than 98% attack detection rate with a recall of 91% for detecting spoofed Vehicle (Car) objects, and is able to achieve real-time detection at 41Hz
    A Hybrid mmWave and Camera System for Long-Range Depth Imaging. (arXiv:2106.07856v1 [cs.CV])
    (2 min) mmWave radars offer excellent depth resolution owing to their high bandwidth at mmWave radio frequencies. Yet, they suffer intrinsically from poor angular resolution, that is an order-of-magnitude worse than camera systems, and are therefore not a capable 3-D imaging solution in isolation. We propose Metamoran, a system that combines the complimentary strengths of radar and camera systems to obtain depth images at high azimuthal resolutions at distances of several tens of meters with high accuracy, all from a single fixed vantage point. Metamoran enables rich long-range depth imaging outdoors with applications to roadside safety infrastructure, surveillance and wide-area mapping. Our key insight is to use the high azimuth resolution from cameras using computer vision techniques, including image segmentation and monocular depth estimation, to obtain object shapes and use these as priors for our novel specular beamforming algorithm. We also design this algorithm to work in cluttered environments with weak reflections and in partially occluded scenarios. We perform a detailed evaluation of Metamoran's depth imaging and sensing capabilities in 200 diverse scenes at a major U.S. city. Our evaluation shows that Metamoran estimates the depth of an object up to 60~m away with a median error of 28~cm, an improvement of 13$\times$ compared to a naive radar+camera baseline and 23$\times$ compared to monocular depth estimation.
    Domain Adaptive SiamRPN++ for Object Tracking in the Wild. (arXiv:2106.07862v1 [cs.CV])
    (2 min) Benefit from large-scale training data, recent advances in Siamese-based object tracking have achieved compelling results on the normal sequences. Whilst Siamese-based trackers assume training and test data follow an identical distribution. Suppose there is a set of foggy or rainy test sequences, it cannot be guaranteed that the trackers trained on the normal images perform well on the data belonging to other domains. The problem of domain shift among training and test data has already been discussed in object detection and semantic segmentation areas, which, however, has not been investigated for visual tracking. To this end, based on SiamRPN++, we introduce a Domain Adaptive SiamRPN++, namely DASiamRPN++, to improve the cross-domain transferability and robustness of a tracker. Inspired by A-distance theory, we present two domain adaptive modules, Pixel Domain Adaptation (PDA) and Semantic Domain Adaptation (SDA). The PDA module aligns the feature maps of template and search region images to eliminate the pixel-level domain shift caused by weather, illumination, etc. The SDA module aligns the feature representations of the tracking target's appearance to eliminate the semantic-level domain shift. PDA and SDA modules reduce the domain disparity by learning domain classifiers in an adversarial training manner. The domain classifiers enforce the network to learn domain-invariant feature representations. Extensive experiments are performed on the standard datasets of two different domains, including synthetic foggy and TIR sequences, which demonstrate the transferability and domain adaptability of the proposed tracker.
    Learning Stable Classifiers by Transferring Unstable Features. (arXiv:2106.07847v1 [cs.LG])
    (2 min) We study transfer learning in the presence of spurious correlations. We experimentally demonstrate that directly transferring the stable feature extractor learned on the source task may not eliminate these biases for the target task. However, we hypothesize that the unstable features in the source task and those in the target task are directly related. By explicitly informing the target classifier of the source task's unstable features, we can regularize the biases in the target task. Specifically, we derive a representation that encodes the unstable features by contrasting different data environments in the source task. On the target task, we cluster data from this representation, and achieve robustness by minimizing the worst-case risk across all clusters. We evaluate our method on both text and image classifications. Empirical results demonstrate that our algorithm is able to maintain robustness on the target task, outperforming the best baseline by 22.9% in absolute accuracy across 12 transfer settings. Our code is available at https://github.com/YujiaBao/Tofu.
    Learning Deep Morphological Networks with Neural Architecture Search. (arXiv:2106.07714v1 [cs.CV])
    (2 min) Deep Neural Networks (DNNs) are generated by sequentially performing linear and non-linear processes. Using a combination of linear and non-linear procedures is critical for generating a sufficiently deep feature space. The majority of non-linear operators are derivations of activation functions or pooling functions. Mathematical morphology is a branch of mathematics that provides non-linear operators for a variety of image processing problems. We investigate the utility of integrating these operations in an end-to-end deep learning framework in this paper. DNNs are designed to acquire a realistic representation for a particular job. Morphological operators give topological descriptors that convey salient information about the shapes of objects depicted in images. We propose a method based on meta-learning to incorporate morphological operators into DNNs. The learned architecture demonstrates how our novel morphological operations significantly increase DNN performance on various tasks, including picture classification and edge detection.
    Reverse Engineering of Generative Models: Inferring Model Hyperparameters from Generated Images. (arXiv:2106.07873v1 [cs.CV])
    (2 min) State-of-the-art (SOTA) Generative Models (GMs) can synthesize photo-realistic images that are hard for humans to distinguish from genuine photos. We propose to perform reverse engineering of GMs to infer the model hyperparameters from the images generated by these models. We define a novel problem, "model parsing", as estimating GM network architectures and training loss functions by examining their generated images -- a task seemingly impossible for human beings. To tackle this problem, we propose a framework with two components: a Fingerprint Estimation Network (FEN), which estimates a GM fingerprint from a generated image by training with four constraints to encourage the fingerprint to have desired properties, and a Parsing Network (PN), which predicts network architecture and loss functions from the estimated fingerprints. To evaluate our approach, we collect a fake image dataset with $100$K images generated by $100$ GMs. Extensive experiments show encouraging results in parsing the hyperparameters of the unseen models. Finally, our fingerprint estimation can be leveraged for deepfake detection and image attribution, as we show by reporting SOTA results on both the recent Celeb-DF and image attribution benchmarks.
    Learning to Aggregate and Personalize 3D Face from In-the-Wild Photo Collection. (arXiv:2106.07852v1 [cs.CV])
    (2 min) Non-parametric face modeling aims to reconstruct 3D face only from images without shape assumptions. While plausible facial details are predicted, the models tend to over-depend on local color appearance and suffer from ambiguous noise. To address such problem, this paper presents a novel Learning to Aggregate and Personalize (LAP) framework for unsupervised robust 3D face modeling. Instead of using controlled environment, the proposed method implicitly disentangles ID-consistent and scene-specific face from unconstrained photo set. Specifically, to learn ID-consistent face, LAP adaptively aggregates intrinsic face factors of an identity based on a novel curriculum learning approach with relaxed consistency loss. To adapt the face for a personalized scene, we propose a novel attribute-refining network to modify ID-consistent face with target attribute and details. Based on the proposed method, we make unsupervised 3D face modeling benefit from meaningful image facial structure and possibly higher resolutions. Extensive experiments on benchmarks show LAP recovers superior or competitive face shape and texture, compared with state-of-the-art (SOTA) methods with or without prior and supervision.
    Cluster-guided Asymmetric Contrastive Learning for Unsupervised Person Re-Identification. (arXiv:2106.07846v1 [cs.CV])
    (2 min) Unsupervised person re-identification (Re-ID) aims to match pedestrian images from different camera views in unsupervised setting. Existing methods for unsupervised person Re-ID are usually built upon the pseudo labels from clustering. However, the quality of clustering depends heavily on the quality of the learned features, which are overwhelmingly dominated by the colors in images especially in the unsupervised setting. In this paper, we propose a Cluster-guided Asymmetric Contrastive Learning (CACL) approach for unsupervised person Re-ID, in which cluster structure is leveraged to guide the feature learning in a properly designed asymmetric contrastive learning framework. To be specific, we propose a novel cluster-level contrastive loss to help the siamese network effectively mine the invariance in feature learning with respect to the cluster structure within and between different data augmentation views, respectively. Extensive experiments conducted on three benchmark datasets demonstrate superior performance of our proposal.
    Keep CALM and Improve Visual Feature Attribution. (arXiv:2106.07861v1 [cs.CV])
    (2 min) The class activation mapping, or CAM, has been the cornerstone of feature attribution methods for multiple vision tasks. Its simplicity and effectiveness have led to wide applications in the explanation of visual predictions and weakly-supervised localization tasks. However, CAM has its own shortcomings. The computation of attribution maps relies on ad-hoc calibration steps that are not part of the training computational graph, making it difficult for us to understand the real meaning of the attribution values. In this paper, we improve CAM by explicitly incorporating a latent variable encoding the location of the cue for recognition in the formulation, thereby subsuming the attribution map into the training computational graph. The resulting model, class activation latent mapping, or CALM, is trained with the expectation-maximization algorithm. Our experiments show that CALM identifies discriminative attributes for image classifiers more accurately than CAM and other visual attribution baselines. CALM also shows performance improvements over prior arts on the weakly-supervised object localization benchmarks. Our code is available at https://github.com/naver-ai/calm.
    Highdicom: A Python library for standardized encoding of image annotations and machine learning model outputs in pathology and radiology. (arXiv:2106.07806v1 [eess.IV])
    (2 min) Machine learning is revolutionizing image-based diagnostics in pathology and radiology. ML models have shown promising results in research settings, but their lack of interoperability has been a major barrier for clinical integration and evaluation. The DICOM a standard specifies Information Object Definitions and Services for the representation and communication of digital images and related information, including image-derived annotations and analysis results. However, the complexity of the standard represents an obstacle for its adoption in the ML community and creates a need for software libraries and tools that simplify working with data sets in DICOM format. Here we present the highdicom library, which provides a high-level application programming interface for the Python programming language that abstracts low-level details of the standard and enables encoding and decoding of image-derived information in DICOM format in a few lines of Python code. The highdicom library ties into the extensive Python ecosystem for image processing and machine learning. Simultaneously, by simplifying creation and parsing of DICOM-compliant files, highdicom achieves interoperability with the medical imaging systems that hold the data used to train and run ML models, and ultimately communicate and store model outputs for clinical use. We demonstrate through experiments with slide microscopy and computed tomography imaging, that, by bridging these two ecosystems, highdicom enables developers to train and evaluate state-of-the-art ML models in pathology and radiology while remaining compliant with the DICOM standard and interoperable with clinical systems at all stages. To promote standardization of ML research and streamline the ML model development and deployment process, we made the library available free and open-source.
    Vision-Language Navigation with Random Environmental Mixup. (arXiv:2106.07876v1 [cs.CV])
    (2 min) Vision-language Navigation (VLN) tasks require an agent to navigate step-by-step while perceiving the visual observations and comprehending a natural language instruction. Large data bias, which is caused by the disparity ratio between the small data scale and large navigation space, makes the VLN task challenging. Previous works have proposed various data augmentation methods to reduce data bias. However, these works do not explicitly reduce the data bias across different house scenes. Therefore, the agent would overfit to the seen scenes and achieve poor navigation performance in the unseen scenes. To tackle this problem, we propose the Random Environmental Mixup (REM) method, which generates cross-connected house scenes as augmented data via mixuping environment. Specifically, we first select key viewpoints according to the room connection graph for each scene. Then, we cross-connect the key views of different scenes to construct augmented scenes. Finally, we generate augmented instruction-path pairs in the cross-connected scenes. The experimental results on benchmark datasets demonstrate that our augmentation data via REM help the agent reduce its performance gap between the seen and unseen environment and improve the overall performance, making our model the best existing approach on the standard VLN benchmark.
    Potato Crop Stress Identification in Aerial Images using Deep Learning-based Object Detection. (arXiv:2106.07770v1 [cs.CV])
    (2 min) Recent research on the application of remote sensing and deep learning-based analysis in precision agriculture demonstrated a potential for improved crop management and reduced environmental impacts of agricultural production. Despite the promising results, the practical relevance of these technologies for actual field deployment requires novel algorithms that are customized for analysis of agricultural images and robust to implementation on natural field imagery. The paper presents an approach for analyzing aerial images of a potato crop using deep neural networks. The main objective is to demonstrate automated spatial recognition of a healthy versus stressed crop at a plant level. Specifically, we examine premature plant senescence resulting in drought stress on Russet Burbank potato plants. The proposed deep learning model, named Retina-UNet-Ag, is a variant of Retina-UNet (Jaeger et al., 2018) and includes connections from low-level semantic dense representation maps to the feature pyramid network. The paper also introduces a dataset of field images acquired with a Parrot Sequoia camera carried by a Solo unmanned aerial vehicle. Experimental validation demonstrated the ability for distinguishing healthy and stressed plants in field images, achieving an average Dice score coefficient of 0.74. A comparison to related state-of-the-art deep learning models for object detection revealed that the presented approach is effective for the task at hand. The method applied here is conducive toward the assessment and recognition of potato crop stress (early plant senescence resulting from drought stress in this case) in natural aerial field images collected under real conditions.
    Dynamic Distillation Network for Cross-Domain Few-Shot Recognition with Unlabeled Data. (arXiv:2106.07807v1 [cs.CV])
    (2 min) Most existing works in few-shot learning rely on meta-learning the network on a large base dataset which is typically from the same domain as the target dataset. We tackle the problem of cross-domain few-shot learning where there is a large shift between the base and target domain. The problem of cross-domain few-shot recognition with unlabeled target data is largely unaddressed in the literature. STARTUP was the first method that tackles this problem using self-training. However, it uses a fixed teacher pretrained on a labeled base dataset to create soft labels for the unlabeled target samples. As the base dataset and unlabeled dataset are from different domains, projecting the target images in the class-domain of the base dataset with a fixed pretrained model might be sub-optimal. We propose a simple dynamic distillation-based approach to facilitate unlabeled images from the novel/base dataset. We impose consistency regularization by calculating predictions from the weakly-augmented versions of the unlabeled images from a teacher network and matching it with the strongly augmented versions of the same images from a student network. The parameters of the teacher network are updated as exponential moving average of the parameters of the student network. We show that the proposed network learns representation that can be easily adapted to the target domain even though it has not been trained with target-specific classes during the pretraining phase. Our model outperforms the current state-of-the art method by 4.4% for 1-shot and 3.6% for 5-shot classification in the BSCD-FSL benchmark, and also shows competitive performance on traditional in-domain few-shot learning task. Our code will be available at: https://github.com/asrafulashiq/dynamic-cdfsl.
    G$^2$DA: Geometry-Guided Dual-Alignment Learning for RGB-Infrared Person Re-Identification. (arXiv:2106.07853v1 [cs.CV])
    (2 min) RGB-Infrared (IR) person re-identification aims to retrieve person-of-interest between heterogeneous modalities, suffering from large modality discrepancy caused by different sensory devices. Existing methods mainly focus on global-level modality alignment, whereas neglect sample-level modality divergence to some extent, leading to performance degradation. This paper attempts to find RGB-IR ReID solutions from tackling sample-level modality difference, and presents a Geometry-Guided Dual-Alignment learning framework (G$^2$DA), which jointly enhances modality-invariance and reinforces discriminability with human topological structure in features to boost the overall matching performance. Specifically, G$^2$DA extracts accurate body part features with a pose estimator, serving as a semantic bridge complementing the missing local details in global descriptor. Based on extracted local and global features, a novel distribution constraint derived from optimal transport is introduced to mitigate the modality gap in a fine-grained sample-level manner. Beyond pair-wise relations across two modalities, it additionally measures the structural similarity of different parts, thus both multi-level features and their relations are kept consistent in the common feature space. Considering the inherent human-topology information, we further advance a geometry-guided graph learning module to refine each part features, where relevant regions can be emphasized while meaningless ones are suppressed, effectively facilitating robust feature learning. Extensive experiments on two standard benchmark datasets validate the superiority of our proposed method, yielding competitive performance over the state-of-the-art approaches.
  • cs.IR updates on arXiv.org

    Talk, Don't Write: A Study of Direct Speech-Based Image Retrieval. (arXiv:2104.01894v3 [cs.CL] UPDATED)
    (2 min) Speech-based image retrieval has been studied as a proxy for joint representation learning, usually without emphasis on retrieval itself. As such, it is unclear how well speech-based retrieval can work in practice -- both in an absolute sense and versus alternative strategies that combine automatic speech recognition (ASR) with strong text encoders. In this work, we extensively study and expand choices of encoder architectures, training methodology (including unimodal and multimodal pretraining), and other factors. Our experiments cover different types of speech in three datasets: Flickr Audio, Places Audio, and Localized Narratives. Our best model configuration achieves large gains over state of the art, e.g., pushing recall-at-one from 21.8% to 33.2% for Flickr Audio and 27.6% to 53.4% for Places Audio. We also show our best speech-based models can match or exceed cascaded ASR-to-text encoding when speech is spontaneous, accented, or otherwise hard to automatically transcribe.
    Contextualized Attention-based Knowledge Transfer for Spoken Conversational Question Answering. (arXiv:2010.11066v3 [cs.CL] UPDATED)
    (2 min) Spoken conversational question answering (SCQA) requires machines to model complex dialogue flow given the speech utterances and text corpora. Different from traditional text question answering (QA) tasks, SCQA involves audio signal processing, passage comprehension, and contextual understanding. However, ASR systems introduce unexpected noisy signals to the transcriptions, which result in performance degradation on SCQA. To overcome the problem, we propose CADNet, a novel contextualized attention-based distillation approach, which applies both cross-attention and self-attention to obtain ASR-robust contextualized embedding representations of the passage and dialogue history for performance improvements. We also introduce the spoken conventional knowledge distillation framework to distill the ASR-robust knowledge from the estimated probabilities of the teacher model to the student. We conduct extensive experiments on the Spoken-CoQA dataset and demonstrate that our approach achieves remarkable performance in this task.
    Full Bitcoin Blockchain Data Made Easy. (arXiv:2106.08072v1 [cs.SI])
    (2 min) Despite the fact that it is publicly available, collecting and processing the full bitcoin blockchain data is not trivial. Its mere size, history, and other features indeed raise quite specific challenges, that we address in this paper. The strengths of our approach are the following: it relies on very basic and standard tools, which makes the procedure reliable and easily reproducible; it is a purely lossless procedure ensuring that we catch and preserve all existing data; it provides additional indexing that makes it easy to further process the whole data and select appropriate subsets of it. We present our procedure in details and illustrate its added value on large-scale use cases, like address clustering. We provide an implementation online, as well as the obtained dataset.
    Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup. (arXiv:2101.06983v2 [cs.LG] CROSS LISTED)
    (2 min) Contrastive learning has been applied successfully to learn vector representations of text. Previous research demonstrated that learning high-quality representations benefits from batch-wise contrastive loss with a large number of negatives. In practice, the technique of in-batch negative is used, where for each example in a batch, other batch examples' positives will be taken as its negatives, avoiding encoding extra negatives. This, however, still conditions each example's loss on all batch examples and requires fitting the entire large batch into GPU memory. This paper introduces a gradient caching technique that decouples backpropagation between contrastive loss and the encoder, removing encoder backward pass data dependency along the batch dimension. As a result, gradients can be computed for one subset of the batch at a time, leading to almost constant memory usage.
    Exploration in Online Advertising Systems with Deep Uncertainty-Aware Learning. (arXiv:2012.02298v2 [cs.IR] UPDATED)
    (2 min) Modern online advertising systems inevitably rely on personalization methods, such as click-through rate (CTR) prediction. Recent progress in CTR prediction enjoys the rich representation capabilities of deep learning and achieves great success in large-scale industrial applications. However, these methods can suffer from lack of exploration. Another line of prior work addresses the exploration-exploitation trade-off problem with contextual bandit methods, which are recently less studied in the industry due to the difficulty in extending their flexibility with deep models. In this paper, we propose a novel Deep Uncertainty-Aware Learning (DUAL) method to learn CTR models based on Gaussian processes, which can provide predictive uncertainty estimations while maintaining the flexibility of deep neural networks. DUAL can be easily implemented on existing models and deployed in real-time systems with minimal extra computational overhead. By linking the predictive uncertainty estimation ability of DUAL to well-known bandit algorithms, we further present DUAL-based Ad-ranking strategies to boost up long-term utilities such as the social welfare in advertising systems. Experimental results on several public datasets demonstrate the effectiveness of our methods. Remarkably, an online A/B test deployed in the Alibaba display advertising platform shows an 8.2% social welfare improvement and an 8.0% revenue lift.
    Interpretable Self-supervised Multi-task Learning for COVID-19 Information Retrieval and Extraction. (arXiv:2106.08252v1 [cs.IR])
    (2 min) The rapidly evolving literature of COVID-19 related articles makes it challenging for NLP models to be effectively trained for information retrieval and extraction with the corresponding labeled data that follows the current distribution of the pandemic. On the other hand, due to the uncertainty of the situation, human experts' supervision would always be required to double check the decision making of these models highlighting the importance of interpretability. In the light of these challenges, this study proposes an interpretable self-supervised multi-task learning model to jointly and effectively tackle the tasks of information retrieval (IR) and extraction (IE) during the current emergency health crisis situation. Our results show that our model effectively leverage the multi-task and self-supervised learning to improve generalization, data efficiency and robustness to the ongoing dataset shift problem. Our model outperforms baselines in IE and IR tasks, respectively by micro-f score of 0.08 (LCA-F score of 0.05), and MAP of 0.05 on average. In IE the zero- and few-shot learning performances are on average 0.32 and 0.19 micro-f score higher than those of the baselines.
    Hotel Recognition via Latent Image Embedding. (arXiv:2106.08042v1 [cs.CV])
    (2 min) We approach the problem of hotel recognition with deep metric learning. We overview the existing approaches and propose a modification to Contrastive loss called Contrastive-Triplet loss. We construct a robust pipeline for benchmarking metric learning models and perform experiments on Hotels-50K and CUB200 datasets. Contrastive-Triplet loss is shown to achieve better retrieval on Hotels-50k. We open-source our code.
    Field-Embedded Factorization Machines for Click-through rate prediction. (arXiv:2009.09931v2 [cs.IR] UPDATED)
    (2 min) Click-through rate (CTR) prediction models are common in many online applications such as digital advertising and recommender systems. Field-Aware Factorization Machine (FFM) and Field-weighted Factorization Machine (FwFM) are state-of-the-art among the shallow models for CTR prediction. Recently, many deep learning-based models have also been proposed. Among deeper models, DeepFM, xDeepFM, AutoInt+, and FiBiNet are state-of-the-art models. The deeper models combine a core architectural component, which learns explicit feature interactions, with a deep neural network (DNN) component. We propose a novel shallow Field-Embedded Factorization Machine (FEFM) and its deep counterpart Deep Field-Embedded Factorization Machine (DeepFEFM). FEFM learns symmetric matrix embeddings for each field pair along with the usual single vector embeddings for each feature. FEFM has significantly lower model complexity than FFM and roughly the same complexity as FwFM. FEFM also has insightful mathematical properties about important fields and field interactions. DeepFEFM combines the FEFM interaction vectors learned by the FEFM component with a DNN and is thus able to learn higher order interactions. We conducted comprehensive experiments over a wide range of hyperparameters on two large publicly available real-world datasets. When comparing test AUC and log loss, the results show that FEFM and DeepFEFM outperform the existing state-of-the-art shallow and deep models for CTR prediction tasks. We have made the code of FEFM and DeepFEFM available in the DeepCTR library (https://github.com/shenweichen/DeepCTR).
    Query Embedding on Hyper-relational Knowledge Graphs. (arXiv:2106.08166v1 [cs.AI])
    (2 min) Multi-hop logical reasoning is an established problem in the field of representation learning on knowledge graphs (KGs). It subsumes both one-hop link prediction as well as other more complex types of logical queries. Existing algorithms operate only on classical, triple-based graphs, whereas modern KGs often employ a hyper-relational modeling paradigm. In this paradigm, typed edges may have several key-value pairs known as qualifiers that provide fine-grained context for facts. In queries, this context modifies the meaning of relations, and usually reduces the answer set. Hyper-relational queries are often observed in real-world KG applications, and existing approaches for approximate query answering cannot make use of qualifier pairs. In this work, we bridge this gap and extend the multi-hop reasoning problem to hyper-relational KGs allowing to tackle this new type of complex queries. Building upon recent advancements in Graph Neural Networks and query embedding techniques, we study how to embed and answer hyper-relational conjunctive queries. Besides that, we propose a method to answer such queries and demonstrate in our experiments that qualifiers improve query answering on a diverse set of query patterns.
    To Infinity and Beyond! Accessibility is the Future for Kids' Search Engines. (arXiv:2106.07813v1 [cs.IR])
    (2 min) Research in the area of search engines for children remains in its infancy. Seminal works have studied how children use mainstream search engines, as well as how to design and evaluate custom search engines explicitly for children. These works, however, tend to take a one-size-fits-all view, treating children as a unit. Nevertheless, even at the same age, children are known to possess and exhibit different capabilities. These differences affect how children access and use search engines. To better serve children, in this vision paper, we spotlight accessibility and discuss why current research on children and search engines does not, but should, focus on this significant matter.
    Towards Axiomatic Explanations for Neural Ranking Models. (arXiv:2106.08019v1 [cs.IR])
    (2 min) Recently, neural networks have been successfully employed to improve upon state-of-the-art performance in ad-hoc retrieval tasks via machine-learned ranking functions. While neural retrieval models grow in complexity and impact, little is understood about their correspondence with well-studied IR principles. Recent work on interpretability in machine learning has provided tools and techniques to understand neural models in general, yet there has been little progress towards explaining ranking models. We investigate whether one can explain the behavior of neural ranking models in terms of their congruence with well understood principles of document ranking by using established theories from axiomatic IR. Axiomatic analysis of information retrieval models has formalized a set of constraints on ranking decisions that reasonable retrieval models should fulfill. We operationalize this axiomatic thinking to reproduce rankings based on combinations of elementary constraints. This allows us to investigate to what extent the ranking decisions of neural rankers can be explained in terms of retrieval axioms, and which axioms apply in which situations. Our experimental study considers a comprehensive set of axioms over several representative neural rankers. While the existing axioms can already explain the particularly confident ranking decisions rather well, future work should extend the axiom set to also cover the other still "unexplainable" neural IR rank decisions.
    Does your robot know? Enhancing children's information retrieval through spoken conversation with responsible robots. (arXiv:2106.07931v1 [cs.IR])
    (2 min) In this paper, we identify challenges in children's current information retrieval process, and propose conversational robots as an opportunity to ease this process in a responsible way. Tools children currently use in this process, such as search engines on a computer or voice agents, do not always meet their specific needs. The conversational robot we propose maintains context, asks clarifying questions, and gives suggestions in order to better meet children's needs. Since children are often too trusting of robots, we propose to have the robot measure, monitor and adapt to the trust the child has in the robot. This way, we hope to induce a critical attitude with the children during their information retrieval process.
    User-specific Adaptive Fine-tuning for Cross-domain Recommendations. (arXiv:2106.07864v1 [cs.IR])
    (2 min) Making accurate recommendations for cold-start users has been a longstanding and critical challenge for recommender systems (RS). Cross-domain recommendations (CDR) offer a solution to tackle such a cold-start problem when there is no sufficient data for the users who have rarely used the system. An effective approach in CDR is to leverage the knowledge (e.g., user representations) learned from a related but different domain and transfer it to the target domain. Fine-tuning works as an effective transfer learning technique for this objective, which adapts the parameters of a pre-trained model from the source domain to the target domain. However, current methods are mainly based on the global fine-tuning strategy: the decision of which layers of the pre-trained model to freeze or fine-tune is taken for all users in the target domain. In this paper, we argue that users in RS are personalized and should have their own fine-tuning policies for better preference transfer learning. As such, we propose a novel User-specific Adaptive Fine-tuning method (UAF), selecting which layers of the pre-trained network to fine-tune, on a per-user basis. Specifically, we devise a policy network with three alternative strategies to automatically decide which layers to be fine-tuned and which layers to have their parameters frozen for each user. Extensive experiments show that the proposed UAF exhibits significantly better and more robust performance for user cold-start recommendation.
    Can BERT Dig It? -- Named Entity Recognition for Information Retrieval in the Archaeology Domain. (arXiv:2106.07742v1 [cs.IR])
    (2 min) The amount of archaeological literature is growing rapidly. Until recently, these data were only accessible through metadata search. We implemented a text retrieval engine for a large archaeological text collection ($\sim 658$ Million words). In archaeological IR, domain-specific entities such as locations, time periods, and artefacts, play a central role. This motivated the development of a named entity recognition (NER) model to annotate the full collection with archaeological named entities. In this paper, we present ArcheoBERTje, a BERT model pre-trained on Dutch archaeological texts. We compare the model's quality and output on a Named Entity Recognition task to a generic multilingual model and a generic Dutch model. We also investigate ensemble methods for combining multiple BERT models, and combining the best BERT model with a domain thesaurus using Conditional Random Fields (CRF). We find that ArcheoBERTje outperforms both the multilingual and Dutch model significantly with a smaller standard deviation between runs, reaching an average F1 score of 0.735. The model also outperforms ensemble methods combining the three models. Combining ArcheoBERTje predictions and explicit domain knowledge from the thesaurus did not increase the F1 score. We quantitatively and qualitatively analyse the differences between the vocabulary and output of the BERT models on the full collection and provide some valuable insights in the effect of fine-tuning for specific domains. Our results indicate that for a highly specific text domain such as archaeology, further pre-training on domain-specific data increases the model's quality on NER by a much larger margin than shown for other domains in the literature, and that domain-specific pre-training makes the addition of domain knowledge from a thesaurus unnecessary.
    Incorporating Domain Knowledge into Health Recommender Systems using Hyperbolic Embeddings. (arXiv:2106.07720v1 [cs.IR])
    (2 min) In contrast to many other domains, recommender systems in health services may benefit particularly from the incorporation of health domain knowledge, as it helps to provide meaningful and personalised recommendations catering to the individual's health needs. With recent advances in representation learning enabling the hierarchical embedding of health knowledge into the hyperbolic Poincare space, this work proposes a content-based recommender system for patient-doctor matchmaking in primary care based on patients' health profiles, enriched by pre-trained Poincare embeddings of the ICD-9 codes through transfer learning. The proposed model outperforms its conventional counterpart in terms of recommendation accuracy and has several important business implications for improving the patient-doctor relationship.
  • cs.LG updates on arXiv.org

    Online Learning for Unknown Partially Observable MDPs. (arXiv:2102.12661v2 [cs.LG] UPDATED)
    (2 min) Solving Partially Observable Markov Decision Processes (POMDPs) is hard. Learning optimal controllers for POMDPs when the model is unknown is harder. Online learning of optimal controllers for unknown POMDPs, which requires efficient learning using regret-minimizing algorithms that effectively tradeoff exploration and exploitation, is even harder, and no solution exists currently. In this paper, we consider infinite-horizon average-cost POMDPs with unknown transition model, though a known observation model. We propose a natural posterior sampling-based reinforcement learning algorithm (PSRL-POMDP) and show that it achieves a regret bound of $O(\log T)$, where $T$ is the time horizon, when the parameter set is finite. In the general case (continuous parameter set), we show that the algorithm achieves $O (T^{2/3})$ regret under two technical assumptions. To the best of our knowledge, this is the first online RL algorithm for POMDPs and has sub-linear regret.
    Learning to Generate Task-Specific Adapters from Task Description. (arXiv:2101.00420v2 [cs.CL] UPDATED)
    (2 min) Pre-trained text-to-text transformers such as BART have achieved impressive performance across a range of NLP tasks. Recent study further shows that they can learn to generalize to novel tasks, by including task descriptions as part of the source sequence and training the model with (source, target) examples. At test time, these fine-tuned models can make inferences on new tasks using the new task descriptions as part of the input. However, this approach has potential limitations, as the model learns to solve individual (source, target) examples (i.e., at the instance level), instead of learning to solve tasks by taking all examples within a task as a whole (i.e., at the task level). To this end, we introduce Hypter, a framework that improves text-to-text transformer's generalization ability to unseen tasks by training a hypernetwork to generate task-specific, light-weight adapters from task descriptions. Experiments on ZEST dataset and a synthetic SQuAD dataset demonstrate that Hypter improves upon fine-tuning baselines. Notably, when using BART-Large as the main network, Hypter brings 11.3% comparative improvement on ZEST dataset.
    Deep Transfer Learning for Brain Magnetic Resonance Image Multi-class Classification. (arXiv:2106.07333v2 [cs.CV] UPDATED)
    (2 min) Magnetic Resonance Imaging (MRI) is a principal diagnostic approach used in the field of radiology to create images of the anatomical and physiological structure of patients. MRI is the prevalent medical imaging practice to find abnormalities in soft tissues. Traditionally they are analyzed by a radiologist to detect abnormalities in soft tissues, especially the brain. The process of interpreting a massive volume of patient's MRI is laborious. Hence, the use of Machine Learning methodologies can aid in detecting abnormalities in soft tissues with considerable accuracy. In this research, we have curated a novel dataset and developed a framework that uses Deep Transfer Learning to perform a multi-classification of tumors in the brain MRI images. In this paper, we adopted the Deep Residual Convolutional Neural Network (ResNet50) architecture for the experiments along with discriminative learning techniques to train the model. Using the novel dataset and two publicly available MRI brain datasets, this proposed approach attained a classification accuracy of 86.40% on the curated dataset, 93.80% on the Harvard Whole Brain Atlas dataset, and 97.05% accuracy on the School of Biomedical Engineering dataset. Results of our experiments significantly demonstrate our proposed framework for transfer learning is a potential and effective method for brain tumor multi-classification tasks.
    Unsupervised Program Synthesis for Images By Sampling Without Replacement. (arXiv:2001.10119v2 [cs.LG] UPDATED)
    (2 min) Program synthesis has emerged as a successful approach to the image parsing task. Most prior works rely on a two-step scheme involving supervised pretraining of a Seq2Seq model with synthetic programs followed by reinforcement learning (RL) for fine-tuning with real reference images. Fully unsupervised approaches promise to train the model directly on the target images without requiring curated pretraining datasets. However, they struggle with the inherent sparsity of meaningful programs in the search space. In this paper, we present the first unsupervised algorithm capable of parsing constructive solid geometry (CSG) images into context-free grammar (CFG) without pretraining via non-differentiable renderer. To tackle the \emph{non-Markovian} sparse reward problem, we combine three key ingredients -- (i) a grammar-encoded tree LSTM ensuring program validity (ii) entropy regularization and (iii) sampling without replacement from the CFG syntax tree. Empirically, our algorithm recovers meaningful programs in large search spaces (up to $3.8 \times 10^{28}$). Further, even though our approach is fully unsupervised, it generalizes better than supervised methods on the synthetic 2D CSG dataset. On the 2D computer aided design (CAD) dataset, our approach significantly outperforms the supervised pretrained model and is competitive to the refined model.
    Constraining Linear-chain CRFs to Regular Languages. (arXiv:2106.07306v2 [cs.LG] UPDATED)
    (2 min) In structured prediction, a major challenge for models is to represent the interdependencies within their output structures. For the common case where outputs are structured as a sequence, linear-chain conditional random fields (CRFs) are a widely used model class which can learn local dependencies in output sequences. However, the CRF's Markov assumption makes it impossible for these models to capture nonlocal dependencies, and standard CRFs are unable to respect nonlocal constraints of the data (such as global arity constraints on output labels). We present a generalization of CRFs that can enforce a broad class of constraints, including nonlocal ones, by specifying the space of possible output structures as a regular language $\mathcal{L}$. The resulting regular-constrained CRF (RegCCRF) has the same formal properties as a standard CRF, but assigns zero probability to all label sequences not in $\mathcal{L}$. Notably, RegCCRFs can incorporate their constraints during training, while related models only enforce constraints during decoding. We prove that constrained training is never worse than constrained decoding, and show using synthetic data that it can be substantially better in practice. Additionally, we demonstrate a practical benefit on downstream tasks by incorporating a RegCCRF into a deep neural model for semantic role labeling, exceeding state-of-the-art results on a standard dataset.
    RFpredInterval: An R Package for Prediction Intervals with Random Forests and Boosted Forests. (arXiv:2106.08217v1 [stat.ML])
    (2 min) Like many predictive models, random forests provide a point prediction for a new observation. Besides the point prediction, it is important to quantify the uncertainty in the prediction. Prediction intervals provide information about the reliability of the point predictions. We have developed a comprehensive R package, RFpredInterval, that integrates 16 methods to build prediction intervals with random forests and boosted forests. The methods implemented in the package are a new method to build prediction intervals with boosted forests (PIBF) and 15 different variants to produce prediction intervals with random forests proposed by Roy and Larocque (2020). We perform an extensive simulation study and apply real data analyses to compare the performance of the proposed method to ten existing methods to build prediction intervals with random forests. The results show that the proposed method is very competitive and, globally, it outperforms the competing methods.
    S-LIME: Stabilized-LIME for Model Explanation. (arXiv:2106.07875v1 [stat.ML])
    (2 min) An increasing number of machine learning models have been deployed in domains with high stakes such as finance and healthcare. Despite their superior performances, many models are black boxes in nature which are hard to explain. There are growing efforts for researchers to develop methods to interpret these black-box models. Post hoc explanations based on perturbations, such as LIME, are widely used approaches to interpret a machine learning model after it has been built. This class of methods has been shown to exhibit large instability, posing serious challenges to the effectiveness of the method itself and harming user trust. In this paper, we propose S-LIME, which utilizes a hypothesis testing framework based on central limit theorem for determining the number of perturbation points needed to guarantee stability of the resulting explanation. Experiments on both simulated and real world data sets are provided to demonstrate the effectiveness of our method.
    Exponential Reduction in Sample Complexity with Learning of Ising Model Dynamics. (arXiv:2104.00995v2 [cs.LG] UPDATED)
    (2 min) The usual setting for learning the structure and parameters of a graphical model assumes the availability of independent samples produced from the corresponding multivariate probability distribution. However, for many models the mixing time of the respective Markov chain can be very large and i.i.d. samples may not be obtained. We study the problem of reconstructing binary graphical models from correlated samples produced by a dynamical process, which is natural in many applications. We analyze the sample complexity of two estimators that are based on the interaction screening objective and the conditional likelihood loss. We observe that for samples coming from a dynamical process far from equilibrium, the sample complexity reduces exponentially compared to a dynamical process that mixes quickly.
    Revisiting the Calibration of Modern Neural Networks. (arXiv:2106.07998v1 [cs.LG])
    (2 min) Accurate estimation of predictive uncertainty (model calibration) is essential for the safe application of neural networks. Many instances of miscalibration in modern neural networks have been reported, suggesting a trend that newer, more accurate models produce poorly calibrated predictions. Here, we revisit this question for recent state-of-the-art image classification models. We systematically relate model calibration and accuracy, and find that the most recent models, notably those not using convolutions, are among the best calibrated. Trends observed in prior model generations, such as decay of calibration with distribution shift or model size, are less pronounced in recent architectures. We also show that model size and amount of pretraining do not fully explain these differences, suggesting that architecture is a major determinant of calibration properties.
    ProsoBeast Prosody Annotation Tool. (arXiv:2104.02397v2 [eess.AS] UPDATED)
    (2 min) The labelling of speech corpora is a laborious and time-consuming process. The ProsoBeast Annotation Tool seeks to ease and accelerate this process by providing an interactive 2D representation of the prosodic landscape of the data, in which contours are distributed based on their similarity. This interactive map allows the user to inspect and label the utterances. The tool integrates several state-of-the-art methods for dimensionality reduction and feature embedding, including variational autoencoders. The user can use these to find a good representation for their data. In addition, as most of these methods are stochastic, each can be used to generate an unlimited number of different prosodic maps. The web app then allows the user to seamlessly switch between these alternative representations in the annotation process. Experiments with a sample prosodically rich dataset have shown that the tool manages to find good representations of varied data and is helpful both for annotation and label correction. The tool is released as free software for use by the community.
    PairConnect: A Compute-Efficient MLP Alternative to Attention. (arXiv:2106.08235v1 [cs.LG])
    (2 min) Transformer models have demonstrated superior performance in natural language processing. The dot product self-attention in Transformer allows us to model interactions between words. However, this modeling comes with significant computational overhead. In this work, we revisit the memory-compute trade-off associated with Transformer, particularly multi-head attention, and show a memory-heavy but significantly more compute-efficient alternative to Transformer. Our proposal, denoted as PairConnect, a multilayer perceptron (MLP), models the pairwise interaction between words by explicit pairwise word embeddings. As a result, PairConnect substitutes self dot product with a simple embedding lookup. We show mathematically that despite being an MLP, our compute-efficient PairConnect is strictly more expressive than Transformer. Our experiment on language modeling tasks suggests that PairConnect could achieve comparable results with Transformer while reducing the computational cost associated with inference significantly.
    Towards Long-term Non-invasive Monitoring for Epilepsy via Wearable EEG Devices. (arXiv:2106.08008v1 [eess.SP])
    (2 min) We present the implementation of seizure detection algorithms based on a minimal number of EEG channels on a parallel ultra-low-power embedded platform. The analyses are based on the CHB-MIT dataset, and include explorations of different classification approaches (Support Vector Machines, Random Forest, Extra Trees, AdaBoost) and different pre/post-processing techniques to maximize sensitivity while guaranteeing no false alarms. We analyze global and subject-specific approaches, considering all 23-electrodes or only 4 temporal channels. For 8s window size and subject-specific approach, we report zero false positives and 100% sensitivity. These algorithms are parallelized and optimized for a parallel ultra-low power (PULP) platform, enabling 300h of continuous monitoring on a 300 mAh battery, in a wearable form factor and power budget. These results pave the way for the implementation of affordable, wearable, long-term epilepsy monitoring solutions with low false-positive rates and high sensitivity, meeting both patient and caregiver requirements.
    Weakly-supervised High-resolution Segmentation of Mammography Images for Breast Cancer Diagnosis. (arXiv:2106.07049v2 [cs.CV] UPDATED)
    (2 min) In the last few years, deep learning classifiers have shown promising results in image-based medical diagnosis. However, interpreting the outputs of these models remains a challenge. In cancer diagnosis, interpretability can be achieved by localizing the region of the input image responsible for the output, i.e. the location of a lesion. Alternatively, segmentation or detection models can be trained with pixel-wise annotations indicating the locations of malignant lesions. Unfortunately, acquiring such labels is labor-intensive and requires medical expertise. To overcome this difficulty, weakly-supervised localization can be utilized. These methods allow neural network classifiers to output saliency maps highlighting the regions of the input most relevant to the classification task (e.g. malignant lesions in mammograms) using only image-level labels (e.g. whether the patient has cancer or not) during training. When applied to high-resolution images, existing methods produce low-resolution saliency maps. This is problematic in applications in which suspicious lesions are small in relation to the image size. In this work, we introduce a novel neural network architecture to perform weakly-supervised segmentation of high-resolution images. The proposed model selects regions of interest via coarse-level localization, and then performs fine-grained segmentation of those regions. We apply this model to breast cancer diagnosis with screening mammography, and validate it on a large clinically-realistic dataset. Measured by Dice similarity score, our approach outperforms existing methods by a large margin in terms of localization performance of benign and malignant lesions, relatively improving the performance by 39.6% and 20.0%, respectively. Code and the weights of some of the models are available at https://github.com/nyukat/GLAM
    Dont Just Divide; Polarize and Conquer!. (arXiv:2102.11872v2 [cs.LG] UPDATED)
    (2 min) In data containing heterogeneous subpopulations, classification performance benefits from incorporating the knowledge of cluster structure in the classifier. Previous methods for such combined clustering and classification are either 1) classifier-specific and not generic, or 2) independently perform clustering and classifier training, which may not form clusters that can potentially benefit classifier performance. The question of how to perform clustering to improve the performance of classifiers trained on the clusters has received scant attention in previous literature, despite its importance in several real-world applications. In this paper, we design a simple and efficient classification algorithm called Clustering Aware Classification (CAC), to find clusters that are well suited for being used as training datasets by classifiers for each underlying subpopulation. Our experiments on synthetic and real benchmark datasets demonstrate the efficacy of CAC over previous methods for combined clustering and classification.
    SSMix: Saliency-Based Span Mixup for Text Classification. (arXiv:2106.08062v1 [cs.CL])
    (2 min) Data augmentation with mixup has shown to be effective on various computer vision tasks. Despite its great success, there has been a hurdle to apply mixup to NLP tasks since text consists of discrete tokens with variable length. In this work, we propose SSMix, a novel mixup method where the operation is performed on input text rather than on hidden vectors like previous approaches. SSMix synthesizes a sentence while preserving the locality of two original texts by span-based mixing and keeping more tokens related to the prediction relying on saliency information. With extensive experiments, we empirically validate that our method outperforms hidden-level mixup methods on a wide range of text classification benchmarks, including textual entailment, sentiment classification, and question-type classification. Our code is available at https://github.com/clovaai/ssmix.
    Evaluating Modules in Graph Contrastive Learning. (arXiv:2106.08171v1 [cs.LG])
    (2 min) The recent emergence of contrastive learning approaches facilitates the research on graph representation learning (GRL), introducing graph contrastive learning (GCL) into the literature. These methods contrast semantically similar and dissimilar sample pairs to encode the semantics into node or graph embeddings. However, most existing works only performed model-level evaluation, and did not explore the combination space of modules for more comprehensive and systematic studies. For effective module-level evaluation, we propose a framework that decomposes GCL models into four modules: (1) a sampler to generate anchor, positive and negative data samples (nodes or graphs); (2) an encoder and a readout function to get sample embeddings; (3) a discriminator to score each sample pair (anchor-positive and anchor-negative); and (4) an estimator to define the loss function. Based on this framework, we conduct controlled experiments over a wide range of architectural designs and hyperparameter settings on node and graph classification tasks. Specifically, we manage to quantify the impact of a single module, investigate the interaction between modules, and compare the overall performance with current model architectures. Our key findings include a set of module-level guidelines for GCL, e.g., simple samplers from LINE and DeepWalk are strong and robust; an MLP encoder associated with Sum readout could achieve competitive performance on graph classification. Finally, we release our implementations and results as OpenGCL, a modularized toolkit that allows convenient reproduction, standard model and module evaluation, and easy extension.
    A White Paper on Neural Network Quantization. (arXiv:2106.08295v1 [cs.LG])
    (2 min) While neural networks have advanced the frontiers in many applications, they often come at a high computational cost. Reducing the power and latency of neural network inference is key if we want to integrate modern networks into edge devices with strict power and compute requirements. Neural network quantization is one of the most effective ways of achieving these savings but the additional noise it induces can lead to accuracy degradation. In this white paper, we introduce state-of-the-art algorithms for mitigating the impact of quantization noise on the network's performance while maintaining low-bit weights and activations. We start with a hardware motivated introduction to quantization and then consider two main classes of algorithms: Post-Training Quantization (PTQ) and Quantization-Aware-Training (QAT). PTQ requires no re-training or labelled data and is thus a lightweight push-button approach to quantization. In most cases, PTQ is sufficient for achieving 8-bit quantization with close to floating-point accuracy. QAT requires fine-tuning and access to labeled training data but enables lower bit quantization with competitive results. For both solutions, we provide tested pipelines based on existing literature and extensive experimentation that lead to state-of-the-art performance for common deep learning models and tasks.
    Compression Implies Generalization. (arXiv:2106.07989v1 [cs.LG])
    (2 min) Explaining the surprising generalization performance of deep neural networks is an active and important line of research in theoretical machine learning. Influential work by Arora et al. (ICML'18) showed that, noise stability properties of deep nets occurring in practice can be used to provably compress model representations. They then argued that the small representations of compressed networks imply good generalization performance albeit only of the compressed nets. Extending their compression framework to yield generalization bounds for the original uncompressed networks remains elusive. Our main contribution is the establishment of a compression-based framework for proving generalization bounds. The framework is simple and powerful enough to extend the generalization bounds by Arora et al. to also hold for the original network. To demonstrate the flexibility of the framework, we also show that it allows us to give simple proofs of the strongest known generalization bounds for other popular machine learning models, namely Support Vector Machines and Boosting.
    CRFL: Certifiably Robust Federated Learning against Backdoor Attacks. (arXiv:2106.08283v1 [cs.LG])
    (2 min) Federated Learning (FL) as a distributed learning paradigm that aggregates information from diverse clients to train a shared global model, has demonstrated great success. However, malicious clients can perform poisoning attacks and model replacement to introduce backdoors into the trained global model. Although there have been intensive studies designing robust aggregation methods and empirical robust federated training protocols against backdoors, existing approaches lack robustness certification. This paper provides the first general framework, Certifiably Robust Federated Learning (CRFL), to train certifiably robust FL models against backdoors. Our method exploits clipping and smoothing on model parameters to control the global model smoothness, which yields a sample-wise robustness certification on backdoors with limited magnitude. Our certification also specifies the relation to federated learning parameters, such as poisoning ratio on instance level, number of attackers, and training iterations. Practically, we conduct comprehensive experiments across a range of federated datasets, and provide the first benchmark for certified robustness against backdoor attacks in federated learning. Our code is available at https://github.com/AI-secure/CRFL.
    Graph Neural Networks Inspired by Classical Iterative Algorithms. (arXiv:2103.06064v2 [cs.LG] UPDATED)
    (2 min) Despite the recent success of graph neural networks (GNN), common architectures often exhibit significant limitations, including sensitivity to oversmoothing, long-range dependencies, and spurious edges, e.g., as can occur as a result of graph heterophily or adversarial attacks. To at least partially address these issues within a simple transparent framework, we consider a new family of GNN layers designed to mimic and integrate the update rules of two classical iterative algorithms, namely, proximal gradient descent and iterative reweighted least squares (IRLS). The former defines an extensible base GNN architecture that is immune to oversmoothing while nonetheless capturing long-range dependencies by allowing arbitrary propagation steps. In contrast, the latter produces a novel attention mechanism that is explicitly anchored to an underlying end-to-end energy function, contributing stability with respect to edge uncertainty. When combined we obtain an extremely simple yet robust model that we evaluate across disparate scenarios including standardized benchmarks, adversarially-perturbated graphs, graphs with heterophily, and graphs involving long-range dependencies. In doing so, we compare against SOTA GNN approaches that have been explicitly designed for the respective task, achieving competitive or superior node classification accuracy. Our code is available at https://github.com/FFTYYY/TWIRLS.
    Epidemic modelling of multiple virus strains:a case study of SARS-CoV-2 B.1.1.7 in Moscow. (arXiv:2106.08048v1 [q-bio.PE])
    (2 min) During a long-running pandemic a pathogen can mutate, producing new strains with different epidemiological parameters. Existing approaches to epidemic modelling only consider one virus strain. We have developed a modified SEIR model to simulate multiple virus strains within the same population. As a case study, we investigate the potential effects of SARS-CoV-2 strain B.1.1.7 on the city of Moscow. Our analysis indicates a high risk of a new wave of infections in September-October 2021 with up to 35 000 daily infections at peak. We open-source our code and data.
    Personalized Keyphrase Detection using Speaker and Environment Information. (arXiv:2104.13970v2 [eess.AS] UPDATED)
    (2 min) In this paper, we introduce a streaming keyphrase detection system that can be easily customized to accurately detect any phrase composed of words from a large vocabulary. The system is implemented with an end-to-end trained automatic speech recognition (ASR) model and a text-independent speaker verification model. To address the challenge of detecting these keyphrases under various noisy conditions, a speaker separation model is added to the feature frontend of the speaker verification model, and an adaptive noise cancellation (ANC) algorithm is included to exploit cross-microphone noise coherence. Our experiments show that the text-independent speaker verification model largely reduces the false triggering rate of the keyphrase detection, while the speaker separation model and adaptive noise cancellation largely reduce false rejections.
    KL Guided Domain Adaptation. (arXiv:2106.07780v1 [cs.LG])
    (2 min) Domain adaptation is an important problem and often needed for real-world applications. In this problem, instead of i.i.d. datapoints, we assume that the source (training) data and the target (testing) data have different distributions. With that setting, the empirical risk minimization training procedure often does not perform well, since it does not account for the change in the distribution. A common approach in the domain adaptation literature is to learn a representation of the input that has the same distributions over the source and the target domain. However, these approaches often require additional networks and/or optimizing an adversarial (minimax) objective, which can be very expensive or unstable in practice. To tackle this problem, we first derive a generalization bound for the target loss based on the training loss and the reverse Kullback-Leibler (KL) divergence between the source and the target representation distributions. Based on this bound, we derive an algorithm that minimizes the KL term to obtain a better generalization to the target domain. We show that with a probabilistic representation network, the KL term can be estimated efficiently via minibatch samples without any additional network or a minimax objective. This leads to a theoretically sound alignment method which is also very efficient and stable in practice. Experimental results also suggest that our method outperforms other representation-alignment approaches.
    An Analytical Theory of Curriculum Learning in Teacher-Student Networks. (arXiv:2106.08068v1 [cs.LG])
    (2 min) In humans and animals, curriculum learning -- presenting data in a curated order - is critical to rapid learning and effective pedagogy. Yet in machine learning, curricula are not widely used and empirically often yield only moderate benefits. This stark difference in the importance of curriculum raises a fundamental theoretical question: when and why does curriculum learning help? In this work, we analyse a prototypical neural network model of curriculum learning in the high-dimensional limit, employing statistical physics methods. Curricula could in principle change both the learning speed and asymptotic performance of a model. To study the former, we provide an exact description of the online learning setting, confirming the long-standing experimental observation that curricula can modestly speed up learning. To study the latter, we derive performance in a batch learning setting, in which a network trains to convergence in successive phases of learning on dataset slices of varying difficulty. With standard training losses, curriculum does not provide generalisation benefit, in line with empirical observations. However, we show that by connecting different learning phases through simple Gaussian priors, curriculum can yield a large improvement in test performance. Taken together, our reduced analytical descriptions help reconcile apparently conflicting empirical results and trace regimes where curriculum learning yields the largest gains. More broadly, our results suggest that fully exploiting a curriculum may require explicit changes to the loss function at curriculum boundaries.
    On the Power of Multitask Representation Learning in Linear MDP. (arXiv:2106.08053v1 [cs.LG])
    (2 min) While multitask representation learning has become a popular approach in reinforcement learning (RL), theoretical understanding of why and when it works remains limited. This paper presents analyses for the statistical benefit of multitask representation learning in linear Markov Decision Process (MDP) under a generative model. In this paper, we consider an agent to learn a representation function $\phi$ out of a function class $\Phi$ from $T$ source tasks with $N$ data per task, and then use the learned $\hat{\phi}$ to reduce the required number of sample for a new task. We first discover a \emph{Least-Activated-Feature-Abundance} (LAFA) criterion, denoted as $\kappa$, with which we prove that a straightforward least-square algorithm learns a policy which is $\tilde{O}(H^2\sqrt{\frac{\mathcal{C}(\Phi)^2 \kappa d}{NT}+\frac{\kappa d}{n}})$ sub-optimal. Here $H$ is the planning horizon, $\mathcal{C}(\Phi)$ is $\Phi$'s complexity measure, $d$ is the dimension of the representation (usually $d\ll \mathcal{C}(\Phi)$) and $n$ is the number of samples for the new task. Thus the required $n$ is $O(\kappa d H^4)$ for the sub-optimality to be close to zero, which is much smaller than $O(\mathcal{C}(\Phi)^2\kappa d H^4)$ in the setting without multitask representation learning, whose sub-optimality gap is $\tilde{O}(H^2\sqrt{\frac{\kappa \mathcal{C}(\Phi)^2d}{n}})$. This theoretically explains the power of multitask representation learning in reducing sample complexity. Further, we note that to ensure high sample efficiency, the LAFA criterion $\kappa$ should be small. In fact, $\kappa$ varies widely in magnitude depending on the different sampling distribution for new task. This indicates adaptive sampling technique is important to make $\kappa$ solely depend on $d$. Finally, we provide empirical results of a noisy grid-world environment to corroborate our theoretical findings.
    Improving the List Decoding Version of the Cyclically Equivariant Neural Decoder. (arXiv:2106.07964v1 [cs.IT])
    (2 min) The cyclically equivariant neural decoder was recently proposed in [Chen-Ye, International Conference on Machine Learning, 2021] to decode cyclic codes. In the same paper, a list decoding procedure was also introduced for two widely used classes of cyclic codes -- BCH codes and punctured Reed-Muller (RM) codes. While the list decoding procedure significantly improves the Frame Error Rate (FER) of the cyclically equivariant neural decoder, the Bit Error Rate (BER) of the list decoding procedure is even worse than the unique decoding algorithm when the list size is small. In this paper, we propose an improved version of the list decoding algorithm for BCH codes and punctured RM codes. Our new proposal significantly reduces the BER while maintaining the same (in some cases even smaller) FER. More specifically, our new decoder provides up to $2$dB gain over the previous list decoder when measured by BER, and the running time of our new decoder is $15\%$ smaller. Code available at https://github.com/improvedlistdecoder/code
    Canonical-Correlation-Based Fast Feature Selection. (arXiv:2106.08247v1 [stat.ML])
    (2 min) This paper proposes a canonical-correlation-based filter method for feature selection. The sum of squared canonical correlation coefficients is adopted as the feature ranking criterion. The proposed method boosts the computational speed of the ranking criterion in greedy search. The supporting theorems developed for the feature selection method are fundamental to the understanding of the canonical correlation analysis. In empirical studies, a synthetic dataset is used to demonstrate the speed advantage of the proposed method, and eight real datasets are applied to show the effectiveness of the proposed feature ranking criterion in both classification and regression. The results show that the proposed method is considerably faster than the definition-based method, and the proposed ranking criterion is competitive compared with the seven mutual-information-based criteria.
    Embarrassingly parallel MCMC using deep invertible transformations. (arXiv:1903.04556v2 [cs.LG] UPDATED)
    (2 min) While MCMC methods have become a main work-horse for Bayesian inference, scaling them to large distributed datasets is still a challenge. Embarrassingly parallel MCMC strategies take a divide-and-conquer stance to achieve this by writing the target posterior as a product of subposteriors, running MCMC for each of them in parallel and subsequently combining the results. The challenge then lies in devising efficient aggregation strategies. Current strategies trade-off between approximation quality, and costs of communication and computation. In this work, we introduce a novel method that addresses these issues simultaneously. Our key insight is to introduce a deep invertible transformation to approximate each of the subposteriors. These approximations can be made accurate even for complex distributions and serve as intermediate representations, keeping the total communication cost limited. Moreover, they enable us to sample from the product of the subposteriors using an efficient and stable importance sampling scheme. We demonstrate the approach outperforms available state-of-the-art methods in a range of challenging scenarios, including high-dimensional and heterogeneous subposteriors.
    AGENT: A Benchmark for Core Psychological Reasoning. (arXiv:2102.12321v3 [cs.AI] UPDATED)
    (2 min) For machine agents to successfully interact with humans in real-world settings, they will need to develop an understanding of human mental life. Intuitive psychology, the ability to reason about hidden mental variables that drive observable actions, comes naturally to people: even pre-verbal infants can tell agents from objects, expecting agents to act efficiently to achieve goals given constraints. Despite recent interest in machine agents that reason about other agents, it is not clear if such agents learn or hold the core psychology principles that drive human reasoning. Inspired by cognitive development studies on intuitive psychology, we present a benchmark consisting of a large dataset of procedurally generated 3D animations, AGENT (Action, Goal, Efficiency, coNstraint, uTility), structured around four scenarios (goal preferences, action efficiency, unobserved constraints, and cost-reward trade-offs) that probe key concepts of core intuitive psychology. We validate AGENT with human-ratings, propose an evaluation protocol emphasizing generalization, and compare two strong baselines built on Bayesian inverse planning and a Theory of Mind neural network. Our results suggest that to pass the designed tests of core intuitive psychology at human levels, a model must acquire or have built-in representations of how agents plan, combining utility computations and core knowledge of objects and physics.
    Perceptually-inspired super-resolution of compressed videos. (arXiv:2106.08147v1 [eess.IV])
    (2 min) Spatial resolution adaptation is a technique which has often been employed in video compression to enhance coding efficiency. This approach encodes a lower resolution version of the input video and reconstructs the original resolution during decoding. Instead of using conventional up-sampling filters, recent work has employed advanced super-resolution methods based on convolutional neural networks (CNNs) to further improve reconstruction quality. These approaches are usually trained to minimise pixel-based losses such as Mean-Squared Error (MSE), despite the fact that this type of loss metric does not correlate well with subjective opinions. In this paper, a perceptually-inspired super-resolution approach (M-SRGAN) is proposed for spatial up-sampling of compressed video using a modified CNN model, which has been trained using a generative adversarial network (GAN) on compressed content with perceptual loss functions. The proposed method was integrated with HEVC HM 16.20, and has been evaluated on the JVET Common Test Conditions (UHD test sequences) using the Random Access configuration. The results show evident perceptual quality improvement over the original HM 16.20, with an average bitrate saving of 35.6% (Bj{\o}ntegaard Delta measurement) based on a perceptual quality metric, VMAF.
    Divergence Frontiers for Generative Models: Sample Complexity, Quantization Level, and Frontier Integral. (arXiv:2106.07898v1 [stat.ML])
    (2 min) The spectacular success of deep generative models calls for quantitative tools to measure their statistical performance. Divergence frontiers have recently been proposed as an evaluation framework for generative models, due to their ability to measure the quality-diversity trade-off inherent to deep generative modeling. However, the statistical behavior of divergence frontiers estimated from data remains unknown to this day. In this paper, we establish non-asymptotic bounds on the sample complexity of the plug-in estimator of divergence frontiers. Along the way, we introduce a novel integral summary of divergence frontiers. We derive the corresponding non-asymptotic bounds and discuss the choice of the quantization level by balancing the two types of approximation errors arisen from its computation. We also augment the divergence frontier framework by investigating the statistical performance of smoothed distribution estimators such as the Good-Turing estimator. We illustrate the theoretical results with numerical examples from natural language processing and computer vision.
    A baseline for semi-supervised learning of efficient semantic segmentation models. (arXiv:2106.07075v2 [cs.CV] UPDATED)
    (2 min) Semi-supervised learning is especially interesting in the dense prediction context due to high cost of pixel-level ground truth. Unfortunately, most such approaches are evaluated on outdated architectures which hamper research due to very slow training and high requirements on GPU RAM. We address this concern by presenting a simple and effective baseline which works very well both on standard and efficient architectures. Our baseline is based on one-way consistency and non-linear geometric and photometric perturbations. We show advantage of perturbing only the student branch and present a plausible explanation of such behaviour. Experiments on Cityscapes and CIFAR-10 demonstrate competitive performance with respect to prior work.
    Over-the-Air Decentralized Federated Learning. (arXiv:2106.08011v1 [cs.IT])
    (2 min) In this paper, we consider decentralized federated learning (FL) over wireless networks, where over-the-air computation (AirComp) is adopted to facilitate the local model consensus in a device-to-device (D2D) communication manner. However, the AirComp-based consensus phase brings the additive noise in each algorithm iterate and the consensus needs to be robust to wireless network topology changes, which introduce a coupled and novel challenge of establishing the convergence for wireless decentralized FL algorithm. To facilitate consensus phase, we propose an AirComp-based DSGD with gradient tracking and variance reduction (DSGT-VR) algorithm, where both precoding and decoding strategies are developed for D2D communication. Furthermore, we prove that the proposed algorithm converges linearly and establish the optimality gap for strongly convex and smooth loss functions, taking into account the channel fading and noise. The theoretical result shows that the additional error bound in the optimality gap depends on the number of devices. Extensive simulations verify the theoretical results and show that the proposed algorithm outperforms other benchmark decentralized FL algorithms over wireless networks.
    How Modular Should Neural Module Networks Be for Systematic Generalization?. (arXiv:2106.08170v1 [cs.LG])
    (2 min) Neural Module Networks (NMNs) aim at Visual Question Answering (VQA) via composition of modules that tackle a sub-task. NMNs are a promising strategy to achieve systematic generalization, i.e. overcoming biasing factors in the training distribution. However, the aspects of NMNs that facilitate systematic generalization are not fully understood. In this paper, we demonstrate that the stage and the degree at which modularity is defined has large influence on systematic generalization. In a series of experiments on three VQA datasets (MNIST with multiple attributes, SQOOP, and CLEVR-CoGenT), our results reveal that tuning the degree of modularity in the network, especially at the image encoder stage, reaches substantially higher systematic generalization. These findings lead to new NMN architectures that outperform previous ones in terms of systematic generalization.
    Time Series Anomaly Detection for Cyber-physical Systems via Neural System Identification and Bayesian Filtering. (arXiv:2106.07992v1 [cs.LG])
    (2 min) Recent advances in AIoT technologies have led to an increasing popularity of utilizing machine learning algorithms to detect operational failures for cyber-physical systems (CPS). In its basic form, an anomaly detection module monitors the sensor measurements and actuator states from the physical plant, and detects anomalies in these measurements to identify abnormal operation status. Nevertheless, building effective anomaly detection models for CPS is rather challenging as the model has to accurately detect anomalies in presence of highly complicated system dynamics and unknown amount of sensor noise. In this work, we propose a novel time series anomaly detection method called Neural System Identification and Bayesian Filtering (NSIBF) in which a specially crafted neural network architecture is posed for system identification, i.e., capturing the dynamics of CPS in a dynamical state-space model; then a Bayesian filtering algorithm is naturally applied on top of the "identified" state-space model for robust anomaly detection by tracking the uncertainty of the hidden state of the system recursively over time. We provide qualitative as well as quantitative experiments with the proposed method on a synthetic and three real-world CPS datasets, showing that NSIBF compares favorably to the state-of-the-art methods with considerable improvements on anomaly detection in CPS.
    Robust Out-of-Distribution Detection on Deep Probabilistic Generative Models. (arXiv:2106.07903v1 [cs.LG])
    (2 min) Out-of-distribution (OOD) detection is an important task in machine learning systems for ensuring their reliability and safety. Deep probabilistic generative models facilitate OOD detection by estimating the likelihood of a data sample. However, such models frequently assign a suspiciously high likelihood to a specific outlier. Several recent works have addressed this issue by training a neural network with auxiliary outliers, which are generated by perturbing the input data. In this paper, we discover that these approaches fail for certain OOD datasets. Thus, we suggest a new detection metric that operates without outlier exposure. We observe that our metric is robust to diverse variations of an image compared to the previous outlier-exposing methods. Furthermore, our proposed score requires neither auxiliary models nor additional training. Instead, this paper utilizes the likelihood ratio statistic in a new perspective to extract genuine properties from the given single deep probabilistic generative model. We also apply a novel numerical approximation to enable fast implementation. Finally, we demonstrate comprehensive experiments on various probabilistic generative models and show that our method achieves state-of-the-art performance.
    End-to-End Learning of Keypoint Representations for Continuous Control from Images. (arXiv:2106.07995v1 [cs.LG])
    (2 min) In many control problems that include vision, optimal controls can be inferred from the location of the objects in the scene. This information can be represented using keypoints, which is a list of spatial locations in the input image. Previous works show that keypoint representations learned during unsupervised pre-training using encoder-decoder architectures can provide good features for control tasks. In this paper, we show that it is possible to learn efficient keypoint representations end-to-end, without the need for unsupervised pre-training, decoders, or additional losses. Our proposed architecture consists of a differentiable keypoint extractor that feeds the coordinates of the estimated keypoints directly to a soft actor-critic agent. The proposed algorithm yields performance competitive to the state-of-the art on DeepMind Control Suite tasks.
    S2Engine: A Novel Systolic Architecture for Sparse Convolutional Neural Networks. (arXiv:2106.07894v1 [cs.AR])
    (2 min) Convolutional neural networks (CNNs) have achieved great success in performing cognitive tasks. However, execution of CNNs requires a large amount of computing resources and generates heavy memory traffic, which imposes a severe challenge on computing system design. Through optimizing parallel executions and data reuse in convolution, systolic architecture demonstrates great advantages in accelerating CNN computations. However, regular internal data transmission path in traditional systolic architecture prevents the systolic architecture from completely leveraging the benefits introduced by neural network sparsity. Deployment of fine-grained sparsity on the existing systolic architectures is greatly hindered by the incurred computational overheads. In this work, we propose S2Engine $-$ a novel systolic architecture that can fully exploit the sparsity in CNNs with maximized data reuse. S2Engine transmits compressed data internally and allows each processing element to dynamically select an aligned data from the compressed dataflow in convolution. Compared to the naive systolic array, S2Engine achieves about $3.2\times$ and about $3.0\times$ improvements on speed and energy efficiency, respectively.
    SynthASR: Unlocking Synthetic Data for Speech Recognition. (arXiv:2106.07803v1 [cs.LG])
    (2 min) End-to-end (E2E) automatic speech recognition (ASR) models have recently demonstrated superior performance over the traditional hybrid ASR models. Training an E2E ASR model requires a large amount of data which is not only expensive but may also raise dependency on production data. At the same time, synthetic speech generated by the state-of-the-art text-to-speech (TTS) engines has advanced to near-human naturalness. In this work, we propose to utilize synthetic speech for ASR training (SynthASR) in applications where data is sparse or hard to get for ASR model training. In addition, we apply continual learning with a novel multi-stage training strategy to address catastrophic forgetting, achieved by a mix of weighted multi-style training, data augmentation, encoder freezing, and parameter regularization. In our experiments conducted on in-house datasets for a new application of recognizing medication names, training ASR RNN-T models with synthetic audio via the proposed multi-stage training improved the recognition performance on new application by more than 65% relative, without degradation on existing general applications. Our observations show that SynthASR holds great promise in training the state-of-the-art large-scale E2E ASR models for new applications while reducing the costs and dependency on production data.
    Now You See It, Now You Dont: Adversarial Vulnerabilities in Computational Pathology. (arXiv:2106.08153v1 [eess.IV])
    (2 min) Deep learning models are routinely employed in computational pathology (CPath) for solving problems of diagnostic and prognostic significance. Typically, the generalization performance of CPath models is analyzed using evaluation protocols such as cross-validation and testing on multi-centric cohorts. However, to ensure that such CPath solutions are robust and safe for use in a clinical setting, a critical analysis of their predictive performance and vulnerability to adversarial attacks is required, which is the focus of this paper. Specifically, we show that a highly accurate model for classification of tumour patches in pathology images (AUC > 0.95) can easily be attacked with minimal perturbations which are imperceptible to lay humans and trained pathologists alike. Our analytical results show that it is possible to generate single-instance white-box attacks on specific input images with high success rate and low perturbation energy. Furthermore, we have also generated a single universal perturbation matrix using the training dataset only which, when added to unseen test images, results in forcing the trained neural network to flip its prediction labels with high confidence at a success rate of > 84%. We systematically analyze the relationship between perturbation energy of an adversarial attack, its impact on morphological constructs of clinical significance, their perceptibility by a trained pathologist and saliency maps obtained using deep learning models. Based on our analysis, we strongly recommend that computational pathology models be critically analyzed using the proposed adversarial validation strategy prior to clinical adoption.
    Graph cuts always find a global optimum for Potts models (with a catch). (arXiv:2011.03639v2 [stat.ML] UPDATED)
    (2 min) We prove that the $\alpha$-expansion algorithm for MAP inference always returns a globally optimal assignment for Markov Random Fields with Potts pairwise potentials, with a catch: the returned assignment is only guaranteed to be optimal for an instance within a small perturbation of the original problem instance. In other words, all local minima with respect to expansion moves are global minima to slightly perturbed versions of the problem. On "real-world" instances, MAP assignments of small perturbations of the problem should be very similar to the MAP assignment(s) of the original problem instance. We design an algorithm that can certify whether this is the case in practice. On several MAP inference problem instances from computer vision, this algorithm certifies that MAP solutions to all of these perturbations are very close to solutions of the original instance. These results taken together give a cohesive explanation for the good performance of "graph cuts" algorithms in practice. Every local expansion minimum is a global minimum in a small perturbation of the problem, and all of these global minima are close to the original solution.
    The Flip Side of the Reweighted Coin: Duality of Adaptive Dropout and Regularization. (arXiv:2106.07769v1 [cs.LG])
    (2 min) Among the most successful methods for sparsifying deep (neural) networks are those that adaptively mask the network weights throughout training. By examining this masking, or dropout, in the linear case, we uncover a duality between such adaptive methods and regularization through the so-called "$\eta$-trick" that casts both as iteratively reweighted optimizations. We show that any dropout strategy that adapts to the weights in a monotonic way corresponds to an effective subquadratic regularization penalty, and therefore leads to sparse solutions. We obtain the effective penalties for several popular sparsification strategies, which are remarkably similar to classical penalties commonly used in sparse optimization. Considering variational dropout as a case study, we demonstrate similar empirical behavior between the adaptive dropout method and classical methods on the task of deep network sparsification, validating our theory.
    Fast decentralized non-convex finite-sum optimization with recursive variance reduction. (arXiv:2008.07428v5 [math.OC] UPDATED)
    (2 min) This paper considers decentralized minimization of $N:=nm$ smooth non-convex cost functions equally divided over a directed network of $n$ nodes. Specifically, we describe a stochastic first-order gradient method, called GT-SARAH, that employs a SARAH-type variance reduction technique and gradient tracking (GT) to address the stochastic and decentralized nature of the problem. We show that GT-SARAH, with appropriate algorithmic parameters, finds an $\epsilon$-accurate first-order stationary point with $O\big(\max\big\{N^{\frac{1}{2}},n(1-\lambda)^{-2},n^{\frac{2}{3}}m^{\frac{1}{3}}(1-\lambda)^{-1}\big\}L\epsilon^{-2}\big)$ gradient complexity, where ${(1-\lambda)\in(0,1]}$ is the spectral gap of the network weight matrix and $L$ is the smoothness parameter of the cost functions. This gradient complexity outperforms that of the existing decentralized stochastic gradient methods. In particular, in a big-data regime such that ${n = O(N^{\frac{1}{2}}(1-\lambda)^{3})}$, this gradient complexity furthers reduces to ${O(N^{\frac{1}{2}}L\epsilon^{-2})}$, independent of the network topology, and matches that of the centralized near-optimal variance-reduced methods. Moreover, in this regime GT-SARAH achieves a non-asymptotic linear speedup, in that, the total number of gradient computations at each node is reduced by a factor of $1/n$ compared to the centralized near-optimal algorithms that perform all gradient computations at a single node. To the best of our knowledge, GT-SARAH is the first algorithm that achieves this property. In addition, we show that appropriate choices of local minibatch size balance the trade-offs between the gradient and communication complexity of GT-SARAH. Over infinite time horizon, we establish that all nodes in GT-SARAH asymptotically achieve consensus and converge to a first-order stationary point in the almost sure and mean-squared sense.
    Learning Incident Prediction Models Over Large Geographical Areas for Emergency Response Systems. (arXiv:2106.08307v1 [cs.LG])
    (2 min) Principled decision making in emergency response management necessitates the use of statistical models that predict the spatial-temporal likelihood of incident occurrence. These statistical models are then used for proactive stationing which allocates first responders across the spatial area in order to reduce overall response time. Traditional methods that simply aggregate past incidents over space and time fail to make useful short-term predictions when the spatial region is large and focused on fine-grained spatial entities like interstate highway networks. This is partially due to the sparsity of incidents with respect to the area in consideration. Further, accidents are affected by several covariates, and collecting, cleaning, and managing multiple streams of data from various sources is challenging for large spatial areas. In this paper, we highlight how this problem is being solved for the state of Tennessee, a state in the USA with a total area of over 100,000 sq. km. Our pipeline, based on a combination of synthetic resampling, non-spatial clustering, and learning from data can efficiently forecast the spatial and temporal dynamics of accident occurrence, even under sparse conditions. In the paper, we describe our pipeline that uses data related to roadway geometry, weather, historical accidents, and real-time traffic congestion to aid accident forecasting. To understand how our forecasting model can affect allocation and dispatch, we improve upon a classical resource allocation approach. Experimental results show that our approach can significantly reduce response times in the field in comparison with current approaches followed by first responders.
    Voting for the right answer: Adversarial defense for speaker verification. (arXiv:2106.07868v1 [cs.LG])
    (2 min) Automatic speaker verification (ASV) is a well developed technology for biometric identification, and has been ubiquitous implemented in security-critic applications, such as banking and access control. However, previous works have shown that ASV is under the radar of adversarial attacks, which are very similar to their original counterparts from human's perception, yet will manipulate the ASV render wrong prediction. Due to the very late emergence of adversarial attacks for ASV, effective countermeasures against them are limited. Given that the security of ASV is of high priority, in this work, we propose the idea of "voting for the right answer" to prevent risky decisions of ASV in blind spot areas, by employing random sampling and voting. Experimental results show that our proposed method improves the robustness against both the limited-knowledge attackers by pulling the adversarial samples out of the blind spots, and the perfect-knowledge attackers by introducing randomness and increasing the attackers' budgets. The code for reproducing main results is available at https://github.com/thuhcsi/adsv_voting.
    Decomposition of Global Feature Importance into Direct and Associative Components (DEDACT). (arXiv:2106.08086v1 [stat.ML])
    (2 min) Global model-agnostic feature importance measures either quantify whether features are directly used for a model's predictions (direct importance) or whether they contain prediction-relevant information (associative importance). Direct importance provides causal insight into the model's mechanism, yet it fails to expose the leakage of information from associated but not directly used variables. In contrast, associative importance exposes information leakage but does not provide causal insight into the model's mechanism. We introduce DEDACT - a framework to decompose well-established direct and associative importance measures into their respective associative and direct components. DEDACT provides insight into both the sources of prediction-relevant information in the data and the direct and indirect feature pathways by which the information enters the model. We demonstrate the method's usefulness on simulated examples.
    Multi-script Handwritten Digit Recognition Using Multi-task Learning. (arXiv:2106.08267v1 [cs.CV])
    (2 min) Handwritten digit recognition is one of the extensively studied area in machine learning. Apart from the wider research on handwritten digit recognition on MNIST dataset, there are many other research works on various script recognition. However, it is not very common for multi-script digit recognition which encourage the development of robust and multipurpose systems. Additionally working on multi-script digit recognition enables multi-task learning, considering the script classification as a related task for instance. It is evident that multi-task learning improves model performance through inductive transfer using the information contained in related tasks. Therefore, in this study multi-script handwritten digit recognition using multi-task learning will be investigated. As a specific case of demonstrating the solution to the problem, Amharic handwritten character recognition will also be experimented. The handwritten digits of three scripts including Latin, Arabic and Kannada are studied to show that multi-task models with reformulation of the individual tasks have shown promising results. In this study a novel way of using the individual tasks predictions was proposed to help classification performance and regularize the different loss for the purpose of the main task. This finding has outperformed the baseline and the conventional multi-task learning models. More importantly, it avoided the need for weighting the different losses of the tasks, which is one of the challenges in multi-task learning.
    Boosting in the Presence of Massart Noise. (arXiv:2106.07779v1 [cs.LG])
    (2 min) We study the problem of boosting the accuracy of a weak learner in the (distribution-independent) PAC model with Massart noise. In the Massart noise model, the label of each example $x$ is independently misclassified with probability $\eta(x) \leq \eta$, where $\eta<1/2$. The Massart model lies between the random classification noise model and the agnostic model. Our main positive result is the first computationally efficient boosting algorithm in the presence of Massart noise that achieves misclassification error arbitrarily close to $\eta$. Prior to our work, no non-trivial booster was known in this setting. Moreover, we show that this error upper bound is best possible for polynomial-time black-box boosters, under standard cryptographic assumptions. Our upper and lower bounds characterize the complexity of boosting in the distribution-independent PAC model with Massart noise. As a simple application of our positive result, we give the first efficient Massart learner for unions of high-dimensional rectangles.
    Talk, Don't Write: A Study of Direct Speech-Based Image Retrieval. (arXiv:2104.01894v3 [cs.CL] UPDATED)
    (2 min) Speech-based image retrieval has been studied as a proxy for joint representation learning, usually without emphasis on retrieval itself. As such, it is unclear how well speech-based retrieval can work in practice -- both in an absolute sense and versus alternative strategies that combine automatic speech recognition (ASR) with strong text encoders. In this work, we extensively study and expand choices of encoder architectures, training methodology (including unimodal and multimodal pretraining), and other factors. Our experiments cover different types of speech in three datasets: Flickr Audio, Places Audio, and Localized Narratives. Our best model configuration achieves large gains over state of the art, e.g., pushing recall-at-one from 21.8% to 33.2% for Flickr Audio and 27.6% to 53.4% for Places Audio. We also show our best speech-based models can match or exceed cascaded ASR-to-text encoding when speech is spontaneous, accented, or otherwise hard to automatically transcribe.
    RETRIEVE: Coreset Selection for Efficient and Robust Semi-Supervised Learning. (arXiv:2106.07760v1 [cs.LG])
    (2 min) Semi-supervised learning (SSL) algorithms have had great success in recent years in limited labeled data regimes. However, the current state-of-the-art SSL algorithms are computationally expensive and entail significant compute time and energy requirements. This can prove to be a huge limitation for many smaller companies and academic groups. Our main insight is that training on a subset of unlabeled data instead of entire unlabeled data enables the current SSL algorithms to converge faster, thereby reducing the computational costs significantly. In this work, we propose RETRIEVE, a coreset selection framework for efficient and robust semi-supervised learning. RETRIEVE selects the coreset by solving a mixed discrete-continuous bi-level optimization problem such that the selected coreset minimizes the labeled set loss. We use a one-step gradient approximation and show that the discrete optimization problem is approximately submodular, thereby enabling simple greedy algorithms to obtain the coreset. We empirically demonstrate on several real-world datasets that existing SSL algorithms like VAT, Mean-Teacher, FixMatch, when used with RETRIEVE, achieve a) faster training times, b) better performance when unlabeled data consists of Out-of-Distribution(OOD) data and imbalance. More specifically, we show that with minimal accuracy degradation, RETRIEVE achieves a speedup of around 3X in the traditional SSL setting and achieves a speedup of 5X compared to state-of-the-art (SOTA) robust SSL algorithms in the case of imbalance and OOD data.
    Coupled Gradient Estimators for Discrete Latent Variables. (arXiv:2106.08056v1 [cs.LG])
    (2 min) Training models with discrete latent variables is challenging due to the high variance of unbiased gradient estimators. While low-variance reparameterization gradients of a continuous relaxation can provide an effective solution, a continuous relaxation is not always available or tractable. Dong et al. (2020) and Yin et al. (2020) introduced a performant estimator that does not rely on continuous relaxations; however, it is limited to binary random variables. We introduce a novel derivation of their estimator based on importance sampling and statistical couplings, which we extend to the categorical setting. Motivated by the construction of a stick-breaking coupling, we introduce gradient estimators based on reparameterizing categorical variables as sequences of binary variables and Rao-Blackwellization. In systematic experiments, we show that our proposed categorical gradient estimators provide state-of-the-art performance, whereas even with additional Rao-Blackwellization, previous estimators (Yin et al., 2019) underperform a simpler REINFORCE with a leave-one-out-baseline estimator (Kool et al., 2019).
    Efficient Optimization Methods for Extreme Similarity Learning with Nonlinear Embeddings. (arXiv:2010.13511v2 [stat.ML] UPDATED)
    (2 min) We study the problem of learning similarity by using nonlinear embedding models (e.g., neural networks) from all possible pairs. This problem is well-known for its difficulty of training with the extreme number of pairs. For the special case of using linear embeddings, many studies have addressed this issue of handling all pairs by considering certain loss functions and developing efficient optimization algorithms. This paper aims to extend results for general nonlinear embeddings. First, we finish detailed derivations and provide clean formulations for efficiently calculating some building blocks of optimization algorithms such as function, gradient evaluation, and Hessian-vector product. The result enables the use of many optimization methods for extreme similarity learning with nonlinear embeddings. Second, we study some optimization methods in detail. Due to the use of nonlinear embeddings, implementation issues different from linear cases are addressed. In the end, some methods are shown to be highly efficient for extreme similarity learning with nonlinear embeddings.
    Model Extraction and Adversarial Attacks on Neural Networks using Switching Power Information. (arXiv:2106.08299v1 [cs.LG])
    (2 min) Artificial neural networks (ANNs) have gained significant popularity in the last decade for solving narrow AI problems in domains such as healthcare, transportation, and defense. As ANNs become more ubiquitous, it is imperative to understand their associated safety, security, and privacy vulnerabilities. Recently, it has been shown that ANNs are susceptible to a number of adversarial evasion attacks--inputs that cause the ANN to make high-confidence misclassifications despite being almost indistinguishable from the data used to train and test the network. This work explores to what degree finding these examples maybe aided by using side-channel information, specifically switching power consumption, of hardware implementations of ANNs. A black-box threat scenario is assumed, where an attacker has access to the ANN hardware's input, outputs, and topology, but the trained model parameters are unknown. Then, a surrogate model is trained to have similar functional (i.e. input-output mapping) and switching power characteristics as the oracle (black-box) model. Our results indicate that the inclusion of power consumption data increases the fidelity of the model extraction by up to 30 percent based on a mean square error comparison of the oracle and surrogate weights. However, transferability of adversarial examples from the surrogate to the oracle model was not significantly affected.
    Deep Reinforcement Learning for Conservation Decisions. (arXiv:2106.08272v1 [cs.LG])
    (2 min) Can machine learning help us make better decisions about a changing planet? In this paper, we illustrate and discuss the potential of a promising corner of machine learning known as _reinforcement learning_ (RL) to help tackle the most challenging conservation decision problems. RL is uniquely well suited to conservation and global change challenges for three reasons: (1) RL explicitly focuses on designing an agent who _interacts_ with an environment which is dynamic and uncertain, (2) RL approaches do not require massive amounts of data, (3) RL approaches would utilize rather than replace existing models, simulations, and the knowledge they contain. We provide a conceptual and technical introduction to RL and its relevance to ecological and conservation challenges, including examples of a problem in setting fisheries quotas and in managing ecological tipping points. Four appendices with annotated code provide a tangible introduction to researchers looking to adopt, evaluate, or extend these approaches.
    Detect and remove watermark in deep neural networks via generative adversarial networks. (arXiv:2106.08104v1 [cs.MM])
    (2 min) Deep neural networks (DNN) have achieved remarkable performance in various fields. However, training a DNN model from scratch requires a lot of computing resources and training data. It is difficult for most individual users to obtain such computing resources and training data. Model copyright infringement is an emerging problem in recent years. For instance, pre-trained models may be stolen or abuse by illegal users without the authorization of the model owner. Recently, many works on protecting the intellectual property of DNN models have been proposed. In these works, embedding watermarks into DNN based on backdoor is one of the widely used methods. However, when the DNN model is stolen, the backdoor-based watermark may face the risk of being detected and removed by an adversary. In this paper, we propose a scheme to detect and remove watermark in deep neural networks via generative adversarial networks (GAN). We demonstrate that the backdoor-based DNN watermarks are vulnerable to the proposed GAN-based watermark removal attack. The proposed attack method includes two phases. In the first phase, we use the GAN and few clean images to detect and reverse the watermark in the DNN model. In the second phase, we fine-tune the watermarked DNN based on the reversed backdoor images. Experimental evaluations on the MNIST and CIFAR10 datasets demonstrate that, the proposed method can effectively remove about 98% of the watermark in DNN models, as the watermark retention rate reduces from 100% to less than 2% after applying the proposed attack. In the meantime, the proposed attack hardly affects the model's performance. The test accuracy of the watermarked DNN on the MNIST and the CIFAR10 datasets drops by less than 1% and 3%, respectively.
    Generating Contrastive Explanations for Inductive Logic Programming Based on a Near Miss Approach. (arXiv:2106.08064v1 [cs.LG])
    (2 min) In recent research, human-understandable explanations of machine learning models have received a lot of attention. Often explanations are given in form of model simplifications or visualizations. However, as shown in cognitive science as well as in early AI research, concept understanding can also be improved by the alignment of a given instance for a concept with a similar counterexample. Contrasting a given instance with a structurally similar example which does not belong to the concept highlights what characteristics are necessary for concept membership. Such near misses have been proposed by Winston (1970) as efficient guidance for learning in relational domains. We introduce an explanation generation algorithm for relational concepts learned with Inductive Logic Programming (\textsc{GeNME}). The algorithm identifies near miss examples from a given set of instances and ranks these examples by their degree of closeness to a specific positive instance. A modified rule which covers the near miss but not the original instance is given as an explanation. We illustrate \textsc{GeNME} with the well known family domain consisting of kinship relations, the visual relational Winston arches domain and a real-world domain dealing with file management. We also present a psychological experiment comparing human preferences of rule-based, example-based, and near miss explanations in the family and the arches domains.
    Dialectal Speech Recognition and Translation of Swiss German Speech to Standard German Text: Microsoft's Submission to SwissText 2021. (arXiv:2106.08126v1 [eess.AS])
    (2 min) This paper describes the winning approach in the public SwissText 2021 competition on dialect recognition and translation of Swiss German speech to standard German text. Swiss German refers to the multitude of Alemannic dialects spoken in the German-speaking parts of Switzerland. Swiss German differs significantly from standard German in pronunciation, word inventory and grammar. It is mostly incomprehensible to native German speakers. Moreover, it lacks a standardized written script. To solve the challenging task, we propose a hybrid automatic speech recognition system with a lexicon that incorporates translations, a 1st pass language model that deals with Swiss German particularities, a transfer-learned acoustic model and a strong neural language model for 2nd pass rescoring. Our submission reaches 46.04% BLEU on a blind conversational test set and outperforms the second best competitor by a 12% relative margin.
    Mind Your Weight(s): A Large-scale Study on Insufficient Machine Learning Model Protection in Mobile Apps. (arXiv:2002.07687v2 [cs.CR] UPDATED)
    (3 min) On-device machine learning (ML) is quickly gaining popularity among mobile apps. It allows offline model inference while preserving user privacy. However, ML models, considered as core intellectual properties of model owners, are now stored on billions of untrusted devices and subject to potential thefts. Leaked models can cause both severe financial loss and security consequences. This paper presents the first empirical study of ML model protection on mobile devices. Our study aims to answer three open questions with quantitative evidence: How widely is model protection used in apps? How robust are existing model protection techniques? What impacts can (stolen) models incur? To that end, we built a simple app analysis pipeline and analyzed 46,753 popular apps collected from the US and Chinese app markets. We identified 1,468 ML apps spanning all popular app categories. We found that, alarmingly, 41% of ML apps do not protect their models at all, which can be trivially stolen from app packages. Even for those apps that use model protection or encryption, we were able to extract the models from 66% of them via unsophisticated dynamic analysis techniques. The extracted models are mostly commercial products and used for face recognition, liveness detection, ID/bank card recognition, and malware detection. We quantitatively estimated the potential financial and security impact of a leaked model, which can amount to millions of dollars for different stakeholders. Our study reveals that on-device models are currently at high risk of being leaked; attackers are highly motivated to steal such models. Drawn from our large-scale study, we report our insights into this emerging security problem and discuss the technical challenges, hoping to inspire future research on robust and practical model protection for mobile devices.
    Meta-World: A Benchmark and Evaluation for Multi-Task and Meta Reinforcement Learning. (arXiv:1910.10897v2 [cs.LG] UPDATED)
    (3 min) Meta-reinforcement learning algorithms can enable robots to acquire new skills much more quickly, by leveraging prior experience to learn how to learn. However, much of the current research on meta-reinforcement learning focuses on task distributions that are very narrow. For example, a commonly used meta-reinforcement learning benchmark uses different running velocities for a simulated robot as different tasks. When policies are meta-trained on such narrow task distributions, they cannot possibly generalize to more quickly acquire entirely new tasks. Therefore, if the aim of these methods is to enable faster acquisition of entirely new behaviors, we must evaluate them on task distributions that are sufficiently broad to enable generalization to new behaviors. In this paper, we propose an open-source simulated benchmark for meta-reinforcement learning and multi-task learning consisting of 50 distinct robotic manipulation tasks. Our aim is to make it possible to develop algorithms that generalize to accelerate the acquisition of entirely new, held-out tasks. We evaluate 7 state-of-the-art meta-reinforcement learning and multi-task learning algorithms on these tasks. Surprisingly, while each task and its variations (e.g., with different object positions) can be learned with reasonable success, these algorithms struggle to learn with multiple tasks at the same time, even with as few as ten distinct training tasks. Our analysis and open-source environments pave the way for future research in multi-task learning and meta-learning that can enable meaningful generalization, thereby unlocking the full potential of these methods.
    A Clinically Inspired Approach for Melanoma classification. (arXiv:2106.08021v1 [cs.CV])
    (2 min) Melanoma is a leading cause of deaths due to skin cancer deaths and hence, early and effective diagnosis of melanoma is of interest. Current approaches for automated diagnosis of melanoma either use pattern recognition or analytical recognition like ABCDE (asymmetry, border, color, diameter and evolving) criterion. In practice however, a differential approach wherein outliers (ugly duckling) are detected and used to evaluate nevi/lesions. Incorporation of differential recognition in Computer Aided Diagnosis (CAD) systems has not been explored but can be beneficial as it can provide a clinical justification for the derived decision. We present a method for identifying and quantifying ugly ducklings by performing Intra-Patient Comparative Analysis (IPCA) of neighboring nevi. This is then incorporated in a CAD system design for melanoma detection. This design ensures flexibility to handle cases where IPCA is not possible. Our experiments on a public dataset show that the outlier information helps boost the sensitivity of detection by at least 4.1 % and specificity by 4.0 % to 8.9 %, depending on the use of a strong (EfficientNet) or moderately strong (VGG or ResNet) classifier.
    Residual Reinforcement Learning from Demonstrations. (arXiv:2106.08050v1 [cs.LG])
    (2 min) Residual reinforcement learning (RL) has been proposed as a way to solve challenging robotic tasks by adapting control actions from a conventional feedback controller to maximize a reward signal. We extend the residual formulation to learn from visual inputs and sparse rewards using demonstrations. Learning from images, proprioceptive inputs and a sparse task-completion reward relaxes the requirement of accessing full state features, such as object and target positions. In addition, replacing the base controller with a policy learned from demonstrations removes the dependency on a hand-engineered controller in favour of a dataset of demonstrations, which can be provided by non-experts. Our experimental evaluation on simulated manipulation tasks on a 6-DoF UR5 arm and a 28-DoF dexterous hand demonstrates that residual RL from demonstrations is able to generalize to unseen environment conditions more flexibly than either behavioral cloning or RL fine-tuning, and is capable of solving high-dimensional, sparse-reward tasks out of reach for RL from scratch.
    Combining Semantic Guidance and Deep Reinforcement Learning For Generating Human Level Paintings. (arXiv:2011.12589v2 [cs.CV] UPDATED)
    (2 min) Generation of stroke-based non-photorealistic imagery, is an important problem in the computer vision community. As an endeavor in this direction, substantial recent research efforts have been focused on teaching machines "how to paint", in a manner similar to a human painter. However, the applicability of previous methods has been limited to datasets with little variation in position, scale and saliency of the foreground object. As a consequence, we find that these methods struggle to cover the granularity and diversity possessed by real world images. To this end, we propose a Semantic Guidance pipeline with 1) a bi-level painting procedure for learning the distinction between foreground and background brush strokes at training time. 2) We also introduce invariance to the position and scale of the foreground object through a neural alignment model, which combines object localization and spatial transformer networks in an end to end manner, to zoom into a particular semantic instance. 3) The distinguishing features of the in-focus object are then amplified by maximizing a novel guided backpropagation based focus reward. The proposed agent does not require any supervision on human stroke-data and successfully handles variations in foreground object attributes, thus, producing much higher quality canvases for the CUB-200 Birds and Stanford Cars-196 datasets. Finally, we demonstrate the further efficacy of our method on complex datasets with multiple foreground object instances by evaluating an extension of our method on the challenging Virtual-KITTI dataset. Source code and models are available at https://github.com/1jsingh/semantic-guidance.
    Self-Supervised Learning with Kernel Dependence Maximization. (arXiv:2106.08320v1 [stat.ML])
    (2 min) We approach self-supervised learning of image representations from a statistical dependence perspective, proposing Self-Supervised Learning with the Hilbert-Schmidt Independence Criterion (SSL-HSIC). SSL-HSIC maximizes dependence between representations of transformed versions of an image and the image identity, while minimizing the kernelized variance of those features. This self-supervised learning framework yields a new understanding of InfoNCE, a variational lower bound on the mutual information (MI) between different transformations. While the MI itself is known to have pathologies which can result in meaningless representations being learned, its bound is much better behaved: we show that it implicitly approximates SSL-HSIC (with a slightly different regularizer). Our approach also gives us insight into BYOL, since SSL-HSIC similarly learns local neighborhoods of samples. SSL-HSIC allows us to directly optimize statistical dependence in time linear in the batch size, without restrictive data assumptions or indirect mutual information estimators. Trained with or without a target network, SSL-HSIC matches the current state-of-the-art for standard linear evaluation on ImageNet, semi-supervised learning and transfer to other classification and vision tasks such as semantic segmentation, depth estimation and object recognition.
    Decoupling Value and Policy for Generalization in Reinforcement Learning. (arXiv:2102.10330v2 [cs.LG] UPDATED)
    (2 min) Standard deep reinforcement learning algorithms use a shared representation for the policy and value function, especially when training directly from images. However, we argue that more information is needed to accurately estimate the value function than to learn the optimal policy. Consequently, the use of a shared representation for the policy and value function can lead to overfitting. To alleviate this problem, we propose two approaches which are combined to create IDAAC: Invariant Decoupled Advantage Actor-Critic. First, IDAAC decouples the optimization of the policy and value function, using separate networks to model them. Second, it introduces an auxiliary loss which encourages the representation to be invariant to task-irrelevant properties of the environment. IDAAC shows good generalization to unseen environments, achieving a new state-of-the-art on the Procgen benchmark and outperforming popular methods on DeepMind Control tasks with distractors. Our implementation is available at https://github.com/rraileanu/idaac.
    Topics to Avoid: Demoting Latent Confounds in Text Classification. (arXiv:1909.00453v2 [cs.LG] UPDATED)
    (2 min) Despite impressive performance on many text classification tasks, deep neural networks tend to learn frequent superficial patterns that are specific to the training data and do not always generalize well. In this work, we observe this limitation with respect to the task of native language identification. We find that standard text classifiers which perform well on the test set end up learning topical features which are confounds of the prediction task (e.g., if the input text mentions Sweden, the classifier predicts that the author's native language is Swedish). We propose a method that represents the latent topical confounds and a model which "unlearns" confounding features by predicting both the label of the input text and the confound; but we train the two predictors adversarially in an alternating fashion to learn a text representation that predicts the correct label but is less prone to using information about the confound. We show that this model generalizes better and learns features that are indicative of the writing style rather than the content.
    Awardee Solution of KDD Cup 2021 OGB Large-Scale Challenge Graph-Level Track. (arXiv:2106.08279v1 [cs.LG])
    (2 min) In this technical report, we present our solution of KDD Cup 2021 OGB Large-Scale Challenge - PCQM4M-LSC Track. We adopt Graphormer and ExpC as our basic models. We train each model by 8-fold cross-validation, and additionally train two Graphormer models on the union of training and validation sets with different random seeds. For final submission, we use a naive ensemble for these 18 models by taking average of their outputs. Using our method, our team MachineLearning achieved 0.1200 MAE on test set.
    Tensor Q-Rank: New Data Dependent Definition of Tensor Rank. (arXiv:1910.12016v4 [cs.LG] UPDATED)
    (2 min) Recently, the \textit{Tensor Nuclear Norm~(TNN)} regularization based on t-SVD has been widely used in various low tubal-rank tensor recovery tasks. However, these models usually require smooth change of data along the third dimension to ensure their low rank structures. In this paper, we propose a new definition of data dependent tensor rank named \textit{tensor Q-rank} by a learnable orthogonal matrix $\mathbf{Q}$, and further introduce a unified data dependent low rank tensor recovery model. According to the low rank hypothesis, we introduce two explainable selection method of $\mathbf{Q}$, under which the data tensor may have a more significant low tensor Q-rank structure than that of low tubal-rank structure. Specifically, maximizing the variance of singular value distribution leads to Variance Maximization Tensor Q-Nuclear norm~(VMTQN), while minimizing the value of nuclear norm through manifold optimization leads to Manifold Optimization Tensor Q-Nuclear norm~(MOTQN). Moreover, we apply these two models to the low rank tensor completion problem, and then give an effective algorithm and briefly analyze why our method works better than TNN based methods in the case of complex data with low sampling rate. Finally, experimental results on real-world datasets demonstrate the superiority of our proposed model in the tensor completion problem with respect to other tensor rank regularization models.
    Active feature selection discovers minimal gene-sets for classifying cell-types and disease states in single-cell mRNA-seq data. (arXiv:2106.08317v1 [q-bio.GN])
    (2 min) Sequencing costs currently prohibit the application of single cell mRNA-seq for many biological and clinical tasks of interest. Here, we introduce an active learning framework that constructs compressed gene sets that enable high accuracy classification of cell-types and physiological states while analyzing a minimal number of gene transcripts. Our active feature selection procedure constructs gene sets through an iterative cell-type classification task where misclassified cells are examined at each round to identify maximally informative genes through an `active' support vector machine (SVM) classifier. Our active SVM procedure automatically identifies gene sets that enables $>90\%$ cell-type classification accuracy in the Tabula Muris mouse tissue survey as well as a $\sim 40$ gene set that enables classification of multiple myeloma patient samples with $>95\%$ accuracy. Broadly, the discovery of compact but highly informative gene sets might enable drastic reductions in sequencing requirements for applications of single-cell mRNA-seq.
    Machine Learning with Electronic Health Records is vulnerable to Backdoor Trigger Attacks. (arXiv:2106.07925v1 [cs.LG])
    (2 min) Electronic Health Records (EHRs) provide a wealth of information for machine learning algorithms to predict the patient outcome from the data including diagnostic information, vital signals, lab tests, drug administration, and demographic information. Machine learning models can be built, for example, to evaluate patients based on their predicted mortality or morbidity and to predict required resources for efficient resource management in hospitals. In this paper, we demonstrate that an attacker can manipulate the machine learning predictions with EHRs easily and selectively at test time by backdoor attacks with the poisoned training data. Furthermore, the poison we create has statistically similar features to the original data making it hard to detect, and can also attack multiple machine learning models without any knowledge of the models. With less than 5% of the raw EHR data poisoned, we achieve average attack success rates of 97% on mortality prediction tasks with MIMIC-III database against Logistic Regression, Multilayer Perceptron, and Long Short-term Memory models simultaneously.
    Thompson Sampling for Unimodal Bandits. (arXiv:2106.08187v1 [cs.LG])
    (2 min) In this paper, we propose a Thompson Sampling algorithm for \emph{unimodal} bandits, where the expected reward is unimodal over the partially ordered arms. To exploit the unimodal structure better, at each step, instead of exploration from the entire decision space, our algorithm makes decision according to posterior distribution only in the neighborhood of the arm that has the highest empirical mean estimate. We theoretically prove that, for Bernoulli rewards, the regret of our algorithm reaches the lower bound of unimodal bandits, thus it is asymptotically optimal. For Gaussian rewards, the regret of our algorithm is $\mathcal{O}(\log T)$, which is far better than standard Thompson Sampling algorithms. Extensive experiments demonstrate the effectiveness of the proposed algorithm on both synthetic data sets and the real-world applications.
    Very Deep Graph Neural Networks Via Noise Regularisation. (arXiv:2106.07971v1 [cs.LG])
    (2 min) Graph Neural Networks (GNNs) perform learned message passing over an input graph, but conventional wisdom says performing more than handful of steps makes training difficult and does not yield improved performance. Here we show the contrary. We train a deep GNN with up to 100 message passing steps and achieve several state-of-the-art results on two challenging molecular property prediction benchmarks, Open Catalyst 2020 IS2RE and QM9. Our approach depends crucially on a novel but simple regularisation method, which we call ``Noisy Nodes'', in which we corrupt the input graph with noise and add an auxiliary node autoencoder loss if the task is graph property prediction. Our results show this regularisation method allows the model to monotonically improve in performance with increased message passing steps. Our work opens new opportunities for reaping the benefits of deep neural networks in the space of graph and other structured prediction problems.
    Encouraging Intra-Class Diversity Through a Reverse Contrastive Loss for Better Single-Source Domain Generalization. (arXiv:2106.07916v1 [cs.CV])
    (2 min) Traditional deep learning algorithms often fail to generalize when they are tested outside of the domain of training data. Because data distributions can change dynamically in real-life applications once a learned model is deployed, in this paper we are interested in single-source domain generalization (SDG) which aims to develop deep learning algorithms able to generalize from a single training domain where no information about the test domain is available at training time. Firstly, we design two simple MNISTbased SDG benchmarks, namely MNIST Color SDG-MP and MNIST Color SDG-UP, which highlight the two different fundamental SDG issues of increasing difficulties: 1) a class-correlated pattern in the training domain is missing (SDG-MP), or 2) uncorrelated with the class (SDG-UP), in the testing data domain. This is in sharp contrast with the current domain generalization (DG) benchmarks which mix up different correlation and variation factors and thereby make hard to disentangle success or failure factors when benchmarking DG algorithms. We further evaluate several state-of-the-art SDG algorithms through our simple benchmark, namely MNIST Color SDG-MP, and show that the issue SDG-MP is largely unsolved despite of a decade of efforts in developing DG algorithms. Finally, we also propose a partially reversed contrastive loss to encourage intra-class diversity and find less strongly correlated patterns, to deal with SDG-MP and show that the proposed approach is very effective on our MNIST Color SDG-MP benchmark.
    Non-Gradient Manifold Neural Network. (arXiv:2106.07905v1 [cs.LG])
    (2 min) Deep neural network (DNN) generally takes thousands of iterations to optimize via gradient descent and thus has a slow convergence. In addition, softmax, as a decision layer, may ignore the distribution information of the data during classification. Aiming to tackle the referred problems, we propose a novel manifold neural network based on non-gradient optimization, i.e., the closed-form solutions. Considering that the activation function is generally invertible, we reconstruct the network via forward ridge regression and low rank backward approximation, which achieve the rapid convergence. Moreover, by unifying the flexible Stiefel manifold and adaptive support vector machine, we devise the novel decision layer which efficiently fits the manifold structure of the data and label information. Consequently, a jointly non-gradient optimization method is designed to generate the network with closed-form results. Eventually, extensive experiments validate the superior performance of the model.
    Graph-based Label Propagation for Semi-Supervised Speaker Identification. (arXiv:2106.08207v1 [cs.SD])
    (2 min) Speaker identification in the household scenario (e.g., for smart speakers) is typically based on only a few enrollment utterances but a much larger set of unlabeled data, suggesting semisupervised learning to improve speaker profiles. We propose a graph-based semi-supervised learning approach for speaker identification in the household scenario, to leverage the unlabeled speech samples. In contrast to most of the works in speaker recognition that focus on speaker-discriminative embeddings, this work focuses on speaker label inference (scoring). Given a pre-trained embedding extractor, graph-based learning allows us to integrate information about both labeled and unlabeled utterances. Considering each utterance as a graph node, we represent pairwise utterance similarity scores as edge weights. Graphs are constructed per household, and speaker identities are propagated to unlabeled nodes to optimize a global consistency criterion. We show in experiments on the VoxCeleb dataset that this approach makes effective use of unlabeled data and improves speaker identification accuracy compared to two state-of-the-art scoring methods as well as their semi-supervised variants based on pseudo-labels.
    GeoMol: Torsional Geometric Generation of Molecular 3D Conformer Ensembles. (arXiv:2106.07802v1 [physics.chem-ph])
    (2 min) Prediction of a molecule's 3D conformer ensemble from the molecular graph holds a key role in areas of cheminformatics and drug discovery. Existing generative models have several drawbacks including lack of modeling important molecular geometry elements (e.g. torsion angles), separate optimization stages prone to error accumulation, and the need for structure fine-tuning based on approximate classical force-fields or computationally expensive methods such as metadynamics with approximate quantum mechanics calculations at each geometry. We propose GeoMol--an end-to-end, non-autoregressive and SE(3)-invariant machine learning approach to generate distributions of low-energy molecular 3D conformers. Leveraging the power of message passing neural networks (MPNNs) to capture local and global graph information, we predict local atomic 3D structures and torsion angles, avoiding unnecessary over-parameterization of the geometric degrees of freedom (e.g. one angle per non-terminal bond). Such local predictions suffice both for the training loss computation, as well as for the full deterministic conformer assembly (at test time). We devise a non-adversarial optimal transport based loss function to promote diverse conformer generation. GeoMol predominantly outperforms popular open-source, commercial, or state-of-the-art machine learning (ML) models, while achieving significant speed-ups. We expect such differentiable 3D structure generators to significantly impact molecular modeling and related applications.
    Closing the Reality Gap with Unsupervised Sim-to-Real Image Translation. (arXiv:1911.01529v2 [cs.LG] UPDATED)
    (0 min) Deep learning approaches have become the standard solution to many problems in computer vision and robotics, but obtaining sufficient training data in high enough quality is challenging, as human labor is error prone, time consuming, and expensive. Solutions based on simulation have become more popular in recent years, but the gap between simulation and reality is still a major issue. In this paper, we introduce a novel method for augmenting synthetic image data through unsupervised image-to-image translation by applying the style of real world images to simulated images with open source frameworks. The generated dataset is combined with conventional augmentation methods and is then applied to a neural network model running in real-time on autonomous soccer robots. Our evaluation shows a significant improvement compared to models trained on images generated entirely in simulation.
    Optimal Latent Vector Alignment for Unsupervised Domain Adaptation in Medical Image Segmentation. (arXiv:2106.08188v1 [cs.LG])
    (2 min) This paper addresses the domain shift problem for segmentation. As a solution, we propose OLVA, a novel and lightweight unsupervised domain adaptation method based on a Variational Auto-Encoder (VAE) and Optimal Transport (OT) theory. Thanks to the VAE, our model learns a shared cross-domain latent space that follows a normal distribution, which reduces the domain shift. To guarantee valid segmentations, our shared latent space is designed to model the shape rather than the intensity variations. We further rely on an OT loss to match and align the remaining discrepancy between the two domains in the latent space. We demonstrate OLVA's effectiveness for the segmentation of multiple cardiac structures on the public Multi-Modality Whole Heart Segmentation (MM-WHS) dataset, where the source domain consists of annotated 3D MR images and the unlabelled target domain of 3D CTs. Our results show remarkable improvements with an additional margin of 12.5\% dice score over concurrent generative training approaches.
    Potato Crop Stress Identification in Aerial Images using Deep Learning-based Object Detection. (arXiv:2106.07770v1 [cs.CV])
    (2 min) Recent research on the application of remote sensing and deep learning-based analysis in precision agriculture demonstrated a potential for improved crop management and reduced environmental impacts of agricultural production. Despite the promising results, the practical relevance of these technologies for actual field deployment requires novel algorithms that are customized for analysis of agricultural images and robust to implementation on natural field imagery. The paper presents an approach for analyzing aerial images of a potato crop using deep neural networks. The main objective is to demonstrate automated spatial recognition of a healthy versus stressed crop at a plant level. Specifically, we examine premature plant senescence resulting in drought stress on Russet Burbank potato plants. The proposed deep learning model, named Retina-UNet-Ag, is a variant of Retina-UNet (Jaeger et al., 2018) and includes connections from low-level semantic dense representation maps to the feature pyramid network. The paper also introduces a dataset of field images acquired with a Parrot Sequoia camera carried by a Solo unmanned aerial vehicle. Experimental validation demonstrated the ability for distinguishing healthy and stressed plants in field images, achieving an average Dice score coefficient of 0.74. A comparison to related state-of-the-art deep learning models for object detection revealed that the presented approach is effective for the task at hand. The method applied here is conducive toward the assessment and recognition of potato crop stress (early plant senescence resulting from drought stress in this case) in natural aerial field images collected under real conditions.
    Multi-StyleGAN: Towards Image-Based Simulation of Time-Lapse Live-Cell Microscopy. (arXiv:2106.08285v1 [cs.CV])
    (0 min) Time-lapse fluorescent microscopy (TLFM) combined with predictive mathematical modelling is a powerful tool to study the inherently dynamic processes of life on the single-cell level. Such experiments are costly, complex and labour intensive. A complimentary approach and a step towards completely in silico experiments, is to synthesise the imagery itself. Here, we propose Multi-StyleGAN as a descriptive approach to simulate time-lapse fluorescence microscopy imagery of living cells, based on a past experiment. This novel generative adversarial network synthesises a multi-domain sequence of consecutive timesteps. We showcase Multi-StyleGAN on imagery of multiple live yeast cells in microstructured environments and train on a dataset recorded in our laboratory. The simulation captures underlying biophysical factors and time dependencies, such as cell morphology, growth, physical interactions, as well as the intensity of a fluorescent reporter protein. An immediate application is to generate additional training and validation data for feature extraction algorithms or to aid and expedite development of advanced experimental techniques such as online monitoring or control of cells. Code and dataset is available at https://git.rwth-aachen.de/bcs/projects/tp/multi-stylegan.
    Composing Normalizing Flows for Inverse Problems. (arXiv:2002.11743v3 [stat.ML] UPDATED)
    (0 min) Given an inverse problem with a normalizing flow prior, we wish to estimate the distribution of the underlying signal conditioned on the observations. We approach this problem as a task of conditional inference on the pre-trained unconditional flow model. We first establish that this is computationally hard for a large class of flow models. Motivated by this, we propose a framework for approximate inference that estimates the target conditional as a composition of two flow models. This formulation leads to a stable variational inference training procedure that avoids adversarial training. Our method is evaluated on a variety of inverse problems and is shown to produce high-quality samples with uncertainty quantification. We further demonstrate that our approach can be amortized for zero-shot inference.
    PATE-AAE: Incorporating Adversarial Autoencoder into Private Aggregation of Teacher Ensembles for Spoken Command Classification. (arXiv:2104.01271v2 [cs.SD] UPDATED)
    (0 min) We propose using an adversarial autoencoder (AAE) to replace generative adversarial network (GAN) in the private aggregation of teacher ensembles (PATE), a solution for ensuring differential privacy in speech applications. The AAE architecture allows us to obtain good synthetic speech leveraging upon a discriminative training of latent vectors. Such synthetic speech is used to build a privacy-preserving classifier when non-sensitive data is not sufficiently available in the public domain. This classifier follows the PATE scheme that uses an ensemble of noisy outputs to label the synthetic samples and guarantee $\varepsilon$-differential privacy (DP) on its derived classifiers. Our proposed framework thus consists of an AAE-based generator and a PATE-based classifier (PATE-AAE). Evaluated on the Google Speech Commands Dataset Version II, the proposed PATE-AAE improves the average classification accuracy by +$2.11\%$ and +$6.60\%$, respectively, when compared with alternative privacy-preserving solutions, namely PATE-GAN and DP-GAN, while maintaining a strong level of privacy target at $\varepsilon$=0.01 with a fixed $\delta$=10$^{-5}$.
    Speaker Diarization using Two-pass Leave-One-Out Gaussian PLDA Clustering of DNN Embeddings. (arXiv:2104.02469v3 [eess.AS] UPDATED)
    (0 min) Many modern systems for speaker diarization, such as the recently-developed VBx approach, rely on clustering of DNN speaker embeddings followed by resegmentation. Two problems with this approach are that the DNN is not directly optimized for this task, and the parameters need significant retuning for different applications. We have recently presented progress in this direction with a Leave-One-Out Gaussian PLDA (LGP) clustering algorithm and an approach to training the DNN such that embeddings directly optimize performance of this scoring method. This paper presents a new two-pass version of this system, where the second pass uses finer time resolution to significantly improve overall performance. For the Callhome corpus, we achieve the first published error rate below 4% without any task-dependent parameter tuning. We also show significant progress towards a robust single solution for multiple diarization tasks.
    Spot the Difference: Topological Anomaly Detection via Geometric Alignment. (arXiv:2106.08233v1 [cs.CV])
    (0 min) Geometric alignment appears in a variety of applications, ranging from domain adaptation, optimal transport, and normalizing flows in machine learning; optical flow and learned augmentation in computer vision and deformable registration within biomedical imaging. A recurring challenge is the alignment of domains whose topology is not the same; a problem that is routinely ignored, potentially introducing bias in downstream analysis. As a first step towards solving such alignment problems, we propose an unsupervised topological difference detection algorithm. The model is based on a conditional variational auto-encoder and detects topological anomalies with regards to a reference alongside the registration step. We consider both a) topological changes in the image under spatial variation and b) unexpected transformations. Our approach is validated on a proxy task of unsupervised anomaly detection in images.
    Mean Embeddings with Test-Time Data Augmentation for Ensembling of Representations. (arXiv:2106.08038v1 [cs.LG])
    (0 min) Averaging predictions over a set of models -- an ensemble -- is widely used to improve predictive performance and uncertainty estimation of deep learning models. At the same time, many machine learning systems, such as search, matching, and recommendation systems, heavily rely on embeddings. Unfortunately, due to misalignment of features of independently trained models, embeddings, cannot be improved with a naive deep ensemble like approach. In this work, we look at the ensembling of representations and propose mean embeddings with test-time augmentation (MeTTA) simple yet well-performing recipe for ensembling representations. Empirically we demonstrate that MeTTA significantly boosts the quality of linear evaluation on ImageNet for both supervised and self-supervised models. Even more exciting, we draw connections between MeTTA, image retrieval, and transformation invariant models. We believe that spreading the success of ensembles to inference higher-quality representations is the important step that will open many new applications of ensembling.
    Amortized Probabilistic Detection of Communities in Graphs. (arXiv:2010.15727v3 [stat.ML] UPDATED)
    (0 min) Learning community structures in graphs has broad applications across scientific domains. While graph neural networks (GNNs) have been successful in encoding graph structures, existing GNN-based methods for community detection are limited by requiring knowledge of the number of communities in advance, in addition to lacking a proper probabilistic formulation to handle uncertainty. We propose a simple framework for amortized community detection, which addresses both of these issues by combining the expressive power of GNNs with recent methods for amortized clustering. Our models consist of a graph representation backbone that extracts structural information and an amortized clustering network that naturally handles variable numbers of clusters. Both components combine into well-defined models of the posterior distribution of graph communities and are jointly optimized given labeled graphs. At inference time, the models yield parallel samples from the posterior of community labels, quantifying uncertainty in a principled way. We evaluate several models from our framework on synthetic and real datasets and demonstrate superior performance to previous methods. As a separate contribution, we extend recent amortized probabilistic clustering architectures by adding attention modules, which yield further improvements on community detection tasks.
    Exploration in Online Advertising Systems with Deep Uncertainty-Aware Learning. (arXiv:2012.02298v2 [cs.IR] UPDATED)
    (0 min) Modern online advertising systems inevitably rely on personalization methods, such as click-through rate (CTR) prediction. Recent progress in CTR prediction enjoys the rich representation capabilities of deep learning and achieves great success in large-scale industrial applications. However, these methods can suffer from lack of exploration. Another line of prior work addresses the exploration-exploitation trade-off problem with contextual bandit methods, which are recently less studied in the industry due to the difficulty in extending their flexibility with deep models. In this paper, we propose a novel Deep Uncertainty-Aware Learning (DUAL) method to learn CTR models based on Gaussian processes, which can provide predictive uncertainty estimations while maintaining the flexibility of deep neural networks. DUAL can be easily implemented on existing models and deployed in real-time systems with minimal extra computational overhead. By linking the predictive uncertainty estimation ability of DUAL to well-known bandit algorithms, we further present DUAL-based Ad-ranking strategies to boost up long-term utilities such as the social welfare in advertising systems. Experimental results on several public datasets demonstrate the effectiveness of our methods. Remarkably, an online A/B test deployed in the Alibaba display advertising platform shows an 8.2% social welfare improvement and an 8.0% revenue lift.
    Understanding Accuracy-Efficiency Trade-Offs as a Means for Holding Distributed ML Systems Accountable. (arXiv:2007.02203v5 [cs.CY] UPDATED)
    (0 min) Trade-offs between accuracy and efficiency are found in multiple non-computing domains, such as law and public health, which have developed rules and heuristics to guide how to balance the two in conditions of uncertainty. While accuracy-efficiency trade-offs are also commonly acknowledged in some areas of computer science, their policy implications remain poorly examined. Drawing on risk assessment practices in the US, we argue that, since examining accuracy-efficiency trade-offs has been useful for guiding governance in other domains, explicitly framing such trade-offs in computing is similarly useful for the governance of computer systems. Our discussion focuses on real-time distributed ML systems; understanding the policy implications in this area is particularly urgent because such systems, which include autonomous vehicles, tend to be high-stakes and safety-critical. We describe how the trade-off takes shape for these systems, highlight gaps between existing US risk assessment standards and what these systems require in order to be properly assessed, and make specific calls to action to facilitate accountability when hypothetical risks become realized as accidents in the real world. We close by discussing how such accountability mechanisms encourage more just, transparent governance aligned with public values.
    Fair Sparse Regression with Clustering: An Invex Relaxation for a Combinatorial Problem. (arXiv:2102.09704v2 [cs.LG] UPDATED)
    (0 min) In this paper, we study the problem of fair sparse regression on a biased dataset where bias depends upon a hidden binary attribute. The presence of a hidden attribute adds an extra layer of complexity to the problem by combining sparse regression and clustering with unknown binary labels. The corresponding optimization problem is combinatorial, but we propose a novel relaxation of it as an \emph{invex} optimization problem. To the best of our knowledge, this is the first invex relaxation for a combinatorial problem. We show that the inclusion of the debiasing/fairness constraint in our model has no adverse effect on the performance. Rather, it enables the recovery of the hidden attribute. The support of our recovered regression parameter vector matches exactly with the true parameter vector. Moreover, we simultaneously solve the clustering problem by recovering the exact value of the hidden attribute for each sample. Our method uses carefully constructed primal dual witnesses to provide theoretical guarantees for the combinatorial problem. To that end, we show that the sample complexity of our method is logarithmic in terms of the dimension of the regression parameter vector.
    Privacy Assessment of Federated Learning using Private Personalized Layers. (arXiv:2106.08060v1 [cs.CR])
    (0 min) Federated Learning (FL) is a collaborative scheme to train a learning model across multiple participants without sharing data. While FL is a clear step forward towards enforcing users' privacy, different inference attacks have been developed. In this paper, we quantify the utility and privacy trade-off of a FL scheme using private personalized layers. While this scheme has been proposed as local adaptation to improve the accuracy of the model through local personalization, it has also the advantage to minimize the information about the model exchanged with the server. However, the privacy of such a scheme has never been quantified. Our evaluations using motion sensor dataset show that personalized layers speed up the convergence of the model and slightly improve the accuracy for all users compared to a standard FL scheme while better preventing both attribute and membership inferences compared to a FL scheme using local differential privacy.
    Constrained Contextual Bandit Learning for Adaptive Radar Waveform Selection. (arXiv:2103.05541v2 [cs.IT] UPDATED)
    (2 min) A sequential decision process in which an adaptive radar system repeatedly interacts with a finite-state target channel is studied. The radar is capable of passively sensing the spectrum at regular intervals, which provides side information for the waveform selection process. The radar transmitter uses the sequence of spectrum observations as well as feedback from a collocated receiver to select waveforms which accurately estimate target parameters. It is shown that the waveform selection problem can be effectively addressed using a linear contextual bandit formulation in a manner that is both computationally feasible and sample efficient. Stochastic and adversarial linear contextual bandit models are introduced, allowing the radar to achieve effective performance in broad classes of physical environments. Simulations in a radar-communication coexistence scenario, as well as in an adversarial radar-jammer scenario, demonstrate that the proposed formulation provides a substantial improvement in target detection performance when Thompson Sampling and EXP3 algorithms are used to drive the waveform selection process. Further, it is shown that the harmful impacts of pulse-agile behavior on coherently processed radar data can be mitigated by adopting a time-varying constraint on the radar's waveform catalog.
    Highdicom: A Python library for standardized encoding of image annotations and machine learning model outputs in pathology and radiology. (arXiv:2106.07806v1 [eess.IV])
    (2 min) Machine learning is revolutionizing image-based diagnostics in pathology and radiology. ML models have shown promising results in research settings, but their lack of interoperability has been a major barrier for clinical integration and evaluation. The DICOM a standard specifies Information Object Definitions and Services for the representation and communication of digital images and related information, including image-derived annotations and analysis results. However, the complexity of the standard represents an obstacle for its adoption in the ML community and creates a need for software libraries and tools that simplify working with data sets in DICOM format. Here we present the highdicom library, which provides a high-level application programming interface for the Python programming language that abstracts low-level details of the standard and enables encoding and decoding of image-derived information in DICOM format in a few lines of Python code. The highdicom library ties into the extensive Python ecosystem for image processing and machine learning. Simultaneously, by simplifying creation and parsing of DICOM-compliant files, highdicom achieves interoperability with the medical imaging systems that hold the data used to train and run ML models, and ultimately communicate and store model outputs for clinical use. We demonstrate through experiments with slide microscopy and computed tomography imaging, that, by bridging these two ecosystems, highdicom enables developers to train and evaluate state-of-the-art ML models in pathology and radiology while remaining compliant with the DICOM standard and interoperable with clinical systems at all stages. To promote standardization of ML research and streamline the ML model development and deployment process, we made the library available free and open-source.
    Robust and Sample Optimal Algorithms for PSD Low-Rank Approximation. (arXiv:1912.04177v5 [cs.DS] UPDATED)
    (2 min) Recently, Musco and Woodruff (FOCS, 2017) showed that given an $n \times n$ positive semidefinite (PSD) matrix $A$, it is possible to compute a $(1+\epsilon)$-approximate relative-error low-rank approximation to $A$ by querying $O(nk/\epsilon^{2.5})$ entries of $A$ in time $O(nk/\epsilon^{2.5} +n k^{\omega-1}/\epsilon^{2(\omega-1)})$. They also showed that any relative-error low-rank approximation algorithm must query $\Omega(nk/\epsilon)$ entries of $A$, this gap has since remained open. Our main result is to resolve this question by obtaining an optimal algorithm that queries $O(nk/\epsilon)$ entries of $A$ and outputs a relative-error low-rank approximation in $O(n(k/\epsilon)^{\omega-1})$ time. Note, our running time improves that of Musco and Woodruff, and matches the information-theoretic lower bound if the matrix-multiplication exponent $\omega$ is $2$. We then extend our techniques to negative-type distance matrices. Bakshi and Woodruff (NeurIPS, 2018) showed a bi-criteria, relative-error low-rank approximation which queries $O(nk/\epsilon^{2.5})$ entries and outputs a rank-$(k+4)$ matrix. We show that the bi-criteria guarantee is not necessary and obtain an $O(nk/\epsilon)$ query algorithm, which is optimal. Our algorithm applies to all distance matrices that arise from metrics satisfying negative-type inequalities, including $\ell_1, \ell_2,$ spherical metrics and hypermetrics. Next, we introduce a new robust low-rank approximation model which captures PSD matrices that have been corrupted with noise. While a sample complexity lower bound precludes sublinear algorithms for arbitrary PSD matrices, we provide the first sublinear time and query algorithms when the corruption on the diagonal entries is bounded. As a special case, we show sample-optimal sublinear time algorithms for low-rank approximation of correlation matrices corrupted by noise.
    Poisoning Deep Reinforcement Learning Agents with In-Distribution Triggers. (arXiv:2106.07798v1 [cs.LG])
    (2 min) In this paper, we propose a new data poisoning attack and apply it to deep reinforcement learning agents. Our attack centers on what we call in-distribution triggers, which are triggers native to the data distributions the model will be trained on and deployed in. We outline a simple procedure for embedding these, and other, triggers in deep reinforcement learning agents following a multi-task learning paradigm, and demonstrate in three common reinforcement learning environments. We believe that this work has important implications for the security of deep learning models.
    FedNILM: Applying Federated Learning to NILM Applications at the Edge. (arXiv:2106.07751v1 [cs.LG])
    (2 min) Non-intrusive load monitoring (NILM) helps disaggregate the household's main electricity consumption to energy usages of individual appliances, thus greatly cutting down the cost in fine-grained household load monitoring. To address the arisen privacy concern in NILM applications, federated learning (FL) could be leveraged for NILM model training and sharing. When applying the FL paradigm in real-world NILM applications, however, we are faced with the challenges of edge resource restriction, edge model personalization and edge training data scarcity. In this paper we present FedNILM, a practical FL paradigm for NILM applications at the edge client. Specifically, FedNILM is designed to deliver privacy-preserving and personalized NILM services to large-scale edge clients, by leveraging i) secure data aggregation through federated learning, ii) efficient cloud model compression via filter pruning and multi-task learning, and iii) personalized edge model building with unsupervised transfer learning. Our experiments on real-world energy data show that, FedNILM is able to achieve personalized energy disaggregation with the state-of-the-art accuracy, while ensuring privacy preserving at the edge client.
    Credit Assignment in Neural Networks through Deep Feedback Control. (arXiv:2106.07887v1 [cs.LG])
    (0 min) The success of deep learning sparked interest in whether the brain learns by using similar techniques for assigning credit to each synaptic weight for its contribution to the network output. However, the majority of current attempts at biologically-plausible learning methods are either non-local in time, require highly specific connectivity motives, or have no clear link to any known mathematical optimization method. Here, we introduce Deep Feedback Control (DFC), a new learning method that uses a feedback controller to drive a deep neural network to match a desired output target and whose control signal can be used for credit assignment. The resulting learning rule is fully local in space and time and approximates Gauss-Newton optimization for a wide range of feedback connectivity patterns. To further underline its biological plausibility, we relate DFC to a multi-compartment model of cortical pyramidal neurons with a local voltage-dependent synaptic plasticity rule, consistent with recent theories of dendritic processing. By combining dynamical system theory with mathematical optimization theory, we provide a strong theoretical foundation for DFC that we corroborate with detailed results on toy experiments and standard computer-vision benchmarks.
    Efficient Micro-Structured Weight Unification for Neural Network Compression. (arXiv:2106.08301v1 [cs.LG])
    (2 min) Compressing Deep Neural Network (DNN) models to alleviate the storage and computation requirements is essential for practical applications, especially for resource limited devices. Although capable of reducing a reasonable amount of model parameters, previous unstructured or structured weight pruning methods can hardly truly accelerate inference, either due to the poor hardware compatibility of the unstructured sparsity or due to the low sparse rate of the structurally pruned network. Aiming at reducing both storage and computation, as well as preserving the original task performance, we propose a generalized weight unification framework at a hardware compatible micro-structured level to achieve high amount of compression and acceleration. Weight coefficients of a selected micro-structured block are unified to reduce the storage and computation of the block without changing the neuron connections, which turns to a micro-structured pruning special case when all unified coefficients are set to zero, where neuron connections (hence storage and computation) are completely removed. In addition, we developed an effective training framework based on the alternating direction method of multipliers (ADMM), which converts our complex constrained optimization into separately solvable subproblems. Through iteratively optimizing the subproblems, the desired micro-structure can be ensured with high compression ratio and low performance degradation. We extensively evaluated our method using a variety of benchmark models and datasets for different applications. Experimental results demonstrate state-of-the-art performance.
    Augmented Tensor Decomposition with Stochastic Optimization. (arXiv:2106.07900v1 [math.NA])
    (0 min) Tensor decompositions are powerful tools for dimensionality reduction and feature interpretation of multidimensional data such as signals. Existing tensor decomposition objectives (e.g., Frobenius norm) are designed for fitting raw data under statistical assumptions, which may not align with downstream classification tasks. Also, real-world tensor data are usually high-ordered and have large dimensions with millions or billions of entries. Thus, it is expensive to decompose the whole tensor with traditional algorithms. In practice, raw tensor data also contains redundant information while data augmentation techniques may be used to smooth out noise in samples. This paper addresses the above challenges by proposing augmented tensor decomposition (ATD), which effectively incorporates data augmentations to boost downstream classification. To reduce the memory footprint of the decomposition, we propose a stochastic algorithm that updates the factor matrices in a batch fashion. We evaluate ATD on multiple signal datasets. It shows comparable or better performance (e.g., up to 15% in accuracy) over self-supervised and autoencoder baselines with less than 5% of model parameters, achieves 0.6% ~ 1.3% accuracy gain over other tensor-based baselines, and reduces the memory footprint by 9X when compared to standard tensor decomposition algorithms.
    Contrastive Mixture of Posteriors for Counterfactual Inference, Data Integration and Fairness. (arXiv:2106.08161v1 [stat.ML])
    (2 min) Learning meaningful representations of data that can address challenges such as batch effect correction, data integration and counterfactual inference is a central problem in many domains including computational biology. Adopting a Conditional VAE framework, we identify the mathematical principle that unites these challenges: learning a representation that is marginally independent of a condition variable. We therefore propose the Contrastive Mixture of Posteriors (CoMP) method that uses a novel misalignment penalty to enforce this independence. This penalty is defined in terms of mixtures of the variational posteriors themselves, unlike prior work which uses external discrepancy measures such as MMD to ensure independence in latent space. We show that CoMP has attractive theoretical properties compared to previous approaches, especially when there is complex global structure in latent space. We further demonstrate state of the art performance on a number of real-world problems, including the challenging tasks of aligning human tumour samples with cancer cell-lines and performing counterfactual inference on single-cell RNA sequencing data. Incidentally, we find parallels with the fair representation learning literature, and demonstrate CoMP has competitive performance in learning fair yet expressive latent representations.
    A Framework for Understanding Sources of Harm throughout the Machine Learning Life Cycle. (arXiv:1901.10002v4 [cs.LG] UPDATED)
    (0 min) As machine learning (ML) increasingly affects people and society, awareness of its potential unwanted consequences has also grown. To anticipate, prevent, and mitigate undesirable downstream consequences, it is critical that we understand when and how harm might be introduced throughout the ML life cycle. In this paper, we provide a framework that identifies seven distinct potential sources of downstream harm in machine learning, spanning data collection, development, and deployment. In doing so, we aim to facilitate more productive and precise communication around these issues, as well as more direct, application-grounded ways to mitigate them.
    Smoothness Analysis of Adversarial Training. (arXiv:2103.01400v3 [cs.LG] UPDATED)
    (2 min) Deep neural networks are vulnerable to adversarial attacks. Recent studies about adversarial robustness focus on the loss landscape in the parameter space since it is related to optimization and generalization performance. These studies conclude that the difficulty of adversarial training is caused by the non-smoothness of the loss function: i.e., its gradient is not Lipschitz continuous. However, this analysis ignores the dependence of adversarial attacks on model parameters. Since adversarial attacks are optimized for models, they should depend on the parameters. Considering this dependence, we analyze the smoothness of the loss function of adversarial training using the optimal attacks for the model parameter in more detail. We reveal that the constraint of adversarial attacks is one cause of the non-smoothness and that the smoothness depends on the types of the constraints. Specifically, the $L_\infty$ constraint can cause non-smoothness more than the $L_2$ constraint. Moreover, our analysis implies that if we flatten the loss function with respect to input data, the Lipschitz constant of the gradient of adversarial loss tends to increase. To address the non-smoothness, we show that EntropySGD smoothens the non-smooth loss and improves the performance of adversarial training.
    Non-Autoregressive Electron Redistribution Modeling for Reaction Prediction. (arXiv:2106.07801v1 [physics.chem-ph])
    (2 min) Reliably predicting the products of chemical reactions presents a fundamental challenge in synthetic chemistry. Existing machine learning approaches typically produce a reaction product by sequentially forming its subparts or intermediate molecules. Such autoregressive methods, however, not only require a pre-defined order for the incremental construction but preclude the use of parallel decoding for efficient computation. To address these issues, we devise a non-autoregressive learning paradigm that predicts reaction in one shot. Leveraging the fact that chemical reactions can be described as a redistribution of electrons in molecules, we formulate a reaction as an arbitrary electron flow and predict it with a novel multi-pointer decoding network. Experiments on the USPTO-MIT dataset show that our approach has established a new state-of-the-art top-1 accuracy and achieves at least 27 times inference speedup over the state-of-the-art methods. Also, our predictions are easier for chemists to interpret owing to predicting the electron flows.
    Machine learning-based conditional mean filter: a generalization of the ensemble Kalman filter for nonlinear data assimilation. (arXiv:2106.07908v1 [cs.LG])
    (0 min) Filtering is a data assimilation technique that performs the sequential inference of dynamical systems states from noisy observations. Herein, we propose a machine learning-based ensemble conditional mean filter (ML-EnCMF) for tracking possibly high-dimensional non-Gaussian state models with nonlinear dynamics based on sparse observations. The proposed filtering method is developed based on the conditional expectation and numerically implemented using machine learning (ML) techniques combined with the ensemble method. The contribution of this work is twofold. First, we demonstrate that the ensembles assimilated using the ensemble conditional mean filter (EnCMF) provide an unbiased estimator of the Bayesian posterior mean, and their variance matches the expected conditional variance. Second, we implement the EnCMF using artificial neural networks, which have a significant advantage in representing nonlinear functions over high-dimensional domains such as the conditional mean. Finally, we demonstrate the effectiveness of the ML-EnCMF for tracking the states of Lorenz-63 and Lorenz-96 systems under the chaotic regime. Numerical results show that the ML-EnCMF outperforms the ensemble Kalman filter.
    Memory-Associated Differential Learning. (arXiv:2102.05246v2 [cs.LG] UPDATED)
    (2 min) Conventional Supervised Learning approaches focus on the mapping from input features to output labels. After training, the learnt models alone are adapted onto testing features to predict testing labels in isolation, with training data wasted and their associations ignored. To take full advantage of the vast number of training data and their associations, we propose a novel learning paradigm called Memory-Associated Differential (MAD) Learning. We first introduce an additional component called Memory to memorize all the training data. Then we learn the differences of labels as well as the associations of features in the combination of a differential equation and some sampling methods. Finally, in the evaluating phase, we predict unknown labels by inferencing from the memorized facts plus the learnt differences and associations in a geometrically meaningful manner. We gently build this theory in unary situations and apply it on Image Recognition, then extend it into Link Prediction as a binary situation, in which our method outperforms strong state-of-the-art baselines on ogbl-ddi dataset.
    Revisiting Model Stitching to Compare Neural Representations. (arXiv:2106.07682v1 [cs.LG])
    (2 min) We revisit and extend model stitching (Lenc & Vedaldi 2015) as a methodology to study the internal representations of neural networks. Given two trained and frozen models $A$ and $B$, we consider a "stitched model'' formed by connecting the bottom-layers of $A$ to the top-layers of $B$, with a simple trainable layer between them. We argue that model stitching is a powerful and perhaps under-appreciated tool, which reveals aspects of representations that measures such as centered kernel alignment (CKA) cannot. Through extensive experiments, we use model stitching to obtain quantitative verifications for intuitive statements such as "good networks learn similar representations'', by demonstrating that good networks of the same architecture, but trained in very different ways (e.g.: supervised vs. self-supervised learning), can be stitched to each other without drop in performance. We also give evidence for the intuition that "more is better'' by showing that representations learnt with (1) more data, (2) bigger width, or (3) more training time can be "plugged in'' to weaker models to improve performance. Finally, our experiments reveal a new structural property of SGD which we call "stitching connectivity'', akin to mode-connectivity: typical minima reached by SGD can all be stitched to each other with minimal change in accuracy.
    Graphical Gaussian Process Regression Model for Aqueous Solvation Free Energy Prediction of Organic Molecules in Redox Flow Battery. (arXiv:2106.08146v1 [cs.CE])
    (0 min) The solvation free energy of organic molecules is a critical parameter in determining emergent properties such as solubility, liquid-phase equilibrium constants, and pKa and redox potentials in an organic redox flow battery. In this work, we present a machine learning (ML) model that can learn and predict the aqueous solvation free energy of an organic molecule using Gaussian process regression method based on a new molecular graph kernel. To investigate the performance of the ML model on electrostatic interaction, the nonpolar interaction contribution of solvent and the conformational entropy of solute in solvation free energy, three data sets with implicit or explicit water solvent models, and contribution of conformational entropy of solute are tested. We demonstrate that our ML model can predict the solvation free energy of molecules at chemical accuracy with a mean absolute error of less than 1 kcal/mol for subsets of the QM9 dataset and the Freesolv database. To solve the general data scarcity problem for a graph-based ML model, we propose a dimension reduction algorithm based on the distance between molecular graphs, which can be used to examine the diversity of the molecular data set. It provides a promising way to build a minimum training set to improve prediction for certain test sets where the space of molecular structures is predetermined.
    Coping with Label Shift via Distributionally Robust Optimisation. (arXiv:2010.12230v2 [cs.LG] UPDATED)
    (0 min) The label shift problem refers to the supervised learning setting where the train and test label distributions do not match. Existing work addressing label shift usually assumes access to an \emph{unlabelled} test sample. This sample may be used to estimate the test label distribution, and to then train a suitably re-weighted classifier. While approaches using this idea have proven effective, their scope is limited as it is not always feasible to access the target domain; further, they require repeated retraining if the model is to be deployed in \emph{multiple} test environments. Can one instead learn a \emph{single} classifier that is robust to arbitrary label shifts from a broad family? In this paper, we answer this question by proposing a model that minimises an objective based on distributionally robust optimisation (DRO). We then design and analyse a gradient descent-proximal mirror ascent algorithm tailored for large-scale problems to optimise the proposed objective. %, and establish its convergence. Finally, through experiments on CIFAR-100 and ImageNet, we show that our technique can significantly improve performance over a number of baselines in settings where label shift is present.
    Fairness as Equality of Opportunity: Normative Guidance from Political Philosophy. (arXiv:2106.08259v1 [cs.CY])
    (0 min) Recent interest in codifying fairness in Automated Decision Systems (ADS) has resulted in a wide range of formulations of what it means for an algorithmic system to be fair. Most of these propositions are inspired by, but inadequately grounded in, political philosophy scholarship. This paper aims to correct that deficit. We introduce a taxonomy of fairness ideals using doctrines of Equality of Opportunity (EOP) from political philosophy, clarifying their conceptions in philosophy and the proposed codification in fair machine learning. We arrange these fairness ideals onto an EOP spectrum, which serves as a useful frame to guide the design of a fair ADS in a given context. We use our fairness-as-EOP framework to re-interpret the impossibility results from a philosophical perspective, as the in-compatibility between different value systems, and demonstrate the utility of the framework with several real-world and hypothetical examples. Through our EOP-framework we hope to answer what it means for an ADS to be fair from a moral and political philosophy standpoint, and to pave the way for similar scholarship from ethics and legal experts.
    Unsupervised Abstractive Opinion Summarization by Generating Sentences with Tree-Structured Topic Guidance. (arXiv:2106.08007v1 [cs.CL])
    (0 min) This paper presents a novel unsupervised abstractive summarization method for opinionated texts. While the basic variational autoencoder-based models assume a unimodal Gaussian prior for the latent code of sentences, we alternate it with a recursive Gaussian mixture, where each mixture component corresponds to the latent code of a topic sentence and is mixed by a tree-structured topic distribution. By decoding each Gaussian component, we generate sentences with tree-structured topic guidance, where the root sentence conveys generic content, and the leaf sentences describe specific topics. Experimental results demonstrate that the generated topic sentences are appropriate as a summary of opinionated texts, which are more informative and cover more input contents than those generated by the recent unsupervised summarization model (Bra\v{z}inskas et al., 2020). Furthermore, we demonstrate that the variance of latent Gaussians represents the granularity of sentences, analogous to Gaussian word embedding (Vilnis and McCallum, 2015).
    Selfish Sparse RNN Training. (arXiv:2101.09048v3 [cs.LG] UPDATED)
    (0 min) Sparse neural networks have been widely applied to reduce the computational demands of training and deploying over-parameterized deep neural networks. For inference acceleration, methods that discover a sparse network from a pre-trained dense network (dense-to-sparse training) work effectively. Recently, dynamic sparse training (DST) has been proposed to train sparse neural networks without pre-training a dense model (sparse-to-sparse training), so that the training process can also be accelerated. However, previous sparse-to-sparse methods mainly focus on Multilayer Perceptron Networks (MLPs) and Convolutional Neural Networks (CNNs), failing to match the performance of dense-to-sparse methods in the Recurrent Neural Networks (RNNs) setting. In this paper, we propose an approach to train intrinsically sparse RNNs with a fixed parameter count in one single run, without compromising performance. During training, we allow RNN layers to have a non-uniform redistribution across cell gates for better regularization. Further, we propose SNT-ASGD, a novel variant of the averaged stochastic gradient optimizer, which significantly improves the performance of all sparse training methods for RNNs. Using these strategies, we achieve state-of-the-art sparse training results, better than the dense-to-sparse methods, with various types of RNNs on Penn TreeBank and Wikitext-2 datasets. Our codes are available at https://github.com/Shiweiliuiiiiiii/Selfish-RNN.
    BEiT: BERT Pre-Training of Image Transformers. (arXiv:2106.08254v1 [cs.CV])
    (0 min) We introduce a self-supervised vision representation model BEiT, which stands for Bidirectional Encoder representation from Image Transformers. Following BERT developed in the natural language processing area, we propose a masked image modeling task to pretrain vision Transformers. Specifically, each image has two views in our pre-training, i.e, image patches (such as 16x16 pixels), and visual tokens (i.e., discrete tokens). We first "tokenize" the original image into visual tokens. Then we randomly mask some image patches and fed them into the backbone Transformer. The pre-training objective is to recover the original visual tokens based on the corrupted image patches. After pre-training BEiT, we directly fine-tune the model parameters on downstream tasks by appending task layers upon the pretrained encoder. Experimental results on image classification and semantic segmentation show that our model achieves competitive results with previous pre-training methods. For example, base-size BEiT achieves 83.2% top-1 accuracy on ImageNet-1K, significantly outperforming from-scratch DeiT training (81.8%) with the same setup. Moreover, large-size BEiT obtains 86.3% only using ImageNet-1K, even outperforming ViT-L with supervised pre-training on ImageNet-22K (85.2%). The code and pretrained models are available at https://aka.ms/beit.
    Natural Language Adversarial Defense through Synonym Encoding. (arXiv:1909.06723v4 [cs.CL] UPDATED)
    (0 min) In the area of natural language processing, deep learning models are recently known to be vulnerable to various types of adversarial perturbations, but relatively few works are done on the defense side. Especially, there exists few effective defense method against the successful synonym substitution based attacks that preserve the syntactic structure and semantic information of the original text while fooling the deep learning models. We contribute in this direction and propose a novel adversarial defense method called Synonym Encoding Method (SEM). Specifically, SEM inserts an encoder before the input layer of the target model to map each cluster of synonyms to a unique encoding and trains the model to eliminate possible adversarial perturbations without modifying the network architecture or adding extra data. Extensive experiments demonstrate that SEM can effectively defend the current synonym substitution based attacks and block the transferability of adversarial examples. SEM is also easy and efficient to scale to large models and big datasets.
    Hotel Recognition via Latent Image Embedding. (arXiv:2106.08042v1 [cs.CV])
    (0 min) We approach the problem of hotel recognition with deep metric learning. We overview the existing approaches and propose a modification to Contrastive loss called Contrastive-Triplet loss. We construct a robust pipeline for benchmarking metric learning models and perform experiments on Hotels-50K and CUB200 datasets. Contrastive-Triplet loss is shown to achieve better retrieval on Hotels-50k. We open-source our code.
    Gradient Forward-Propagation for Large-Scale Temporal Video Modelling. (arXiv:2106.08318v1 [cs.CV])
    (0 min) How can neural networks be trained on large-volume temporal data efficiently? To compute the gradients required to update parameters, backpropagation blocks computations until the forward and backward passes are completed. For temporal signals, this introduces high latency and hinders real-time learning. It also creates a coupling between consecutive layers, which limits model parallelism and increases memory consumption. In this paper, we build upon Sideways, which avoids blocking by propagating approximate gradients forward in time, and we propose mechanisms for temporal integration of information based on different variants of skip connections. We also show how to decouple computation and delegate individual neural modules to different devices, allowing distributed and parallel training. The proposed Skip-Sideways achieves low latency training, model parallelism, and, importantly, is capable of extracting temporal features, leading to more stable training and improved performance on real-world action recognition video datasets such as HMDB51, UCF101, and the large-scale Kinetics-600. Finally, we also show that models trained with Skip-Sideways generate better future frames than Sideways models, and hence they can better utilize motion cues.
    Interactive Learning from Activity Description. (arXiv:2102.07024v2 [cs.CL] UPDATED)
    (2 min) We present a novel interactive learning protocol that enables training request-fulfilling agents by verbally describing their activities. Unlike imitation learning (IL), our protocol allows the teaching agent to provide feedback in a language that is most appropriate for them. Compared with reward in reinforcement learning (RL), the description feedback is richer and allows for improved sample complexity. We develop a probabilistic framework and an algorithm that practically implements our protocol. Empirical results in two challenging request-fulfilling problems demonstrate the strengths of our approach: compared with RL baselines, it is more sample-efficient; compared with IL baselines, it achieves competitive success rates without requiring the teaching agent to be able to demonstrate the desired behavior using the learning agent's actions. Apart from empirical evaluation, we also provide theoretical guarantees for our algorithm under certain assumptions about the teacher and the environment.
    A Value-Function-based Interior-point Method for Non-convex Bi-level Optimization. (arXiv:2106.07991v1 [math.OC])
    (2 min) Bi-level optimization model is able to capture a wide range of complex learning tasks with practical interest. Due to the witnessed efficiency in solving bi-level programs, gradient-based methods have gained popularity in the machine learning community. In this work, we propose a new gradient-based solution scheme, namely, the Bi-level Value-Function-based Interior-point Method (BVFIM). Following the main idea of the log-barrier interior-point scheme, we penalize the regularized value function of the lower level problem into the upper level objective. By further solving a sequence of differentiable unconstrained approximation problems, we consequently derive a sequential programming scheme. The numerical advantage of our scheme relies on the fact that, when gradient methods are applied to solve the approximation problem, we successfully avoid computing any expensive Hessian-vector or Jacobian-vector product. We prove the convergence without requiring any convexity assumption on either the upper level or the lower level objective. Experiments demonstrate the efficiency of the proposed BVFIM on non-convex bi-level problems.
    When does gradient descent with logistic loss find interpolating two-layer networks?. (arXiv:2012.02409v3 [stat.ML] UPDATED)
    (0 min) We study the training of finite-width two-layer smoothed ReLU networks for binary classification using the logistic loss. We show that gradient descent drives the training loss to zero if the initial loss is small enough. When the data satisfies certain cluster and separation conditions and the network is wide enough, we show that one step of gradient descent reduces the loss sufficiently that the first result applies.
    CausalNLP: A Practical Toolkit for Causal Inference with Text. (arXiv:2106.08043v1 [cs.CL])
    (2 min) The vast majority of existing methods and systems for causal inference assume that all variables under consideration are categorical or numerical (e.g., gender, price, blood pressure, enrollment). In this paper, we present CausalNLP, a toolkit for inferring causality from observational data that includes text in addition to traditional numerical and categorical variables. CausalNLP employs the use of meta-learners for treatment effect estimation and supports using raw text and its linguistic properties as both a treatment and a "controlled-for" variable (e.g., confounder). The library is open-source and available at: https://github.com/amaiya/causalnlp.
    Unique sparse decomposition of low rank matrices. (arXiv:2106.07736v1 [math.OC])
    (2 min) The problem of finding the unique low dimensional decomposition of a given matrix has been a fundamental and recurrent problem in many areas. In this paper, we study the problem of seeking a unique decomposition of a low rank matrix $Y\in \mathbb{R}^{p\times n}$ that admits a sparse representation. Specifically, we consider $Y = A X\in \mathbb{R}^{p\times n}$ where the matrix $A\in \mathbb{R}^{p\times r}$ has full column rank, with $r < \min\{n,p\}$, and the matrix $X\in \mathbb{R}^{r\times n}$ is element-wise sparse. We prove that this sparse decomposition of $Y$ can be uniquely identified, up to some intrinsic signed permutation. Our approach relies on solving a nonconvex optimization problem constrained over the unit sphere. Our geometric analysis for the nonconvex optimization landscape shows that any {\em strict} local solution is close to the ground truth solution, and can be recovered by a simple data-driven initialization followed with any second order descent algorithm. At last, we corroborate these theoretical results with numerical experiments.
    Robust learning under clean-label attack. (arXiv:2103.00671v3 [cs.LG] UPDATED)
    (2 min) We study the problem of robust learning under clean-label data-poisoning attacks, where the attacker injects (an arbitrary set of) correctly-labeled examples to the training set to fool the algorithm into making mistakes on specific test instances at test time. The learning goal is to minimize the attackable rate (the probability mass of attackable test instances), which is more difficult than optimal PAC learning. As we show, any robust algorithm with diminishing attackable rate can achieve the optimal dependence on $\epsilon$ in its PAC sample complexity, i.e., $O(1/\epsilon)$. On the other hand, the attackable rate might be large even for some optimal PAC learners, e.g., SVM for linear classifiers. Furthermore, we show that the class of linear hypotheses is not robustly learnable when the data distribution has zero margin and is robustly learnable in the case of positive margin but requires sample complexity exponential in the dimension. For a general hypothesis class with bounded VC dimension, if the attacker is limited to add at most $t>0$ poison examples, the optimal robust learning sample complexity grows almost linearly with $t$.
    Controlling Neural Networks with Rule Representations. (arXiv:2106.07804v1 [cs.LG])
    (2 min) We propose a novel training method to integrate rules into deep learning, in a way their strengths are controllable at inference. Deep Neural Networks with Controllable Rule Representations (DeepCTRL) incorporates a rule encoder into the model coupled with a rule-based objective, enabling a shared representation for decision making. DeepCTRL is agnostic to data type and model architecture. It can be applied to any kind of rule defined for inputs and outputs. The key aspect of DeepCTRL is that it does not require retraining to adapt the rule strength -- at inference, the user can adjust it based on the desired operation point on accuracy vs. rule verification ratio. In real-world domains where incorporating rules is critical -- such as Physics, Retail and Healthcare -- we show the effectiveness of DeepCTRL in teaching rules for deep learning. DeepCTRL improves the trust and reliability of the trained models by significantly increasing their rule verification ratio, while also providing accuracy gains at downstream tasks. Additionally, DeepCTRL enables novel use cases such as hypothesis testing of the rules on data samples, and unsupervised adaptation based on shared rules between datasets.
    Application of the Quantum Potential Neural Network to multi-electronic atoms. (arXiv:2106.08138v1 [quant-ph])
    (2 min) In this report, the application of the Quantum Potential Neural Network (QPNN) framework to many electron atomic systems is presented. For this study, full configuration interaction (FCI) one--electron density functions within predefined limits of accuracy were used to train the QPNN. The obtained results suggest that this new neural network is capable of learning the effective potential functions of many electron atoms in a completely unsupervised manner, and using only limited information from the probability density. Using the effective potential functions learned for each of the studied systems the QPNN was able to estimate the total energies of each of the systems (with a maximum of 10 trials) with a remarkable accuracy when compared to the FCI energies.
    Graph Neural Networks with Heterophily. (arXiv:2009.13566v3 [cs.LG] UPDATED)
    (2 min) Graph Neural Networks (GNNs) have proven to be useful for many different practical applications. However, many existing GNN models have implicitly assumed homophily among the nodes connected in the graph, and therefore have largely overlooked the important setting of heterophily, where most connected nodes are from different classes. In this work, we propose a novel framework called CPGNN that generalizes GNNs for graphs with either homophily or heterophily. The proposed framework incorporates an interpretable compatibility matrix for modeling the heterophily or homophily level in the graph, which can be learned in an end-to-end fashion, enabling it to go beyond the assumption of strong homophily. Theoretically, we show that replacing the compatibility matrix in our framework with the identity (which represents pure homophily) reduces to GCN. Our extensive experiments demonstrate the effectiveness of our approach in more realistic and challenging experimental settings with significantly less training data compared to previous works: CPGNN variants achieve state-of-the-art results in heterophily settings with or without contextual node features, while maintaining comparable performance in homophily settings.
    Incorporating Domain Knowledge into Health Recommender Systems using Hyperbolic Embeddings. (arXiv:2106.07720v1 [cs.IR])
    (2 min) In contrast to many other domains, recommender systems in health services may benefit particularly from the incorporation of health domain knowledge, as it helps to provide meaningful and personalised recommendations catering to the individual's health needs. With recent advances in representation learning enabling the hierarchical embedding of health knowledge into the hyperbolic Poincare space, this work proposes a content-based recommender system for patient-doctor matchmaking in primary care based on patients' health profiles, enriched by pre-trained Poincare embeddings of the ICD-9 codes through transfer learning. The proposed model outperforms its conventional counterpart in terms of recommendation accuracy and has several important business implications for improving the patient-doctor relationship.
    Sequence-Level Training for Non-Autoregressive Neural Machine Translation. (arXiv:2106.08122v1 [cs.CL])
    (2 min) In recent years, Neural Machine Translation (NMT) has achieved notable results in various translation tasks. However, the word-by-word generation manner determined by the autoregressive mechanism leads to high translation latency of the NMT and restricts its low-latency applications. Non-Autoregressive Neural Machine Translation (NAT) removes the autoregressive mechanism and achieves significant decoding speedup through generating target words independently and simultaneously. Nevertheless, NAT still takes the word-level cross-entropy loss as the training objective, which is not optimal because the output of NAT cannot be properly evaluated due to the multimodality problem. In this paper, we propose using sequence-level training objectives to train NAT models, which evaluate the NAT outputs as a whole and correlates well with the real translation quality. Firstly, we propose training NAT models to optimize sequence-level evaluation metrics (e.g., BLEU) based on several novel reinforcement algorithms customized for NAT, which outperforms the conventional method by reducing the variance of gradient estimation. Secondly, we introduce a novel training objective for NAT models, which aims to minimize the Bag-of-Ngrams (BoN) difference between the model output and the reference sentence. The BoN training objective is differentiable and can be calculated efficiently without doing any approximations. Finally, we apply a three-stage training strategy to combine these two methods to train the NAT model. We validate our approach on four translation tasks (WMT14 En$\leftrightarrow$De, WMT16 En$\leftrightarrow$Ro), which shows that our approach largely outperforms NAT baselines and achieves remarkable performance on all translation tasks.
    Federated Stochastic Gradient Langevin Dynamics. (arXiv:2004.11231v3 [stat.ML] UPDATED)
    (2 min) Stochastic gradient MCMC methods, such as stochastic gradient Langevin dynamics (SGLD), employ fast but noisy gradient estimates to enable large-scale posterior sampling. Although we can easily extend SGLD to distributed settings, it suffers from two issues when applied to federated non-IID data. First, the variance of these estimates increases significantly. Second, delaying communication causes the Markov chains to diverge from the true posterior even for very simple models. To alleviate both these problems, we propose conducive gradients, a simple mechanism that combines local likelihood approximations to correct gradient updates. Notably, conducive gradients are easy to compute, and since we only calculate the approximations once, they incur negligible overhead. We apply conducive gradients to distributed stochastic gradient Langevin dynamics (DSGLD) and call the resulting method federated stochastic gradient Langevin dynamics (FSGLD). We demonstrate that our approach can handle delayed communication rounds, converging to the target posterior in cases where DSGLD fails. We also show that FSGLD outperforms DSGLD for non-IID federated data with experiments on metric learning and neural networks.
    Counterfactual Explanations as Interventions in Latent Space. (arXiv:2106.07754v1 [cs.AI])
    (2 min) Explainable Artificial Intelligence (XAI) is a set of techniques that allows the understanding of both technical and non-technical aspects of Artificial Intelligence (AI) systems. XAI is crucial to help satisfying the increasingly important demand of \emph{trustworthy} Artificial Intelligence, characterized by fundamental characteristics such as respect of human autonomy, prevention of harm, transparency, accountability, etc. Within XAI techniques, counterfactual explanations aim to provide to end users a set of features (and their corresponding values) that need to be changed in order to achieve a desired outcome. Current approaches rarely take into account the feasibility of actions needed to achieve the proposed explanations, and in particular they fall short of considering the causal impact of such actions. In this paper, we present Counterfactual Explanations as Interventions in Latent Space (CEILS), a methodology to generate counterfactual explanations capturing by design the underlying causal relations from the data, and at the same time to provide feasible recommendations to reach the proposed profile. Moreover, our methodology has the advantage that it can be set on top of existing counterfactuals generator algorithms, thus minimising the complexity of imposing additional causal constrains. We demonstrate the effectiveness of our approach with a set of different experiments using synthetic and real datasets (including a proprietary dataset of the financial domain).
    Error Diffusion Halftoning Against Adversarial Examples. (arXiv:2101.09451v2 [cs.CV] UPDATED)
    (2 min) Adversarial examples contain carefully crafted perturbations that can fool deep neural networks (DNNs) into making wrong predictions. Enhancing the adversarial robustness of DNNs has gained considerable interest in recent years. Although image transformation-based defenses were widely considered at an earlier time, most of them have been defeated by adaptive attacks. In this paper, we propose a new image transformation defense based on error diffusion halftoning, and combine it with adversarial training to defend against adversarial examples. Error diffusion halftoning projects an image into a 1-bit space and diffuses quantization error to neighboring pixels. This process can remove adversarial perturbations from a given image while maintaining acceptable image quality in the meantime in favor of recognition. Experimental results demonstrate that the proposed method is able to improve adversarial robustness even under advanced adaptive attacks, while most of the other image transformation-based defenses do not. We show that a proper image transformation can still be an effective defense approach. Code: https://github.com/shaoyuanlo/Halftoning-Defense
    Scalable Marginal Likelihood Estimation for Model Selection in Deep Learning. (arXiv:2104.04975v3 [stat.ML] UPDATED)
    (0 min) Marginal-likelihood based model-selection, even though promising, is rarely used in deep learning due to estimation difficulties. Instead, most approaches rely on validation data, which may not be readily available. In this work, we present a scalable marginal-likelihood estimation method to select both hyperparameters and network architectures, based on the training data alone. Some hyperparameters can be estimated online during training, simplifying the procedure. Our marginal-likelihood estimate is based on Laplace's method and Gauss-Newton approximations to the Hessian, and it outperforms cross-validation and manual-tuning on standard regression and image classification datasets, especially in terms of calibration and out-of-distribution detection. Our work shows that marginal likelihoods can improve generalization and be useful when validation data is unavailable (e.g., in nonstationary settings).
    Model-Based Domain Generalization. (arXiv:2102.11436v2 [stat.ML] UPDATED)
    (2 min) Despite remarkable success in a variety of applications, it is well-known that deep learning can fail catastrophically when presented with out-of-distribution data. Toward addressing this challenge, we consider the domain generalization problem, wherein predictors are trained using data drawn from a family of related training domains and then evaluated on a distinct and unseen test domain. We show that under a natural model of data generation and a concomitant invariance condition, the domain generalization problem is equivalent to an infinite-dimensional constrained statistical learning problem; this problem forms the basis of our approach, which we call Model-Based Domain Generalization. Due to the inherent challenges in solving constrained optimization problems in deep learning, we exploit nonconvex duality theory to develop unconstrained relaxations of this statistical problem with tight bounds on the duality gap. Based on this theoretical motivation, we propose a novel domain generalization algorithm with convergence guarantees. In our experiments, we report improvements of up to 30 percentage points over state-of-the-art domain generalization baselines on several benchmarks including ColoredMNIST, Camelyon17-WILDS, FMoW-WILDS, and PACS.
    Hypergraph Dissimilarity Measures. (arXiv:2106.08206v1 [cs.LG])
    (2 min) In this paper, we propose two novel approaches for hypergraph comparison. The first approach transforms the hypergraph into a graph representation for use of standard graph dissimilarity measures. The second approach exploits the mathematics of tensors to intrinsically capture multi-way relations. For each approach, we present measures that assess hypergraph dissimilarity at a specific scale or provide a more holistic multi-scale comparison. We test these measures on synthetic hypergraphs and apply them to biological datasets.
    Inexact-ADMM Based Federated Meta-Learning for Fast and Continual Edge Learning. (arXiv:2012.08677v3 [cs.LG] UPDATED)
    (2 min) In order to meet the requirements for performance, safety, and latency in many IoT applications, intelligent decisions must be made right here right now at the network edge. However, the constrained resources and limited local data amount pose significant challenges to the development of edge AI. To overcome these challenges, we explore continual edge learning capable of leveraging the knowledge transfer from previous tasks. Aiming to achieve fast and continual edge learning, we propose a platform-aided federated meta-learning architecture where edge nodes collaboratively learn a meta-model, aided by the knowledge transfer from prior tasks. The edge learning problem is cast as a regularized optimization problem, where the valuable knowledge learned from previous tasks is extracted as regularization. Then, we devise an ADMM based federated meta-learning algorithm, namely ADMM-FedMeta, where ADMM offers a natural mechanism to decompose the original problem into many subproblems which can be solved in parallel across edge nodes and the platform. Further, a variant of inexact-ADMM method is employed where the subproblems are `solved' via linear approximation as well as Hessian estimation to reduce the computational cost per round to $\mathcal{O}(n)$. We provide a comprehensive analysis of ADMM-FedMeta, in terms of the convergence properties, the rapid adaptation performance, and the forgetting effect of prior knowledge transfer, for the general non-convex case. Extensive experimental studies demonstrate the effectiveness and efficiency of ADMM-FedMeta, and showcase that it substantially outperforms the existing baselines.
    Overcomplete Representations Against Adversarial Videos. (arXiv:2012.04262v2 [cs.CV] UPDATED)
    (2 min) Adversarial robustness of deep neural networks is an extensively studied problem in the literature and various methods have been proposed to defend against adversarial images. However, only a handful of defense methods have been developed for defending against attacked videos. In this paper, we propose a novel Over-and-Under complete restoration network for Defending against adversarial videos (OUDefend). Most restoration networks adopt an encoder-decoder architecture that first shrinks spatial dimension then expands it back. This approach learns undercomplete representations, which have large receptive fields to collect global information but overlooks local details. On the other hand, overcomplete representations have opposite properties. Hence, OUDefend is designed to balance local and global features by learning those two representations. We attach OUDefend to target video recognition models as a feature restoration block and train the entire network end-to-end. Experimental results show that the defenses focusing on images may be ineffective to videos, while OUDefend enhances robustness against different types of adversarial videos, ranging from additive attacks, multiplicative attacks to physically realizable attacks. Code: https://github.com/shaoyuanlo/OUDefend
    Generating Data Augmentation samples for Semantic Segmentation of Salt Bodies in a Synthetic Seismic Image Dataset. (arXiv:2106.08269v1 [cs.CV])
    (2 min) Nowadays, subsurface salt body localization and delineation, also called semantic segmentation of salt bodies, are among the most challenging geophysicist tasks. Thus, identifying large salt bodies is notoriously tricky and is crucial for identifying hydrocarbon reservoirs and drill path planning. This work proposes a Data Augmentation method based on training two generative models to augment the number of samples in a seismic image dataset for the semantic segmentation of salt bodies. Our method uses deep learning models to generate pairs of seismic image patches and their respective salt masks for the Data Augmentation. The first model is a Variational Autoencoder and is responsible for generating patches of salt body masks. The second is a Conditional Normalizing Flow model, which receives the generated masks as inputs and generates the associated seismic image patches. We evaluate the proposed method by comparing the performance of ten distinct state-of-the-art models for semantic segmentation, trained with and without the generated augmentations, in a dataset from two synthetic seismic images. The proposed methodology yields an average improvement of 8.57% in the IoU metric across all compared models. The best result is achieved by a DeeplabV3+ model variant, which presents an IoU score of 95.17% when trained with our augmentations. Additionally, our proposal outperformed six selected data augmentation methods, and the most significant improvement in the comparison, of 9.77%, is achieved by composing our DA with augmentations from an elastic transformation. At last, we show that the proposed method is adaptable for a larger context size by achieving results comparable to the obtained on the smaller context size.
    Decentralized Local Stochastic Extra-Gradient for Variational Inequalities. (arXiv:2106.08315v1 [math.OC])
    (2 min) We consider decentralized stochastic variational inequalities where the problem data is distributed across many participating devices (heterogeneous, or non-IID data setting). We propose a novel method - based on stochastic extra-gradient - where participating devices can communicate over arbitrary, possibly time-varying network topologies. This covers both the fully decentralized optimization setting and the centralized topologies commonly used in Federated Learning. Our method further supports multiple local updates on the workers for reducing the communication frequency between workers. We theoretically analyze the proposed scheme in the strongly monotone, monotone and non-monotone setting. As a special case, our method and analysis apply in particular to decentralized stochastic min-max problems which are being studied with increased interest in Deep Learning. For example, the training objective of Generative Adversarial Networks (GANs) are typically saddle point problems and the decentralized training of GANs has been reported to be extremely challenging. While SOTA techniques rely on either repeated gossip rounds or proximal updates, we alleviate both of these requirements. Experimental results for decentralized GAN demonstrate the effectiveness of our proposed algorithm.
    Differentially Private Quantiles. (arXiv:2102.08244v2 [cs.LG] UPDATED)
    (0 min) Quantiles are often used for summarizing and understanding data. If that data is sensitive, it may be necessary to compute quantiles in a way that is differentially private, providing theoretical guarantees that the result does not reveal private information. However, when multiple quantiles are needed, existing differentially private algorithms fare poorly: they either compute quantiles individually, splitting the privacy budget, or summarize the entire distribution, wasting effort. In either case the result is reduced accuracy. In this work we propose an instance of the exponential mechanism that simultaneously estimates exactly $m$ quantiles from $n$ data points while guaranteeing differential privacy. The utility function is carefully structured to allow for an efficient implementation that returns estimates of all $m$ quantiles in time $O(mn\log(n) + m^2n)$. Experiments show that our method significantly outperforms the current state of the art on both real and synthetic data while remaining efficient enough to be practical.
    Multivariate Uncertainty in Deep Learning. (arXiv:1910.14215v2 [cs.LG] UPDATED)
    (2 min) Deep learning has the potential to dramatically impact navigation and tracking state estimation problems critical to autonomous vehicles and robotics. Measurement uncertainties in state estimation systems based on Kalman and other Bayes filters are typically assumed to be a fixed covariance matrix. This assumption is risky, particularly for "black box" deep learning models, in which uncertainty can vary dramatically and unexpectedly. Accurate quantification of multivariate uncertainty will allow for the full potential of deep learning to be used more safely and reliably in these applications. We show how to model multivariate uncertainty for regression problems with neural networks, incorporating both aleatoric and epistemic sources of heteroscedastic uncertainty. We train a deep uncertainty covariance matrix model in two ways: directly using a multivariate Gaussian density loss function, and indirectly using end-to-end training through a Kalman filter. We experimentally show in a visual tracking problem the large impact that accurate multivariate uncertainty quantification can have on Kalman filter performance for both in-domain and out-of-domain evaluation data. We additionally show in a challenging visual odometry problem how end-to-end filter training can allow uncertainty predictions to compensate for filter weaknesses.
    CatBoost model with synthetic features in application to loan risk assessment of small businesses. (arXiv:2106.07954v1 [cs.CE])
    (0 min) Loan risk for small business has long been a complex problem worthy of exploring. Predicting the loan risk approximately can benefit entrepreneurship by developing more jobs for the society. CatBoost (Categorical Boosting) is a powerful machine learning algorithm that is suitable for dataset with many categorical variables like the dataset for forecasting loan risk. In this paper, we identify the important risk factors that contribute to loan status classification problem. Then we compare the the performance between boosting-type algorithms(especially CatBoost) with other traditional yet popular ones. The dataset we adopt in the research comes from the U.S. Small Business Administration (SBA) and holds a very large sample size (899,164 observations and 27 features). We obtain a high accuracy of 95.74% and well-performed AUC of 98.59% compared with the existent literature of related research. In order to make best use of the important features in the dataset, we propose a technique named "synthetic generation" to develop more combined features based on arithmetic operation, which ends up improving the accuracy and AUC of original CatBoost model.
    Physics-Informed Neural Network for Modelling the Thermochemical Curing Process of Composite-Tool Systems During Manufacture. (arXiv:2011.13511v2 [cs.LG] UPDATED)
    (2 min) We present a Physics-Informed Neural Network (PINN) to simulate the thermochemical evolution of a composite material on a tool undergoing cure in an autoclave. In particular, we solve the governing coupled system of differential equations -- including conductive heat transfer and resin cure kinetics -- by optimizing the parameters of a deep neural network (DNN) using a physics-based loss function. To account for the vastly different behaviour of thermal conduction and resin cure, we design a PINN consisting of two disconnected subnetworks, and develop a sequential training algorithm that mitigates instability present in traditional training methods. Further, we incorporate explicit discontinuities into the DNN at the composite-tool interface and enforce known physical behaviour directly in the loss function to improve the solution near the interface. We train the PINN with a technique that automatically adapts the weights on the loss terms corresponding to PDE, boundary, interface, and initial conditions. Finally, we demonstrate that one can include problem parameters as an input to the model -- resulting in a surrogate that provides real-time simulation for a range of problem settings -- and that one can use transfer learning to significantly reduce the training time for problem settings similar to that of an initial trained model. The performance of the proposed PINN is demonstrated in multiple scenarios with different material thicknesses and thermal boundary conditions.
    Probabilistic Margins for Instance Reweighting in Adversarial Training. (arXiv:2106.07904v1 [cs.LG])
    (0 min) Reweighting adversarial data during training has been recently shown to improve adversarial robustness, where data closer to the current decision boundaries are regarded as more critical and given larger weights. However, existing methods measuring the closeness are not very reliable: they are discrete and can take only a few values, and they are path-dependent, i.e., they may change given the same start and end points with different attack paths. In this paper, we propose three types of probabilistic margin (PM), which are continuous and path-independent, for measuring the aforementioned closeness and reweighting adversarial data. Specifically, a PM is defined as the difference between two estimated class-posterior probabilities, e.g., such the probability of the true label minus the probability of the most confusing label given some natural data. Though different PMs capture different geometric properties, all three PMs share a negative correlation with the vulnerability of data: data with larger/smaller PMs are safer/riskier and should have smaller/larger weights. Experiments demonstrate that PMs are reliable measurements and PM-based reweighting methods outperform state-of-the-art methods.
    Tracing Back Music Emotion Predictions to Sound Sources and Intuitive Perceptual Qualities. (arXiv:2106.07787v1 [cs.SD])
    (0 min) Music emotion recognition is an important task in MIR (Music Information Retrieval) research. Owing to factors like the subjective nature of the task and the variation of emotional cues between musical genres, there are still significant challenges in developing reliable and generalizable models. One important step towards better models would be to understand what a model is actually learning from the data and how the prediction for a particular input is made. In previous work, we have shown how to derive explanations of model predictions in terms of spectrogram image segments that connect to the high-level emotion prediction via a layer of easily interpretable perceptual features. However, that scheme lacks intuitive musical comprehensibility at the spectrogram level. In the present work, we bridge this gap by merging audioLIME -- a source-separation based explainer -- with mid-level perceptual features, thus forming an intuitive connection chain between the input audio and the output emotion predictions. We demonstrate the usefulness of this method by applying it to debug a biased emotion prediction model.
    On Large-Cohort Training for Federated Learning. (arXiv:2106.07820v1 [cs.LG])
    (0 min) Federated learning methods typically learn a model by iteratively sampling updates from a population of clients. In this work, we explore how the number of clients sampled at each round (the cohort size) impacts the quality of the learned model and the training dynamics of federated learning algorithms. Our work poses three fundamental questions. First, what challenges arise when trying to scale federated learning to larger cohorts? Second, what parallels exist between cohort sizes in federated learning and batch sizes in centralized learning? Last, how can we design federated learning methods that effectively utilize larger cohort sizes? We give partial answers to these questions based on extensive empirical evaluation. Our work highlights a number of challenges stemming from the use of larger cohorts. While some of these (such as generalization issues and diminishing returns) are analogs of large-batch training challenges, others (including training failures and fairness concerns) are unique to federated learning.
    Learning Combinatorial Node Labeling Algorithms. (arXiv:2106.03594v2 [cs.LG] UPDATED)
    (0 min) We present a graph neural network to learn graph coloring heuristics using reinforcement learning. Our learned deterministic heuristics give better solutions than classical degree-based greedy heuristics and only take seconds to evaluate on graphs with tens of thousands of vertices. As our approach is based on policy-gradients, it also learns a probabilistic policy as well. These probabilistic policies outperform all greedy coloring baselines and a machine learning baseline. Our approach generalizes several previous machine-learning frameworks, which applied to problems like minimum vertex cover. We also demonstrate that our approach outperforms two greedy heuristics on minimum vertex cover.
    Disentangling Syntax and Semantics in the Brain with Deep Networks. (arXiv:2103.01620v2 [cs.CL] UPDATED)
    (0 min) The activations of language transformers like GPT-2 have been shown to linearly map onto brain activity during speech comprehension. However, the nature of these activations remains largely unknown and presumably conflate distinct linguistic classes. Here, we propose a taxonomy to factorize the high-dimensional activations of language models into four combinatorial classes: lexical, compositional, syntactic, and semantic representations. We then introduce a statistical method to decompose, through the lens of GPT-2's activations, the brain activity of 345 subjects recorded with functional magnetic resonance imaging (fMRI) during the listening of ~4.6 hours of narrated text. The results highlight two findings. First, compositional representations recruit a more widespread cortical network than lexical ones, and encompass the bilateral temporal, parietal and prefrontal cortices. Second, contrary to previous claims, syntax and semantics are not associated with separated modules, but, instead, appear to share a common and distributed neural substrate. Overall, this study introduces a versatile framework to isolate, in the brain activity, the distributed representations of linguistic constructs.
    CAN-LOC: Spoofing Detection and Physical Intrusion Localization on an In-Vehicle CAN Bus Based on Deep Features of Voltage Signals. (arXiv:2106.07895v1 [cs.CR])
    (0 min) The Controller Area Network (CAN) is used for communication between in-vehicle devices. The CAN bus has been shown to be vulnerable to remote attacks. To harden vehicles against such attacks, vehicle manufacturers have divided in-vehicle networks into sub-networks, logically isolating critical devices. However, attackers may still have physical access to various sub-networks where they can connect a malicious device. This threat has not been adequately addressed, as methods proposed to determine physical intrusion points have shown weak results, emphasizing the need to develop more advanced techniques. To address this type of threat, we propose a security hardening system for in-vehicle networks. The proposed system includes two mechanisms that process deep features extracted from voltage signals measured on the CAN bus. The first mechanism uses data augmentation and deep learning to detect and locate physical intrusions when the vehicle starts; this mechanism can detect and locate intrusions, even when the connected malicious devices are silent. This mechanism's effectiveness (100% accuracy) is demonstrated in a wide variety of insertion scenarios on a CAN bus prototype. The second mechanism is a continuous device authentication mechanism, which is also based on deep learning; this mechanism's robustness (99.8% accuracy) is demonstrated on a real moving vehicle.
    Multivariate Business Process Representation Learning utilizing Gramian Angular Fields and Convolutional Neural Networks. (arXiv:2106.08027v1 [cs.LG])
    (0 min) Learning meaningful representations of data is an important aspect of machine learning and has recently been successfully applied to many domains like language understanding or computer vision. Instead of training a model for one specific task, representation learning is about training a model to capture all useful information in the underlying data and make it accessible for a predictor. For predictive process analytics, it is essential to have all explanatory characteristics of a process instance available when making predictions about the future, as well as for clustering and anomaly detection. Due to the large variety of perspectives and types within business process data, generating a good representation is a challenging task. In this paper, we propose a novel approach for representation learning of business process instances which can process and combine most perspectives in an event log. In conjunction with a self-supervised pre-training method, we show the capabilities of the approach through a visualization of the representation space and case retrieval. Furthermore, the pre-trained model is fine-tuned to multiple process prediction tasks and demonstrates its effectiveness in comparison with existing approaches.
    Semantic Representation and Inference for NLP. (arXiv:2106.08117v1 [cs.CL])
    (0 min) Semantic representation and inference is essential for Natural Language Processing (NLP). The state of the art for semantic representation and inference is deep learning, and particularly Recurrent Neural Networks (RNNs), Convolutional Neural Networks (CNNs), and transformer Self-Attention models. This thesis investigates the use of deep learning for novel semantic representation and inference, and makes contributions in the following three areas: creating training data, improving semantic representations and extending inference learning. In terms of creating training data, we contribute the largest publicly available dataset of real-life factual claims for the purpose of automatic claim verification (MultiFC), and we present a novel inference model composed of multi-scale CNNs with different kernel sizes that learn from external sources to infer fact checking labels. In terms of improving semantic representations, we contribute a novel model that captures non-compositional semantic indicators. By definition, the meaning of a non-compositional phrase cannot be inferred from the individual meanings of its composing words (e.g., hot dog). Motivated by this, we operationalize the compositionality of a phrase contextually by enriching the phrase representation with external word embeddings and knowledge graphs. Finally, in terms of inference learning, we propose a series of novel deep learning architectures that improve inference by using syntactic dependencies, by ensembling role guided attention heads, incorporating gating layers, and concatenating multiple heads in novel and effective ways. This thesis consists of seven publications (five published and two under review).
    Aggregating From Multiple Target-Shifted Sources. (arXiv:2105.04051v2 [cs.LG] UPDATED)
    (0 min) Multi-source domain adaptation aims at leveraging the knowledge from multiple tasks for predicting a related target domain. Hence, a crucial aspect is to properly combine different sources based on their relations. In this paper, we analyzed the problem for aggregating source domains with different label distributions, where most recent source selection approaches fail. Our proposed algorithm differs from previous approaches in two key ways: the model aggregates multiple sources mainly through the similarity of semantic conditional distribution rather than marginal distribution; the model proposes a \emph{unified} framework to select relevant sources for three popular scenarios, i.e., domain adaptation with limited label on target domain, unsupervised domain adaptation and label partial unsupervised domain adaption. We evaluate the proposed method through extensive experiments. The empirical results significantly outperform the baselines.
    Counterfactual Explanations for Machine Learning: Challenges Revisited. (arXiv:2106.07756v1 [cs.LG])
    (0 min) Counterfactual explanations (CFEs) are an emerging technique under the umbrella of interpretability of machine learning (ML) models. They provide ``what if'' feedback of the form ``if an input datapoint were $x'$ instead of $x$, then an ML model's output would be $y'$ instead of $y$.'' Counterfactual explainability for ML models has yet to see widespread adoption in industry. In this short paper, we posit reasons for this slow uptake. Leveraging recent work outlining desirable properties of CFEs and our experience running the ML wing of a model monitoring startup, we identify outstanding obstacles hindering CFE deployment in industry.
    SUPER-ADAM: Faster and Universal Framework of Adaptive Gradients. (arXiv:2106.08208v1 [math.OC])
    (0 min) Adaptive gradient methods have shown excellent performance for solving many machine learning problems. Although multiple adaptive methods were recently studied, they mainly focus on either empirical or theoretical aspects and also only work for specific problems by using specific adaptive learning rates. It is desired to design a universal framework for practical algorithms of adaptive gradients with theoretical guarantee to solve general problems. To fill this gap, we propose a faster and universal framework of adaptive gradients (i.e., SUPER-ADAM) by introducing a universal adaptive matrix that includes most existing adaptive gradient forms. Moreover, our framework can flexibly integrates the momentum and variance reduced techniques. In particular, our novel framework provides the convergence analysis support for adaptive gradient methods under the nonconvex setting. In theoretical analysis, we prove that our new algorithm can achieve the best known complexity of $\tilde{O}(\epsilon^{-3})$ for finding an $\epsilon$-stationary point of nonconvex optimization, which matches the lower bound for stochastic smooth nonconvex optimization. In numerical experiments, we employ various deep learning tasks to validate that our algorithm consistently outperforms the existing adaptive algorithms.
    EuroCrops: A Pan-European Dataset for Time Series Crop Type Classification. (arXiv:2106.08151v1 [eess.IV])
    (0 min) We present EuroCrops, a dataset based on self-declared field annotations for training and evaluating methods for crop type classification and mapping, together with its process of acquisition and harmonisation. By this, we aim to enrich the research efforts and discussion for data-driven land cover classification via Earth observation and remote sensing. Additionally, through inclusion of self-declarations gathered in the scope of subsidy control from all countries of the European Union (EU), this dataset highlights the difficulties and pitfalls one comes across when operating on a transnational level. We, therefore, also introduce a new taxonomy scheme, HCAT-ID, that aspires to capture all the aspects of reference data originating from administrative and agency databases. To address researchers from both the remote sensing and the computer vision and machine learning communities, we publish the dataset in different formats and processing levels.
    Reasoning Over Virtual Knowledge Bases With Open Predicate Relations. (arXiv:2102.07043v2 [cs.AI] UPDATED)
    (0 min) We present the Open Predicate Query Language (OPQL); a method for constructing a virtual KB (VKB) trained entirely from text. Large Knowledge Bases (KBs) are indispensable for a wide-range of industry applications such as question answering and recommendation. Typically, KBs encode world knowledge in a structured, readily accessible form derived from laborious human annotation efforts. Unfortunately, while they are extremely high precision, KBs are inevitably highly incomplete and automated methods for enriching them are far too inaccurate. Instead, OPQL constructs a VKB by encoding and indexing a set of relation mentions in a way that naturally enables reasoning and can be trained without any structured supervision. We demonstrate that OPQL outperforms prior VKB methods on two different KB reasoning tasks and, additionally, can be used as an external memory integrated into a language model (OPQL-LM) leading to improvements on two open-domain question answering tasks.
    Improving Robustness of Graph Neural Networks with Heterophily-Inspired Designs. (arXiv:2106.07767v1 [cs.LG])
    (0 min) Recent studies have exposed that many graph neural networks (GNNs) are sensitive to adversarial attacks, and can suffer from performance loss if the graph structure is intentionally perturbed. A different line of research has shown that many GNN architectures implicitly assume that the underlying graph displays homophily, i.e., connected nodes are more likely to have similar features and class labels, and perform poorly if this assumption is not fulfilled. In this work, we formalize the relation between these two seemingly different issues. We theoretically show that in the standard scenario in which node features exhibit homophily, impactful structural attacks always lead to increased levels of heterophily. Then, inspired by GNN architectures that target heterophily, we present two designs -- (i) separate aggregators for ego- and neighbor-embeddings, and (ii) a reduced scope of aggregation -- that can significantly improve the robustness of GNNs. Our extensive empirical evaluations show that GNNs featuring merely these two designs can achieve significantly improved robustness compared to the best-performing unvaccinated model with 24.99% gain in average performance under targeted attacks, while having smaller computational overhead than existing defense mechanisms. Furthermore, these designs can be readily combined with explicit defense mechanisms to yield state-of-the-art robustness with up to 18.33% increase in performance under attacks compared to the best-performing vaccinated model.
    Control Variates for Slate Off-Policy Evaluation. (arXiv:2106.07914v1 [cs.LG])
    (0 min) We study the problem of off-policy evaluation from batched contextual bandit data with multidimensional actions, often termed slates. The problem is common to recommender systems and user-interface optimization, and it is particularly challenging because of the combinatorially-sized action space. Swaminathan et al. (2017) have proposed the pseudoinverse (PI) estimator under the assumption that the conditional mean rewards are additive in actions. Using control variates, we consider a large class of unbiased estimators that includes as specific cases the PI estimator and (asymptotically) its self-normalized variant. By optimizing over this class, we obtain new estimators with risk improvement guarantees over both the PI and self-normalized PI estimators. Experiments with real-world recommender data as well as synthetic data validate these improvements in practice.
    Causal Navigation by Continuous-time Neural Networks. (arXiv:2106.08314v1 [cs.LG])
    (0 min) Imitation learning enables high-fidelity, vision-based learning of policies within rich, photorealistic environments. However, such techniques often rely on traditional discrete-time neural models and face difficulties in generalizing to domain shifts by failing to account for the causal relationships between the agent and the environment. In this paper, we propose a theoretical and experimental framework for learning causal representations using continuous-time neural networks, specifically over their discrete-time counterparts. We evaluate our method in the context of visual-control learning of drones over a series of complex tasks, ranging from short- and long-term navigation, to chasing static and dynamic objects through photorealistic environments. Our results demonstrate that causal continuous-time deep models can perform robust navigation tasks, where advanced recurrent models fail. These models learn complex causal control representations directly from raw visual inputs and scale to solve a variety of tasks using imitation learning.
    On Multi-objective Policy Optimization as a Tool for Reinforcement Learning. (arXiv:2106.08199v1 [cs.LG])
    (0 min) Many advances that have improved the robustness and efficiency of deep reinforcement learning (RL) algorithms can, in one way or another, be understood as introducing additional objectives, or constraints, in the policy optimization step. This includes ideas as far ranging as exploration bonuses, entropy regularization, and regularization toward teachers or data priors when learning from experts or in offline RL. Often, task reward and auxiliary objectives are in conflict with each other and it is therefore natural to treat these examples as instances of multi-objective (MO) optimization problems. We study the principles underlying MORL and introduce a new algorithm, Distillation of a Mixture of Experts (DiME), that is intuitive and scale-invariant under some conditions. We highlight its strengths on standard MO benchmark problems and consider case studies in which we recast offline RL and learning from experts as MO problems. This leads to a natural algorithmic formulation that sheds light on the connection between existing approaches. For offline RL, we use the MO perspective to derive a simple algorithm, that optimizes for the standard RL objective plus a behavioral cloning term. This outperforms state-of-the-art on two established offline RL benchmarks.
    Natural continual learning: success is a journey, not (just) a destination. (arXiv:2106.08085v1 [cs.LG])
    (0 min) Biological agents are known to learn many different tasks over the course of their lives, and to be able to revisit previous tasks and behaviors with little to no loss in performance. In contrast, artificial agents are prone to 'catastrophic forgetting' whereby performance on previous tasks deteriorates rapidly as new ones are acquired. This shortcoming has recently been addressed using methods that encourage parameters to stay close to those used for previous tasks. This can be done by (i) using specific parameter regularizers that map out suitable destinations in parameter space, or (ii) guiding the optimization journey by projecting gradients into subspaces that do not interfere with previous tasks. However, parameter regularization has been shown to be relatively ineffective in recurrent neural networks (RNNs), a setting relevant to the study of neural dynamics supporting biological continual learning. Similarly, projection based methods can reach capacity and fail to learn any further as the number of tasks increases. To address these limitations, we propose Natural Continual Learning (NCL), a new method that unifies weight regularization and projected gradient descent. NCL uses Bayesian weight regularization to encourage good performance on all tasks at convergence and combines this with gradient projections designed to prevent catastrophic forgetting during optimization. NCL formalizes gradient projection as a trust region algorithm based on the Fisher information metric, and achieves scalability via a novel Kronecker-factored approximation strategy. Our method outperforms both standard weight regularization techniques and projection based approaches when applied to continual learning problems in RNNs. The trained networks evolve task-specific dynamics that are strongly preserved as new tasks are learned, similar to experimental findings in biological circuits.
    Employing an Adjusted Stability Measure for Multi-Criteria Model Fitting on Data Sets with Similar Features. (arXiv:2106.08105v1 [stat.ML])
    (0 min) Fitting models with high predictive accuracy that include all relevant but no irrelevant or redundant features is a challenging task on data sets with similar (e.g. highly correlated) features. We propose the approach of tuning the hyperparameters of a predictive model in a multi-criteria fashion with respect to predictive accuracy and feature selection stability. We evaluate this approach based on both simulated and real data sets and we compare it to the standard approach of single-criteria tuning of the hyperparameters as well as to the state-of-the-art technique "stability selection". We conclude that our approach achieves the same or better predictive performance compared to the two established approaches. Considering the stability during tuning does not decrease the predictive accuracy of the resulting models. Our approach succeeds at selecting the relevant features while avoiding irrelevant or redundant features. The single-criteria approach fails at avoiding irrelevant or redundant features and the stability selection approach fails at selecting enough relevant features for achieving acceptable predictive accuracy. For our approach, for data sets with many similar features, the feature selection stability must be evaluated with an adjusted stability measure, that is, a measure that considers similarities between features. For data sets with only few similar features, an unadjusted stability measure suffices and is faster to compute.
    Optimization-friendly generic mechanisms without money. (arXiv:2106.07752v1 [cs.GT])
    (0 min) The goal of this paper is to develop a generic framework for converting modern optimization algorithms into mechanisms where inputs come from self-interested agents. We focus on aggregating preferences from $n$ players in a context without money. Special cases of this setting include voting, allocation of items by lottery, and matching. Our key technical contribution is a new meta-algorithm we call \apex (Adaptive Pricing Equalizing Externalities). The framework is sufficiently general to be combined with any optimization algorithm that is based on local search. We outline an agenda for studying the algorithm's properties and its applications. As a special case of applying the framework to the problem of one-sided assignment with lotteries, we obtain a strengthening of the 1979 result by Hylland and Zeckhauser on allocation via a competitive equilibrium from equal incomes (CEEI). The [HZ79] result posits that there is a (fractional) allocation and a set of item prices such that the allocation is a competitive equilibrium given prices. We further show that there is always a reweighing of the players' utility values such that running unit-demand VCG with reweighed utilities leads to a HZ-equilibrium prices. Interestingly, not all HZ competitive equilibria come from VCG prices. As part of our proof, we re-prove the [HZ79] result using only Brouwer's fixed point theorem (and not the more general Kakutani's theorem). This may be of independent interest.
    Unsuitability of NOTEARS for Causal Graph Discovery. (arXiv:2104.05441v2 [stat.ML] UPDATED)
    (0 min) Causal Discovery methods aim to identify a DAG structure that represents causal relationships from observational data. In this article, we stress that it is important to test such methods for robustness in practical settings. As our main example, we analyze the NOTEARS method, for which we demonstrate a lack of scale-invariance. We show that NOTEARS is a method that aims to identify a parsimonious DAG from the data that explains the residual variance. We conclude that NOTEARS is not suitable for identifying truly causal relationships from the data.
    Improving Lossless Compression Rates via Monte Carlo Bits-Back Coding. (arXiv:2102.11086v2 [cs.LG] UPDATED)
    (0 min) Latent variable models have been successfully applied in lossless compression with the bits-back coding algorithm. However, bits-back suffers from an increase in the bitrate equal to the KL divergence between the approximate posterior and the true posterior. In this paper, we show how to remove this gap asymptotically by deriving bits-back coding algorithms from tighter variational bounds. The key idea is to exploit extended space representations of Monte Carlo estimators of the marginal likelihood. Naively applied, our schemes would require more initial bits than the standard bits-back coder, but we show how to drastically reduce this additional cost with couplings in the latent space. When parallel architectures can be exploited, our coders can achieve better rates than bits-back with little additional cost. We demonstrate improved lossless compression rates in a variety of settings, especially in out-of-distribution or sequential data compression.
    Linear-Time Probabilistic Solutions of Boundary Value Problems. (arXiv:2106.07761v1 [stat.ML])
    (0 min) We propose a fast algorithm for the probabilistic solution of boundary value problems (BVPs), which are ordinary differential equations subject to boundary conditions. In contrast to previous work, we introduce a Gauss--Markov prior and tailor it specifically to BVPs, which allows computing a posterior distribution over the solution in linear time, at a quality and cost comparable to that of well-established, non-probabilistic methods. Our model further delivers uncertainty quantification, mesh refinement, and hyperparameter adaptation. We demonstrate how these practical considerations positively impact the efficiency of the scheme. Altogether, this results in a practically usable probabilistic BVP solver that is (in contrast to non-probabilistic algorithms) natively compatible with other parts of the statistical modelling tool-chain.
    Learning Deep Morphological Networks with Neural Architecture Search. (arXiv:2106.07714v1 [cs.CV])
    (0 min) Deep Neural Networks (DNNs) are generated by sequentially performing linear and non-linear processes. Using a combination of linear and non-linear procedures is critical for generating a sufficiently deep feature space. The majority of non-linear operators are derivations of activation functions or pooling functions. Mathematical morphology is a branch of mathematics that provides non-linear operators for a variety of image processing problems. We investigate the utility of integrating these operations in an end-to-end deep learning framework in this paper. DNNs are designed to acquire a realistic representation for a particular job. Morphological operators give topological descriptors that convey salient information about the shapes of objects depicted in images. We propose a method based on meta-learning to incorporate morphological operators into DNNs. The learned architecture demonstrates how our novel morphological operations significantly increase DNN performance on various tasks, including picture classification and edge detection.
    Kernel Identification Through Transformers. (arXiv:2106.08185v1 [stat.ML])
    (0 min) Kernel selection plays a central role in determining the performance of Gaussian Process (GP) models, as the chosen kernel determines both the inductive biases and prior support of functions under the GP prior. This work addresses the challenge of constructing custom kernel functions for high-dimensional GP regression models. Drawing inspiration from recent progress in deep learning, we introduce a novel approach named KITT: Kernel Identification Through Transformers. KITT exploits a transformer-based architecture to generate kernel recommendations in under 0.1 seconds, which is several orders of magnitude faster than conventional kernel search algorithms. We train our model using synthetic data generated from priors over a vocabulary of known kernels. By exploiting the nature of the self-attention mechanism, KITT is able to process datasets with inputs of arbitrary dimension. We demonstrate that kernels chosen by KITT yield strong performance over a diverse collection of regression benchmarks.
    Extracting Global Dynamics of Loss Landscape in Deep Learning Models. (arXiv:2106.07683v1 [math.DS])
    (0 min) Deep learning models evolve through training to learn the manifold in which the data exists to satisfy an objective. It is well known that evolution leads to different final states which produce inconsistent predictions of the same test data points. This calls for techniques to be able to empirically quantify the difference in the trajectories and highlight problematic regions. While much focus is placed on discovering what models learn, the question of how a model learns is less studied beyond theoretical landscape characterizations and local geometric approximations near optimal conditions. Here, we present a toolkit for the Dynamical Organization Of Deep Learning Loss Landscapes, or DOODL3. DOODL3 formulates the training of neural networks as a dynamical system, analyzes the learning process, and presents an interpretable global view of trajectories in the loss landscape. Our approach uses the coarseness of topology to capture the granularity of geometry to mitigate against states of instability or elongated training. Overall, our analysis presents an empirical framework to extract the global dynamics of a model and to use that information to guide the training of neural networks.
    CathAI: Fully Automated Interpretation of Coronary Angiograms Using Neural Networks. (arXiv:2106.07708v1 [cs.LG])
    (0 min) Coronary heart disease (CHD) is the leading cause of adult death in the United States and worldwide, and for which the coronary angiography procedure is the primary gateway for diagnosis and clinical management decisions. The standard-of-care for interpretation of coronary angiograms depends upon ad-hoc visual assessment by the physician operator. However, ad-hoc visual interpretation of angiograms is poorly reproducible, highly variable and bias prone. Here we show for the first time that fully-automated angiogram interpretation to estimate coronary artery stenosis is possible using a sequence of deep neural network algorithms. The algorithmic pipeline we developed--called CathAI--achieves state-of-the art performance across the sequence of tasks required to accomplish automated interpretation of unselected, real-world angiograms. CathAI (Algorithms 1-2) demonstrated positive predictive value, sensitivity and F1 score of >=90% to identify the projection angle overall and >=93% for left or right coronary artery angiogram detection, the primary anatomic structures of interest. To predict obstructive coronary artery stenosis (>=70% stenosis), CathAI (Algorithm 4) exhibited an area under the receiver operating characteristic curve (AUC) of 0.862 (95% CI: 0.843-0.880). When externally validated in a healthcare system in another country, CathAI AUC was 0.869 (95% CI: 0.830-0.907) to predict obstructive coronary artery stenosis. Our results demonstrate that multiple purpose-built neural networks can function in sequence to accomplish the complex series of tasks required for automated analysis of real-world angiograms. Deployment of CathAI may serve to increase standardization and reproducibility in coronary stenosis assessment, while providing a robust foundation to accomplish future tasks for algorithmic angiographic interpretation.
    Keyword Transformer: A Self-Attention Model for Keyword Spotting. (arXiv:2104.00769v3 [eess.AS] UPDATED)
    (0 min) The Transformer architecture has been successful across many domains, including natural language processing, computer vision and speech recognition. In keyword spotting, self-attention has primarily been used on top of convolutional or recurrent encoders. We investigate a range of ways to adapt the Transformer architecture to keyword spotting and introduce the Keyword Transformer (KWT), a fully self-attentional architecture that exceeds state-of-the-art performance across multiple tasks without any pre-training or additional data. Surprisingly, this simple architecture outperforms more complex models that mix convolutional, recurrent and attentive layers. KWT can be used as a drop-in replacement for these models, setting two new benchmark records on the Google Speech Commands dataset with 98.6% and 97.7% accuracy on the 12 and 35-command tasks respectively.
    MobILE: Model-Based Imitation Learning From Observation Alone. (arXiv:2102.10769v2 [cs.LG] UPDATED)
    (0 min) This paper studies Imitation Learning from Observations alone (ILFO) where the learner is presented with expert demonstrations that consist only of states visited by an expert (without access to actions taken by the expert). We present a provably efficient model-based framework MobILE to solve the ILFO problem. MobILE involves carefully trading off strategic exploration against imitation - this is achieved by integrating the idea of optimism in the face of uncertainty into the distribution matching imitation learning (IL) framework. We provide a unified analysis for MobILE, and demonstrate that MobILE enjoys strong performance guarantees for classes of MDP dynamics that satisfy certain well studied notions of structural complexity. We also show that the ILFO problem is strictly harder than the standard IL problem by presenting an exponential sample complexity separation between IL and ILFO. We complement these theoretical results with experimental simulations on benchmark OpenAI Gym tasks that indicate the efficacy of MobILE.
    Pitfalls of Explainable ML: An Industry Perspective. (arXiv:2106.07758v1 [cs.LG])
    (0 min) As machine learning (ML) systems take a more prominent and central role in contributing to life-impacting decisions, ensuring their trustworthiness and accountability is of utmost importance. Explanations sit at the core of these desirable attributes of a ML system. The emerging field is frequently called ``Explainable AI (XAI)'' or ``Explainable ML.'' The goal of explainable ML is to intuitively explain the predictions of a ML system, while adhering to the needs to various stakeholders. Many explanation techniques were developed with contributions from both academia and industry. However, there are several existing challenges that have not garnered enough interest and serve as roadblocks to widespread adoption of explainable ML. In this short paper, we enumerate challenges in explainable ML from an industry perspective. We hope these challenges will serve as promising future research directions, and would contribute to democratizing explainable ML.
    NNrepair: Constraint-based Repair of Neural Network Classifiers. (arXiv:2103.12535v2 [cs.LG] UPDATED)
    (0 min) We present NNrepair, a constraint-based technique for repairing neural network classifiers. The technique aims to fix the logic of the network at an intermediate layer or at the last layer. NNrepair first uses fault localization to find potentially faulty network parameters (such as the weights) and then performs repair using constraint solving to apply small modifications to the parameters to remedy the defects. We present novel strategies to enable precise yet efficient repair such as inferring correctness specifications to act as oracles for intermediate layer repair, and generation of experts for each class. We demonstrate the technique in the context of three different scenarios: (1) Improving the overall accuracy of a model, (2) Fixing security vulnerabilities caused by poisoning of training data and (3) Improving the robustness of the network against adversarial attacks. Our evaluation on MNIST and CIFAR-10 models shows that NNrepair can improve the accuracy by 45.56 percentage points on poisoned data and 10.40 percentage points on adversarial data. NNrepair also provides small improvement in the overall accuracy of models, without requiring new data or re-training.
    Sample Efficient Reinforcement Learning In Continuous State Spaces: A Perspective Beyond Linearity. (arXiv:2106.07814v1 [cs.LG])
    (0 min) Reinforcement learning (RL) is empirically successful in complex nonlinear Markov decision processes (MDPs) with continuous state spaces. By contrast, the majority of theoretical RL literature requires the MDP to satisfy some form of linear structure, in order to guarantee sample efficient RL. Such efforts typically assume the transition dynamics or value function of the MDP are described by linear functions of the state features. To resolve this discrepancy between theory and practice, we introduce the Effective Planning Window (EPW) condition, a structural condition on MDPs that makes no linearity assumptions. We demonstrate that the EPW condition permits sample efficient RL, by providing an algorithm which provably solves MDPs satisfying this condition. Our algorithm requires minimal assumptions on the policy class, which can include multi-layer neural networks with nonlinear activation functions. Notably, the EPW condition is directly motivated by popular gaming benchmarks, and we show that many classic Atari games satisfy this condition. We additionally show the necessity of conditions like EPW, by demonstrating that simple MDPs with slight nonlinearities cannot be solved sample efficiently.
    Coded Machine Unlearning. (arXiv:2012.15721v2 [cs.LG] UPDATED)
    (0 min) There are applications that may require removing the trace of a sample from the system, e.g., a user requests their data to be deleted, or corrupted data is discovered. Simply removing a sample from storage units does not necessarily remove its entire trace since downstream machine learning models may store some information about the samples used to train them. A sample can be perfectly unlearned if we retrain all models that used it from scratch with that sample removed from their training dataset. When multiple such unlearning requests are expected to be served, unlearning by retraining becomes prohibitively expensive. Ensemble learning enables the training data to be split into smaller disjoint shards that are assigned to non-communicating weak learners. Each shard is used to produce a weak model. These models are then aggregated to produce the final central model. This setup introduces an inherent trade-off between performance and unlearning cost, as reducing the shard size reduces the unlearning cost but may cause degradation in performance. In this paper, we propose a coded learning protocol where we utilize linear encoders to encode the training data into shards prior to the learning phase. We also present the corresponding unlearning protocol and show that it satisfies the perfect unlearning criterion. Our experimental results show that the proposed coded machine unlearning provides a better performance versus unlearning cost trade-off compared to the uncoded baseline.
    On the Convergence of Deep Learning with Differential Privacy. (arXiv:2106.07830v1 [cs.LG])
    (2 min) In deep learning with differential privacy (DP), the neural network achieves the privacy usually at the cost of slower convergence (and thus lower performance) than its non-private counterpart. This work gives the first convergence analysis of the DP deep learning, through the lens of training dynamics and the neural tangent kernel (NTK). Our convergence theory successfully characterizes the effects of two key components in the DP training: the per-sample clipping (flat or layerwise) and the noise addition. Our analysis not only initiates a general principled framework to understand the DP deep learning with any network architecture and loss function, but also motivates a new clipping method -- the global clipping, that significantly improves the convergence while preserving the same privacy guarantee as the existing local clipping. In terms of theoretical results, we establish the precise connection between the per-sample clipping and NTK matrix. We show that in the gradient flow, i.e., with infinitesimal learning rate, the noise level of DP optimizers does not affect the convergence. We prove that DP gradient descent (GD) with global clipping guarantees the monotone convergence to zero loss, which can be violated by the existing DP-GD with local clipping. Notably, our analysis framework easily extends to other optimizers, e.g., DP-Adam. Empirically speaking, DP optimizers equipped with global clipping perform strongly on a wide range of classification and regression tasks. In particular, our global clipping is surprisingly effective at learning calibrated classifiers, in contrast to the existing DP classifiers which are oftentimes over-confident and unreliable. Implementation-wise, the new clipping can be realized by adding one line of code into the Opacus library.
    Improved Regret Bounds for Online Submodular Maximization. (arXiv:2106.07836v1 [cs.LG])
    (2 min) In this paper, we consider an online optimization problem over $T$ rounds where at each step $t\in[T]$, the algorithm chooses an action $x_t$ from the fixed convex and compact domain set $\mathcal{K}$. A utility function $f_t(\cdot)$ is then revealed and the algorithm receives the payoff $f_t(x_t)$. This problem has been previously studied under the assumption that the utilities are adversarially chosen monotone DR-submodular functions and $\mathcal{O}(\sqrt{T})$ regret bounds have been derived. We first characterize the class of strongly DR-submodular functions and then, we derive regret bounds for the following new online settings: $(1)$ $\{f_t\}_{t=1}^T$ are monotone strongly DR-submodular and chosen adversarially, $(2)$ $\{f_t\}_{t=1}^T$ are monotone submodular (while the average $\frac{1}{T}\sum_{t=1}^T f_t$ is strongly DR-submodular) and chosen by an adversary but they arrive in a uniformly random order, $(3)$ $\{f_t\}_{t=1}^T$ are drawn i.i.d. from some unknown distribution $f_t\sim \mathcal{D}$ where the expected function $f(\cdot)=\mathbb{E}_{f_t\sim\mathcal{D}}[f_t(\cdot)]$ is monotone DR-submodular. For $(1)$, we obtain the first logarithmic regret bounds. In terms of the second framework, we show that it is possible to obtain similar logarithmic bounds with high probability. Finally, for the i.i.d. model, we provide algorithms with $\tilde{\mathcal{O}}(\sqrt{T})$ stochastic regret bound, both in expectation and with high probability. Experimental results demonstrate that our algorithms outperform the previous techniques in the aforementioned three settings.
    Contextualized Attention-based Knowledge Transfer for Spoken Conversational Question Answering. (arXiv:2010.11066v3 [cs.CL] UPDATED)
    (2 min) Spoken conversational question answering (SCQA) requires machines to model complex dialogue flow given the speech utterances and text corpora. Different from traditional text question answering (QA) tasks, SCQA involves audio signal processing, passage comprehension, and contextual understanding. However, ASR systems introduce unexpected noisy signals to the transcriptions, which result in performance degradation on SCQA. To overcome the problem, we propose CADNet, a novel contextualized attention-based distillation approach, which applies both cross-attention and self-attention to obtain ASR-robust contextualized embedding representations of the passage and dialogue history for performance improvements. We also introduce the spoken conventional knowledge distillation framework to distill the ASR-robust knowledge from the estimated probabilities of the teacher model to the student. We conduct extensive experiments on the Spoken-CoQA dataset and demonstrate that our approach achieves remarkable performance in this task.
    The Recurrent Neural Tangent Kernel. (arXiv:2006.10246v4 [cs.LG] UPDATED)
    (2 min) The study of deep neural networks (DNNs) in the infinite-width limit, via the so-called neural tangent kernel (NTK) approach, has provided new insights into the dynamics of learning, generalization, and the impact of initialization. One key DNN architecture remains to be kernelized, namely, the recurrent neural network (RNN). In this paper we introduce and study the Recurrent Neural Tangent Kernel (RNTK), which provides new insights into the behavior of overparametrized RNNs. A key property of the RNTK should greatly benefit practitioners is its ability to compare inputs of different length. To this end, we characterize how the RNTK weights different time steps to form its output under different initialization parameters and nonlinearity choices. A synthetic and 56 real-world data experiments demonstrate that the RNTK offers significant performance gains over other kernels, including standard NTKs, across a wide array of data sets.
    Extracting Training Data from Large Language Models. (arXiv:2012.07805v2 [cs.CR] UPDATED)
    (2 min) It has become common to publish large (billion parameter) language models that have been trained on private datasets. This paper demonstrates that in such settings, an adversary can perform a training data extraction attack to recover individual training examples by querying the language model. We demonstrate our attack on GPT-2, a language model trained on scrapes of the public Internet, and are able to extract hundreds of verbatim text sequences from the model's training data. These extracted examples include (public) personally identifiable information (names, phone numbers, and email addresses), IRC conversations, code, and 128-bit UUIDs. Our attack is possible even though each of the above sequences are included in just one document in the training data. We comprehensively evaluate our extraction attack to understand the factors that contribute to its success. Worryingly, we find that larger models are more vulnerable than smaller models. We conclude by drawing lessons and discussing possible safeguards for training large language models.
    Top-Related Meta-Learning Method for Few-Shot Object Detection. (arXiv:2007.06837v6 [cs.CV] UPDATED)
    (3 min) Many meta-learning methods are proposed for few-shot detection. However, previous most methods have two main problems, poor detection APs, and strong bias because of imbalance and insufficient datasets. Previous works mainly alleviate these issues by additional datasets, multi-relation attention mechanisms and sub-modules. However, they require more cost. In this work, for meta-learning, we find that the main challenges focus on related or irrelevant semantic features between categories. Therefore, based on semantic features, we propose a Top-C classification loss (i.e., TCL-C) for classification task and a category-based grouping mechanism for category-based meta-features obtained by the meta-model. The TCL-C exploits the true-label prediction and the most likely C-1 false classification predictions to improve detection performance on few-shot classes. According to similar appearance (i.e., visual appearance, shape, and limbs etc.) and environment in which objects often appear, the category-based grouping mechanism splits categories into disjoint groups to make similar semantic features more compact between categories within a group and obtain more significant difference between groups, alleviating the strong bias problem and further improving detection APs. The whole training consists of the base model and the fine-tuning phases. According to grouping mechanism, we group the meta-features vectors obtained by meta-model, so that the distribution difference between groups is obvious, and the one within each group is less. Extensive experiments on Pascal VOC dataset demonstrate that ours which combines the TCL-C with category-based grouping significantly outperforms previous state-of-the-art methods for few-shot detection. Compared with previous competitive baseline, ours improves detection APs by almost 4% for few-shot detection.
    A Principle of Least Action for the Training of Neural Networks. (arXiv:2009.08372v4 [stat.ML] UPDATED)
    (2 min) Neural networks have been achieving high generalization performance on many tasks despite being highly over-parameterized. Since classical statistical learning theory struggles to explain this behavior, much effort has recently been focused on uncovering the mechanisms behind it, in the hope of developing a more adequate theoretical framework and having a better control over the trained models. In this work, we adopt an alternate perspective, viewing the neural network as a dynamical system displacing input particles over time. We conduct a series of experiments and, by analyzing the network's behavior through its displacements, we show the presence of a low kinetic energy displacement bias in the transport map of the network, and link this bias with generalization performance. From this observation, we reformulate the learning problem as follows: finding neural networks which solve the task while transporting the data as efficiently as possible. This offers a novel formulation of the learning problem which allows us to provide regularity results for the solution network, based on Optimal Transport theory. From a practical viewpoint, this allows us to propose a new learning algorithm, which automatically adapts to the complexity of the given task, and leads to networks with a high generalization ability even in low data regimes.
    Universal consistency of Wasserstein $k$-NN classifier. (arXiv:2009.04651v3 [stat.ML] UPDATED)
    (2 min) The Wasserstein distance provides a notion of dissimilarities between probability measures, which has recent applications in learning of structured data with varying size such as images and text documents. In this work, we analyze the $k$-nearest neighbor classifier ($k$-NN) under the Wasserstein distance and establish the universal consistency on families of distributions. Using previous known results on the consistency of the $k$-NN classifier on infinite dimensional metric spaces, it suffices to show that the families is a countable union of finite dimension sets. As a result, we show that the $k$-NN classifier is universally consistent on spaces of finitely supported measures, the space of Gaussian measures, and the space of measures with finite wavelet densities. In addition, we give a counterexample to show that the universal consistency does not hold on $\mathcal{W}_p((0,1))$.
    DeepKoCo: Efficient latent planning with a robust Koopman representation. (arXiv:2011.12690v2 [cs.LG] UPDATED)
    (2 min) This paper presents DeepKoCo, a novel model-based agent that learns a latent Koopman representation from images. This representation allows DeepKoCo to plan efficiently using linear control methods, such as linear model predictive control. Compared to traditional agents, DeepKoCo is robust to task-irrelevant dynamics, thanks to the use of a tailored lossy autoencoder network that allows DeepKoCo to learn latent dynamics that reconstruct and predict only observed costs, rather than all observed dynamics. As our results show, DeepKoCo achieves a similar final performance as traditional model-free methods on complex control tasks, while being considerably more robust to distractor dynamics, making the proposed agent more amenable for real-life applications.
    Data-efficient Hindsight Off-policy Option Learning. (arXiv:2007.15588v2 [cs.LG] UPDATED)
    (2 min) We introduce Hindsight Off-policy Options (HO2), a data-efficient option learning algorithm. Given any trajectory, HO2 infers likely option choices and backpropagates through the dynamic programming inference procedure to robustly train all policy components off-policy and end-to-end. The approach outperforms existing option learning methods on common benchmarks. To better understand the option framework and disentangle benefits from both temporal and action abstraction, we evaluate ablations with flat policies and mixture policies with comparable optimization. The results highlight the importance of both types of abstraction as well as off-policy training and trust-region constraints, particularly in challenging, simulated 3D robot manipulation tasks from raw pixel inputs. Finally, we intuitively adapt the inference step to investigate the effect of increased temporal abstraction on training with pre-trained options and from scratch.
    MICo: Learning improved representations via sampling-based state similarity for Markov decision processes. (arXiv:2106.08229v1 [cs.LG])
    (2 min) We present a new behavioural distance over the state space of a Markov decision process, and demonstrate the use of this distance as an effective means of shaping the learnt representations of deep reinforcement learning agents. While existing notions of state similarity are typically difficult to learn at scale due to high computational cost and lack of sample-based algorithms, our newly-proposed distance addresses both of these issues. In addition to providing detailed theoretical analysis, we provide empirical evidence that learning this distance alongside the value function yields structured and informative representations, including strong results on the Arcade Learning Environment benchmark.
    Challenges and Considerations with Code-Mixed NLP for Multilingual Societies. (arXiv:2106.07823v1 [cs.CL])
    (2 min) Multilingualism refers to the high degree of proficiency in two or more languages in the written and oral communication modes. It often results in language mixing, a.k.a. code-mixing, when a multilingual speaker switches between multiple languages in a single utterance of a text or speech. This paper discusses the current state of the NLP research, limitations, and foreseeable pitfalls in addressing five real-world applications for social good crisis management, healthcare, political campaigning, fake news, and hate speech for multilingual societies. We also propose futuristic datasets, models, and tools that can significantly advance the current research in multilingual NLP applications for the societal good. As a representative example, we consider English-Hindi code-mixing but draw similar inferences for other language pairs
    Learning Audio-Visual Dereverberation. (arXiv:2106.07732v1 [cs.SD])
    (2 min) Reverberation from audio reflecting off surfaces and objects in the environment not only degrades the quality of speech for human perception, but also severely impacts the accuracy of automatic speech recognition. Prior work attempts to remove reverberation based on the audio modality only. Our idea is to learn to dereverberate speech from audio-visual observations. The visual environment surrounding a human speaker reveals important cues about the room geometry, materials, and speaker location, all of which influence the precise reverberation effects in the audio stream. We introduce Visually-Informed Dereverberation of Audio (VIDA), an end-to-end approach that learns to remove reverberation based on both the observed sounds and visual scene. In support of this new task, we develop a large-scale dataset that uses realistic acoustic renderings of speech in real-world 3D scans of homes offering a variety of room acoustics. Demonstrating our approach on both simulated and real imagery for speech enhancement, speech recognition, and speaker identification, we show it achieves state-of-the-art performance and substantially improves over traditional audio-only methods. Project page: this http URL
    Do We Actually Need Dense Over-Parameterization? In-Time Over-Parameterization in Sparse Training. (arXiv:2102.02887v3 [cs.LG] UPDATED)
    (2 min) In this paper, we introduce a new perspective on training deep neural networks capable of state-of-the-art performance without the need for the expensive over-parameterization by proposing the concept of In-Time Over-Parameterization (ITOP) in sparse training. By starting from a random sparse network and continuously exploring sparse connectivities during training, we can perform an Over-Parameterization in the space-time manifold, closing the gap in the expressibility between sparse training and dense training. We further use ITOP to understand the underlying mechanism of Dynamic Sparse Training (DST) and indicate that the benefits of DST come from its ability to consider across time all possible parameters when searching for the optimal sparse connectivity. As long as there are sufficient parameters that have been reliably explored during training, DST can outperform the dense neural network by a large margin. We present a series of experiments to support our conjecture and achieve the state-of-the-art sparse training performance with ResNet-50 on ImageNet. More impressively, our method achieves dominant performance over the overparameterization-based sparse methods at extreme sparsity levels. When trained on CIFAR-100, our method can match the performance of the dense model even at an extreme sparsity (98%). Code can be found https://github.com/Shiweiliuiiiiiii/In-Time-Over-Parameterization.
    A Continuized View on Nesterov Acceleration for Stochastic Gradient Descent and Randomized Gossip. (arXiv:2106.07644v1 [math.OC])
    (2 min) We introduce the continuized Nesterov acceleration, a close variant of Nesterov acceleration whose variables are indexed by a continuous time parameter. The two variables continuously mix following a linear ordinary differential equation and take gradient steps at random times. This continuized variant benefits from the best of the continuous and the discrete frameworks: as a continuous process, one can use differential calculus to analyze convergence and obtain analytical expressions for the parameters; and a discretization of the continuized process can be computed exactly with convergence rates similar to those of Nesterov original acceleration. We show that the discretization has the same structure as Nesterov acceleration, but with random parameters. We provide continuized Nesterov acceleration under deterministic as well as stochastic gradients, with either additive or multiplicative noise. Finally, using our continuized framework and expressing the gossip averaging problem as the stochastic minimization of a certain energy function, we provide the first rigorous acceleration of asynchronous gossip algorithms.
    Next Generation Reservoir Computing. (arXiv:2106.07688v1 [cs.LG])
    (2 min) Reservoir computing is a best-in-class machine learning algorithm for processing information generated by dynamical systems using observed time-series data. Importantly, it requires very small training data sets, uses linear optimization, and thus requires minimal computing resources. However, the algorithm uses randomly sampled matrices to define the underlying recurrent neural network and has a multitude of metaparameters that must be optimized. Recent results demonstrate the equivalence of reservoir computing to nonlinear vector autoregression, which requires no random matrices, fewer metaparameters, and provides interpretable results. Here, we demonstrate that nonlinear vector autoregression excels at reservoir computing benchmark tasks and requires even shorter training data sets and training time, heralding the next generation of reservoir computing.
    Bringing Differential Private SGD to Practice: On the Independence of Gaussian Noise and the Number of Training Rounds. (arXiv:2102.09030v2 [cs.LG] UPDATED)
    (2 min) In the context of DP-SGD each round communicates a local SGD update which leaks some new information about the underlying local data set to the outside world. In order to provide privacy, Gaussian noise is added to local SGD updates. However, privacy leakage still aggregates over multiple training rounds. Therefore, in order to control privacy leakage over an increasing number of training rounds, we need to increase the added Gaussian noise per local SGD update. This dependence of the amount of Gaussian noise $\sigma$ on the number of training rounds $T$ may impose an impractical upper bound on $T$ (because $\sigma$ cannot be too large) leading to a low accuracy global model (because the global model receives too few local SGD updates). DP-SGD much less competitive compared to other existing privacy techniques. We show for the first time that for $(\epsilon,\delta)$-differential privacy $\sigma$ can be chosen equal to $\sqrt{2(\epsilon +\ln(1/\delta))/\epsilon}$ regardless the total number of training rounds $T$. In other words, $\sigma$ does not depend on $T$ anymore (and aggregation of privacy leakage increases to a limit). This important discovery brings DP-SGD to practice because $\sigma$ can remain small to make the trained model have high accuracy even for large $T$ as usually happens in practice.
    How to find a unicorn: a novel model-free, unsupervised anomaly detection method for time series. (arXiv:2004.11468v3 [cs.LG] UPDATED)
    (2 min) Recognition of anomalous events is a challenging but critical task in many scientific and industrial fields, especially when the properties of anomalies are unknown. In this paper, we introduce a new anomaly concept called "unicorn" or unique event and present a new, model-free, unsupervised detection algorithm to detect unicorns. The key component of the new algorithm is the Temporal Outlier Factor (TOF) to measure the uniqueness of events in continuous data sets from dynamic systems. The concept of unique events differs significantly from traditional outliers in many aspects: while repetitive outliers are no longer unique events, a unique event is not necessarily an outlier; it does not necessarily fall out from the distribution of normal activity. The performance of our algorithm was examined in recognizing unique events on different types of simulated data sets with anomalies and it was compared with the Local Outlier Factor (LOF) and discord discovery algorithms. TOF had superior performance compared to LOF and discord algorithms even in recognizing traditional outliers and it also recognized unique events that those did not. The benefits of the unicorn concept and the new detection method were illustrated by example data sets from very different scientific fields. Our algorithm successfully recognized unique events in those cases where they were already known such as the gravitational waves of a binary black hole merger on LIGO detector data and the signs of respiratory failure on ECG data series. Furthermore, unique events were found on the LIBOR data set of the last 30 years.
    Capabilities of Deep Learning Models on Learning Physical Relationships: Case of Rainfall-Runoff Modeling with LSTM. (arXiv:2106.07963v1 [physics.ao-ph])
    (2 min) This study investigates the relationships which deep learning methods can identify between the input and output data. As a case study, rainfall-runoff modeling in a snow-dominated watershed by means of a long- and short-term memory (LSTM) network is selected. Daily precipitation and mean air temperature were used as model input to estimate daily flow discharge. After model training and verification, two experimental simulations were conducted with hypothetical inputs instead of observed meteorological data to clarify the response of the trained model to the inputs. The first numerical experiment showed that even without input precipitation, the trained model generated flow discharge, particularly winter low flow and high flow during the snow-melting period. The effects of warmer and colder conditions on the flow discharge were also replicated by the trained model without precipitation. Additionally, the model reflected only 17-39% of the total precipitation mass during the snow accumulation period in the total annual flow discharge, revealing a strong lack of water mass conservation. The results of this study indicated that a deep learning method may not properly learn the explicit physical relationships between input and target variables, although they are still capable of maintaining strong goodness-of-fit results.
    Evading Malware Classifiers via Monte Carlo Mutant Feature Discovery. (arXiv:2106.07860v1 [cs.CR])
    (2 min) The use of Machine Learning has become a significant part of malware detection efforts due to the influx of new malware, an ever changing threat landscape, and the ability of Machine Learning methods to discover meaningful distinctions between malicious and benign software. Antivirus vendors have also begun to widely utilize malware classifiers based on dynamic and static malware analysis features. Therefore, a malware author might make evasive binary modifications against Machine Learning models as part of the malware development life cycle to execute an attack successfully. This makes the studying of possible classifier evasion strategies an essential part of cyber defense against malice. To this extent, we stage a grey box setup to analyze a scenario where the malware author does not know the target classifier algorithm, and does not have access to decisions made by the classifier, but knows the features used in training. In this experiment, a malicious actor trains a surrogate model using the EMBER-2018 dataset to discover binary mutations that cause an instance to be misclassified via a Monte Carlo tree search. Then, mutated malware is sent to the victim model that takes the place of an antivirus API to test whether it can evade detection.
    Parallel Training of Deep Networks with Local Updates. (arXiv:2012.03837v2 [cs.LG] UPDATED)
    (2 min) Deep learning models trained on large data sets have been widely successful in both vision and language domains. As state-of-the-art deep learning architectures have continued to grow in parameter count so have the compute budgets and times required to train them, increasing the need for compute-efficient methods that parallelize training. Two common approaches to parallelize the training of deep networks have been data and model parallelism. While useful, data and model parallelism suffer from diminishing returns in terms of compute efficiency for large batch sizes. In this paper, we investigate how to continue scaling compute efficiently beyond the point of diminishing returns for large batches through local parallelism, a framework which parallelizes training of individual layers in deep networks by replacing global backpropagation with truncated layer-wise backpropagation. Local parallelism enables fully asynchronous layer-wise parallelism with a low memory footprint, and requires little communication overhead compared with model parallelism. We show results in both vision and language domains across a diverse set of architectures, and find that local parallelism is particularly effective in the high-compute regime.
    CoDERT: Distilling Encoder Representations with Co-learning for Transducer-based Speech Recognition. (arXiv:2106.07734v1 [cs.CL])
    (2 min) We propose a simple yet effective method to compress an RNN-Transducer (RNN-T) through the well-known knowledge distillation paradigm. We show that the transducer's encoder outputs naturally have a high entropy and contain rich information about acoustically similar word-piece confusions. This rich information is suppressed when combined with the lower entropy decoder outputs to produce the joint network logits. Consequently, we introduce an auxiliary loss to distill the encoder logits from a teacher transducer's encoder, and explore training strategies where this encoder distillation works effectively. We find that tandem training of teacher and student encoders with an inplace encoder distillation outperforms the use of a pre-trained and static teacher transducer. We also report an interesting phenomenon we refer to as implicit distillation, that occurs when the teacher and student encoders share the same decoder. Our experiments show 5.37-8.4% relative word error rate reductions (WERR) on in-house test sets, and 5.05-6.18% relative WERRs on LibriSpeech test sets.
    Double Double Descent: On Generalization Errors in Transfer Learning between Linear Regression Tasks. (arXiv:2006.07002v5 [cs.LG] UPDATED)
    (2 min) We study the transfer learning process between two linear regression problems. An important and timely special case is when the regressors are overparameterized and perfectly interpolate their training data. We examine a parameter transfer mechanism whereby a subset of the parameters of the target task solution are constrained to the values learned for a related source task. We analytically characterize the generalization error of the target task in terms of the salient factors in the transfer learning architecture, i.e., the number of examples available, the number of (free) parameters in each of the tasks, the number of parameters transferred from the source to target task, and the correlation between the two tasks. Our non-asymptotic analysis shows that the generalization error of the target task follows a two-dimensional double descent trend (with respect to the number of free parameters in each of the tasks) that is controlled by the transfer learning factors. Our analysis points to specific cases where the transfer of parameters is beneficial. Specifically, we show that transferring a specific set of parameters that generalizes well on the respective part of the source task can soften the demand on the task correlation level that is required for successful transfer learning. Moreover, we show that the usefulness of a transfer learning setting is fragile and depends on a delicate interplay among the set of transferred parameters, the relation between the tasks, and the true solution.
    Randomized Exploration for Reinforcement Learning with General Value Function Approximation. (arXiv:2106.07841v1 [cs.LG])
    (2 min) We propose a model-free reinforcement learning algorithm inspired by the popular randomized least squares value iteration (RLSVI) algorithm as well as the optimism principle. Unlike existing upper-confidence-bound (UCB) based approaches, which are often computationally intractable, our algorithm drives exploration by simply perturbing the training data with judiciously chosen i.i.d. scalar noises. To attain optimistic value function estimation without resorting to a UCB-style bonus, we introduce an optimistic reward sampling procedure. When the value functions can be represented by a function class $\mathcal{F}$, our algorithm achieves a worst-case regret bound of $\widetilde{O}(\mathrm{poly}(d_EH)\sqrt{T})$ where $T$ is the time elapsed, $H$ is the planning horizon and $d_E$ is the $\textit{eluder dimension}$ of $\mathcal{F}$. In the linear setting, our algorithm reduces to LSVI-PHE, a variant of RLSVI, that enjoys an $\widetilde{\mathcal{O}}(\sqrt{d^3H^3T})$ regret. We complement the theory with an empirical evaluation across known difficult exploration tasks.
    A Software Engineering Perspective on Engineering Machine Learning Systems: State of the Art and Challenges. (arXiv:2012.07919v3 [cs.SE] UPDATED)
    (2 min) Context: Advancements in machine learning (ML) lead to a shift from the traditional view of software development, where algorithms are hard-coded by humans, to ML systems materialized through learning from data. Therefore, we need to revisit our ways of developing software systems and consider the particularities required by these new types of systems. Objective: The purpose of this study is to systematically identify, analyze, summarize, and synthesize the current state of software engineering (SE) research for engineering ML systems. Method: I performed a systematic literature review (SLR). I systematically selected a pool of 141 studies from SE venues and then conducted a quantitative and qualitative analysis using the data extracted from these studies. Results: The non-deterministic nature of ML systems complicates all SE aspects of engineering ML systems. Despite increasing interest from 2018 onwards, the results reveal that none of the SE aspects have a mature set of tools and techniques. Testing is by far the most popular area among researchers. Even for testing ML systems, engineers have only some tool prototypes and solution proposals with weak experimental proof. Many of the challenges of ML systems engineering were identified through surveys and interviews. Researchers should conduct experiments and case studies, ideally in industrial environments, to further understand these challenges and propose solutions. Conclusion: The results may benefit (1) practitioners in foreseeing the challenges of ML systems engineering; (2) researchers and academicians in identifying potential research questions; and (3) educators in designing or updating SE courses to cover ML systems engineering.
    Reverse Engineering of Generative Models: Inferring Model Hyperparameters from Generated Images. (arXiv:2106.07873v1 [cs.CV])
    (2 min) State-of-the-art (SOTA) Generative Models (GMs) can synthesize photo-realistic images that are hard for humans to distinguish from genuine photos. We propose to perform reverse engineering of GMs to infer the model hyperparameters from the images generated by these models. We define a novel problem, "model parsing", as estimating GM network architectures and training loss functions by examining their generated images -- a task seemingly impossible for human beings. To tackle this problem, we propose a framework with two components: a Fingerprint Estimation Network (FEN), which estimates a GM fingerprint from a generated image by training with four constraints to encourage the fingerprint to have desired properties, and a Parsing Network (PN), which predicts network architecture and loss functions from the estimated fingerprints. To evaluate our approach, we collect a fake image dataset with $100$K images generated by $100$ GMs. Extensive experiments show encouraging results in parsing the hyperparameters of the unseen models. Finally, our fingerprint estimation can be leveraged for deepfake detection and image attribution, as we show by reporting SOTA results on both the recent Celeb-DF and image attribution benchmarks.
    PRANK: motion Prediction based on RANKing. (arXiv:2010.12007v2 [cs.LG] UPDATED)
    (2 min) Predicting the motion of agents such as pedestrians or human-driven vehicles is one of the most critical problems in the autonomous driving domain. The overall safety of driving and the comfort of a passenger directly depend on its successful solution. The motion prediction problem also remains one of the most challenging problems in autonomous driving engineering, mainly due to high variance of the possible agent's future behavior given a situation. The two phenomena responsible for the said variance are the multimodality caused by the uncertainty of the agent's intent (e.g., turn right or move forward) and uncertainty in the realization of a given intent (e.g., which lane to turn into). To be useful within a real-time autonomous driving pipeline, a motion prediction system must provide efficient ways to describe and quantify this uncertainty, such as computing posterior modes and their probabilities or estimating density at the point corresponding to a given trajectory. It also should not put substantial density on physically impossible trajectories, as they can confuse the system processing the predictions. In this paper, we introduce the PRANK method, which satisfies these requirements. PRANK takes rasterized bird-eye images of agent's surroundings as an input and extracts features of the scene with a convolutional neural network. It then produces the conditional distribution of agent's trajectories plausible in the given scene. The key contribution of PRANK is a way to represent that distribution using nearest-neighbor methods in latent trajectory space, which allows for efficient inference in real time. We evaluate PRANK on the in-house and Argoverse datasets, where it shows competitive results.
    KD3A: Unsupervised Multi-Source Decentralized Domain Adaptation via Knowledge Distillation. (arXiv:2011.09757v7 [cs.LG] UPDATED)
    (2 min) Conventional unsupervised multi-source domain adaptation (UMDA) methods assume all source domains can be accessed directly. This neglects the privacy-preserving policy, that is, all the data and computations must be kept decentralized. There exists three problems in this scenario: (1) Minimizing the domain distance requires the pairwise calculation of the data from source and target domains, which is not accessible. (2) The communication cost and privacy security limit the application of UMDA methods (e.g., the domain adversarial training). (3) Since users have no authority to check the data quality, the irrelevant or malicious source domains are more likely to appear, which causes negative transfer. In this study, we propose a privacy-preserving UMDA paradigm named Knowledge Distillation based Decentralized Domain Adaptation (KD3A), which performs domain adaptation through the knowledge distillation on models from different source domains. KD3A solves the above problems with three components: (1) A multi-source knowledge distillation method named Knowledge Vote to learn high-quality domain consensus knowledge. (2) A dynamic weighting strategy named Consensus Focus to identify both the malicious and irrelevant domains. (3) A decentralized optimization strategy for domain distance named BatchNorm MMD. The extensive experiments on DomainNet demonstrate that KD3A is robust to the negative transfer and brings a 100x reduction of communication cost compared with other decentralized UMDA methods. Moreover, our KD3A significantly outperforms state-of-the-art UMDA approaches.
    Approximate spectral clustering using both reference vectors and topology of the network generated by growing neural gas. (arXiv:2009.07101v2 [cs.LG] UPDATED)
    (2 min) Spectral clustering (SC) is one of the most popular clustering methods and often outperforms traditional clustering methods. SC uses the eigenvectors of a Laplacian matrix calculated from a similarity matrix of a dataset. SC has the serious drawbacks that are the significant increases in the time complexity derived from the computation of eigenvectors and the memory space complexity to store the similarity matrix. To address the issues, I develop a new approximate spectral clustering using the network generated by growing neural gas (GNG), called ASC with GNG in this study. ASC with GNG uses not only reference vectors for vector quantization but also the topology of the network for extraction of the topological relationship between data points in a dataset. ASC with GNG makes the similarity matrix from both the reference vectors and the topology of the network generated by GNG. Using the network generated from a dataset by GNG, ASC with GNG achieves to reduce the computational and space complexities and improve clustering quality. In this study, I demonstrate that ASC with GNG effectively reduces computational time. Moreover, this study shows that ASC with GNG displays equal to or better clustering performance than SC.
    Cascading Convolutional Temporal Colour Constancy. (arXiv:2106.07955v1 [cs.CV])
    (2 min) Computational Colour Constancy (CCC) consists of estimating the colour of one or more illuminants in a scene and using them to remove unwanted chromatic distortions. Much research has focused on illuminant estimation for CCC on single images, with few attempts of leveraging the temporal information intrinsic in sequences of correlated images (e.g., the frames in a video), a task known as Temporal Colour Constancy (TCC). The state-of-the-art for TCC is TCCNet, a deep-learning architecture that uses a ConvLSTM for aggregating the encodings produced by CNN submodules for each image in a sequence. We extend this architecture with different models obtained by (i) substituting the TCCNet submodules with C4, the state-of-the-art method for CCC targeting images; (ii) adding a cascading strategy to perform an iterative improvement of the estimate of the illuminant. We tested our models on the recently released TCC benchmark and achieved results that surpass the state-of-the-art. Analyzing the impact of the number of frames involved in illuminant estimation on performance, we show that it is possible to reduce inference time by training the models on few selected frames from the sequences while retaining comparable accuracy.
    Improved SVRG for quadratic functions. (arXiv:2006.01017v2 [cs.LG] UPDATED)
    (2 min) We analyse an iterative algorithm to minimize quadratic functions whose Hessian matrix $H$ is the expectation of a random symmetric $d\times d$ matrix. The algorithm is a variant of the stochastic variance reduced gradient (SVRG). In several applications, including least-squares regressions, ridge regressions, linear discriminant analysis and regularized linear discriminant analysis, the running time of each iteration is proportional to $d$. Under smoothness and convexity conditions, the algorithm has linear convergence. When applied to quadratic functions, our analysis improves the state-of-the-art performance of SVRG up to a logarithmic factor. Furthermore, for well-conditioned quadratic problems, our analysis improves the state-of-the-art running times of accelerated SVRG, and is better than the known matching lower bound, by a logarithmic factor. Our theoretical results are backed with numerical experiments.
    Sliced Iterative Normalizing Flows. (arXiv:2007.00674v3 [cs.LG] UPDATED)
    (2 min) We develop an iterative (greedy) deep learning (DL) algorithm which is able to transform an arbitrary probability distribution function (PDF) into the target PDF. The model is based on iterative Optimal Transport of a series of 1D slices, matching on each slice the marginal PDF to the target. The axes of the orthogonal slices are chosen to maximize the PDF difference using Wasserstein distance at each iteration, which enables the algorithm to scale well to high dimensions. As special cases of this algorithm, we introduce two sliced iterative Normalizing Flow (SINF) models, which map from the data to the latent space (GIS) and vice versa (SIG). We show that SIG is able to generate high quality samples of image datasets, which match the GAN benchmarks, while GIS obtains competitive results on density estimation tasks compared to the density trained NFs, and is more stable, faster, and achieves higher $p(x)$ when trained on small training sets. SINF approach deviates significantly from the current DL paradigm, as it is greedy and does not use concepts such as mini-batching, stochastic gradient descent and gradient back-propagation through deep layers.
    Field-Embedded Factorization Machines for Click-through rate prediction. (arXiv:2009.09931v2 [cs.IR] UPDATED)
    (2 min) Click-through rate (CTR) prediction models are common in many online applications such as digital advertising and recommender systems. Field-Aware Factorization Machine (FFM) and Field-weighted Factorization Machine (FwFM) are state-of-the-art among the shallow models for CTR prediction. Recently, many deep learning-based models have also been proposed. Among deeper models, DeepFM, xDeepFM, AutoInt+, and FiBiNet are state-of-the-art models. The deeper models combine a core architectural component, which learns explicit feature interactions, with a deep neural network (DNN) component. We propose a novel shallow Field-Embedded Factorization Machine (FEFM) and its deep counterpart Deep Field-Embedded Factorization Machine (DeepFEFM). FEFM learns symmetric matrix embeddings for each field pair along with the usual single vector embeddings for each feature. FEFM has significantly lower model complexity than FFM and roughly the same complexity as FwFM. FEFM also has insightful mathematical properties about important fields and field interactions. DeepFEFM combines the FEFM interaction vectors learned by the FEFM component with a DNN and is thus able to learn higher order interactions. We conducted comprehensive experiments over a wide range of hyperparameters on two large publicly available real-world datasets. When comparing test AUC and log loss, the results show that FEFM and DeepFEFM outperform the existing state-of-the-art shallow and deep models for CTR prediction tasks. We have made the code of FEFM and DeepFEFM available in the DeepCTR library (https://github.com/shenweichen/DeepCTR).
    ShadowNet: A Secure and Efficient System for On-device Model Inference. (arXiv:2011.05905v2 [cs.CR] UPDATED)
    (2 min) With the increased usage of AI accelerators on mobile and edge devices, on-device machine learning (ML) is gaining popularity. Consequently, thousands of proprietary ML models are being deployed on billions of untrusted devices. This raises serious security concerns about model privacy. However, protecting the model privacy without losing access to the AI accelerators is a challenging problem. In this paper, we present a novel on-device model inference system, ShadowNet. ShadowNet protects the model privacy with Trusted Execution Environment (TEE) while securely outsourcing the heavy linear layers of the model to the untrusted hardware accelerators. ShadowNet achieves this by transforming the weights of the linear layers before outsourcing them and restoring the results inside the TEE. The nonlinear layers are also kept secure inside the TEE. The transformation of the weights and the restoration of the results are designed in a way that can be implemented efficiently. We have built a ShadowNet prototype based on TensorFlow Lite and applied it on four popular CNNs, namely, MobileNets, ResNet-44, AlexNet and MiniVGG. Our evaluation shows that ShadowNet achieves strong security guarantees with reasonable performance, offering a practical solution for secure on-device model inference.
    Site-Agnostic 3D Dose Distribution Prediction with Deep Learning Neural Networks. (arXiv:2106.07825v1 [cs.LG])
    (2 min) Typically, the current dose prediction models are limited to small amounts of data and require re-training for a specific site, often leading to suboptimal performance. We propose a site-agnostic, 3D dose distribution prediction model using deep learning that can leverage data from any treatment site, thus increasing the total data available to train the model. Applying our proposed model to a new target treatment site requires only a brief fine-tuning of the model to the new data and involves no modifications to the model input channels or its parameters. Thus, it can be efficiently adapted to a different treatment site, even with a small training dataset.
    Variational Inference with Continuously-Indexed Normalizing Flows. (arXiv:2007.05426v2 [stat.ML] UPDATED)
    (2 min) Continuously-indexed flows (CIFs) have recently achieved improvements over baseline normalizing flows on a variety of density estimation tasks. CIFs do not possess a closed-form marginal density, and so, unlike standard flows, cannot be plugged in directly to a variational inference (VI) scheme in order to produce a more expressive family of approximate posteriors. However, we show here how CIFs can be used as part of an auxiliary VI scheme to formulate and train expressive posterior approximations in a natural way. We exploit the conditional independence structure of multi-layer CIFs to build the required auxiliary inference models, which we show empirically yield low-variance estimators of the model evidence. We then demonstrate the advantages of CIFs over baseline flows in VI problems when the posterior distribution of interest possesses a complicated topology, obtaining improved results in both the Bayesian inference and surrogate maximum likelihood settings.
    Going Beyond Classification Accuracy Metrics in Model Compression. (arXiv:2012.01604v2 [cs.CV] UPDATED)
    (2 min) With the rise in edge-computing devices, there has been an increasing demand to deploy energy and resource-efficient models. A large body of research has been devoted to developing methods that can reduce the size of the model considerably without affecting the standard metrics such as top-1 accuracy. However, these pruning approaches tend to result in a significant mismatch in other metrics such as fairness across classes and explainability. To combat such misalignment, we propose a novel multi-part loss function inspired by the knowledge-distillation literature. Through extensive experiments, we demonstrate the effectiveness of our approach across different compression algorithms, architectures, tasks as well as datasets. In particular, we obtain up to $4.1\times$ reduction in the number of prediction mismatches between the compressed and reference models, and up to $5.7\times$ in cases where the reference model makes the correct prediction; all while making no changes to the compression algorithm, and minor modifications to the loss function. Furthermore, we demonstrate how inducing simple alignment between the predictions of the models naturally improves the alignment on other metrics including fairness and attributions. Our framework can thus serve as a simple plug-and-play component for compression algorithms in the future.
    Code Integrity Attestation for PLCs using Black Box Neural Network Predictions. (arXiv:2106.07851v1 [cs.CR])
    (2 min) Cyber-physical systems (CPSs) are widespread in critical domains, and significant damage can be caused if an attacker is able to modify the code of their programmable logic controllers (PLCs). Unfortunately, traditional techniques for attesting code integrity (i.e. verifying that it has not been modified) rely on firmware access or roots-of-trust, neither of which proprietary or legacy PLCs are likely to provide. In this paper, we propose a practical code integrity checking solution based on privacy-preserving black box models that instead attest the input/output behaviour of PLC programs. Using faithful offline copies of the PLC programs, we identify their most important inputs through an information flow analysis, execute them on multiple combinations to collect data, then train neural networks able to predict PLC outputs (i.e. actuator commands) from their inputs. By exploiting the black box nature of the model, our solution maintains the privacy of the original PLC code and does not assume that attackers are unaware of its presence. The trust instead comes from the fact that it is extremely hard to attack the PLC code and neural networks at the same time and with consistent outcomes. We evaluated our approach on a modern six-stage water treatment plant testbed, finding that it could predict actuator states from PLC inputs with near-100% accuracy, and thus could detect all 120 effective code mutations that we subjected the PLCs to. Finally, we found that it is not practically possible to simultaneously modify the PLC code and apply discreet adversarial noise to our attesters in a way that leads to consistent (mis-)predictions.
    A Near-Optimal Algorithm for Stochastic Bilevel Optimization via Double-Momentum. (arXiv:2102.07367v3 [math.OC] UPDATED)
    (2 min) This paper proposes a new algorithm -- the \underline{S}ingle-timescale Do\underline{u}ble-momentum \underline{St}ochastic \underline{A}pprox\underline{i}matio\underline{n} (SUSTAIN) -- for tackling stochastic unconstrained bilevel optimization problems. We focus on bilevel problems where the lower level subproblem is strongly-convex and the upper level objective function is smooth. Unlike prior works which rely on \emph{two-timescale} or \emph{double loop} techniques, we design a stochastic momentum-assisted gradient estimator for both the upper and lower level updates. The latter allows us to control the error in the stochastic gradient updates due to inaccurate solution to both subproblems. If the upper objective function is smooth but possibly non-convex, we show that {\aname}~requires $\mathcal{O}(\epsilon^{-3/2})$ iterations (each using ${\cal O}(1)$ samples) to find an $\epsilon$-stationary solution. The $\epsilon$-stationary solution is defined as the point whose squared norm of the gradient of the outer function is less than or equal to $\epsilon$. The total number of stochastic gradient samples required for the upper and lower level objective functions matches the best-known complexity for single-level stochastic gradient algorithms. We also analyze the case when the upper level objective function is strongly-convex.
    Text Generation with Efficient (Soft) Q-Learning. (arXiv:2106.07704v1 [cs.CL])
    (2 min) Maximum likelihood estimation (MLE) is the predominant algorithm for training text generation models. This paradigm relies on direct supervision examples, which is not applicable to many applications, such as generating adversarial attacks or generating prompts to control language models. Reinforcement learning (RL) on the other hand offers a more flexible solution by allowing users to plug in arbitrary task metrics as reward. Yet previous RL algorithms for text generation, such as policy gradient (on-policy RL) and Q-learning (off-policy RL), are often notoriously inefficient or unstable to train due to the large sequence space and the sparse reward received only at the end of sequences. In this paper, we introduce a new RL formulation for text generation from the soft Q-learning perspective. It further enables us to draw from the latest RL advances, such as path consistency learning, to combine the best of on-/off-policy updates, and learn effectively from sparse reward. We apply the approach to a wide range of tasks, including learning from noisy/negative examples, adversarial attacks, and prompt generation. Experiments show our approach consistently outperforms both task-specialized algorithms and the previous RL methods. On standard supervised tasks where MLE prevails, our approach also achieves competitive performance and stability by training text generation from scratch.
    Learning Equivariant Energy Based Models with Equivariant Stein Variational Gradient Descent. (arXiv:2106.07832v1 [cs.LG])
    (2 min) We focus on the problem of efficient sampling and learning of probability densities by incorporating symmetries in probabilistic models. We first introduce Equivariant Stein Variational Gradient Descent algorithm -- an equivariant sampling method based on Stein's identity for sampling from densities with symmetries. Equivariant SVGD explicitly incorporates symmetry information in a density through equivariant kernels which makes the resultant sampler efficient both in terms of sample complexity and the quality of generated samples. Subsequently, we define equivariant energy based models to model invariant densities that are learned using contrastive divergence. By utilizing our equivariant SVGD for training equivariant EBMs, we propose new ways of improving and scaling up training of energy based models. We apply these equivariant energy models for modelling joint densities in regression and classification tasks for image datasets, many-body particle systems and molecular structure generation.
    Federated Learning for Internet of Things: A Federated Learning Framework for On-device Anomaly Data Detection. (arXiv:2106.07976v1 [cs.LG])
    (2 min) Federated learning can be a promising solution for enabling IoT cybersecurity (i.e., anomaly detection in the IoT environment) while preserving data privacy and mitigating the high communication/storage overhead (e.g., high-frequency data from time-series sensors) of centralized over-the-cloud approaches. In this paper, to further push forward this direction with a comprehensive study in both algorithm and system design, we build FedIoT platform that contains a synthesized dataset using N-BaIoT, FedDetect algorithm, and a system design for IoT devices. Furthermore, the proposed FedDetect learning framework improves the performance by utilizing an adaptive optimizer (e.g., Adam) and a cross-round learning rate scheduler. In a network of realistic IoT devices (Raspberry PI), we evaluate FedIoT platform and FedDetect algorithm in both model and system performance. Our results demonstrate the efficacy of federated learning in detecting a large range of attack types. The system efficiency analysis indicates that both end-to-end training time and memory cost are affordable and promising for resource-constrained IoT devices. The source code is publicly available.
    Unbiased Sentence Encoder For Large-Scale Multi-lingual Search Engines. (arXiv:2106.07719v1 [cs.CL])
    (2 min) In this paper, we present a multi-lingual sentence encoder that can be used in search engines as a query and document encoder. This embedding enables a semantic similarity score between queries and documents that can be an important feature in document ranking and relevancy. To train such a customized sentence encoder, it is beneficial to leverage users search data in the form of query-document clicked pairs however, we must avoid relying too much on search click data as it is biased and does not cover many unseen cases. The search data is heavily skewed towards short queries and for long queries is small and often noisy. The goal is to design a universal multi-lingual encoder that works for all cases and covers both short and long queries. We select a number of public NLI datasets in different languages and translation data and together with user search data we train a language model using a multi-task approach. A challenge is that these datasets are not homogeneous in terms of content, size and the balance ratio. While the public NLI datasets are usually two-sentence based with the same portion of positive and negative pairs, the user search data can contain multi-sentence documents and only positive pairs. We show how multi-task training enables us to leverage all these datasets and exploit knowledge sharing across these tasks.
    An Empirical Characterization of Fair Machine Learning For Clinical Risk Prediction. (arXiv:2007.10306v3 [stat.ML] UPDATED)
    (3 min) The use of machine learning to guide clinical decision making has the potential to worsen existing health disparities. Several recent works frame the problem as that of algorithmic fairness, a framework that has attracted considerable attention and criticism. However, the appropriateness of this framework is unclear due to both ethical as well as technical considerations, the latter of which include trade-offs between measures of fairness and model performance that are not well-understood for predictive models of clinical outcomes. To inform the ongoing debate, we conduct an empirical study to characterize the impact of penalizing group fairness violations on an array of measures of model performance and group fairness. We repeat the analyses across multiple observational healthcare databases, clinical outcomes, and sensitive attributes. We find that procedures that penalize differences between the distributions of predictions across groups induce nearly-universal degradation of multiple performance metrics within groups. On examining the secondary impact of these procedures, we observe heterogeneity of the effect of these procedures on measures of fairness in calibration and ranking across experimental conditions. Beyond the reported trade-offs, we emphasize that analyses of algorithmic fairness in healthcare lack the contextual grounding and causal awareness necessary to reason about the mechanisms that lead to health disparities, as well as about the potential of algorithmic fairness methods to counteract those mechanisms. In light of these limitations, we encourage researchers building predictive models for clinical use to step outside the algorithmic fairness frame and engage critically with the broader sociotechnical context surrounding the use of machine learning in healthcare.
    Planning to Fairly Allocate: Probabilistic Fairness in the Restless Bandit Setting. (arXiv:2106.07677v1 [cs.LG])
    (2 min) Restless and collapsing bandits are commonly used to model constrained resource allocation in settings featuring arms with action-dependent transition probabilities, such as allocating health interventions among patients [Whittle, 1988; Mate et al., 2020]. However, state-of-the-art Whittle-index-based approaches to this planning problem either do not consider fairness among arms, or incentivize fairness without guaranteeing it [Mate et al., 2021]. Additionally, their optimality guarantees only apply when arms are indexable and threshold-optimal. We demonstrate that the incorporation of hard fairness constraints necessitates the coupling of arms, which undermines the tractability, and by extension, indexability of the problem. We then introduce ProbFair, a probabilistically fair stationary policy that maximizes total expected reward and satisfies the budget constraint, while ensuring a strictly positive lower bound on the probability of being pulled at each timestep. We evaluate our algorithm on a real-world application, where interventions support continuous positive airway pressure (CPAP) therapy adherence among obstructive sleep apnea (OSA) patients, as well as simulations on a broader class of synthetic transition matrices.
    Learning Revenue-Maximizing Auctions With Differentiable Matching. (arXiv:2106.07877v1 [cs.GT])
    (2 min) We propose a new architecture to approximately learn incentive compatible, revenue-maximizing auctions from sampled valuations. Our architecture uses the Sinkhorn algorithm to perform a differentiable bipartite matching which allows the network to learn strategyproof revenue-maximizing mechanisms in settings not learnable by the previous RegretNet architecture. In particular, our architecture is able to learn mechanisms in settings without free disposal where each bidder must be allocated exactly some number of items. In experiments, we show our approach successfully recovers multiple known optimal mechanisms and high-revenue, low-regret mechanisms in larger settings where the optimal mechanism is unknown.
    Learning Stable Classifiers by Transferring Unstable Features. (arXiv:2106.07847v1 [cs.LG])
    (2 min) We study transfer learning in the presence of spurious correlations. We experimentally demonstrate that directly transferring the stable feature extractor learned on the source task may not eliminate these biases for the target task. However, we hypothesize that the unstable features in the source task and those in the target task are directly related. By explicitly informing the target classifier of the source task's unstable features, we can regularize the biases in the target task. Specifically, we derive a representation that encodes the unstable features by contrasting different data environments in the source task. On the target task, we cluster data from this representation, and achieve robustness by minimizing the worst-case risk across all clusters. We evaluate our method on both text and image classifications. Empirical results demonstrate that our algorithm is able to maintain robustness on the target task, outperforming the best baseline by 22.9% in absolute accuracy across 12 transfer settings. Our code is available at https://github.com/YujiaBao/Tofu.
    HUMAP: Hierarchical Uniform Manifold Approximation and Projection. (arXiv:2106.07718v1 [cs.LG])
    (2 min) Dimensionality reduction (DR) techniques help analysts to understand patterns in high-dimensional spaces. These techniques, often represented by scatter plots, are employed in diverse science domains and facilitate similarity analysis among clusters and data samples. For datasets containing many granularities or when analysis follows the information visualization mantra, hierarchical DR techniques are the most suitable approach since they present major structures beforehand and details on demand. However, current hierarchical DR techniques are not fully capable of addressing literature problems because they do not preserve the projection mental map across hierarchical levels or are not suitable for most data types. This work presents HUMAP, a novel hierarchical dimensionality reduction technique designed to be flexible on preserving local and global structures and preserve the mental map throughout hierarchical exploration. We provide empirical evidence of our technique's superiority compared with current hierarchical approaches and show two case studies to demonstrate its strengths.
    Improving the compromise between accuracy, interpretability and personalization of rule-based machine learning in medical problems. (arXiv:2106.07827v1 [cs.LG])
    (2 min) One of the key challenges when developing a predictive model is the capability to describe the domain knowledge and the cause-effect relationships in a simple way. Decision rules are a useful and important methodology in this context, justifying their application in several areas, in particular in clinical practice. Several machine-learning classifiers have exploited the advantageous properties of decision rules to build intelligent prediction models, namely decision trees and ensembles of trees (ETs). However, such methodologies usually suffer from a trade-off between interpretability and predictive performance. Some procedures consider a simplification of ETs, using heuristic approaches to select an optimal reduced set of decision rules. In this paper, we introduce a novel step to those methodologies. We create a new component to predict if a given rule will be correct or not for a particular patient, which introduces personalization into the procedure. Furthermore, the validation results using three public clinical datasets show that it also allows to increase the predictive performance of the selected set of rules, improving the mentioned trade-off.
    Phase Transitions, Distance Functions, and Implicit Neural Representations. (arXiv:2106.07689v1 [cs.LG])
    (2 min) Representing surfaces as zero level sets of neural networks recently emerged as a powerful modeling paradigm, named Implicit Neural Representations (INRs), serving numerous downstream applications in geometric deep learning and 3D vision. Training INRs previously required choosing between occupancy and distance function representation and different losses with unknown limit behavior and/or bias. In this paper we draw inspiration from the theory of phase transitions of fluids and suggest a loss for training INRs that learns a density function that converges to a proper occupancy function, while its log transform converges to a distance function. Furthermore, we analyze the limit minimizer of this loss showing it satisfies the reconstruction constraints and has minimal surface perimeter, a desirable inductive bias for surface reconstruction. Training INRs with this new loss leads to state-of-the-art reconstructions on a standard benchmark.
    Simon Says: Evaluating and Mitigating Bias in Pruned Neural Networks with Knowledge Distillation. (arXiv:2106.07849v1 [cs.LG])
    (2 min) In recent years the ubiquitous deployment of AI has posed great concerns in regards to algorithmic bias, discrimination, and fairness. Compared to traditional forms of bias or discrimination caused by humans, algorithmic bias generated by AI is more abstract and unintuitive therefore more difficult to explain and mitigate. A clear gap exists in the current literature on evaluating and mitigating bias in pruned neural networks. In this work, we strive to tackle the challenging issues of evaluating, mitigating, and explaining induced bias in pruned neural networks. Our paper makes three contributions. First, we propose two simple yet effective metrics, Combined Error Variance (CEV) and Symmetric Distance Error (SDE), to quantitatively evaluate the induced bias prevention quality of pruned models. Second, we demonstrate that knowledge distillation can mitigate induced bias in pruned neural networks, even with unbalanced datasets. Third, we reveal that model similarity has strong correlations with pruning induced bias, which provides a powerful method to explain why bias occurs in pruned neural networks. Our code is available at https://github.com/codestar12/pruning-distilation-bias
    Diverse Video Captioning Through Latent Variable Expansion. (arXiv:1910.12019v6 [cs.CV] UPDATED)
    (2 min) Automatically describing video content with text description is challenging but important task, which has been attracting a lot of attention in computer vision community. Previous works mainly strive for the accuracy of the generated sentences, while ignoring the sentences diversity, which is inconsistent with human behavior. In this paper, we aim to caption each video with multiple descriptions and propose a novel framework. Concretely, for a given video, the intermediate latent variables of conventional encode-decode process are utilized as input to the conditional generative adversarial network (CGAN) with the purpose of generating diverse sentences. We adopt different Convolutional Neural Networks (CNNs) as our generator that produces descriptions conditioned on latent variables and discriminator that assesses the quality of generated sentences. Simultaneously, a novel DCE metric is designed to assess the diverse captions. We evaluate our method on the benchmark datasets, where it demonstrates its ability to generate diverse descriptions and achieves superior results against other state-of-the-art methods.
    Accelerating Ill-Conditioned Low-Rank Matrix Estimation via Scaled Gradient Descent. (arXiv:2005.08898v4 [cs.LG] UPDATED)
    (3 min) Low-rank matrix estimation is a canonical problem that finds numerous applications in signal processing, machine learning and imaging science. A popular approach in practice is to factorize the matrix into two compact low-rank factors, and then optimize these factors directly via simple iterative methods such as gradient descent and alternating minimization. Despite nonconvexity, recent literatures have shown that these simple heuristics in fact achieve linear convergence when initialized properly for a growing number of problems of interest. However, upon closer examination, existing approaches can still be computationally expensive especially for ill-conditioned matrices: the convergence rate of gradient descent depends linearly on the condition number of the low-rank matrix, while the per-iteration cost of alternating minimization is often prohibitive for large matrices. The goal of this paper is to set forth a competitive algorithmic approach dubbed Scaled Gradient Descent (ScaledGD) which can be viewed as pre-conditioned or diagonally-scaled gradient descent, where the pre-conditioners are adaptive and iteration-varying with a minimal computational overhead. With tailored variants for low-rank matrix sensing, robust principal component analysis and matrix completion, we theoretically show that ScaledGD achieves the best of both worlds: it converges linearly at a rate independent of the condition number of the low-rank matrix similar as alternating minimization, while maintaining the low per-iteration cost of gradient descent. Our analysis is also applicable to general loss functions that are restricted strongly convex and smooth over low-rank matrices. To the best of our knowledge, ScaledGD is the first algorithm that provably has such properties over a wide range of low-rank matrix estimation tasks.
    Quantized Adam with Error Feedback. (arXiv:2004.14180v2 [cs.LG] UPDATED)
    (2 min) In this paper, we present a distributed variant of adaptive stochastic gradient method for training deep neural networks in the parameter-server model. To reduce the communication cost among the workers and server, we incorporate two types of quantization schemes, i.e., gradient quantization and weight quantization, into the proposed distributed Adam. Besides, to reduce the bias introduced by quantization operations, we propose an error-feedback technique to compensate for the quantized gradient. Theoretically, in the stochastic nonconvex setting, we show that the distributed adaptive gradient method with gradient quantization and error-feedback converges to the first-order stationary point, and that the distributed adaptive gradient method with weight quantization and error-feedback converges to the point related to the quantized level under both the single-worker and multi-worker modes. At last, we apply the proposed distributed adaptive gradient methods to train deep neural networks. Experimental results demonstrate the efficacy of our methods.
    ST-UNet: A Spatio-Temporal U-Network for Graph-structured Time Series Modeling. (arXiv:1903.05631v2 [cs.LG] UPDATED)
    (2 min) The spatio-temporal graph learning is becoming an increasingly important object of graph study. Many application domains involve highly dynamic graphs where temporal information is crucial, e.g. traffic networks and financial transaction graphs. Despite the constant progress made on learning structured data, there is still a lack of effective means to extract dynamic complex features from spatio-temporal structures. Particularly, conventional models such as convolutional networks or recurrent neural networks are incapable of revealing the temporal patterns in short or long terms and exploring the spatial properties in local or global scope from spatio-temporal graphs simultaneously. To tackle this problem, we design a novel multi-scale architecture, Spatio-Temporal U-Net (ST-UNet), for graph-structured time series modeling. In this U-shaped network, a paired sampling operation is proposed in spacetime domain accordingly: the pooling (ST-Pool) coarsens the input graph in spatial from its deterministic partition while abstracts multi-resolution temporal dependencies through dilated recurrent skip connections; based on previous settings in the downsampling, the unpooling (ST-Unpool) restores the original structure of spatio-temporal graphs and resumes regular intervals within graph sequences. Experiments on spatio-temporal prediction tasks demonstrate that our model effectively captures comprehensive features in multiple scales and achieves substantial improvements over mainstream methods on several real-world datasets.
    Query Embedding on Hyper-relational Knowledge Graphs. (arXiv:2106.08166v1 [cs.AI])
    (2 min) Multi-hop logical reasoning is an established problem in the field of representation learning on knowledge graphs (KGs). It subsumes both one-hop link prediction as well as other more complex types of logical queries. Existing algorithms operate only on classical, triple-based graphs, whereas modern KGs often employ a hyper-relational modeling paradigm. In this paradigm, typed edges may have several key-value pairs known as qualifiers that provide fine-grained context for facts. In queries, this context modifies the meaning of relations, and usually reduces the answer set. Hyper-relational queries are often observed in real-world KG applications, and existing approaches for approximate query answering cannot make use of qualifier pairs. In this work, we bridge this gap and extend the multi-hop reasoning problem to hyper-relational KGs allowing to tackle this new type of complex queries. Building upon recent advancements in Graph Neural Networks and query embedding techniques, we study how to embed and answer hyper-relational conjunctive queries. Besides that, we propose a method to answer such queries and demonstrate in our experiments that qualifiers improve query answering on a diverse set of query patterns.
    Learning Autonomy in Management of Wireless Random Networks. (arXiv:2106.07984v1 [cs.IT])
    (2 min) This paper presents a machine learning strategy that tackles a distributed optimization task in a wireless network with an arbitrary number of randomly interconnected nodes. Individual nodes decide their optimal states with distributed coordination among other nodes through randomly varying backhaul links. This poses a technical challenge in distributed universal optimization policy robust to a random topology of the wireless network, which has not been properly addressed by conventional deep neural networks (DNNs) with rigid structural configurations. We develop a flexible DNN formalism termed distributed message-passing neural network (DMPNN) with forward and backward computations independent of the network topology. A key enabler of this approach is an iterative message-sharing strategy through arbitrarily connected backhaul links. The DMPNN provides a convergent solution for iterative coordination by learning numerous random backhaul interactions. The DMPNN is investigated for various configurations of the power control in wireless networks, and intensive numerical results prove its universality and viability over conventional optimization and DNN approaches.
    Learning to Prevent Leakage: Privacy-Preserving Inference in the Mobile Cloud. (arXiv:1912.08421v2 [cs.LG] UPDATED)
    (2 min) Powered by machine learning services in the cloud, numerous learning-driven mobile applications are gaining popularity in the market. As deep learning tasks are mostly computation-intensive, it has become a trend to process raw data on devices and send the deep neural network (DNN) features to the cloud, where the features are further processed to return final results. However, there is always unexpected leakage with the release of features, with which an adversary could infer a significant amount of information about the original data. We propose a privacy-preserving reinforcement learning framework on top of the mobile cloud infrastructure from the perspective of DNN structures. The framework aims to learn a policy to modify the base DNNs to prevent information leakage while maintaining high inference accuracy. The policy can also be readily transferred to large-size DNNs to speed up learning. Extensive evaluations on a variety of DNNs have shown that our framework can successfully find privacy-preserving DNN structures to defend different privacy attacks.
    Scaling Neural Tangent Kernels via Sketching and Random Features. (arXiv:2106.07880v1 [cs.LG])
    (2 min) The Neural Tangent Kernel (NTK) characterizes the behavior of infinitely-wide neural networks trained under least squares loss by gradient descent. Recent works also report that NTK regression can outperform finitely-wide neural networks trained on small-scale datasets. However, the computational complexity of kernel methods has limited its use in large-scale learning tasks. To accelerate learning with NTK, we design a near input-sparsity time approximation algorithm for NTK, by sketching the polynomial expansions of arc-cosine kernels: our sketch for the convolutional counterpart of NTK (CNTK) can transform any image using a linear runtime in the number of pixels. Furthermore, we prove a spectral approximation guarantee for the NTK matrix, by combining random features (based on leverage score sampling) of the arc-cosine kernels with a sketching algorithm. We benchmark our methods on various large-scale regression and classification tasks and show that a linear regressor trained on our CNTK features matches the accuracy of exact CNTK on CIFAR-10 dataset while achieving 150x speedup.
    Contextualizing Multiple Tasks via Learning to Decompose. (arXiv:2106.08112v1 [cs.LG])
    (2 min) One single instance could possess multiple portraits and reveal diverse relationships with others according to different contexts. Those ambiguities increase the difficulty of learning a generalizable model when there exists one concept or mixed concepts in a task. We propose a general approach Learning to Decompose Network (LeadNet) for both two cases, which contextualizes a model through meta-learning multiple maps for concepts discovery -- the representations of instances are decomposed and adapted conditioned on the contexts. Through taking a holistic view over multiple latent components over instances in a sampled pseudo task, LeadNet learns to automatically select the right concept via incorporating those rich semantics inside and between objects. LeadNet demonstrates its superiority in various applications, including exploring multiple views of confusing tasks, out-of-distribution recognition, and few-shot image classification.
    An Exponential Improvement on the Memorization Capacity of Deep Threshold Networks. (arXiv:2106.07724v1 [cs.LG])
    (2 min) It is well known that modern deep neural networks are powerful enough to memorize datasets even when the labels have been randomized. Recently, Vershynin (2020) settled a long standing question by Baum (1988), proving that \emph{deep threshold} networks can memorize $n$ points in $d$ dimensions using $\widetilde{\mathcal{O}}(e^{1/\delta^2}+\sqrt{n})$ neurons and $\widetilde{\mathcal{O}}(e^{1/\delta^2}(d+\sqrt{n})+n)$ weights, where $\delta$ is the minimum distance between the points. In this work, we improve the dependence on $\delta$ from exponential to almost linear, proving that $\widetilde{\mathcal{O}}(\frac{1}{\delta}+\sqrt{n})$ neurons and $\widetilde{\mathcal{O}}(\frac{d}{\delta}+n)$ weights are sufficient. Our construction uses Gaussian random weights only in the first layer, while all the subsequent layers use binary or integer weights. We also prove new lower bounds by connecting memorization in neural networks to the purely geometric problem of separating $n$ points on a sphere using hyperplanes.
    Learning to Compensate: A Deep Neural Network Framework for 5G Power Amplifier Compensation. (arXiv:2106.07953v1 [eess.SP])
    (2 min) Owing to the complicated characteristics of 5G communication system, designing RF components through mathematical modeling becomes a challenging obstacle. Moreover, such mathematical models need numerous manual adjustments for various specification requirements. In this paper, we present a learning-based framework to model and compensate Power Amplifiers (PAs) in 5G communication. In the proposed framework, Deep Neural Networks (DNNs) are used to learn the characteristics of the PAs, while, correspondent Digital Pre-Distortions (DPDs) are also learned to compensate for the nonlinear and memory effects of PAs. On top of the framework, we further propose two frequency domain losses to guide the learning process to better optimize the target, compared to naive time domain Mean Square Error (MSE). The proposed framework serves as a drop-in replacement for the conventional approach. The proposed approach achieves an average of 56.7% reduction of nonlinear and memory effects, which converts to an average of 16.3% improvement over a carefully-designed mathematical model, and even reaches 34% enhancement in severe distortion scenarios.
    Characterizing Structural Regularities of Labeled Data in Overparameterized Models. (arXiv:2002.03206v3 [cs.LG] UPDATED)
    (2 min) Humans are accustomed to environments that contain both regularities and exceptions. For example, at most gas stations, one pays prior to pumping, but the occasional rural station does not accept payment in advance. Likewise, deep neural networks can generalize across instances that share common patterns or structures, yet have the capacity to memorize rare or irregular forms. We analyze how individual instances are treated by a model via a consistency score. The score characterizes the expected accuracy for a held-out instance given training sets of varying size sampled from the data distribution. We obtain empirical estimates of this score for individual instances in multiple data sets, and we show that the score identifies out-of-distribution and mislabeled examples at one end of the continuum and strongly regular examples at the other end. We identify computationally inexpensive proxies to the consistency score using statistics collected during training. We show examples of potential applications to the analysis of deep-learning systems.
    On the Evaluation of Sequential Machine Learning for Network Intrusion Detection. (arXiv:2106.07961v1 [cs.CR])
    (2 min) Recent advances in deep learning renewed the research interests in machine learning for Network Intrusion Detection Systems (NIDS). Specifically, attention has been given to sequential learning models, due to their ability to extract the temporal characteristics of Network traffic Flows (NetFlows), and use them for NIDS tasks. However, the applications of these sequential models often consist of transferring and adapting methodologies directly from other fields, without an in-depth investigation on how to leverage the specific circumstances of cybersecurity scenarios; moreover, there is a lack of comprehensive studies on sequential models that rely on NetFlow data, which presents significant advantages over traditional full packet captures. We tackle this problem in this paper. We propose a detailed methodology to extract temporal sequences of NetFlows that denote patterns of malicious activities. Then, we apply this methodology to compare the efficacy of sequential learning models against traditional static learning models. In particular, we perform a fair comparison of a `sequential' Long Short-Term Memory (LSTM) against a `static' Feedforward Neural Networks (FNN) in distinct environments represented by two well-known datasets for NIDS: the CICIDS2017 and the CTU13. Our results highlight that LSTM achieves comparable performance to FNN in the CICIDS2017 with over 99.5\% F1-score; while obtaining superior performance in the CTU13, with 95.7\% F1-score against 91.5\%. This paper thus paves the way to future applications of sequential learning models for NIDS.

2021-06-15

  • cs.CL updates on arXiv.org

    InFillmore: Frame-Guided Language Generation with Bidirectional Context. (arXiv:2103.04941v2 [cs.CL] UPDATED)
    (2 min) We propose a structured extension to bidirectional-context conditional language generation, or "infilling," inspired by Frame Semantic theory (Fillmore, 1976). Guidance is provided through two approaches: (1) model fine-tuning, conditioning directly on observed symbolic frames, and (2) a novel extension to disjunctive lexically constrained decoding that leverages frame semantic lexical units. Automatic and human evaluations confirm that frame-guided generation allows for explicit manipulation of intended infill semantics, with minimal loss in distinguishability from human-generated text. Our methods flexibly apply to a variety of use scenarios, and we provide a codebase and interactive demo available from https://nlp.jhu.edu/demos/infillmore.
    Bilingual Lexicon Induction via Unsupervised Bitext Construction and Word Alignment. (arXiv:2101.00148v2 [cs.CL] UPDATED)
    (2 min) Bilingual lexicons map words in one language to their translations in another, and are typically induced by learning linear projections to align monolingual word embedding spaces. In this paper, we show it is possible to produce much higher quality lexicons with methods that combine (1) unsupervised bitext mining and (2) unsupervised word alignment. Directly applying a pipeline that uses recent algorithms for both subproblems significantly improves induced lexicon quality and further gains are possible by learning to filter the resulting lexical entries, with both unsupervised and semi-supervised schemes. Our final model outperforms the state of the art on the BUCC 2020 shared task by 14 $F_1$ points averaged over 12 language pairs, while also providing a more interpretable approach that allows for rich reasoning of word meaning in context. Further analysis of our output and the standard reference lexicons suggests they are of comparable quality, and new benchmarks may be needed to measure further progress on this task.
    Reinforcement Learning for Emotional Text-to-Speech Synthesis with Improved Emotion Discriminability. (arXiv:2104.01408v2 [cs.CL] UPDATED)
    (2 min) Emotional text-to-speech synthesis (ETTS) has seen much progress in recent years. However, the generated voice is often not perceptually identifiable by its intended emotion category. To address this problem, we propose a new interactive training paradigm for ETTS, denoted as i-ETTS, which seeks to directly improve the emotion discriminability by interacting with a speech emotion recognition (SER) model. Moreover, we formulate an iterative training strategy with reinforcement learning to ensure the quality of i-ETTS optimization. Experimental results demonstrate that the proposed i-ETTS outperforms the state-of-the-art baselines by rendering speech with more accurate emotion style. To our best knowledge, this is the first study of reinforcement learning in emotional text-to-speech synthesis.
    Certified Robustness to Text Adversarial Attacks by Randomized [MASK]. (arXiv:2105.03743v2 [cs.CL] UPDATED)
    (2 min) Recently, few certified defense methods have been developed to provably guarantee the robustness of a text classifier to adversarial synonym substitutions. However, all existing certified defense methods assume that the defenders are informed of how the adversaries generate synonyms, which is not a realistic scenario. In this paper, we propose a certifiably robust defense method by randomly masking a certain proportion of the words in an input text, in which the above unrealistic assumption is no longer necessary. The proposed method can defend against not only word substitution-based attacks, but also character-level perturbations. We can certify the classifications of over 50% texts to be robust to any perturbation of 5 words on AGNEWS, and 2 words on SST2 dataset. The experimental results show that our randomized smoothing method significantly outperforms recently proposed defense methods across multiple datasets.
    Detecting Local Insights from Global Labels: Supervised & Zero-Shot Sequence Labeling via a Convolutional Decomposition. (arXiv:1906.01154v6 [cs.CL] UPDATED)
    (3 min) We propose a new, more actionable view of neural network interpretability and data analysis by leveraging the remarkable matching effectiveness of representations derived from deep networks, guided by an approach for class-conditional feature detection. The decomposition of the filter-ngram interactions of a convolutional neural network and a linear layer over a pre-trained deep network yields a strong binary sequence labeler, with flexibility in producing predictions at -- and defining loss functions for -- varying label granularities, from the fully-supervised sequence labeling setting to the challenging zero-shot sequence labeling setting, in which we seek token-level predictions but only have document-level labels for training. From this sequence-labeling layer we derive dense representations of the input that can then be matched to instances from training, or a support set with known labels. Such introspection with inference-time decision rules provides a means, in some settings, of making local updates to the model by altering the labels or instances in the support set without re-training the full model. Finally, we construct a particular K-nearest neighbors (K-NN) model from matched exemplar representations that approximates the original model's predictions and is at least as effective a predictor with respect to the ground-truth labels. This additionally yields interpretable heuristics at the token level for determining when predictions are less likely to be reliable, and for screening input dissimilar to the support set. In effect, we show that we can transform the deep network into a simple weighting over exemplars and associated labels, yielding an introspectable -- and modestly updatable -- version of the original model.
    Don't Rule Out Monolingual Speakers: A Method For Crowdsourcing Machine Translation Data. (arXiv:2106.06875v1 [cs.CL])
    (2 min) High-performing machine translation (MT) systems can help overcome language barriers while making it possible for everyone to communicate and use language technologies in the language of their choice. However, such systems require large amounts of parallel sentences for training, and translators can be difficult to find and expensive. Here, we present a data collection strategy for MT which, in contrast, is cheap and simple, as it does not require bilingual speakers. Based on the insight that humans pay specific attention to movements, we use graphics interchange formats (GIFs) as a pivot to collect parallel sentences from monolingual annotators. We use our strategy to collect data in Hindi, Tamil and English. As a baseline, we also collect data using images as a pivot. We perform an intrinsic evaluation by manually evaluating a subset of the sentence pairs and an extrinsic evaluation by finetuning mBART on the collected data. We find that sentences collected via GIFs are indeed of higher quality.
    Memory-efficient Transformers via Top-$k$ Attention. (arXiv:2106.06899v1 [cs.CL])
    (2 min) Following the success of dot-product attention in Transformers, numerous approximations have been recently proposed to address its quadratic complexity with respect to the input length. While these variants are memory and compute efficient, it is not possible to directly use them with popular pre-trained language models trained using vanilla attention, without an expensive corrective pre-training stage. In this work, we propose a simple yet highly accurate approximation for vanilla attention. We process the queries in chunks, and for each query, compute the top-$k$ scores with respect to the keys. Our approach offers several advantages: (a) its memory usage is linear in the input size, similar to linear attention variants, such as Performer and RFA (b) it is a drop-in replacement for vanilla attention that does not require any corrective pre-training, and (c) it can also lead to significant memory savings in the feed-forward layers after casting them into the familiar query-key-value framework. We evaluate the quality of top-$k$ approximation for multi-head attention layers on the Long Range Arena Benchmark, and for feed-forward layers of T5 and UnifiedQA on multiple QA datasets. We show our approach leads to accuracy that is nearly-identical to vanilla attention in multiple setups including training from scratch, fine-tuning, and zero-shot inference.
    Engineering Knowledge Graph from Patent Database. (arXiv:2106.06739v1 [cs.IR])
    (2 min) We propose a large, scalable engineering knowledge graph, comprising sets of (entity, relationship, entity) triples that are real-world engineering facts found in the patent database. We apply a set of rules based on the syntactic and lexical properties of claims in a patent document to extract facts. We aggregate these facts within each patent document and integrate the aggregated sets of facts across the patent database to obtain the engineering knowledge graph. Such a knowledge graph is expected to support inference, reasoning, and recalling in various engineering tasks. The knowledge graph has a greater size and coverage in comparison with the previously used knowledge graphs and semantic networks in the engineering literature.
    Guiding Teacher Forcing with Seer Forcing for Neural Machine Translation. (arXiv:2106.06751v1 [cs.CL])
    (2 min) Although teacher forcing has become the main training paradigm for neural machine translation, it usually makes predictions only conditioned on past information, and hence lacks global planning for the future. To address this problem, we introduce another decoder, called seer decoder, into the encoder-decoder framework during training, which involves future information in target predictions. Meanwhile, we force the conventional decoder to simulate the behaviors of the seer decoder via knowledge distillation. In this way, at test the conventional decoder can perform like the seer decoder without the attendance of it. Experiment results on the Chinese-English, English-German and English-Romanian translation tasks show our method can outperform competitive baselines significantly and achieves greater improvements on the bigger data sets. Besides, the experiments also prove knowledge distillation the best way to transfer knowledge from the seer decoder to the conventional decoder compared to adversarial learning and L2 regularization.
    Multitask Training with Text Data for End-to-End Speech Recognition. (arXiv:2010.14318v2 [cs.CL] UPDATED)
    (2 min) We propose a multitask training method for attention-based end-to-end speech recognition models. We regularize the decoder in a listen, attend, and spell model by multitask training it on both audio-text and text-only data. Trained on the 100-hour subset of LibriSpeech, the proposed method, without requiring an additional language model, leads to an 11% relative performance improvement over the baseline and approaches the performance of language model shallow fusion on the test-clean evaluation set. We observe a similar trend on the whole 960-hour LibriSpeech training set. Analyses of different types of errors and sample output sentences demonstrate that the proposed method can incorporate language level information, suggesting its effectiveness in real-world applications.
    Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation. (arXiv:2106.03153v2 [eess.AS] UPDATED)
    (2 min) With rapid progress in neural text-to-speech (TTS) models, personalized speech generation is now in high demand for many applications. For practical applicability, a TTS model should generate high-quality speech with only a few audio samples from the given speaker, that are also short in length. However, existing methods either require to fine-tune the model or achieve low adaptation quality without fine-tuning. In this work, we propose StyleSpeech, a new TTS model which not only synthesizes high-quality speech but also effectively adapts to new speakers. Specifically, we propose Style-Adaptive Layer Normalization (SALN) which aligns gain and bias of the text input according to the style extracted from a reference speech audio. With SALN, our model effectively synthesizes speech in the style of the target speaker even from single speech audio. Furthermore, to enhance StyleSpeech's adaptation to speech from new speakers, we extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training. The experimental results show that our models generate high-quality speech which accurately follows the speaker's voice with single short-duration (1-3 sec) speech audio, significantly outperforming baselines.
    Graph Neural Networks Meet Neural-Symbolic Computing: A Survey and Perspective. (arXiv:2003.00330v7 [cs.AI] UPDATED)
    (2 min) Neural-symbolic computing has now become the subject of interest of both academic and industry research laboratories. Graph Neural Networks (GNN) have been widely used in relational and symbolic domains, with widespread application of GNNs in combinatorial optimization, constraint satisfaction, relational reasoning and other scientific domains. The need for improved explainability, interpretability and trust of AI systems in general demands principled methodologies, as suggested by neural-symbolic computing. In this paper, we review the state-of-the-art on the use of GNNs as a model of neural-symbolic computing. This includes the application of GNNs in several domains as well as its relationship to current developments in neural-symbolic computing.
    Multilingual Neural Semantic Parsing for Low-Resourced Languages. (arXiv:2106.03469v2 [cs.CL] UPDATED)
    (2 min) Multilingual semantic parsing is a cost-effective method that allows a single model to understand different languages. However, researchers face a great imbalance of availability of training data, with English being resource rich, and other languages having much less data. To tackle the data limitation problem, we propose using machine translation to bootstrap multilingual training data from the more abundant English data. To compensate for the data quality of machine translated training data, we utilize transfer learning from pretrained multilingual encoders to further improve the model. To evaluate our multilingual models on human-written sentences as opposed to machine translated ones, we introduce a new multilingual semantic parsing dataset in English, Italian and Japanese based on the Facebook Task Oriented Parsing (TOP) dataset. We show that joint multilingual training with pretrained encoders substantially outperforms our baselines on the TOP dataset and outperforms the state-of-the-art model on the public NLMaps dataset. We also establish a new baseline for zero-shot learning on the TOP dataset. We find that a semantic parser trained only on English data achieves a zero-shot performance of 44.9% exact-match accuracy on Italian sentences.
    Red Alarm for Pre-trained Models: Universal Vulnerability to Neuron-Level Backdoor Attacks. (arXiv:2101.06969v3 [cs.CL] UPDATED)
    (2 min) Pre-trained models (PTMs) have been widely used in various downstream tasks. The parameters of PTMs are distributed on the Internet and may suffer backdoor attacks. In this work, we demonstrate the universal vulnerability of PTMs, where fine-tuned PTMs can be easily controlled by backdoor attacks in arbitrary downstream tasks. Specifically, attackers can add a simple pre-training task, which restricts the output representations of trigger instances to pre-defined vectors, namely neuron-level backdoor attack (NeuBA). If the backdoor functionality is not eliminated during fine-tuning, the triggers can make the fine-tuned model predict fixed labels by pre-defined vectors. In the experiments of both natural language processing (NLP) and computer vision (CV), we show that NeuBA absolutely controls the predictions for trigger instances without any knowledge of downstream tasks. Finally, we apply several defense methods to NeuBA and find that model pruning is a promising direction to resist NeuBA by excluding backdoored neurons. Our findings sound a red alarm for the wide use of PTMs. Our source code and models are available at \url{https://github.com/thunlp/NeuBA}.
    Can Transformer Language Models Predict Psychometric Properties?. (arXiv:2106.06849v1 [cs.CL])
    (2 min) Transformer-based language models (LMs) continue to advance state-of-the-art performance on NLP benchmark tasks, including tasks designed to mimic human-inspired "commonsense" competencies. To better understand the degree to which LMs can be said to have certain linguistic reasoning skills, researchers are beginning to adapt the tools and concepts of the field of psychometrics. But to what extent can the benefits flow in the other direction? I.e., can LMs be of use in predicting what the psychometric properties of test items will be when those items are given to human participants? We gather responses from numerous human participants and LMs (transformer and non-transformer-based) on a broad diagnostic test of linguistic competencies. We then use the responses to calculate standard psychometric properties of the items in the diagnostic test, using the human responses and the LM responses separately. We then determine how well these two sets of predictions match. We find cases in which transformer-based LMs predict psychometric properties consistently well in certain categories but consistently poorly in others, thus providing new insights into fundamental similarities and differences between human and LM reasoning.
    A Controllable Model of Grounded Response Generation. (arXiv:2005.00613v2 [cs.CL] UPDATED)
    (2 min) Current end-to-end neural conversation models inherently lack the flexibility to impose semantic control in the response generation process, often resulting in uninteresting responses. Attempts to boost informativeness alone come at the expense of factual accuracy, as attested by pretrained language models' propensity to "hallucinate" facts. While this may be mitigated by access to background knowledge, there is scant guarantee of relevance and informativeness in generated responses. We propose a framework that we call controllable grounded response generation (CGRG), in which lexical control phrases are either provided by a user or automatically extracted by a control phrase predictor from dialogue context and grounding knowledge. Quantitative and qualitative results show that, using this framework, a transformer based model with a novel inductive attention mechanism, trained on a conversation-like Reddit dataset, outperforms strong generation baselines.
    Cross-sentence Neural Language Models for Conversational Speech Recognition. (arXiv:2106.06922v1 [cs.CL])
    (2 min) An important research direction in automatic speech recognition (ASR) has centered around the development of effective methods to rerank the output hypotheses of an ASR system with more sophisticated language models (LMs) for further gains. A current mainstream school of thoughts for ASR N-best hypothesis reranking is to employ a recurrent neural network (RNN)-based LM or its variants, with performance superiority over the conventional n-gram LMs across a range of ASR tasks. In real scenarios such as a long conversation, a sequence of consecutive sentences may jointly contain ample cues of conversation-level information such as topical coherence, lexical entrainment and adjacency pairs, which however remains to be underexplored. In view of this, we first formulate ASR N-best reranking as a prediction problem, putting forward an effective cross-sentence neural LM approach that reranks the ASR N-best hypotheses of an upcoming sentence by taking into consideration the word usage in its precedent sentences. Furthermore, we also explore to extract task-specific global topical information of the cross-sentence history in an unsupervised manner for better ASR performance. Extensive experiments conducted on the AMI conversational benchmark corpus indicate the effectiveness and feasibility of our methods in comparison to several state-of-the-art reranking methods.
    Common Sense Beyond English: Evaluating and Improving Multilingual Language Models for Commonsense Reasoning. (arXiv:2106.06937v1 [cs.CL])
    (2 min) Commonsense reasoning research has so far been limited to English. We aim to evaluate and improve popular multilingual language models (ML-LMs) to help advance commonsense reasoning (CSR) beyond English. We collect the Mickey Corpus, consisting of 561k sentences in 11 different languages, which can be used for analyzing and improving ML-LMs. We propose Mickey Probe, a language-agnostic probing task for fairly evaluating the common sense of popular ML-LMs across different languages. In addition, we also create two new datasets, X-CSQA and X-CODAH, by translating their English versions to 15 other languages, so that we can evaluate popular ML-LMs for cross-lingual commonsense reasoning. To improve the performance beyond English, we propose a simple yet effective method -- multilingual contrastive pre-training (MCP). It significantly enhances sentence representations, yielding a large performance gain on both benchmarks.
    Case Study on Detecting COVID-19 Health-Related Misinformation in Social Media. (arXiv:2106.06811v1 [cs.SI])
    (2 min) COVID-19 pandemic has generated what public health officials called an infodemic of misinformation. As social distancing and stay-at-home orders came into effect, many turned to social media for socializing. This increase in social media usage has made it a prime vehicle for the spreading of misinformation. This paper presents a mechanism to detect COVID-19 health-related misinformation in social media following an interdisciplinary approach. Leveraging social psychology as a foundation and existing misinformation frameworks, we defined misinformation themes and associated keywords incorporated into the misinformation detection mechanism using applied machine learning techniques. Next, using the Twitter dataset, we explored the performance of the proposed methodology using multiple state-of-the-art machine learning classifiers. Our method shows promising results with at most 78% accuracy in classifying health-related misinformation versus true information using uni-gram-based NLP feature generations from tweets and the Decision Tree classifier. We also provide suggestions on alternatives for countering misinformation and ethical consideration for the study.
    Latent-Optimized Adversarial Neural Transfer for Sarcasm Detection. (arXiv:2104.09261v2 [cs.LG] UPDATED)
    (2 min) The existence of multiple datasets for sarcasm detection prompts us to apply transfer learning to exploit their commonality. The adversarial neural transfer (ANT) framework utilizes multiple loss terms that encourage the source-domain and the target-domain feature distributions to be similar while optimizing for domain-specific performance. However, these objectives may be in conflict, which can lead to optimization difficulties and sometimes diminished transfer. We propose a generalized latent optimization strategy that allows different losses to accommodate each other and improves training dynamics. The proposed method outperforms transfer learning and meta-learning baselines. In particular, we achieve 10.02% absolute performance gain over the previous state of the art on the iSarcasm dataset.
    Librispeech Transducer Model with Internal Language Model Prior Correction. (arXiv:2104.03006v2 [cs.CL] UPDATED)
    (2 min) We present our transducer model on Librispeech. We study variants to include an external language model (LM) with shallow fusion and subtract an estimated internal LM. This is justified by a Bayesian interpretation where the transducer model prior is given by the estimated internal LM. The subtraction of the internal LM gives us over 14% relative improvement over normal shallow fusion. Our transducer has a separate probability distribution for the non-blank labels which allows for easier combination with the external LM, and easier estimation of the internal LM. We additionally take care of including the end-of-sentence (EOS) probability of the external LM in the last blank probability which further improves the performance. All our code and setups are published.
    A Sentence-level Hierarchical BERT Model for Document Classification with Limited Labelled Data. (arXiv:2106.06738v1 [cs.CL])
    (2 min) Training deep learning models with limited labelled data is an attractive scenario for many NLP tasks, including document classification. While with the recent emergence of BERT, deep learning language models can achieve reasonably good performance in document classification with few labelled instances, there is a lack of evidence in the utility of applying BERT-like models on long document classification. This work introduces a long-text-specific model -- the Hierarchical BERT Model (HBM) -- that learns sentence-level features of the text and works well in scenarios with limited labelled data. Various evaluation experiments have demonstrated that HBM can achieve higher performance in document classification than the previous state-of-the-art methods with only 50 to 200 labelled instances, especially when documents are long. Also, as an extra benefit of HBM, the salient sentences identified by learned HBM are useful as explanations for labelling documents based on a user study.
    Contextualized Streaming End-to-End Speech Recognition with Trie-Based Deep Biasing and Shallow Fusion. (arXiv:2104.02194v2 [cs.CL] UPDATED)
    (2 min) How to leverage dynamic contextual information in end-to-end speech recognition has remained an active research area. Previous solutions to this problem were either designed for specialized use cases that did not generalize well to open-domain scenarios, did not scale to large biasing lists, or underperformed on rare long-tail words. We address these limitations by proposing a novel solution that combines shallow fusion, trie-based deep biasing, and neural network language model contextualization. These techniques result in significant 19.5% relative Word Error Rate improvement over existing contextual biasing approaches and 5.4%-9.3% improvement compared to a strong hybrid baseline on both open-domain and constrained contextualization tasks, where the targets consist of mostly rare long-tail words. Our final system remains lightweight and modular, allowing for quick modification without model re-training.
    GENIE: A Leaderboard for Human-in-the-Loop Evaluation of Text Generation. (arXiv:2101.06561v2 [cs.CL] UPDATED)
    (2 min) Leaderboards have eased model development for many NLP datasets by standardizing their evaluation and delegating it to an independent external repository. Their adoption, however, is so far limited to tasks that can be reliably evaluated in an automatic manner. This work introduces GENIE, an extensible human evaluation leaderboard, which brings the ease of leaderboards to text generation tasks. GENIE automatically posts leaderboard submissions to crowdsourcing platforms asking human annotators to evaluate them on various axes (e.g., correctness, conciseness, fluency) and compares their answers to various automatic metrics. We introduce several datasets in English to GENIE, representing four core challenges in text generation: machine translation, summarization, commonsense reasoning, and machine comprehension. We provide formal granular evaluation metrics and identify areas for future research. We make GENIE publicly available and hope that it will spur progress in language generation models as well as their automatic and manual evaluation.
    Extracting Summary Knowledge Graphs from Long Documents. (arXiv:2009.09162v2 [cs.CL] UPDATED)
    (2 min) Knowledge graphs capture entities and relations from long documents and can facilitate reasoning in many downstream applications. Extracting compact knowledge graphs containing only salient entities and relations is important but challenging for understanding and summarizing long documents. We introduce a new text-to-graph task of predicting summarized knowledge graphs from long documents. We develop a dataset of 200k document/graph pairs using automatic and human annotations. We also develop strong baselines for this task based on graph learning and text summarization, and provide quantitative and qualitative studies of their effect.
    Russian News Clustering and Headline Selection Shared Task. (arXiv:2105.00981v3 [cs.CL] UPDATED)
    (2 min) This paper presents the results of the Russian News Clustering and Headline Selection shared task. As a part of it, we propose the tasks of Russian news event detection, headline selection, and headline generation. These tasks are accompanied by datasets and baselines. The presented datasets for event detection and headline selection are the first public Russian datasets for their tasks. The headline generation dataset is based on clustering and provides multiple reference headlines for every cluster, unlike the previous datasets. Finally, the approaches proposed by the shared task participants are reported and analyzed.
    SASICM A Multi-Task Benchmark For Subtext Recognition. (arXiv:2106.06944v1 [cs.CL])
    (2 min) Subtext is a kind of deep semantics which can be acquired after one or more rounds of expression transformation. As a popular way of expressing one's intentions, it is well worth studying. In this paper, we try to make computers understand whether there is a subtext by means of machine learning. We build a Chinese dataset whose source data comes from the popular social media (e.g. Weibo, Netease Music, Zhihu, and Bilibili). In addition, we also build a baseline model called SASICM to deal with subtext recognition. The F1 score of SASICMg, whose pretrained model is GloVe, is as high as 64.37%, which is 3.97% higher than that of BERT based model, 12.7% higher than that of traditional methods on average, including support vector machine, logistic regression classifier, maximum entropy classifier, naive bayes classifier and decision tree and 2.39% higher than that of the state-of-the-art, including MARIN and BTM. The F1 score of SASICMBERT, whose pretrained model is BERT, is 65.12%, which is 0.75% higher than that of SASICMg. The accuracy rates of SASICMg and SASICMBERT are 71.16% and 70.76%, respectively, which can compete with those of other methods which are mentioned before.
    Robust Optimization for Multilingual Translation with Imbalanced Data. (arXiv:2104.07639v3 [cs.CL] UPDATED)
    (2 min) Multilingual models are parameter-efficient with the prospect improving low-resource languages by leveraging crosslingual transfer. Despite recent advance in massive multilingual translation with ever-growing model and data, how to effectively train multilingual models has not been well understood. In this paper, we show that a common situation in multilingual training, data imbalance among languages, poses optimization tension between high resource and low resource languages where the found multilingual solution is often sub-optimal for low resources. We show that common training method which upsamples low resources can not robustly optimize population loss with risks of either underfitting high resource languages or overfitting low resource ones. Drawing on recent findings on the geometry of loss landscape and its effect on generalization, we propose a principled optimization algorithm, Curvature Aware Task Scaling (CATS), which adaptively rescales gradients from different tasks with a meta objective of guiding multilingual training to low-curvature neighborhoods with uniformly low loss for all languages. We ran experiments on common benchmarks (TED, WMT and OPUS-100) with varying degrees of data imbalance. CATS effectively improved multilingual optimization and as a result demonstrated consistent gains on low resources ( to BLEU) without hurting high resources. In addition, CATS is robust to overparameterization and large batch size training, making it a promising training method for massive multilingual models that truly improve low resource languages.
    Decrypting Cryptic Crosswords: Semantically Complex Wordplay Puzzles as a Target for NLP. (arXiv:2104.08620v2 [cs.CL] UPDATED)
    (2 min) Cryptic crosswords, the dominant English-language crossword variety in the United Kingdom, can be solved by expert humans using flexible, creative intelligence and knowledge of language. Cryptic clues read like fluent natural language, but they are adversarially composed of two parts: a definition and a wordplay cipher requiring sub-word or character-level manipulations. As such, they are a promising target for evaluating and advancing NLP systems that seek to process language in more creative, human-like ways. We present a dataset of cryptic crossword clues from a major newspaper that can be used as a benchmark and train a sequence-to-sequence model to solve them. We also develop related benchmarks that can guide development of approaches to this challenging task. We show that performance can be substantially improved using a novel curriculum learning approach in which the model is pre-trained on related tasks involving, e.g, unscrambling words, before it is trained to solve cryptics. However, even this curricular approach does not generalize to novel clue types in the way that humans can, and so cryptic crosswords remain a challenge for NLP systems and a potential source of future innovation.
    Speak or Chat with Me: End-to-End Spoken Language Understanding System with Flexible Inputs. (arXiv:2104.05752v2 [cs.CL] UPDATED)
    (2 min) A major focus of recent research in spoken language understanding (SLU) has been on the end-to-end approach where a single model can predict intents directly from speech inputs without intermediate transcripts. However, this approach presents some challenges. First, since speech can be considered as personally identifiable information, in some cases only automatic speech recognition (ASR) transcripts are accessible. Second, intent-labeled speech data is scarce. To address the first challenge, we propose a novel system that can predict intents from flexible types of inputs: speech, ASR transcripts, or both. We demonstrate strong performance for either modality separately, and when both speech and ASR transcripts are available, through system combination, we achieve better results than using a single input modality. To address the second challenge, we leverage a semantically robust pre-trained BERT model and adopt a cross-modal system that co-trains text embeddings and acoustic embeddings in a shared latent space. We further enhance this system by utilizing an acoustic module pre-trained on LibriSpeech and domain-adapting the text module on our target datasets. Our experiments show significant advantages for these pre-training and fine-tuning strategies, resulting in a system that achieves competitive intent-classification performance on Snips SLU and Fluent Speech Commands datasets.
    Impact of Encoding and Segmentation Strategies on End-to-End Simultaneous Speech Translation. (arXiv:2104.14470v2 [cs.CL] UPDATED)
    (2 min) Boosted by the simultaneous translation shared task at IWSLT 2020, promising end-to-end online speech translation approaches were recently proposed. They consist in incrementally encoding a speech input (in a source language) and decoding the corresponding text (in a target language) with the best possible trade-off between latency and translation quality. This paper investigates two key aspects of end-to-end simultaneous speech translation: (a) how to encode efficiently the continuous speech flow, and (b) how to segment the speech flow in order to alternate optimally between reading (R: encoding input) and writing (W: decoding output) operations. We extend our previously proposed end-to-end online decoding strategy and show that while replacing BLSTM by ULSTM encoding degrades performance in offline mode, it actually improves both efficiency and performance in online mode. We also measure the impact of different methods to segment the speech signal (using fixed interval boundaries, oracle word boundaries or randomly set boundaries) and show that our best end-to-end online decoding strategy is surprisingly the one that alternates R/W operations on fixed size blocks on our English-German speech translation setup.
    GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio. (arXiv:2106.06909v1 [cs.SD])
    (2 min) This paper introduces GigaSpeech, an evolving, multi-domain English speech recognition corpus with 10,000 hours of high quality labeled audio suitable for supervised training, and 40,000 hours of total audio suitable for semi-supervised and unsupervised training. Around 40,000 hours of transcribed audio is first collected from audiobooks, podcasts and YouTube, covering both read and spontaneous speaking styles, and a variety of topics, such as arts, science, sports, etc. A new forced alignment and segmentation pipeline is proposed to create sentence segments suitable for speech recognition training, and to filter out segments with low-quality transcription. For system training, GigaSpeech provides five subsets of different sizes, 10h, 250h, 1000h, 2500h, and 10000h. For our 10,000-hour XL training subset, we cap the word error rate at 4% during the filtering/validation stage, and for all our other smaller training subsets, we cap it at 0%. The DEV and TEST evaluation sets, on the other hand, are re-processed by professional human transcribers to ensure high transcription quality. Baseline systems are provided for popular speech recognition toolkits, namely Athena, ESPnet, Kaldi and Pika.
    EL-Attention: Memory Efficient Lossless Attention for Generation. (arXiv:2105.04779v2 [cs.CL] UPDATED)
    (2 min) Transformer model with multi-head attention requires caching intermediate results for efficient inference in generation tasks. However, cache brings new memory-related costs and prevents leveraging larger batch size for faster speed. We propose memory-efficient lossless attention (called EL-attention) to address this issue. It avoids heavy operations for building multi-head keys and values, cache for them is not needed. EL-attention constructs an ensemble of attention results by expanding query while keeping key and value shared. It produces the same result as multi-head attention with less GPU memory and faster inference speed. We conduct extensive experiments on Transformer, BART, and GPT-2 for summarization and question generation tasks. The results show EL-attention speeds up existing models by 1.6x to 5.3x without accuracy loss.
    Controllable Generation from Pre-trained Language Models via Inverse Prompting. (arXiv:2103.10685v2 [cs.CL] UPDATED)
    (2 min) Large-scale pre-trained language models have demonstrated strong capabilities of generating realistic text. However, it remains challenging to control the generation results. Previous approaches such as prompting are far from sufficient, which limits the usage of language models. To tackle this challenge, we propose an innovative method, inverse prompting, to better control text generation. The core idea of inverse prompting is to use generated text to inversely predict the prompt during beam search, which enhances the relevance between the prompt and the generated text and provides better controllability. Empirically, we pre-train a large-scale Chinese language model to perform a systematic study using human evaluation on the tasks of open-domain poem generation and open-domain long-form question answering. Our results show that our proposed method substantially outperforms the baselines and that our generation quality is close to human performance on some of the tasks. Narrators can try our poem generation demo at https://pretrain.aminer.cn/apps/poetry.html, while our QA demo can be found at https://pretrain.aminer.cn/app/qa. For researchers, the code is provided in https://github.com/THUDM/InversePrompting.
    Kwame: A Bilingual AI Teaching Assistant for Online SuaCode Courses. (arXiv:2010.11387v2 [cs.CL] UPDATED)
    (2 min) Introductory hands-on courses such as our smartphone-based coding course, SuaCode require a lot of support for students to accomplish learning goals. Online environments make it even more difficult to get assistance especially more recently because of COVID-19. Given the multilingual context of SuaCode students - learners across 42 African countries that are mostly Anglophone or Francophone - in this work, we developed a bilingual Artificial Intelligence (AI) Teaching Assistant (TA) - Kwame - that provides answers to students' coding questions from SuaCode courses in English and French. Kwame is a Sentence-BERT (SBERT)-based question-answering (QA) system that we trained and evaluated offline using question-answer pairs created from the course's quizzes, lesson notes and students' questions in past cohorts. Kwame finds the paragraph most semantically similar to the question via cosine similarity. We compared the system with TF-IDF and Universal Sentence Encoder. Our results showed that fine-tuning on the course data and returning the top 3 and 5 answers improved the accuracy results. Kwame will make it easy for students to get quick and accurate answers to questions in SuaCode courses.
    Thinking Like Transformers. (arXiv:2106.06981v1 [cs.LG])
    (2 min) What is the computational model behind a Transformer? Where recurrent neural networks have direct parallels in finite state machines, allowing clear discussion and thought around architecture variants or trained models, Transformers have no such familiar parallel. In this paper we aim to change that, proposing a computational model for the transformer-encoder in the form of a programming language. We map the basic components of a transformer-encoder -- attention and feed-forward computation -- into simple primitives, around which we form a programming language: the Restricted Access Sequence Processing Language (RASP). We show how RASP can be used to program solutions to tasks that could conceivably be learned by a Transformer, and how a Transformer can be trained to mimic a RASP solution. In particular, we provide RASP programs for histograms, sorting, and Dyck-languages. We further use our model to relate their difficulty in terms of the number of required layers and attention heads: analyzing a RASP program implies a maximum number of heads and layers necessary to encode a task in a transformer. Finally, we see how insights gained from our abstraction might be used to explain phenomena seen in recent works.
    Span-based Semantic Parsing for Compositional Generalization. (arXiv:2009.06040v2 [cs.CL] UPDATED)
    (2 min) Despite the success of sequence-to-sequence (seq2seq) models in semantic parsing, recent work has shown that they fail in compositional generalization, i.e., the ability to generalize to new structures built of components observed during training. In this work, we posit that a span-based parser should lead to better compositional generalization. we propose SpanBasedSP, a parser that predicts a span tree over an input utterance, explicitly encoding how partial programs compose over spans in the input. SpanBasedSP extends Pasupat et al. (2019) to be comparable to seq2seq models by (i) training from programs, without access to gold trees, treating trees as latent variables, (ii) parsing a class of non-projective trees through an extension to standard CKY. On GeoQuery, SCAN and CLOSURE datasets, SpanBasedSP performs similarly to strong seq2seq baselines on random splits, but dramatically improves performance compared to baselines on splits that require compositional generalization: from $61.0 \rightarrow 88.9$ average accuracy.
    Grounding Language to Entities and Dynamics for Generalization in Reinforcement Learning. (arXiv:2101.07393v2 [cs.CL] UPDATED)
    (2 min) We investigate the use of natural language to drive the generalization of control policies and introduce the new multi-task environment Messenger with free-form text manuals describing the environment dynamics. Unlike previous work, Messenger does not assume prior knowledge connecting text and state observations $-$ the control policy must simultaneously ground the game manual to entity symbols and dynamics in the environment. We develop a new model, EMMA (Entity Mapper with Multi-modal Attention) which uses an entity-conditioned attention module that allows for selective focus over relevant descriptions in the manual for each entity in the environment. EMMA is end-to-end differentiable and learns a latent grounding of entities and dynamics from text to observations using only environment rewards. EMMA achieves successful zero-shot generalization to unseen games with new dynamics, obtaining a 40% higher win rate compared to multiple baselines. However, win rate on the hardest stage of Messenger remains low (10%), demonstrating the need for additional work in this direction.
    DVD: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded Dialogue. (arXiv:2101.00151v2 [cs.AI] UPDATED)
    (2 min) A video-grounded dialogue system is required to understand both dialogue, which contains semantic dependencies from turn to turn, and video, which contains visual cues of spatial and temporal scene variations. Building such dialogue systems is a challenging problem, involving various reasoning types on both visual and language inputs. Existing benchmarks do not have enough annotations to thoroughly analyze dialogue systems and understand their capabilities and limitations in isolation. These benchmarks are also not explicitly designed to minimise biases that models can exploit without actual reasoning. To address these limitations, in this paper, we present DVD, a Diagnostic Dataset for Video-grounded Dialogues. The dataset is designed to contain minimal biases and has detailed annotations for the different types of reasoning over the spatio-temporal space of video. Dialogues are synthesized over multiple question turns, each of which is injected with a set of cross-turn semantic relationships. We use DVD to analyze existing approaches, providing interesting insights into their abilities and limitations. In total, DVD is built from $11k$ CATER synthetic videos and contains $10$ instances of $10$-round dialogues for each video, resulting in more than $100k$ dialogues and $1M$ question-answer pairs. Our code and dataset are publicly available at https://github.com/facebookresearch/DVDialogues.
    Length-Adaptive Transformer: Train Once with Length Drop, Use Anytime with Search. (arXiv:2010.07003v2 [cs.CL] UPDATED)
    (2 min) Despite transformers' impressive accuracy, their computational cost is often prohibitive to use with limited computational resources. Most previous approaches to improve inference efficiency require a separate model for each possible computational budget. In this paper, we extend PoWER-BERT (Goyal et al., 2020) and propose Length-Adaptive Transformer that can be used for various inference scenarios after one-shot training. We train a transformer with LengthDrop, a structural variant of dropout, which stochastically determines a sequence length at each layer. We then conduct a multi-objective evolutionary search to find a length configuration that maximizes the accuracy and minimizes the efficiency metric under any given computational budget. Additionally, we significantly extend the applicability of PoWER-BERT beyond sequence-level classification into token-level classification with Drop-and-Restore process that drops word-vectors temporarily in intermediate layers and restores at the last layer if necessary. We empirically verify the utility of the proposed approach by demonstrating the superior accuracy-efficiency trade-off under various setups, including span-based question answering and text classification. Code is available at https://github.com/clovaai/length-adaptive-transformer.
    Exploring and Distilling Posterior and Prior Knowledge for Radiology Report Generation. (arXiv:2106.06963v1 [cs.CV])
    (2 min) Automatically generating radiology reports can improve current clinical practice in diagnostic radiology. On one hand, it can relieve radiologists from the heavy burden of report writing; On the other hand, it can remind radiologists of abnormalities and avoid the misdiagnosis and missed diagnosis. Yet, this task remains a challenging job for data-driven neural networks, due to the serious visual and textual data biases. To this end, we propose a Posterior-and-Prior Knowledge Exploring-and-Distilling approach (PPKED) to imitate the working patterns of radiologists, who will first examine the abnormal regions and assign the disease topic tags to the abnormal regions, and then rely on the years of prior medical knowledge and prior working experience accumulations to write reports. Thus, the PPKED includes three modules: Posterior Knowledge Explorer (PoKE), Prior Knowledge Explorer (PrKE) and Multi-domain Knowledge Distiller (MKD). In detail, PoKE explores the posterior knowledge, which provides explicit abnormal visual regions to alleviate visual data bias; PrKE explores the prior knowledge from the prior medical knowledge graph (medical knowledge) and prior radiology reports (working experience) to alleviate textual data bias. The explored knowledge is distilled by the MKD to generate the final reports. Evaluated on MIMIC-CXR and IU-Xray datasets, our method is able to outperform previous state-of-the-art models on these two datasets.
    Contrastive Attention for Automatic Chest X-ray Report Generation. (arXiv:2106.06965v1 [cs.CV])
    (2 min) Recently, chest X-ray report generation, which aims to automatically generate descriptions of given chest X-ray images, has received growing research interests. The key challenge of chest X-ray report generation is to accurately capture and describe the abnormal regions. In most cases, the normal regions dominate the entire chest X-ray image, and the corresponding descriptions of these normal regions dominate the final report. Due to such data bias, learning-based models may fail to attend to abnormal regions. In this work, to effectively capture and describe abnormal regions, we propose the Contrastive Attention (CA) model. Instead of solely focusing on the current input image, the CA model compares the current input image with normal images to distill the contrastive information. The acquired contrastive information can better represent the visual features of abnormal regions. According to the experiments on the public IU-X-ray and MIMIC-CXR datasets, incorporating our CA into several existing models can boost their performance across most metrics. In addition, according to the analysis, the CA model can help existing models better attend to the abnormal regions and provide more accurate descriptions which are crucial for an interpretable diagnosis. Specifically, we achieve the state-of-the-art results on the two public datasets.
    Machine Translation into Low-resource Language Varieties. (arXiv:2106.06797v1 [cs.CL])
    (2 min) State-of-the-art machine translation (MT) systems are typically trained to generate the "standard" target language; however, many languages have multiple varieties (regional varieties, dialects, sociolects, non-native varieties) that are different from the standard language. Such varieties are often low-resource, and hence do not benefit from contemporary NLP solutions, MT included. We propose a general framework to rapidly adapt MT systems to generate language varieties that are close to, but different from, the standard target language, using no parallel (source--variety) data. This also includes adaptation of MT systems to low-resource typologically-related target languages. We experiment with adapting an English--Russian MT system to generate Ukrainian and Belarusian, an English--Norwegian Bokm{\aa}l system to generate Nynorsk, and an English--Arabic system to generate four Arabic dialects, obtaining significant improvements over competitive baselines.
    InfoBehavior: Self-supervised Representation Learning for Ultra-long Behavior Sequence via Hierarchical Grouping. (arXiv:2106.06905v1 [cs.CL])
    (2 min) E-commerce companies have to face abnormal sellers who sell potentially-risky products. Typically, the risk can be identified by jointly considering product content (e.g., title and image) and seller behavior. This work focuses on behavior feature extraction as behavior sequences can provide valuable clues for the risk discovery by reflecting the sellers' operation habits. Traditional feature extraction techniques heavily depend on domain experts and adapt poorly to new tasks. In this paper, we propose a self-supervised method InfoBehavior to automatically extract meaningful representations from ultra-long raw behavior sequences instead of the costly feature selection procedure. InfoBehavior utilizes Bidirectional Transformer as feature encoder due to its excellent capability in modeling long-term dependency. However, it is intractable for commodity GPUs because the time and memory required by Transformer grow quadratically with the increase of sequence length. Thus, we propose a hierarchical grouping strategy to aggregate ultra-long raw behavior sequences to length-processable high-level embedding sequences. Moreover, we introduce two types of pretext tasks. Sequence-related pretext task defines a contrastive-based training objective to correctly select the masked-out coarse-grained/fine-grained behavior sequences against other "distractor" behavior sequences; Domain-related pretext task designs a classification training objective to correctly predict the domain-specific statistical results of anomalous behavior. We show that behavior representations from the pre-trained InfoBehavior can be directly used or integrated with features from other side information to support a wide range of downstream tasks. Experimental results demonstrate that InfoBehavior significantly improves the performance of Product Risk Management and Intellectual Property Protection.
    MultiWOZ 2.3: A multi-domain task-oriented dialogue dataset enhanced with annotation corrections and co-reference annotation. (arXiv:2010.05594v3 [cs.CL] UPDATED)
    (2 min) Task-oriented dialogue systems have made unprecedented progress with multiple state-of-the-art (SOTA) models underpinned by a number of publicly available MultiWOZ datasets. Dialogue state annotations are error-prone, leading to sub-optimal performance. Various efforts have been put in rectifying the annotation errors presented in the original MultiWOZ dataset. In this paper, we introduce MultiWOZ 2.3, in which we differentiate incorrect annotations in dialogue acts from dialogue states, identifying a lack of co-reference when publishing the updated dataset. To ensure consistency between dialogue acts and dialogue states, we implement co-reference features and unify annotations of dialogue acts and dialogue states. We update the state of the art performance of natural language understanding and dialogue state tracking on MultiWOZ 2.3, where the results show significant improvements than on previous versions of MultiWOZ datasets (2.0-2.2).
    Compression of Deep Learning Models for Text: A Survey. (arXiv:2008.05221v4 [cs.CL] UPDATED)
    (2 min) In recent years, the fields of natural language processing (NLP) and information retrieval (IR) have made tremendous progress thanksto deep learning models like Recurrent Neural Networks (RNNs), Gated Recurrent Units (GRUs) and Long Short-Term Memory (LSTMs)networks, and Transformer [120] based models like Bidirectional Encoder Representations from Transformers (BERT) [24], GenerativePre-training Transformer (GPT-2) [94], Multi-task Deep Neural Network (MT-DNN) [73], Extra-Long Network (XLNet) [134], Text-to-text transfer transformer (T5) [95], T-NLG [98] and GShard [63]. But these models are humongous in size. On the other hand,real world applications demand small model size, low response times and low computational power wattage. In this survey, wediscuss six different types of methods (Pruning, Quantization, Knowledge Distillation, Parameter Sharing, Tensor Decomposition, andSub-quadratic Transformer based methods) for compression of such models to enable their deployment in real industry NLP projects.Given the critical need of building applications with efficient and small models, and the large amount of recently published work inthis area, we believe that this survey organizes the plethora of work done by the 'deep learning for NLP' community in the past fewyears and presents it as a coherent story.
    Sentiment Analysis of Covid-19 Tweets using Evolutionary Classification-Based LSTM Model. (arXiv:2106.06910v1 [cs.CL])
    (2 min) As the Covid-19 outbreaks rapidly all over the world day by day and also affects the lives of million, a number of countries declared complete lock-down to check its intensity. During this lockdown period, social media plat-forms have played an important role to spread information about this pandemic across the world, as people used to express their feelings through the social networks. Considering this catastrophic situation, we developed an experimental approach to analyze the reactions of people on Twitter taking into ac-count the popular words either directly or indirectly based on this pandemic. This paper represents the sentiment analysis on collected large number of tweets on Coronavirus or Covid-19. At first, we analyze the trend of public sentiment on the topics related to Covid-19 epidemic using an evolutionary classification followed by the n-gram analysis. Then we calculated the sentiment ratings on collected tweet based on their class. Finally, we trained the long-short term network using two types of rated tweets to predict sentiment on Covid-19 data and obtained an overall accuracy of 84.46%.
    Shape of Elephant: Study of Macro Properties of Word Embeddings Spaces. (arXiv:2106.06964v1 [cs.CL])
    (2 min) Pre-trained word representations became a key component in many NLP tasks. However, the global geometry of the word embeddings remains poorly understood. In this paper, we demonstrate that a typical word embeddings cloud is shaped as a high-dimensional simplex with interpretable vertices and propose a simple yet effective method for enumeration of these vertices. We show that the proposed method can detect and describe vertices of the simplex for GloVe and fasttext spaces.
    Exploiting Parallel Corpora to Improve Multilingual Embedding based Document and Sentence Alignment. (arXiv:2106.06766v1 [cs.CL])
    (2 min) Multilingual sentence representations pose a great advantage for low-resource languages that do not have enough data to build monolingual models on their own. These multilingual sentence representations have been separately exploited by few research for document and sentence alignment. However, most of the low-resource languages are under-represented in these pre-trained models. Thus, in the context of low-resource languages, these models have to be fine-tuned for the task at hand, using additional data sources. This paper presents a weighting mechanism that makes use of available small-scale parallel corpora to improve the performance of multilingual sentence representations on document and sentence alignment. Experiments are conducted with respect to two low-resource languages, Sinhala and Tamil. Results on a newly created dataset of Sinhala-English, Tamil-English, and Sinhala-Tamil show that this new weighting mechanism significantly improves both document and sentence alignment. This dataset, as well as the source-code, is publicly released.
    Prompting Contrastive Explanations for Commonsense Reasoning Tasks. (arXiv:2106.06823v1 [cs.CL])
    (2 min) Many commonsense reasoning NLP tasks involve choosing between one or more possible answers to a question or prompt based on knowledge that is often implicit. Large pretrained language models (PLMs) can achieve near-human performance on such tasks, while providing little human-interpretable evidence of the underlying reasoning they use. In this work, we show how to use these same models to generate such evidence: inspired by the contrastive nature of human explanations, we use PLMs to complete explanation prompts which contrast alternatives according to the key attribute(s) required to justify the correct answer (for example, peanuts are usually salty while raisins are sweet). Conditioning model decisions on these explanations improves performance on two commonsense reasoning benchmarks, as compared to previous non-contrastive alternatives. These explanations are also judged by humans to be more relevant for solving the task, and facilitate a novel method to evaluate explanation faithfulfness.
    A Pseudo Label-wise Attention Network for Automatic ICD Coding. (arXiv:2106.06822v1 [cs.CL])
    (2 min) Automatic International Classification of Diseases (ICD) coding is defined as a kind of text multi-label classification problem, which is difficult because the number of labels is very large and the distribution of labels is unbalanced. The label-wise attention mechanism is widely used in automatic ICD coding because it can assign weights to every word in full Electronic Medical Records (EMR) for different ICD codes. However, the label-wise attention mechanism is computational redundant and costly. In this paper, we propose a pseudo label-wise attention mechanism to tackle the problem. Instead of computing different attention modes for different ICD codes, the pseudo label-wise attention mechanism automatically merges similar ICD codes and computes only one attention mode for the similar ICD codes, which greatly compresses the number of attention modes and improves the predicted accuracy. In addition, we apply a more convenient and effective way to obtain the ICD vectors, and thus our model can predict new ICD codes by calculating the similarities between EMR vectors and ICD vectors. Extensive experiments show the superior performance of our model. On the public MIMIC-III dataset and private Xiangya dataset, our model achieves micro f1 of 0.575 and 0.796, respectively, which outperforms other competing models. Furthermore, we verify the ability of our model in predicting new ICD codes. The case study shows how pseudo label-wise attention works, and demonstrates the effectiveness of pseudo label-wise attention mechanism.
    Evaluating Entity Disambiguation and the Role of Popularity in Retrieval-Based NLP. (arXiv:2106.06830v1 [cs.CL])
    (2 min) Retrieval is a core component for open-domain NLP tasks. In open-domain tasks, multiple entities can share a name, making disambiguation an inherent yet under-explored problem. We propose an evaluation benchmark for assessing the entity disambiguation capabilities of these retrievers, which we call Ambiguous Entity Retrieval (AmbER) sets. We define an AmbER set as a collection of entities that share a name along with queries about those entities. By covering the set of entities for polysemous names, AmbER sets act as a challenging test of entity disambiguation. We create AmbER sets for three popular open-domain tasks: fact checking, slot filling, and question answering, and evaluate a diverse set of retrievers. We find that the retrievers exhibit popularity bias, significantly under-performing on rarer entities that share a name, e.g., they are twice as likely to retrieve erroneous documents on queries for the less popular entity under the same name. These experiments on AmbER sets show their utility as an evaluation tool and highlight the weaknesses of popular retrieval systems.
    Predicting the Ordering of Characters in Japanese Historical Documents. (arXiv:2106.06786v1 [cs.CL])
    (2 min) Japan is a unique country with a distinct cultural heritage, which is reflected in billions of historical documents that have been preserved. However, the change in Japanese writing system in 1900 made these documents inaccessible for the general public. A major research project has been to make these historical documents accessible and understandable. An increasing amount of research has focused on the character recognition task and the location of characters on image, yet less research has focused on how to predict the sequential ordering of the characters. This is because sequence in classical Japanese is very different from modern Japanese. Ordering characters into a sequence is important for making the document text easily readable and searchable. Additionally, it is a necessary step for any kind of natural language processing on the data (e.g. machine translation, language modeling, and word embeddings). We explore a few approaches to the task of predicting the sequential ordering of the characters: one using simple hand-crafted rules, another using hand-crafted rules with adaptive thresholds, and another using a deep recurrent sequence model trained with teacher forcing. We provide a quantitative and qualitative comparison of these techniques as well as their distinct trade-offs. Our best-performing system has an accuracy of 98.65\% and has a perfect accuracy on 49\% of the books in our dataset, suggesting that the technique is able to predict the order of the characters well enough for many tasks.
    Improving Unsupervised Dialogue Topic Segmentation with Utterance-Pair Coherence Scoring. (arXiv:2106.06719v1 [cs.CL])
    (2 min) Dialogue topic segmentation is critical in several dialogue modeling problems. However, popular unsupervised approaches only exploit surface features in assessing topical coherence among utterances. In this work, we address this limitation by leveraging supervisory signals from the utterance-pair coherence scoring task. First, we present a simple yet effective strategy to generate a training corpus for utterance-pair coherence scoring. Then, we train a BERT-based neural utterance-pair coherence model with the obtained training corpus. Finally, such model is used to measure the topical relevance between utterances, acting as the basis of the segmentation inference. Experiments on three public datasets in English and Chinese demonstrate that our proposal outperforms the state-of-the-art baselines.
    Every Bite Is an Experience: Key Point Analysis of Business Reviews. (arXiv:2106.06758v1 [cs.CL])
    (2 min) Previous work on review summarization focused on measuring the sentiment toward the main aspects of the reviewed product or business, or on creating a textual summary. These approaches provide only a partial view of the data: aspect-based sentiment summaries lack sufficient explanation or justification for the aspect rating, while textual summaries do not quantify the significance of each element, and are not well-suited for representing conflicting views. Recently, Key Point Analysis (KPA) has been proposed as a summarization framework that provides both textual and quantitative summary of the main points in the data. We adapt KPA to review data by introducing Collective Key Point Mining for better key point extraction; integrating sentiment analysis into KPA; identifying good key point candidates for review summaries; and leveraging the massive amount of available reviews and their metadata. We show empirically that these novel extensions of KPA substantially improve its performance. We demonstrate that promising results can be achieved without any domain-specific annotation, while human supervision can lead to further improvement.
    Incorporating External POS Tagger for Punctuation Restoration. (arXiv:2106.06731v1 [cs.CL])
    (2 min) Punctuation restoration is an important post-processing step in automatic speech recognition. Among other kinds of external information, part-of-speech (POS) taggers provide informative tags, suggesting each input token's syntactic role, which has been shown to be beneficial for the punctuation restoration task. In this work, we incorporate an external POS tagger and fuse its predicted labels into the existing language model to provide syntactic information. Besides, we propose sequence boundary sampling (SBS) to learn punctuation positions more efficiently as a sequence tagging task. Experimental results show that our methods can consistently obtain performance gains and achieve a new state-of-the-art on the common IWSLT benchmark. Further ablation studies illustrate that both large pre-trained language models and the external POS tagger take essential parts to improve the model's performance.
    Modeling Language Usage and Listener Engagement in Podcasts. (arXiv:2106.06605v1 [cs.CL])
    (2 min) While there is an abundance of popular writing targeted to podcast creators on how to speak in ways that engage their listeners, there has been little data-driven analysis of podcasts that relates linguistic style with listener engagement. In this paper, we investigate how various factors -- vocabulary diversity, distinctiveness, emotion, and syntax, among others -- correlate with engagement, based on analysis of the creators' written descriptions and transcripts of the audio. We build models with different textual representations, and show that the identified features are highly predictive of engagement. Our analysis tests popular wisdom about stylistic elements in high-engagement podcasts, corroborating some aspects, and adding new perspectives on others.
    Break-It-Fix-It: Unsupervised Learning for Program Repair. (arXiv:2106.06600v1 [cs.LG])
    (2 min) We consider repair tasks: given a critic (e.g., compiler) that assesses the quality of an input, the goal is to train a fixer that converts a bad example (e.g., code with syntax errors) into a good one (e.g., code with no errors). Existing works create training data consisting of (bad, good) pairs by corrupting good examples using heuristics (e.g., dropping tokens). However, fixers trained on this synthetically-generated data do not extrapolate well to the real distribution of bad inputs. To bridge this gap, we propose a new training approach, Break-It-Fix-It (BIFI), which has two key ideas: (i) we use the critic to check a fixer's output on real bad inputs and add good (fixed) outputs to the training data, and (ii) we train a breaker to generate realistic bad code from good code. Based on these ideas, we iteratively update the breaker and the fixer while using them in conjunction to generate more paired data. We evaluate BIFI on two code repair datasets: GitHub-Python, a new dataset we introduce where the goal is to repair Python code with AST parse errors; and DeepFix, where the goal is to repair C code with compiler errors. BIFI outperforms existing methods, obtaining 90.5% repair accuracy on GitHub-Python (+28.5%) and 71.7% on DeepFix (+5.6%). Notably, BIFI does not require any labeled data; we hope it will be a strong starting point for unsupervised learning of various repair tasks.
    Neural Combinatory Constituency Parsing. (arXiv:2106.06689v1 [cs.CL])
    (2 min) We propose two fast neural combinatory models for constituency parsing: binary and multi-branching. Our models decompose the bottom-up parsing process into 1) classification of tags, labels, and binary orientations or chunks and 2) vector composition based on the computed orientations or chunks. These models have theoretical sub-quadratic complexity and empirical linear complexity. The binary model achieves an F1 score of 92.54 on Penn Treebank, speeding at 1327.2 sents/sec. Both the models with XLNet provide near state-of-the-art accuracies for English. Syntactic branching tendency and headedness of a language are observed during the training and inference processes for Penn Treebank, Chinese Treebank, and Keyaki Treebank (Japanese).
    Explaining the Deep Natural Language Processing by Mining Textual Interpretable Features. (arXiv:2106.06697v1 [cs.CL])
    (2 min) Despite the high accuracy offered by state-of-the-art deep natural-language models (e.g. LSTM, BERT), their application in real-life settings is still widely limited, as they behave like a black-box to the end-user. Hence, explainability is rapidly becoming a fundamental requirement of future-generation data-driven systems based on deep-learning approaches. Several attempts to fulfill the existing gap between accuracy and interpretability have been done. However, robust and specialized xAI (Explainable Artificial Intelligence) solutions tailored to deep natural-language models are still missing. We propose a new framework, named T-EBAnO, which provides innovative prediction-local and class-based model-global explanation strategies tailored to black-box deep natural-language models. Given a deep NLP model and the textual input data, T-EBAnO provides an objective, human-readable, domain-specific assessment of the reasons behind the automatic decision-making process. Specifically, the framework extracts sets of interpretable features mining the inner knowledge of the model. Then, it quantifies the influence of each feature during the prediction process by exploiting the novel normalized Perturbation Influence Relation index at the local level and the novel Global Absolute Influence and Global Relative Influence indexes at the global level. The effectiveness and the quality of the local and global explanations obtained with T-EBAnO are proved on (i) a sentiment analysis task performed by a fine-tuned BERT model, and (ii) a toxic comment classification task performed by an LSTM model.
    Leveraging Pre-trained Language Model for Speech Sentiment Analysis. (arXiv:2106.06598v1 [cs.CL])
    (2 min) In this paper, we explore the use of pre-trained language models to learn sentiment information of written texts for speech sentiment analysis. First, we investigate how useful a pre-trained language model would be in a 2-step pipeline approach employing Automatic Speech Recognition (ASR) and transcripts-based sentiment analysis separately. Second, we propose a pseudo label-based semi-supervised training strategy using a language model on an end-to-end speech sentiment approach to take advantage of a large, but unlabeled speech dataset for training. Although spoken and written texts have different linguistic characteristics, they can complement each other in understanding sentiment. Therefore, the proposed system can not only model acoustic characteristics to bear sentiment-specific information in speech signals, but learn latent information to carry sentiments in the text representation. In these experiments, we demonstrate the proposed approaches improve F1 scores consistently compared to systems without a language model. Moreover, we also show that the proposed framework can reduce 65% of human supervision by leveraging a large amount of data without human sentiment annotation and boost performance in a low-resource condition where the human sentiment annotation is not available enough.
    Assessing Multilingual Fairness in Pre-trained Multimodal Representations. (arXiv:2106.06683v1 [cs.CL])
    (2 min) Recently pre-trained multimodal models, such as CLIP, have received a surge of attention for their exceptional capabilities towards connecting images and natural language. The textual representations in English can be desirably transferred to multilingualism and support promising downstream multimodal tasks for different languages. Nevertheless, previous fairness discourse in vision-and-language learning mainly focuses on monolingual representational biases, and rarely scrutinizes the principles of multilingual fairness in this multimodal setting, where one language is equated to a group of individuals and images provide the universal grounding for bridging different languages. In this paper, we provide a nuanced understanding of individual fairness and group fairness by viewing language as the recipient of fairness notions. We define new fairness notions within multilingual context and analytically articulate that, pre-trained vision-and-language representations are individually fair across languages but not guaranteed to group fairness. Furthermore, we conduct extensive experiments to explore the prevalent group disparity across languages and protected groups including race, gender and age.
    Visualization Techniques to Enhance Automated Event Extraction. (arXiv:2106.06588v1 [cs.CL])
    (2 min) Robust visualization of complex data is critical for the effective use of NLP for event classification, as the volume of data is large and the high-dimensional structure of text makes data challenging to summarize succinctly. In event extraction tasks in particular, visualization can aid in understanding and illustrating the textual relationships from which machine learning tools produce insights. Through our case study which seeks to identify potential triggers of state-led mass killings from news articles using NLP, we demonstrate how visualizations can aid in each stage, from exploratory analysis of raw data, to machine learning training analysis, and finally post-inference validation.
    Sample-efficient Linguistic Generalizations through Program Synthesis: Experiments with Phonology Problems. (arXiv:2106.06566v1 [cs.CL])
    (2 min) Neural models excel at extracting statistical patterns from large amounts of data, but struggle to learn patterns or reason about language from only a few examples. In this paper, we ask: Can we learn explicit rules that generalize well from only a few examples? We explore this question using program synthesis. We develop a synthesis model to learn phonology rules as programs in a domain-specific language. We test the ability of our models to generalize from few training examples using our new dataset of problems from the Linguistics Olympiad, a challenging set of tasks that require strong linguistic reasoning ability. In addition to being highly sample-efficient, our approach generates human-readable programs, and allows control over the generalizability of the learnt programs.
    Direct Simultaneous Speech-to-Text Translation Assisted by Synchronized Streaming ASR. (arXiv:2106.06636v1 [cs.CL])
    (2 min) Simultaneous speech-to-text translation is widely useful in many scenarios. The conventional cascaded approach uses a pipeline of streaming ASR followed by simultaneous MT, but suffers from error propagation and extra latency. To alleviate these issues, recent efforts attempt to directly translate the source speech into target text simultaneously, but this is much harder due to the combination of two separate tasks. We instead propose a new paradigm with the advantages of both cascaded and end-to-end approaches. The key idea is to use two separate, but synchronized, decoders on streaming ASR and direct speech-to-text translation (ST), respectively, and the intermediate results of ASR guide the decoding policy of (but is not fed as input to) ST. During training time, we use multitask learning to jointly learn these two tasks with a shared encoder. En-to-De and En-to-Es experiments on the MuSTC dataset demonstrate that our proposed technique achieves substantially better translation quality at similar levels of latency.
    Study of sampling methods in sentiment analysis of imbalanced data. (arXiv:2106.06673v1 [cs.CL])
    (2 min) This work investigates the application of sampling methods for sentiment analysis on two different highly imbalanced datasets. One dataset contains online user reviews from the cooking platform Epicurious and the other contains comments given to the Planned Parenthood organization. In both these datasets, the classes of interest are rare. Word n-grams were used as features from these datasets. A feature selection technique based on information gain is first applied to reduce the number of features to a manageable space. A number of different sampling methods were then applied to mitigate the class imbalance problem which are then analyzed.
  • cs.CV updates on arXiv.org

    Cross-Subject Domain Adaptation for Multi-Frame EEG Images. (arXiv:2106.06769v1 [cs.LG])
    (2 min) Working memory (WM) is a basic part of human cognition, which plays an important role in the study of human cognitive load. Among various brain imaging techniques, electroencephalography has shown its advantage on easy access and reliability. However, one of the critical challenges is that individual difference may cause the ineffective results, especially when the established model meets an unfamiliar subject. In this work, we propose a cross-subject deep adaptation model with spatial attention (CS-DASA) to generalize the workload classifications across subjects. First, we transform time-series EEG data into multi-frame EEG images incorporating more spatio-temporal information. First, the subject-shared module in CS-DASA receives multi-frame EEG image data from both source and target subjects and learns the common feature representations. Then, in subject-specific module, the maximum mean discrepancy is implemented to measure the domain distribution divergence in a reproducing kernel Hilbert space, which can add an effective penalty loss for domain adaptation. Additionally, the subject-to-subject spatial attention mechanism is employed to focus on the most discriminative spatial feature in EEG image data. Experiments conducted on a public WM EEG dataset containing 13 subjects show that the proposed model is capable of achieve better performance than existing state-of-the art methods.
    Multi-Contrast MRI Super-Resolution via a Multi-Stage Integration Network. (arXiv:2105.08949v2 [eess.IV] UPDATED)
    (2 min) Super-resolution (SR) plays a crucial role in improving the image quality of magnetic resonance imaging (MRI). MRI produces multi-contrast images and can provide a clear display of soft tissues. However, current super-resolution methods only employ a single contrast, or use a simple multi-contrast fusion mechanism, ignoring the rich relations among different contrasts, which are valuable for improving SR. In this work, we propose a multi-stage integration network (i.e., MINet) for multi-contrast MRI SR, which explicitly models the dependencies between multi-contrast images at different stages to guide image SR. In particular, our MINet first learns a hierarchical feature representation from multiple convolutional stages for each of different-contrast image. Subsequently, we introduce a multi-stage integration module to mine the comprehensive relations between the representations of the multi-contrast images. Specifically, the module matches each representation with all other features, which are integrated in terms of their similarities to obtain an enriched representation. Extensive experiments on fastMRI and real-world clinical datasets demonstrate that 1) our MINet outperforms state-of-the-art multi-contrast SR methods in terms of various metrics and 2) our multi-stage integration module is able to excavate complex interactions among multi-contrast features at different stages, leading to improved target-image quality.
    DeepMMSA: A Novel Multimodal Deep Learning Method for Non-small Cell Lung Cancer Survival Analysis. (arXiv:2106.06744v1 [cs.CV])
    (2 min) Lung cancer is the leading cause of cancer death worldwide. The critical reason for the deaths is delayed diagnosis and poor prognosis. With the accelerated development of deep learning techniques, it has been successfully applied extensively in many real-world applications, including health sectors such as medical image interpretation and disease diagnosis. By combining more modalities that being engaged in the processing of information, multimodal learning can extract better features and improve predictive ability. The conventional methods for lung cancer survival analysis normally utilize clinical data and only provide a statistical probability. To improve the survival prediction accuracy and help prognostic decision-making in clinical practice for medical experts, we for the first time propose a multimodal deep learning method for non-small cell lung cancer (NSCLC) survival analysis, named DeepMMSA. This method leverages CT images in combination with clinical data, enabling the abundant information hold within medical images to be associate with lung cancer survival information. We validate our method on the data of 422 NSCLC patients from The Cancer Imaging Archive (TCIA). Experimental results support our hypothesis that there is an underlying relationship between prognostic information and radiomic images. Besides, quantitative results showing that the established multimodal model can be applied to traditional method and has the potential to break bottleneck of existing methods and increase the the percentage of concordant pairs(right predicted pairs) in overall population by 4%.
    CoPE: Conditional image generation using Polynomial Expansions. (arXiv:2104.05077v2 [cs.LG] UPDATED)
    (2 min) Generative modeling has evolved to a notable field of machine learning. Deep polynomial neural networks (PNNs) have demonstrated impressive results in unsupervised image generation, where the task is to map an input vector (i.e., noise) to a synthesized image. However, the success of PNNs has not been replicated in conditional generation tasks, such as super-resolution. Existing PNNs focus on single-variable polynomial expansions which do not fare well to two-variable inputs, i.e., the noise variable and the conditional variable. In this work, we introduce a general framework, called CoPE, that enables a polynomial expansion of two input variables and captures their auto- and cross-correlations. We exhibit how CoPE can be trivially augmented to accept an arbitrary number of input variables. CoPE is evaluated in five tasks (class-conditional generation, inverse problems, edges-to-image translation, image-to-image translation, attribute-guided generation) involving eight datasets. The thorough evaluation suggests that CoPE can be useful for tackling diverse conditional generation tasks.
    RobustBench: a standardized adversarial robustness benchmark. (arXiv:2010.09670v2 [cs.LG] UPDATED)
    (3 min) As a research community, we are still lacking a systematic understanding of the progress on adversarial robustness, which often makes it hard to identify the most promising ideas in training robust models. A key challenge in benchmarking robustness is that its evaluation is often error-prone, leading to overestimation of the true robustness of models. While adaptive attacks designed for a particular defense are a potential solution, they have to be highly customized for particular models, which makes it difficult to compare different methods. Our goal is to instead establish a standardized benchmark of adversarial robustness, which as accurately as possible reflects the robustness of the considered models within a reasonable computational budget. To evaluate the robustness of models for our benchmark, we consider AutoAttack, an ensemble of white- and black-box attacks which was recently shown in a large-scale study to improve almost all robustness evaluations compared to the original publications. We also impose some restrictions on the admitted models to rule out defenses that only make gradient-based attacks ineffective without improving actual robustness. Our leaderboard, hosted at https://robustbench.github.io/, contains evaluations of 90+ models and aims at reflecting the current state of the art on a set of well-defined tasks in $\ell_\infty$- and $\ell_2$-threat models and on common corruptions, with possible extensions in the future. Additionally, we open-source the library https://github.com/RobustBench/robustbench that provides unified access to 60+ robust models to facilitate their downstream applications. Finally, based on the collected models, we analyze the impact of robustness on the performance on distribution shifts, calibration, out-of-distribution detection, fairness, privacy leakage, smoothness, and transferability.
    Orderly Dual-Teacher Knowledge Distillation for Lightweight Human Pose Estimation. (arXiv:2104.10414v2 [cs.CV] UPDATED)
    (2 min) Although deep convolution neural networks (DCNN) have achieved excellent performance in human pose estimation, these networks often have a large number of parameters and computations, leading to the slow inference speed. For this issue, an effective solution is knowledge distillation, which transfers knowledge from a large pre-trained network (teacher) to a small network (student). However, there are some defects in the existing approaches: (I) Only a single teacher is adopted, neglecting the potential that a student can learn from multiple teachers. (II) The human segmentation mask can be regarded as additional prior information to restrict the location of keypoints, which is never utilized. (III) A student with a small number of parameters cannot fully imitate heatmaps provided by datasets and teachers. (IV) There exists noise in heatmaps generated by teachers, which causes model degradation. To overcome these defects, we propose an orderly dual-teacher knowledge distillation (ODKD) framework, which consists of two teachers with different capabilities. Specifically, the weaker one (primary teacher, PT) is used to teach keypoints information, the stronger one (senior teacher, ST) is utilized to transfer segmentation and keypoints information by adding the human segmentation mask. Taking dual-teacher together, an orderly learning strategy is proposed to promote knowledge absorbability. Moreover, we employ a binarization operation which further improves the learning ability of the student and reduces noise in heatmaps. Experimental results on COCO and OCHuman keypoints datasets show that our proposed ODKD can improve the performance of different lightweight models by a large margin, and HRNet-W16 equipped with ODKD achieves state-of-the-art performance for lightweight human pose estimation.
    Mirror3D: Depth Refinement for Mirror Surfaces. (arXiv:2106.06629v1 [cs.CV])
    (2 min) Despite recent progress in depth sensing and 3D reconstruction, mirror surfaces are a significant source of errors. To address this problem, we create the Mirror3D dataset: a 3D mirror plane dataset based on three RGBD datasets (Matterport3D, NYUv2 and ScanNet) containing 7,011 mirror instance masks and 3D planes. We then develop Mirror3DNet: a module that refines raw sensor depth or estimated depth to correct errors on mirror surfaces. Our key idea is to estimate the 3D mirror plane based on RGB input and surrounding depth context, and use this estimate to directly regress mirror surface depth. Our experiments show that Mirror3DNet significantly mitigates errors from a variety of input depth data, including raw sensor depth and depth estimation or completion methods.
    CNN-based Lung CT Registration with Multiple Anatomical Constraints. (arXiv:2011.14372v2 [cs.CV] UPDATED)
    (2 min) Deep-learning-based registration methods emerged as a fast alternative to conventional registration methods. However, these methods often still cannot achieve the same performance as conventional registration methods because they are either limited to small deformation or they fail to handle a superposition of large and small deformations without producing implausible deformation fields with foldings inside. In this paper, we identify important strategies of conventional registration methods for lung registration and successfully developed the deep-learning counterpart. We employ a Gaussian-pyramid-based multilevel framework that can solve the image registration optimization in a coarse-to-fine fashion. Furthermore, we prevent foldings of the deformation field and restrict the determinant of the Jacobian to physiologically meaningful values by combining a volume change penalty with a curvature regularizer in the loss function. Keypoint correspondences are integrated to focus on the alignment of smaller structures. We perform an extensive evaluation to assess the accuracy, the robustness, the plausibility of the estimated deformation fields, and the transferability of our registration approach. We show that it achieves state-of-the-art results on the COPDGene dataset compared to conventional registration method with much shorter execution time. In our experiments on the DIRLab exhale to inhale lung registration, we demonstrate substantial improvements (TRE below $1.2$ mm) over other deep learning methods. Our algorithm is publicly available at https://grand-challenge.org/algorithms/deep-learning-based-ct-lung-registration/.
    Video Super-Resolution Transformer. (arXiv:2106.06847v1 [cs.CV])
    (2 min) Video super-resolution (VSR), with the aim to restore a high-resolution video from its corresponding low-resolution version, is a spatial-temporal sequence prediction problem. Recently, Transformer has been gaining popularity due to its parallel computing ability for sequence-to-sequence modeling. Thus, it seems to be straightforward to apply the vision Transformer to solve VSR. However, the typical block design of Transformer with a fully connected self-attention layer and a token-wise feed-forward layer does not fit well for VSR due to the following two reasons. First, the fully connected self-attention layer neglects to exploit the data locality because this layer relies on linear layers to compute attention maps. Second, the token-wise feed-forward layer lacks the feature alignment which is important for VSR since this layer independently processes each of the input token embeddings without any interaction among them. In this paper, we make the first attempt to adapt Transformer for VSR. Specifically, to tackle the first issue, we present a spatial-temporal convolutional self-attention layer with a theoretical understanding to exploit the locality information. For the second issue, we design a bidirectional optical flow-based feed-forward layer to discover the correlations across different video frames and also align features. Extensive experiments on several benchmark datasets demonstrate the effectiveness of our proposed method. The code will be available at https://github.com/caojiezhang/VSR-Transformer.
    D2C: Diffusion-Denoising Models for Few-shot Conditional Generation. (arXiv:2106.06819v1 [cs.LG])
    (2 min) Conditional generative models of high-dimensional images have many applications, but supervision signals from conditions to images can be expensive to acquire. This paper describes Diffusion-Decoding models with Contrastive representations (D2C), a paradigm for training unconditional variational autoencoders (VAEs) for few-shot conditional image generation. D2C uses a learned diffusion-based prior over the latent representations to improve generation and contrastive self-supervised learning to improve representation quality. D2C can adapt to novel generation tasks conditioned on labels or manipulation constraints, by learning from as few as 100 labeled examples. On conditional generation from new labels, D2C achieves superior performance over state-of-the-art VAEs and diffusion models. On conditional image manipulation, D2C generations are two orders of magnitude faster to produce over StyleGAN2 ones and are preferred by 50% - 60% of the human evaluators in a double-blind study.
    LE-NAS: Learning-based Ensenble with NAS for Dose Prediction. (arXiv:2106.06733v1 [cs.CV])
    (2 min) Radiation therapy treatment planning is a complex process, as the target dose prescription and normal tissue sparing are conflicting objectives. Automated and accurate dose prediction for radiation therapy planning is in high demand. In this study, we propose a novel learning-based ensemble approach, named LE-NAS, which integrates neural architecture search (NAS) with knowledge distillation for 3D radiotherapy dose prediction. Specifically, the prediction network first exhaustively searches each block from enormous architecture space. Then, multiple architectures are selected with promising performance and diversity. To reduce the inference time, we adopt the teacher-student paradigm by treating the combination of diverse outputs from multiple searched networks as supervisions to guide the student network training. In addition, we apply adversarial learning to optimize the student network to recover the knowledge in teacher networks. To the best of our knowledge, we are the first to investigate the combination of NAS and knowledge distillation. The proposed method has been evaluated on the public OpenKBP dataset, and experimental results demonstrate the effectiveness of our method and its superior performance to the state-of-the-art method.
    Exploring and Distilling Posterior and Prior Knowledge for Radiology Report Generation. (arXiv:2106.06963v1 [cs.CV])
    (2 min) Automatically generating radiology reports can improve current clinical practice in diagnostic radiology. On one hand, it can relieve radiologists from the heavy burden of report writing; On the other hand, it can remind radiologists of abnormalities and avoid the misdiagnosis and missed diagnosis. Yet, this task remains a challenging job for data-driven neural networks, due to the serious visual and textual data biases. To this end, we propose a Posterior-and-Prior Knowledge Exploring-and-Distilling approach (PPKED) to imitate the working patterns of radiologists, who will first examine the abnormal regions and assign the disease topic tags to the abnormal regions, and then rely on the years of prior medical knowledge and prior working experience accumulations to write reports. Thus, the PPKED includes three modules: Posterior Knowledge Explorer (PoKE), Prior Knowledge Explorer (PrKE) and Multi-domain Knowledge Distiller (MKD). In detail, PoKE explores the posterior knowledge, which provides explicit abnormal visual regions to alleviate visual data bias; PrKE explores the prior knowledge from the prior medical knowledge graph (medical knowledge) and prior radiology reports (working experience) to alleviate textual data bias. The explored knowledge is distilled by the MKD to generate the final reports. Evaluated on MIMIC-CXR and IU-Xray datasets, our method is able to outperform previous state-of-the-art models on these two datasets.
    Feedback Pyramid Attention Networks for Single Image Super-Resolution. (arXiv:2106.06966v1 [cs.CV])
    (2 min) Recently, convolutional neural network (CNN) based image super-resolution (SR) methods have achieved significant performance improvement. However, most CNN-based methods mainly focus on feed-forward architecture design and neglect to explore the feedback mechanism, which usually exists in the human visual system. In this paper, we propose feedback pyramid attention networks (FPAN) to fully exploit the mutual dependencies of features. Specifically, a novel feedback connection structure is developed to enhance low-level feature expression with high-level information. In our method, the output of each layer in the first stage is also used as the input of the corresponding layer in the next state to re-update the previous low-level filters. Moreover, we introduce a pyramid non-local structure to model global contextual information in different scales and improve the discriminative representation of the network. Extensive experimental results on various datasets demonstrate the superiority of our FPAN in comparison with the state-of-the-art SR methods.
    NLHD: A Pixel-Level Non-Local Retinex Model for Low-Light Image Enhancement. (arXiv:2106.06971v1 [cs.CV])
    (2 min) Retinex model has been applied to low-light image enhancement in many existing methods. More appropriate decomposition of a low-light image can help achieve better image enhancement. In this paper, we propose a new pixel-level non-local Haar transform based illumination and reflectance decomposition method (NLHD). The unique low-frequency coefficient of Haar transform on each similar pixel group is used to reconstruct the illumination component, and the rest of all high-frequency coefficients are employed to reconstruct the reflectance component. The complete similarity of pixels in a matched similar pixel group and the simple separable Haar transform help to obtain more appropriate image decomposition; thus, the image is hardly sharpened in the image brightness enhancement procedure. The exponential transform and logarithmic transform are respectively implemented on the illumination component. Then a minimum fusion strategy on the results of these two transforms is utilized to achieve more natural illumination component enhancement. It can alleviate the mosaic artifacts produced in the darker regions by the exponential transform with a gamma value less than 1 and reduce information loss caused by excessive enhancement of the brighter regions due to the logarithmic transform. Finally, the Retinex model is applied to the enhanced illumination and reflectance to achieve image enhancement. We also develop a local noise level estimation based noise suppression method and a non-local saturation reduction based color deviation correction method. These two methods can respectively attenuate noise or color deviation usually presented in the enhanced results of the extremely dark low-light images. Experiments on benchmark datasets show that the proposed method can achieve better low-light image enhancement results on subjective and objective evaluations than most existing methods.
    GANs N' Roses: Stable, Controllable, Diverse Image to Image Translation (works for videos too!). (arXiv:2106.06561v1 [cs.CV])
    (2 min) We show how to learn a map that takes a content code, derived from a face image, and a randomly chosen style code to an anime image. We derive an adversarial loss from our simple and effective definitions of style and content. This adversarial loss guarantees the map is diverse -- a very wide range of anime can be produced from a single content code. Under plausible assumptions, the map is not just diverse, but also correctly represents the probability of an anime, conditioned on an input face. In contrast, current multimodal generation procedures cannot capture the complex styles that appear in anime. Extensive quantitative experiments support the idea the map is correct. Extensive qualitative results show that the method can generate a much more diverse range of styles than SOTA comparisons. Finally, we show that our formalization of content and style allows us to perform video to video translation without ever training on videos.
    Toward Accurate and Realistic Outfits Visualization with Attention to Details. (arXiv:2106.06593v1 [cs.CV])
    (2 min) Virtual try-on methods aim to generate images of fashion models wearing arbitrary combinations of garments. This is a challenging task because the generated image must appear realistic and accurately display the interaction between garments. Prior works produce images that are filled with artifacts and fail to capture important visual details necessary for commercial applications. We propose Outfit Visualization Net (OVNet) to capture these important details (e.g. buttons, shading, textures, realistic hemlines, and interactions between garments) and produce high quality multiple-garment virtual try-on images. OVNet consists of 1) a semantic layout generator and 2) an image generation pipeline using multiple coordinated warps. We train the warper to output multiple warps using a cascade loss, which refines each successive warp to focus on poorly generated regions of a previous warp and yields consistent improvements in detail. In addition, we introduce a method for matching outfits with the most suitable model and produce significant improvements for both our and other previous try-on methods. Through quantitative and qualitative analysis, we demonstrate our method generates substantially higher-quality studio images compared to prior works for multi-garment outfits. An interactive interface powered by this method has been deployed on fashion e-commerce websites and received overwhelmingly positive feedback.
    Cross-Modal Attention Consistency for Video-Audio Unsupervised Learning. (arXiv:2106.06939v1 [cs.CV])
    (2 min) Cross-modal correlation provides an inherent supervision for video unsupervised representation learning. Existing methods focus on distinguishing different video clips by visual and audio representations. We human visual perception could attend to regions where sounds are made, and our auditory perception could also ground their frequencies of sounding objects, which we call bidirectional local correspondence. Such supervision is intuitive but not well explored in the contrastive learning framework. This paper introduces a pretext task, Cross-Modal Attention Consistency (CMAC), for exploring the bidirectional local correspondence property. The CMAC approach aims to align the regional attention generated purely from the visual signal with the target attention generated under the guidance of acoustic signal, and do a similar alignment for frequency grounding on the acoustic attention. Accompanied by a remoulded cross-modal contrastive loss where we consider additional within-modal interactions, the CMAC approach works effectively for enforcing the bidirectional alignment. Extensive experiments on six downstream benchmarks demonstrate that CMAC can improve the state-of-the-art performance on both visual and audio modalities.
    Understanding Failures of Deep Networks via Robust Feature Extraction. (arXiv:2012.01750v3 [cs.CV] UPDATED)
    (2 min) Traditional evaluation metrics for learned models that report aggregate scores over a test set are insufficient for surfacing important and informative patterns of failure over features and instances. We introduce and study a method aimed at characterizing and explaining failures by identifying visual attributes whose presence or absence results in poor performance. In distinction to previous work that relies upon crowdsourced labels for visual attributes, we leverage the representation of a separate robust model to extract interpretable features and then harness these features to identify failure modes. We further propose a visualization method aimed at enabling humans to understand the meaning encoded in such features and we test the comprehensibility of the features. An evaluation of the methods on the ImageNet dataset demonstrates that: (i) the proposed workflow is effective for discovering important failure modes, (ii) the visualization techniques help humans to understand the extracted features, and (iii) the extracted insights can assist engineers with error analysis and debugging.
    Lite-FPN for Keypoint-based Monocular 3D Object Detection. (arXiv:2105.00268v2 [cs.CV] UPDATED)
    (2 min) 3D object detection with a single image is an essential and challenging task for autonomous driving. Recently, keypoint-based monocular 3D object detection has made tremendous progress and achieved great speed-accuracy trade-off. However, there still exists a huge gap with LIDAR-based methods in terms of accuracy. To improve their performance without sacrificing efficiency, we propose a sort of lightweight feature pyramid network called Lite-FPN to achieve multi-scale feature fusion in an effective and efficient way, which can boost the multi-scale detection capability of keypoint-based detectors. Besides, the misalignment between classification score and localization precision is further relieved by introducing a novel regression loss named attention loss. With the proposed loss, predictions with high confidence but poor localization are treated with more attention during the training phase. Comparative experiments based on several state-of-the-art keypoint-based detectors on the KITTI dataset show that our proposed methods manage to achieve significant improvements in both accuracy and frame rate. The code and pretrained models will be released at \url{https://github.com/yanglei18/Lite-FPN}.
    Understanding self-supervised Learning Dynamics without Contrastive Pairs. (arXiv:2102.06810v2 [cs.LG] UPDATED)
    (2 min) While contrastive approaches of self-supervised learning (SSL) learn representations by minimizing the distance between two augmented views of the same data point (positive pairs) and maximizing views from different data points (negative pairs), recent \emph{non-contrastive} SSL (e.g., BYOL and SimSiam) show remarkable performance {\it without} negative pairs, with an extra learnable predictor and a stop-gradient operation. A fundamental question arises: why do these methods not collapse into trivial representations? We answer this question via a simple theoretical study and propose a novel approach, DirectPred, that \emph{directly} sets the linear predictor based on the statistics of its inputs, without gradient training. On ImageNet, it performs comparably with more complex two-layer non-linear predictors that employ BatchNorm and outperforms a linear predictor by $2.5\%$ in 300-epoch training (and $5\%$ in 60-epoch). DirectPred is motivated by our theoretical study of the nonlinear learning dynamics of non-contrastive SSL in simple linear networks. Our study yields conceptual insights into how non-contrastive SSL methods learn, how they avoid representational collapse, and how multiple factors, like predictor networks, stop-gradients, exponential moving averages, and weight decay all come into play. Our simple theory recapitulates the results of real-world ablation studies in both STL-10 and ImageNet. Code is released\footnote{\url{https://github.com/facebookresearch/luckmatters/tree/master/ssl}}.
    Proxy-Normalizing Activations to Match Batch Normalization while Removing Batch Dependence. (arXiv:2106.03743v2 [cs.LG] CROSS LISTED)
    (2 min) We investigate the reasons for the performance degradation incurred with batch-independent normalization. We find that the prototypical techniques of layer normalization and instance normalization both induce the appearance of failure modes in the neural network's pre-activations: (i) layer normalization induces a collapse towards channel-wise constant functions; (ii) instance normalization induces a lack of variability in instance statistics, symptomatic of an alteration of the expressivity. To alleviate failure mode (i) without aggravating failure mode (ii), we introduce the technique "Proxy Normalization" that normalizes post-activations using a proxy distribution. When combined with layer normalization or group normalization, this batch-independent normalization emulates batch normalization's behavior and consistently matches or exceeds its performance.
    Adversarial Robustness under Long-Tailed Distribution. (arXiv:2104.02703v2 [cs.CV] UPDATED)
    (2 min) Adversarial robustness has attracted extensive studies recently by revealing the vulnerability and intrinsic characteristics of deep networks. However, existing works on adversarial robustness mainly focus on balanced datasets, while real-world data usually exhibits a long-tailed distribution. To push adversarial robustness towards more realistic scenarios, in this work we investigate the adversarial vulnerability as well as defense under long-tailed distributions. In particular, we first reveal the negative impacts induced by imbalanced data on both recognition performance and adversarial robustness, uncovering the intrinsic challenges of this problem. We then perform a systematic study on existing long-tailed recognition methods in conjunction with the adversarial training framework. Several valuable observations are obtained: 1) natural accuracy is relatively easy to improve, 2) fake gain of robust accuracy exists under unreliable evaluation, and 3) boundary error limits the promotion of robustness. Inspired by these observations, we propose a clean yet effective framework, RoBal, which consists of two dedicated modules, a scale-invariant classifier and data re-balancing via both margin engineering at training stage and boundary adjustment during inference. Extensive experiments demonstrate the superiority of our approach over other state-of-the-art defense methods. To our best knowledge, we are the first to tackle adversarial robustness under long-tailed distributions, which we believe would be a significant step towards real-world robustness. Our code is available at: https://github.com/wutong16/Adversarial_Long-Tail .
    Barlow Twins: Self-Supervised Learning via Redundancy Reduction. (arXiv:2103.03230v3 [cs.CV] UPDATED)
    (2 min) Self-supervised learning (SSL) is rapidly closing the gap with supervised methods on large computer vision benchmarks. A successful approach to SSL is to learn embeddings which are invariant to distortions of the input sample. However, a recurring issue with this approach is the existence of trivial constant solutions. Most current methods avoid such solutions by careful implementation details. We propose an objective function that naturally avoids collapse by measuring the cross-correlation matrix between the outputs of two identical networks fed with distorted versions of a sample, and making it as close to the identity matrix as possible. This causes the embedding vectors of distorted versions of a sample to be similar, while minimizing the redundancy between the components of these vectors. The method is called Barlow Twins, owing to neuroscientist H. Barlow's redundancy-reduction principle applied to a pair of identical networks. Barlow Twins does not require large batches nor asymmetry between the network twins such as a predictor network, gradient stopping, or a moving average on the weight updates. Intriguingly it benefits from very high-dimensional output vectors. Barlow Twins outperforms previous methods on ImageNet for semi-supervised classification in the low-data regime, and is on par with current state of the art for ImageNet classification with a linear classifier head, and for transfer tasks of classification and object detection.
    Explaining the Black-box Smoothly- A Counterfactual Approach. (arXiv:2101.04230v2 [cs.CV] UPDATED)
    (2 min) We propose a BlackBox \emph{Counterfactual Explainer} that is explicitly developed for medical imaging applications. Classical approaches (e.g. saliency maps) assessing feature importance do not explain \emph{how} and \emph{why} variations in a particular anatomical region is relevant to the outcome, which is crucial for transparent decision making in healthcare application. Our framework explains the outcome by gradually \emph{exaggerating} the semantic effect of the given outcome label. Given a query input to a classifier, Generative Adversarial Networks produce a progressive set of perturbations to the query image that gradually changes the posterior probability from its original class to its negation. We design the loss function to ensure that essential and potentially relevant details, such as support devices, are preserved in the counterfactually generated images. We provide an extensive evaluation of different classification tasks on the chest X-Ray images. Our experiments show that a counterfactually generated visual explanation is consistent with the disease's clinical relevant measurements, both quantitatively and qualitatively.
    DONet: Dual-Octave Network for Fast MR Image Reconstruction. (arXiv:2105.05980v2 [eess.IV] UPDATED)
    (2 min) Magnetic resonance (MR) image acquisition is an inherently prolonged process, whose acceleration has long been the subject of research. This is commonly achieved by obtaining multiple undersampled images, simultaneously, through parallel imaging. In this paper, we propose the Dual-Octave Network (DONet), which is capable of learning multi-scale spatial-frequency features from both the real and imaginary components of MR data, for fast parallel MR image reconstruction. More specifically, our DONet consists of a series of Dual-Octave convolutions (Dual-OctConv), which are connected in a dense manner for better reuse of features. In each Dual-OctConv, the input feature maps and convolutional kernels are first split into two components (ie, real and imaginary), and then divided into four groups according to their spatial frequencies. Then, our Dual-OctConv conducts intra-group information updating and inter-group information exchange to aggregate the contextual information across different groups. Our framework provides three appealing benefits: (i) It encourages information interaction and fusion between the real and imaginary components at various spatial frequencies to achieve richer representational capacity. (ii) The dense connections between the real and imaginary groups in each Dual-OctConv make the propagation of features more efficient by feature reuse. (iii) DONet enlarges the receptive field by learning multiple spatial-frequency features of both the real and imaginary components. Extensive experiments on two popular datasets (ie, clinical knee and fastMRI), under different undersampling patterns and acceleration factors, demonstrate the superiority of our model in accelerated parallel MR image reconstruction.
    Robust Image Classification Using A Low-Pass Activation Function and DCT Augmentation. (arXiv:2007.09453v2 [cs.CV] UPDATED)
    (2 min) Convolutional Neural Network's (CNN's) performance disparity on clean and corrupted datasets has recently come under scrutiny. In this work, we analyse common corruptions in the frequency domain, i.e., High Frequency corruptions (HFc, e.g., noise) and Low Frequency corruptions (LFc, e.g., blur). Although a simple solution to HFc is low-pass filtering, ReLU -- a widely used Activation Function (AF), does not have any filtering mechanism. In this work, we instill low-pass filtering into the AF (LP-ReLU) to improve robustness against HFc. To deal with LFc, we complement LP-ReLU with Discrete Cosine Transform based augmentation. LP-ReLU, coupled with DCT augmentation, enables a deep network to tackle the entire spectrum of corruption. We use CIFAR-10-C and Tiny ImageNet-C for evaluation and demonstrate improvements of 5% and 7.3% in accuracy respectively, compared to the State-Of-The-Art (SOTA). We further evaluate our method's stability on a variety of perturbations in CIFAR-10-P and Tiny ImageNet-P, achieving new SOTA in these experiments as well. To further strengthen our understanding regarding CNN's lack of robustness, a decision space visualisation process is proposed and presented in this work.
    Deep manifold learning reveals hidden dynamics of proteasome autoregulation. (arXiv:2012.12854v2 [q-bio.QM] UPDATED)
    (2 min) The 2.5-MDa 26S proteasome maintains proteostasis and regulates myriad cellular processes. How polyubiquitylated substrate interactions regulate proteasome activity is not understood. Here we introduce a deep manifold learning framework, named AlphaCryo4D, which enables atomic-level cryogenic electron microscopy (cryo-EM) reconstructions of nonequilibrium conformational continuum and reconstitutes hidden dynamics of proteasome autoregulation in the act of substrate degradation. AlphaCryo4D integrates 3D deep residual learning with manifold embedding of free-energy landscapes, which directs 3D clustering via an energy-based particle-voting algorithm. In blind assessments using simulated heterogeneous cryo-EM datasets, AlphaCryo4D achieved 3D classification accuracy three times that of conventional method and reconstructed continuous conformational changes of a 130-kDa protein at sub-3-angstrom resolution. By using AlphaCryo4D to analyze a single experimental cryo-EM dataset, we identified 64 conformers of the substrate-bound human 26S proteasome, revealing conformational entanglement of two regulatory particles in the doubly capped holoenzymes and their energetic differences with singly capped ones. Novel ubiquitin-binding sites are discovered on the RPN2, RPN10 and Alpha5 subunits to remodel polyubiquitin chains for deubiquitylation and recycle. Importantly, AlphaCryo4D choreographs single-nucleotide-exchange dynamics of proteasomal AAA-ATPase motor during translocation initiation, which upregulates proteolytic activity by allosterically promoting nucleophilic attack. Our systemic analysis illuminates a grand hierarchical allostery for proteasome autoregulation.
    Dirty Road Can Attack: Security of Deep Learning based Automated Lane Centering under Physical-World Attack. (arXiv:2009.06701v2 [cs.CR] UPDATED)
    (2 min) Automated Lane Centering (ALC) systems are convenient and widely deployed today, but also highly security and safety critical. In this work, we are the first to systematically study the security of state-of-the-art deep learning based ALC systems in their designed operational domains under physical-world adversarial attacks. We formulate the problem with a safety-critical attack goal, and a novel and domain-specific attack vector: dirty road patches. To systematically generate the attack, we adopt an optimization-based approach and overcome domain-specific design challenges such as camera frame inter-dependencies due to attack-influenced vehicle control, and the lack of objective function design for lane detection models. We evaluate our attack on a production ALC using 80 scenarios from real-world driving traces. The results show that our attack is highly effective with over 97.5% success rates and less than 0.903 sec average success time, which is substantially lower than the average driver reaction time. This attack is also found (1) robust to various real-world factors such as lighting conditions and view angles, (2) general to different model designs, and (3) stealthy from the driver's view. To understand the safety impacts, we conduct experiments using software-in-the-loop simulation and attack trace injection in a real vehicle. The results show that our attack can cause a 100% collision rate in different scenarios, including when tested with common safety features such as automatic emergency braking. We also evaluate and discuss defenses.
    Pay Attention with Focus: A Novel Learning Scheme for Classification of Whole Slide Images. (arXiv:2106.06623v1 [cs.CV])
    (2 min) Deep learning methods such as convolutional neural networks (CNNs) are difficult to directly utilize to analyze whole slide images (WSIs) due to the large image dimensions. We overcome this limitation by proposing a novel two-stage approach. First, we extract a set of representative patches (called mosaic) from a WSI. Each patch of a mosaic is encoded to a feature vector using a deep network. The feature extractor model is fine-tuned using hierarchical target labels of WSIs, i.e., anatomic site and primary diagnosis. In the second stage, a set of encoded patch-level features from a WSI is used to compute the primary diagnosis probability through the proposed Pay Attention with Focus scheme, an attention-weighted averaging of predicted probabilities for all patches of a mosaic modulated by a trainable focal factor. Experimental results show that the proposed model can be robust, and effective for the classification of WSIs.
    Revisiting Classification Perspective on Scene Text Recognition. (arXiv:2102.10884v3 [cs.CV] UPDATED)
    (2 min) The prevalent perspectives of scene text recognition are from sequence to sequence (seq2seq) and segmentation. Nevertheless, the former is composed of many components which makes implementation and deployment complicated, while the latter requires character level annotations that is expensive. In this paper, we revisit classification perspective that models scene text recognition as an image classification problem. Classification perspective has a simple pipeline and only needs word level annotations. We revive classification perspective by devising a scene text recognition model named as CSTR, which performs as well as methods from other perspectives. The CSTR model consists of CPNet (classification perspective network) and SPPN (separated conv with global average pooling prediction network). CSTR is as simple as image classification model like ResNet \cite{he2016deep} which makes it easy to implement and deploy. We demonstrate the effectiveness of the classification perspective on scene text recognition with extensive experiments. Futhermore, CSTR achieves nearly state-of-the-art performance on six public benchmarks including regular text, irregular text. The code will be available at https://github.com/Media-Smart/vedastr.
    Multi-level Attention Fusion Network for Audio-visual Event Recognition. (arXiv:2106.06736v1 [cs.CV])
    (2 min) Event classification is inherently sequential and multimodal. Therefore, deep neural models need to dynamically focus on the most relevant time window and/or modality of a video. In this study, we propose the Multi-level Attention Fusion network (MAFnet), an architecture that can dynamically fuse visual and audio information for event recognition. Inspired by prior studies in neuroscience, we couple both modalities at different levels of visual and audio paths. Furthermore, the network dynamically highlights a modality at a given time window relevant to classify events. Experimental results in AVE (Audio-Visual Event), UCF51, and Kinetics-Sounds datasets show that the approach can effectively improve the accuracy in audio-visual event classification. Code is available at: https://github.com/numediart/MAFnet
    Is Perfect Filtering Enough Leading to Perfect Phase Correction for dMRI data?. (arXiv:2106.06992v1 [cs.CV])
    (2 min) Being complex-valued and low in signal-to-noise ratios, magnitude-based diffusion MRI is confounded by the noise-floor that falsely elevates signal magnitude and incurs bias to the commonly used diffusion indices, such as fractional anisotropy (FA). To avoid noise-floor, most existing phase correction methods explore improving filters to estimate the noise-free background phase. In this work, after diving into the phase correction procedures, we argue that even a perfect filter is insufficient for phase correction because the correction procedures are incapable of distinguishing sign-symbols of noise, resulting in artifacts (\textit{i.e.}, arbitrary signal loss). With this insight, we generalize the definition of noise-floor to a complex polar coordinate system and propose a calibration procedure that could conveniently distinguish noise sign symbols. The calibration procedure is conceptually simple and easy to implement without relying on any external technique while keeping distinctly effective.
    Knowledge Consolidation based Class Incremental Online Learning with Limited Data. (arXiv:2106.06795v1 [cs.LG])
    (2 min) We propose a novel approach for class incremental online learning in a limited data setting. This problem setting is challenging because of the following constraints: (1) Classes are given incrementally, which necessitates a class incremental learning approach; (2) Data for each class is given in an online fashion, i.e., each training example is seen only once during training; (3) Each class has very few training examples; and (4) We do not use or assume access to any replay/memory to store data from previous classes. Therefore, in this setting, we have to handle twofold problems of catastrophic forgetting and overfitting. In our approach, we learn robust representations that are generalizable across tasks without suffering from the problems of catastrophic forgetting and overfitting to accommodate future classes with limited samples. Our proposed method leverages the meta-learning framework with knowledge consolidation. The meta-learning framework helps the model for rapid learning when samples appear in an online fashion. Simultaneously, knowledge consolidation helps to learn a robust representation against forgetting under online updates to facilitate future learning. Our approach significantly outperforms other methods on several benchmarks.
    Diffusion Probabilistic Models for 3D Point Cloud Generation. (arXiv:2103.01458v2 [cs.CV] UPDATED)
    (2 min) We present a probabilistic model for point cloud generation, which is fundamental for various 3D vision tasks such as shape completion, upsampling, synthesis and data augmentation. Inspired by the diffusion process in non-equilibrium thermodynamics, we view points in point clouds as particles in a thermodynamic system in contact with a heat bath, which diffuse from the original distribution to a noise distribution. Point cloud generation thus amounts to learning the reverse diffusion process that transforms the noise distribution to the distribution of a desired shape. Specifically, we propose to model the reverse diffusion process for point clouds as a Markov chain conditioned on certain shape latent. We derive the variational bound in closed form for training and provide implementations of the model. Experimental results demonstrate that our model achieves competitive performance in point cloud generation and auto-encoding. The code is available at \url{https://github.com/luost26/diffusion-point-cloud}.
    BigEarthNet Dataset with A New Class-Nomenclature for Remote Sensing Image Understanding. (arXiv:2001.06372v3 [cs.CV] UPDATED)
    (2 min) This paper presents BigEarthNet that is a large-scale Sentinel-2 multispectral image dataset with a new class nomenclature to advance deep learning (DL) studies in remote sensing (RS). BigEarthNet is made up of 590,326 image patches annotated with multi-labels provided by the CORINE Land Cover (CLC) map of 2018 based on its most thematic detailed Level-3 class nomenclature. Initial research demonstrates that some CLC classes are challenging to be accurately described by considering only Sentinel-2 images. To increase the effectiveness of BigEarthNet, in this paper we introduce an alternative class-nomenclature to allow DL models for better learning and describing the complex spatial and spectral information content of the Sentinel-2 images. This is achieved by interpreting and arranging the CLC Level-3 nomenclature based on the properties of Sentinel-2 images in a new nomenclature of 19 classes. Then, the new class-nomenclature of BigEarthNet is used within state-of-the-art DL models in the context of multi-label classification. Results show that the models trained from scratch on BigEarthNet outperform those pre-trained on ImageNet, especially in relation to some complex classes including agriculture, other vegetated and natural environments. All DL models are made publicly available at this http URL, offering an important resource to guide future progress on RS image analysis.
    TransGAN: Two Pure Transformers Can Make One Strong GAN, and That Can Scale Up. (arXiv:2102.07074v3 [cs.CV] UPDATED)
    (2 min) The recent explosive interest on transformers has suggested their potential to become powerful ``universal" models for computer vision tasks, such as classification, detection, and segmentation. While those attempts mainly study the discriminative models, we explore transformers on some more notoriously difficult vision tasks, e.g., generative adversarial networks (GANs). Our goal is to conduct the first pilot study in building a GAN completely free of convolutions, using only pure transformer-based architectures. Our vanilla GAN architecture, dubbed TransGAN, consists of a memory-friendly transformer-based generator that progressively increases feature resolution, and correspondingly a multi-scale discriminator to capture simultaneously semantic contexts and low-level textures. On top of them, we introduce the new module of grid self-attention for alleviating the memory bottleneck further, in order to scale up TransGAN to high-resolution generation. We also develop a unique training recipe including a series of techniques that can mitigate the training instability issues of TransGAN, such as data augmentation, modified normalization, and relative position encoding. Our best architecture achieves highly competitive performance compared to current state-of-the-art GANs using convolutional backbones. Specifically, TransGAN sets new state-of-the-art inception score of 10.43 and FID of 18.28 on STL-10, outperforming StyleGAN-V2. When it comes to higher-resolution (e.g. 256 x 256) generation tasks, such as on CelebA-HQ and LSUN-Church, TransGAN continues to produce diverse visual examples with high fidelity and impressive texture details. In addition, we dive deep into the transformer-based generation models to understand how their behaviors differ from convolutional ones, by visualizing training dynamics. The code is available at https://github.com/VITA-Group/TransGAN.
    Large-Scale Unsupervised Object Discovery. (arXiv:2106.06650v1 [cs.CV])
    (2 min) Existing approaches to unsupervised object discovery (UOD) do not scale up to large datasets without approximations which compromise their performance. We propose a novel formulation of UOD as a ranking problem, amenable to the arsenal of distributed methods available for eigenvalue problems and link analysis. Extensive experiments with COCO and OpenImages demonstrate that, in the single-object discovery setting where a single prominent object is sought in each image, the proposed LOD (Large-scale Object Discovery) approach is on par with, or better than the state of the art for medium-scale datasets (up to 120K images), and over 37% better than the only other algorithms capable of scaling up to 1.7M images. In the multi-object discovery setting where multiple objects are sought in each image, the proposed LOD is over 14% better in average precision (AP) than all other methods for datasets ranging from 20K to 1.7M images.
    Representation and Correlation Enhanced Encoder-Decoder Framework for Scene Text Recognition. (arXiv:2106.06960v1 [cs.CV])
    (2 min) Attention-based encoder-decoder framework is widely used in the scene text recognition task. However, for the current state-of-the-art(SOTA) methods, there is room for improvement in terms of the efficient usage of local visual and global context information of the input text image, as well as the robust correlation between the scene processing module(encoder) and the text processing module(decoder). In this paper, we propose a Representation and Correlation Enhanced Encoder-Decoder Framework(RCEED) to address these deficiencies and break performance bottleneck. In the encoder module, local visual feature, global context feature, and position information are aligned and fused to generate a small-size comprehensive feature map. In the decoder module, two methods are utilized to enhance the correlation between scene and text feature space. 1) The decoder initialization is guided by the holistic feature and global glimpse vector exported from the encoder. 2) The feature enriched glimpse vector produced by the Multi-Head General Attention is used to assist the RNN iteration and the character prediction at each time step. Meanwhile, we also design a Layernorm-Dropout LSTM cell to improve model's generalization towards changeable texts. Extensive experiments on the benchmarks demonstrate the advantageous performance of RCEED in scene text recognition tasks, especially the irregular ones.
    Contrastive Attention for Automatic Chest X-ray Report Generation. (arXiv:2106.06965v1 [cs.CV])
    (2 min) Recently, chest X-ray report generation, which aims to automatically generate descriptions of given chest X-ray images, has received growing research interests. The key challenge of chest X-ray report generation is to accurately capture and describe the abnormal regions. In most cases, the normal regions dominate the entire chest X-ray image, and the corresponding descriptions of these normal regions dominate the final report. Due to such data bias, learning-based models may fail to attend to abnormal regions. In this work, to effectively capture and describe abnormal regions, we propose the Contrastive Attention (CA) model. Instead of solely focusing on the current input image, the CA model compares the current input image with normal images to distill the contrastive information. The acquired contrastive information can better represent the visual features of abnormal regions. According to the experiments on the public IU-X-ray and MIMIC-CXR datasets, incorporating our CA into several existing models can boost their performance across most metrics. In addition, according to the analysis, the CA model can help existing models better attend to the abnormal regions and provide more accurate descriptions which are crucial for an interpretable diagnosis. Specifically, we achieve the state-of-the-art results on the two public datasets.
    Entropy-based Logic Explanations of Neural Networks. (arXiv:2106.06804v1 [cs.AI])
    (2 min) Explainable artificial intelligence has rapidly emerged since lawmakers have started requiring interpretable models for safety-critical domains. Concept-based neural networks have arisen as explainable-by-design methods as they leverage human-understandable symbols (i.e. concepts) to predict class memberships. However, most of these approaches focus on the identification of the most relevant concepts but do not provide concise, formal explanations of how such concepts are leveraged by the classifier to make predictions. In this paper, we propose a novel end-to-end differentiable approach enabling the extraction of logic explanations from neural networks using the formalism of First-Order Logic. The method relies on an entropy-based criterion which automatically identifies the most relevant concepts. We consider four different case studies to demonstrate that: (i) this entropy-based criterion enables the distillation of concise logic explanations in safety-critical domains from clinical data to computer vision; (ii) the proposed approach outperforms state-of-the-art white-box models in terms of classification accuracy.
    Anisotropic Stroke Control for Multiple Artists Style Transfer. (arXiv:2010.08175v2 [cs.CV] UPDATED)
    (2 min) Though significant progress has been made in artistic style transfer, semantic information is usually difficult to be preserved in a fine-grained locally consistent manner by most existing methods, especially when multiple artists styles are required to transfer within one single model. To circumvent this issue, we propose a Stroke Control Multi-Artist Style Transfer framework. On the one hand, we develop a multi-condition single-generator structure which first performs multi-artist style transfer. On the one hand, we design an Anisotropic Stroke Module (ASM) which realizes the dynamic adjustment of style-stroke between the non-trivial and the trivial regions. ASM endows the network with the ability of adaptive semantic-consistency among various styles. On the other hand, we present an novel Multi-Scale Projection Discriminator} to realize the texture-level conditional generation. In contrast to the single-scale conditional discriminator, our discriminator is able to capture multi-scale texture clue to effectively distinguish a wide range of artistic styles. Extensive experimental results well demonstrate the feasibility and effectiveness of our approach. Our framework can transform a photograph into different artistic style oil painting via only ONE single model. Furthermore, the results are with distinctive artistic style and retain the anisotropic semantic information. The code is already available on github: https://github.com/neuralchen/ASMAGAN.
    Information Obfuscation of Graph Neural Networks. (arXiv:2009.13504v5 [cs.LG] UPDATED)
    (2 min) While the advent of Graph Neural Networks (GNNs) has greatly improved node and graph representation learning in many applications, the neighborhood aggregation scheme exposes additional vulnerabilities to adversaries seeking to extract node-level information about sensitive attributes. In this paper, we study the problem of protecting sensitive attributes by information obfuscation when learning with graph structured data. We propose a framework to locally filter out pre-determined sensitive attributes via adversarial training with the total variation and the Wasserstein distance. Our method creates a strong defense against inference attacks, while only suffering small loss in task performance. Theoretically, we analyze the effectiveness of our framework against a worst-case adversary, and characterize an inherent trade-off between maximizing predictive accuracy and minimizing information leakage. Experiments across multiple datasets from recommender systems, knowledge graphs and quantum chemistry demonstrate that the proposed approach provides a robust defense across various graph structures and tasks, while producing competitive GNN encoders for downstream tasks.
    Learning from Crowds by Modeling Common Confusions. (arXiv:2012.13052v2 [cs.LG] UPDATED)
    (2 min) Crowdsourcing provides a practical way to obtain large amounts of labeled data at a low cost. However, the annotation quality of annotators varies considerably, which imposes new challenges in learning a high-quality model from the crowdsourced annotations. In this work, we provide a new perspective to decompose annotation noise into common noise and individual noise and differentiate the source of confusion based on instance difficulty and annotator expertise on a per-instance-annotator basis. We realize this new crowdsourcing model by an end-to-end learning solution with two types of noise adaptation layers: one is shared across annotators to capture their commonly shared confusions, and the other one is pertaining to each annotator to realize individual confusion. To recognize the source of noise in each annotation, we use an auxiliary network to choose the two noise adaptation layers with respect to both instances and annotators. Extensive experiments on both synthesized and real-world benchmarks demonstrate the effectiveness of our proposed common noise adaptation solution.
    PMP-Net: Point Cloud Completion by Learning Multi-step Point Moving Paths. (arXiv:2012.03408v3 [cs.CV] UPDATED)
    (2 min) The task of point cloud completion aims to predict the missing part for an incomplete 3D shape. A widely used strategy is to generate a complete point cloud from the incomplete one. However, the unordered nature of point clouds will degrade the generation of high-quality 3D shapes, as the detailed topology and structure of discrete points are hard to be captured by the generative process only using a latent code. In this paper, we address the above problem by reconsidering the completion task from a new perspective, where we formulate the prediction as a point cloud deformation process. Specifically, we design a novel neural network, named PMP-Net, to mimic the behavior of an earth mover. It moves each point of the incomplete input to complete the point cloud, where the total distance of point moving paths (PMP) should be shortest. Therefore, PMP-Net predicts a unique point moving path for each point according to the constraint of total point moving distances. As a result, the network learns a strict and unique correspondence on point-level, which can capture the detailed topology and structure relationships between the incomplete shape and the complete target, and thus improves the quality of the predicted complete shape. We conduct comprehensive experiments on Completion3D and PCN datasets, which demonstrate our advantages over the state-of-the-art point cloud completion methods.
    Multi-Disease Classification of 13,667 Body CT Scans Using Weakly Supervised Deep Learning. (arXiv:2008.01158v2 [cs.CV] UPDATED)
    (2 min) Background: Training deep learning classifiers typically requires massive amounts of manual annotation. Weak supervision may leverage existing medical data to classify multiple diseases and organ systems. Purpose: To design multi-disease classifiers for body computed tomography (CT) scans using automatically extracted labels from radiology text reports. Materials & Methods: This retrospective study deployed rule-based algorithms to extract 19,255 disease labels from reports of 13,667 body CT scans of 12,092 subjects for training. Using a 3D DenseVNet, three organ systems were segmented: lungs/pleura, liver/gallbladder, and kidneys/ureters. For each organ, a 3D convolutional neural network classified normality versus four common diseases. Testing was performed on an additional 2,158 CT volumes relative to 2,875 manually derived reference labels. Results: Manual validation of the extracted labels confirmed 91 to 99% accuracy. Performance using the receiver operating characteristic area under the curve (AUC) for lungs/pleura labels were as follows: atelectasis 0.77 (95% CI: 0.74 to 0.81), nodule 0.65 (0.61 to 0.69), emphysema 0.89 (0.86 to 0.92), effusion 0.97 (0.96 to 0.98), and normal 0.89 (0.87 to 0.91). For liver/gallbladder: stone 0.62 (0.56 to 0.67), lesion 0.73 (0.69 to 0.77), dilation 0.87 (0.84 to 0.90), fatty 0.89 (0.86 to 0.92), and normal 0.82 (0.78 to 0.85). For kidneys/ureters: stone 0.83 (0.79 to 0.87), atrophy 0.92 (0.89 to 0.94), lesion 0.68 (0.64 to 0.72), cyst 0.70 (0.66 to 0.73), and normal 0.79 (0.75 to 0.83). Conclusion: Weakly supervised deep learning classifiers leveraged massive amounts of unannotated body CT data to classify multiple organ systems and diverse diseases.
    Do Not Escape From the Manifold: Discovering the Local Coordinates on the Latent Space of GANs. (arXiv:2106.06959v1 [cs.CV])
    (2 min) In this paper, we propose a method to find local-geometry-aware traversal directions on the intermediate latent space of Generative Adversarial Networks (GANs). These directions are defined as an ordered basis of tangent space at a latent code. Motivated by the intrinsic sparsity of the latent space, the basis is discovered by solving the low-rank approximation problem of the differential of the partial network. Moreover, the local traversal basis leads to a natural iterative traversal on the latent space. Iterative Curve-Traversal shows stable traversal on images, since the trajectory of latent code stays close to the latent space even under the strong perturbations compared to the linear traversal. This stability provides far more diverse variations of the given image. Although the proposed method can be applied to various GAN models, we focus on the W-space of the StyleGAN2, which is renowned for showing the better disentanglement of the latent factors of variation. Our quantitative and qualitative analysis provides evidence showing that the W-space is still globally warped while showing a certain degree of global consistency of interpretable variation. In particular, we introduce some metrics on the Grassmannian manifolds to quantify the global warpage of the W-space and the subspace traversal to test the stability of traversal directions.
    Cycle4Completion: Unpaired Point Cloud Completion using Cycle Transformation with Missing Region Coding. (arXiv:2103.07838v2 [cs.CV] UPDATED)
    (2 min) In this paper, we present a novel unpaired point cloud completion network, named Cycle4Completion, to infer the complete geometries from a partial 3D object. Previous unpaired completion methods merely focus on the learning of geometric correspondence from incomplete shapes to complete shapes, and ignore the learning in the reverse direction, which makes them suffer from low completion accuracy due to the limited 3D shape understanding ability. To address this problem, we propose two simultaneous cycle transformations between the latent spaces of complete shapes and incomplete ones. The insight of cycle transformation is to promote networks to understand 3D shapes by learning to generate complete or incomplete shapes from their complementary ones. Specifically, the first cycle transforms shapes from incomplete domain to complete domain, and then projects them back to the incomplete domain. This process learns the geometric characteristic of complete shapes, and maintains the shape consistency between the complete prediction and the incomplete input. Similarly, the inverse cycle transformation starts from complete domain to incomplete domain, and goes back to complete domain to learn the characteristic of incomplete shapes. We provide a comprehensive evaluation in experiments, which shows that our model with the learned bidirectional geometry correspondence outperforms state-of-the-art unpaired completion methods.
    Using Convolutional Neural Networks for the Helicity Classification of Magnetic Fields. (arXiv:2106.06718v1 [astro-ph.HE])
    (2 min) The presence of non-zero helicity in intergalactic magnetic fields is a smoking gun for their primordial origin since they have to be generated by processes that break CP invariance. As an experimental signature for the presence of helical magnetic fields, an estimator $Q$ based on the triple scalar product of the wave-vectors of photons generated in electromagnetic cascades from, e.g., TeV blazars, has been suggested previously. We propose to apply deep learning to helicity classification employing Convolutional Neural Networks and show that this method outperforms the $Q$ estimator.
    Cluster-to-Conquer: A Framework for End-to-End Multi-Instance Learning for Whole Slide Image Classification. (arXiv:2103.10626v2 [eess.IV] UPDATED)
    (2 min) In recent years, the availability of digitized Whole Slide Images (WSIs) has enabled the use of deep learning-based computer vision techniques for automated disease diagnosis. However, WSIs present unique computational and algorithmic challenges. WSIs are gigapixel-sized ($\sim$100K pixels), making them infeasible to be used directly for training deep neural networks. Also, often only slide-level labels are available for training as detailed annotations are tedious and can be time-consuming for experts. Approaches using multiple-instance learning (MIL) frameworks have been shown to overcome these challenges. Current state-of-the-art approaches divide the learning framework into two decoupled parts: a convolutional neural network (CNN) for encoding the patches followed by an independent aggregation approach for slide-level prediction. In this approach, the aggregation step has no bearing on the representations learned by the CNN encoder. We have proposed an end-to-end framework that clusters the patches from a WSI into ${k}$-groups, samples ${k}'$ patches from each group for training, and uses an adaptive attention mechanism for slide level prediction; Cluster-to-Conquer (C2C). We have demonstrated that dividing a WSI into clusters can improve the model training by exposing it to diverse discriminative features extracted from the patches. We regularized the clustering mechanism by introducing a KL-divergence loss between the attention weights of patches in a cluster and the uniform distribution. The framework is optimized end-to-end on slide-level cross-entropy, patch-level cross-entropy, and KL-divergence loss (Implementation: https://github.com/YashSharma/C2C).
    On The Radon-Nikodym Spectral Approach With Optimal Clustering. (arXiv:1906.00460v16 [cs.LG] UPDATED)
    (3 min) Problems of interpolation, classification, and clustering are considered. In the tenets of Radon--Nikodym approach $\langle f(\mathbf{x})\psi^2 \rangle / \langle\psi^2\rangle$, where the $\psi(\mathbf{x})$ is a linear function on input attributes, all the answers are obtained from a generalized eigenproblem $|f|\psi^{[i]}\rangle = \lambda^{[i]} |\psi^{[i]}\rangle$. The solution to the interpolation problem is a regular Radon-Nikodym derivative. The solution to the classification problem requires prior and posterior probabilities that are obtained using the Lebesgue quadrature[1] technique. Whereas in a Bayesian approach new observations change only outcome probabilities, in the Radon-Nikodym approach not only outcome probabilities but also the probability space $|\psi^{[i]}\rangle$ change with new observations. This is a remarkable feature of the approach: both the probabilities and the probability space are constructed from the data. The Lebesgue quadrature technique can be also applied to the optimal clustering problem. The problem is solved by constructing a Gaussian quadrature on the Lebesgue measure. A distinguishing feature of the Radon-Nikodym approach is the knowledge of the invariant group: all the answers are invariant relatively any non-degenerated linear transform of input vector $\mathbf{x}$ components. A software product implementing the algorithms of interpolation, classification, and optimal clustering is available from the authors.
    Towards annotation-efficient segmentation via image-to-image translation. (arXiv:1904.01636v4 [cs.CV] UPDATED)
    (2 min) Often in medical imaging, it is prohibitively challenging to produce enough boundary annotations to train deep neural networks for accurate tumor segmentation. We propose the use of weak labels about whether an image presents tumor or whether it is absent to extend training over images that lack these annotations. Specifically, we propose a semi-supervised framework that employs unpaired image-to-image translation between two domains, presence vs. absence of cancer, as the unsupervised objective. We conjecture that translation helps segmentation -- both require the target to be separated from the background. We encode images into two codes: one that is common to both domains and one that is unique to the presence domain. Decoding from the common code yields healthy images; decoding with the addition of the unique code produces a residual change to this image that adds cancer. Translation proceeds from presence to absence and vice versa. In the first case, the tumor is re-added to the image and we successfully exploit the residual decoder to also perform segmentation. In the second case, unique codes are sampled, producing a distribution of possible tumors. To validate the method, we created challenging synthetic tasks and tumor segmentation datasets from public BRATS (brain, MRI) and LitS (liver, CT) datasets. We show a clear improvement (0.83 Dice on brain, 0.74 on liver) over baseline semi-supervised training with autoencoding (0.73, 0.66) and a mean teacher approach (0.75, 0.69), demonstrating the ability to generalize from smaller distributions of annotated samples.
    DyGLIP: A Dynamic Graph Model with Link Prediction for Accurate Multi-Camera Multiple Object Tracking. (arXiv:2106.06856v1 [cs.CV])
    (2 min) Multi-Camera Multiple Object Tracking (MC-MOT) is a significant computer vision problem due to its emerging applicability in several real-world applications. Despite a large number of existing works, solving the data association problem in any MC-MOT pipeline is arguably one of the most challenging tasks. Developing a robust MC-MOT system, however, is still highly challenging due to many practical issues such as inconsistent lighting conditions, varying object movement patterns, or the trajectory occlusions of the objects between the cameras. To address these problems, this work, therefore, proposes a new Dynamic Graph Model with Link Prediction (DyGLIP) approach to solve the data association task. Compared to existing methods, our new model offers several advantages, including better feature representations and the ability to recover from lost tracks during camera transitions. Moreover, our model works gracefully regardless of the overlapping ratios between the cameras. Experimental results show that we outperform existing MC-MOT algorithms by a large margin on several practical datasets. Notably, our model works favorably on online settings but can be extended to an incremental approach for large-scale datasets.
    A Novel Interaction-based Methodology Towards Explainable AI with Better Understanding of Pneumonia Chest X-ray Images. (arXiv:2104.12672v2 [cs.LG] UPDATED)
    (2 min) In the field of eXplainable AI (XAI), robust ``blackbox'' algorithms such as Convolutional Neural Networks (CNNs) are known for making high prediction performance. However, the ability to explain and interpret these algorithms still require innovation in the understanding of influential and, more importantly, explainable features that directly or indirectly impact the performance of predictivity. A number of methods existing in literature focus on visualization techniques but the concepts of explainability and interpretability still require rigorous definition. In view of the above needs, this paper proposes an interaction-based methodology -- Influence Score (I-score) -- to screen out the noisy and non-informative variables in the images hence it nourishes an environment with explainable and interpretable features that are directly associated to feature predictivity. We apply the proposed method on a real world application in Pneumonia Chest X-ray Image data set and produced state-of-the-art results. We demonstrate how to apply the proposed approach for more general big data problems by improving the explainability and interpretability without sacrificing the prediction performance. The contribution of this paper opens a novel angle that moves the community closer to the future pipelines of XAI problems.
    Toward Understanding the Feature Learning Process of Self-supervised Contrastive Learning. (arXiv:2105.15134v2 [cs.LG] UPDATED)
    (2 min) How can neural networks trained by contrastive learning extract features from the unlabeled data? Why does contrastive learning usually need much stronger data augmentations than supervised learning to ensure good representations? These questions involve both the optimization and statistical aspects of deep learning, but can hardly be answered by analyzing supervised learning, where the target functions are the highest pursuit. Indeed, in self-supervised learning, it is inevitable to relate to the optimization/generalization of neural networks to how they can encode the latent structures in the data, which we refer to as the feature learning process. In this work, we formally study how contrastive learning learns the feature representations for neural networks by analyzing its feature learning process. We consider the case where our data are comprised of two types of features: the more semantically aligned sparse features which we want to learn from, and the other dense features we want to avoid. Theoretically, we prove that contrastive learning using $\mathbf{ReLU}$ networks provably learns the desired sparse features if proper augmentations are adopted. We present an underlying principle called $\textbf{feature decoupling}$ to explain the effects of augmentations, where we theoretically characterize how augmentations can reduce the correlations of dense features between positive samples while keeping the correlations of sparse features intact, thereby forcing the neural networks to learn from the self-supervision of sparse features. Empirically, we verified that the feature decoupling principle matches the underlying mechanism of contrastive learning in practice.
    Unsupervised Place Recognition with Deep Embedding Learning over Radar Videos. (arXiv:2106.06703v1 [cs.CV])
    (2 min) We learn, in an unsupervised way, an embedding from sequences of radar images that is suitable for solving place recognition problem using complex radar data. We experiment on 280 km of data and show performance exceeding state-of-the-art supervised approaches, localising correctly 98.38% of the time when using just the nearest database candidate.
    SPADE: A Spectral Method for Black-Box Adversarial Robustness Evaluation. (arXiv:2102.03716v3 [cs.LG] UPDATED)
    (2 min) A black-box spectral method is introduced for evaluating the adversarial robustness of a given machine learning (ML) model. Our approach, named SPADE, exploits bijective distance mapping between the input/output graphs constructed for approximating the manifolds corresponding to the input/output data. By leveraging the generalized Courant-Fischer theorem, we propose a SPADE score for evaluating the adversarial robustness of a given model, which is proved to be an upper bound of the best Lipschitz constant under the manifold setting. To reveal the most non-robust data samples highly vulnerable to adversarial attacks, we develop a spectral graph embedding procedure leveraging dominant generalized eigenvectors. This embedding step allows assigning each data sample a robustness score that can be further harnessed for more effective adversarial training. Our experiments show the proposed SPADE method leads to promising empirical results for neural network models that are adversarially trained with the MNIST and CIFAR-10 data sets.
    Shared Cross-Modal Trajectory Prediction for Autonomous Driving. (arXiv:2004.00202v3 [cs.CV] UPDATED)
    (2 min) Predicting future trajectories of traffic agents in highly interactive environments is an essential and challenging problem for the safe operation of autonomous driving systems. On the basis of the fact that self-driving vehicles are equipped with various types of sensors (e.g., LiDAR scanner, RGB camera, radar, etc.), we propose a Cross-Modal Embedding framework that aims to benefit from the use of multiple input modalities. At training time, our model learns to embed a set of complementary features in a shared latent space by jointly optimizing the objective functions across different types of input data. At test time, a single input modality (e.g., LiDAR data) is required to generate predictions from the input perspective (i.e., in the LiDAR space), while taking advantages from the model trained with multiple sensor modalities. An extensive evaluation is conducted to show the efficacy of the proposed framework using two benchmark driving datasets.
    Alpha Matte Generation from Single Input for Portrait Matting. (arXiv:2106.03210v2 [cs.CV] UPDATED)
    (2 min) Portrait matting is an important research problem with a wide range of applications, such as video conference app, image/video editing, and post-production. The goal is to predict an alpha matte that identifies the effect of each pixel on the foreground subject. Traditional approaches and most of the existing works utilized an additional input, e.g., trimap, background image, to predict alpha matte. However, providing additional input is not always practical. Besides, models are too sensitive to these additional inputs. In this paper, we introduce an additional input-free approach to perform portrait matting using Generative Adversarial Nets (GANs). We divide the main task into two subtasks. For this, we propose a segmentation network for the person segmentation and the alpha generation network for alpha matte prediction. While the segmentation network takes an input image and produces a coarse segmentation map, the alpha generation network utilizes the same input image as well as a coarse segmentation map that is produced by the segmentation network to predict the alpha matte. Besides, we present a segmentation encoding block to downsample the coarse segmentation map and provide feature representation to the residual block. Furthermore, we propose border loss to penalize only the borders of the subject separately which is more likely to be challenging and we also adapt perceptual loss for portrait matting. To train the proposed system, we combine two different popular training datasets to improve the amount of data as well as diversity to address domain shift problems in the inference time. We tested our model on three different benchmark datasets, namely Adobe Image Matting dataset, Portrait Matting dataset, and Distinctions dataset. The proposed method outperformed the MODNet method that also takes a single input.
    Compression of Deep Learning Models for Text: A Survey. (arXiv:2008.05221v4 [cs.CL] UPDATED)
    (2 min) In recent years, the fields of natural language processing (NLP) and information retrieval (IR) have made tremendous progress thanksto deep learning models like Recurrent Neural Networks (RNNs), Gated Recurrent Units (GRUs) and Long Short-Term Memory (LSTMs)networks, and Transformer [120] based models like Bidirectional Encoder Representations from Transformers (BERT) [24], GenerativePre-training Transformer (GPT-2) [94], Multi-task Deep Neural Network (MT-DNN) [73], Extra-Long Network (XLNet) [134], Text-to-text transfer transformer (T5) [95], T-NLG [98] and GShard [63]. But these models are humongous in size. On the other hand,real world applications demand small model size, low response times and low computational power wattage. In this survey, wediscuss six different types of methods (Pruning, Quantization, Knowledge Distillation, Parameter Sharing, Tensor Decomposition, andSub-quadratic Transformer based methods) for compression of such models to enable their deployment in real industry NLP projects.Given the critical need of building applications with efficient and small models, and the large amount of recently published work inthis area, we believe that this survey organizes the plethora of work done by the 'deep learning for NLP' community in the past fewyears and presents it as a coherent story.
    Reconstruction of turbulent data with deep generative models for semantic inpainting from TURB-Rot database. (arXiv:2006.09179v2 [physics.flu-dyn] UPDATED)
    (2 min) We study the applicability of tools developed by the computer vision community for features learning and semantic image inpainting to perform data reconstruction of fluid turbulence configurations. The aim is twofold. First, we explore on a quantitative basis, the capability of Convolutional Neural Networks embedded in a Deep Generative Adversarial Model (Deep-GAN) to generate missing data in turbulence, a paradigmatic high dimensional chaotic system. In particular, we investigate their use in reconstructing two-dimensional damaged snapshots extracted from a large database of numerical configurations of 3d turbulence in the presence of rotation, a case with multi-scale random features where both large-scale organised structures and small-scale highly intermittent and non-Gaussian fluctuations are present. Second, following a reverse engineering approach, we aim to rank the input flow properties (features) in terms of their qualitative and quantitative importance to obtain a better set of reconstructed fields. We present two approaches both based on Context Encoders. The first one infers the missing data via a minimization of the L2 pixel-wise reconstruction loss, plus a small adversarial penalisation. The second searches for the closest encoding of the corrupted flow configuration from a previously trained generator. Finally, we present a comparison with a different data assimilation tool, based on Nudging, an equation-informed unbiased protocol, well known in the numerical weather prediction community. The TURB-Rot database, this http URL, of roughly 300K 2d turbulent images is released and details on how to download it are given.
    D3DLO: Deep 3D LiDAR Odometry. (arXiv:2101.12242v2 [cs.CV] UPDATED)
    (2 min) LiDAR odometry (LO) describes the task of finding an alignment of subsequent LiDAR point clouds. This alignment can be used to estimate the motion of the platform where the LiDAR sensor is mounted on. Currently, on the well-known KITTI Vision Benchmark Suite state-of-the-art algorithms are non-learning approaches. We propose a network architecture that learns LO by directly processing 3D point clouds. It is trained on the KITTI dataset in an end-to-end manner without the necessity of pre-defining corresponding pairs of points. An evaluation on the KITTI Vision Benchmark Suite shows similar performance to a previously published work, DeepCLR [1], even though our model uses only around 3.56% of the number of network parameters thereof. Furthermore, a plane point extraction is applied which leads to a marginal performance decrease while simultaneously reducing the input size by up to 50%.
    Probabilistic Embeddings for Cross-Modal Retrieval. (arXiv:2101.05068v2 [cs.CV] UPDATED)
    (2 min) Cross-modal retrieval methods build a common representation space for samples from multiple modalities, typically from the vision and the language domains. For images and their captions, the multiplicity of the correspondences makes the task particularly challenging. Given an image (respectively a caption), there are multiple captions (respectively images) that equally make sense. In this paper, we argue that deterministic functions are not sufficiently powerful to capture such one-to-many correspondences. Instead, we propose to use Probabilistic Cross-Modal Embedding (PCME), where samples from the different modalities are represented as probabilistic distributions in the common embedding space. Since common benchmarks such as COCO suffer from non-exhaustive annotations for cross-modal matches, we propose to additionally evaluate retrieval on the CUB dataset, a smaller yet clean database where all possible image-caption pairs are annotated. We extensively ablate PCME and demonstrate that it not only improves the retrieval performance over its deterministic counterpart but also provides uncertainty estimates that render the embeddings more interpretable. Code is available at https://github.com/naver-ai/pcme
    An Approach Towards Physics Informed Lung Ultrasound Image Scoring Neural Network for Diagnostic Assistance in COVID-19. (arXiv:2106.06980v1 [eess.IV])
    (3 min) Ultrasound is fast becoming an inevitable diagnostic tool for regular and continuous monitoring of the lung with the recent outbreak of COVID-19. In this work, a novel approach is presented to extract acoustic propagation-based features to automatically highlight the region below pleura, which is an important landmark in lung ultrasound (LUS). Subsequently, a multichannel input formed by using the acoustic physics-based feature maps is fused to train a neural network, referred to as LUSNet, to classify the LUS images into five classes of varying severity of lung infection to track the progression of COVID-19. In order to ensure that the proposed approach is agnostic to the type of acquisition, the LUSNet, which consists of a U-net architecture is trained in an unsupervised manner with the acoustic feature maps to ensure that the encoder-decoder architecture is learning features in the pleural region of interest. A novel combination of the U-net output and the U-net encoder output is employed for the classification of severity of infection in the lung. A detailed analysis of the proposed approach on LUS images over the infection to full recovery period of ten confirmed COVID-19 subjects shows an average five-fold cross-validation accuracy, sensitivity, and specificity of 97%, 93%, and 98% respectively over 5000 frames of COVID-19 videos. The analysis also shows that, when the input dataset is limited and diverse as in the case of COVID-19 pandemic, an aided effort of combining acoustic propagation-based features along with the gray scale images, as proposed in this work, improves the performance of the neural network significantly and also aids the labelling and triaging process.
    Multi-Scale Hourglass Hierarchical Fusion Network for Single Image Deraining. (arXiv:2104.12100v2 [eess.IV] UPDATED)
    (2 min) Rain streaks bring serious blurring and visual quality degradation, which often vary in size, direction and density. Current CNN-based methods achieve encouraging performance, while are limited to depict rain characteristics and recover image details in the poor visibility environment. To address these issues, we present a Multi-scale Hourglass Hierarchical Fusion Network (MH2F-Net) in end-to-end manner, to exactly captures rain streak features with multi-scale extraction, hierarchical distillation and information aggregation. For better extracting the features, a novel Multi-scale Hourglass Extraction Block (MHEB) is proposed to get local and global features across different scales through down- and up-sample process. Besides, a Hierarchical Attentive Distillation Block (HADB) then employs the dual attention feature responses to adaptively recalibrate the hierarchical features and eliminate the redundant ones. Further, we introduce a Residual Projected Feature Fusion (RPFF) strategy to progressively discriminate feature learning and aggregate different features instead of directly concatenating or adding. Extensive experiments on both synthetic and real rainy datasets demonstrate the effectiveness of the designed MH2F-Net by comparing with recent state-of-the-art deraining algorithms. Our source code will be available on the GitHub: https://github.com/cxtalk/MH2F-Net.
    Neural Descent for Visual 3D Human Pose and Shape. (arXiv:2008.06910v2 [cs.CV] UPDATED)
    (2 min) We present deep neural network methodology to reconstruct the 3d pose and shape of people, given an input RGB image. We rely on a recently introduced, expressivefull body statistical 3d human model, GHUM, trained end-to-end, and learn to reconstruct its pose and shape state in a self-supervised regime. Central to our methodology, is a learning to learn and optimize approach, referred to as HUmanNeural Descent (HUND), which avoids both second-order differentiation when training the model parameters,and expensive state gradient descent in order to accurately minimize a semantic differentiable rendering loss at test time. Instead, we rely on novel recurrent stages to update the pose and shape parameters such that not only losses are minimized effectively, but the process is meta-regularized in order to ensure end-progress. HUND's symmetry between training and testing makes it the first 3d human sensing architecture to natively support different operating regimes including self-supervised ones. In diverse tests, we show that HUND achieves very competitive results in datasets like H3.6M and 3DPW, aswell as good quality 3d reconstructions for complex imagery collected in-the-wild.
    Uncovering the Connections Between Adversarial Transferability and Knowledge Transferability. (arXiv:2006.14512v2 [cs.LG] UPDATED)
    (2 min) Knowledge transferability, or transfer learning, has been widely adopted to allow a pre-trained model in the source domain to be effectively adapted to downstream tasks in the target domain. It is thus important to explore and understand the factors affecting knowledge transferability. In this paper, as the first work, we analyze and demonstrate the connections between knowledge transferability and another important phenomenon--adversarial transferability, \emph{i.e.}, adversarial examples generated against one model can be transferred to attack other models. Our theoretical studies show that adversarial transferability indicates knowledge transferability and vice versa. Moreover, based on the theoretical insights, we propose two practical adversarial transferability metrics to characterize this process, serving as bidirectional indicators between adversarial and knowledge transferability. We conduct extensive experiments for different scenarios on diverse datasets, showing a positive correlation between adversarial transferability and knowledge transferability. Our findings will shed light on future research about effective knowledge transfer learning and adversarial transferability analyses.
    Fast and Robust Certifiable Estimation of the Relative Pose Between Two Calibrated Cameras. (arXiv:2101.08524v2 [cs.CV] UPDATED)
    (2 min) This work contributes an efficient algorithm to compute the Relative Pose problem (RPp) between calibrated cameras and certify the optimality of the solution, given a set of pair-wise feature correspondences affected by noise and probably corrupted by wrong matches. We propose a family of certifiers that is shown to increase the ratio of detected optimal solutions. This set of certifiers is incorporated into a fast essential matrix estimation pipeline that, given any initial guess for the RPp, refines it iteratively on the product space of 3D rotations and 2-sphere. In addition, this fast certifiable pipeline is integrated into a robust framework that combines Graduated Non-convexity and the Black-Rangarajan duality between robust functions and line processes. We proved through extensive experiments on synthetic and real data that the proposed framework provides a fast and robust relative pose estimation. We make the code publicly available \url{https://github.com/mergarsal/FastCertRelPose.git}.
    PVRED: A Position-Velocity Recurrent Encoder-Decoder for Human Motion Prediction. (arXiv:1906.06514v2 [cs.CV] UPDATED)
    (2 min) Human motion prediction, which aims to predict future human poses given past poses, has recently seen increased interest. Many recent approaches are based on Recurrent Neural Networks (RNN) which model human poses with exponential maps. These approaches neglect the pose velocity as well as temporal relation of different poses, and tend to converge to the mean pose or fail to generate natural-looking poses. We therefore propose a novel Position-Velocity Recurrent Encoder-Decoder (PVRED) for human motion prediction, which makes full use of pose velocities and temporal positional information. A temporal position embedding method is presented and a Position-Velocity RNN (PVRNN) is proposed. We also emphasize the benefits of quaternion parameterization of poses and design a novel trainable Quaternion Transformation (QT) layer, which is combined with a robust loss function during training. We provide quantitative results for both short-term prediction in the future 0.5 seconds and long-term prediction in the future 0.5 to 1 seconds. Experiments on several benchmarks show that our approach considerably outperforms the state-of-the-art methods. In addition, qualitative visualizations in the future 4 seconds show that our approach could predict future human-like and meaningful poses in very long time horizons. Code is publicly available on GitHub: \textcolor{red}{https://github.com/hongsong-wang/PVRNN}.
    Adaptive Dynamic Pruning for Non-IID Federated Learning. (arXiv:2106.06921v1 [cs.LG])
    (2 min) Federated Learning~(FL) has emerged as a new paradigm of training machine learning models without sacrificing data security and privacy. Learning models at edge devices such as cell phones is one of the most common use case of FL. However, the limited computing power and energy constraints of edge devices hinder the adoption of FL for both model training and deployment, especially for the resource-hungry Deep Neural Networks~(DNNs). To this end, many model compression methods have been proposed and network pruning is among the most well-known. However, a pruning policy for a given model is highly dataset-dependent, which is not suitable for non-Independent and Identically Distributed~(Non-IID) FL edge devices. In this paper, we present an adaptive pruning scheme for edge devices in an FL system, which applies dataset-aware dynamic pruning for inference acceleration on Non-IID datasets. Our evaluation shows that the proposed method accelerates inference by $2\times$~($50\%$ FLOPs reduction) while maintaining the model's quality on edge devices.
    The Spatio-Temporal Poisson Point Process: A Simple Model for the Alignment of Event Camera Data. (arXiv:2106.06887v1 [cs.CV])
    (2 min) Event cameras, inspired by biological vision systems, provide a natural and data efficient representation of visual information. Visual information is acquired in the form of events that are triggered by local brightness changes. Each pixel location of the camera's sensor records events asynchronously and independently with very high temporal resolution. However, because most brightness changes are triggered by relative motion of the camera and the scene, the events recorded at a single sensor location seldom correspond to the same world point. To extract meaningful information from event cameras, it is helpful to register events that were triggered by the same underlying world point. In this work we propose a new model of event data that captures its natural spatio-temporal structure. We start by developing a model for aligned event data. That is, we develop a model for the data as though it has been perfectly registered already. In particular, we model the aligned data as a spatio-temporal Poisson point process. Based on this model, we develop a maximum likelihood approach to registering events that are not yet aligned. That is, we find transformations of the observed events that make them as likely as possible under our model. In particular we extract the camera rotation that leads to the best event alignment. We show new state of the art accuracy for rotational velocity estimation on the DAVIS 240C dataset. In addition, our method is also faster and has lower computational complexity than several competing methods.
    Red Alarm for Pre-trained Models: Universal Vulnerability to Neuron-Level Backdoor Attacks. (arXiv:2101.06969v3 [cs.CL] UPDATED)
    (2 min) Pre-trained models (PTMs) have been widely used in various downstream tasks. The parameters of PTMs are distributed on the Internet and may suffer backdoor attacks. In this work, we demonstrate the universal vulnerability of PTMs, where fine-tuned PTMs can be easily controlled by backdoor attacks in arbitrary downstream tasks. Specifically, attackers can add a simple pre-training task, which restricts the output representations of trigger instances to pre-defined vectors, namely neuron-level backdoor attack (NeuBA). If the backdoor functionality is not eliminated during fine-tuning, the triggers can make the fine-tuned model predict fixed labels by pre-defined vectors. In the experiments of both natural language processing (NLP) and computer vision (CV), we show that NeuBA absolutely controls the predictions for trigger instances without any knowledge of downstream tasks. Finally, we apply several defense methods to NeuBA and find that model pruning is a promising direction to resist NeuBA by excluding backdoored neurons. Our findings sound a red alarm for the wide use of PTMs. Our source code and models are available at \url{https://github.com/thunlp/NeuBA}.
    Inverting Adversarially Robust Networks for Image Synthesis. (arXiv:2106.06927v1 [cs.CV])
    (2 min) Recent research in adversarially robust classifiers suggests their representations tend to be aligned with human perception, which makes them attractive for image synthesis and restoration applications. Despite favorable empirical results on a few downstream tasks, their advantages are limited to slow and sensitive optimization-based techniques. Moreover, their use on generative models remains unexplored. This work proposes the use of robust representations as a perceptual primitive for feature inversion models, and show its benefits with respect to standard non-robust image features. We empirically show that adopting robust representations as an image prior significantly improves the reconstruction accuracy of CNN-based feature inversion models. Furthermore, it allows reconstructing images at multiple scales out-of-the-box. Following these findings, we propose an encoding-decoding network based on robust representations and show its advantages for applications such as anomaly detection, style transfer and image denoising.
    A One-Shot Texture-Perceiving Generative Adversarial Network for Unsupervised Surface Inspection. (arXiv:2106.06792v1 [cs.CV])
    (2 min) Visual surface inspection is a challenging task owing to the highly diverse appearance of target surfaces and defective regions. Previous attempts heavily rely on vast quantities of training examples with manual annotation. However, in some practical cases, it is difficult to obtain a large number of samples for inspection. To combat it, we propose a hierarchical texture-perceiving generative adversarial network (HTP-GAN) that is learned from the one-shot normal image in an unsupervised scheme. Specifically, the HTP-GAN contains a pyramid of convolutional GANs that can capture the global structure and fine-grained representation of an image simultaneously. This innovation helps distinguishing defective surface regions from normal ones. In addition, in the discriminator, a texture-perceiving module is devised to capture the spatially invariant representation of normal image via directional convolutions, making it more sensitive to defective areas. Experiments on a variety of datasets consistently demonstrate the effectiveness of our method.
    Synthesizing Long-Term 3D Human Motion and Interaction in 3D Scenes. (arXiv:2012.05522v2 [cs.CV] UPDATED)
    (2 min) Synthesizing 3D human motion plays an important role in many graphics applications as well as understanding human activity. While many efforts have been made on generating realistic and natural human motion, most approaches neglect the importance of modeling human-scene interactions and affordance. On the other hand, affordance reasoning (e.g., standing on the floor or sitting on the chair) has mainly been studied with static human pose and gestures, and it has rarely been addressed with human motion. In this paper, we propose to bridge human motion synthesis and scene affordance reasoning. We present a hierarchical generative framework to synthesize long-term 3D human motion conditioning on the 3D scene structure. Building on this framework, we further enforce multiple geometry constraints between the human mesh and scene point clouds via optimization to improve realistic synthesis. Our experiments show significant improvements over previous approaches on generating natural and physically plausible human motion in a scene.
    Adversarial Segmentation Loss for Sketch Colorization. (arXiv:2102.06192v2 [cs.CV] UPDATED)
    (2 min) We introduce a new method for generating color images from sketches or edge maps. Current methods either require some form of additional user-guidance or are limited to the "paired" translation approach. We argue that segmentation information could provide valuable guidance for sketch colorization. To this end, we propose to leverage semantic image segmentation, as provided by a general purpose panoptic segmentation network, to create an additional adversarial loss function. Our loss function can be integrated to any baseline GAN model. Our method is not limited to datasets that contain segmentation labels, and it can be trained for "unpaired" translation tasks. We show the effectiveness of our method on four different datasets spanning scene level indoor, outdoor, and children book illustration images using qualitative, quantitative and user study analysis. Our model improves its baseline up to 35 points on the FID metric. Our code and pretrained models can be found at https://github.com/giddyyupp/AdvSegLoss.
    Few-Shot Learning with Class Imbalance. (arXiv:2101.02523v2 [cs.LG] UPDATED)
    (2 min) Few-Shot Learning (FSL) algorithms are commonly trained through Meta-Learning (ML), which exposes models to batches of tasks sampled from a meta-dataset to mimic tasks seen during evaluation. However, the standard training procedures overlook the real-world dynamics where classes commonly occur at different frequencies. While it is generally understood that class imbalance harms the performance of supervised methods, limited research examines the impact of imbalance on the FSL evaluation task. Our analysis compares 10 state-of-the-art meta-learning and FSL methods on different imbalance distributions and rebalancing techniques. Our results reveal that 1) some FSL methods display a natural disposition against imbalance while most other approaches produce a performance drop by up to 17\% compared to the balanced task without the appropriate mitigation; 2) contrary to popular belief, many meta-learning algorithms will not automatically learn to balance from exposure to imbalanced training tasks; 3) classical rebalancing strategies, such as random oversampling, can still be very effective, leading to state-of-the-art performances and should not be overlooked; 4) FSL methods are more robust against meta-dataset imbalance than imbalance at the task-level with a similar imbalance ratio ($\rho<20$), with the effect holding even in long-tail datasets under a larger imbalance ($\rho=65$).
    A Stronger Baseline for Ego-Centric Action Detection. (arXiv:2106.06942v1 [cs.CV])
    (2 min) This technical report analyzes an egocentric video action detection method we used in the 2021 EPIC-KITCHENS-100 competition hosted in CVPR2021 Workshop. The goal of our task is to locate the start time and the end time of the action in the long untrimmed video, and predict action category. We adopt sliding window strategy to generate proposals, which can better adapt to short-duration actions. In addition, we show that classification and proposals are conflict in the same network. The separation of the two tasks boost the detection performance with high efficiency. By simply employing these strategy, we achieved 16.10\% performance on the test set of EPIC-KITCHENS-100 Action Detection challenge using a single model, surpassing the baseline method by 11.7\% in terms of average mAP.
    Adversarial Robustness via Fisher-Rao Regularization. (arXiv:2106.06685v1 [cs.LG])
    (2 min) Adversarial robustness has become a topic of growing interest in machine learning since it was observed that neural networks tend to be brittle. We propose an information-geometric formulation of adversarial defense and introduce FIRE, a new Fisher-Rao regularization for the categorical cross-entropy loss, which is based on the geodesic distance between natural and perturbed input features. Based on the information-geometric properties of the class of softmax distributions, we derive an explicit characterization of the Fisher-Rao Distance (FRD) for the binary and multiclass cases, and draw some interesting properties as well as connections with standard regularization metrics. Furthermore, for a simple linear and Gaussian model, we show that all Pareto-optimal points in the accuracy-robustness region can be reached by FIRE while other state-of-the-art methods fail. Empirically, we evaluate the performance of various classifiers trained with the proposed loss on standard datasets, showing up to 2\% of improvements in terms of robustness while reducing the training time by 20\% over the best-performing methods.
    Learning the Imaging Landmarks: Unsupervised Key point Detection in Lung Ultrasound Videos. (arXiv:2106.06987v1 [eess.IV])
    (2 min) Lung ultrasound (LUS) is an increasingly popular diagnostic imaging modality for continuous and periodic monitoring of lung infection, given its advantages of non-invasiveness, non-ionizing nature, portability and easy disinfection. The major landmarks assessed by clinicians for triaging using LUS are pleura, A and B lines. There have been many efforts for the automatic detection of these landmarks. However, restricting to a few pre-defined landmarks may not reveal the actual imaging biomarkers particularly in case of new pathologies like COVID-19. Rather, the identification of key landmarks should be driven by data given the availability of a plethora of neural network algorithms. This work is a first of its kind attempt towards unsupervised detection of the key LUS landmarks in LUS videos of COVID-19 subjects during various stages of infection. We adapted the relatively newer approach of transporter neural networks to automatically mark and track pleura, A and B lines based on their periodic motion and relatively stable appearance in the videos. Initial results on unsupervised pleura detection show an accuracy of 91.8% employing 1081 LUS video frames.
    An Integrated Approach to Produce Robust Models with High Efficiency. (arXiv:2008.13305v2 [cs.CV] UPDATED)
    (2 min) Deep Neural Networks (DNNs) needs to be both efficient and robust for practical uses. Quantization and structure simplification are promising ways to adapt DNNs to mobile devices, and adversarial training is the most popular method to make DNNs robust. In this work, we try to obtain both features by applying a convergent relaxation quantization algorithm, Binary-Relax (BR), to a robust adversarial-trained model, ResNets Ensemble via Feynman-Kac Formalism (EnResNet). We also discover that high precision, such as ternary (tnn) and 4-bit, quantization will produce sparse DNNs. However, this sparsity is unstructured under advarsarial training. To solve the problems that adversarial training jeopardizes DNNs' accuracy on clean images and the struture of sparsity, we design a trade-off loss function that helps DNNs preserve their natural accuracy and improve the channel sparsity. With our trade-off loss function, we achieve both goals with no reduction of resistance under weak attacks and very minor reduction of resistance under strong attcks. Together with quantized EnResNet with trade-off loss function, we provide robust models that have high efficiency.
    Improving Co-registration for Sentinel-1 SAR and Sentinel-2 Optical images. (arXiv:2005.11092v2 [eess.IV] UPDATED)
    (2 min) Co-registering the Sentinel-1 SAR and Sentinel-2 optical data of European Space Agency (ESA) is of great importance for many remote sensing applications. However, we find that there are evident misregistration shifts between the Sentinel-1 SAR and Sentinel-2 optical images that are directly downloaded from the official website. To address that, this paper presents a fast and effective registration method for the two types of images. In the proposed method, a block-based scheme is first designed to extract evenly distributed interest points. Then the correspondences are detected by using the similarity of structural features between the SAR and optical images, where the three dimension (3D) phase correlation (PC) is used as the similarity measure for accelerating image matching. Finally, the obtained correspondences are employed to measure the misregistration shifts between the images. Moreover, to eliminate the misregistration, we use some representative geometric transformation models such as polynomial models, projective models, and rational function models for the co-registration of the two types of images, and compare and analyze their registration accuracy under different numbers of control points and different terrains. Six pairs of the Sentinel-1 SAR L1 and Sentinel-2 optical L1C images covering three different terrains are tested in our experiments. Experimental results show that the proposed method can achieve precise correspondences between the images, and the 3rd. Order polynomial achieves the most satisfactory registration results. Its registration accuracy of the flat areas is less than 1.0 10m pixels, and that of the hilly areas is about 1.5 10m pixels, and that of the mountainous areas is between 1.7 and 2.3 10m pixels, which significantly improves the co-registration accuracy of the Sentinel-1 SAR and Sentinel-2 optical images.
    Capsule Attention for Multimodal EEG-EOG Representation Learning with Application to Driver Vigilance Estimation. (arXiv:1912.07812v4 [cs.LG] UPDATED)
    (2 min) Driver vigilance estimation is an important task for transportation safety. Wearable and portable brain-computer interface devices provide a powerful means for real-time monitoring of the vigilance level of drivers to help with avoiding distracted or impaired driving. In this paper, we propose a novel multimodal architecture for in-vehicle vigilance estimation from Electroencephalogram and Electrooculogram. To enable the system to focus on the most salient parts of the learned multimodal representations, we propose an architecture composed of a capsule attention mechanism following a deep Long Short-Term Memory (LSTM) network. Our model learns hierarchical dependencies in the data through the LSTM and capsule feature representation layers. To better explore the discriminative ability of the learned representations, we study the effect of the proposed capsule attention mechanism including the number of dynamic routing iterations as well as other parameters. Experiments show the robustness of our method by outperforming other solutions and baseline techniques, setting a new state-of-the-art. We then provide an analysis on different frequency bands and brain regions to evaluate their suitability for driver vigilance estimation. Lastly, an analysis on the role of capsule attention, multimodality, and robustness to noise is performed, highlighting the advantages of our approach.
    Deep Learning for Reversible Steganography: Principles and Insights. (arXiv:2106.06924v1 [cs.CV])
    (2 min) Deep-learning\textendash{centric} reversible steganography has emerged as a promising research paradigm. A direct way of applying deep learning to reversible steganography is to construct a pair of encoder and decoder, whose parameters are trained jointly, thereby learning the steganographic system as a whole. This end-to-end framework, however, falls short of the reversibility requirement because it is difficult for this kind of monolithic system, as a black box, to create or duplicate intricate reversible mechanisms. In response to this issue, a recent approach is to carve up the steganographic system and work on modules independently. In particular, neural networks are deployed in an analytics module to learn the data distribution, while an established mechanism is called upon to handle the remaining tasks. In this paper, we investigate the modular framework and deploy deep neural networks in a reversible steganographic scheme referred to as prediction-error modulation, in which an analytics module serves the purpose of pixel intensity prediction. The primary focus of this study is on deep-learning\textendash{based} context-aware pixel intensity prediction. We address the unsolved issues reported in related literature, including the impact of pixel initialisation on prediction accuracy and the influence of uncertainty propagation in dual-layer embedding. Furthermore, we establish a connection between context-aware pixel intensity prediction and low-level computer vision and analyse the performance of several advanced neural networks.
    1st Place Solution for YouTubeVOS Challenge 2021:Video Instance Segmentation. (arXiv:2106.06649v1 [cs.CV])
    (2 min) Video Instance Segmentation (VIS) is a multi-task problem performing detection, segmentation, and tracking simultaneously. Extended from image set applications, video data additionally induces the temporal information, which, if handled appropriately, is very useful to identify and predict object motions. In this work, we design a unified model to mutually learn these tasks. Specifically, we propose two modules, named Temporally Correlated Instance Segmentation (TCIS) and Bidirectional Tracking (BiTrack), to take the benefit of the temporal correlation between the object's instance masks across adjacent frames. On the other hand, video data is often redundant due to the frame's overlap. Our analysis shows that this problem is particularly severe for the YoutubeVOS-VIS2021 data. Therefore, we propose a Multi-Source Data (MSD) training mechanism to compensate for the data deficiency. By combining these techniques with a bag of tricks, the network performance is significantly boosted compared to the baseline, and outperforms other methods by a considerable margin on the YoutubeVOS-VIS 2019 and 2021 datasets.
    An Interaction-based Convolutional Neural Network (ICNN) Towards Better Understanding of COVID-19 X-ray Images. (arXiv:2106.06911v1 [cs.CV])
    (2 min) The field of Explainable Artificial Intelligence (XAI) aims to build explainable and interpretable machine learning (or deep learning) methods without sacrificing prediction performance. Convolutional Neural Networks (CNNs) have been successful in making predictions, especially in image classification. However, these famous deep learning models use tens of millions of parameters based on a large number of pre-trained filters which have been repurposed from previous data sets. We propose a novel Interaction-based Convolutional Neural Network (ICNN) that does not make assumptions about the relevance of local information. Instead, we use a model-free Influence Score (I-score) to directly extract the influential information from images to form important variable modules. We demonstrate that the proposed method produces state-of-the-art prediction performance of 99.8% on a real-world data set classifying COVID-19 Chest X-ray images without sacrificing the explanatory power of the model. This proposed design can efficiently screen COVID-19 patients before human diagnosis, and will be the benchmark for addressing future XAI problems in large-scale data sets.
    Few-Shot Learning via Embedding Adaptation with Set-to-Set Functions. (arXiv:1812.03664v6 [cs.LG] UPDATED)
    (2 min) Learning with limited data is a key challenge for visual recognition. Many few-shot learning methods address this challenge by learning an instance embedding function from seen classes and apply the function to instances from unseen classes with limited labels. This style of transfer learning is task-agnostic: the embedding function is not learned optimally discriminative with respect to the unseen classes, where discerning among them leads to the target task. In this paper, we propose a novel approach to adapt the instance embeddings to the target classification task with a set-to-set function, yielding embeddings that are task-specific and are discriminative. We empirically investigated various instantiations of such set-to-set functions and observed the Transformer is most effective -- as it naturally satisfies key properties of our desired model. We denote this model as FEAT (few-shot embedding adaptation w/ Transformer) and validate it on both the standard few-shot classification benchmark and four extended few-shot learning settings with essential use cases, i.e., cross-domain, transductive, generalized few-shot learning, and low-shot learning. It archived consistent improvements over baseline models as well as previous methods and established the new state-of-the-art results on two benchmarks.
    Domain Generalization on Medical Imaging Classification using Episodic Training with Task Augmentation. (arXiv:2106.06908v1 [cs.CV])
    (2 min) Medical imaging datasets usually exhibit domain shift due to the variations of scanner vendors, imaging protocols, etc. This raises the concern about the generalization capacity of machine learning models. Domain generalization (DG), which aims to learn a model from multiple source domains such that it can be directly generalized to unseen test domains, seems particularly promising to medical imaging community. To address DG, recent model-agnostic meta-learning (MAML) has been introduced, which transfers the knowledge from previous training tasks to facilitate the learning of novel testing tasks. However, in clinical practice, there are usually only a few annotated source domains available, which decreases the capacity of training task generation and thus increases the risk of overfitting to training tasks in the paradigm. In this paper, we propose a novel DG scheme of episodic training with task augmentation on medical imaging classification. Based on meta-learning, we develop the paradigm of episodic training to construct the knowledge transfer from episodic training-task simulation to the real testing task of DG. Motivated by the limited number of source domains in real-world medical deployment, we consider the unique task-level overfitting and we propose task augmentation to enhance the variety during training task generation to alleviate it. With the established learning framework, we further exploit a novel meta-objective to regularize the deep embedding of training domains. To validate the effectiveness of the proposed method, we perform experiments on histopathological images and abdominal CT images.
    NDPNet: A novel non-linear data projection network for few-shot fine-gained image classification. (arXiv:2106.06988v1 [cs.CV])
    (2 min) Metric-based few-shot fine-grained image classification (FSFGIC) aims to learn a transferable feature embedding network by estimating the similarities between query images and support classes from very few examples. In this work, we propose, for the first time, to introduce the non-linear data projection concept into the design of FSFGIC architecture in order to address the limited sample problem in few-shot learning and at the same time to increase the discriminability of the model for fine-grained image classification. Specifically, we first design a feature re-abstraction embedding network that has the ability to not only obtain the required semantic features for effective metric learning but also re-enhance such features with finer details from input images. Then the descriptors of the query images and the support classes are projected into different non-linear spaces in our proposed similarity metric learning network to learn discriminative projection factors. This design can effectively operate in the challenging and restricted condition of a FSFGIC task for making the distance between the samples within the same class smaller and the distance between samples from different classes larger and for reducing the coupling relationship between samples from different categories. Furthermore, a novel similarity measure based on the proposed non-linear data project is presented for evaluating the relationships of feature information between a query image and a support set. It is worth to note that our proposed architecture can be easily embedded into any episodic training mechanisms for end-to-end training from scratch. Extensive experiments on FSFGIC tasks demonstrate the superiority of the proposed methods over the state-of-the-art benchmarks.
    Graph-based Visual-Semantic Entanglement Network for Zero-shot Image Recognition. (arXiv:2006.04648v2 [cs.CV] UPDATED)
    (2 min) Zero-shot learning uses semantic attributes to connect the search space of unseen objects. In recent years, although the deep convolutional network brings powerful visual modeling capabilities to the ZSL task, its visual features have severe pattern inertia and lack of representation of semantic relationships, which leads to severe bias and ambiguity. In response to this, we propose the Graph-based Visual-Semantic Entanglement Network to conduct graph modeling of visual features, which is mapped to semantic attributes by using a knowledge graph, it contains several novel designs: 1. it establishes a multi-path entangled network with the convolutional neural network (CNN) and the graph convolutional network (GCN), which input the visual features from CNN to GCN to model the implicit semantic relations, then GCN feedback the graph modeled information to CNN features; 2. it uses attribute word vectors as the target for the graph semantic modeling of GCN, which forms a self-consistent regression for graph modeling and supervise GCN to learn more personalized attribute relations; 3. it fuses and supplements the hierarchical visual-semantic features refined by graph modeling into visual embedding. Our method outperforms state-of-the-art approaches on multiple representative ZSL datasets: AwA2, CUB, and SUN by promoting the semantic linkage modelling of visual features.
    Boosting Randomized Smoothing with Variance Reduced Classifiers. (arXiv:2106.06946v1 [cs.LG])
    (2 min) Randomized Smoothing (RS) is a promising method for obtaining robustness certificates by evaluating a base model under noise. In this work we: (i) theoretically motivate why ensembles are a particularly suitable choice as base models for RS, and (ii) empirically confirm this choice, obtaining state of the art results in multiple settings. The key insight of our work is that the reduced variance of ensembles over the perturbations introduced in RS leads to significantly more consistent classifications for a given input, in turn leading to substantially increased certifiable radii for difficult samples. We also introduce key optimizations which enable an up to 50-fold decrease in sample complexity of RS, thus drastically reducing its computational overhead. Experimentally, we show that ensembles of only 3 to 10 classifiers consistently improve on the strongest single model with respect to their average certified radius (ACR) by 5% to 21% on both CIFAR-10 and ImageNet. On the latter, we achieve a state-of-the-art ACR of 1.11. We release all code and models required to reproduce our results upon publication.
    A Multi-Implicit Neural Representation for Fonts. (arXiv:2106.06866v1 [cs.CV])
    (2 min) Fonts are ubiquitous across documents and come in a variety of styles. They are either represented in a native vector format or rasterized to produce fixed resolution images. In the first case, the non-standard representation prevents benefiting from latest network architectures for neural representations; while, in the latter case, the rasterized representation, when encoded via networks, results in loss of data fidelity, as font-specific discontinuities like edges and corners are difficult to represent using neural networks. Based on the observation that complex fonts can be represented by a superposition of a set of simpler occupancy functions, we introduce \textit{multi-implicits} to represent fonts as a permutation-invariant set of learned implict functions, without losing features (e.g., edges and corners). However, while multi-implicits locally preserve font features, obtaining supervision in the form of ground truth multi-channel signals is a problem in itself. Instead, we propose how to train such a representation with only local supervision, while the proposed neural architecture directly finds globally consistent multi-implicits for font families. We extensively evaluate the proposed representation for various tasks including reconstruction, interpolation, and synthesis to demonstrate clear advantages with existing alternatives. Additionally, the representation naturally enables glyph completion, wherein a single characteristic font is used to synthesize a whole font family in the target style.
    Evaluating Foveated Video Quality Using Entropic Differencing. (arXiv:2106.06817v1 [cs.CV])
    (2 min) Virtual Reality is regaining attention due to recent advancements in hardware technology. Immersive images / videos are becoming widely adopted to carry omnidirectional visual information. However, due to the requirements for higher spatial and temporal resolution of real video data, immersive videos require significantly larger bandwidth consumption. To reduce stresses on bandwidth, foveated video compression is regaining popularity, whereby the space-variant spatial resolution of the retina is exploited. Towards advancing the progress of foveated video compression, we propose a full reference (FR) foveated image quality assessment algorithm, which we call foveated entropic differencing (FED), which employs the natural scene statistics of bandpass responses by applying differences of local entropies weighted by a foveation-based error sensitivity function. We evaluate the proposed algorithm by measuring the correlations of the predictions that FED makes against human judgements on the newly created 2D and 3D LIVE-FBT-FCVR databases for Virtual Reality (VR). The performance of the proposed algorithm yields state-of-the-art as compared with other existing full reference algorithms. Software for FED has been made available at: this http URL
    Hyperspectral and Multispectral Classification for Coastal Wetland Using Depthwise Feature Interaction Network. (arXiv:2106.06896v1 [cs.CV])
    (2 min) The monitoring of coastal wetlands is of great importance to the protection of marine and terrestrial ecosystems. However, due to the complex environment, severe vegetation mixture, and difficulty of access, it is impossible to accurately classify coastal wetlands and identify their species with traditional classifiers. Despite the integration of multisource remote sensing data for performance enhancement, there are still challenges with acquiring and exploiting the complementary merits from multisource data. In this paper, the Deepwise Feature Interaction Network (DFINet) is proposed for wetland classification. A depthwise cross attention module is designed to extract self-correlation and cross-correlation from multisource feature pairs. In this way, meaningful complementary information is emphasized for classification. DFINet is optimized by coordinating consistency loss, discrimination loss, and classification loss. Accordingly, DFINet reaches the standard solution-space under the regularity of loss functions, while the spatial consistency and feature discrimination are preserved. Comprehensive experimental results on two hyperspectral and multispectral wetland datasets demonstrate that the proposed DFINet outperforms other competitive methods in terms of overall accuracy.
    Contrastive Semi-Supervised Learning for 2D Medical Image Segmentation. (arXiv:2106.06801v1 [cs.CV])
    (2 min) Contrastive Learning (CL) is a recent representation learning approach, which achieves promising results by encouraging inter-class separability and intra-class compactness in learned image representations. Because medical images often contain multiple classes of interest per image, a standard image-level CL for these images is not applicable. In this work, we present a novel semi-supervised 2D medical segmentation solution that applies CL on image patches, instead of full images. These patches are meaningfully constructed using the semantic information of different classes obtained via pseudo labeling. We also propose a novel consistency regularization scheme, which works in synergy with contrastive learning. It addresses the problem of confirmation bias often observed in semi-supervised settings, and encourages better clustering in the feature space. We evaluate our method on four public medical segmentation datasets along with a novel histopathology dataset that we introduce. Our method obtains consistent improvements over the state-of-the-art semi-supervised segmentation approaches for all datasets.
    Sparse PointPillars: Exploiting Sparsity in Birds-Eye-View Object Detection. (arXiv:2106.06882v1 [cs.CV])
    (2 min) Bird's Eye View (BEV) is a popular representation for processing 3D point clouds, and by its nature is fundamentally sparse. Motivated by the computational limitations of mobile robot platforms, we take a fast high-performance BEV 3D object detector - PointPillars - and modify its backbone to exploit this sparsity, leading to decreased runtimes. We present preliminary results demonstrating decreased runtimes with either the same performance or a modest decrease in performance, which we anticipate will be remedied by model specific hyperparameter tuning. Our work is a first step towards a new class of 3D object detectors that exploit sparsity throughout their entire pipeline in order to reduce runtime and resource usage while maintaining good detection performance.
    Hippocampus segmentation in magnetic resonance images of Alzheimer's patients using Deep machine learning. (arXiv:2106.06743v1 [eess.IV])
    (2 min) Background: Alzheimers disease is a progressive neurodegenerative disorder and the main cause of dementia in aging. Hippocampus is prone to changes in the early stages of Alzheimers disease. Detection and observation of the hippocampus changes using magnetic resonance imaging (MRI) before the onset of Alzheimers disease leads to the faster preventive and therapeutic measures. Objective: The aim of this study was the segmentation of the hippocampus in magnetic resonance (MR) images of Alzheimers patients using deep machine learning method. Methods: U-Net architecture of convolutional neural network was proposed to segment the hippocampus in the real MRI data. The MR images of the 100 and 35 patients available in Alzheimers disease Neuroimaging Initiative (ADNI) dataset, was used for the train and test of the model, respectively. The performance of the proposed method was compared with manual segmentation by measuring the similarity metrics. Results: The desired segmentation achieved after 10 iterations. A Dice similarity coefficient (DSC) = 92.3%, sensitivity = 96.5%, positive predicted value (PPV) = 90.4%, and Intersection over Union (IoU) value for the train 92.94 and test 92.93 sets were obtained which are acceptable. Conclusion: The proposed approach is promising and can be extended in the prognosis of Alzheimers disease by the prediction of the hippocampus volume changes in the early stage of the disease.
    Multistream ValidNet: Improving 6D Object Pose Estimation by Automatic Multistream Validation. (arXiv:2106.06684v1 [cs.CV])
    (2 min) This work presents a novel approach to improve the results of pose estimation by detecting and distinguishing between the occurrence of True and False Positive results. It achieves this by training a binary classifier on the output of an arbitrary pose estimation algorithm, and returns a binary label indicating the validity of the result. We demonstrate that our approach improves upon a state-of-the-art pose estimation result on the Sil\'eane dataset, outperforming a variation of the alternative CullNet method by 4.15% in average class accuracy and 0.73% in overall accuracy at validation. Applying our method can also improve the pose estimation average precision results of Op-Net by 6.06% on average.
    Dise\~no y desarrollo de aplicaci\'on m\'ovil para la clasificaci\'on de flora nativa chilena utilizando redes neuronales convolucionales. (arXiv:2106.06592v1 [cs.CV])
    (2 min) Introduction: Mobile apps, through artificial vision, are capable of recognizing vegetable species in real time. However, the existing species recognition apps do not take in consideration the wide variety of endemic and native (Chilean) species, which leads to wrong species predictions. This study introduces the development of a chilean species dataset and an optimized classification model implemented to a mobile app. Method: the data set was built by putting together pictures of several species captured on the field and by selecting some pictures available from other datasets available online. Convolutional neural networks were used in order to develop the images prediction models. The networks were trained by performing a sensitivity analysis, validating with k-fold cross validation and performing tests with different hyper-parameters, optimizers, convolutional layers, and learning rates in order to identify and choose the best models and then put them together in one classification model. Results: The final data set was compounded by 46 species, including native species, endemic and exotic from Chile, with 6120 training pictures and 655 testing pictures. The best models were implemented on a mobile app, obtaining a 95% correct prediction rate with respect to the set of tests. Conclusion: The app developed in this study is capable of classifying species with a high level of accuracy, depending on the state of the art of the artificial vision and it can also show relevant information related to the classified species.
    Dynamic Clone Transformer for Efficient Convolutional Neural Netwoks. (arXiv:2106.06778v1 [cs.CV])
    (2 min) Convolutional networks (ConvNets) have shown impressive capability to solve various vision tasks. Nevertheless, the trade-off between performance and efficiency is still a challenge for a feasible model deployment on resource-constrained platforms. In this paper, we introduce a novel concept termed multi-path fully connected pattern (MPFC) to rethink the interdependencies of topology pattern, accuracy and efficiency for ConvNets. Inspired by MPFC, we further propose a dual-branch module named dynamic clone transformer (DCT) where one branch generates multiple replicas from inputs and another branch reforms those clones through a series of difference vectors conditional on inputs itself to produce more variants. This operation allows the self-expansion of channel-wise information in a data-driven way with little computational cost while providing sufficient learning capacity, which is a potential unit to replace computationally expensive pointwise convolution as an expansion layer in the bottleneck structure.
    Rapid COVID-19 Risk Screening by Eye-region Manifestations. (arXiv:2106.06664v1 [eess.IV])
    (3 min) It is still nontrivial to develop a new fast COVID-19 screening method with the easier access and lower cost, due to the technical and cost limitations of the current testing methods in the medical resource-poor districts. On the other hand, there are more and more ocular manifestations that have been reported in the COVID-19 patients as growing clinical evidence[1]. This inspired this project. We have conducted the joint clinical research since January 2021 at the ShiJiaZhuang City, Heibei province, China, which approved by the ethics committee of The fifth hospital of ShiJiaZhuang of Hebei Medical University. We undertake several blind tests of COVID-19 patients by Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan, China. Meantime as an important part of the ongoing globally COVID-19 eye test program by AIMOMICS since February 2020, we propose a new fast screening method of analyzing the eye-region images, captured by common CCD and CMOS cameras. This could reliably make a rapid risk screening of COVID-19 with the sustainable stable high performance in different countries and races. Our model for COVID-19 rapid prescreening have the merits of the lower cost, fully self-performed, non-invasive, importantly real-time, and thus enables the continuous health surveillance. We further implement it as the open accessible APIs, and provide public service to the world. Our pilot experiments show that our model is ready to be usable to all kinds of surveillance scenarios, such as infrared temperature measurement device at airports and stations, or directly pushing to the target people groups smartphones as a packaged application.
    Task Transformer Network for Joint MRI Reconstruction and Super-Resolution. (arXiv:2106.06742v1 [cs.CV])
    (2 min) The core problem of Magnetic Resonance Imaging (MRI) is the trade off between acceleration and image quality. Image reconstruction and super-resolution are two crucial techniques in Magnetic Resonance Imaging (MRI). Current methods are designed to perform these tasks separately, ignoring the correlations between them. In this work, we propose an end-to-end task transformer network (T$^2$Net) for joint MRI reconstruction and super-resolution, which allows representations and feature transmission to be shared between multiple task to achieve higher-quality, super-resolved and motion-artifacts-free images from highly undersampled and degenerated MRI data. Our framework combines both reconstruction and super-resolution, divided into two sub-branches, whose features are expressed as queries and keys. Specifically, we encourage joint feature learning between the two tasks, thereby transferring accurate task information. We first use two separate CNN branches to extract task-specific features. Then, a task transformer module is designed to embed and synthesize the relevance between the two tasks. Experimental results show that our multi-task model significantly outperforms advanced sequential methods, both quantitatively and qualitatively.
    Robust Representation Learning via Perceptual Similarity Metrics. (arXiv:2106.06620v1 [cs.LG])
    (2 min) A fundamental challenge in artificial intelligence is learning useful representations of data that yield good performance on a downstream task, without overfitting to spurious input features. Extracting such task-relevant predictive information is particularly difficult for real-world datasets. In this work, we propose Contrastive Input Morphing (CIM), a representation learning framework that learns input-space transformations of the data to mitigate the effect of irrelevant input features on downstream performance. Our method leverages a perceptual similarity metric via a triplet loss to ensure that the transformation preserves task-relevant information.Empirically, we demonstrate the efficacy of our approach on tasks which typically suffer from the presence of spurious correlations: classification with nuisance information, out-of-distribution generalization, and preservation of subgroup accuracies. We additionally show that CIM is complementary to other mutual information-based representation learning techniques, and demonstrate that it improves the performance of variational information bottleneck (VIB) when used together.
    Structure-Regularized Attention for Deformable Object Representation. (arXiv:2106.06672v1 [cs.CV])
    (2 min) Capturing contextual dependencies has proven useful to improve the representational power of deep neural networks. Recent approaches that focus on modeling global context, such as self-attention and non-local operation, achieve this goal by enabling unconstrained pairwise interactions between elements. In this work, we consider learning representations for deformable objects which can benefit from context exploitation by modeling the structural dependencies that the data intrinsically possesses. To this end, we provide a novel structure-regularized attention mechanism, which formalizes feature interaction as structural factorization through the use of a pair of light-weight operations. The instantiated building blocks can be directly incorporated into modern convolutional neural networks, to boost the representational power in an efficient manner. Comprehensive studies on multiple tasks and empirical comparisons with modern attention mechanisms demonstrate the gains brought by our method in terms of both performance and model complexity. We further investigate its effect on feature representations, showing that our trained models can capture diversified representations characterizing object parts without resorting to extra supervision.
    Disrupting Model Training with Adversarial Shortcuts. (arXiv:2106.06654v1 [cs.CV])
    (2 min) When data is publicly released for human consumption, it is unclear how to prevent its unauthorized usage for machine learning purposes. Successful model training may be preventable with carefully designed dataset modifications, and we present a proof-of-concept approach for the image classification setting. We propose methods based on the notion of adversarial shortcuts, which encourage models to rely on non-robust signals rather than semantic features, and our experiments demonstrate that these measures successfully prevent deep learning models from achieving high accuracy on real, unmodified data examples.
    Federated Learning with Spiking Neural Networks. (arXiv:2106.06579v1 [cs.LG])
    (2 min) As neural networks get widespread adoption in resource-constrained embedded devices, there is a growing need for low-power neural systems. Spiking Neural Networks (SNNs)are emerging to be an energy-efficient alternative to the traditional Artificial Neural Networks (ANNs) which are known to be computationally intensive. From an application perspective, as federated learning involves multiple energy-constrained devices, there is a huge scope to leverage energy efficiency provided by SNNs. Despite its importance, there has been little attention on training SNNs on a large-scale distributed system like federated learning. In this paper, we bring SNNs to a more realistic federated learning scenario. Specifically, we propose a federated learning framework for decentralized and privacy-preserving training of SNNs. To validate the proposed federated learning framework, we experimentally evaluate the advantages of SNNs on various aspects of federated learning with CIFAR10 and CIFAR100 benchmarks. We observe that SNNs outperform ANNs in terms of overall accuracy by over 15% when the data is distributed across a large number of clients in the federation while providing up to5.3x energy efficiency. In addition to efficiency, we also analyze the sensitivity of the proposed federated SNN framework to data distribution among the clients, stragglers, and gradient noise and perform a comprehensive comparison with ANNs.
    DS-TransUNet:Dual Swin Transformer U-Net for Medical Image Segmentation. (arXiv:2106.06716v1 [cs.CV])
    (2 min) Automatic medical image segmentation has made great progress benefit from the development of deep learning. However, most existing methods are based on convolutional neural networks (CNNs), which fail to build long-range dependencies and global context connections due to the limitation of receptive field in convolution operation. Inspired by the success of Transformer in modeling the long-range contextual information, some researchers have expended considerable efforts in designing the robust variants of Transformer-based U-Net. Moreover, the patch division used in vision transformers usually ignores the pixel-level intrinsic structural features inside each patch. To alleviate these problems, we propose a novel deep medical image segmentation framework called Dual Swin Transformer U-Net (DS-TransUNet), which might be the first attempt to concurrently incorporate the advantages of hierarchical Swin Transformer into both encoder and decoder of the standard U-shaped architecture to enhance the semantic segmentation quality of varying medical images. Unlike many prior Transformer-based solutions, the proposed DS-TransUNet first adopts dual-scale encoder subnetworks based on Swin Transformer to extract the coarse and fine-grained feature representations of different semantic scales. As the core component for our DS-TransUNet, a well-designed Transformer Interactive Fusion (TIF) module is proposed to effectively establish global dependencies between features of different scales through the self-attention mechanism. Furthermore, we also introduce the Swin Transformer block into decoder to further explore the long-range contextual information during the up-sampling process. Extensive experiments across four typical tasks for medical image segmentation demonstrate the effectiveness of DS-TransUNet, and show that our approach significantly outperforms the state-of-the-art methods.
    CAR-Net: Unsupervised Co-Attention Guided Registration Network for Joint Registration and Structure Learning. (arXiv:2106.06637v1 [cs.CV])
    (2 min) Image registration is a fundamental building block for various applications in medical image analysis. To better explore the correlation between the fixed and moving images and improve registration performance, we propose a novel deep learning network, Co-Attention guided Registration Network (CAR-Net). CAR-Net employs a co-attention block to learn a new representation of the inputs, which drives the registration of the fixed and moving images. Experiments on UK Biobank cardiac cine-magnetic resonance image data demonstrate that CAR-Net obtains higher registration accuracy and smoother deformation fields than state-of-the-art unsupervised registration methods, while achieving comparable or better registration performance than corresponding weakly-supervised variants. In addition, our approach can provide critical structural information of the input fixed and moving images simultaneously in a completely unsupervised manner.
    HR-NAS: Searching Efficient High-Resolution Neural Architectures with Lightweight Transformers. (arXiv:2106.06560v1 [cs.CV])
    (2 min) High-resolution representations (HR) are essential for dense prediction tasks such as segmentation, detection, and pose estimation. Learning HR representations is typically ignored in previous Neural Architecture Search (NAS) methods that focus on image classification. This work proposes a novel NAS method, called HR-NAS, which is able to find efficient and accurate networks for different tasks, by effectively encoding multiscale contextual information while maintaining high-resolution representations. In HR-NAS, we renovate the NAS search space as well as its searching strategy. To better encode multiscale image contexts in the search space of HR-NAS, we first carefully design a lightweight transformer, whose computational complexity can be dynamically changed with respect to different objective functions and computation budgets. To maintain high-resolution representations of the learned networks, HR-NAS adopts a multi-branch architecture that provides convolutional encoding of multiple feature resolutions, inspired by HRNet. Last, we proposed an efficient fine-grained search strategy to train HR-NAS, which effectively explores the search space, and finds optimal architectures given various tasks and computation resources. HR-NAS is capable of achieving state-of-the-art trade-offs between performance and FLOPs for three dense prediction tasks and an image classification task, given only small computational budgets. For example, HR-NAS surpasses SqueezeNAS that is specially designed for semantic segmentation while improving efficiency by 45.9%. Code is available at https://github.com/dingmyu/HR-NAS
    Go Small and Similar: A Simple Output Decay Brings Better Performance. (arXiv:2106.06726v1 [cs.CV])
    (2 min) Regularization and data augmentation methods have been widely used and become increasingly indispensable in deep learning training. Researchers who devote themselves to this have considered various possibilities. But so far, there has been little discussion about regularizing outputs of the model. This paper begins with empirical observations that better performances are significantly associated with output distributions, that have smaller average values and variances. By audaciously assuming there is causality involved, we propose a novel regularization term, called Output Decay, that enforces the model to assign smaller and similar output values on each class. Though being counter-intuitive, such a small modification result in a remarkable improvement on performance. Extensive experiments demonstrate the wide applicability, versatility, and compatibility of Output Decay.
    Deception Detection and Remote Physiological Monitoring: A Dataset and Baseline Experimental Results. (arXiv:2106.06583v1 [cs.CV])
    (2 min) We present the Deception Detection and Physiological Monitoring (DDPM) dataset and initial baseline results on this dataset. Our application context is an interview scenario in which the interviewee attempts to deceive the interviewer on selected responses. The interviewee is recorded in RGB, near-infrared, and long-wave infrared, along with cardiac pulse, blood oxygenation, and audio. After collection, data were annotated for interviewer/interviewee, curated, ground-truthed, and organized into train / test parts for a set of canonical deception detection experiments. Baseline experiments found random accuracy for micro-expressions as an indicator of deception, but that saccades can give a statistically significant response. We also estimated subject heart rates from face videos (remotely) with a mean absolute error as low as 3.16 bpm. The database contains almost 13 hours of recordings of 70 subjects, and over 8 million visible-light, near-infrared, and thermal video frames, along with appropriate meta, audio and pulse oximeter data. To our knowledge, this is the only collection offering recordings of five modalities in an interview scenario that can be used in both deception detection and remote photoplethysmography research.
    Reverse-engineer the Distributional Structure of Infant Egocentric Views for Training Generalizable Image Classifiers. (arXiv:2106.06694v1 [cs.CV])
    (2 min) We analyze egocentric views of attended objects from infants. This paper shows 1) empirical evidence that children's egocentric views have more diverse distributions compared to adults' views, 2) we can computationally simulate the infants' distribution, and 3) the distribution is beneficial for training more generalized image classifiers not only for infant egocentric vision but for third-person computer vision.
  • cs.IR updates on arXiv.org

    Socially-Aware Self-Supervised Tri-Training for Recommendation. (arXiv:2106.03569v2 [cs.IR] UPDATED)
    (2 min) Self-supervised learning (SSL), which can automatically generate ground-truth samples from raw data, holds vast potential to improve recommender systems. Most existing SSL-based methods perturb the raw data graph with uniform node/edge dropout to generate new data views and then conduct the self-discrimination based contrastive learning over different views to learn generalizable representations. Under this scheme, only a bijective mapping is built between nodes in two different views, which means that the self-supervision signals from other nodes are being neglected. Due to the widely observed homophily in recommender systems, we argue that the supervisory signals from other nodes are also highly likely to benefit the representation learning for recommendation. To capture these signals, a general socially-aware SSL framework that integrates tri-training is proposed in this paper. Technically, our framework first augments the user data views with the user social information. And then under the regime of tri-training for multi-view encoding, the framework builds three graph encoders (one for recommendation) upon the augmented views and iteratively improves each encoder with self-supervision signals from other users, generated by the other two encoders. Since the tri-training operates on the augmented views of the same data sources for self-supervision signals, we name it self-supervised tri-training. Extensive experiments on multiple real-world datasets consistently validate the effectiveness of the self-supervised tri-training framework for improving recommendation. The code is released at https://github.com/Coder-Yu/QRec.
    CHECKED: Chinese COVID-19 Fake News Dataset. (arXiv:2010.09029v2 [cs.SI] UPDATED)
    (2 min) COVID-19 has impacted all lives. To maintain social distancing and avoiding exposure, works and lives have gradually moved online. Under this trend, social media usage to obtain COVID-19 news has increased. Also, misinformation on COVID-19 is frequently spread on social media. In this work, we develop CHECKED, the first Chinese dataset on COVID-19 misinformation. CHECKED provides a total 2,104 verified microblogs related to COVID-19 from December 2019 to August 2020, identified by using a specific list of keywords. Correspondingly, CHECKED includes 1,868,175 reposts, 1,185,702 comments, and 56,852,736 likes that reveal how these verified microblogs are spread and reacted on Weibo. The dataset contains a rich set of multimedia information for each microblog including ground-truth label, textual, visual, temporal, and network information. Extensive experiments have been conducted to analyze CHECKED data and to provide benchmark results for well-established methods when predicting fake news using CHECKED. We hope that CHECKED can facilitate studies that target misinformation on coronavirus. The dataset is available at https://github.com/cyang03/CHECKED.
    Sentiment Analysis of Covid-19 Tweets using Evolutionary Classification-Based LSTM Model. (arXiv:2106.06910v1 [cs.CL])
    (2 min) As the Covid-19 outbreaks rapidly all over the world day by day and also affects the lives of million, a number of countries declared complete lock-down to check its intensity. During this lockdown period, social media plat-forms have played an important role to spread information about this pandemic across the world, as people used to express their feelings through the social networks. Considering this catastrophic situation, we developed an experimental approach to analyze the reactions of people on Twitter taking into ac-count the popular words either directly or indirectly based on this pandemic. This paper represents the sentiment analysis on collected large number of tweets on Coronavirus or Covid-19. At first, we analyze the trend of public sentiment on the topics related to Covid-19 epidemic using an evolutionary classification followed by the n-gram analysis. Then we calculated the sentiment ratings on collected tweet based on their class. Finally, we trained the long-short term network using two types of rated tweets to predict sentiment on Covid-19 data and obtained an overall accuracy of 84.46%.
    Engineering Knowledge Graph from Patent Database. (arXiv:2106.06739v1 [cs.IR])
    (2 min) We propose a large, scalable engineering knowledge graph, comprising sets of (entity, relationship, entity) triples that are real-world engineering facts found in the patent database. We apply a set of rules based on the syntactic and lexical properties of claims in a patent document to extract facts. We aggregate these facts within each patent document and integrate the aggregated sets of facts across the patent database to obtain the engineering knowledge graph. Such a knowledge graph is expected to support inference, reasoning, and recalling in various engineering tasks. The knowledge graph has a greater size and coverage in comparison with the previously used knowledge graphs and semantic networks in the engineering literature.
    Curriculum Pre-Training Heterogeneous Subgraph Transformer for Top-$N$ Recommendation. (arXiv:2106.06722v1 [cs.IR])
    (2 min) Due to the flexibility in modelling data heterogeneity, heterogeneous information network (HIN) has been adopted to characterize complex and heterogeneous auxiliary data in top-$N$ recommender systems, called \emph{HIN-based recommendation}. HIN characterizes complex, heterogeneous data relations, containing a variety of information that may not be related to the recommendation task. Therefore, it is challenging to effectively leverage useful information from HINs for improving the recommendation performance. To address the above issue, we propose a Curriculum pre-training based HEterogeneous Subgraph Transformer (called \emph{CHEST}) with new \emph{data characterization}, \emph{representation model} and \emph{learning algorithm}. Specifically, we consider extracting useful information from HIN to compose the interaction-specific heterogeneous subgraph, containing both sufficient and relevant context information for recommendation. Then we capture the rich semantics (\eg graph structure and path semantics) within the subgraph via a heterogeneous subgraph Transformer, where we encode the subgraph with multi-slot sequence representations. Besides, we design a curriculum pre-training strategy to provide an elementary-to-advanced learning process, by which we smoothly transfer basic semantics in HIN for modeling user-item interaction relation. Extensive experiments conducted on three real-world datasets demonstrate the superiority of our proposed method over a number of competitive baselines, especially when only limited training data is available.
    Deep Reinforcement Learning based Group Recommender System. (arXiv:2106.06900v1 [cs.IR])
    (2 min) Group recommender systems are widely used in current web applications. In this paper, we propose a novel group recommender system based on the deep reinforcement learning. We introduce the MovieLens data at first and generate one random group dataset, MovieLens-Rand, from it. This randomly generated dataset is described and analyzed. We also present experimental settings and two state-of-art baselines, AGREE and GroupIM. The framework of our novel model, the Deep Reinforcement learning based Group Recommender system (DRGR), is proposed. Actor-critic networks are implemented with the deep deterministic policy gradient algorithm. The DRGR model is applied on the MovieLens-Rand dataset with two baselines. Compared with baselines, we conclude that DRGR performs better than GroupIM due to long interaction histories but worse than AGREE because of the self-attention mechanism. We express advantages and shortcomings of DRGR and also give future improvement directions at the end.
    AutoLoss: Automated Loss Function Search in Recommendations. (arXiv:2106.06713v1 [cs.IR])
    (2 min) Designing an effective loss function plays a crucial role in training deep recommender systems. Most existing works often leverage a predefined and fixed loss function that could lead to suboptimal recommendation quality and training efficiency. Some recent efforts rely on exhaustively or manually searched weights to fuse a group of candidate loss functions, which is exceptionally costly in computation and time. They also neglect the various convergence behaviors of different data examples. In this work, we propose an AutoLoss framework that can automatically and adaptively search for the appropriate loss function from a set of candidates. To be specific, we develop a novel controller network, which can dynamically adjust the loss probabilities in a differentiable manner. Unlike existing algorithms, the proposed controller can adaptively generate the loss probabilities for different data examples according to their varied convergence behaviors. Such design improves the model's generalizability and transferability between deep recommender systems and datasets. We evaluate the proposed framework on two benchmark datasets. The results show that AutoLoss outperforms representative baselines. Further experiments have been conducted to deepen our understandings of AutoLoss, including its transferability, components and training efficiency.
  • cs.LG updates on arXiv.org

    Non-Transferable Learning: A New Approach for Model Verification and Authorization. (arXiv:2106.06916v1 [cs.LG])
    (2 min) As Artificial Intelligence as a Service gains popularity, protecting well-trained models as intellectual property is becoming increasingly important. Generally speaking, there are two common protection methods: ownership verification and usage authorization. In this paper, we propose Non-Transferable Learning (NTL), a novel approach that captures the exclusive data representation in the learned model and restricts the model generalization ability to certain domains. This approach provides effective solutions to both model verification and authorization. For ownership verification, watermarking techniques are commonly used but are often vulnerable to sophisticated watermark removal methods. Our NTL-based model verification approach instead provides robust resistance to state-of-the-art watermark removal methods, as shown in extensive experiments for four of such methods over the digits, CIFAR10 & STL10, and VisDA datasets. For usage authorization, prior solutions focus on authorizing specific users to use the model, but authorized users can still apply the model to any data without restriction. Our NTL-based authorization approach instead provides data-centric usage protection by significantly degrading the performance of usage on unauthorized data. Its effectiveness is also shown through experiments on a variety of datasets.
    LE-NAS: Learning-based Ensenble with NAS for Dose Prediction. (arXiv:2106.06733v1 [eess.IV])
    (2 min) Radiation therapy treatment planning is a complex process, as the target dose prescription and normal tissue sparing are conflicting objectives. Automated and accurate dose prediction for radiation therapy planning is in high demand. In this study, we propose a novel learning-based ensemble approach, named LE-NAS, which integrates neural architecture search (NAS) with knowledge distillation for 3D radiotherapy dose prediction. Specifically, the prediction network first exhaustively searches each block from enormous architecture space. Then, multiple architectures are selected with promising performance and diversity. To reduce the inference time, we adopt the teacher-student paradigm by treating the combination of diverse outputs from multiple searched networks as supervisions to guide the student network training. In addition, we apply adversarial learning to optimize the student network to recover the knowledge in teacher networks. To the best of our knowledge, we are the first to investigate the combination of NAS and knowledge distillation. The proposed method has been evaluated on the public OpenKBP dataset, and experimental results demonstrate the effectiveness of our method and its superior performance to the state-of-the-art method.
    Cross-Subject Domain Adaptation for Multi-Frame EEG Images. (arXiv:2106.06769v1 [cs.LG])
    (2 min) Working memory (WM) is a basic part of human cognition, which plays an important role in the study of human cognitive load. Among various brain imaging techniques, electroencephalography has shown its advantage on easy access and reliability. However, one of the critical challenges is that individual difference may cause the ineffective results, especially when the established model meets an unfamiliar subject. In this work, we propose a cross-subject deep adaptation model with spatial attention (CS-DASA) to generalize the workload classifications across subjects. First, we transform time-series EEG data into multi-frame EEG images incorporating more spatio-temporal information. First, the subject-shared module in CS-DASA receives multi-frame EEG image data from both source and target subjects and learns the common feature representations. Then, in subject-specific module, the maximum mean discrepancy is implemented to measure the domain distribution divergence in a reproducing kernel Hilbert space, which can add an effective penalty loss for domain adaptation. Additionally, the subject-to-subject spatial attention mechanism is employed to focus on the most discriminative spatial feature in EEG image data. Experiments conducted on a public WM EEG dataset containing 13 subjects show that the proposed model is capable of achieve better performance than existing state-of-the art methods.
    GRAD-MATCH: Gradient Matching based Data Subset Selection for Efficient Deep Model Training. (arXiv:2103.00123v2 [cs.LG] UPDATED)
    (2 min) The great success of modern machine learning models on large datasets is contingent on extensive computational resources with high financial and environmental costs. One way to address this is by extracting subsets that generalize on par with the full data. In this work, we propose a general framework, GRAD-MATCH, which finds subsets that closely match the gradient of the training or validation set. We find such subsets effectively using an orthogonal matching pursuit algorithm. We show rigorous theoretical and convergence guarantees of the proposed algorithm and, through our extensive experiments on real-world datasets, show the effectiveness of our proposed framework. We show that GRAD-MATCH significantly and consistently outperforms several recent data-selection algorithms and achieves the best accuracy-efficiency trade-off. GRAD-MATCH is available as a part of the CORDS toolkit: \url{https://github.com/decile-team/cords}.
    Barlow Twins: Self-Supervised Learning via Redundancy Reduction. (arXiv:2103.03230v3 [cs.CV] UPDATED)
    (2 min) Self-supervised learning (SSL) is rapidly closing the gap with supervised methods on large computer vision benchmarks. A successful approach to SSL is to learn embeddings which are invariant to distortions of the input sample. However, a recurring issue with this approach is the existence of trivial constant solutions. Most current methods avoid such solutions by careful implementation details. We propose an objective function that naturally avoids collapse by measuring the cross-correlation matrix between the outputs of two identical networks fed with distorted versions of a sample, and making it as close to the identity matrix as possible. This causes the embedding vectors of distorted versions of a sample to be similar, while minimizing the redundancy between the components of these vectors. The method is called Barlow Twins, owing to neuroscientist H. Barlow's redundancy-reduction principle applied to a pair of identical networks. Barlow Twins does not require large batches nor asymmetry between the network twins such as a predictor network, gradient stopping, or a moving average on the weight updates. Intriguingly it benefits from very high-dimensional output vectors. Barlow Twins outperforms previous methods on ImageNet for semi-supervised classification in the low-data regime, and is on par with current state of the art for ImageNet classification with a linear classifier head, and for transfer tasks of classification and object detection.
    On Continuous Local BDD-Based Search for Hybrid SAT Solving. (arXiv:2012.07983v2 [cs.AI] UPDATED)
    (2 min) We explore the potential of continuous local search (CLS) in SAT solving by proposing a novel approach for finding a solution of a hybrid system of Boolean constraints. The algorithm is based on CLS combined with belief propagation on binary decision diagrams (BDDs). Our framework accepts all Boolean constraints that admit compact BDDs, including symmetric Boolean constraints and small-coefficient pseudo-Boolean constraints as interesting families. We propose a novel algorithm for efficiently computing the gradient needed by CLS. We study the capabilities and limitations of our versatile CLS solver, GradSAT, by applying it on many benchmark instances. The experimental results indicate that GradSAT can be a useful addition to the portfolio of existing SAT and MaxSAT solvers for solving Boolean satisfiability and optimization problems.
    LogME: Practical Assessment of Pre-trained Models for Transfer Learning. (arXiv:2102.11005v2 [cs.LG] UPDATED)
    (2 min) This paper studies task adaptive pre-trained model selection, an underexplored problem of assessing pre-trained models for the target task and select best ones from the model zoo \emph{without fine-tuning}. A few pilot works addressed the problem in transferring supervised pre-trained models to classification tasks, but they cannot handle emerging unsupervised pre-trained models or regression tasks. In pursuit of a practical assessment method, we propose to estimate the maximum value of label evidence given features extracted by pre-trained models. Unlike the maximum likelihood, the maximum evidence is \emph{immune to over-fitting}, while its expensive computation can be dramatically reduced by our carefully designed algorithm. The Logarithm of Maximum Evidence (LogME) can be used to assess pre-trained models for transfer learning: a pre-trained model with a high LogME value is likely to have good transfer performance. LogME is \emph{fast, accurate, and general}, characterizing itself as the first practical method for assessing pre-trained models. Compared with brute-force fine-tuning, LogME brings at most $3000\times$ speedup in wall-clock time and requires only $1\%$ memory footprint. It outperforms prior methods by a large margin in their setting and is applicable to new settings. It is general enough for diverse pre-trained models (supervised pre-trained and unsupervised pre-trained), downstream tasks (classification and regression), and modalities (vision and language). Code is available at this repository: \href{https://github.com/thuml/LogME}{https://github.com/thuml/LogME}.
    Globally-Robust Neural Networks. (arXiv:2102.08452v2 [cs.LG] UPDATED)
    (2 min) The threat of adversarial examples has motivated work on training certifiably robust neural networks to facilitate efficient verification of local robustness at inference time. We formalize a notion of global robustness, which captures the operational properties of on-line local robustness certification while yielding a natural learning objective for robust training. We show that widely-used architectures can be easily adapted to this objective by incorporating efficient global Lipschitz bounds into the network, yielding certifiably-robust models by construction that achieve state-of-the-art verifiable accuracy. Notably, this approach requires significantly less time and memory than recent certifiable training methods, and leads to negligible costs when certifying points on-line; for example, our evaluation shows that it is possible to train a large robust Tiny-Imagenet model in a matter of hours. Our models effectively leverage inexpensive global Lipschitz bounds for real-time certification, despite prior suggestions that tighter local bounds are needed for good performance; we posit this is possible because our models are specifically trained to achieve tighter global bounds. Namely, we prove that the maximum achievable verifiable accuracy for a given dataset is not improved by using a local bound.
    PerSim: Data-Efficient Offline Reinforcement Learning with Heterogeneous Agents via Personalized Simulators. (arXiv:2102.06961v3 [cs.LG] UPDATED)
    (2 min) We consider offline reinforcement learning (RL) with heterogeneous agents under severe data scarcity, i.e., we only observe a single historical trajectory for every agent under an unknown, potentially sub-optimal policy. We find that the performance of state-of-the-art offline and model-based RL methods degrade significantly given such limited data availability, even for commonly perceived "solved" benchmark settings such as "MountainCar" and "CartPole". To address this challenge, we propose PerSim, a model-based offline RL approach which first learns a personalized simulator for each agent by collectively using the historical trajectories across all agents, prior to learning a policy. We do so by positing that the transition dynamics across agents can be represented as a latent function of latent factors associated with agents, states, and actions; subsequently, we theoretically establish that this function is well-approximated by a "low-rank" decomposition of separable agent, state, and action latent functions. This representation suggests a simple, regularized neural network architecture to effectively learn the transition dynamics per agent, even with scarce, offline data. We perform extensive experiments across several benchmark environments and RL methods. The consistent improvement of our approach, measured in terms of both state dynamics prediction and eventual reward, confirms the efficacy of our framework in leveraging limited historical data to simultaneously learn personalized policies across agents.
    Differentiable Particle Filtering via Entropy-Regularized Optimal Transport. (arXiv:2102.07850v2 [stat.ML] UPDATED)
    (2 min) Particle Filtering (PF) methods are an established class of procedures for performing inference in non-linear state-space models. Resampling is a key ingredient of PF, necessary to obtain low variance likelihood and states estimates. However, traditional resampling methods result in PF-based loss functions being non-differentiable with respect to model and PF parameters. In a variational inference context, resampling also yields high variance gradient estimates of the PF-based evidence lower bound. By leveraging optimal transport ideas, we introduce a principled differentiable particle filter and provide convergence results. We demonstrate this novel method on a variety of applications.
    Strategic Classification in the Dark. (arXiv:2102.11592v3 [cs.LG] UPDATED)
    (2 min) Strategic classification studies the interaction between a classification rule and the strategic agents it governs. Under the assumption that the classifier is known, rational agents respond to it by manipulating their features. However, in many real-life scenarios of high-stake classification (e.g., credit scoring), the classifier is not revealed to the agents, which leads agents to attempt to learn the classifier and game it too. In this paper we generalize the strategic classification model to such scenarios. We define the price of opacity as the difference in prediction error between opaque and transparent strategy-robust classifiers, characterize it, and give a sufficient condition for this price to be strictly positive, in which case transparency is the recommended policy. Our experiments show how Hardt et al.'s robust classifier is affected by keeping agents in the dark.
    Scaling Multi-Agent Reinforcement Learning with Selective Parameter Sharing. (arXiv:2102.07475v2 [cs.MA] UPDATED)
    (2 min) Sharing parameters in multi-agent deep reinforcement learning has played an essential role in allowing algorithms to scale to a large number of agents. Parameter sharing between agents significantly decreases the number of trainable parameters, shortening training times to tractable levels, and has been linked to more efficient learning. However, having all agents share the same parameters can also have a detrimental effect on learning. We demonstrate the impact of parameter sharing methods on training speed and converged returns, establishing that when applied indiscriminately, their effectiveness is highly dependent on the environment. We propose a novel method to automatically identify agents which may benefit from sharing parameters by partitioning them based on their abilities and goals. Our approach combines the increased sample efficiency of parameter sharing with the representational capacity of multiple independent networks to reduce training time and increase final returns.
    Strategic Classification Made Practical. (arXiv:2103.01826v2 [cs.LG] UPDATED)
    (2 min) Strategic classification regards the problem of learning in settings where users can strategically modify their features to improve outcomes. This setting applies broadly and has received much recent attention. But despite its practical significance, work in this space has so far been predominantly theoretical. In this paper we present a learning framework for strategic classification that is practical. Our approach directly minimizes the "strategic" empirical risk, achieved by differentiating through the strategic response of users. This provides flexibility that allows us to extend beyond the original problem formulation and towards more realistic learning scenarios. A series of experiments demonstrates the effectiveness of our approach on various learning settings.
    Fortify Machine Learning Production Systems: Detect and Classify Adversarial Attacks. (arXiv:2102.09695v3 [cs.LG] UPDATED)
    (2 min) Production machine learning systems are consistently under attack by adversarial actors. Various deep learning models must be capable of accurately detecting fake or adversarial input while maintaining speed. In this work, we propose one piece of the production protection system: detecting an incoming adversarial attack and its characteristics. Detecting types of adversarial attacks has two primary effects: the underlying model can be trained in a structured manner to be robust from those attacks and the attacks can be potentially filtered out in real-time before causing any downstream damage. The adversarial image classification space is explored for models commonly used in transfer learning.
    Active Testing: Sample-Efficient Model Evaluation. (arXiv:2103.05331v2 [stat.ML] UPDATED)
    (2 min) We introduce a new framework for sample-efficient model evaluation that we call active testing. While approaches like active learning reduce the number of labels needed for model training, existing literature largely ignores the cost of labeling test data, typically unrealistically assuming large test sets for model evaluation. This creates a disconnect to real applications, where test labels are important and just as expensive, e.g. for optimizing hyperparameters. Active testing addresses this by carefully selecting the test points to label, ensuring model evaluation is sample-efficient. To this end, we derive theoretically-grounded and intuitive acquisition strategies that are specifically tailored to the goals of active testing, noting these are distinct to those of active learning. As actively selecting labels introduces a bias; we further show how to remove this bias while reducing the variance of the estimator at the same time. Active testing is easy to implement and can be applied to any supervised machine learning method. We demonstrate its effectiveness on models including WideResNets and Gaussian processes on datasets including Fashion-MNIST and CIFAR-100.
    Benchmarks, Algorithms, and Metrics for Hierarchical Disentanglement. (arXiv:2102.05185v2 [cs.LG] UPDATED)
    (2 min) In representation learning, there has been recent interest in developing algorithms to disentangle the ground-truth generative factors behind a dataset, and metrics to quantify how fully this occurs. However, these algorithms and metrics often assume that both representations and ground-truth factors are flat, continuous, and factorized, whereas many real-world generative processes involve rich hierarchical structure, mixtures of discrete and continuous variables with dependence between them, and even varying intrinsic dimensionality. In this work, we develop benchmarks, algorithms, and metrics for learning such hierarchical representations.
    Double-descent curves in neural networks: a new perspective using Gaussian processes. (arXiv:2102.07238v3 [stat.ML] UPDATED)
    (2 min) Double-descent curves in neural networks describe the phenomenon that the generalisation error initially descends with increasing parameters, then grows after reaching an optimal number of parameters which is less than the number of data points, but then descends again in the overparameterised regime. Here we use a neural network Gaussian process (NNGP) which maps exactly to a fully connected network (FCN) in the infinite width limit, combined with techniques from random matrix theory, to calculate this generalisation behaviour, with a particular focus on the overparameterised regime. An advantage of our NNGP approach is that the analytical calculations are easier to interpret. We argue that neural network generalization performance improves in the overparameterised regime precisely because that is where they converge to their equivalent Gaussian process.
    Sinkhorn Label Allocation: Semi-Supervised Classification via Annealed Self-Training. (arXiv:2102.08622v2 [cs.LG] UPDATED)
    (2 min) Self-training is a standard approach to semi-supervised learning where the learner's own predictions on unlabeled data are used as supervision during training. In this paper, we reinterpret this label assignment process as an optimal transportation problem between examples and classes, wherein the cost of assigning an example to a class is mediated by the current predictions of the classifier. This formulation facilitates a practical annealing strategy for label assignment and allows for the inclusion of prior knowledge on class proportions via flexible upper bound constraints. The solutions to these assignment problems can be efficiently approximated using Sinkhorn iteration, thus enabling their use in the inner loop of standard stochastic optimization algorithms. We demonstrate the effectiveness of our algorithm on the CIFAR-10, CIFAR-100, and SVHN datasets in comparison with FixMatch, a state-of-the-art self-training algorithm. Our code is available at https://github.com/stanford-futuredata/sinkhorn-label-allocation.
    Understanding and Mitigating Accuracy Disparity in Regression. (arXiv:2102.12013v2 [cs.LG] UPDATED)
    (2 min) With the widespread deployment of large-scale prediction systems in high-stakes domains, e.g., face recognition, criminal justice, etc., disparity in prediction accuracy between different demographic subgroups has called for fundamental understanding on the source of such disparity and algorithmic intervention to mitigate it. In this paper, we study the accuracy disparity problem in regression. To begin with, we first propose an error decomposition theorem, which decomposes the accuracy disparity into the distance between marginal label distributions and the distance between conditional representations, to help explain why such accuracy disparity appears in practice. Motivated by this error decomposition and the general idea of distribution alignment with statistical distances, we then propose an algorithm to reduce this disparity, and analyze its game-theoretic optima of the proposed objective functions. To corroborate our theoretical findings, we also conduct experiments on five benchmark datasets. The experimental results suggest that our proposed algorithms can effectively mitigate accuracy disparity while maintaining the predictive power of the regression models.
    Semi-Supervised Data Programming with Subset Selection. (arXiv:2008.09887v3 [cs.LG] UPDATED)
    (2 min) The paradigm of data programming, which uses weak supervision in the form of rules/labelling functions, and semi-supervised learning, which augments small amounts of labelled data with a large unlabelled dataset, have shown great promise in several text classification scenarios. In this work, we argue that by not using any labelled data, data programming based approaches can yield sub-optimal performances, particularly when the labelling functions are noisy. The first contribution of this work is an introduction of a framework, \model which is a semi-supervised data programming paradigm that learns a \emph{joint model} that effectively uses the rules/labelling functions along with semi-supervised loss functions on the feature space. Next, we also study \modelss which additionally does subset selection on top of the joint semi-supervised data programming objective and \emph{selects} a set of examples that can be used as the labelled set by \model. The goal of \modelss is to ensure that the labelled data can \emph{complement} the labelling functions, thereby benefiting from both data-programming as well as appropriately selected data for human labelling. We demonstrate that by effectively combining semi-supervision, data-programming, and subset selection paradigms, we significantly outperform the current state-of-the-art on seven publicly available datasets. \footnote{The source code is available at \url{https://github.com/ayushbits/Semi-Supervised-LFs-Subset-Selection}}
    DVD: A Diagnostic Dataset for Multi-step Reasoning in Video Grounded Dialogue. (arXiv:2101.00151v2 [cs.AI] UPDATED)
    (2 min) A video-grounded dialogue system is required to understand both dialogue, which contains semantic dependencies from turn to turn, and video, which contains visual cues of spatial and temporal scene variations. Building such dialogue systems is a challenging problem, involving various reasoning types on both visual and language inputs. Existing benchmarks do not have enough annotations to thoroughly analyze dialogue systems and understand their capabilities and limitations in isolation. These benchmarks are also not explicitly designed to minimise biases that models can exploit without actual reasoning. To address these limitations, in this paper, we present DVD, a Diagnostic Dataset for Video-grounded Dialogues. The dataset is designed to contain minimal biases and has detailed annotations for the different types of reasoning over the spatio-temporal space of video. Dialogues are synthesized over multiple question turns, each of which is injected with a set of cross-turn semantic relationships. We use DVD to analyze existing approaches, providing interesting insights into their abilities and limitations. In total, DVD is built from $11k$ CATER synthetic videos and contains $10$ instances of $10$-round dialogues for each video, resulting in more than $100k$ dialogues and $1M$ question-answer pairs. Our code and dataset are publicly available at https://github.com/facebookresearch/DVDialogues.
    Learning Noise Transition Matrix from Only Noisy Labels via Total Variation Regularization. (arXiv:2102.02414v2 [stat.ML] UPDATED)
    (2 min) Many weakly supervised classification methods employ a noise transition matrix to capture the class-conditional label corruption. To estimate the transition matrix from noisy data, existing methods often need to estimate the noisy class-posterior, which could be unreliable due to the overconfidence of neural networks. In this work, we propose a theoretically grounded method that can estimate the noise transition matrix and learn a classifier simultaneously, without relying on the error-prone noisy class-posterior estimation. Concretely, inspired by the characteristics of the stochastic label corruption process, we propose total variation regularization, which encourages the predicted probabilities to be more distinguishable from each other. Under mild assumptions, the proposed method yields a consistent estimator of the transition matrix. We show the effectiveness of the proposed method through experiments on benchmark and real-world datasets.
    BoMb-OT: On Batch of Mini-batches Optimal Transport. (arXiv:2102.05912v2 [stat.ML] UPDATED)
    (2 min) Mini-batch optimal transport (m-OT) has been successfully used in practical applications that involve probability measures with intractable density, or probability measures with a very high number of supports. The m-OT solves several sparser optimal transport problems and then returns the average of their costs and transportation plans. Despite its scalability advantage, the m-OT does not consider the relationship between mini-batches which leads to undesirable estimation. Moreover, the m-OT does not approximate a proper metric between probability measures since the identity property is not satisfied. To address these problems, we propose a novel mini-batching scheme for optimal transport, named Batch of Mini-batches Optimal Transport (BoMb-OT), that finds the optimal coupling between mini-batches and it can be seen as an approximation to a well-defined distance on the space of probability measures. Furthermore, we show that the m-OT is a limit of the entropic regularized version of the BoMb-OT when the regularized parameter goes to infinity. Finally, we carry out extensive experiments to show that the BoMb-OT can estimate a better transportation plan between two original measures than the m-OT. It leads to a favorable performance of the BoMb-OT in the matching and color transfer tasks. Furthermore, we observe that the BoMb-OT also provides a better objective loss than the m-OT for doing approximate Bayesian computation, estimating parameters of interest in parametric generative models, and learning non-parametric generative models with gradient flow.
    Towards Understanding Iterative Magnitude Pruning: Why Lottery Tickets Win. (arXiv:2106.06955v1 [cs.LG])
    (2 min) The lottery ticket hypothesis states that sparse subnetworks exist in randomly initialized dense networks that can be trained to the same accuracy as the dense network they reside in. However, the subsequent work has failed to replicate this on large-scale models and required rewinding to an early stable state instead of initialization. We show that by using a training method that is stable with respect to linear mode connectivity, large networks can also be entirely rewound to initialization. Our subsequent experiments on common vision tasks give strong credence to the hypothesis in Evci et al. (2020b) that lottery tickets simply retrain to the same regions (although not necessarily to the same basin). These results imply that existing lottery tickets could not have been found without the preceding dense training by iterative magnitude pruning, raising doubts about the use of the lottery ticket hypothesis.
    Mitigating Covariate Shift in Imitation Learning via Offline Data Without Great Coverage. (arXiv:2106.03207v2 [cs.LG] UPDATED)
    (2 min) This paper studies offline Imitation Learning (IL) where an agent learns to imitate an expert demonstrator without additional online environment interactions. Instead, the learner is presented with a static offline dataset of state-action-next state transition triples from a potentially less proficient behavior policy. We introduce Model-based IL from Offline data (MILO): an algorithmic framework that utilizes the static dataset to solve the offline IL problem efficiently both in theory and in practice. In theory, even if the behavior policy is highly sub-optimal compared to the expert, we show that as long as the data from the behavior policy provides sufficient coverage on the expert state-action traces (and with no necessity for a global coverage over the entire state-action space), MILO can provably combat the covariate shift issue in IL. Complementing our theory results, we also demonstrate that a practical implementation of our approach mitigates covariate shift on benchmark MuJoCo continuous control tasks. We demonstrate that with behavior policies whose performances are less than half of that of the expert, MILO still successfully imitates with an extremely low number of expert state-action pairs while traditional offline IL method such as behavior cloning (BC) fails completely. Source code is provided at https://github.com/jdchang1/milo.
    Few-Shot Learning with Class Imbalance. (arXiv:2101.02523v2 [cs.LG] UPDATED)
    (2 min) Few-Shot Learning (FSL) algorithms are commonly trained through Meta-Learning (ML), which exposes models to batches of tasks sampled from a meta-dataset to mimic tasks seen during evaluation. However, the standard training procedures overlook the real-world dynamics where classes commonly occur at different frequencies. While it is generally understood that class imbalance harms the performance of supervised methods, limited research examines the impact of imbalance on the FSL evaluation task. Our analysis compares 10 state-of-the-art meta-learning and FSL methods on different imbalance distributions and rebalancing techniques. Our results reveal that 1) some FSL methods display a natural disposition against imbalance while most other approaches produce a performance drop by up to 17\% compared to the balanced task without the appropriate mitigation; 2) contrary to popular belief, many meta-learning algorithms will not automatically learn to balance from exposure to imbalanced training tasks; 3) classical rebalancing strategies, such as random oversampling, can still be very effective, leading to state-of-the-art performances and should not be overlooked; 4) FSL methods are more robust against meta-dataset imbalance than imbalance at the task-level with a similar imbalance ratio ($\rho<20$), with the effect holding even in long-tail datasets under a larger imbalance ($\rho=65$).
    The Evidence Lower Bound of Variational Autoencoders Converges to a Sum of Three Entropies. (arXiv:2010.14860v2 [stat.ML] UPDATED)
    (2 min) The central objective function of a variational autoencoder (VAE) is its variational lower bound. Here we show that for standard VAEs the variational bound converges to a value given by the sum of three entropies: the (negative) entropy of the latent distribution, the expected (negative) entropy of the observable distribution, and the average entropy of the variational distributions. Our derived analytical results are exact and apply for small as well as complex neural networks for decoder and encoder. Furthermore, they apply for finitely and infinitely many data points and at any stationary point (including local and global maxima). As a consequence, we show that the variance parameters of encoder and decoder play the key role in determining the values of variational bounds at stationary points. Furthermore, the obtained results can allow for closed-form analytical expressions at points of convergence, which may be unexpected as neither variational lower bounds of VAEs nor log-likelihoods of VAEs are closed-form during learning. As our main contribution, we provide the proofs for convergence of standard VAEs to sums of entropies. Furthermore, we numerically verify our analytical results and discuss some potential applications. The obtained equality to entropy sums provides novel information on those points in parameter space that variational learning converges to. As such, we believe, they can contribute to our understanding of established as well as novel VAE approaches.
    Event Outlier Detection in Continuous Time. (arXiv:1912.09522v3 [cs.LG] UPDATED)
    (2 min) Continuous-time event sequences represent discrete events occurring in continuous time. Such sequences arise frequently in real-life. Usually we expect the sequences to follow some regular pattern over time. However, sometimes these patterns may be interrupted by unexpected absence or occurrences of events. Identification of these unexpected cases can be very important as they may point to abnormal situations that need human attention. In this work, we study and develop methods for detecting outliers in continuous-time event sequences, including unexpected absence and unexpected occurrences of events. Since the patterns that event sequences tend to follow may change in different contexts, we develop outlier detection methods based on point processes that can take context information into account. Our methods are based on Bayesian decision theory and hypothesis testing with theoretical guarantees. To test the performance of the methods, we conduct experiments on both synthetic data and real-world clinical data and show the effectiveness of the proposed methods.
    Complexity of Linear Minimization and Projection on Some Sets. (arXiv:2101.10040v2 [math.OC] UPDATED)
    (2 min) The Frank-Wolfe algorithm is a method for constrained optimization that relies on linear minimizations, as opposed to projections. Therefore, a motivation put forward in a large body of work on the Frank-Wolfe algorithm is the computational advantage of solving linear minimizations instead of projections. However, the discussions supporting this advantage are often too succinct or incomplete. In this paper, we review the complexity bounds for both tasks on several sets commonly used in optimization. Projection methods onto the $\ell_p$-ball, $p\in\left]1,2\right[\cup\left]2,+\infty\right[$, and the Birkhoff polytope are also proposed.
    Understanding self-supervised Learning Dynamics without Contrastive Pairs. (arXiv:2102.06810v2 [cs.LG] UPDATED)
    (2 min) While contrastive approaches of self-supervised learning (SSL) learn representations by minimizing the distance between two augmented views of the same data point (positive pairs) and maximizing views from different data points (negative pairs), recent \emph{non-contrastive} SSL (e.g., BYOL and SimSiam) show remarkable performance {\it without} negative pairs, with an extra learnable predictor and a stop-gradient operation. A fundamental question arises: why do these methods not collapse into trivial representations? We answer this question via a simple theoretical study and propose a novel approach, DirectPred, that \emph{directly} sets the linear predictor based on the statistics of its inputs, without gradient training. On ImageNet, it performs comparably with more complex two-layer non-linear predictors that employ BatchNorm and outperforms a linear predictor by $2.5\%$ in 300-epoch training (and $5\%$ in 60-epoch). DirectPred is motivated by our theoretical study of the nonlinear learning dynamics of non-contrastive SSL in simple linear networks. Our study yields conceptual insights into how non-contrastive SSL methods learn, how they avoid representational collapse, and how multiple factors, like predictor networks, stop-gradients, exponential moving averages, and weight decay all come into play. Our simple theory recapitulates the results of real-world ablation studies in both STL-10 and ImageNet. Code is released\footnote{\url{https://github.com/facebookresearch/luckmatters/tree/master/ssl}}.
    Bayesian Neural Networks for Virtual Flow Metering: An Empirical Study. (arXiv:2102.01391v3 [cs.LG] UPDATED)
    (2 min) Recent works have presented promising results from the application of machine learning (ML) to the modeling of flow rates in oil and gas wells. Encouraging results and advantageous properties of ML models, such as computationally cheap evaluation and ease of calibration to new data, have sparked optimism for the development of data-driven virtual flow meters (VFMs). Data-driven VFMs are developed in the small data regime, where it is important to question the uncertainty and robustness of models. The modeling of uncertainty may help to build trust in models, which is a prerequisite for industrial applications. The contribution of this paper is the introduction of a probabilistic VFM based on Bayesian neural networks. Uncertainty in the model and measurements is described, and the paper shows how to perform approximate Bayesian inference using variational inference. The method is studied by modeling on a large and heterogeneous dataset, consisting of 60 wells across five different oil and gas assets. The predictive performance is analyzed on historical and future test data, where an average error of 4-6% and 8-13% is achieved for the 50% best performing models, respectively. Variational inference appears to provide more robust predictions than the reference approach on future data. Prediction performance and uncertainty calibration is explored in detail and discussed in light of four data challenges. The findings motivate the development of alternative strategies to improve the robustness of data-driven VFMs.
    Compression of Deep Learning Models for Text: A Survey. (arXiv:2008.05221v4 [cs.CL] UPDATED)
    (2 min) In recent years, the fields of natural language processing (NLP) and information retrieval (IR) have made tremendous progress thanksto deep learning models like Recurrent Neural Networks (RNNs), Gated Recurrent Units (GRUs) and Long Short-Term Memory (LSTMs)networks, and Transformer [120] based models like Bidirectional Encoder Representations from Transformers (BERT) [24], GenerativePre-training Transformer (GPT-2) [94], Multi-task Deep Neural Network (MT-DNN) [73], Extra-Long Network (XLNet) [134], Text-to-text transfer transformer (T5) [95], T-NLG [98] and GShard [63]. But these models are humongous in size. On the other hand,real world applications demand small model size, low response times and low computational power wattage. In this survey, wediscuss six different types of methods (Pruning, Quantization, Knowledge Distillation, Parameter Sharing, Tensor Decomposition, andSub-quadratic Transformer based methods) for compression of such models to enable their deployment in real industry NLP projects.Given the critical need of building applications with efficient and small models, and the large amount of recently published work inthis area, we believe that this survey organizes the plethora of work done by the 'deep learning for NLP' community in the past fewyears and presents it as a coherent story.
    Game of GANs: Game Theoretical Models for Generative Adversarial Networks. (arXiv:2106.06976v1 [cs.LG])
    (2 min) Generative Adversarial Network, as a promising research direction in the AI community, recently attracts considerable attention due to its ability to generating high-quality realistic data. GANs are a competing game between two neural networks trained in an adversarial manner to reach a Nash equilibrium. Despite the improvement accomplished in GANs in the last years, there remain several issues to solve. In this way, how to tackle these issues and make advances leads to rising research interests. This paper reviews literature that leverages the game theory in GANs and addresses how game models can relieve specific generative models' challenges and improve the GAN's performance. In particular, we firstly review some preliminaries, including the basic GAN model and some game theory backgrounds. After that, we present our taxonomy to summarize the state-of-the-art solutions into three significant categories: modified game model, modified architecture, and modified learning method. The classification is based on the modifications made in the basic model by the proposed approaches from the game-theoretic perspective. We further classify each category into several subcategories. Following the proposed taxonomy, we explore the main objective of each class and review the recent work in each group. Finally, we discuss the remaining challenges in this field and present the potential future research topics.
    Lattice protein design using Bayesian learning. (arXiv:2003.06601v5 [physics.bio-ph] UPDATED)
    (2 min) Protein design is the inverse approach of the three-dimensional (3D) structure prediction for elucidating the relationship between the 3D structures and amino acid sequences. In general, the computation of the protein design involves a double loop: a loop for amino acid sequence changes and a loop for an exhaustive conformational search for each amino acid sequence. Herein, we propose a novel statistical mechanical design method using Bayesian learning, which can design lattice proteins without the exhaustive conformational search. We consider a thermodynamic hypothesis of the evolution of proteins and apply it to the prior distribution of amino acid sequences. Furthermore, we take the water effect into account in view of the grand canonical picture. As a result, on applying the 2D lattice hydrophobic-polar (HP) model, our design method successfully finds an amino acid sequence for which the target conformation has a unique ground state. However, the performance was not as good for the 3D lattice HP models compared to the 2D models. The performance of the 3D model improves on using a 20-letter lattice proteins. Furthermore, we find a strong linearity between the chemical potential of water and the number of surface residues, thereby revealing the relationship between protein structure and the effect of water molecules. The advantage of our method is that it greatly reduces computation time, because it does not require long calculations for the partition function corresponding to an exhaustive conformational search. As our method uses a general form of Bayesian learning and statistical mechanics and is not limited to lattice proteins, the results presented here elucidate some heuristics used successfully in previous protein design methods.
    Root-finding Approaches for Computing Conformal Prediction Set. (arXiv:2104.06648v2 [stat.ML] UPDATED)
    (2 min) Conformal prediction constructs a confidence set for an unobserved response of a feature vector based on previous identically distributed and exchangeable observations of responses and features. It has a coverage guarantee at any nominal level without additional assumptions on their distribution. Its computation deplorably requires a refitting procedure for all replacement candidates of the target response. In regression settings, this corresponds to an infinite number of model fit. Apart from relatively simple estimators that can be written as pieces of linear function of the response, efficiently computing such sets is difficult and is still considered as an open problem. We exploit the fact that, \emph{often}, conformal prediction sets are intervals whose boundaries can be efficiently approximated by classical root-finding algorithm. We investigate how this approach can overcome many limitations of formerly used strategies and we discuss its complexity and drawbacks.
    GLISTER: Generalization based Data Subset Selection for Efficient and Robust Learning. (arXiv:2012.10630v4 [cs.LG] UPDATED)
    (3 min) Large scale machine learning and deep models are extremely data-hungry. Unfortunately, obtaining large amounts of labeled data is expensive, and training state-of-the-art models (with hyperparameter tuning) requires significant computing resources and time. Secondly, real-world data is noisy and imbalanced. As a result, several recent papers try to make the training process more efficient and robust. However, most existing work either focuses on robustness or efficiency, but not both. In this work, we introduce Glister, a GeneraLIzation based data Subset selecTion for Efficient and Robust learning framework. We formulate Glister as a mixed discrete-continuous bi-level optimization problem to select a subset of the training data, which maximizes the log-likelihood on a held-out validation set. Next, we propose an iterative online algorithm Glister-Online, which performs data selection iteratively along with the parameter updates and can be applied to any loss-based learning algorithm. We then show that for a rich class of loss functions including cross-entropy, hinge-loss, squared-loss, and logistic-loss, the inner discrete data selection is an instance of (weakly) submodular optimization, and we analyze conditions for which Glister-Online reduces the validation loss and converges. Finally, we propose Glister-Active, an extension to batch active learning, and we empirically demonstrate the performance of Glister on a wide range of tasks including, (a) data selection to reduce training time, (b) robust learning under label noise and imbalance settings, and (c) batch-active learning with several deep and shallow models. We show that our framework improves upon state of the art both in efficiency and accuracy (in cases (a) and (c)) and is more efficient compared to other state-of-the-art robust learning algorithms in case (b).
    Meta-Learning Bidirectional Update Rules. (arXiv:2104.04657v2 [cs.LG] UPDATED)
    (2 min) In this paper, we introduce a new type of generalized neural network where neurons and synapses maintain multiple states. We show that classical gradient-based backpropagation in neural networks can be seen as a special case of a two-state network where one state is used for activations and another for gradients, with update rules derived from the chain rule. In our generalized framework, networks have neither explicit notion of nor ever receive gradients. The synapses and neurons are updated using a bidirectional Hebb-style update rule parameterized by a shared low-dimensional "genome". We show that such genomes can be meta-learned from scratch, using either conventional optimization techniques, or evolutionary strategies, such as CMA-ES. Resulting update rules generalize to unseen tasks and train faster than gradient descent based optimizers for several standard computer vision and synthetic tasks.
    A Review of Graph Neural Networks and Their Applications in Power Systems. (arXiv:2101.10025v2 [cs.LG] UPDATED)
    (2 min) Deep neural networks have revolutionized many machine learning tasks in power systems, ranging from pattern recognition to signal processing. The data in these tasks is typically represented in Euclidean domains. Nevertheless, there is an increasing number of applications in power systems, where data are collected from non-Euclidean domains and represented as graph-structured data with high dimensional features and interdependency among nodes. The complexity of graph-structured data has brought significant challenges to the existing deep neural networks defined in Euclidean domains. Recently, many publications generalizing deep neural networks for graph-structured data in power systems have emerged. In this paper, a comprehensive overview of graph neural networks (GNNs) in power systems is proposed. Specifically, several classical paradigms of GNNs structures (e.g., graph convolutional networks) are summarized, and key applications in power systems, such as fault scenario application, time series prediction, power flow calculation, and data generation are reviewed in detail. Furthermore, main issues and some research trends about the applications of GNNs in power systems are discussed.
    Online learning in MDPs with linear function approximation and bandit feedback. (arXiv:2007.01612v2 [cs.LG] UPDATED)
    (2 min) We consider an online learning problem where the learner interacts with a Markov decision process in a sequence of episodes, where the reward function is allowed to change between episodes in an adversarial manner and the learner only gets to observe the rewards associated with its actions. We allow the state space to be arbitrarily large, but we assume that all action-value functions can be represented as linear functions in terms of a known low-dimensional feature map, and that the learner has access to a simulator of the environment that allows generating trajectories from the true MDP dynamics. Our main contribution is developing a computationally efficient algorithm that we call MDP-LinExp3, and prove that its regret is bounded by $\widetilde{\mathcal{O}}\big(H^2 T^{2/3} (dK)^{1/3}\big)$, where $T$ is the number of episodes, $H$ is the number of steps in each episode, $K$ is the number of actions, and $d$ is the dimension of the feature map. We also show that the regret can be improved to $\widetilde{\mathcal{O}}\big(H^2 \sqrt{TdK}\big)$ under much stronger assumptions on the MDP dynamics. To our knowledge, MDP-LinExp3 is the first provably efficient algorithm for this problem setting.
    SPADE: A Spectral Method for Black-Box Adversarial Robustness Evaluation. (arXiv:2102.03716v3 [cs.LG] UPDATED)
    (2 min) A black-box spectral method is introduced for evaluating the adversarial robustness of a given machine learning (ML) model. Our approach, named SPADE, exploits bijective distance mapping between the input/output graphs constructed for approximating the manifolds corresponding to the input/output data. By leveraging the generalized Courant-Fischer theorem, we propose a SPADE score for evaluating the adversarial robustness of a given model, which is proved to be an upper bound of the best Lipschitz constant under the manifold setting. To reveal the most non-robust data samples highly vulnerable to adversarial attacks, we develop a spectral graph embedding procedure leveraging dominant generalized eigenvectors. This embedding step allows assigning each data sample a robustness score that can be further harnessed for more effective adversarial training. Our experiments show the proposed SPADE method leads to promising empirical results for neural network models that are adversarially trained with the MNIST and CIFAR-10 data sets.
    A fast randomized incremental gradient method for decentralized non-convex optimization. (arXiv:2011.03853v2 [math.OC] UPDATED)
    (2 min) We study decentralized non-convex finite-sum minimization problems described over a network of nodes, where each node possesses a local batch of data samples. In this context, we analyze a single-timescale randomized incremental gradient method, called GT-SAGA. GT-SAGA is computationally efficient as it evaluates one component gradient per node per iteration and achieves provably fast and robust performance by leveraging node-level variance reduction and network-level gradient tracking. For general smooth non-convex problems, we show the almost sure and mean-squared convergence of GT-SAGA to a first-order stationary point and further describe regimes of practical significance where it outperforms the existing approaches and achieves a network topology-independent iteration complexity respectively. When the global function satisfies the Polyak-Lojaciewisz condition, we show that GT-SAGA exhibits linear convergence to an optimal solution in expectation and describe regimes of practical interest where the performance is network topology-independent and improves upon the existing methods. Numerical experiments are included to highlight the main convergence aspects of GT-SAGA in non-convex settings.
    Deep manifold learning reveals hidden dynamics of proteasome autoregulation. (arXiv:2012.12854v2 [q-bio.QM] UPDATED)
    (2 min) The 2.5-MDa 26S proteasome maintains proteostasis and regulates myriad cellular processes. How polyubiquitylated substrate interactions regulate proteasome activity is not understood. Here we introduce a deep manifold learning framework, named AlphaCryo4D, which enables atomic-level cryogenic electron microscopy (cryo-EM) reconstructions of nonequilibrium conformational continuum and reconstitutes hidden dynamics of proteasome autoregulation in the act of substrate degradation. AlphaCryo4D integrates 3D deep residual learning with manifold embedding of free-energy landscapes, which directs 3D clustering via an energy-based particle-voting algorithm. In blind assessments using simulated heterogeneous cryo-EM datasets, AlphaCryo4D achieved 3D classification accuracy three times that of conventional method and reconstructed continuous conformational changes of a 130-kDa protein at sub-3-angstrom resolution. By using AlphaCryo4D to analyze a single experimental cryo-EM dataset, we identified 64 conformers of the substrate-bound human 26S proteasome, revealing conformational entanglement of two regulatory particles in the doubly capped holoenzymes and their energetic differences with singly capped ones. Novel ubiquitin-binding sites are discovered on the RPN2, RPN10 and Alpha5 subunits to remodel polyubiquitin chains for deubiquitylation and recycle. Importantly, AlphaCryo4D choreographs single-nucleotide-exchange dynamics of proteasomal AAA-ATPase motor during translocation initiation, which upregulates proteolytic activity by allosterically promoting nucleophilic attack. Our systemic analysis illuminates a grand hierarchical allostery for proteasome autoregulation.
    Meta-Learning of Neural Architectures for Few-Shot Learning. (arXiv:1911.11090v3 [cs.LG] UPDATED)
    (2 min) The recent progress in neural architecture search (NAS) has allowed scaling the automated design of neural architectures to real-world domains, such as object detection and semantic segmentation. However, one prerequisite for the application of NAS are large amounts of labeled data and compute resources. This renders its application challenging in few-shot learning scenarios, where many related tasks need to be learned, each with limited amounts of data and compute time. Thus, few-shot learning is typically done with a fixed neural architecture. To improve upon this, we propose MetaNAS, the first method which fully integrates NAS with gradient-based meta-learning. MetaNAS optimizes a meta-architecture along with the meta-weights during meta-training. During meta-testing, architectures can be adapted to a novel task with a few steps of the task optimizer, that is: task adaptation becomes computationally cheap and requires only little data per task. Moreover, MetaNAS is agnostic in that it can be used with arbitrary model-agnostic meta-learning algorithms and arbitrary gradient-based NAS methods. %We present encouraging results for MetaNAS with a combination of DARTS and REPTILE on few-shot classification benchmarks. Empirical results on standard few-shot classification benchmarks show that MetaNAS with a combination of DARTS and REPTILE yields state-of-the-art results.
    Deep Reinforcement Learning for Electric Vehicle Routing Problem with Time Windows. (arXiv:2010.02068v3 [cs.LG] UPDATED)
    (2 min) The past decade has seen a rapid penetration of electric vehicles (EV) in the market, more and more logistics and transportation companies start to deploy EVs for service provision. In order to model the operations of a commercial EV fleet, we utilize the EV routing problem with time windows (EVRPTW). In this research, we propose an end-to-end deep reinforcement learning framework to solve the EVRPTW. In particular, we develop an attention model incorporating the pointer network and a graph embedding technique to parameterize a stochastic policy for solving the EVRPTW. The model is then trained using policy gradient with rollout baseline. Our numerical studies show that the proposed model is able to efficiently solve EVRPTW instances of large sizes that are not solvable with any existing approaches.
    A Hybrid Variance-Reduced Method for Decentralized Stochastic Non-Convex Optimization. (arXiv:2102.06752v2 [math.OC] UPDATED)
    (2 min) This paper considers decentralized stochastic optimization over a network of $n$ nodes, where each node possesses a smooth non-convex local cost function and the goal of the networked nodes is to find an $\epsilon$-accurate first-order stationary point of the sum of the local costs. We focus on an online setting, where each node accesses its local cost only by means of a stochastic first-order oracle that returns a noisy version of the exact gradient. In this context, we propose a novel single-loop decentralized hybrid variance-reduced stochastic gradient method, called GT-HSGD, that outperforms the existing approaches in terms of both the oracle complexity and practical implementation. The GT-HSGD algorithm implements specialized local hybrid stochastic gradient estimators that are fused over the network to track the global gradient. Remarkably, GT-HSGD achieves a network topology-independent oracle complexity of $O(n^{-1}\epsilon^{-3})$ when the required error tolerance $\epsilon$ is small enough, leading to a linear speedup with respect to the centralized optimal online variance-reduced approaches that operate on a single node. Numerical experiments are provided to illustrate our main technical results.
    Say No to the Discrimination: Learning Fair Graph Neural Networks with Limited Sensitive Attribute Information. (arXiv:2009.01454v3 [cs.LG] UPDATED)
    (2 min) Graph neural networks (GNNs) have achieved state-of-the-art performance in modeling graphs. Despite its great success, as with many other models, GNNs have the risk to inherit the bias from the training data. In addition, the bias of GNN can be magnified by the graph structures and message-passing mechanism of GNNs. The risk of discrimination limits the adoption of GNNs in sensitive domains such as credit score estimation. Though extensive studies of fair classification have been conducted on i.i.d data, methods to address the problem of discrimination on non-i.i.d data are rather limited. Furthermore, the practical scenario of sparse annotations in sensitive attributes is rarely considered in existing works. Therefore, we study the novel and important problem of learning fair GNNs with limited sensitive information. We propose a novel framework called FairGNN, which is able to reduce the bias of GNNs and maintain high node classification accuracy by leveraging graph structured data and sensitive information. Theoretical analysis is conducted to show that FairGNN can ensure fairness under mild conditions given limited nodes with known sensitive attributes. Experiments on real-world datasets demonstrated the effectiveness of the proposed framework in eliminating discrimination while maintaining high node classification accuracy.
    Functional optimal transport: map estimation and domain adaptation for functional data. (arXiv:2102.03895v3 [stat.ML] UPDATED)
    (2 min) We introduce a formulation of optimal transport problem for distributions on function spaces, where the stochastic map between functional domains can be partially represented in terms of an (infinite-dimensional) Hilbert-Schmidt operator mapping a Hilbert space of functions to another. For numerous machine learning tasks, data can be naturally viewed as samples drawn from spaces of functions, such as curves and surfaces, in high dimensions. Optimal transport for functional data analysis provides a useful framework of treatment for such domains. In this work, we develop an efficient algorithm for finding the stochastic transport map between functional domains and provide theoretical guarantees on the existence, uniqueness, and consistency of our estimate for the Hilbert-Schmidt operator. We validate our method on synthetic datasets and study the geometric properties of the transport map. Experiments on real-world datasets of robot arm trajectories further demonstrate the effectiveness of our method on applications in domain adaptation.
    Bias-Variance Reduced Local SGD for Less Heterogeneous Federated Learning. (arXiv:2102.03198v2 [cs.LG] UPDATED)
    (2 min) Recently, local SGD has got much attention and been extensively studied in the distributed learning community to overcome the communication bottleneck problem. However, the superiority of local SGD to minibatch SGD only holds in quite limited situations. In this paper, we study a new local algorithm called Bias-Variance Reduced Local SGD (BVR-L-SGD) for nonconvex distributed optimization. Algorithmically, our proposed bias and variance reduced local gradient estimator fully utilizes small second-order heterogeneity of local objectives and suggests randomly picking up one of the local models instead of taking the average of them when workers are synchronized. Theoretically, under small heterogeneity of local objectives, we show that BVR-L-SGD achieves better communication complexity than both the previous non-local and local methods under mild conditions, and particularly BVR-L-SGD is the first method that breaks the barrier of communication complexity $\Theta(1/\varepsilon)$ for general nonconvex smooth objectives when the heterogeneity is small and the local computation budget is large. Numerical results are given to verify the theoretical findings and give empirical evidence of the superiority of our method.
    Information Obfuscation of Graph Neural Networks. (arXiv:2009.13504v5 [cs.LG] UPDATED)
    (2 min) While the advent of Graph Neural Networks (GNNs) has greatly improved node and graph representation learning in many applications, the neighborhood aggregation scheme exposes additional vulnerabilities to adversaries seeking to extract node-level information about sensitive attributes. In this paper, we study the problem of protecting sensitive attributes by information obfuscation when learning with graph structured data. We propose a framework to locally filter out pre-determined sensitive attributes via adversarial training with the total variation and the Wasserstein distance. Our method creates a strong defense against inference attacks, while only suffering small loss in task performance. Theoretically, we analyze the effectiveness of our framework against a worst-case adversary, and characterize an inherent trade-off between maximizing predictive accuracy and minimizing information leakage. Experiments across multiple datasets from recommender systems, knowledge graphs and quantum chemistry demonstrate that the proposed approach provides a robust defense across various graph structures and tasks, while producing competitive GNN encoders for downstream tasks.
    Capsule Attention for Multimodal EEG-EOG Representation Learning with Application to Driver Vigilance Estimation. (arXiv:1912.07812v4 [cs.LG] UPDATED)
    (2 min) Driver vigilance estimation is an important task for transportation safety. Wearable and portable brain-computer interface devices provide a powerful means for real-time monitoring of the vigilance level of drivers to help with avoiding distracted or impaired driving. In this paper, we propose a novel multimodal architecture for in-vehicle vigilance estimation from Electroencephalogram and Electrooculogram. To enable the system to focus on the most salient parts of the learned multimodal representations, we propose an architecture composed of a capsule attention mechanism following a deep Long Short-Term Memory (LSTM) network. Our model learns hierarchical dependencies in the data through the LSTM and capsule feature representation layers. To better explore the discriminative ability of the learned representations, we study the effect of the proposed capsule attention mechanism including the number of dynamic routing iterations as well as other parameters. Experiments show the robustness of our method by outperforming other solutions and baseline techniques, setting a new state-of-the-art. We then provide an analysis on different frequency bands and brain regions to evaluate their suitability for driver vigilance estimation. Lastly, an analysis on the role of capsule attention, multimodality, and robustness to noise is performed, highlighting the advantages of our approach.
    Graph Inference Representation: Learning Graph Positional Embeddings with Anchor Path Encoding. (arXiv:2105.03821v2 [cs.LG] UPDATED)
    (2 min) Learning node representations that incorporate information from graph structure benefits wide range of tasks on graph. The majority of existing graph neural networks (GNNs) have limited power in capturing position information for a given node. The idea of positioning nodes with selected anchors has been exploited, yet mainly relying on explicit labeling of distance information. Here we propose Graph Inference Representation (GIR), an anchor based GNN model encoding path information related to pre-selected anchors for each node. Abilities to get position-aware embeddings are theoretically and experimentally investigated on GIR and its core variants. Further, the complementarity between GIRs and typical GNNs is demonstrated. We show that GIRs get outperformed results in position-aware scenarios, and performances on typical GNNs could be improved by fusing GIR embeddings.
    Human Apprenticeship Learning via Kernel-based Inverse Reinforcement Learning. (arXiv:2002.10904v2 [cs.LG] UPDATED)
    (2 min) It has been well demonstrated that inverse reinforcement learning (IRL) is an effective technique for teaching machines to perform tasks at human skill levels given human demonstrations (i.e., human to machine apprenticeship learning). This paper seeks to show that a similar application can be demonstrated with human learners. That is, given demonstrations from human experts inverse reinforcement learning techniques can be used to teach other humans to perform at higher skill levels (i.e., human to human apprenticeship learning). To show this two experiments were conducted using a simple, real-time web game where players were asked to touch targets in order to earn as many points as possible. For the experiment player performance was defined as the number of targets a player touched, irrespective of the points that a player actually earned. This allowed for in-game points to be modified and the effect of these alterations on performance measured. At no time were participants told the true performance metric. To determine the point modifications IRL was applied on demonstrations of human experts playing the game. The results of the experiment show with significance that performance improved over the control for select treatment groups. Finally, in addition to the experiment, we also detail the algorithmic challenges we faced when conducting the experiment and the techniques we used to overcome them.
    CATE: Computation-aware Neural Architecture Encoding with Transformers. (arXiv:2102.07108v2 [cs.LG] UPDATED)
    (2 min) Recent works (White et al., 2020a; Yan et al., 2020) demonstrate the importance of architecture encodings in Neural Architecture Search (NAS). These encodings encode either structure or computation information of the neural architectures. Compared to structure-aware encodings, computation-aware encodings map architectures with similar accuracies to the same region, which improves the downstream architecture search performance (Zhang et al., 2019; White et al., 2020a). In this work, we introduce a Computation-Aware Transformer-based Encoding method called CATE. Different from existing computation-aware encodings based on fixed transformation (e.g. path encoding), CATE employs a pairwise pre-training scheme to learn computation-aware encodings using Transformers with cross-attention. Such learned encodings contain dense and contextualized computation information of neural architectures. We compare CATE with eleven encodings under three major encoding-dependent NAS subroutines in both small and large search spaces. Our experiments show that CATE is beneficial to the downstream search, especially in the large search space. Moreover, the outside search space experiment demonstrates its superior generalization ability beyond the search space on which it was trained. Our code is available at: https://github.com/MSU-MLSys-Lab/CATE.
    Learning Quantized Neural Nets by Coarse Gradient Method for Non-linear Classification. (arXiv:2011.11256v2 [cs.LG] UPDATED)
    (2 min) Quantized or low-bit neural networks are attractive due to their inference efficiency. However, training deep neural networks with quantized activations involves minimizing a discontinuous and piecewise constant loss function. Such a loss function has zero gradients almost everywhere (a.e.), which makes the conventional gradient-based algorithms inapplicable. To this end, we study a novel class of \emph{biased} first-order oracle, termed coarse gradient, for overcoming the vanished gradient issue. A coarse gradient is generated by replacing the a.e. zero derivatives of quantized (i.e., stair-case) ReLU activation composited in the chain rule with some heuristic proxy derivative called straight-through estimator (STE). Although having been widely used in training quantized networks empirically, fundamental questions like when and why the ad-hoc STE trick works, still lacks theoretical understanding. In this paper, we propose a class of STEs with certain monotonicity, and consider their applications to the training of a two-linear-layer network with quantized activation functions for non-linear multi-category classification. We establish performance guarantees for the proposed STEs by showing that the corresponding coarse gradient methods converge to the global minimum, which leads to a perfect classification. Lastly, we present experimental results on synthetic data as well as MNIST dataset to verify our theoretical findings and demonstrate the effectiveness of our proposed STEs.
    Scaling Deep Contrastive Learning Batch Size under Memory Limited Setup. (arXiv:2101.06983v2 [cs.LG] UPDATED)
    (0 min) Contrastive learning has been applied successfully to learn vector representations of text. Previous research demonstrated that learning high-quality representations benefits from batch-wise contrastive loss with a large number of negatives. In practice, the technique of in-batch negative is used, where for each example in a batch, other batch examples' positives will be taken as its negatives, avoiding encoding extra negatives. This, however, still conditions each example's loss on all batch examples and requires fitting the entire large batch into GPU memory. This paper introduces a gradient caching technique that decouples backpropagation between contrastive loss and the encoder, removing encoder backward pass data dependency along the batch dimension. As a result, gradients can be computed for one subset of the batch at a time, leading to almost constant memory usage.
    Data-driven Prediction of General Hamiltonian Dynamics via Learning Exactly-Symplectic Maps. (arXiv:2103.05632v2 [cs.LG] UPDATED)
    (0 min) We consider the learning and prediction of nonlinear time series generated by a latent symplectic map. A special case is (not necessarily separable) Hamiltonian systems, whose solution flows give such symplectic maps. For this special case, both generic approaches based on learning the vector field of the latent ODE and specialized approaches based on learning the Hamiltonian that generates the vector field exist. Our method, however, is different as it does not rely on the vector field nor assume its existence; instead, it directly learns the symplectic evolution map in discrete time. Moreover, we do so by representing the symplectic map via a generating function, which we approximate by a neural network (hence the name GFNN). This way, our approximation of the evolution map is always \emph{exactly} symplectic. This additional geometric structure allows the local prediction error at each step to accumulate in a controlled fashion, and we will prove, under reasonable assumptions, that the global prediction error grows at most \emph{linearly} with long prediction time, which significantly improves an otherwise exponential growth. In addition, as a map-based and thus purely data-driven method, GFNN avoids two additional sources of inaccuracies common in vector-field based approaches, namely the error in approximating the vector field by finite difference of the data, and the error in numerical integration of the vector field for making predictions. Numerical experiments further demonstrate our claims.
    Ensemble Squared: A Meta AutoML System. (arXiv:2012.05390v2 [cs.LG] UPDATED)
    (0 min) There are currently many barriers that prevent non-experts from exploiting machine learning solutions ranging from the lack of intuition on statistical learning techniques to the trickiness of hyperparameter tuning. Such barriers have led to an explosion of interest in automated machine learning (AutoML), whereby an off-the-shelf system can take care of many of the steps for end-users without the need for expertise in machine learning. This paper presents Ensemble Squared (Ensemble$^2$), an AutoML system that ensembles the results of state-of-the-art open-source AutoML systems. Ensemble$^2$ exploits the diversity of existing AutoML systems by leveraging the differences in their model search space and heuristics. Empirically, we show that diversity of each AutoML system is sufficient to justify ensembling at the AutoML system level. In demonstrating this, we also establish new state-of-the-art AutoML results on the OpenML tabular classification benchmark.
    Invariance Principle Meets Information Bottleneck for Out-of-Distribution Generalization. (arXiv:2106.06607v1 [cs.LG])
    (2 min) The invariance principle from causality is at the heart of notable approaches such as invariant risk minimization (IRM) that seek to address out-of-distribution (OOD) generalization failures. Despite the promising theory, invariance principle-based approaches fail in common classification tasks, where invariant (causal) features capture all the information about the label. Are these failures due to the methods failing to capture the invariance? Or is the invariance principle itself insufficient? To answer these questions, we revisit the fundamental assumptions in linear regression tasks, where invariance-based approaches were shown to provably generalize OOD. In contrast to the linear regression tasks, we show that for linear classification tasks we need much stronger restrictions on the distribution shifts, or otherwise OOD generalization is impossible. Furthermore, even with appropriate restrictions on distribution shifts in place, we show that the invariance principle alone is insufficient. We prove that a form of the information bottleneck constraint along with invariance helps address key failures when invariant features capture all the information about the label and also retains the existing success when they do not. We propose an approach that incorporates both of these principles and demonstrate its effectiveness in several experiments.
    Power Modeling for Effective Datacenter Planning and Compute Management. (arXiv:2103.13308v2 [cs.DC] UPDATED)
    (0 min) Datacenter power demand has been continuously growing and is the key driver of its cost. An accurate mapping of compute resources (CPU, RAM, etc.) and hardware types (servers, accelerators, etc.) to power consumption has emerged as a critical requirement for major Web and cloud service providers. With the global growth in datacenter capacity and associated power consumption, such models are essential for important decisions around datacenter design and operation. In this paper, we discuss two classes of statistical power models designed and validated to be accurate, simple, interpretable and applicable to all hardware configurations and workloads across hyperscale datacenters of Google fleet. To the best of our knowledge, this is the largest scale power modeling study of this kind, in both the scope of diverse datacenter planning and real-time management use cases, as well as the variety of hardware configurations and workload types used for modeling and validation. We demonstrate that the proposed statistical modeling techniques, while simple and scalable, predict power with less than 5% Mean Absolute Percent Error (MAPE) for more than 95% diverse Power Distribution Units (more than 2000) using only 4 features. This performance matches the reported accuracy of the previous started-of-the-art methods, while using significantly less features and covering a wider range of use cases.
    Statistical Analysis from the Fourier Integral Theorem. (arXiv:2106.06608v1 [stat.ME])
    (2 min) Taking the Fourier integral theorem as our starting point, in this paper we focus on natural Monte Carlo and fully nonparametric estimators of multivariate distributions and conditional distribution functions. We do this without the need for any estimated covariance matrix or dependence structure between variables. These aspects arise immediately from the integral theorem. Being able to model multivariate data sets using conditional distribution functions we can study a number of problems, such as prediction for Markov processes, estimation of mixing distribution functions which depend on covariates, and general multivariate data. Estimators are explicit Monte Carlo based and require no recursive or iterative algorithms.
    Tracking Peaceful Tractors on Social Media -- XAI-enabled analysis of Red Fort Riots 2021. (arXiv:2104.13352v2 [cs.SI] UPDATED)
    (2 min) On 26 January 2021, India witnessed a national embarrassment from the demographic least expected from - farmers. People across the nation watched in horror as a pseudo-patriotic mob of farmers stormed capital Delhi and vandalized the national pride- Red Fort. Investigations that followed the event revealed the existence of a social media trail that led to the likes of such an event. Consequently, it became essential and necessary to archive this trail for social media analysis - not only to understand the bread-crumbs that are dispersed across the trail but also to visualize the role played by misinformation and fake news in this event. In this paper, we propose the tractor2twitter dataset which contains around 0.05 million tweets that were posted before, during, and after this event. Also, we benchmark our dataset with an Explainable AI ML model for classification of each tweet into either of the three categories - disinformation, misinformation, and opinion.
    Not All Memories are Created Equal: Learning to Forget by Expiring. (arXiv:2105.06548v2 [cs.LG] UPDATED)
    (2 min) Attention mechanisms have shown promising results in sequence modeling tasks that require long-term memory. Recent work investigated mechanisms to reduce the computational cost of preserving and storing memories. However, not all content in the past is equally important to remember. We propose Expire-Span, a method that learns to retain the most important information and expire the irrelevant information. This forgetting of memories enables Transformers to scale to attend over tens of thousands of previous timesteps efficiently, as not all states from previous timesteps are preserved. We demonstrate that Expire-Span can help models identify and retain critical information and show it can achieve strong performance on reinforcement learning tasks specifically designed to challenge this functionality. Next, we show that Expire-Span can scale to memories that are tens of thousands in size, setting a new state of the art on incredibly long context tasks such as character-level language modeling and a frame-by-frame moving objects task. Finally, we analyze the efficiency of Expire-Span compared to existing approaches and demonstrate that it trains faster and uses less memory.
    Maximum n-times Coverage for Vaccine Design. (arXiv:2101.10902v2 [q-bio.QM] UPDATED)
    (2 min) We introduce the maximum $n$-times coverage problem that selects $k$ overlays to maximize the summed coverage of weighted elements, where each element must be covered at least $n$ times. We also define the min-cost $n$-times coverage problem where the objective is to select the minimum set of overlays such that the sum of the weights of elements that are covered at least $n$ times is at least $\tau$. Maximum $n$-times coverage is a generalization of the multi-set multi-cover problem, is NP-complete, and is not submodular. We introduce two new practical solutions for $n$-times coverage based on integer linear programming and sequential greedy optimization. We show that maximum $n$-times coverage is a natural way to frame peptide vaccine design, and find that it produces a pan-strain COVID-19 vaccine design that is superior to 29 other published designs in predicted population coverage and the expected number of peptides displayed by each individual's HLA molecules.
    Federated Continual Learning with Weighted Inter-client Transfer. (arXiv:2003.03196v5 [cs.LG] UPDATED)
    (2 min) There has been a surge of interest in continual learning and federated learning, both of which are important in deep neural networks in real-world scenarios. Yet little research has been done regarding the scenario where each client learns on a sequence of tasks from a private local data stream. This problem of federated continual learning poses new challenges to continual learning, such as utilizing knowledge from other clients, while preventing interference from irrelevant knowledge. To resolve these issues, we propose a novel federated continual learning framework, Federated Weighted Inter-client Transfer (FedWeIT), which decomposes the network weights into global federated parameters and sparse task-specific parameters, and each client receives selective knowledge from other clients by taking a weighted combination of their task-specific parameters. FedWeIT minimizes interference between incompatible tasks, and also allows positive knowledge transfer across clients during learning. We validate our FedWeIT against existing federated learning and continual learning methods under varying degrees of task similarity across clients, and our model significantly outperforms them with a large reduction in the communication cost. Code is available at https://github.com/wyjeong/FedWeIT
    Functorial Manifold Learning. (arXiv:2011.07435v5 [cs.LG] UPDATED)
    (2 min) We adapt previous research on category theory and topological unsupervised learning to develop a functorial perspective on manifold learning. We first characterize manifold learning algorithms as functors that map pseudometric spaces to optimization objectives and factor through hierachical clustering functors. We then use this characterization to prove refinement bounds on manifold learning loss functions and construct a hierarchy of manifold learning algorithms based on their invariants. We express several popular manifold learning algorithms as functors at different levels of this hierarchy, including Metric Multidimensional Scaling, IsoMap, and UMAP. Next, we use interleaving distance to study the stability of a broad class of manifold learning algorithms. We present bounds on how closely the embeddings these algorithms produce from noisy data approximate the embeddings they would learn from noiseless data. Finally, we use our framework to derive a set of novel manifold learning algorithms, which we experimentally demonstrate are competitive with the state of the art.
    Time Series Classification via Topological Data Analysis. (arXiv:2102.01956v2 [stat.ML] UPDATED)
    (2 min) In this paper, we develop topological data analysis methods for classification tasks on univariate time series. As an application, we perform binary and ternary classification tasks on two public datasets that consist of physiological signals collected under stress and non-stress conditions. We accomplish our goal by using persistent homology to engineer stable topological features after we use a time delay embedding of the signals and perform a subwindowing instead of using windows of fixed length. The combination of methods we use can be applied to any univariate time series and in this application allows us to reduce noise and use long window sizes without incurring an extra computational cost. We then use machine learning models on the features we algorithmically engineered to obtain higher accuracies with fewer features.
    FeSHI: Feature Map Based Stealthy Hardware Intrinsic Attack. (arXiv:2106.06895v1 [cs.CR])
    (2 min) Convolutional Neural Networks (CNN) have shown impressive performance in computer vision, natural language processing, and many other applications, but they exhibit high computations and substantial memory requirements. To address these limitations, especially in resource-constrained devices, the use of cloud computing for CNNs is becoming more popular. This comes with privacy and latency concerns that have motivated the designers to develop embedded hardware accelerators for CNNs. However, designing a specialized accelerator increases the time-to-market and cost of production. Therefore, to reduce the time-to-market and access to state-of-the-art techniques, CNN hardware mapping and deployment on embedded accelerators are often outsourced to untrusted third parties, which is going to be more prevalent in futuristic artificial intelligence of things (AIoT) systems. These AIoT systems anticipate horizontal collaboration among different resource-constrained AIoT node devices, where CNN layers are partitioned and these devices collaboratively compute complex CNN tasks Therefore, there is a dire need to explore this attack surface for designing secure embedded hardware accelerators for CNNs. Towards this goal, in this paper, we exploited this attack surface to propose an HT-based attack called FeSHI. This attack exploits the statistical distribution i.e., Gaussian distribution, of the layer-by-layer feature maps of the CNN to design two triggers for stealthy HT with a very low probability of triggering. To illustrate the effectiveness of the proposed attack, we deployed the LeNet and LeNet-3D on PYNQ to classify the MNIST and CIFAR-10 datasets, respectively, and tested FeSHI. The experimental results show that FeSHI utilizes up to 2% extra LUTs, and the overall resource overhead is less than 1% compared to the original designs
    Randomized Stochastic Variance-Reduced Methods for Multi-Task Stochastic Bilevel Optimization. (arXiv:2105.02266v2 [math.OC] UPDATED)
    (0 min) In this paper, we consider non-convex stochastic bilevel optimization (SBO) problems that have many applications in machine learning. Although numerous studies have proposed stochastic algorithms for solving these problems, they are limited in two perspectives: (i) their sample complexities are high, which do not match the state-of-the-art result for non-convex stochastic optimization; (ii) their algorithms are tailored to problems with only one lower-level problem. When there are many lower-level problems, it could be prohibitive to process all these lower-level problems at each iteration. To address these limitations, this paper proposes fast randomized stochastic algorithms for non-convex SBO problems. First, we present a stochastic method for non-convex SBO with only one lower problem and establish its sample complexity of $O(1/\epsilon^3)$ for finding an $\epsilon$-stationary point under Lipschitz continuous conditions of stochastic oracles, matching the lower bound for stochastic smooth non-convex optimization. Second, we present a randomized stochastic method for non-convex SBO with $m>1$ lower level problems (multi-task SBO) by processing a constant number of lower problems at each iteration, and establish its sample complexity no worse than $O(m/\epsilon^3)$, which could be a better complexity than that of simply processing all $m$ lower problems at each iteration. Lastly, we establish even faster convergence results for gradient-dominant functions. To the best of our knowledge, this is the first work considering multi-task SBO and developing state-of-the-art sample complexity results.
    Two-way Spectrum Pursuit for CUR Decomposition and Its Application in Joint Column/Row Subset Selection. (arXiv:2106.06983v1 [cs.LG])
    (2 min) The problem of simultaneous column and row subset selection is addressed in this paper. The column space and row space of a matrix are spanned by its left and right singular vectors, respectively. However, the singular vectors are not within actual columns/rows of the matrix. In this paper, an iterative approach is proposed to capture the most structural information of columns/rows via selecting a subset of actual columns/rows. This algorithm is referred to as two-way spectrum pursuit (TWSP) which provides us with an accurate solution for the CUR matrix decomposition. TWSP is applicable in a wide range of applications since it enjoys a linear complexity w.r.t. number of original columns/rows. We demonstrated the application of TWSP for joint channel and sensor selection in cognitive radio networks, informative users and contents detection, and efficient supervised data reduction.
    Domain Adaptation for Time Series Forecasting via Attention Sharing. (arXiv:2102.06828v3 [cs.LG] UPDATED)
    (2 min) Recent years have witnessed deep neural networks gaining increasing popularity in the field of time series forecasting. A primary reason of their success is their ability to effectively capture complex temporal dynamics across multiple related time series. However, the advantages of these deep forecasters only start to emerge in the presence of a sufficient amount of data. This poses a challenge for typical forecasting problems in practice, where one either has a small number of time series, or limited observations per time series, or both. To cope with the issue of data scarcity, we propose a novel domain adaptation framework, Domain Adaptation Forecaster (DAF), that leverages the statistical strengths from another relevant domain with abundant data samples (source) to improve the performance on the domain of interest with limited data (target). In particular, we propose an attention-based shared module with a domain discriminator across domains as well as private modules for individual domains. This allows us to jointly train the source and target domains by generating domain-invariant latent features while retraining domain-specific features. Extensive experiments on various domains demonstrate that our proposed method outperforms state-of-the-art baselines on synthetic and real-world datasets.
    Unifying Cardiovascular Modelling with Deep Reinforcement Learning for Uncertainty Aware Control of Sepsis Treatment. (arXiv:2101.08477v3 [cs.LG] UPDATED)
    (2 min) Sepsis is a potentially life threatening inflammatory response to infection or severe tissue damage. It has a highly variable clinical course, requiring constant monitoring of the patient's state to guide the management of intravenous fluids and vasopressors, among other interventions. Despite decades of research, there's still debate among experts on optimal treatment. Here, we combine for the first time, distributional deep reinforcement learning with mechanistic physiological models to find personalized sepsis treatment strategies. Our method handles partial observability by leveraging known cardiovascular physiology, introducing a novel physiology-driven recurrent autoencoder, and quantifies the uncertainty of its own results. Moreover, we introduce a framework for uncertainty aware decision support with humans in the loop. We show that our method learns physiologically explainable, robust policies that are consistent with clinical knowledge. Further our method consistently identifies high risk states that lead to death, which could potentially benefit from more frequent vasopressor administration, providing valuable guidance for future research
    Probabilistic Generating Circuits. (arXiv:2102.09768v2 [cs.AI] UPDATED)
    (2 min) Generating functions, which are widely used in combinatorics and probability theory, encode function values into the coefficients of a polynomial. In this paper, we explore their use as a tractable probabilistic model, and propose probabilistic generating circuits (PGCs) for their efficient representation. PGCs are strictly more expressive efficient than many existing tractable probabilistic models, including determinantal point processes (DPPs), probabilistic circuits (PCs) such as sum-product networks, and tractable graphical models. We contend that PGCs are not just a theoretical framework that unifies vastly different existing models, but also show great potential in modeling realistic data. We exhibit a simple class of PGCs that are not trivially subsumed by simple combinations of PCs and DPPs, and obtain competitive performance on a suite of density estimation benchmarks. We also highlight PGCs' connection to the theory of strongly Rayleigh distributions.
    Estimating Treatment Effects with Observed Confounders and Mediators. (arXiv:2003.11991v3 [stat.ME] UPDATED)
    (2 min) Given a causal graph, the do-calculus can express treatment effects as functionals of the observational joint distribution that can be estimated empirically. Sometimes the do-calculus identifies multiple valid formulae, prompting us to compare the statistical properties of the corresponding estimators. For example, the backdoor formula applies when all confounders are observed and the frontdoor formula applies when an observed mediator transmits the causal effect. In this paper, we investigate the over-identified scenario where both confounders and mediators are observed, rendering both estimators valid. Addressing the linear Gaussian causal model, we demonstrate that either estimator can dominate the other by an unbounded constant factor. Next, we derive an optimal estimator, which leverages all observed variables, and bound its finite-sample variance. We show that it strictly outperforms the backdoor and frontdoor estimators and that this improvement can be unbounded. We also present a procedure for combining two datasets, one with observed confounders and another with observed mediators. Finally, we evaluate our methods on both simulated data and the IHDP and JTPA datasets.
    LaProp: Separating Momentum and Adaptivity in Adam. (arXiv:2002.04839v3 [cs.LG] UPDATED)
    (0 min) We identity a by-far-unrecognized problem of Adam-style optimizers which results from unnecessary coupling between momentum and adaptivity. The coupling leads to instability and divergence when the momentum and adaptivity parameters are mismatched. In this work, we propose a method, Laprop, which decouples momentum and adaptivity in the Adam-style methods. We show that the decoupling leads to greater flexibility in the hyperparameters and allows for a straightforward interpolation between the signed gradient methods and the adaptive gradient methods. We experimentally show that Laprop has consistently improved speed and stability over Adam on a variety of tasks. We also bound the regret of Laprop on a convex problem and show that our bound differs from that of Adam by a key factor, which demonstrates its advantage.
    Reconstruction of turbulent data with deep generative models for semantic inpainting from TURB-Rot database. (arXiv:2006.09179v2 [physics.flu-dyn] UPDATED)
    (2 min) We study the applicability of tools developed by the computer vision community for features learning and semantic image inpainting to perform data reconstruction of fluid turbulence configurations. The aim is twofold. First, we explore on a quantitative basis, the capability of Convolutional Neural Networks embedded in a Deep Generative Adversarial Model (Deep-GAN) to generate missing data in turbulence, a paradigmatic high dimensional chaotic system. In particular, we investigate their use in reconstructing two-dimensional damaged snapshots extracted from a large database of numerical configurations of 3d turbulence in the presence of rotation, a case with multi-scale random features where both large-scale organised structures and small-scale highly intermittent and non-Gaussian fluctuations are present. Second, following a reverse engineering approach, we aim to rank the input flow properties (features) in terms of their qualitative and quantitative importance to obtain a better set of reconstructed fields. We present two approaches both based on Context Encoders. The first one infers the missing data via a minimization of the L2 pixel-wise reconstruction loss, plus a small adversarial penalisation. The second searches for the closest encoding of the corrupted flow configuration from a previously trained generator. Finally, we present a comparison with a different data assimilation tool, based on Nudging, an equation-informed unbiased protocol, well known in the numerical weather prediction community. The TURB-Rot database, this http URL, of roughly 300K 2d turbulent images is released and details on how to download it are given.
    Grounding Language to Entities and Dynamics for Generalization in Reinforcement Learning. (arXiv:2101.07393v2 [cs.CL] UPDATED)
    (2 min) We investigate the use of natural language to drive the generalization of control policies and introduce the new multi-task environment Messenger with free-form text manuals describing the environment dynamics. Unlike previous work, Messenger does not assume prior knowledge connecting text and state observations $-$ the control policy must simultaneously ground the game manual to entity symbols and dynamics in the environment. We develop a new model, EMMA (Entity Mapper with Multi-modal Attention) which uses an entity-conditioned attention module that allows for selective focus over relevant descriptions in the manual for each entity in the environment. EMMA is end-to-end differentiable and learns a latent grounding of entities and dynamics from text to observations using only environment rewards. EMMA achieves successful zero-shot generalization to unseen games with new dynamics, obtaining a 40% higher win rate compared to multiple baselines. However, win rate on the hardest stage of Messenger remains low (10%), demonstrating the need for additional work in this direction.
    Cluster-to-Conquer: A Framework for End-to-End Multi-Instance Learning for Whole Slide Image Classification. (arXiv:2103.10626v2 [eess.IV] UPDATED)
    (2 min) In recent years, the availability of digitized Whole Slide Images (WSIs) has enabled the use of deep learning-based computer vision techniques for automated disease diagnosis. However, WSIs present unique computational and algorithmic challenges. WSIs are gigapixel-sized ($\sim$100K pixels), making them infeasible to be used directly for training deep neural networks. Also, often only slide-level labels are available for training as detailed annotations are tedious and can be time-consuming for experts. Approaches using multiple-instance learning (MIL) frameworks have been shown to overcome these challenges. Current state-of-the-art approaches divide the learning framework into two decoupled parts: a convolutional neural network (CNN) for encoding the patches followed by an independent aggregation approach for slide-level prediction. In this approach, the aggregation step has no bearing on the representations learned by the CNN encoder. We have proposed an end-to-end framework that clusters the patches from a WSI into ${k}$-groups, samples ${k}'$ patches from each group for training, and uses an adaptive attention mechanism for slide level prediction; Cluster-to-Conquer (C2C). We have demonstrated that dividing a WSI into clusters can improve the model training by exposing it to diverse discriminative features extracted from the patches. We regularized the clustering mechanism by introducing a KL-divergence loss between the attention weights of patches in a cluster and the uniform distribution. The framework is optimized end-to-end on slide-level cross-entropy, patch-level cross-entropy, and KL-divergence loss (Implementation: https://github.com/YashSharma/C2C).
    A Novel Interaction-based Methodology Towards Explainable AI with Better Understanding of Pneumonia Chest X-ray Images. (arXiv:2104.12672v2 [cs.LG] UPDATED)
    (2 min) In the field of eXplainable AI (XAI), robust ``blackbox'' algorithms such as Convolutional Neural Networks (CNNs) are known for making high prediction performance. However, the ability to explain and interpret these algorithms still require innovation in the understanding of influential and, more importantly, explainable features that directly or indirectly impact the performance of predictivity. A number of methods existing in literature focus on visualization techniques but the concepts of explainability and interpretability still require rigorous definition. In view of the above needs, this paper proposes an interaction-based methodology -- Influence Score (I-score) -- to screen out the noisy and non-informative variables in the images hence it nourishes an environment with explainable and interpretable features that are directly associated to feature predictivity. We apply the proposed method on a real world application in Pneumonia Chest X-ray Image data set and produced state-of-the-art results. We demonstrate how to apply the proposed approach for more general big data problems by improving the explainability and interpretability without sacrificing the prediction performance. The contribution of this paper opens a novel angle that moves the community closer to the future pipelines of XAI problems.
    Transferability of Spectral Graph Convolutional Neural Networks. (arXiv:1907.12972v3 [cs.LG] UPDATED)
    (2 min) This paper focuses on spectral graph convolutional neural networks (ConvNets), where filters are defined as elementwise multiplication in the frequency domain of a graph. In machine learning settings where the dataset consists of signals defined on many different graphs, the trained ConvNet should generalize to signals on graphs unseen in the training set. It is thus important to transfer ConvNets between graphs. Transferability, which is a certain type of generalization capability, can be loosely defined as follows: if two graphs describe the same phenomenon, then a single filter or ConvNet should have similar repercussions on both graphs. This paper aims at debunking the common misconception that spectral filters are not transferable. We show that if two graphs discretize the same "continuous" space, then a spectral filter or ConvNet has approximately the same repercussion on both graphs. Our analysis is more permissive than the standard analysis. Transferability is typically described as the robustness of the filter to small graph perturbations and re-indexing of the vertices. Our analysis accounts also for large graph perturbations. We prove transferability between graphs that can have completely different dimensions and topologies, only requiring that both graphs discretize the same underlying space in some generic sense.
    Understanding Generalization in Adversarial Training via the Bias-Variance Decomposition. (arXiv:2103.09947v2 [cs.LG] UPDATED)
    (2 min) Adversarially trained models exhibit a large generalization gap: they can interpolate the training set even for large perturbation radii, but at the cost of large test error on clean samples. To investigate this gap, we decompose the test risk into its bias and variance components and study their behavior as a function of adversarial training perturbation radii ($\varepsilon$). We find that the bias increases monotonically with $\varepsilon$ and is the dominant term in the risk. Meanwhile, the variance is unimodal as a function of $\varepsilon$, peaking near the interpolation threshold for the training set. This characteristic behavior occurs robustly across different datasets and also for other robust training procedures such as randomized smoothing. It thus provides a test for proposed explanations of the generalization gap. We find that some existing explanations fail this test--for instance, by predicting a monotonically increasing variance curve. This underscores the power of bias-variance decompositions in modern settings-by providing two measurements instead of one, they can rule out more explanations than test accuracy alone. We also show that bias and variance can provide useful guidance for scalably reducing the generalization gap, highlighting pre-training and unlabeled data as promising routes.
    Urysohn Forest for Aleatoric Uncertainty Quantification. (arXiv:2104.01714v2 [cs.LG] UPDATED)
    (2 min) The terms tree and forest are normally associated with an ensemble of classifiers. In this article Urysohn tree is a regression model representing multiple discrete Urysohn operators connected as a tree, where the inputs of one operator are outputs of the others. This structure, referred as Urysohn tree, is not completely new. One example of such tree is known for more than half a century. It is Kolmogorov-Arnold representation. The authors of this paper in their recently published research offered the new computational technique for generating of Kolmogorov-Arnold representation as a deep machine learning process. This article is two steps further into this research. First is a Urysohn tree with multiple hidden layers which is generalization of Kolmogorov-Arnold model and second is a boosting algorithm for building of the forest of such trees for modeling of aleatoric uncertainty of the data.
    Recoverability Landscape of Tree Structured Markov Random Fields under Symmetric Noise. (arXiv:2102.08554v3 [stat.ML] UPDATED)
    (2 min) We study the problem of learning tree-structured Markov random fields (MRF) on discrete random variables with common support when the observations are corrupted by a $k$-ary symmetric noise channel with unknown probability of error. For Ising models (support size = 2), past work has shown that graph structure can only be recovered up to the leaf clusters (a leaf node, its parent, and its siblings form a leaf cluster) and exact recovery is impossible. No prior work has addressed the setting of support size of 3 or more, and indeed this setting is far richer. As we show, when the support size is 3 or more, the structure of the leaf clusters may be partially or fully identifiable. We provide a precise characterization of this phenomenon and show that the extent of recoverability is dictated by the joint PMF of the random variables. In particular, we provide necessary and sufficient conditions for exact recoverability. Furthermore, we present a polynomial time, sample efficient algorithm that recovers the exact tree when this is possible, or up to the unidentifiability as promised by our characterization, when full recoverability is impossible. Finally, we demonstrate the efficacy of our algorithm experimentally.
    Meta-StyleSpeech : Multi-Speaker Adaptive Text-to-Speech Generation. (arXiv:2106.03153v2 [eess.AS] UPDATED)
    (2 min) With rapid progress in neural text-to-speech (TTS) models, personalized speech generation is now in high demand for many applications. For practical applicability, a TTS model should generate high-quality speech with only a few audio samples from the given speaker, that are also short in length. However, existing methods either require to fine-tune the model or achieve low adaptation quality without fine-tuning. In this work, we propose StyleSpeech, a new TTS model which not only synthesizes high-quality speech but also effectively adapts to new speakers. Specifically, we propose Style-Adaptive Layer Normalization (SALN) which aligns gain and bias of the text input according to the style extracted from a reference speech audio. With SALN, our model effectively synthesizes speech in the style of the target speaker even from single speech audio. Furthermore, to enhance StyleSpeech's adaptation to speech from new speakers, we extend it to Meta-StyleSpeech by introducing two discriminators trained with style prototypes, and performing episodic training. The experimental results show that our models generate high-quality speech which accurately follows the speaker's voice with single short-duration (1-3 sec) speech audio, significantly outperforming baselines.
    Heterogeneous Risk Minimization. (arXiv:2105.03818v2 [cs.LG] UPDATED)
    (2 min) Machine learning algorithms with empirical risk minimization usually suffer from poor generalization performance due to the greedy exploitation of correlations among the training data, which are not stable under distributional shifts. Recently, some invariant learning methods for out-of-distribution (OOD) generalization have been proposed by leveraging multiple training environments to find invariant relationships. However, modern datasets are frequently assembled by merging data from multiple sources without explicit source labels. The resultant unobserved heterogeneity renders many invariant learning methods inapplicable. In this paper, we propose Heterogeneous Risk Minimization (HRM) framework to achieve joint learning of latent heterogeneity among the data and invariant relationship, which leads to stable prediction despite distributional shifts. We theoretically characterize the roles of the environment labels in invariant learning and justify our newly proposed HRM framework. Extensive experimental results validate the effectiveness of our HRM framework.
    EL-Attention: Memory Efficient Lossless Attention for Generation. (arXiv:2105.04779v2 [cs.CL] UPDATED)
    (2 min) Transformer model with multi-head attention requires caching intermediate results for efficient inference in generation tasks. However, cache brings new memory-related costs and prevents leveraging larger batch size for faster speed. We propose memory-efficient lossless attention (called EL-attention) to address this issue. It avoids heavy operations for building multi-head keys and values, cache for them is not needed. EL-attention constructs an ensemble of attention results by expanding query while keeping key and value shared. It produces the same result as multi-head attention with less GPU memory and faster inference speed. We conduct extensive experiments on Transformer, BART, and GPT-2 for summarization and question generation tasks. The results show EL-attention speeds up existing models by 1.6x to 5.3x without accuracy loss.
    PDE-constrained Models with Neural Network Terms: Optimization and Global Convergence. (arXiv:2105.08633v2 [cs.LG] UPDATED)
    (2 min) Recent research has used deep learning to develop partial differential equation (PDE) models in science and engineering. The functional form of the PDE is determined by a neural network, and the neural network parameters are calibrated to available data. Calibration of the embedded neural network can be performed by optimizing over the PDE. Motivated by these applications, we rigorously study the optimization of a class of linear elliptic PDEs with neural network terms. The neural network parameters in the PDE are optimized using gradient descent, where the gradient is evaluated using an adjoint PDE. As the number of parameters become large, the PDE and adjoint PDE converge to a non-local PDE system. Using this limit PDE system, we are able to prove convergence of the neural network-PDE to a global minimum during the optimization. The limit PDE system contains a non-local linear operator whose eigenvalues are positive but become arbitrarily small. The lack of a spectral gap for the eigenvalues poses the main challenge for the global convergence proof. Careful analysis of the spectral decomposition of the coupled PDE and adjoint PDE system is required. Finally, we use this adjoint method to train a neural network model for an application in fluid mechanics, in which the neural network functions as a closure model for the Reynolds-averaged Navier-Stokes (RANS) equations. The RANS neural network model is trained on several datasets for turbulent channel flow and is evaluated out-of-sample at different Reynolds numbers.
    Link Prediction with Persistent Homology: An Interactive View. (arXiv:2102.10255v2 [cs.LG] UPDATED)
    (2 min) Link prediction is an important learning task for graph-structured data. In this paper, we propose a novel topological approach to characterize interactions between two nodes. Our topological feature, based on the extended persistent homology, encodes rich structural information regarding the multi-hop paths connecting nodes. Based on this feature, we propose a graph neural network method that outperforms state-of-the-arts on different benchmarks. As another contribution, we propose a novel algorithm to more efficiently compute the extended persistence diagrams for graphs. This algorithm can be generally applied to accelerate many other topological methods for graph learning tasks.
    Online Learning and Distributed Control for Residential Demand Response. (arXiv:2010.05153v2 [eess.SY] UPDATED)
    (2 min) This paper studies the automated control method for regulating air conditioner (AC) loads in incentive-based residential demand response (DR). The critical challenge is that the customer responses to load adjustment are uncertain and unknown in practice. In this paper, we formulate the AC control problem in a DR event as a multi-period stochastic optimization that integrates the indoor thermal dynamics and customer opt-out status transition. Specifically, machine learning techniques including Gaussian process and logistic regression are employed to learn the unknown thermal dynamics model and customer opt-out behavior model, respectively. We consider two typical DR objectives for AC load control: 1) minimizing the total demand, 2) closely tracking a regulated power trajectory. Based on the Thompson sampling framework, we propose an online DR control algorithm to learn customer behaviors and make real-time AC control schemes. This algorithm considers the influence of various environmental factors on customer behaviors and is implemented in a distributed fashion to preserve the privacy of customers. Numerical simulations demonstrate the control optimality and learning efficiency of the proposed algorithm.
    Robust Optimization for Multilingual Translation with Imbalanced Data. (arXiv:2104.07639v3 [cs.CL] UPDATED)
    (2 min) Multilingual models are parameter-efficient with the prospect improving low-resource languages by leveraging crosslingual transfer. Despite recent advance in massive multilingual translation with ever-growing model and data, how to effectively train multilingual models has not been well understood. In this paper, we show that a common situation in multilingual training, data imbalance among languages, poses optimization tension between high resource and low resource languages where the found multilingual solution is often sub-optimal for low resources. We show that common training method which upsamples low resources can not robustly optimize population loss with risks of either underfitting high resource languages or overfitting low resource ones. Drawing on recent findings on the geometry of loss landscape and its effect on generalization, we propose a principled optimization algorithm, Curvature Aware Task Scaling (CATS), which adaptively rescales gradients from different tasks with a meta objective of guiding multilingual training to low-curvature neighborhoods with uniformly low loss for all languages. We ran experiments on common benchmarks (TED, WMT and OPUS-100) with varying degrees of data imbalance. CATS effectively improved multilingual optimization and as a result demonstrated consistent gains on low resources ( to BLEU) without hurting high resources. In addition, CATS is robust to overparameterization and large batch size training, making it a promising training method for massive multilingual models that truly improve low resource languages.
    BoolNet: Minimizing The Energy Consumption of Binary Neural Networks. (arXiv:2106.06991v1 [cs.LG])
    (2 min) Recent works on Binary Neural Networks (BNNs) have made promising progress in narrowing the accuracy gap of BNNs to their 32-bit counterparts. However, the accuracy gains are often based on specialized model designs using additional 32-bit components. Furthermore, almost all previous BNNs use 32-bit for feature maps and the shortcuts enclosing the corresponding binary convolution blocks, which helps to effectively maintain the accuracy, but is not friendly to hardware accelerators with limited memory, energy, and computing resources. Thus, we raise the following question: How can accuracy and energy consumption be balanced in a BNN network design? We extensively study this fundamental problem in this work and propose a novel BNN architecture without most commonly used 32-bit components: \textit{BoolNet}. Experimental results on ImageNet demonstrate that BoolNet can achieve 4.6x energy reduction coupled with 1.2\% higher accuracy than the commonly used BNN architecture Bi-RealNet. Code and trained models are available at: https://github.com/hpi-xnor/BoolNet.
    SKIing on Simplices: Kernel Interpolation on the Permutohedral Lattice for Scalable Gaussian Processes. (arXiv:2106.06695v1 [cs.LG])
    (2 min) State-of-the-art methods for scalable Gaussian processes use iterative algorithms, requiring fast matrix vector multiplies (MVMs) with the covariance kernel. The Structured Kernel Interpolation (SKI) framework accelerates these MVMs by performing efficient MVMs on a grid and interpolating back to the original space. In this work, we develop a connection between SKI and the permutohedral lattice used for high-dimensional fast bilateral filtering. Using a sparse simplicial grid instead of a dense rectangular one, we can perform GP inference exponentially faster in the dimension than SKI. Our approach, Simplex-GP, enables scaling SKI to high dimensions, while maintaining strong predictive performance. We additionally provide a CUDA implementation of Simplex-GP, which enables significant GPU acceleration of MVM based inference.
    Latent-Optimized Adversarial Neural Transfer for Sarcasm Detection. (arXiv:2104.09261v2 [cs.LG] UPDATED)
    (2 min) The existence of multiple datasets for sarcasm detection prompts us to apply transfer learning to exploit their commonality. The adversarial neural transfer (ANT) framework utilizes multiple loss terms that encourage the source-domain and the target-domain feature distributions to be similar while optimizing for domain-specific performance. However, these objectives may be in conflict, which can lead to optimization difficulties and sometimes diminished transfer. We propose a generalized latent optimization strategy that allows different losses to accommodate each other and improves training dynamics. The proposed method outperforms transfer learning and meta-learning baselines. In particular, we achieve 10.02% absolute performance gain over the previous state of the art on the iSarcasm dataset.
    Multi-facet Contextual Bandits: A Neural Network Perspective. (arXiv:2106.03039v2 [cs.LG] UPDATED)
    (2 min) Contextual multi-armed bandit has shown to be an effective tool in recommender systems. In this paper, we study a novel problem of multi-facet bandits involving a group of bandits, each characterizing the users' needs from one unique aspect. In each round, for the given user, we need to select one arm from each bandit, such that the combination of all arms maximizes the final reward. This problem can find immediate applications in E-commerce, healthcare, etc. To address this problem, we propose a novel algorithm, named MuFasa, which utilizes an assembled neural network to jointly learn the underlying reward functions of multiple bandits. It estimates an Upper Confidence Bound (UCB) linked with the expected reward to balance between exploitation and exploration. Under mild assumptions, we provide the regret analysis of MuFasa. It can achieve the near-optimal $\widetilde{ \mathcal{O}}((K+1)\sqrt{T})$ regret bound where $K$ is the number of bandits and $T$ is the number of played rounds. Furthermore, we conduct extensive experiments to show that MuFasa outperforms strong baselines on real-world data sets.
    Meta-Learning Dynamics Forecasting Using Task Inference. (arXiv:2102.10271v2 [cs.LG] UPDATED)
    (2 min) Current deep learning models for dynamics forecasting struggle with generalization. They can only forecast in a specific domain and fail when applied to systems with different parameters, external forces, or boundary conditions. We propose a model-based meta-learning method called DyAd which can generalize across heterogeneous domains by partitioning them into different tasks. DyAd has two parts: an encoder which infers the time-invariant hidden features of the task with weak supervision, and a forecaster which learns the shared dynamics of the entire domain. The encoder adapts and controls the forecaster during inference using adaptive instance normalization and adaptive padding. Theoretically, we prove that the generalization error of such procedure is related to the task relatedness in the source domain, as well as the domain differences between source and target. Experimentally, we demonstrate that our model outperforms state-of-the-art approaches on both turbulent flow and real-world ocean data forecasting tasks.
    FeatureNorm: L2 Feature Normalization for Dynamic Graph Embedding. (arXiv:2103.00164v2 [cs.LG] UPDATED)
    (2 min) Dynamic graphs arise in a plethora of practical scenarios such as social networks, communication networks, and financial transaction networks. Given a dynamic graph, it is fundamental and essential to learn a graph representation that is expected not only to preserve structural proximity but also jointly capture the time-evolving patterns. Recently, graph convolutional network (GCN) has been widely explored and used in non-Euclidean application domains. The main success of GCN, especially in handling dependencies and passing messages within nodes, lies in its approximation to Laplacian smoothing. As a matter of fact, this smoothing technique can not only encourage must-link node pairs to get closer but also push cannot-link pairs to shrink together, which potentially cause serious feature shrink or oversmoothing problem, especially when stacking graph convolution in multiple layers or steps. For learning time-evolving patterns, a natural solution is to preserve historical state and combine it with the current interactions to obtain the most recent representation. Then the serious feature shrink or oversmoothing problem could happen when stacking graph convolution explicitly or implicitly according to current prevalent methods, which would make nodes too similar to distinguish each other. To solve this problem in dynamic graph embedding, we analyze the shrinking properties in the node embedding space at first, and then design a simple yet versatile method, which exploits L2 feature normalization constraint to rescale all nodes to hypersphere of a unit ball so that nodes would not shrink together, and yet similar nodes can still get closer. Extensive experiments on four real-world dynamic graph datasets compared with competitive baseline models demonstrate the effectiveness of the proposed method.
    Multi-Disease Classification of 13,667 Body CT Scans Using Weakly Supervised Deep Learning. (arXiv:2008.01158v2 [cs.CV] UPDATED)
    (2 min) Background: Training deep learning classifiers typically requires massive amounts of manual annotation. Weak supervision may leverage existing medical data to classify multiple diseases and organ systems. Purpose: To design multi-disease classifiers for body computed tomography (CT) scans using automatically extracted labels from radiology text reports. Materials & Methods: This retrospective study deployed rule-based algorithms to extract 19,255 disease labels from reports of 13,667 body CT scans of 12,092 subjects for training. Using a 3D DenseVNet, three organ systems were segmented: lungs/pleura, liver/gallbladder, and kidneys/ureters. For each organ, a 3D convolutional neural network classified normality versus four common diseases. Testing was performed on an additional 2,158 CT volumes relative to 2,875 manually derived reference labels. Results: Manual validation of the extracted labels confirmed 91 to 99% accuracy. Performance using the receiver operating characteristic area under the curve (AUC) for lungs/pleura labels were as follows: atelectasis 0.77 (95% CI: 0.74 to 0.81), nodule 0.65 (0.61 to 0.69), emphysema 0.89 (0.86 to 0.92), effusion 0.97 (0.96 to 0.98), and normal 0.89 (0.87 to 0.91). For liver/gallbladder: stone 0.62 (0.56 to 0.67), lesion 0.73 (0.69 to 0.77), dilation 0.87 (0.84 to 0.90), fatty 0.89 (0.86 to 0.92), and normal 0.82 (0.78 to 0.85). For kidneys/ureters: stone 0.83 (0.79 to 0.87), atrophy 0.92 (0.89 to 0.94), lesion 0.68 (0.64 to 0.72), cyst 0.70 (0.66 to 0.73), and normal 0.79 (0.75 to 0.83). Conclusion: Weakly supervised deep learning classifiers leveraged massive amounts of unannotated body CT data to classify multiple organ systems and diverse diseases.
    Prediction with Unpredictable Feature Evolution. (arXiv:1904.12171v2 [cs.LG] UPDATED)
    (2 min) Learning with feature evolution studies the scenario where the features of the data streams can evolve, i.e., old features vanish and new features emerge. Its goal is to keep the model always performing well even when the features happen to evolve. To tackle this problem, canonical methods assume that the old features will vanish simultaneously and the new features themselves will emerge simultaneously as well. They also assume there is an overlapping period where old and new features both exist when the feature space starts to change. However, in reality, the feature evolution could be unpredictable, which means the features can vanish or emerge arbitrarily, causing the overlapping period incomplete. In this paper, we propose a novel paradigm: Prediction with Unpredictable Feature Evolution (PUFE) where the feature evolution is unpredictable. To address this problem, we fill the incomplete overlapping period and formulate it as a new matrix completion problem. We give a theoretical bound on the least number of observed entries to make the overlapping period intact. With this intact overlapping period, we leverage an ensemble method to take the advantage of both the old and new feature spaces without manually deciding which base models should be incorporated. Theoretical and experimental results validate that our method can always follow the best base models and thus realize the goal of learning with feature evolution.
    AutoScore-Survival: Developing interpretable machine learning-based time-to-event scores with right-censored survival data. (arXiv:2106.06957v1 [cs.LG])
    (2 min) Scoring systems are highly interpretable and widely used to evaluate time-to-event outcomes in healthcare research. However, existing time-to-event scores are predominantly created ad-hoc using a few manually selected variables based on clinician's knowledge, suggesting an unmet need for a robust and efficient generic score-generating method. AutoScore was previously developed as an interpretable machine learning score generator, integrated both machine learning and point-based scores in the strong discriminability and accessibility. We have further extended it to time-to-event data and developed AutoScore-Survival, for automatically generating time-to-event scores with right-censored survival data. Random survival forest provides an efficient solution for selecting variables, and Cox regression was used for score weighting. We illustrated our method in a real-life study of 90-day mortality of patients in intensive care units and compared its performance with survival models (i.e., Cox) and the random survival forest. The AutoScore-Survival-derived scoring model was more parsimonious than survival models built using traditional variable selection methods (e.g., penalized likelihood approach and stepwise variable selection), and its performance was comparable to survival models using the same set of variables. Although AutoScore-Survival achieved a comparable integrated area under the curve of 0.782 (95% CI: 0.767-0.794), the integer-valued time-to-event scores generated are favorable in clinical applications because they are easier to compute and interpret. Our proposed AutoScore-Survival provides an automated, robust and easy-to-use machine learning-based clinical score generator to studies of time-to-event outcomes. It provides a systematic guideline to facilitate the future development of time-to-event scores for clinical applications.
    The DEformer: An Order-Agnostic Distribution Estimating Transformer. (arXiv:2106.06989v1 [cs.LG])
    (2 min) Order-agnostic autoregressive distribution estimation (OADE), i.e., autoregressive distribution estimation where the features can occur in an arbitrary order, is a challenging problem in generative machine learning. Prior work on OADE has encoded feature identity (e.g., pixel location) by assigning each feature to a distinct fixed position in an input vector. As a result, architectures built for these inputs must strategically mask either the input or model weights to learn the various conditional distributions necessary for inferring the full joint distribution of the dataset in an order-agnostic way. In this paper, we propose an alternative approach for encoding feature identities, where each feature's identity is included alongside its value in the input. This feature identity encoding strategy allows neural architectures designed for sequential data to be applied to the OADE task without modification. As a proof of concept, we show that a Transformer trained on this input (which we refer to as "the DEformer", i.e., the distribution estimating Transformer) can effectively model binarized-MNIST, approaching the average negative log-likelihood of fixed order autoregressive distribution estimating algorithms while still being entirely order-agnostic.
    Double/Debiased Machine Learning for Dynamic Treatment Effects via g-Estimation. (arXiv:2002.07285v4 [econ.EM] UPDATED)
    (2 min) We consider the estimation of treatment effects in settings when multiple treatments are assigned over time and treatments can have a causal effect on future outcomes or the state of the treated unit. We propose an extension of the double/debiased machine learning framework to estimate the dynamic effects of treatments, which can be viewed as a Neyman orthogonal (locally robust) cross-fitted version of $g$-estimation in the dynamic treatment regime. Our method applies to a general class of non-linear dynamic treatment models known as Structural Nested Mean Models and allows the use of machine learning methods to control for potentially high dimensional state variables, subject to a mean square error guarantee, while still allowing parametric estimation and construction of confidence intervals for the structural parameters of interest. These structural parameters can be used for off-policy evaluation of any target dynamic policy at parametric rates, subject to semi-parametric restrictions on the data generating process. Our work is based on a recursive peeling process, typical in $g$-estimation, and formulates a strongly convex objective at each stage, which allows us to extend the $g$-estimation framework in multiple directions: i) to provide finite sample guarantees, ii) to estimate non-linear effect heterogeneity with respect to fixed unit characteristics, within arbitrary function spaces, enabling a dynamic analogue of the RLearner algorithm for heterogeneous effects, iii) to allow for high-dimensional sparse parameterizations of the target structural functions, enabling automated model selection via a recursive lasso algorithm. We also provide guarantees for data stemming from a single treated unit over a long horizon and under stationarity conditions.
    DeepShift: Towards Multiplication-Less Neural Networks. (arXiv:1905.13298v4 [cs.LG] UPDATED)
    (3 min) The high computation, memory, and power budgets of inferring convolutional neural networks (CNNs) are major bottlenecks of model deployment to edge computing platforms, e.g., mobile devices and IoT. Moreover, training CNNs is time and energy-intensive even on high-grade servers. Convolution layers and fully connected layers, because of their intense use of multiplications, are the dominant contributor to this computation budget. We propose to alleviate this problem by introducing two new operations: convolutional shifts and fully-connected shifts which replace multiplications with bitwise shift and sign flipping during both training and inference. During inference, both approaches require only 5 bits (or less) to represent the weights. This family of neural network architectures (that use convolutional shifts and fully connected shifts) is referred to as DeepShift models. We propose two methods to train DeepShift models: DeepShift-Q which trains regular weights constrained to powers of 2, and DeepShift-PS that trains the values of the shifts and sign flips directly. Very close accuracy, and in some cases higher accuracy, to baselines are achieved. Converting pre-trained 32-bit floating-point baseline models of ResNet18, ResNet50, VGG16, and GoogleNet to DeepShift and training them for 15 to 30 epochs, resulted in Top-1/Top-5 accuracies higher than that of the original model. Last but not least, we implemented the convolutional shifts and fully connected shift GPU kernels and showed a reduction in latency time of 25% when inferring ResNet18 compared to unoptimized multiplication-based GPU kernels. The code can be found at https://github.com/mostafaelhoushi/DeepShift.
    Accelerating Feedforward Computation via Parallel Nonlinear Equation Solving. (arXiv:2002.03629v2 [cs.LG] UPDATED)
    (2 min) Feedforward computation, such as evaluating a neural network or sampling from an autoregressive model, is ubiquitous in machine learning. The sequential nature of feedforward computation, however, requires a strict order of execution and cannot be easily accelerated with parallel computing. To enable parallelization, we frame the task of feedforward computation as solving a system of nonlinear equations. We then propose to find the solution using a Jacobi or Gauss-Seidel fixed-point iteration method, as well as hybrid methods of both. Crucially, Jacobi updates operate independently on each equation and can be executed in parallel. Our method is guaranteed to give exactly the same values as the original feedforward computation with a reduced (or equal) number of parallelizable iterations, and hence reduced time given sufficient parallel computing power. Experimentally, we demonstrate the effectiveness of our approach in accelerating (i) backpropagation of RNNs, (ii) evaluation of DenseNets, and (iii) autoregressive sampling of MADE and PixelCNN++, with speedup factors between 2.1 and 26 under various settings.
    Breaking the Limit of Graph Neural Networks by Improving the Assortativity of Graphs with Local Mixing Patterns. (arXiv:2106.06586v1 [cs.LG])
    (2 min) Graph neural networks (GNNs) have achieved tremendous success on multiple graph-based learning tasks by fusing network structure and node features. Modern GNN models are built upon iterative aggregation of neighbor's/proximity features by message passing. Its prediction performance has been shown to be strongly bounded by assortative mixing in the graph, a key property wherein nodes with similar attributes mix/connect with each other. We observe that real world networks exhibit heterogeneous or diverse mixing patterns and the conventional global measurement of assortativity, such as global assortativity coefficient, may not be a representative statistic in quantifying this mixing. We adopt a generalized concept, node-level assortativity, one that is based at the node level to better represent the diverse patterns and accurately quantify the learnability of GNNs. We find that the prediction performance of a wide range of GNN models is highly correlated with the node level assortativity. To break this limit, in this work, we focus on transforming the input graph into a computation graph which contains both proximity and structural information as distinct type of edges. The resulted multi-relational graph has an enhanced level of assortativity and, more importantly, preserves rich information from the original graph. We then propose to run GNNs on this computation graph and show that adaptively choosing between structure and proximity leads to improved performance under diverse mixing. Empirically, we show the benefits of adopting our transformation framework for semi-supervised node classification task on a variety of real world graph learning benchmarks.
    Bayesian Inference Gaussian Process Multiproxy Alignment of Continuous Signals (BIGMACS): Applications for Paleoceanography. (arXiv:1907.08738v4 [stat.AP] UPDATED)
    (3 min) We first introduce a novel profile-based alignment algorithm, the multiple continuous Signal Alignment algorithm with Gaussian Process Regression profiles (SA-GPR). SA-GPR addresses the limitations of currently available signal alignment methods by adopting a hybrid of the particle smoothing and Markov-chain Monte Carlo (MCMC) algorithms to align signals, and by applying the Gaussian process regression to construct profiles to be aligned continuously. SA-GPR shares all the strengths of the existing alignment algorithms that depend on profiles but is more exact in the sense that profiles do not need to be discretized as sequential bins. The uncertainty of performance over the resolution of such bins is thereby eliminated. This methodology produces alignments that are consistent, that regularize extreme cases, and that properly reflect the inherent uncertainty. Then we extend SA-GPR to a specific problem in the field of paleoceanography with a method called Bayesian Inference Gaussian Process Multiproxy Alignment of Continuous Signals (BIGMACS). The goal of BIGMACS is to infer continuous ages for ocean sediment cores using two classes of age proxies: proxies that explicitly return calendar ages (e.g., radiocarbon) and those used to synchronize ages in multiple marine records (e.g., an oxygen isotope based marine proxy known as benthic ${\delta}^{18}{\rm O}$). BIGMACS integrates these two proxies by iteratively performing two steps: profile construction from benthic ${\delta}^{18}{\rm O}$ age models and alignment of each core to the profile also reflecting radiocarbon dates. We use BIGMACS to construct a new Deep Northeastern Atlantic stack (i.e., a profile from a particular benthic ${\delta}^{18}{\rm O}$ records) of five ocean sediment cores. We conclude by constructing multiproxy age models for two additional cores from the same region by aligning them to the stack.
    Transformation Importance with Applications to Cosmology. (arXiv:2003.01926v2 [stat.ML] UPDATED)
    (2 min) Machine learning lies at the heart of new possibilities for scientific discovery, knowledge generation, and artificial intelligence. Its potential benefits to these fields requires going beyond predictive accuracy and focusing on interpretability. In particular, many scientific problems require interpretations in a domain-specific interpretable feature space (e.g. the frequency domain) whereas attributions to the raw features (e.g. the pixel space) may be unintelligible or even misleading. To address this challenge, we propose TRIM (TRansformation IMportance), a novel approach which attributes importances to features in a transformed space and can be applied post-hoc to a fully trained model. TRIM is motivated by a cosmological parameter estimation problem using deep neural networks (DNNs) on simulated data, but it is generally applicable across domains/models and can be combined with any local interpretation method. In our cosmology example, combining TRIM with contextual decomposition shows promising results for identifying which frequencies a DNN uses, helping cosmologists to understand and validate that the model learns appropriate physical features rather than simulation artifacts.
    Uncovering the Connections Between Adversarial Transferability and Knowledge Transferability. (arXiv:2006.14512v2 [cs.LG] UPDATED)
    (2 min) Knowledge transferability, or transfer learning, has been widely adopted to allow a pre-trained model in the source domain to be effectively adapted to downstream tasks in the target domain. It is thus important to explore and understand the factors affecting knowledge transferability. In this paper, as the first work, we analyze and demonstrate the connections between knowledge transferability and another important phenomenon--adversarial transferability, \emph{i.e.}, adversarial examples generated against one model can be transferred to attack other models. Our theoretical studies show that adversarial transferability indicates knowledge transferability and vice versa. Moreover, based on the theoretical insights, we propose two practical adversarial transferability metrics to characterize this process, serving as bidirectional indicators between adversarial and knowledge transferability. We conduct extensive experiments for different scenarios on diverse datasets, showing a positive correlation between adversarial transferability and knowledge transferability. Our findings will shed light on future research about effective knowledge transfer learning and adversarial transferability analyses.
    Semi-supervised Active Regression. (arXiv:2106.06676v1 [cs.LG])
    (2 min) Labelled data often comes at a high cost as it may require recruiting human labelers or running costly experiments. At the same time, in many practical scenarios, one already has access to a partially labelled, potentially biased dataset that can help with the learning task at hand. Motivated by such settings, we formally initiate a study of $semi-supervised$ $active$ $learning$ through the frame of linear regression. In this setting, the learner has access to a dataset $X \in \mathbb{R}^{(n_1+n_2) \times d}$ which is composed of $n_1$ unlabelled examples that an algorithm can actively query, and $n_2$ examples labelled a-priori. Concretely, denoting the true labels by $Y \in \mathbb{R}^{n_1 + n_2}$, the learner's objective is to find $\widehat{\beta} \in \mathbb{R}^d$ such that, \begin{equation} \| X \widehat{\beta} - Y \|_2^2 \le (1 + \epsilon) \min_{\beta \in \mathbb{R}^d} \| X \beta - Y \|_2^2 \end{equation} while making as few additional label queries as possible. In order to bound the label queries, we introduce an instance dependent parameter called the reduced rank, denoted by $R_X$, and propose an efficient algorithm with query complexity $O(R_X/\epsilon)$. This result directly implies improved upper bounds for two important special cases: (i) active ridge regression, and (ii) active kernel ridge regression, where the reduced-rank equates to the statistical dimension, $sd_\lambda$ and effective dimension, $d_\lambda$ of the problem respectively, where $\lambda \ge 0$ denotes the regularization parameter. For active ridge regression we also prove a matching lower bound of $O(sd_\lambda / \epsilon)$ on the query complexity of any algorithm. This subsumes prior work that only considered the unregularized case, i.e., $\lambda = 0$.
    Shared Cross-Modal Trajectory Prediction for Autonomous Driving. (arXiv:2004.00202v3 [cs.CV] UPDATED)
    (2 min) Predicting future trajectories of traffic agents in highly interactive environments is an essential and challenging problem for the safe operation of autonomous driving systems. On the basis of the fact that self-driving vehicles are equipped with various types of sensors (e.g., LiDAR scanner, RGB camera, radar, etc.), we propose a Cross-Modal Embedding framework that aims to benefit from the use of multiple input modalities. At training time, our model learns to embed a set of complementary features in a shared latent space by jointly optimizing the objective functions across different types of input data. At test time, a single input modality (e.g., LiDAR data) is required to generate predictions from the input perspective (i.e., in the LiDAR space), while taking advantages from the model trained with multiple sensor modalities. An extensive evaluation is conducted to show the efficacy of the proposed framework using two benchmark driving datasets.
    Distributed Saddle-Point Problems: Lower Bounds, Optimal and Robust Algorithms. (arXiv:2010.13112v6 [cs.LG] UPDATED)
    (2 min) This paper focuses on the distributed optimization of smooth stochastic saddle-point problems. The first part of the paper is devoted to lower bounds for the cenralized and decentralized distributed methods for smooth (strongly-)convex-(strongly-)concave saddle-point problems as well as the optimal algorithms by which these bounds are achieved. Next, we present a new federated algorithm for saddle-point problems - Extra Step Local SGD. Theoretical analysis of the new method is carried out for (strongly-)convex-(strongly-)concave and non-convex-non-concave problems. In the experimental part of the paper, we show the effectiveness of our method in practice. In particular, we train GANs in a distributed manner.
    Few-Shot Learning via Embedding Adaptation with Set-to-Set Functions. (arXiv:1812.03664v6 [cs.LG] UPDATED)
    (2 min) Learning with limited data is a key challenge for visual recognition. Many few-shot learning methods address this challenge by learning an instance embedding function from seen classes and apply the function to instances from unseen classes with limited labels. This style of transfer learning is task-agnostic: the embedding function is not learned optimally discriminative with respect to the unseen classes, where discerning among them leads to the target task. In this paper, we propose a novel approach to adapt the instance embeddings to the target classification task with a set-to-set function, yielding embeddings that are task-specific and are discriminative. We empirically investigated various instantiations of such set-to-set functions and observed the Transformer is most effective -- as it naturally satisfies key properties of our desired model. We denote this model as FEAT (few-shot embedding adaptation w/ Transformer) and validate it on both the standard few-shot classification benchmark and four extended few-shot learning settings with essential use cases, i.e., cross-domain, transductive, generalized few-shot learning, and low-shot learning. It archived consistent improvements over baseline models as well as previous methods and established the new state-of-the-art results on two benchmarks.
    SoundDet: Polyphonic Sound Event Detection and Localization from Raw Waveform. (arXiv:2106.06969v1 [cs.SD])
    (2 min) We present a new framework SoundDet, which is an end-to-end trainable and light-weight framework, for polyphonic moving sound event detection and localization. Prior methods typically approach this problem by preprocessing raw waveform into time-frequency representations, which is more amenable to process with well-established image processing pipelines. Prior methods also detect in segment-wise manner, leading to incomplete and partial detections. SoundDet takes a novel approach and directly consumes the raw, multichannel waveform and treats the spatio-temporal sound event as a complete ``sound-object" to be detected. Specifically, SoundDet consists of a backbone neural network and two parallel heads for temporal detection and spatial localization, respectively. Given the large sampling rate of raw waveform, the backbone network first learns a set of phase-sensitive and frequency-selective bank of filters to explicitly retain direction-of-arrival information, whilst being highly computationally and parametrically efficient than standard 1D/2D convolution. A dense sound event proposal map is then constructed to handle the challenges of predicting events with large varying temporal duration. Accompanying the dense proposal map are a temporal overlapness map and a motion smoothness map that measure a proposal's confidence to be an event from temporal detection accuracy and movement consistency perspective. Involving the two maps guarantees SoundDet to be trained in a spatio-temporally unified manner. Experimental results on the public DCASE dataset show the advantage of SoundDet on both segment-based and our newly proposed event-based evaluation system.
    NDPNet: A novel non-linear data projection network for few-shot fine-gained image classification. (arXiv:2106.06988v1 [cs.CV])
    (2 min) Metric-based few-shot fine-grained image classification (FSFGIC) aims to learn a transferable feature embedding network by estimating the similarities between query images and support classes from very few examples. In this work, we propose, for the first time, to introduce the non-linear data projection concept into the design of FSFGIC architecture in order to address the limited sample problem in few-shot learning and at the same time to increase the discriminability of the model for fine-grained image classification. Specifically, we first design a feature re-abstraction embedding network that has the ability to not only obtain the required semantic features for effective metric learning but also re-enhance such features with finer details from input images. Then the descriptors of the query images and the support classes are projected into different non-linear spaces in our proposed similarity metric learning network to learn discriminative projection factors. This design can effectively operate in the challenging and restricted condition of a FSFGIC task for making the distance between the samples within the same class smaller and the distance between samples from different classes larger and for reducing the coupling relationship between samples from different categories. Furthermore, a novel similarity measure based on the proposed non-linear data project is presented for evaluating the relationships of feature information between a query image and a support set. It is worth to note that our proposed architecture can be easily embedded into any episodic training mechanisms for end-to-end training from scratch. Extensive experiments on FSFGIC tasks demonstrate the superiority of the proposed methods over the state-of-the-art benchmarks.
    Lifelong Neural Predictive Coding: Learning Cumulatively Online without Forgetting. (arXiv:1905.10696v2 [cs.LG] UPDATED)
    (2 min) In lifelong learning systems, especially those based on artificial neural networks, one of the biggest obstacles is the severe inability to retain old knowledge as new information is encountered. This phenomenon is known as catastrophic forgetting. In this article, we propose a new kind of connectionist architecture, the Sequential Neural Coding Network, that is robust to forgetting when learning from streams of data points and, unlike networks of today, does not learn via the immensely popular back-propagation of errors. Grounded in the neurocognitive theory of predictive processing, our model adapts its synapses in a biologically-plausible fashion, while another, complementary neural system rapidly learns to direct and control this cortex-like structure by mimicking the task-executive control functionality of the basal ganglia. In our experiments, we demonstrate that our self-organizing system experiences significantly less forgetting as compared to standard neural models and outperforms a wide swath of previously proposed methods even though it is trained across task datasets in a stream-like fashion. The promising performance of our complementary system on benchmarks, e.g., SplitMNIST, Split Fashion MNIST, and Split NotMNIST, offers evidence that by incorporating mechanisms prominent in real neuronal systems, such as competition, sparse activation patterns, and iterative input processing, a new possibility for tackling the grand challenge of lifelong machine learning opens up.
    Markov Decision Processes with Long-Term Average Constraints. (arXiv:2106.06680v1 [cs.LG])
    (2 min) We consider the problem of constrained Markov Decision Process (CMDP) where an agent interacts with a unichain Markov Decision Process. At every interaction, the agent obtains a reward. Further, there are $K$ cost functions. The agent aims to maximize the long-term average reward while simultaneously keeping the $K$ long-term average costs lower than a certain threshold. In this paper, we propose CMDP-PSRL, a posterior sampling based algorithm using which the agent can learn optimal policies to interact with the CMDP. Further, for MDP with $S$ states, $A$ actions, and diameter $D$, we prove that following CMDP-PSRL algorithm, the agent can bound the regret of not accumulating rewards from optimal policy by $\Tilde{O}(poly(DSA)\sqrt{T})$. Further, we show that the violations for any of the $K$ constraints is also bounded by $\Tilde{O}(poly(DSA)\sqrt{T})$. To the best of our knowledge, this is the first work which obtains a $\Tilde{O}(\sqrt{T})$ regret bounds for ergodic MDPs with long-term average constraints.
    Corruption-Robust Offline Reinforcement Learning. (arXiv:2106.06630v1 [cs.LG])
    (2 min) We study the adversarial robustness in offline reinforcement learning. Given a batch dataset consisting of tuples $(s, a, r, s')$, an adversary is allowed to arbitrarily modify $\epsilon$ fraction of the tuples. From the corrupted dataset the learner aims to robustly identify a near-optimal policy. We first show that a worst-case $\Omega(d\epsilon)$ optimality gap is unavoidable in linear MDP of dimension $d$, even if the adversary only corrupts the reward element in a tuple. This contrasts with dimension-free results in robust supervised learning and best-known lower-bound in the online RL setting with corruption. Next, we propose robust variants of the Least-Square Value Iteration (LSVI) algorithm utilizing robust supervised learning oracles, which achieve near-matching performances in cases both with and without full data coverage. The algorithm requires the knowledge of $\epsilon$ to design the pessimism bonus in the no-coverage case. Surprisingly, in this case, the knowledge of $\epsilon$ is necessary, as we show that being adaptive to unknown $\epsilon$ is impossible.This again contrasts with recent results on corruption-robust online RL and implies that robust offline RL is a strictly harder problem.
    Disrupting Model Training with Adversarial Shortcuts. (arXiv:2106.06654v1 [cs.CV])
    (2 min) When data is publicly released for human consumption, it is unclear how to prevent its unauthorized usage for machine learning purposes. Successful model training may be preventable with carefully designed dataset modifications, and we present a proof-of-concept approach for the image classification setting. We propose methods based on the notion of adversarial shortcuts, which encourage models to rely on non-robust signals rather than semantic features, and our experiments demonstrate that these measures successfully prevent deep learning models from achieving high accuracy on real, unmodified data examples.
    Robust Knowledge Graph Completion with Stacked Convolutions and a Student Re-Ranking Network. (arXiv:2106.06555v1 [cs.LG])
    (2 min) Knowledge Graph (KG) completion research usually focuses on densely connected benchmark datasets that are not representative of real KGs. We curate two KG datasets that include biomedical and encyclopedic knowledge and use an existing commonsense KG dataset to explore KG completion in the more realistic setting where dense connectivity is not guaranteed. We develop a deep convolutional network that utilizes textual entity representations and demonstrate that our model outperforms recent KG completion methods in this challenging setting. We find that our model's performance improvements stem primarily from its robustness to sparsity. We then distill the knowledge from the convolutional network into a student network that re-ranks promising candidate entities. This re-ranking stage leads to further improvements in performance and demonstrates the effectiveness of entity re-ranking for KG completion.
    Shape of Elephant: Study of Macro Properties of Word Embeddings Spaces. (arXiv:2106.06964v1 [cs.CL])
    (2 min) Pre-trained word representations became a key component in many NLP tasks. However, the global geometry of the word embeddings remains poorly understood. In this paper, we demonstrate that a typical word embeddings cloud is shaped as a high-dimensional simplex with interpretable vertices and propose a simple yet effective method for enumeration of these vertices. We show that the proposed method can detect and describe vertices of the simplex for GloVe and fasttext spaces.
    Markov Neural Operators for Learning Chaotic Systems. (arXiv:2106.06898v1 [cs.LG])
    (2 min) Chaotic systems are notoriously challenging to predict because of their instability. Small errors accumulate in the simulation of each time step, resulting in completely different trajectories. However, the trajectories of many prominent chaotic systems live in a low-dimensional subspace (attractor). If the system is Markovian, the attractor is uniquely determined by the Markov operator that maps the evolution of infinitesimal time steps. This makes it possible to predict the behavior of the chaotic system by learning the Markov operator even if we cannot predict the exact trajectory. Recently, a new framework for learning resolution-invariant solution operators for PDEs was proposed, known as neural operators. In this work, we train a Markov neural operator (MNO) with only the local one-step evolution information. We then compose the learned operator to obtain the global attractor and invariant measure. Such a Markov neural operator forms a discrete semigroup and we empirically observe that does not collapse or blow up. Experiments show neural operators are more accurate and stable compared to previous methods on chaotic systems such as the Kuramoto-Sivashinsky and Navier-Stokes equations.
    Case Study on Detecting COVID-19 Health-Related Misinformation in Social Media. (arXiv:2106.06811v1 [cs.SI])
    (2 min) COVID-19 pandemic has generated what public health officials called an infodemic of misinformation. As social distancing and stay-at-home orders came into effect, many turned to social media for socializing. This increase in social media usage has made it a prime vehicle for the spreading of misinformation. This paper presents a mechanism to detect COVID-19 health-related misinformation in social media following an interdisciplinary approach. Leveraging social psychology as a foundation and existing misinformation frameworks, we defined misinformation themes and associated keywords incorporated into the misinformation detection mechanism using applied machine learning techniques. Next, using the Twitter dataset, we explored the performance of the proposed methodology using multiple state-of-the-art machine learning classifiers. Our method shows promising results with at most 78% accuracy in classifying health-related misinformation versus true information using uni-gram-based NLP feature generations from tweets and the Decision Tree classifier. We also provide suggestions on alternatives for countering misinformation and ethical consideration for the study.
    Acceleration via Fractal Learning Rate Schedules. (arXiv:2103.01338v2 [cs.LG] UPDATED)
    (2 min) In practical applications of iterative first-order optimization, the learning rate schedule remains notoriously difficult to understand and expensive to tune. We demonstrate the presence of these subtleties even in the innocuous case when the objective is a convex quadratic. We reinterpret an iterative algorithm from the numerical analysis literature as what we call the Chebyshev learning rate schedule for accelerating vanilla gradient descent, and show that the problem of mitigating instability leads to a fractal ordering of step sizes. We provide some experiments to challenge conventional beliefs about stable learning rates in deep learning: the fractal schedule enables training to converge with locally unstable updates which make negative progress on the objective.
    Improving weakly supervised sound event detection with self-supervised auxiliary tasks. (arXiv:2106.06858v1 [eess.AS])
    (2 min) While multitask and transfer learning has shown to improve the performance of neural networks in limited data settings, they require pretraining of the model on large datasets beforehand. In this paper, we focus on improving the performance of weakly supervised sound event detection in low data and noisy settings simultaneously without requiring any pretraining task. To that extent, we propose a shared encoder architecture with sound event detection as a primary task and an additional secondary decoder for a self-supervised auxiliary task. We empirically evaluate the proposed framework for weakly supervised sound event detection on a remix dataset of the DCASE 2019 task 1 acoustic scene data with DCASE 2018 Task 2 sounds event data under 0, 10 and 20 dB SNR. To ensure we retain the localisation information of multiple sound events, we propose a two-step attention pooling mechanism that provides a time-frequency localisation of multiple audio events in the clip. The proposed framework with two-step attention outperforms existing benchmark models by 22.3%, 12.8%, 5.9% on 0, 10 and 20 dB SNR respectively. We carry out an ablation study to determine the contribution of the auxiliary task and two-step attention pooling to the SED performance improvement.
    Graph Neural Network-Based Anomaly Detection in Multivariate Time Series. (arXiv:2106.06947v1 [cs.LG])
    (2 min) Given high-dimensional time series data (e.g., sensor data), how can we detect anomalous events, such as system faults and attacks? More challengingly, how can we do this in a way that captures complex inter-sensor relationships, and detects and explains anomalies which deviate from these relationships? Recently, deep learning approaches have enabled improvements in anomaly detection in high-dimensional datasets; however, existing methods do not explicitly learn the structure of existing relationships between variables, or use them to predict the expected behavior of time series. Our approach combines a structure learning approach with graph neural networks, additionally using attention weights to provide explainability for the detected anomalies. Experiments on two real-world sensor datasets with ground truth anomalies show that our method detects anomalies more accurately than baseline approaches, accurately captures correlations between sensors, and allows users to deduce the root cause of a detected anomaly.
    Towards a Privacy-preserving Deep Learning-based Network Intrusion Detection in Data Distribution Services. (arXiv:2106.06765v1 [cs.LG])
    (2 min) Data Distribution Service (DDS) is an innovative approach towards communication in ICS/IoT infrastructure and robotics. Being based on the cross-platform and cross-language API to be applicable in any computerised device, it offers the benefits of modern programming languages and the opportunities to develop more complex and advanced systems. However, the DDS complexity equally increases its vulnerability, while the existing security measures are limited to plug-ins and static rules, with the rest of the security provided by third-party applications and operating system. Specifically, traditional intrusion detection systems (IDS) do not detect any anomalies in the publish/subscribe method. With the exponentially growing global communication exchange, securing DDS is of the utmost importance to futureproofing industrial, public, and even personal devices and systems. This report presents an experimental work on the simulation of several specific attacks against DDS, and the application of Deep Learning for their detection. The findings show that even though Deep Learning allows to detect all simulated attacks using only metadata analysis, their detection level varies, with some of the advanced attacks being harder to detect. The limitations imposed by the attempts to preserve privacy significantly decrease the detection rate. The report also reviews the drawbacks and limitations of the Deep Learning approach and proposes a set of selected solutions and configurations, that can further improve the DDS security.
    A Distributed Model-Free Ride-Sharing Approach for Joint Matching, Pricing, and Dispatching using Deep Reinforcement Learning. (arXiv:2010.01755v2 [cs.MA] UPDATED)
    (2 min) Significant development of ride-sharing services presents a plethora of opportunities to transform urban mobility by providing personalized and convenient transportation while ensuring efficiency of large-scale ride pooling. However, a core problem for such services is route planning for each driver to fulfill the dynamically arriving requests while satisfying given constraints. Current models are mostly limited to static routes with only two rides per vehicle (optimally) or three (with heuristics). In this paper, we present a dynamic, demand aware, and pricing-based vehicle-passenger matching and route planning framework that (1) dynamically generates optimal routes for each vehicle based on online demand, pricing associated with each ride, vehicle capacities and locations. This matching algorithm starts greedily and optimizes over time using an insertion operation, (2) involves drivers in the decision-making process by allowing them to propose a different price based on the expected reward for a particular ride as well as the destination locations for future rides, which is influenced by supply-and demand computed by the Deep Q-network, (3) allows customers to accept or reject rides based on their set of preferences with respect to pricing and delay windows, vehicle type and carpooling preferences, and (4) based on demand prediction, our approach re-balances idle vehicles by dispatching them to the areas of anticipated high demand using deep Reinforcement Learning (RL). Our framework is validated using the New York City Taxi public dataset; however, we consider different vehicle types and designed customer utility functions to validate the setup and study different settings. Experimental results show the effectiveness of our approach in real-time and large scale settings.
    Learngene: From Open-World to Your Learning Task. (arXiv:2106.06788v1 [cs.LG])
    (2 min) Although deep learning has made significant progress on fixed large-scale datasets, it typically encounters challenges regarding improperly detecting new/unseen classes in the open-world classification, over-parametrized, and overfitting small samples. In contrast, biological systems can overcome the above difficulties very well. Individuals inherit an innate gene from collective creatures that have evolved over hundreds of millions of years, and can learn new skills through a few examples. Inspired by this, we propose a practical collective-individual paradigm where open-world tasks are trained in sequence using an evolution (expandable) network. To be specific, we innovatively introduce learngene that inherits the meta-knowledge from the collective model and reconstructs a new lightweight individual model for the target task, to realize the collective-individual paradigm. Particularly, we present a novel criterion that can discover the learngene in the collective model, according to the gradient information. Finally, the individual model is trained only with a few samples in the absence of the source data. We demonstrate the effectiveness of our approach in an extensive empirical study and theoretical analysis.
    Solving PDEs on Unknown Manifolds with Machine Learning. (arXiv:2106.06682v1 [math.NA])
    (2 min) This paper proposes a mesh-free computational framework and machine learning theory for solving elliptic PDEs on unknown manifolds, identified with point clouds, based on diffusion maps (DM) and deep learning. The PDE solver is formulated as a supervised learning task to solve a least-squares regression problem that imposes an algebraic equation approximating a PDE (and boundary conditions if applicable). This algebraic equation involves a graph-Laplacian type matrix obtained via DM asymptotic expansion, which is a consistent estimator of second-order elliptic differential operators. The resulting numerical method is to solve a highly non-convex empirical risk minimization problem subjected to a solution from a hypothesis space of neural-network type functions. In a well-posed elliptic PDE setting, when the hypothesis space consists of feedforward neural networks with either infinite width or depth, we show that the global minimizer of the empirical loss function is a consistent solution in the limit of large training data. When the hypothesis space is a two-layer neural network, we show that for a sufficiently large width, the gradient descent method can identify a global minimizer of the empirical loss function. Supporting numerical examples demonstrate the convergence of the solutions and the effectiveness of the proposed solver in avoiding numerical issues that hampers the traditional approach when a large data set becomes available, e.g., large matrix inversion.
    RobustBench: a standardized adversarial robustness benchmark. (arXiv:2010.09670v2 [cs.LG] UPDATED)
    (3 min) As a research community, we are still lacking a systematic understanding of the progress on adversarial robustness, which often makes it hard to identify the most promising ideas in training robust models. A key challenge in benchmarking robustness is that its evaluation is often error-prone, leading to overestimation of the true robustness of models. While adaptive attacks designed for a particular defense are a potential solution, they have to be highly customized for particular models, which makes it difficult to compare different methods. Our goal is to instead establish a standardized benchmark of adversarial robustness, which as accurately as possible reflects the robustness of the considered models within a reasonable computational budget. To evaluate the robustness of models for our benchmark, we consider AutoAttack, an ensemble of white- and black-box attacks which was recently shown in a large-scale study to improve almost all robustness evaluations compared to the original publications. We also impose some restrictions on the admitted models to rule out defenses that only make gradient-based attacks ineffective without improving actual robustness. Our leaderboard, hosted at https://robustbench.github.io/, contains evaluations of 90+ models and aims at reflecting the current state of the art on a set of well-defined tasks in $\ell_\infty$- and $\ell_2$-threat models and on common corruptions, with possible extensions in the future. Additionally, we open-source the library https://github.com/RobustBench/robustbench that provides unified access to 60+ robust models to facilitate their downstream applications. Finally, based on the collected models, we analyze the impact of robustness on the performance on distribution shifts, calibration, out-of-distribution detection, fairness, privacy leakage, smoothness, and transferability.
    PGDOT -- Perturbed Gradient Descent Adapted with Occupation Time. (arXiv:2005.04507v2 [math.OC] UPDATED)
    (2 min) This paper develops further the idea of perturbed gradient descent (PGD), by adapting perturbation with the history of states via the notion of occupation time. The proposed algorithm, perturbed gradient descent adapted with occupation time (PGDOT), is shown to converge at least as fast as the PGD algorithm and is guaranteed to avoid getting stuck at saddle points. The analysis is corroborated by empirical studies, in which a mini-batch version of PGDOT is shown to outperform alternatives such as mini-batch gradient descent, Adam, AMSGrad, and RMSProp in training multilayer perceptrons (MLPs). In particular, the mini-batch PGDOT manages to escape saddle points whereas these alternatives fail.
    CoPE: Conditional image generation using Polynomial Expansions. (arXiv:2104.05077v2 [cs.LG] UPDATED)
    (2 min) Generative modeling has evolved to a notable field of machine learning. Deep polynomial neural networks (PNNs) have demonstrated impressive results in unsupervised image generation, where the task is to map an input vector (i.e., noise) to a synthesized image. However, the success of PNNs has not been replicated in conditional generation tasks, such as super-resolution. Existing PNNs focus on single-variable polynomial expansions which do not fare well to two-variable inputs, i.e., the noise variable and the conditional variable. In this work, we introduce a general framework, called CoPE, that enables a polynomial expansion of two input variables and captures their auto- and cross-correlations. We exhibit how CoPE can be trivially augmented to accept an arbitrary number of input variables. CoPE is evaluated in five tasks (class-conditional generation, inverse problems, edges-to-image translation, image-to-image translation, attribute-guided generation) involving eight datasets. The thorough evaluation suggests that CoPE can be useful for tackling diverse conditional generation tasks.
    Single-Timescale Actor-Critic Provably Finds Globally Optimal Policy. (arXiv:2008.00483v2 [cs.LG] UPDATED)
    (2 min) We study the global convergence and global optimality of actor-critic, one of the most popular families of reinforcement learning algorithms. While most existing works on actor-critic employ bi-level or two-timescale updates, we focus on the more practical single-timescale setting, where the actor and critic are updated simultaneously. Specifically, in each iteration, the critic update is obtained by applying the Bellman evaluation operator only once while the actor is updated in the policy gradient direction computed using the critic. Moreover, we consider two function approximation settings where both the actor and critic are represented by linear or deep neural networks. For both cases, we prove that the actor sequence converges to a globally optimal policy at a sublinear $O(K^{-1/2})$ rate, where $K$ is the number of iterations. To the best of our knowledge, we establish the rate of convergence and global optimality of single-timescale actor-critic with linear function approximation for the first time. Moreover, under the broader scope of policy optimization with nonlinear function approximation, we prove that actor-critic with deep neural network finds the globally optimal policy at a sublinear rate for the first time.
    ECG-TCN: Wearable Cardiac Arrhythmia Detection with a Temporal Convolutional Network. (arXiv:2103.13740v2 [cs.LG] UPDATED)
    (2 min) Personalized ubiquitous healthcare solutions require energy-efficient wearable platforms that provide an accurate classification of bio-signals while consuming low average power for long-term battery-operated use. Single lead electrocardiogram (ECG) signals provide the ability to detect, classify, and even predict cardiac arrhythmia. In this paper, we propose a novel temporal convolutional network (TCN) that achieves high accuracy while still being feasible for wearable platform use. Experimental results on the ECG5000 dataset show that the TCN has a similar accuracy (94.2%) score as the state-of-the-art (SoA) network while achieving an improvement of 16.5% in the balanced accuracy score. This accurate classification is done with 27 times fewer parameters and 37 times less multiply-accumulate operations. We test our implementation on two publicly available platforms, the STM32L475, which is based on ARM Cortex M4F, and the GreenWaves Technologies GAP8 on the GAPuino board, based on 1+8 RISC-V CV32E40P cores. Measurements show that the GAP8 implementation respects the real-time constraints while consuming 0.10 mJ per inference. With 9.91 GMAC/s/W, it is 23.0 times more energy-efficient and 46.85 times faster than an implementation on the ARM Cortex M4F (0.43 GMAC/s/W). Overall, we obtain 8.1% higher accuracy while consuming 19.6 times less energy and being 35.1 times faster compared to a previous SoA embedded implementation.
    Foundations and modelling of dynamic networks using Dynamic Graph Neural Networks: A survey. (arXiv:2005.07496v2 [cs.SI] UPDATED)
    (2 min) Dynamic networks are used in a wide range of fields, including social network analysis, recommender systems, and epidemiology. Representing complex networks as structures changing over time allow network models to leverage not only structural but also temporal patterns. However, as dynamic network literature stems from diverse fields and makes use of inconsistent terminology, it is challenging to navigate. Meanwhile, graph neural networks (GNNs) have gained a lot of attention in recent years for their ability to perform well on a range of network science tasks, such as link prediction and node classification. Despite the popularity of graph neural networks and the proven benefits of dynamic network models, there has been little focus on graph neural networks for dynamic networks. To address the challenges resulting from the fact that this research crosses diverse fields as well as to survey dynamic graph neural networks, this work is split into two main parts. First, to address the ambiguity of the dynamic network terminology we establish a foundation of dynamic networks with consistent, detailed terminology and notation. Second, we present a comprehensive survey of dynamic graph neural network models using the proposed terminology
    Harmonization with Flow-based Causal Inference. (arXiv:2106.06845v1 [cs.LG])
    (2 min) Heterogeneity in medical data, e.g., from data collected at different sites and with different protocols in a clinical study, is a fundamental hurdle for accurate prediction using machine learning models, as such models often fail to generalize well. This paper presents a normalizing-flow-based method to perform counterfactual inference upon a structural causal model (SCM) to harmonize such data. We formulate a causal model for observed effects (brain magnetic resonance imaging data) that result from known confounders (site, gender and age) and exogenous noise variables. Our method exploits the bijection induced by flow for harmonization. We can infer the posterior of exogenous variables, intervene on observations, and draw samples from the resultant SCM to obtain counterfactuals. We evaluate on multiple, large, real-world medical datasets to observe that this method leads to better cross-domain generalization compared to state-of-the-art algorithms. Further experiments that evaluate the quality of confounder-independent data generated by our model using regression and classification tasks are provided.
    Residual Networks based Distortion Classification and Ranking for Laparoscopic Image Quality Assessment. (arXiv:2106.06784v1 [eess.IV])
    (2 min) Laparoscopic images and videos are often affected by different types of distortion like noise, smoke, blur and nonuniform illumination. Automatic detection of these distortions, followed generally by application of appropriate image quality enhancement methods, is critical to avoid errors during surgery. In this context, a crucial step involves an objective assessment of the image quality, which is a two-fold problem requiring both the classification of the distortion type affecting the image and the estimation of the severity level of that distortion. Unlike existing image quality measures which focus mainly on estimating a quality score, we propose in this paper to formulate the image quality assessment task as a multi-label classification problem taking into account both the type as well as the severity level (or rank) of distortions. Here, this problem is then solved by resorting to a deep neural networks based approach. The obtained results on a laparoscopic image dataset show the efficiency of the proposed approach.
    ATRAS: Adversarially Trained Robust Architecture Search. (arXiv:2106.06917v1 [cs.LG])
    (2 min) In this paper, we explore the effect of architecture completeness on adversarial robustness. We train models with different architectures on CIFAR-10 and MNIST dataset. For each model, we vary different number of layers and different number of nodes in the layer. For every architecture candidate, we use Fast Gradient Sign Method (FGSM) to generate untargeted adversarial attacks and use adversarial training to defend against those attacks. For each architecture candidate, we report pre-attack, post-attack and post-defense accuracy for the model as well as the architecture parameters and the impact of completeness to the model accuracies.
    Knowledge Consolidation based Class Incremental Online Learning with Limited Data. (arXiv:2106.06795v1 [cs.LG])
    (2 min) We propose a novel approach for class incremental online learning in a limited data setting. This problem setting is challenging because of the following constraints: (1) Classes are given incrementally, which necessitates a class incremental learning approach; (2) Data for each class is given in an online fashion, i.e., each training example is seen only once during training; (3) Each class has very few training examples; and (4) We do not use or assume access to any replay/memory to store data from previous classes. Therefore, in this setting, we have to handle twofold problems of catastrophic forgetting and overfitting. In our approach, we learn robust representations that are generalizable across tasks without suffering from the problems of catastrophic forgetting and overfitting to accommodate future classes with limited samples. Our proposed method leverages the meta-learning framework with knowledge consolidation. The meta-learning framework helps the model for rapid learning when samples appear in an online fashion. Simultaneously, knowledge consolidation helps to learn a robust representation against forgetting under online updates to facilitate future learning. Our approach significantly outperforms other methods on several benchmarks.
    A Comprehensive Overview on 5G-and-Beyond Networks with UAVs: From Communications to Sensing and Intelligence. (arXiv:2010.09317v2 [cs.IT] UPDATED)
    (2 min) Due to the advancements in cellular technologies and the dense deployment of cellular infrastructure, integrating unmanned aerial vehicles (UAVs) into the fifth-generation (5G) and beyond cellular networks is a promising solution to achieve safe UAV operation as well as enabling diversified applications with mission-specific payload data delivery. In particular, 5G networks need to support three typical usage scenarios, namely, enhanced mobile broadband (eMBB), ultra-reliable low-latency communications (URLLC), and massive machine-type communications (mMTC). On the one hand, UAVs can be leveraged as cost-effective aerial platforms to provide ground users with enhanced communication services by exploiting their high cruising altitude and controllable maneuverability in three-dimensional (3D) space. On the other hand, providing such communication services simultaneously for both UAV and ground users poses new challenges due to the need for ubiquitous 3D signal coverage as well as the strong air-ground network interference. Besides the requirement of high-performance wireless communications, the ability to support effective and efficient sensing as well as network intelligence is also essential for 5G-and-beyond 3D heterogeneous wireless networks with coexisting aerial and ground users. In this paper, we provide a comprehensive overview of the latest research efforts on integrating UAVs into cellular networks, with an emphasis on how to exploit advanced techniques (e.g., intelligent reflecting surface, short packet transmission, energy harvesting, joint communication and radar sensing, and edge intelligence) to meet the diversified service requirements of next-generation wireless systems. Moreover, we highlight important directions for further investigation in future work.
    Graph-based Visual-Semantic Entanglement Network for Zero-shot Image Recognition. (arXiv:2006.04648v2 [cs.CV] UPDATED)
    (2 min) Zero-shot learning uses semantic attributes to connect the search space of unseen objects. In recent years, although the deep convolutional network brings powerful visual modeling capabilities to the ZSL task, its visual features have severe pattern inertia and lack of representation of semantic relationships, which leads to severe bias and ambiguity. In response to this, we propose the Graph-based Visual-Semantic Entanglement Network to conduct graph modeling of visual features, which is mapped to semantic attributes by using a knowledge graph, it contains several novel designs: 1. it establishes a multi-path entangled network with the convolutional neural network (CNN) and the graph convolutional network (GCN), which input the visual features from CNN to GCN to model the implicit semantic relations, then GCN feedback the graph modeled information to CNN features; 2. it uses attribute word vectors as the target for the graph semantic modeling of GCN, which forms a self-consistent regression for graph modeling and supervise GCN to learn more personalized attribute relations; 3. it fuses and supplements the hierarchical visual-semantic features refined by graph modeling into visual embedding. Our method outperforms state-of-the-art approaches on multiple representative ZSL datasets: AwA2, CUB, and SUN by promoting the semantic linkage modelling of visual features.
    Federated Learning with Sparsification-Amplified Privacy and Adaptive Optimization. (arXiv:2008.01558v2 [cs.LG] UPDATED)
    (2 min) Federated learning (FL) enables distributed agents to collaboratively learn a centralized model without sharing their raw data with each other. However, data locality does not provide sufficient privacy protection, and it is desirable to facilitate FL with rigorous differential privacy (DP) guarantee. Existing DP mechanisms would introduce random noise with magnitude proportional to the model size, which can be quite large in deep neural networks. In this paper, we propose a new FL framework with sparsification-amplified privacy. Our approach integrates random sparsification with gradient perturbation on each agent to amplify privacy guarantee. Since sparsification would increase the number of communication rounds required to achieve a certain target accuracy, which is unfavorable for DP guarantee, we further introduce acceleration techniques to help reduce the privacy cost. We rigorously analyze the convergence of our approach and utilize Renyi DP to tightly account the end-to-end DP guarantee. Extensive experiments on benchmark datasets validate that our approach outperforms previous differentially-private FL approaches in both privacy guarantee and communication efficiency.
    A Game-Theoretic Approach to Multi-Agent Trust Region Optimization. (arXiv:2106.06828v1 [cs.MA])
    (2 min) Trust region methods are widely applied in single-agent reinforcement learning problems due to their monotonic performance-improvement guarantee at every iteration. Nonetheless, when applied in multi-agent settings, the guarantee of trust region methods no longer holds because an agent's payoff is also affected by other agents' adaptive behaviors. To tackle this problem, we conduct a game-theoretical analysis in the policy space, and propose a multi-agent trust region learning method (MATRL), which enables trust region optimization for multi-agent learning. Specifically, MATRL finds a stable improvement direction that is guided by the solution concept of Nash equilibrium at the meta-game level. We derive the monotonic improvement guarantee in multi-agent settings and empirically show the local convergence of MATRL to stable fixed points in the two-player rotational differential game. To test our method, we evaluate MATRL in both discrete and continuous multiplayer general-sum games including checker and switch grid worlds, multi-agent MuJoCo, and Atari games. Results suggest that MATRL significantly outperforms strong multi-agent reinforcement learning baselines.
    A Shuffling Framework for Local Differential Privacy. (arXiv:2106.06603v1 [cs.LG])
    (2 min) ldp deployments are vulnerable to inference attacks as an adversary can link the noisy responses to their identity and subsequently, auxiliary information using the order of the data. An alternative model, shuffle DP, prevents this by shuffling the noisy responses uniformly at random. However, this limits the data learnability -- only symmetric functions (input order agnostic) can be learned. In this paper, we strike a balance and propose a generalized shuffling framework that interpolates between the two deployment models. We show that systematic shuffling of the noisy responses can thwart specific inference attacks while retaining some meaningful data learnability. To this end, we propose a novel privacy guarantee, d-sigma privacy, that captures the privacy of the order of a data sequence. d-sigma privacy allows tuning the granularity at which the ordinal information is maintained, which formalizes the degree the resistance to inference attacks trading it off with data learnability. Additionally, we propose a novel shuffling mechanism that can achieve d-sigma privacy and demonstrate the practicality of our mechanism via evaluation on real-world datasets.
    Hedging with Linear Regressions and Neural Networks. (arXiv:2004.08891v3 [q-fin.RM] UPDATED)
    (2 min) We study neural networks as nonparametric estimation tools for the hedging of options. To this end, we design a network, named HedgeNet, that directly outputs a hedging strategy. This network is trained to minimise the hedging error instead of the pricing error. Applied to end-of-day and tick prices of S&P 500 and Euro Stoxx 50 options, the network is able to reduce the mean squared hedging error of the Black-Scholes benchmark significantly. However, a similar benefit arises by simple linear regressions that incorporate the leverage effect.
    Break-It-Fix-It: Unsupervised Learning for Program Repair. (arXiv:2106.06600v1 [cs.LG])
    (2 min) We consider repair tasks: given a critic (e.g., compiler) that assesses the quality of an input, the goal is to train a fixer that converts a bad example (e.g., code with syntax errors) into a good one (e.g., code with no errors). Existing works create training data consisting of (bad, good) pairs by corrupting good examples using heuristics (e.g., dropping tokens). However, fixers trained on this synthetically-generated data do not extrapolate well to the real distribution of bad inputs. To bridge this gap, we propose a new training approach, Break-It-Fix-It (BIFI), which has two key ideas: (i) we use the critic to check a fixer's output on real bad inputs and add good (fixed) outputs to the training data, and (ii) we train a breaker to generate realistic bad code from good code. Based on these ideas, we iteratively update the breaker and the fixer while using them in conjunction to generate more paired data. We evaluate BIFI on two code repair datasets: GitHub-Python, a new dataset we introduce where the goal is to repair Python code with AST parse errors; and DeepFix, where the goal is to repair C code with compiler errors. BIFI outperforms existing methods, obtaining 90.5% repair accuracy on GitHub-Python (+28.5%) and 71.7% on DeepFix (+5.6%). Notably, BIFI does not require any labeled data; we hope it will be a strong starting point for unsupervised learning of various repair tasks.
    Boosting Randomized Smoothing with Variance Reduced Classifiers. (arXiv:2106.06946v1 [cs.LG])
    (2 min) Randomized Smoothing (RS) is a promising method for obtaining robustness certificates by evaluating a base model under noise. In this work we: (i) theoretically motivate why ensembles are a particularly suitable choice as base models for RS, and (ii) empirically confirm this choice, obtaining state of the art results in multiple settings. The key insight of our work is that the reduced variance of ensembles over the perturbations introduced in RS leads to significantly more consistent classifications for a given input, in turn leading to substantially increased certifiable radii for difficult samples. We also introduce key optimizations which enable an up to 50-fold decrease in sample complexity of RS, thus drastically reducing its computational overhead. Experimentally, we show that ensembles of only 3 to 10 classifiers consistently improve on the strongest single model with respect to their average certified radius (ACR) by 5% to 21% on both CIFAR-10 and ImageNet. On the latter, we achieve a state-of-the-art ACR of 1.11. We release all code and models required to reproduce our results upon publication.
    Multi-modal Scene-compliant User Intention Estimation for Navigation. (arXiv:2106.06920v1 [cs.RO])
    (2 min) A multi-modal framework to generated user intention distributions when operating a mobile vehicle is proposed in this work. The model learns from past observed trajectories and leverages traversability information derived from the visual surroundings to produce a set of future trajectories, suitable to be directly embedded into a perception-action shared control strategy on a mobile agent, or as a safety layer to supervise the prudent operation of the vehicle. We base our solution on a conditional Generative Adversarial Network with Long-Short Term Memory cells to capture trajectory distributions conditioned on past trajectories, further fused with traversability probabilities derived from visual segmentation with a Convolutional Neural Network. The proposed data-driven framework results in a significant reduction in error of the predicted trajectories (versus the ground truth) from comparable strategies in the literature (e.g. Social-GAN) that fail to account for information other than the agent's past history. Experiments were conducted on a dataset collected with a custom wheelchair model built onto the open-source urban driving simulator CARLA, proving also that the proposed framework can be used with a small, un-annotated dataset.
    Adversarial Robustness via Fisher-Rao Regularization. (arXiv:2106.06685v1 [cs.LG])
    (2 min) Adversarial robustness has become a topic of growing interest in machine learning since it was observed that neural networks tend to be brittle. We propose an information-geometric formulation of adversarial defense and introduce FIRE, a new Fisher-Rao regularization for the categorical cross-entropy loss, which is based on the geodesic distance between natural and perturbed input features. Based on the information-geometric properties of the class of softmax distributions, we derive an explicit characterization of the Fisher-Rao Distance (FRD) for the binary and multiclass cases, and draw some interesting properties as well as connections with standard regularization metrics. Furthermore, for a simple linear and Gaussian model, we show that all Pareto-optimal points in the accuracy-robustness region can be reached by FIRE while other state-of-the-art methods fail. Empirically, we evaluate the performance of various classifiers trained with the proposed loss on standard datasets, showing up to 2\% of improvements in terms of robustness while reducing the training time by 20\% over the best-performing methods.
    BRAIN2DEPTH: Lightweight CNN Model for Classification of Cognitive States from EEG Recordings. (arXiv:2106.06688v1 [cs.LG])
    (2 min) Several Convolutional Deep Learning models have been proposed to classify the cognitive states utilizing several neuro-imaging domains. These models have achieved significant results, but they are heavily designed with millions of parameters, which increases train and test time, making the model complex and less suitable for real-time analysis. This paper proposes a simple, lightweight CNN model to classify cognitive states from Electroencephalograph (EEG) recordings. We develop a novel pipeline to learn distinct cognitive representation consisting of two stages. The first stage is to generate the 2D spectral images from neural time series signals in a particular frequency band. Images are generated to preserve the relationship between the neighboring electrodes and the spectral property of the cognitive events. The second is to develop a time-efficient, computationally less loaded, and high-performing model. We design a network containing 4 blocks and major components include standard and depth-wise convolution for increasing the performance and followed by separable convolution to decrease the number of parameters which maintains the tradeoff between time and performance. We experiment on open access EEG meditation dataset comprising expert, nonexpert meditative, and control states. We compare performance with six commonly used machine learning classifiers and four state of the art deep learning models. We attain comparable performance utilizing less than 4\% of the parameters of other models. This model can be employed in a real-time computation environment such as neurofeedback.
    Distributionally Robust Optimization with Markovian Data. (arXiv:2106.06741v1 [math.OC])
    (2 min) We study a stochastic program where the probability distribution of the uncertain problem parameters is unknown and only indirectly observed via finitely many correlated samples generated by an unknown Markov chain with $d$ states. We propose a data-driven distributionally robust optimization model to estimate the problem's objective function and optimal solution. By leveraging results from large deviations theory, we derive statistical guarantees on the quality of these estimators. The underlying worst-case expectation problem is nonconvex and involves $\mathcal O(d^2)$ decision variables. Thus, it cannot be solved efficiently for large $d$. By exploiting the structure of this problem, we devise a customized Frank-Wolfe algorithm with convex direction-finding subproblems of size $\mathcal O(d)$. We prove that this algorithm finds a stationary point efficiently under mild conditions. The efficiency of the method is predicated on a dimensionality reduction enabled by a dual reformulation. Numerical experiments indicate that our approach has better computational and statistical properties than the state-of-the-art methods.
    Federated Learning with Spiking Neural Networks. (arXiv:2106.06579v1 [cs.LG])
    (2 min) As neural networks get widespread adoption in resource-constrained embedded devices, there is a growing need for low-power neural systems. Spiking Neural Networks (SNNs)are emerging to be an energy-efficient alternative to the traditional Artificial Neural Networks (ANNs) which are known to be computationally intensive. From an application perspective, as federated learning involves multiple energy-constrained devices, there is a huge scope to leverage energy efficiency provided by SNNs. Despite its importance, there has been little attention on training SNNs on a large-scale distributed system like federated learning. In this paper, we bring SNNs to a more realistic federated learning scenario. Specifically, we propose a federated learning framework for decentralized and privacy-preserving training of SNNs. To validate the proposed federated learning framework, we experimentally evaluate the advantages of SNNs on various aspects of federated learning with CIFAR10 and CIFAR100 benchmarks. We observe that SNNs outperform ANNs in terms of overall accuracy by over 15% when the data is distributed across a large number of clients in the federation while providing up to5.3x energy efficiency. In addition to efficiency, we also analyze the sensitivity of the proposed federated SNN framework to data distribution among the clients, stragglers, and gradient noise and perform a comprehensive comparison with ANNs.
    A New Formalism, Method and Open Issues for Zero-Shot Coordination. (arXiv:2106.06613v1 [cs.AI])
    (2 min) In many coordination problems, independently reasoning humans are able to discover mutually compatible policies. In contrast, independently trained self-play policies are often mutually incompatible. Zero-shot coordination (ZSC) has recently been proposed as a new frontier in multi-agent reinforcement learning to address this fundamental issue. Prior work approaches the ZSC problem by assuming players can agree on a shared learning algorithm but not on labels for actions and observations, and proposes other-play as an optimal solution. However, until now, this "label-free" problem has only been informally defined. We formalize this setting as the label-free coordination (LFC) problem by defining the label-free coordination game. We show that other-play is not an optimal solution to the LFC problem as it fails to consistently break ties between incompatible maximizers of the other-play objective. We introduce an extension of the algorithm, other-play with tie-breaking, and prove that it is optimal in the LFC problem and an equilibrium in the LFC game. Since arbitrary tie-breaking is precisely what the ZSC setting aims to prevent, we conclude that the LFC problem does not reflect the aims of ZSC. To address this, we introduce an alternative informal operationalization of ZSC as a starting point for future work.
    A Deep Reinforcement Learning Approach to Marginalized Importance Sampling with the Successor Representation. (arXiv:2106.06854v1 [cs.LG])
    (2 min) Marginalized importance sampling (MIS), which measures the density ratio between the state-action occupancy of a target policy and that of a sampling distribution, is a promising approach for off-policy evaluation. However, current state-of-the-art MIS methods rely on complex optimization tricks and succeed mostly on simple toy problems. We bridge the gap between MIS and deep reinforcement learning by observing that the density ratio can be computed from the successor representation of the target policy. The successor representation can be trained through deep reinforcement learning methodology and decouples the reward optimization from the dynamics of the environment, making the resulting algorithm stable and applicable to high-dimensional domains. We evaluate the empirical performance of our approach on a variety of challenging Atari and MuJoCo environments.
    Hyperspectral and Multispectral Classification for Coastal Wetland Using Depthwise Feature Interaction Network. (arXiv:2106.06896v1 [cs.CV])
    (2 min) The monitoring of coastal wetlands is of great importance to the protection of marine and terrestrial ecosystems. However, due to the complex environment, severe vegetation mixture, and difficulty of access, it is impossible to accurately classify coastal wetlands and identify their species with traditional classifiers. Despite the integration of multisource remote sensing data for performance enhancement, there are still challenges with acquiring and exploiting the complementary merits from multisource data. In this paper, the Deepwise Feature Interaction Network (DFINet) is proposed for wetland classification. A depthwise cross attention module is designed to extract self-correlation and cross-correlation from multisource feature pairs. In this way, meaningful complementary information is emphasized for classification. DFINet is optimized by coordinating consistency loss, discrimination loss, and classification loss. Accordingly, DFINet reaches the standard solution-space under the regularity of loss functions, while the spatial consistency and feature discrimination are preserved. Comprehensive experimental results on two hyperspectral and multispectral wetland datasets demonstrate that the proposed DFINet outperforms other competitive methods in terms of overall accuracy.
    Bellman-consistent Pessimism for Offline Reinforcement Learning. (arXiv:2106.06926v1 [cs.LG])
    (2 min) The use of pessimism, when reasoning about datasets lacking exhaustive exploration has recently gained prominence in offline reinforcement learning. Despite the robustness it adds to the algorithm, overly pessimistic reasoning can be equally damaging in precluding the discovery of good policies, which is an issue for the popular bonus-based pessimism. In this paper, we introduce the notion of Bellman-consistent pessimism for general function approximation: instead of calculating a point-wise lower bound for the value function, we implement pessimism at the initial state over the set of functions consistent with the Bellman equations. Our theoretical guarantees only require Bellman closedness as standard in the exploratory setting, in which case bonus-based pessimism fails to provide guarantees. Even in the special case of linear MDPs where stronger function-approximation assumptions hold, our result improves upon a recent bonus-based approach by $\mathcal{O}(d)$ in its sample complexity when the action space is finite. Remarkably, our algorithms automatically adapt to the best bias-variance tradeoff in the hindsight, whereas most prior approaches require tuning extra hyperparameters a priori.
    Federated Learning on Non-IID Data: A Survey. (arXiv:2106.06843v1 [cs.LG])
    (2 min) Federated learning is an emerging distributed machine learning framework for privacy preservation. However, models trained in federated learning usually have worse performance than those trained in the standard centralized learning mode, especially when the training data are not independent and identically distributed (Non-IID) on the local devices. In this survey, we pro-vide a detailed analysis of the influence of Non-IID data on both parametric and non-parametric machine learning models in both horizontal and vertical federated learning. In addition, cur-rent research work on handling challenges of Non-IID data in federated learning are reviewed, and both advantages and disadvantages of these approaches are discussed. Finally, we suggest several future research directions before concluding the paper.
    Characterizing the Gap Between Actor-Critic and Policy Gradient. (arXiv:2106.06932v1 [cs.AI])
    (2 min) Actor-critic (AC) methods are ubiquitous in reinforcement learning. Although it is understood that AC methods are closely related to policy gradient (PG), their precise connection has not been fully characterized previously. In this paper, we explain the gap between AC and PG methods by identifying the exact adjustment to the AC objective/gradient that recovers the true policy gradient of the cumulative reward objective (PG). Furthermore, by viewing the AC method as a two-player Stackelberg game between the actor and critic, we show that the Stackelberg policy gradient can be recovered as a special case of our more general analysis. Based on these results, we develop practical algorithms, Residual Actor-Critic and Stackelberg Actor-Critic, for estimating the correction between AC and PG and use these to modify the standard AC algorithm. Experiments on popular tabular and continuous environments show the proposed corrections can improve both the sample efficiency and final performance of existing AC methods.
    What can linearized neural networks actually say about generalization?. (arXiv:2106.06770v1 [cs.LG])
    (2 min) For certain infinitely-wide neural networks, the neural tangent kernel (NTK) theory fully characterizes generalization. However, for the networks used in practice, the empirical NTK represents only a rough first-order approximation of these architectures. Still, a growing body of work keeps leveraging this approximation to successfully analyze important deep learning phenomena and derive algorithms for new applications. In our work, we provide strong empirical evidence to determine the practical validity of such approximation by conducting a systematic comparison of the behaviour of different neural networks and their linear approximations on different tasks. We show that the linear approximations can indeed rank the learning complexity of certain tasks for neural networks, albeit with important nuances. Specifically, we discover that, in contrast to what was previously observed, neural networks do not always perform better than their kernel approximations, and reveal that their performance gap heavily depends on architecture, number of samples and training task. In fact, we show that during training, deep networks increase the alignment of their empirical NTK with the target task, which explains why linear approximations at the end of training can better explain the dynamics of deep networks. Overall, our work provides concrete examples of novel deep learning phenomena which can inspire future theoretical research, as well as provides a new perspective on the use of the NTK approximation in deep learning.
    Short-term forecasting of global solar irradiance with incomplete data. (arXiv:2106.06868v1 [cs.LG])
    (2 min) Accurate mechanisms for forecasting solar irradiance and insolation provide important information for the planning of renewable energy and agriculture projects as well as for environmental and socio-economical studies. This research introduces a pipeline for the one-day ahead forecasting of solar irradiance and insolation that only requires solar irradiance historical data for training. Furthermore, our approach is able to deal with missing data since it includes a data imputation state. In the prediction stage, we consider four data-driven approaches: Autoregressive Integrated Moving Average (ARIMA), Single Layer Feed Forward Network (SL-FNN), Multiple Layer Feed Forward Network (FL-FNN), and Long Short-Term Memory (LSTM). The experiments are performed in a real-world dataset collected with 12 Automatic Weather Stations (AWS) located in the Nari\~no - Colombia. The results show that the neural network-based models outperform ARIMA in most cases. Furthermore, LSTM exhibits better performance in cloudy environments (where more randomness is expected).
    Zero-Cost Proxies Meet Differentiable Architecture Search. (arXiv:2106.06799v1 [cs.LG])
    (2 min) Differentiable neural architecture search (NAS) has attracted significant attention in recent years due to its ability to quickly discover promising architectures of deep neural networks even in very large search spaces. Despite its success, DARTS lacks robustness in certain cases, e.g. it may degenerate to trivial architectures with excessive parametric-free operations such as skip connection or random noise, leading to inferior performance. In particular, operation selection based on the magnitude of architectural parameters was recently proven to be fundamentally wrong showcasing the need to rethink this aspect. On the other hand, zero-cost proxies have been recently studied in the context of sample-based NAS showing promising results -- speeding up the search process drastically in some cases but also failing on some of the large search spaces typical for differentiable NAS. In this work we propose a novel operation selection paradigm in the context of differentiable NAS which utilises zero-cost proxies. Our perturbation-based zero-cost operation selection (Zero-Cost-PT) improves searching time and, in many cases, accuracy compared to the best available differentiable architecture search, regardless of the search space size. Specifically, we are able to find comparable architectures to DARTS-PT on the DARTS CNN search space while being over 40x faster (total searching time 25 minutes on a single GPU).
    Sparse PointPillars: Exploiting Sparsity in Birds-Eye-View Object Detection. (arXiv:2106.06882v1 [cs.CV])
    (2 min) Bird's Eye View (BEV) is a popular representation for processing 3D point clouds, and by its nature is fundamentally sparse. Motivated by the computational limitations of mobile robot platforms, we take a fast high-performance BEV 3D object detector - PointPillars - and modify its backbone to exploit this sparsity, leading to decreased runtimes. We present preliminary results demonstrating decreased runtimes with either the same performance or a modest decrease in performance, which we anticipate will be remedied by model specific hyperparameter tuning. Our work is a first step towards a new class of 3D object detectors that exploit sparsity throughout their entire pipeline in order to reduce runtime and resource usage while maintaining good detection performance.
    Stochastic Alternating Direction Method of Multipliers for Byzantine-Robust Distributed Learning. (arXiv:2106.06891v1 [math.OC])
    (2 min) This paper aims to solve a distributed learning problem under Byzantine attacks. In the underlying distributed system, a number of unknown but malicious workers (termed as Byzantine workers) can send arbitrary messages to the master and bias the learning process, due to data corruptions, computation errors or malicious attacks. Prior work has considered a total variation (TV) norm-penalized approximation formulation to handle the Byzantine attacks, where the TV norm penalty forces the regular workers' local variables to be close, and meanwhile, tolerates the outliers sent by the Byzantine workers. To solve the TV norm-penalized approximation formulation, we propose a Byzantine-robust stochastic alternating direction method of multipliers (ADMM) that fully utilizes the separable problem structure. Theoretically, we prove that the proposed method converges to a bounded neighborhood of the optimal solution at a rate of O(1/k) under mild assumptions, where k is the number of iterations and the size of neighborhood is determined by the number of Byzantine workers. Numerical experiments on the MNIST and COVERTYPE datasets demonstrate the effectiveness of the proposed method to various Byzantine attacks.
    Predicting Higher Education Throughput in South Africa Using a Tree-Based Ensemble Technique. (arXiv:2106.06805v1 [stat.AP])
    (2 min) We use gradient boosting machines and logistic regression to predict academic throughput at a South African university. The results highlight the significant influence of socio-economic factors and field of study as predictors of throughput. We further find that socio-economic factors become less of a predictor relative to the field of study as the time to completion increases. We provide recommendations on interventions to counteract the identified effects, which include academic, psychosocial and financial support.
    Quantifying the Conceptual Error in Dimensionality Reduction. (arXiv:2106.06815v1 [cs.LG])
    (2 min) Dimension reduction of data sets is a standard problem in the realm of machine learning and knowledge reasoning. They affect patterns in and dependencies on data dimensions and ultimately influence any decision-making processes. Therefore, a wide variety of reduction procedures are in use, each pursuing different objectives. A so far not considered criterion is the conceptual continuity of the reduction mapping, i.e., the preservation of the conceptual structure with respect to the original data set. Based on the notion scale-measure from formal concept analysis we present in this work a) the theoretical foundations to detect and quantify conceptual errors in data scalings; b) an experimental investigation of our approach on eleven data sets that were respectively treated with a variant of non-negative matrix factorization.
    Guaranteed Fixed-Confidence Best Arm Identification in Multi-Armed Bandit. (arXiv:2106.06848v1 [cs.LG])
    (2 min) We consider the problem of finding, through adaptive sampling, which of n populations (arms) has the largest mean. Our objective is to determine a rule which identifies the best population with a fixed minimum confidence using as few observations as possible, i.e. fixed-confidence (FC) best arm identification (BAI) in multi-armed bandits. We study such problems under the Bayesian setting with both Bernoulli and Gaussian populations. We propose to use the classical vector at a time (VT) rule, which samples each alive population once in each round. We show how VT can be implemented and analyzed in our Bayesian setting and be improved by early elimination. We also propose and analyze a variant of the classical play the winner (PW) algorithm. Numerical results show that these rules compare favorably with state-of-art algorithms.
    Finding the Stochastic Shortest Path with Low Regret: The Adversarial Cost and Unknown Transition Case. (arXiv:2102.05284v2 [cs.LG] UPDATED)
    (2 min) We make significant progress toward the stochastic shortest path problem with adversarial costs and unknown transition. Specifically, we develop algorithms that achieve $\widetilde{O}(\sqrt{S^2ADT_\star K})$ regret for the full-information setting and $\widetilde{O}(\sqrt{S^3A^2DT_\star K})$ regret for the bandit feedback setting, where $D$ is the diameter, $T_\star$ is the expected hitting time of the optimal policy, $S$ is the number of states, $A$ is the number of actions, and $K$ is the number of episodes. Our work strictly improves (Rosenberg and Mansour, 2020) in the full information setting, extends (Chen et al., 2020) from known transition to unknown transition, and is also the first to consider the most challenging combination: bandit feedback with adversarial costs and unknown transition. To remedy the gap between our upper bounds and the current best lower bounds constructed via a stochastically oblivious adversary, we also propose algorithms with near-optimal regret for this special case.
    Decreasing scaling transition from adaptive gradient descent to stochastic gradient descent. (arXiv:2106.06749v1 [cs.LG])
    (2 min) Currently, researchers have proposed the adaptive gradient descent algorithm and its variants, such as AdaGrad, RMSProp, Adam, AmsGrad, etc. Although these algorithms have a faster speed in the early stage, the generalization ability in the later stage of training is often not as good as the stochastic gradient descent. Recently, some researchers have combined the adaptive gradient descent and stochastic gradient descent to obtain the advantages of both and achieved good results. Based on this research, we propose a decreasing scaling transition from adaptive gradient descent to stochastic gradient descent method(DSTAda). For the training stage of the stochastic gradient descent, we use a learning rate that decreases linearly with the number of iterations instead of a constant learning rate. We achieve a smooth and stable transition from adaptive gradient descent to stochastic gradient descent through scaling. At the same time, we give a theoretical proof of the convergence of DSTAda under the framework of online learning. Our experimental results show that the DSTAda algorithm has a faster convergence speed, higher accuracy, and better stability and robustness. Our implementation is available at: https://github.com/kunzeng/DSTAdam.
    Adaptive Dynamic Pruning for Non-IID Federated Learning. (arXiv:2106.06921v1 [cs.LG])
    (2 min) Federated Learning~(FL) has emerged as a new paradigm of training machine learning models without sacrificing data security and privacy. Learning models at edge devices such as cell phones is one of the most common use case of FL. However, the limited computing power and energy constraints of edge devices hinder the adoption of FL for both model training and deployment, especially for the resource-hungry Deep Neural Networks~(DNNs). To this end, many model compression methods have been proposed and network pruning is among the most well-known. However, a pruning policy for a given model is highly dataset-dependent, which is not suitable for non-Independent and Identically Distributed~(Non-IID) FL edge devices. In this paper, we present an adaptive pruning scheme for edge devices in an FL system, which applies dataset-aware dynamic pruning for inference acceleration on Non-IID datasets. Our evaluation shows that the proposed method accelerates inference by $2\times$~($50\%$ FLOPs reduction) while maintaining the model's quality on edge devices.
    Explainable Artificial Intelligence for Manufacturing Cost Estimation and Machining Feature Visualization. (arXiv:2010.14824v2 [cs.CG] UPDATED)
    (2 min) Studies on manufacturing cost prediction based on deep learning have begun in recent years, but the cost prediction rationale cannot be explained because the models are still used as a black box. This study aims to propose a manufacturing cost prediction process for 3D computer-aided design (CAD) models using explainable artificial intelligence. The proposed process can visualize the machining features of the 3D CAD model that are influencing the increase in manufacturing costs. The proposed process consists of (1) data collection and pre-processing, (2) 3D deep learning architecture exploration, and (3) visualization to explain the prediction results. The proposed deep learning model shows high predictability of manufacturing cost for the computer numerical control (CNC) machined parts. In particular, using 3D gradient-weighted class activation mapping proves that the proposed model not only can detect the CNC machining features but also can differentiate the machining difficulty for the same feature. Using the proposed process, we can provide a design guidance to engineering designers in reducing manufacturing costs during the conceptual design phase. We can also provide real-time quotations and redesign proposals to online manufacturing platform customers.
    Learning Binary Decision Trees by Argmin Differentiation. (arXiv:2010.04627v2 [cs.LG] UPDATED)
    (2 min) We address the problem of learning binary decision trees that partition data for some downstream task. We propose to learn discrete parameters (i.e., for tree traversals and node pruning) and continuous parameters (i.e., for tree split functions and prediction functions) simultaneously using argmin differentiation. We do so by sparsely relaxing a mixed-integer program for the discrete parameters, to allow gradients to pass through the program to continuous parameters. We derive customized algorithms to efficiently compute the forward and backward passes. This means that our tree learning procedure can be used as an (implicit) layer in arbitrary deep networks, and can be optimized with arbitrary loss functions. We demonstrate that our approach produces binary trees that are competitive with existing single tree and ensemble approaches, in both supervised and unsupervised settings. Further, apart from greedy approaches (which do not have competitive accuracies), our method is faster to train than all other tree-learning baselines we compare with. The code for reproducing the results is available at https://github.com/vzantedeschi/LatentTrees.
    Confidential Machine Learning on Untrusted Platforms: A Survey. (arXiv:2012.08156v2 [cs.LG] UPDATED)
    (2 min) With the ever-growing data and the need for developing powerful machine learning models, data owners increasingly depend on various untrusted platforms (e.g., public clouds, edges, and machine learning service providers) for scalable processing or collaborative learning. Thus, sensitive data and models are in danger of unauthorized access, misuse, and privacy compromises. A relatively new body of research confidentially trains machine learning models on protected data to address these concerns. In this survey, we summarize notable studies in this emerging area of research. With a unified framework, we highlight the critical challenges and innovations in outsourcing machine learning confidentially. We focus on the cryptographic approaches for confidential machine learning (CML), primarily on model training, while also covering other directions such as perturbation-based approaches and CML in the hardware-assisted computing environment. The discussion will take a holistic way to consider a rich context of the related threat models, security assumptions, design principles, and associated trade-offs amongst data utility, cost, and confidentiality.
    Random Shuffling Beats SGD Only After Many Epochs on Ill-Conditioned Problems. (arXiv:2106.06880v1 [cs.LG])
    (2 min) Recently, there has been much interest in studying the convergence rates of without-replacement SGD, and proving that it is faster than with-replacement SGD in the worst case. However, these works ignore or do not provide tight bounds in terms of the problem's geometry, including its condition number. Perhaps surprisingly, we prove that when the condition number is taken into account, without-replacement SGD \emph{does not} significantly improve on with-replacement SGD in terms of worst-case bounds, unless the number of epochs (passes over the data) is larger than the condition number. Since many problems in machine learning and other areas are both ill-conditioned and involve large datasets, this indicates that without-replacement does not necessarily improve over with-replacement sampling for realistic iteration budgets. We show this by providing new lower and upper bounds which are tight (up to log factors), for quadratic problems with commuting quadratic terms, precisely quantifying the dependence on the problem parameters.
    Thinking Like Transformers. (arXiv:2106.06981v1 [cs.LG])
    (2 min) What is the computational model behind a Transformer? Where recurrent neural networks have direct parallels in finite state machines, allowing clear discussion and thought around architecture variants or trained models, Transformers have no such familiar parallel. In this paper we aim to change that, proposing a computational model for the transformer-encoder in the form of a programming language. We map the basic components of a transformer-encoder -- attention and feed-forward computation -- into simple primitives, around which we form a programming language: the Restricted Access Sequence Processing Language (RASP). We show how RASP can be used to program solutions to tasks that could conceivably be learned by a Transformer, and how a Transformer can be trained to mimic a RASP solution. In particular, we provide RASP programs for histograms, sorting, and Dyck-languages. We further use our model to relate their difficulty in terms of the number of required layers and attention heads: analyzing a RASP program implies a maximum number of heads and layers necessary to encode a task in a transformer. Finally, we see how insights gained from our abstraction might be used to explain phenomena seen in recent works.
    Recomposing the Reinforcement Learning Building Blocks with Hypernetworks. (arXiv:2106.06842v1 [cs.LG])
    (2 min) The Reinforcement Learning (RL) building blocks, i.e. Q-functions and policy networks, usually take elements from the cartesian product of two domains as input. In particular, the input of the Q-function is both the state and the action, and in multi-task problems (Meta-RL) the policy can take a state and a context. Standard architectures tend to ignore these variables' underlying interpretations and simply concatenate their features into a single vector. In this work, we argue that this choice may lead to poor gradient estimation in actor-critic algorithms and high variance learning steps in Meta-RL algorithms. To consider the interaction between the input variables, we suggest using a Hypernetwork architecture where a primary network determines the weights of a conditional dynamic network. We show that this approach improves the gradient approximation and reduces the learning step variance, which both accelerates learning and improves the final performance. We demonstrate a consistent improvement across different locomotion tasks and different algorithms both in RL (TD3 and SAC) and in Meta-RL (MAML and PEARL).
    Label Inference Attacks from Log-loss Scores. (arXiv:2105.08266v2 [cs.LG] UPDATED)
    (2 min) Log-loss (also known as cross-entropy loss) metric is ubiquitously used across machine learning applications to assess the performance of classification algorithms. In this paper, we investigate the problem of inferring the labels of a dataset from single (or multiple) log-loss score(s), without any other access to the dataset. Surprisingly, we show that for any finite number of label classes, it is possible to accurately infer the labels of the dataset from the reported log-loss score of a single carefully constructed prediction vector if we allow arbitrary precision arithmetic. Additionally, we present label inference algorithms (attacks) that succeed even under addition of noise to the log-loss scores and under limited precision arithmetic. All our algorithms rely on ideas from number theory and combinatorics and require no model training. We run experimental simulations on some real datasets to demonstrate the ease of running these attacks in practice.
    Inverting Adversarially Robust Networks for Image Synthesis. (arXiv:2106.06927v1 [cs.CV])
    (2 min) Recent research in adversarially robust classifiers suggests their representations tend to be aligned with human perception, which makes them attractive for image synthesis and restoration applications. Despite favorable empirical results on a few downstream tasks, their advantages are limited to slow and sensitive optimization-based techniques. Moreover, their use on generative models remains unexplored. This work proposes the use of robust representations as a perceptual primitive for feature inversion models, and show its benefits with respect to standard non-robust image features. We empirically show that adopting robust representations as an image prior significantly improves the reconstruction accuracy of CNN-based feature inversion models. Furthermore, it allows reconstructing images at multiple scales out-of-the-box. Following these findings, we propose an encoding-decoding network based on robust representations and show its advantages for applications such as anomaly detection, style transfer and image denoising.
    A Simple Unified Framework for High Dimensional Bandit Problems. (arXiv:2102.09626v2 [cs.LG] UPDATED)
    (2 min) Stochastic high dimensional bandit problems with low dimensional structures are useful in different applications such as online advertising and drug discovery. In this work, we propose a simple unified algorithm for such problems and present a general analysis framework for the regret upper bound of our algorithm. We show that under some mild unified assumptions, our algorithm can be applied to different high dimensional bandit problems. Our framework utilizes the low dimensional structure to guide the parameter estimation in the problem, therefore our algorithm achieves the best regret bounds in the LASSO bandit, as well as novel bounds in the low-rank matrix bandit, the group sparse matrix bandit, and in a new problem: the multi-agent LASSO bandit.
    Expected Tight Bounds for Robust Training. (arXiv:1905.12418v5 [cs.LG] UPDATED)
    (2 min) Training Deep Neural Networks that are robust to norm bounded adversarial attacks remains an elusive problem. While exact and inexact verification-based methods are generally too expensive to train large networks, it was demonstrated that bounded input intervals can be inexpensively propagated from a layer to another through deep networks. This interval bound propagation approach (IBP) not only has improved both robustness and certified accuracy but was the first to be employed on large/deep networks. However, due to the very loose nature of the IBP bounds, the required training procedure is complex and involved. In this paper, we closely examine the bounds of a block of layers composed in the form of Affine-ReLU-Affine. To this end, we propose expected tight bounds (true bounds in expectation), referred to as ETB, which are provably tighter than IBP bounds in expectation. We then extend this result to deeper networks through blockwise propagation and show that we can achieve orders of magnitudes tighter bounds compared to IBP. Furthermore, using a simple standard training procedure, we can achieve impressive robustness-accuracy trade-off on both MNIST and CIFAR10.
    Omnidirectional Transfer for Quasilinear Lifelong Learning. (arXiv:2004.12908v7 [cs.AI] UPDATED)
    (2 min) In biological learning, data are used to improve performance not only on the current task, but also on previously encountered and as yet unencountered tasks. In contrast, classical machine learning starts from a blank slate, or tabula rasa, using data only for the single task at hand. While typical transfer learning algorithms can improve performance on future tasks, their performance on prior tasks degrades upon learning new tasks (called catastrophic forgetting). Many recent approaches for continual or lifelong learning have attempted to maintain performance given new tasks. But striving to avoid forgetting sets the goal unnecessarily low: the goal of lifelong learning, whether biological or artificial, should be to improve performance on all tasks (including past and future) with any new data. We propose omnidirectional transfer learning algorithms, which includes two special cases of interest: decision forests and deep networks. Our key insight is the development of the omni-voter layer, which ensembles representations learned independently on all tasks to jointly decide how to proceed on any given new data point, thereby improving performance on both past and future tasks. Our algorithms demonstrate omnidirectional transfer in a variety of simulated and real data scenarios, including tabular data, image data, spoken data, and adversarial tasks. Moreover, they do so with quasilinear space and time complexity.
    Quantifying Ignorance in Individual-Level Causal-Effect Estimates under Hidden Confounding. (arXiv:2103.04850v4 [cs.LG] UPDATED)
    (2 min) We study the problem of learning conditional average treatment effects (CATE) from high-dimensional, observational data with unobserved confounders. Unobserved confounders introduce ignorance -- a level of unidentifiability -- about an individual's response to treatment by inducing bias in CATE estimates. We present a new parametric interval estimator suited for high-dimensional data, that estimates a range of possible CATE values when given a predefined bound on the level of hidden confounding. Further, previous interval estimators do not account for ignorance about the CATE associated with samples that may be underrepresented in the original study, or samples that violate the overlap assumption. Our interval estimator also incorporates model uncertainty so that practitioners can be made aware of out-of-distribution data. We prove that our estimator converges to tight bounds on CATE when there may be unobserved confounding, and assess it using semi-synthetic, high-dimensional datasets.
    Dirty Road Can Attack: Security of Deep Learning based Automated Lane Centering under Physical-World Attack. (arXiv:2009.06701v2 [cs.CR] UPDATED)
    (2 min) Automated Lane Centering (ALC) systems are convenient and widely deployed today, but also highly security and safety critical. In this work, we are the first to systematically study the security of state-of-the-art deep learning based ALC systems in their designed operational domains under physical-world adversarial attacks. We formulate the problem with a safety-critical attack goal, and a novel and domain-specific attack vector: dirty road patches. To systematically generate the attack, we adopt an optimization-based approach and overcome domain-specific design challenges such as camera frame inter-dependencies due to attack-influenced vehicle control, and the lack of objective function design for lane detection models. We evaluate our attack on a production ALC using 80 scenarios from real-world driving traces. The results show that our attack is highly effective with over 97.5% success rates and less than 0.903 sec average success time, which is substantially lower than the average driver reaction time. This attack is also found (1) robust to various real-world factors such as lighting conditions and view angles, (2) general to different model designs, and (3) stealthy from the driver's view. To understand the safety impacts, we conduct experiments using software-in-the-loop simulation and attack trace injection in a real vehicle. The results show that our attack can cause a 100% collision rate in different scenarios, including when tested with common safety features such as automatic emergency braking. We also evaluate and discuss defenses.
    Relearning ensemble selection based on new generated features. (arXiv:2106.06761v1 [cs.LG])
    (2 min) The ensemble methods are meta-algorithms that combine several base machine learning techniques to increase the effectiveness of the classification. Many existing committees of classifiers use the classifier selection process to determine the optimal set of base classifiers. In this article, we propose the classifiers selection framework with relearning base classifiers. Additionally, we use in the proposed framework the new generated feature, which can be obtained after the relearning process. The proposed technique was compared with state-of-the-art ensemble methods using three benchmark datasets and one synthetic dataset. Four classification performance measures are used to evaluate the proposed method.
    About exchanging expectation and supremum for conditional Wasserstein GANs. (arXiv:2103.13906v2 [cs.LG] UPDATED)
    (2 min) In cases where a Wasserstein GAN depends on a condition the latter is usually handled via an expectation within the loss function. Depending on the way this is motivated, the discriminator is either required to be Lipschitz-1 in both or in only one of its arguments. For the weaker requirement to become usable one needs to exchange a supremum and an expectation. This is a mathematically perilous operation, which is, so far, only partially justified in the literature. This short mathematical note intends to fill this gap and provides the mathematical rationale for discriminators that are only partially Lipschitz-1 for cases where this approach is more appropriate or successful.
    Achieving Efficiency in Black Box Simulation of Distribution Tails with Self-structuring Importance Samplers. (arXiv:2102.07060v2 [stat.ML] UPDATED)
    (2 min) Motivated by the increasing adoption of models which facilitate greater automation in risk management and decision-making, this paper presents a novel Importance Sampling (IS) scheme for measuring distribution tails of objectives modelled with enabling tools such as feature-based decision rules, mixed integer linear programs, deep neural networks, etc. Conventional efficient IS approaches suffer from feasibility and scalability concerns due to the need to intricately tailor the sampler to the underlying probability distribution and the objective. This challenge is overcome in the proposed black-box scheme by automating the selection of an effective IS distribution with a transformation that implicitly learns and replicates the concentration properties observed in less rare samples. This novel approach is guided by a large deviations principle that brings out the phenomenon of self-similarity of optimal IS distributions. The proposed sampler is the first to attain asymptotically optimal variance reduction across a spectrum of multivariate distributions despite being oblivious to the underlying structure. The large deviations principle additionally results in new distribution tail asymptotics capable of yielding operational insights. The applicability is illustrated by considering product distribution networks and portfolio credit risk models informed by neural networks as examples.
    Defending against Backdoors in Federated Learning with Robust Learning Rate. (arXiv:2007.03767v3 [cs.LG] UPDATED)
    (2 min) Federated learning (FL) allows a set of agents to collaboratively train a model without sharing their potentially sensitive data. This makes FL suitable for privacy-preserving applications. At the same time, FL is susceptible to adversarial attacks due to decentralized and unvetted data. One important line of attacks against FL is the backdoor attacks. In a backdoor attack, an adversary tries to embed a backdoor functionality to the model during training that can later be activated to cause a desired misclassification. To prevent backdoor attacks, we propose a lightweight defense that requires minimal change to the FL protocol. At a high level, our defense is based on carefully adjusting the aggregation server's learning rate, per dimension and per round, based on the sign information of agents' updates. We first conjecture the necessary steps to carry a successful backdoor attack in FL setting, and then, explicitly formulate the defense based on our conjecture. Through experiments, we provide empirical evidence that supports our conjecture, and we test our defense against backdoor attacks under different settings. We observe that either backdoor is completely eliminated, or its accuracy is significantly reduced. Overall, our experiments suggest that our defense significantly outperforms some of the recently proposed defenses in the literature. We achieve this by having minimal influence over the accuracy of the trained models. In addition, we also provide convergence rate analysis for our proposed scheme.
    Weisfeiler and Lehman Go Topological: Message Passing Simplicial Networks. (arXiv:2103.03212v2 [cs.LG] UPDATED)
    (2 min) The pairwise interaction paradigm of graph machine learning has predominantly governed the modelling of relational systems. However, graphs alone cannot capture the multi-level interactions present in many complex systems and the expressive power of such schemes was proven to be limited. To overcome these limitations, we propose Message Passing Simplicial Networks (MPSNs), a class of models that perform message passing on simplicial complexes (SCs). To theoretically analyse the expressivity of our model we introduce a Simplicial Weisfeiler-Lehman (SWL) colouring procedure for distinguishing non-isomorphic SCs. We relate the power of SWL to the problem of distinguishing non-isomorphic graphs and show that SWL and MPSNs are strictly more powerful than the WL test and not less powerful than the 3-WL test. We deepen the analysis by comparing our model with traditional graph neural networks (GNNs) with ReLU activations in terms of the number of linear regions of the functions they can represent. We empirically support our theoretical claims by showing that MPSNs can distinguish challenging strongly regular graphs for which GNNs fail and, when equipped with orientation equivariant layers, they can improve classification accuracy in oriented SCs compared to a GNN baseline.
    Locally Adaptive Label Smoothing for Predictive Churn. (arXiv:2102.05140v2 [cs.LG] UPDATED)
    (2 min) Training modern neural networks is an inherently noisy process that can lead to high \emph{prediction churn} -- disagreements between re-trainings of the same model due to factors such as randomization in the parameter initialization and mini-batches -- even when the trained models all attain similar accuracies. Such prediction churn can be very undesirable in practice. In this paper, we present several baselines for reducing churn and show that training on soft labels obtained by adaptively smoothing each example's label based on the example's neighboring labels often outperforms the baselines on churn while improving accuracy on a variety of benchmark classification tasks and model architectures.
    Learning Randomly Perturbed Structured Predictors for Direct Loss Minimization. (arXiv:2007.05724v2 [stat.ML] UPDATED)
    (2 min) Direct loss minimization is a popular approach for learning predictors over structured label spaces. This approach is computationally appealing as it replaces integration with optimization and allows to propagate gradients in a deep net using loss-perturbed prediction. Recently, this technique was extended to generative models, while introducing a randomized predictor that samples a structure from a randomly perturbed score function. In this work, we learn the variance of these randomized structured predictors and show that it balances better between the learned score function and the randomized noise in structured prediction. We demonstrate empirically the effectiveness of learning the balance between the signal and the random noise in structured discrete spaces.
    Controllable Generation from Pre-trained Language Models via Inverse Prompting. (arXiv:2103.10685v2 [cs.CL] UPDATED)
    (2 min) Large-scale pre-trained language models have demonstrated strong capabilities of generating realistic text. However, it remains challenging to control the generation results. Previous approaches such as prompting are far from sufficient, which limits the usage of language models. To tackle this challenge, we propose an innovative method, inverse prompting, to better control text generation. The core idea of inverse prompting is to use generated text to inversely predict the prompt during beam search, which enhances the relevance between the prompt and the generated text and provides better controllability. Empirically, we pre-train a large-scale Chinese language model to perform a systematic study using human evaluation on the tasks of open-domain poem generation and open-domain long-form question answering. Our results show that our proposed method substantially outperforms the baselines and that our generation quality is close to human performance on some of the tasks. Narrators can try our poem generation demo at https://pretrain.aminer.cn/apps/poetry.html, while our QA demo can be found at https://pretrain.aminer.cn/app/qa. For researchers, the code is provided in https://github.com/THUDM/InversePrompting.
    COHORTNEY: Non-Parametric Clustering of Event Sequences. (arXiv:2104.01440v2 [cs.LG] UPDATED)
    (2 min) Cohort analysis is a pervasive activity in web analytics. One divides users into groups according to specific criteria and tracks their behavior over time. Despite its extensive use, academic circles do not discuss cohort analysis to evaluate user behavior online. This work introduces an unsupervised non-parametric approach to group Internet users based on their activities. In comparison, canonical methods in marketing and engineering-based techniques underperform. COHORTNEY is the first machine learning-based cohort analysis algorithm with a robust theoretical explanation.
    Quantitative Understanding of VAE as a Non-linearly Scaled Isometric Embedding. (arXiv:2007.15190v3 [stat.ML] UPDATED)
    (2 min) Variational autoencoder (VAE) estimates the posterior parameters (mean and variance) of latent variables corresponding to each input data. While it is used for many tasks, the transparency of the model is still an underlying issue. This paper provides a quantitative understanding of VAE property through the differential geometric and information-theoretic interpretations of VAE. According to the Rate-distortion theory, the optimal transform coding is achieved by using an orthonormal transform with PCA basis where the transform space is isometric to the input. Considering the analogy of transform coding to VAE, we clarify theoretically and experimentally that VAE can be mapped to an implicit isometric embedding with a scale factor derived from the posterior parameter. As a result, we can estimate the data probabilities in the input space from the prior, loss metrics, and corresponding posterior parameters, and further, the quantitative importance of each latent variable can be evaluated like the eigenvalue of PCA.
    Graph Neural Networks Meet Neural-Symbolic Computing: A Survey and Perspective. (arXiv:2003.00330v7 [cs.AI] UPDATED)
    (2 min) Neural-symbolic computing has now become the subject of interest of both academic and industry research laboratories. Graph Neural Networks (GNN) have been widely used in relational and symbolic domains, with widespread application of GNNs in combinatorial optimization, constraint satisfaction, relational reasoning and other scientific domains. The need for improved explainability, interpretability and trust of AI systems in general demands principled methodologies, as suggested by neural-symbolic computing. In this paper, we review the state-of-the-art on the use of GNNs as a model of neural-symbolic computing. This includes the application of GNNs in several domains as well as its relationship to current developments in neural-symbolic computing.
    Evaluating Entity Disambiguation and the Role of Popularity in Retrieval-Based NLP. (arXiv:2106.06830v1 [cs.CL])
    (2 min) Retrieval is a core component for open-domain NLP tasks. In open-domain tasks, multiple entities can share a name, making disambiguation an inherent yet under-explored problem. We propose an evaluation benchmark for assessing the entity disambiguation capabilities of these retrievers, which we call Ambiguous Entity Retrieval (AmbER) sets. We define an AmbER set as a collection of entities that share a name along with queries about those entities. By covering the set of entities for polysemous names, AmbER sets act as a challenging test of entity disambiguation. We create AmbER sets for three popular open-domain tasks: fact checking, slot filling, and question answering, and evaluate a diverse set of retrievers. We find that the retrievers exhibit popularity bias, significantly under-performing on rarer entities that share a name, e.g., they are twice as likely to retrieve erroneous documents on queries for the less popular entity under the same name. These experiments on AmbER sets show their utility as an evaluation tool and highlight the weaknesses of popular retrieval systems.
    A Free Lunch From ANN: Towards Efficient, Accurate Spiking Neural Networks Calibration. (arXiv:2106.06984v1 [cs.LG])
    (2 min) Spiking Neural Network (SNN) has been recognized as one of the next generation of neural networks. Conventionally, SNN can be converted from a pre-trained ANN by only replacing the ReLU activation to spike activation while keeping the parameters intact. Perhaps surprisingly, in this work we show that a proper way to calibrate the parameters during the conversion of ANN to SNN can bring significant improvements. We introduce SNN Calibration, a cheap but extraordinarily effective method by leveraging the knowledge within a pre-trained Artificial Neural Network (ANN). Starting by analyzing the conversion error and its propagation through layers theoretically, we propose the calibration algorithm that can correct the error layer-by-layer. The calibration only takes a handful number of training data and several minutes to finish. Moreover, our calibration algorithm can produce SNN with state-of-the-art architecture on the large-scale ImageNet dataset, including MobileNet and RegNet. Extensive experiments demonstrate the effectiveness and efficiency of our algorithm. For example, our advanced pipeline can increase up to 69% top-1 accuracy when converting MobileNet on ImageNet compared to baselines. Codes are released at https://github.com/yhhhli/SNN_Calibration.
    Equivariant Networks for Pixelized Spheres. (arXiv:2106.06662v1 [cs.LG])
    (2 min) Pixelizations of Platonic solids such as the cube and icosahedron have been widely used to represent spherical data, from climate records to Cosmic Microwave Background maps. Platonic solids have well-known global symmetries. Once we pixelize each face of the solid, each face also possesses its own local symmetries in the form of Euclidean isometries. One way to combine these symmetries is through a hierarchy. However, this approach does not adequately model the interplay between the two levels of symmetry transformations. We show how to model this interplay using ideas from group theory, identify the equivariant linear maps, and introduce equivariant padding that respects these symmetries. Deep networks that use these maps as their building blocks generalize gauge equivariant CNNs on pixelized spheres. These deep networks achieve state-of-the-art results on semantic segmentation for climate data and omnidirectional image processing. Code is available at https://git.io/JGiZA.
    Solving Graph-based Public Good Games with Tree Search and Imitation Learning. (arXiv:2106.06762v1 [cs.AI])
    (2 min) Public goods games represent insightful settings for studying incentives for individual agents to make contributions that, while costly for each of them, benefit the wider society. In this work, we adopt the perspective of a central planner with a global view of a network of self-interested agents and the goal of maximizing some desired property in the context of a best-shot public goods game. Existing algorithms for this known NP-complete problem find solutions that are sub-optimal and cannot optimize for criteria other than social welfare. In order to efficiently solve public goods games, our proposed method directly exploits the correspondence between equilibria and the Maximal Independent Set (mIS) structural property of graphs. In particular, we define a Markov Decision Process, which incrementally generates an mIS, and adopt a planning method to search for equilibria, outperforming existing methods. Furthermore, we devise an imitation learning technique that uses demonstrations of the search to obtain a graph neural network parametrized policy which quickly generalizes to unseen game instances. Our evaluation results show that this policy is able to reach 99.5% of the performance of the planning method while being approximately three orders of magnitude faster to evaluate on the largest graphs tested. The methods presented in this work can be applied to a large class of public goods games of potentially high societal impact.
    Model-free Reinforcement Learning for Branching Markov Decision Processes. (arXiv:2106.06777v1 [cs.LG])
    (2 min) We study reinforcement learning for the optimal control of Branching Markov Decision Processes (BMDPs), a natural extension of (multitype) Branching Markov Chains (BMCs). The state of a (discrete-time) BMCs is a collection of entities of various types that, while spawning other entities, generate a payoff. In comparison with BMCs, where the evolution of a each entity of the same type follows the same probabilistic pattern, BMDPs allow an external controller to pick from a range of options. This permits us to study the best/worst behaviour of the system. We generalise model-free reinforcement learning techniques to compute an optimal control strategy of an unknown BMDP in the limit. We present results of an implementation that demonstrate the practicality of the approach.
    Speak or Chat with Me: End-to-End Spoken Language Understanding System with Flexible Inputs. (arXiv:2104.05752v2 [cs.CL] UPDATED)
    (2 min) A major focus of recent research in spoken language understanding (SLU) has been on the end-to-end approach where a single model can predict intents directly from speech inputs without intermediate transcripts. However, this approach presents some challenges. First, since speech can be considered as personally identifiable information, in some cases only automatic speech recognition (ASR) transcripts are accessible. Second, intent-labeled speech data is scarce. To address the first challenge, we propose a novel system that can predict intents from flexible types of inputs: speech, ASR transcripts, or both. We demonstrate strong performance for either modality separately, and when both speech and ASR transcripts are available, through system combination, we achieve better results than using a single input modality. To address the second challenge, we leverage a semantically robust pre-trained BERT model and adopt a cross-modal system that co-trains text embeddings and acoustic embeddings in a shared latent space. We further enhance this system by utilizing an acoustic module pre-trained on LibriSpeech and domain-adapting the text module on our target datasets. Our experiments show significant advantages for these pre-training and fine-tuning strategies, resulting in a system that achieves competitive intent-classification performance on Snips SLU and Fluent Speech Commands datasets.
    Precise characterization of the prior predictive distribution of deep ReLU networks. (arXiv:2106.06615v1 [cs.LG])
    (2 min) Recent works on Bayesian neural networks (BNNs) have highlighted the need to better understand the implications of using Gaussian priors in combination with the compositional structure of the network architecture. Similar in spirit to the kind of analysis that has been developed to devise better initialization schemes for neural networks (cf. He- or Xavier initialization), we derive a precise characterization of the prior predictive distribution of finite-width ReLU networks with Gaussian weights. While theoretical results have been obtained for their heavy-tailedness, the full characterization of the prior predictive distribution (i.e. its density, CDF and moments), remained unknown prior to this work. Our analysis, based on the Meijer-G function, allows us to quantify the influence of architectural choices such as the width or depth of the network on the resulting shape of the prior predictive distribution. We also formally connect our results to previous work in the infinite width setting, demonstrating that the moments of the distribution converge to those of a normal log-normal mixture in the infinite depth limit. Finally, our results provide valuable guidance on prior design: for instance, controlling the predictive variance with depth- and width-informed priors on the weights of the network.
    Joint Client Scheduling and Resource Allocation under Channel Uncertainty in Federated Learning. (arXiv:2106.06796v1 [cs.LG])
    (2 min) The performance of federated learning (FL) over wireless networks depend on the reliability of the client-server connectivity and clients' local computation capabilities. In this article we investigate the problem of client scheduling and resource block (RB) allocation to enhance the performance of model training using FL, over a pre-defined training duration under imperfect channel state information (CSI) and limited local computing resources. First, we analytically derive the gap between the training losses of FL with clients scheduling and a centralized training method for a given training duration. Then, we formulate the gap of the training loss minimization over client scheduling and RB allocation as a stochastic optimization problem and solve it using Lyapunov optimization. A Gaussian process regression-based channel prediction method is leveraged to learn and track the wireless channel, in which, the clients' CSI predictions and computing power are incorporated into the scheduling decision. Using an extensive set of simulations, we validate the robustness of the proposed method under both perfect and imperfect CSI over an array of diverse data distributions. Results show that the proposed method reduces the gap of the training accuracy loss by up to 40.7% compared to state-of-theart client scheduling and RB allocation methods.
    Synthesized Difference in Differences. (arXiv:2105.00455v2 [stat.ML] UPDATED)
    (2 min) We consider estimating the conditional average treatment effect for everyone by eliminating confounding and selection bias. Unfortunately, randomized clinical trials (RCTs) eliminate confounding but impose strict exclusion criteria that prevent sampling of the entire clinical population. Observational datasets are more inclusive but suffer from confounding. We therefore analyze RCT and observational data simultaneously in order to extract the strengths of each. Our solution builds upon Difference in Differences (DD), an algorithm that eliminates confounding from observational data by comparing outcomes before and after treatment administration. DD requires a parallel slopes assumption that may not apply in practice when confounding shifts across time. We instead propose Synthesized Difference in Differences (SDD) that infers the correct (possibly non-parallel) slopes by linearly adjusting a conditional version of DD using additional RCT data. The algorithm achieves state of the art performance across multiple synthetic and real datasets even when the RCT excludes the majority of patients.
    Achieving Near Instance-Optimality and Minimax-Optimality in Stochastic and Adversarial Linear Bandits Simultaneously. (arXiv:2102.05858v3 [cs.LG] UPDATED)
    (2 min) In this work, we develop linear bandit algorithms that automatically adapt to different environments. By plugging a novel loss estimator into the optimization problem that characterizes the instance-optimal strategy, our first algorithm not only achieves nearly instance-optimal regret in stochastic environments, but also works in corrupted environments with additional regret being the amount of corruption, while the state-of-the-art (Li et al., 2019) achieves neither instance-optimality nor the optimal dependence on the corruption amount. Moreover, by equipping this algorithm with an adversarial component and carefully-designed testings, our second algorithm additionally enjoys minimax-optimal regret in completely adversarial environments, which is the first of this kind to our knowledge. Finally, all our guarantees hold with high probability, while existing instance-optimal guarantees only hold in expectation.
    Augmented World Models Facilitate Zero-Shot Dynamics Generalization From a Single Offline Environment. (arXiv:2104.05632v2 [cs.LG] UPDATED)
    (2 min) Reinforcement learning from large-scale offline datasets provides us with the ability to learn policies without potentially unsafe or impractical exploration. Significant progress has been made in the past few years in dealing with the challenge of correcting for differing behavior between the data collection and learned policies. However, little attention has been paid to potentially changing dynamics when transferring a policy to the online setting, where performance can be up to 90% reduced for existing methods. In this paper we address this problem with Augmented World Models (AugWM). We augment a learned dynamics model with simple transformations that seek to capture potential changes in physical properties of the robot, leading to more robust policies. We not only train our policy in this new setting, but also provide it with the sampled augmentation as a context, allowing it to adapt to changes in the environment. At test time we learn the context in a self-supervised fashion by approximating the augmentation which corresponds to the new environment. We rigorously evaluate our approach on over 100 different changed dynamics settings, and show that this simple approach can significantly improve the zero-shot generalization of a recent state-of-the-art baseline, often achieving successful policies where the baseline fails.
    Correlated Weights in Infinite Limits of Deep Convolutional Neural Networks. (arXiv:2101.04097v2 [stat.ML] UPDATED)
    (2 min) Infinite width limits of deep neural networks often have tractable forms. They have been used to analyse the behaviour of finite networks, as well as being useful methods in their own right. When investigating infinitely wide convolutional neural networks (CNNs), it was observed that the correlations arising from spatial weight sharing disappear in the infinite limit. This is undesirable, as spatial correlation is the main motivation behind CNNs. We show that the loss of this property is not a consequence of the infinite limit, but rather of choosing an independent weight prior. Correlating the weights maintains the correlations in the activations. Varying the amount of correlation interpolates between independent-weight limits and mean-pooling. Empirical evaluation of the infinitely wide network shows that optimal performance is achieved between the extremes, indicating that correlations can be useful.
    Fantastic Four: Differentiable Bounds on Singular Values of Convolution Layers. (arXiv:1911.10258v3 [cs.LG] UPDATED)
    (2 min) In deep neural networks, the spectral norm of the Jacobian of a layer bounds the factor by which the norm of a signal changes during forward/backward propagation. Spectral norm regularizations have been shown to improve generalization, robustness and optimization of deep learning methods. Existing methods to compute the spectral norm of convolution layers either rely on heuristics that are efficient in computation but lack guarantees or are theoretically-sound but computationally expensive. In this work, we obtain the best of both worlds by deriving {\it four} provable upper bounds on the spectral norm of a standard 2D multi-channel convolution layer. These bounds are differentiable and can be computed efficiently during training with negligible overhead. One of these bounds is in fact the popular heuristic method of Miyato et al. (multiplied by a constant factor depending on filter sizes). Each of these four bounds can achieve the tightest gap depending on convolution filters. Thus, we propose to use the minimum of these four bounds as a tight, differentiable and efficient upper bound on the spectral norm of convolution layers. We show that our spectral bound is an effective regularizer and can be used to bound either the lipschitz constant or curvature values (eigenvalues of the Hessian) of neural networks. Through experiments on MNIST and CIFAR-10, we demonstrate the effectiveness of our spectral bound in improving generalization and provable robustness of deep networks.
    Learning from History for Byzantine Robust Optimization. (arXiv:2012.10333v2 [cs.LG] UPDATED)
    (2 min) Byzantine robustness has received significant attention recently given its importance for distributed and federated learning. In spite of this, we identify severe flaws in existing algorithms even when the data across the participants is identically distributed. First, we show realistic examples where current state of the art robust aggregation rules fail to converge even in the absence of any Byzantine attackers. Secondly, we prove that even if the aggregation rules may succeed in limiting the influence of the attackers in a single round, the attackers can couple their attacks across time eventually leading to divergence. To address these issues, we present two surprisingly simple strategies: a new robust iterative clipping procedure, and incorporating worker momentum to overcome time-coupled attacks. This is the first provably robust method for the standard stochastic optimization setting. Our code is open sourced at https://github.com/epfml/byzantine-robust-optimizer.
    Assessing Multilingual Fairness in Pre-trained Multimodal Representations. (arXiv:2106.06683v1 [cs.CL])
    (2 min) Recently pre-trained multimodal models, such as CLIP, have received a surge of attention for their exceptional capabilities towards connecting images and natural language. The textual representations in English can be desirably transferred to multilingualism and support promising downstream multimodal tasks for different languages. Nevertheless, previous fairness discourse in vision-and-language learning mainly focuses on monolingual representational biases, and rarely scrutinizes the principles of multilingual fairness in this multimodal setting, where one language is equated to a group of individuals and images provide the universal grounding for bridging different languages. In this paper, we provide a nuanced understanding of individual fairness and group fairness by viewing language as the recipient of fairness notions. We define new fairness notions within multilingual context and analytically articulate that, pre-trained vision-and-language representations are individually fair across languages but not guaranteed to group fairness. Furthermore, we conduct extensive experiments to explore the prevalent group disparity across languages and protected groups including race, gender and age.
    Learning from Crowds by Modeling Common Confusions. (arXiv:2012.13052v2 [cs.LG] UPDATED)
    (2 min) Crowdsourcing provides a practical way to obtain large amounts of labeled data at a low cost. However, the annotation quality of annotators varies considerably, which imposes new challenges in learning a high-quality model from the crowdsourced annotations. In this work, we provide a new perspective to decompose annotation noise into common noise and individual noise and differentiate the source of confusion based on instance difficulty and annotator expertise on a per-instance-annotator basis. We realize this new crowdsourcing model by an end-to-end learning solution with two types of noise adaptation layers: one is shared across annotators to capture their commonly shared confusions, and the other one is pertaining to each annotator to realize individual confusion. To recognize the source of noise in each annotation, we use an auxiliary network to choose the two noise adaptation layers with respect to both instances and annotators. Extensive experiments on both synthesized and real-world benchmarks demonstrate the effectiveness of our proposed common noise adaptation solution.
    Machine Unlearning for Random Forests. (arXiv:2009.05567v2 [cs.LG] UPDATED)
    (2 min) Responding to user data deletion requests, removing noisy examples, or deleting corrupted training data are just a few reasons for wanting to delete instances from a machine learning (ML) model. However, efficiently removing this data from an ML model is generally difficult. In this paper, we introduce data removal-enabled (DaRE) forests, a variant of random forests that enables the removal of training data with minimal retraining. Model updates for each DaRE tree in the forest are exact, meaning that removing instances from a DaRE model yields exactly the same model as retraining from scratch on updated data. DaRE trees use randomness and caching to make data deletion efficient. The upper levels of DaRE trees use random nodes, which choose split attributes and thresholds uniformly at random. These nodes rarely require updates because they only minimally depend on the data. At the lower levels, splits are chosen to greedily optimize a split criterion such as Gini index or mutual information. DaRE trees cache statistics at each node and training data at each leaf, so that only the necessary subtrees are updated as data is removed. For numerical attributes, greedy nodes optimize over a random subset of thresholds, so that they can maintain statistics while approximating the optimal threshold. By adjusting the number of thresholds considered for greedy nodes, and the number of random nodes, DaRE trees can trade off between more accurate predictions and more efficient updates. In experiments on 13 real-world datasets and one synthetic dataset, we find DaRE forests delete data orders of magnitude faster than retraining from scratch while sacrificing little to no predictive power.
    Continuous Coordination As a Realistic Scenario for Lifelong Learning. (arXiv:2103.03216v2 [cs.LG] UPDATED)
    (2 min) Current deep reinforcement learning (RL) algorithms are still highly task-specific and lack the ability to generalize to new environments. Lifelong learning (LLL), however, aims at solving multiple tasks sequentially by efficiently transferring and using knowledge between tasks. Despite a surge of interest in lifelong RL in recent years, the lack of a realistic testbed makes robust evaluation of LLL algorithms difficult. Multi-agent RL (MARL), on the other hand, can be seen as a natural scenario for lifelong RL due to its inherent non-stationarity, since the agents' policies change over time. In this work, we introduce a multi-agent lifelong learning testbed that supports both zero-shot and few-shot settings. Our setup is based on Hanabi -- a partially-observable, fully cooperative multi-agent game that has been shown to be challenging for zero-shot coordination. Its large strategy space makes it a desirable environment for lifelong RL tasks. We evaluate several recent MARL methods, and benchmark state-of-the-art LLL algorithms in limited memory and computation regimes to shed light on their strengths and weaknesses. This continual learning paradigm also provides us with a pragmatic way of going beyond centralized training which is the most commonly used training protocol in MARL. We empirically show that the agents trained in our setup are able to coordinate well with unseen agents, without any additional assumptions made by previous works. The code and all pre-trained models are available at https://github.com/chandar-lab/Lifelong-Hanabi.
    Fictitious play in zero-sum stochastic games. (arXiv:2010.04223v4 [cs.GT] UPDATED)
    (2 min) We present fictitious play dynamics for stochastic games and analyze its convergence properties in zero-sum stochastic games. Our dynamics involves players forming beliefs on opponent strategy and their own continuation payoff (Q-function), and playing a greedy best response using estimated continuation payoffs. Players update their beliefs from observations of opponent actions. A key property of the learning dynamics is that update of the beliefs on Q-functions occurs at a slower timescale than update of the beliefs on strategies. We show both in the model-based and model-free cases (without knowledge of player payoff functions and state transition probabilities), the beliefs on strategies converge to a stationary mixed Nash equilibrium of the zero-sum stochastic game.
    Neural Path Features and Neural Path Kernel : Understanding the role of gates in deep learning. (arXiv:2006.10529v2 [cs.LG] UPDATED)
    (2 min) Rectified linear unit (ReLU) activations can also be thought of as 'gates', which, either pass or stop their pre-activation input when they are 'on' (when the pre-activation input is positive) or 'off' (when the pre-activation input is negative) respectively. A deep neural network (DNN) with ReLU activations has many gates, and the on/off status of each gate changes across input examples as well as network weights. For a given input example, only a subset of gates are 'active', i.e., on, and the sub-network of weights connected to these active gates is responsible for producing the output. At randomised initialisation, the active sub-network corresponding to a given input example is random. During training, as the weights are learnt, the active sub-networks are also learnt, and potentially hold very valuable information. In this paper, we analytically characterise the role of active sub-networks in deep learning. To this end, we encode the on/off state of the gates of a given input in a novel 'neural path feature' (NPF), and the weights of the DNN are encoded in a novel 'neural path value' (NPV). Further, we show that the output of network is indeed the inner product of NPF and NPV. The main result of the paper shows that the 'neural path kernel' associated with the NPF is a fundamental quantity that characterises the information stored in the gates of a DNN. We show via experiments (on MNIST and CIFAR-10) that in standard DNNs with ReLU activations NPFs are learnt during training and such learning is key for generalisation. Furthermore, NPFs and NPVs can be learnt in two separate networks and such learning also generalises well in experiments.
    Memory-efficient Transformers via Top-$k$ Attention. (arXiv:2106.06899v1 [cs.CL])
    (2 min) Following the success of dot-product attention in Transformers, numerous approximations have been recently proposed to address its quadratic complexity with respect to the input length. While these variants are memory and compute efficient, it is not possible to directly use them with popular pre-trained language models trained using vanilla attention, without an expensive corrective pre-training stage. In this work, we propose a simple yet highly accurate approximation for vanilla attention. We process the queries in chunks, and for each query, compute the top-$k$ scores with respect to the keys. Our approach offers several advantages: (a) its memory usage is linear in the input size, similar to linear attention variants, such as Performer and RFA (b) it is a drop-in replacement for vanilla attention that does not require any corrective pre-training, and (c) it can also lead to significant memory savings in the feed-forward layers after casting them into the familiar query-key-value framework. We evaluate the quality of top-$k$ approximation for multi-head attention layers on the Long Range Arena Benchmark, and for feed-forward layers of T5 and UnifiedQA on multiple QA datasets. We show our approach leads to accuracy that is nearly-identical to vanilla attention in multiple setups including training from scratch, fine-tuning, and zero-shot inference.
    Contextualized Streaming End-to-End Speech Recognition with Trie-Based Deep Biasing and Shallow Fusion. (arXiv:2104.02194v2 [cs.CL] UPDATED)
    (2 min) How to leverage dynamic contextual information in end-to-end speech recognition has remained an active research area. Previous solutions to this problem were either designed for specialized use cases that did not generalize well to open-domain scenarios, did not scale to large biasing lists, or underperformed on rare long-tail words. We address these limitations by proposing a novel solution that combines shallow fusion, trie-based deep biasing, and neural network language model contextualization. These techniques result in significant 19.5% relative Word Error Rate improvement over existing contextual biasing approaches and 5.4%-9.3% improvement compared to a strong hybrid baseline on both open-domain and constrained contextualization tasks, where the targets consist of mostly rare long-tail words. Our final system remains lightweight and modular, allowing for quick modification without model re-training.
    Length-Adaptive Transformer: Train Once with Length Drop, Use Anytime with Search. (arXiv:2010.07003v2 [cs.CL] UPDATED)
    (2 min) Despite transformers' impressive accuracy, their computational cost is often prohibitive to use with limited computational resources. Most previous approaches to improve inference efficiency require a separate model for each possible computational budget. In this paper, we extend PoWER-BERT (Goyal et al., 2020) and propose Length-Adaptive Transformer that can be used for various inference scenarios after one-shot training. We train a transformer with LengthDrop, a structural variant of dropout, which stochastically determines a sequence length at each layer. We then conduct a multi-objective evolutionary search to find a length configuration that maximizes the accuracy and minimizes the efficiency metric under any given computational budget. Additionally, we significantly extend the applicability of PoWER-BERT beyond sequence-level classification into token-level classification with Drop-and-Restore process that drops word-vectors temporarily in intermediate layers and restores at the last layer if necessary. We empirically verify the utility of the proposed approach by demonstrating the superior accuracy-efficiency trade-off under various setups, including span-based question answering and text classification. Code is available at https://github.com/clovaai/length-adaptive-transformer.
    On The Radon-Nikodym Spectral Approach With Optimal Clustering. (arXiv:1906.00460v16 [cs.LG] UPDATED)
    (3 min) Problems of interpolation, classification, and clustering are considered. In the tenets of Radon--Nikodym approach $\langle f(\mathbf{x})\psi^2 \rangle / \langle\psi^2\rangle$, where the $\psi(\mathbf{x})$ is a linear function on input attributes, all the answers are obtained from a generalized eigenproblem $|f|\psi^{[i]}\rangle = \lambda^{[i]} |\psi^{[i]}\rangle$. The solution to the interpolation problem is a regular Radon-Nikodym derivative. The solution to the classification problem requires prior and posterior probabilities that are obtained using the Lebesgue quadrature[1] technique. Whereas in a Bayesian approach new observations change only outcome probabilities, in the Radon-Nikodym approach not only outcome probabilities but also the probability space $|\psi^{[i]}\rangle$ change with new observations. This is a remarkable feature of the approach: both the probabilities and the probability space are constructed from the data. The Lebesgue quadrature technique can be also applied to the optimal clustering problem. The problem is solved by constructing a Gaussian quadrature on the Lebesgue measure. A distinguishing feature of the Radon-Nikodym approach is the knowledge of the invariant group: all the answers are invariant relatively any non-degenerated linear transform of input vector $\mathbf{x}$ components. A software product implementing the algorithms of interpolation, classification, and optimal clustering is available from the authors.
    Bandwidth-Agile Image Transmission with Deep Joint Source-Channel Coding. (arXiv:2009.12480v2 [cs.IT] UPDATED)
    (2 min) We propose deep learning based communication methods for adaptive-bandwidth transmission of images over wireless channels. We consider the scenario in which images are transmitted progressively in layers over time or frequency, and such layers can be aggregated by receivers in order to increase the quality of their reconstructions. We investigate two scenarios, one in which the layers are sent sequentially, and incrementally contribute to the refinement of a reconstruction, and another in which the layers are independent and can be retrieved in any order. Those scenarios correspond to the well known problems of \textit{successive refinement} and \textit{multiple descriptions}, respectively, in the context of joint source-channel coding (JSCC). We propose DeepJSCC-$l$, an innovative solution that uses convolutional autoencoders, and present three architectures with different complexity trade-offs. To the best of our knowledge, this is the first practical multiple-description JSCC scheme developed and tested for practical information sources and channels. Numerical results show that DeepJSCC-$l$ can learn to transmit the source progressively with negligible losses in the end-to-end performance compared with a single transmission. Moreover, DeepJSCC-$l$ has comparable performance with state of the art digital progressive transmission schemes in the challenging low signal-to-noise ratio (SNR) and small bandwidth regimes, with the additional advantage of graceful degradation with channel SNR.
    Entropy-based Logic Explanations of Neural Networks. (arXiv:2106.06804v1 [cs.AI])
    (2 min) Explainable artificial intelligence has rapidly emerged since lawmakers have started requiring interpretable models for safety-critical domains. Concept-based neural networks have arisen as explainable-by-design methods as they leverage human-understandable symbols (i.e. concepts) to predict class memberships. However, most of these approaches focus on the identification of the most relevant concepts but do not provide concise, formal explanations of how such concepts are leveraged by the classifier to make predictions. In this paper, we propose a novel end-to-end differentiable approach enabling the extraction of logic explanations from neural networks using the formalism of First-Order Logic. The method relies on an entropy-based criterion which automatically identifies the most relevant concepts. We consider four different case studies to demonstrate that: (i) this entropy-based criterion enables the distillation of concise logic explanations in safety-critical domains from clinical data to computer vision; (ii) the proposed approach outperforms state-of-the-art white-box models in terms of classification accuracy.
    Weakly-supervised Graph Meta-learning for Few-shot Node Classification. (arXiv:2106.06873v1 [cs.LG])
    (2 min) Graphs are widely used to model the relational structure of data, and the research of graph machine learning (ML) has a wide spectrum of applications ranging from drug design in molecular graphs to friendship recommendation in social networks. Prevailing approaches for graph ML typically require abundant labeled instances in achieving satisfactory results, which is commonly infeasible in real-world scenarios since labeled data for newly emerged concepts (e.g., new categorizations of nodes) on graphs is limited. Though meta-learning has been applied to different few-shot graph learning problems, most existing efforts predominately assume that all the data from those seen classes is gold-labeled, while those methods may lose their efficacy when the seen data is weakly-labeled with severe label noise. As such, we aim to investigate a novel problem of weakly-supervised graph meta-learning for improving the model robustness in terms of knowledge transfer. To achieve this goal, we propose a new graph meta-learning framework -- Graph Hallucination Networks (Meta-GHN) in this paper. Based on a new robustness-enhanced episodic training, Meta-GHN is meta-learned to hallucinate clean node representations from weakly-labeled data and extracts highly transferable meta-knowledge, which enables the model to quickly adapt to unseen tasks with few labeled instances. Extensive experiments demonstrate the superiority of Meta-GHN over existing graph meta-learning studies on the task of weakly-supervised few-shot node classification.
    Distribution-free uncertainty quantification for classification under label shift. (arXiv:2103.03323v3 [stat.ML] UPDATED)
    (2 min) Trustworthy deployment of ML models requires a proper measure of uncertainty, especially in safety-critical applications. We focus on uncertainty quantification (UQ) for classification problems via two avenues -- prediction sets using conformal prediction and calibration of probabilistic predictors by post-hoc binning -- since these possess distribution-free guarantees for i.i.d. data. Two common ways of generalizing beyond the i.i.d. setting include handling covariate and label shift. Within the context of distribution-free UQ, the former has already received attention, but not the latter. It is known that label shift hurts prediction, and we first argue that it also hurts UQ, by showing degradation in coverage and calibration. Piggybacking on recent progress in addressing label shift (for better prediction), we examine the right way to achieve UQ by reweighting the aforementioned conformal and calibration procedures whenever some unlabeled data from the target distribution is available. We examine these techniques theoretically in a distribution-free framework and demonstrate their excellent practical performance.
    Matrix games with bandit feedback. (arXiv:2006.05145v2 [cs.LG] UPDATED)
    (2 min) We study a version of the classical zero-sum matrix game with unknown payoff matrix and bandit feedback, where the players only observe each others actions and a noisy payoff. This generalizes the usual matrix game, where the payoff matrix is known to the players. Despite numerous applications, this problem has received relatively little attention. Although adversarial bandit algorithms achieve low regret, they do not exploit the matrix structure and perform poorly relative to the new algorithms. The main contributions are regret analyses of variants of UCB and K-learning that hold for any opponent, e.g., even when the opponent adversarially plays the best-response to the learner's mixed strategy. Along the way, we show that Thompson fails catastrophically in this setting and provide empirical comparison to existing algorithms.
    Learning disentangled representations via product manifold projection. (arXiv:2103.01638v2 [cs.LG] UPDATED)
    (2 min) We propose a novel approach to disentangle the generative factors of variation underlying a given set of observations. Our method builds upon the idea that the (unknown) low-dimensional manifold underlying the data space can be explicitly modeled as a product of submanifolds. This definition of disentanglement gives rise to a novel weakly-supervised algorithm for recovering the unknown explanatory factors behind the data. At training time, our algorithm only requires pairs of non i.i.d. data samples whose elements share at least one, possibly multidimensional, generative factor of variation. We require no knowledge on the nature of these transformations, and do not make any limiting assumption on the properties of each subspace. Our approach is easy to implement, and can be successfully applied to different kinds of data (from images to 3D surfaces) undergoing arbitrary transformations. In addition to standard synthetic benchmarks, we showcase our method in challenging real-world applications, where we compare favorably with the state of the art.
    Training Deep Architectures Without End-to-End Backpropagation: A Brief Survey. (arXiv:2101.03419v2 [cs.LG] UPDATED)
    (2 min) This tutorial paper surveys training alternatives to end-to-end backpropagation (E2EBP) -- the de facto standard for training deep architectures. Modular training refers to strictly local training without both the forward and the backward pass, i.e., dividing a deep architecture into several nonoverlapping modules and training them separately without any end-to-end operation. Between the fully global E2EBP and the strictly local modular training, there are "weakly modular" hybrids performing training without the backward pass only. These alternatives can match or surpass the performance of E2EBP on challenging datasets such as ImageNet, and are gaining increased attention primarily because they offer practical advantages over E2EBP, which will be enumerated herein. In particular, they allow for greater modularity and transparency in deep learning workflows, aligning deep learning with the mainstream computer science engineering that heavily exploits modularization for scalability. Modular training has also revealed novel insights about learning and has further implications on other important research domains. Specifically, it induces natural and effective solutions to some important practical problems such as data efficiency and transferability estimation.
    Randomized Entity-wise Factorization for Multi-Agent Reinforcement Learning. (arXiv:2006.04222v3 [cs.LG] UPDATED)
    (2 min) Multi-agent settings in the real world often involve tasks with varying types and quantities of agents and non-agent entities; however, common patterns of behavior often emerge among these agents/entities. Our method aims to leverage these commonalities by asking the question: ``What is the expected utility of each agent when only considering a randomly selected sub-group of its observed entities?'' By posing this counterfactual question, we can recognize state-action trajectories within sub-groups of entities that we may have encountered in another task and use what we learned in that task to inform our prediction in the current one. We then reconstruct a prediction of the full returns as a combination of factors considering these disjoint groups of entities and train this ``randomly factorized" value function as an auxiliary objective for value-based multi-agent reinforcement learning. By doing so, our model can recognize and leverage similarities across tasks to improve learning efficiency in a multi-task setting. Our approach, Randomized Entity-wise Factorization for Imagined Learning (REFIL), outperforms all strong baselines by a significant margin in challenging multi-task StarCraft micromanagement settings.
    Composed Fine-Tuning: Freezing Pre-Trained Denoising Autoencoders for Improved Generalization. (arXiv:2006.16205v3 [cs.LG] UPDATED)
    (2 min) We focus on prediction problems with structured outputs that are subject to output validity constraints, e.g. pseudocode-to-code translation where the code must compile. While labeled input-output pairs are expensive to obtain, "unlabeled" outputs, i.e. outputs without corresponding inputs, are freely available (e.g. code on GitHub) and provide information about output validity. Pre-training captures this structure by training a denoiser to denoise corrupted versions of unlabeled outputs. We first show that standard fine-tuning after pre-training destroys some of this structure. We then propose composed fine-tuning, which trains a predictor composed with the pre-trained denoiser. Importantly, the denoiser is fixed to preserve output structure. Like standard fine-tuning, the predictor is also initialized with the pre-trained denoiser. We prove for two-layer ReLU networks that composed fine-tuning significantly reduces the complexity of the predictor, thus improving generalization. Empirically, we show that composed fine-tuning improves over standard fine-tuning on two pseudocode-to-code translation datasets (3% and 6% relative). The improvement is magnified on out-of-distribution (OOD) examples (4% and 25% relative), suggesting that reducing predictor complexity improves OOD extrapolation.
    Neural Bellman-Ford Networks: A General Graph Neural Network Framework for Link Prediction. (arXiv:2106.06935v1 [cs.LG])
    (2 min) Link prediction is a very fundamental task on graphs. Inspired by traditional path-based methods, in this paper we propose a general and flexible representation learning framework based on paths for link prediction. Specifically, we define the representation of a pair of nodes as the generalized sum of all path representations, with each path representation as the generalized product of the edge representations in the path. Motivated by the Bellman-Ford algorithm for solving the shortest path problem, we show that the proposed path formulation can be efficiently solved by the generalized Bellman-Ford algorithm. To further improve the capacity of the path formulation, we propose the Neural Bellman-Ford Network (NBFNet), a general graph neural network framework that solves the path formulation with learned operators in the generalized Bellman-Ford algorithm. The NBFNet parameterizes the generalized Bellman-Ford algorithm with 3 neural components, namely INDICATOR, MESSAGE and AGGREGATE functions, which corresponds to the boundary condition, multiplication operator, and summation operator respectively. The NBFNet is very general, covers many traditional path-based methods, and can be applied to both homogeneous graphs and multi-relational graphs (e.g., knowledge graphs) in both transductive and inductive settings. Experiments on both homogeneous graphs and knowledge graphs show that the proposed NBFNet outperforms existing methods by a large margin in both transductive and inductive settings, achieving new state-of-the-art results.
    Learning MDPs from Features: Predict-Then-Optimize for Sequential Decision Problems by Reinforcement Learning. (arXiv:2106.03279v2 [cs.LG] UPDATED)
    (2 min) In the predict-then-optimize framework, the objective is to train a predictive model, mapping from environment features to parameters of an optimization problem, which maximizes decision quality when the optimization is subsequently solved. Recent work on decision-focused learning shows that embedding the optimization problem in the training pipeline can improve decision quality and help generalize better to unseen tasks compared to relying on an intermediate loss function for evaluating prediction quality. We study the predict-then-optimize framework in the context of sequential decision problems (formulated as MDPs) that are solved via reinforcement learning. In particular, we are given environment features and a set of trajectories from training MDPs, which we use to train a predictive model that generalizes to unseen test MDPs without trajectories. Two significant computational challenges arise in applying decision-focused learning to MDPs: (i) large state and action spaces make it infeasible for existing techniques to differentiate through MDP problems, and (ii) the high-dimensional policy space, as parameterized by a neural network, makes differentiating through a policy expensive. We resolve the first challenge by sampling provably unbiased derivatives to approximate and differentiate through optimality conditions, and the second challenge by using a low-rank approximation to the high-dimensional sample-based derivatives. We implement both Bellman--based and policy gradient--based decision-focused learning on three different MDP problems with missing parameters, and show that decision-focused learning performs better in generalization to unseen tasks.
    Relaxing Local Robustness. (arXiv:2106.06624v1 [cs.LG])
    (2 min) Certifiable local robustness, which rigorously precludes small-norm adversarial examples, has received significant attention as a means of addressing security concerns in deep learning. However, for some classification problems, local robustness is not a natural objective, even in the presence of adversaries; for example, if an image contains two classes of subjects, the correct label for the image may be considered arbitrary between the two, and thus enforcing strict separation between them is unnecessary. In this work, we introduce two relaxed safety properties for classifiers that address this observation: (1) relaxed top-k robustness, which serves as the analogue of top-k accuracy; and (2) affinity robustness, which specifies which sets of labels must be separated by a robustness margin, and which can be $\epsilon$-close in $\ell_p$ space. We show how to construct models that can be efficiently certified against each relaxed robustness property, and trained with very little overhead relative to standard gradient descent. Finally, we demonstrate experimentally that these relaxed variants of robustness are well-suited to several significant classification problems, leading to lower rejection rates and higher certified accuracies than can be obtained when certifying "standard" local robustness.
    Graph Neural Networks with Local Graph Parameters. (arXiv:2106.06707v1 [cs.LG])
    (2 min) Various recent proposals increase the distinguishing power of Graph Neural Networks GNNs by propagating features between $k$-tuples of vertices. The distinguishing power of these "higher-order'' GNNs is known to be bounded by the $k$-dimensional Weisfeiler-Leman (WL) test, yet their $\mathcal O(n^k)$ memory requirements limit their applicability. Other proposals infuse GNNs with local higher-order graph structural information from the start, hereby inheriting the desirable $\mathcal O(n)$ memory requirement from GNNs at the cost of a one-time, possibly non-linear, preprocessing step. We propose local graph parameter enabled GNNs as a framework for studying the latter kind of approaches and precisely characterize their distinguishing power, in terms of a variant of the WL test, and in terms of the graph structural properties that they can take into account. Local graph parameters can be added to any GNN architecture, and are cheap to compute. In terms of expressive power, our proposal lies in the middle of GNNs and their higher-order counterparts. Further, we propose several techniques to aide in choosing the right local graph parameters. Our results connect GNNs with deep results in finite model theory and finite variable logics. Our experimental evaluation shows that adding local graph parameters often has a positive effect for a variety of GNNs, datasets and graph learning tasks.
    Online Learning with Optimism and Delay. (arXiv:2106.06885v1 [cs.LG])
    (2 min) Inspired by the demands of real-time climate and weather forecasting, we develop optimistic online learning algorithms that require no parameter tuning and have optimal regret guarantees under delayed feedback. Our algorithms -- DORM, DORMP, and AdaHedgeD -- arise from a novel reduction of delayed online learning to optimistic online learning that reveals how optimistic hints can mitigate the regret penalty caused by delay. We pair this delay-as-optimism perspective with a new analysis of optimistic learning that exposes its robustness to hinting errors and a new meta-algorithm for learning effective hinting strategies in the presence of delay. We conclude by benchmarking our algorithms on four subseasonal climate forecasting tasks, demonstrating low regret relative to state-of-the-art forecasting models.
    D2C: Diffusion-Denoising Models for Few-shot Conditional Generation. (arXiv:2106.06819v1 [cs.LG])
    (2 min) Conditional generative models of high-dimensional images have many applications, but supervision signals from conditions to images can be expensive to acquire. This paper describes Diffusion-Decoding models with Contrastive representations (D2C), a paradigm for training unconditional variational autoencoders (VAEs) for few-shot conditional image generation. D2C uses a learned diffusion-based prior over the latent representations to improve generation and contrastive self-supervised learning to improve representation quality. D2C can adapt to novel generation tasks conditioned on labels or manipulation constraints, by learning from as few as 100 labeled examples. On conditional generation from new labels, D2C achieves superior performance over state-of-the-art VAEs and diffusion models. On conditional image manipulation, D2C generations are two orders of magnitude faster to produce over StyleGAN2 ones and are preferred by 50% - 60% of the human evaluators in a double-blind study.
    Dynamic Clone Transformer for Efficient Convolutional Neural Netwoks. (arXiv:2106.06778v1 [cs.CV])
    (2 min) Convolutional networks (ConvNets) have shown impressive capability to solve various vision tasks. Nevertheless, the trade-off between performance and efficiency is still a challenge for a feasible model deployment on resource-constrained platforms. In this paper, we introduce a novel concept termed multi-path fully connected pattern (MPFC) to rethink the interdependencies of topology pattern, accuracy and efficiency for ConvNets. Inspired by MPFC, we further propose a dual-branch module named dynamic clone transformer (DCT) where one branch generates multiple replicas from inputs and another branch reforms those clones through a series of difference vectors conditional on inputs itself to produce more variants. This operation allows the self-expansion of channel-wise information in a data-driven way with little computational cost while providing sufficient learning capacity, which is a potential unit to replace computationally expensive pointwise convolution as an expansion layer in the bottleneck structure.
    Locally Persistent Exploration in Continuous Control Tasks with Sparse Rewards. (arXiv:2012.13658v2 [cs.LG] UPDATED)
    (2 min) A major challenge in reinforcement learning is the design of exploration strategies, especially for environments with sparse reward structures and continuous state and action spaces. Intuitively, if the reinforcement signal is very scarce, the agent should rely on some form of short-term memory in order to cover its environment efficiently. We propose a new exploration method, based on two intuitions: (1) the choice of the next exploratory action should depend not only on the (Markovian) state of the environment, but also on the agent's trajectory so far, and (2) the agent should utilize a measure of spread in the state space to avoid getting stuck in a small region. Our method leverages concepts often used in statistical physics to provide explanations for the behavior of simplified (polymer) chains in order to generate persistent (locally self-avoiding) trajectories in state space. We discuss the theoretical properties of locally self-avoiding walks and their ability to provide a kind of short-term memory through a decaying temporal correlation within the trajectory. We provide empirical evaluations of our approach in a simulated 2D navigation task, as well as higher-dimensional MuJoCo continuous control locomotion tasks with sparse rewards.
    Toward Understanding the Feature Learning Process of Self-supervised Contrastive Learning. (arXiv:2105.15134v2 [cs.LG] UPDATED)
    (2 min) How can neural networks trained by contrastive learning extract features from the unlabeled data? Why does contrastive learning usually need much stronger data augmentations than supervised learning to ensure good representations? These questions involve both the optimization and statistical aspects of deep learning, but can hardly be answered by analyzing supervised learning, where the target functions are the highest pursuit. Indeed, in self-supervised learning, it is inevitable to relate to the optimization/generalization of neural networks to how they can encode the latent structures in the data, which we refer to as the feature learning process. In this work, we formally study how contrastive learning learns the feature representations for neural networks by analyzing its feature learning process. We consider the case where our data are comprised of two types of features: the more semantically aligned sparse features which we want to learn from, and the other dense features we want to avoid. Theoretically, we prove that contrastive learning using $\mathbf{ReLU}$ networks provably learns the desired sparse features if proper augmentations are adopted. We present an underlying principle called $\textbf{feature decoupling}$ to explain the effects of augmentations, where we theoretically characterize how augmentations can reduce the correlations of dense features between positive samples while keeping the correlations of sparse features intact, thereby forcing the neural networks to learn from the self-supervision of sparse features. Empirically, we verified that the feature decoupling principle matches the underlying mechanism of contrastive learning in practice.
    Doubly Non-Central Beta Matrix Factorization for DNA Methylation Data. (arXiv:2106.06691v1 [stat.ML])
    (2 min) We present a new non-negative matrix factorization model for $(0,1)$ bounded-support data based on the doubly non-central beta (DNCB) distribution, a generalization of the beta distribution. The expressiveness of the DNCB distribution is particularly useful for modeling DNA methylation datasets, which are typically highly dispersed and multi-modal; however, the model structure is sufficiently general that it can be adapted to many other domains where latent representations of $(0,1)$ bounded-support data are of interest. Although the DNCB distribution lacks a closed-form conjugate prior, several augmentations let us derive an efficient posterior inference algorithm composed entirely of analytic updates. Our model improves out-of-sample predictive performance on both real and synthetic DNA methylation datasets over state-of-the-art methods in bioinformatics. In addition, our model yields meaningful latent representations that accord with existing biological knowledge.
    Robust Representation Learning via Perceptual Similarity Metrics. (arXiv:2106.06620v1 [cs.LG])
    (2 min) A fundamental challenge in artificial intelligence is learning useful representations of data that yield good performance on a downstream task, without overfitting to spurious input features. Extracting such task-relevant predictive information is particularly difficult for real-world datasets. In this work, we propose Contrastive Input Morphing (CIM), a representation learning framework that learns input-space transformations of the data to mitigate the effect of irrelevant input features on downstream performance. Our method leverages a perceptual similarity metric via a triplet loss to ensure that the transformation preserves task-relevant information.Empirically, we demonstrate the efficacy of our approach on tasks which typically suffer from the presence of spurious correlations: classification with nuisance information, out-of-distribution generalization, and preservation of subgroup accuracies. We additionally show that CIM is complementary to other mutual information-based representation learning techniques, and demonstrate that it improves the performance of variational information bottleneck (VIB) when used together.
    Domain-adaptive Fall Detection Using Deep Adversarial Training. (arXiv:2012.10911v2 [eess.SP] UPDATED)
    (2 min) Fall detection (FD) systems are important assistive technologies for healthcare that can detect emergency fall events and alert caregivers. However, it is not easy to obtain large-scale annotated fall events with various specifications of sensors or sensor positions during the implementation of accurate FD systems. Moreover, the knowledge obtained through machine learning has been restricted to tasks in the same domain. The mismatch between different domains might hinder the performance of FD systems. Cross-domain knowledge transfer is very beneficial for machine-learning-based FD systems to train a reliable FD model with well-labeled data in new environments. In this study, we propose domain-adaptive fall detection (DAFD) using deep adversarial training (DAT) to tackle cross-domain problems, such as cross-position and cross-configuration. The proposed DAFD can transfer knowledge from the source domain to the target domain by minimizing the domain discrepancy to avoid mismatch problems. The experimental results show that the average F1-score improvement when using DAFD ranges from 1.5% to 7% in the cross-position scenario, and from 3.5% to 12% in the cross-configuration scenario, compared to using the conventional FD model without domain adaptation training. The results demonstrate that the proposed DAFD successfully helps to deal with cross-domain problems and to achieve better detection performance.
    COVID-19 Cough Classification using Machine Learning and Global Smartphone Recordings. (arXiv:2012.01926v2 [cs.SD] UPDATED)
    (3 min) We present a machine learning based COVID-19 cough classifier which can discriminate COVID-19 positive coughs from both COVID-19 negative and healthy coughs recorded on a smartphone. This type of screening is non-contact, easy to apply, and can reduce the workload in testing centres as well as limit transmission by recommending early self-isolation to those who have a cough suggestive of COVID-19. The datasets used in this study include subjects from all six continents and contain both forced and natural coughs, indicating that the approach is widely applicable. The publicly available Coswara dataset contains 92 COVID-19 positive and 1079 healthy subjects, while the second smaller dataset was collected mostly in South Africa and contains 18 COVID-19 positive and 26 COVID-19 negative subjects who have undergone a SARS-CoV laboratory test. Both datasets indicate that COVID-19 positive coughs are 15\%-20\% shorter than non-COVID coughs. Dataset skew was addressed by applying the synthetic minority oversampling technique (SMOTE). A leave-$p$-out cross-validation scheme was used to train and evaluate seven machine learning classifiers: LR, KNN, SVM, MLP, CNN, LSTM and Resnet50. Our results show that although all classifiers were able to identify COVID-19 coughs, the best performance was exhibited by the Resnet50 classifier, which was best able to discriminate between the COVID-19 positive and the healthy coughs with an area under the ROC curve (AUC) of 0.98. An LSTM classifier was best able to discriminate between the COVID-19 positive and COVID-19 negative coughs, with an AUC of 0.94 after selecting the best 13 features from a sequential forward selection (SFS). Since this type of cough audio classification is cost-effective and easy to deploy, it is potentially a useful and viable means of non-contact COVID-19 screening.
    Integer Programming for Causal Structure Learning in the Presence of Latent Variables. (arXiv:2102.03129v2 [cs.LG] UPDATED)
    (2 min) The problem of finding an ancestral acyclic directed mixed graph (ADMG) that represents the causal relationships between a set of variables is an important area of research on causal inference. Most existing score-based structure learning methods focus on learning directed acyclic graph (DAG) models without latent variables. A number of score-based methods have recently been proposed for the ADMG learning, yet they are heuristic in nature and do not guarantee an optimal solution. We propose a novel exact score-based method that solves an integer programming (IP) formulation and returns a score-maximizing ancestral ADMG for a set of continuous variables that follow a multivariate Gaussian distribution. We generalize the state-of-the-art IP model for DAG learning problems and derive new classes of valid inequalities to formulate an IP model for ADMG learning. Empirically, our model can be solved efficiently for medium-sized problems and achieves better accuracy than state-of-the-art score-based methods as well as benchmark constraint-based methods.
    Prioritized Level Replay. (arXiv:2010.03934v4 [cs.LG] UPDATED)
    (2 min) Environments with procedurally generated content serve as important benchmarks for testing systematic generalization in deep reinforcement learning. In this setting, each level is an algorithmically created environment instance with a unique configuration of its factors of variation. Training on a prespecified subset of levels allows for testing generalization to unseen levels. What can be learned from a level depends on the current policy, yet prior work defaults to uniform sampling of training levels independently of the policy. We introduce Prioritized Level Replay (PLR), a general framework for selectively sampling the next training level by prioritizing those with higher estimated learning potential when revisited in the future. We show TD-errors effectively estimate a level's future learning potential and, when used to guide the sampling procedure, induce an emergent curriculum of increasingly difficult levels. By adapting the sampling of training levels, PLR significantly improves sample efficiency and generalization on Procgen Benchmark--matching the previous state-of-the-art in test return--and readily combines with other methods. Combined with the previous leading method, PLR raises the state-of-the-art to over 76% improvement in test return relative to standard RL baselines.
    FPT Approximation for Socially Fair Clustering. (arXiv:2106.06755v1 [cs.DS])
    (2 min) In this work, we study the socially fair $k$-median/$k$-means problem. We are given a set of points $P$ in a metric space $\mathcal{X}$ with a distance function $d(.,.)$. There are $\ell$ groups: $P_1,\dotsc,P_{\ell} \subseteq P$. We are also given a set $F$ of feasible centers in $\mathcal{X}$. The goal of the socially fair $k$-median problem is to find a set $C \subseteq F$ of $k$ centers that minimizes the maximum average cost over all the groups. That is, find $C$ that minimizes the objective function $\Phi(C,P) \equiv \max_{j} \sum_{x \in P_j} d(C,x)/|P_j|$, where $d(C,x)$ is the distance of $x$ to the closest center in $C$. The socially fair $k$-means problem is defined similarly by using squared distances, i.e., $d^{2}(.,.)$ instead of $d(.,.)$. In this work, we design $(5+\varepsilon)$ and $(33 + \varepsilon)$ approximation algorithms for the socially fair $k$-median and $k$-means problems, respectively. For the parameters: $k$ and $\ell$, the algorithms have an FPT (fixed parameter tractable) running time of $f(k,\ell,\varepsilon) \cdot n$ for $f(k,\ell,\varepsilon) = 2^{{O}(k \, \ell/\varepsilon)}$ and $n = |P \cup F|$. We also study a special case of the problem where the centers are allowed to be chosen from the point set $P$, i.e., $P \subseteq F$. For this special case, our algorithms give better approximation guarantees of $(4+\varepsilon)$ and $(18+\varepsilon)$ for the socially fair $k$-median and $k$-means problems, respectively. Furthermore, we convert these algorithms to constant pass log-space streaming algorithms. Lastly, we show FPT hardness of approximation results for the problem with a small gap between our upper and lower bounds.
    Kwame: A Bilingual AI Teaching Assistant for Online SuaCode Courses. (arXiv:2010.11387v2 [cs.CL] UPDATED)
    (2 min) Introductory hands-on courses such as our smartphone-based coding course, SuaCode require a lot of support for students to accomplish learning goals. Online environments make it even more difficult to get assistance especially more recently because of COVID-19. Given the multilingual context of SuaCode students - learners across 42 African countries that are mostly Anglophone or Francophone - in this work, we developed a bilingual Artificial Intelligence (AI) Teaching Assistant (TA) - Kwame - that provides answers to students' coding questions from SuaCode courses in English and French. Kwame is a Sentence-BERT (SBERT)-based question-answering (QA) system that we trained and evaluated offline using question-answer pairs created from the course's quizzes, lesson notes and students' questions in past cohorts. Kwame finds the paragraph most semantically similar to the question via cosine similarity. We compared the system with TF-IDF and Universal Sentence Encoder. Our results showed that fine-tuning on the course data and returning the top 3 and 5 answers improved the accuracy results. Kwame will make it easy for students to get quick and accurate answers to questions in SuaCode courses.
    PAGE: A Simple and Optimal Probabilistic Gradient Estimator for Nonconvex Optimization. (arXiv:2008.10898v3 [cs.LG] UPDATED)
    (2 min) In this paper, we propose a novel stochastic gradient estimator -- ProbAbilistic Gradient Estimator (PAGE) -- for nonconvex optimization. PAGE is easy to implement as it is designed via a small adjustment to vanilla SGD: in each iteration, PAGE uses the vanilla minibatch SGD update with probability $p_t$ or reuses the previous gradient with a small adjustment, at a much lower computational cost, with probability $1-p_t$. We give a simple formula for the optimal choice of $p_t$. Moreover, we prove the first tight lower bound $\Omega(n+\frac{\sqrt{n}}{\epsilon^2})$ for nonconvex finite-sum problems, which also leads to a tight lower bound $\Omega(b+\frac{\sqrt{b}}{\epsilon^2})$ for nonconvex online problems, where $b:= \min\{\frac{\sigma^2}{\epsilon^2}, n\}$. Then, we show that PAGE obtains the optimal convergence results $O(n+\frac{\sqrt{n}}{\epsilon^2})$ (finite-sum) and $O(b+\frac{\sqrt{b}}{\epsilon^2})$ (online) matching our lower bounds for both nonconvex finite-sum and online problems. Besides, we also show that for nonconvex functions satisfying the Polyak-\L{}ojasiewicz (PL) condition, PAGE can automatically switch to a faster linear convergence rate $O(\cdot\log \frac{1}{\epsilon})$. Finally, we conduct several deep learning experiments (e.g., LeNet, VGG, ResNet) on real datasets in PyTorch showing that PAGE not only converges much faster than SGD in training but also achieves the higher test accuracy, validating the optimal theoretical results and confirming the practical superiority of PAGE.
    Predicting the Ordering of Characters in Japanese Historical Documents. (arXiv:2106.06786v1 [cs.CL])
    (2 min) Japan is a unique country with a distinct cultural heritage, which is reflected in billions of historical documents that have been preserved. However, the change in Japanese writing system in 1900 made these documents inaccessible for the general public. A major research project has been to make these historical documents accessible and understandable. An increasing amount of research has focused on the character recognition task and the location of characters on image, yet less research has focused on how to predict the sequential ordering of the characters. This is because sequence in classical Japanese is very different from modern Japanese. Ordering characters into a sequence is important for making the document text easily readable and searchable. Additionally, it is a necessary step for any kind of natural language processing on the data (e.g. machine translation, language modeling, and word embeddings). We explore a few approaches to the task of predicting the sequential ordering of the characters: one using simple hand-crafted rules, another using hand-crafted rules with adaptive thresholds, and another using a deep recurrent sequence model trained with teacher forcing. We provide a quantitative and qualitative comparison of these techniques as well as their distinct trade-offs. Our best-performing system has an accuracy of 98.65\% and has a perfect accuracy on 49\% of the books in our dataset, suggesting that the technique is able to predict the order of the characters well enough for many tasks.
    DANCE: Enhancing saliency maps using decoys. (arXiv:2002.00526v3 [cs.LG] UPDATED)
    (2 min) Saliency methods can make deep neural network predictions more interpretable by identifying a set of critical features in an input sample, such as pixels that contribute most strongly to a prediction made by an image classifier. Unfortunately, recent evidence suggests that many saliency methods poorly perform, especially in situations where gradients are saturated, inputs contain adversarial perturbations, or predictions rely upon inter-feature dependence. To address these issues, we propose a framework that improves the robustness of saliency methods by following a two-step procedure. First, we introduce a perturbation mechanism that subtly varies the input sample without changing its intermediate representations. Using this approach, we can gather a corpus of perturbed data samples while ensuring that the perturbed and original input samples follow the same distribution. Second, we compute saliency maps for the perturbed samples and propose a new method to aggregate saliency maps. With this design, we offset the gradient saturation influence upon interpretation. From a theoretical perspective, we show the aggregated saliency map could not only capture inter-feature dependence but, more importantly, robustify interpretation against previously described adversarial perturbation methods. Following our theoretical analysis, we present experimental results suggesting that, both qualitatively and quantitatively, our saliency method outperforms existing methods.
    Robust Gaussian Process Regression Based on Iterative Trimming. (arXiv:2011.11057v2 [cs.LG] UPDATED)
    (2 min) The Gaussian process (GP) regression can be severely biased when the data are contaminated by outliers. This paper presents a new robust GP regression algorithm that iteratively trims the most extreme data points. While the new algorithm retains the attractive properties of the standard GP as a nonparametric and flexible regression method, it can greatly improve the model accuracy for contaminated data even in the presence of extreme or abundant outliers. It is also easier to implement compared with previous robust GP variants that rely on approximate inference. Applied to a wide range of experiments with different contamination levels, the proposed method significantly outperforms the standard GP and the popular robust GP variant with the Student-t likelihood in most test cases. In addition, as a practical example in the astrophysical study, we show that this method can precisely determine the main-sequence ridge line in the color-magnitude diagram of star clusters.
    AutoLoss: Automated Loss Function Search in Recommendations. (arXiv:2106.06713v1 [cs.IR])
    (2 min) Designing an effective loss function plays a crucial role in training deep recommender systems. Most existing works often leverage a predefined and fixed loss function that could lead to suboptimal recommendation quality and training efficiency. Some recent efforts rely on exhaustively or manually searched weights to fuse a group of candidate loss functions, which is exceptionally costly in computation and time. They also neglect the various convergence behaviors of different data examples. In this work, we propose an AutoLoss framework that can automatically and adaptively search for the appropriate loss function from a set of candidates. To be specific, we develop a novel controller network, which can dynamically adjust the loss probabilities in a differentiable manner. Unlike existing algorithms, the proposed controller can adaptively generate the loss probabilities for different data examples according to their varied convergence behaviors. Such design improves the model's generalizability and transferability between deep recommender systems and datasets. We evaluate the proposed framework on two benchmark datasets. The results show that AutoLoss outperforms representative baselines. Further experiments have been conducted to deepen our understandings of AutoLoss, including its transferability, components and training efficiency.
    Using Convolutional Neural Networks for the Helicity Classification of Magnetic Fields. (arXiv:2106.06718v1 [astro-ph.HE])
    (2 min) The presence of non-zero helicity in intergalactic magnetic fields is a smoking gun for their primordial origin since they have to be generated by processes that break CP invariance. As an experimental signature for the presence of helical magnetic fields, an estimator $Q$ based on the triple scalar product of the wave-vectors of photons generated in electromagnetic cascades from, e.g., TeV blazars, has been suggested previously. We propose to apply deep learning to helicity classification employing Convolutional Neural Networks and show that this method outperforms the $Q$ estimator.
    Understanding Deflation Process in Over-parametrized Tensor Decomposition. (arXiv:2106.06573v1 [stat.ML])
    (2 min) In this paper we study the training dynamics for gradient flow on over-parametrized tensor decomposition problems. Empirically, such training process often first fits larger components and then discovers smaller components, which is similar to a tensor deflation process that is commonly used in tensor decomposition algorithms. We prove that for orthogonally decomposable tensor, a slightly modified version of gradient flow would follow a tensor deflation process and recover all the tensor components. Our proof suggests that for orthogonal tensors, gradient flow dynamics works similarly as greedy low-rank learning in the matrix setting, which is a first step towards understanding the implicit regularization effect of over-parametrized models for low-rank tensors.
    Scaling transition from momentum stochastic gradient descent to plain stochastic gradient descent. (arXiv:2106.06753v1 [cs.LG])
    (2 min) The plain stochastic gradient descent and momentum stochastic gradient descent have extremely wide applications in deep learning due to their simple settings and low computational complexity. The momentum stochastic gradient descent uses the accumulated gradient as the updated direction of the current parameters, which has a faster training speed. Because the direction of the plain stochastic gradient descent has not been corrected by the accumulated gradient. For the parameters that currently need to be updated, it is the optimal direction, and its update is more accurate. We combine the advantages of the momentum stochastic gradient descent with fast training speed and the plain stochastic gradient descent with high accuracy, and propose a scaling transition from momentum stochastic gradient descent to plain stochastic gradient descent(TSGD) method. At the same time, a learning rate that decreases linearly with the iterations is used instead of a constant learning rate. The TSGD algorithm has a larger step size in the early stage to speed up the training, and training with a smaller step size in the later stage can steadily converge. Our experimental results show that the TSGD algorithm has faster training speed, higher accuracy and better stability. Our implementation is available at: https://github.com/kunzeng/TSGD.
    A Minimalist Approach to Offline Reinforcement Learning. (arXiv:2106.06860v1 [cs.LG])
    (2 min) Offline reinforcement learning (RL) defines the task of learning from a fixed batch of data. Due to errors in value estimation from out-of-distribution actions, most offline RL algorithms take the approach of constraining or regularizing the policy with the actions contained in the dataset. Built on pre-existing RL algorithms, modifications to make an RL algorithm work offline comes at the cost of additional complexity. Offline RL algorithms introduce new hyperparameters and often leverage secondary components such as generative models, while adjusting the underlying RL algorithm. In this paper we aim to make a deep RL algorithm work while making minimal changes. We find that we can match the performance of state-of-the-art offline RL algorithms by simply adding a behavior cloning term to the policy update of an online RL algorithm and normalizing the data. The resulting algorithm is a simple to implement and tune baseline, while more than halving the overall run time by removing the additional computational overheads of previous methods.
    Hippocampus segmentation in magnetic resonance images of Alzheimer's patients using Deep machine learning. (arXiv:2106.06743v1 [eess.IV])
    (2 min) Background: Alzheimers disease is a progressive neurodegenerative disorder and the main cause of dementia in aging. Hippocampus is prone to changes in the early stages of Alzheimers disease. Detection and observation of the hippocampus changes using magnetic resonance imaging (MRI) before the onset of Alzheimers disease leads to the faster preventive and therapeutic measures. Objective: The aim of this study was the segmentation of the hippocampus in magnetic resonance (MR) images of Alzheimers patients using deep machine learning method. Methods: U-Net architecture of convolutional neural network was proposed to segment the hippocampus in the real MRI data. The MR images of the 100 and 35 patients available in Alzheimers disease Neuroimaging Initiative (ADNI) dataset, was used for the train and test of the model, respectively. The performance of the proposed method was compared with manual segmentation by measuring the similarity metrics. Results: The desired segmentation achieved after 10 iterations. A Dice similarity coefficient (DSC) = 92.3%, sensitivity = 96.5%, positive predicted value (PPV) = 90.4%, and Intersection over Union (IoU) value for the train 92.94 and test 92.93 sets were obtained which are acceptable. Conclusion: The proposed approach is promising and can be extended in the prognosis of Alzheimers disease by the prediction of the hippocampus volume changes in the early stage of the disease.
    Piecewise-constant Neural ODEs. (arXiv:2106.06621v1 [cs.LG])
    (2 min) Neural networks are a popular tool for modeling sequential data but they generally do not treat time as a continuous variable. Neural ODEs represent an important exception: they parameterize the time derivative of a hidden state with a neural network and then integrate over arbitrary amounts of time. But these parameterizations, which have arbitrary curvature, can be hard to integrate and thus train and evaluate. In this paper, we propose making a piecewise-constant approximation to Neural ODEs to mitigate these issues. Our model can be integrated exactly via Euler integration and can generate autoregressive samples in 3-20 times fewer steps than comparable RNN and ODE-RNN models. We evaluate our model on several synthetic physics tasks and a planning task inspired by the game of billiards. We find that it matches the performance of baseline approaches while requiring less time to train and evaluate.
    CARTL: Cooperative Adversarially-Robust Transfer Learning. (arXiv:2106.06667v1 [cs.LG])
    (2 min) Transfer learning eases the burden of training a well-performed model from scratch, especially when training data is scarce and computation power is limited. In deep learning, a typical strategy for transfer learning is to freeze the early layers of a pre-trained model and fine-tune the rest of its layers on the target domain. Previous work focuses on the accuracy of the transferred model but neglects the transfer of adversarial robustness. In this work, we first show that transfer learning improves the accuracy on the target domain but degrades the inherited robustness of the target model. To address such a problem, we propose a novel cooperative adversarially-robust transfer learning (CARTL) by pre-training the model via feature distance minimization and fine-tuning the pre-trained model with non-expansive fine-tuning for target domain tasks. Empirical results show that CARTL improves the inherited robustness by about 28% at most compared with the baseline with the same degree of accuracy. Furthermore, we study the relationship between the batch normalization (BN) layers and the robustness in the context of transfer learning, and we reveal that freezing BN layers can further boost the robustness transfer.
    Provable Adaptation across Multiway Domains via Representation Learning. (arXiv:2106.06657v1 [cs.LG])
    (2 min) This paper studies zero-shot domain adaptation where each domain is indexed on a multi-dimensional array, and we only have data from a small subset of domains. Our goal is to produce predictors that perform well on \emph{unseen} domains. We propose a model which consists of a domain-invariant latent representation layer and a domain-specific linear prediction layer with a low-rank tensor structure. Theoretically, we present explicit sample complexity bounds to characterize the prediction error on unseen domains in terms of the number of domains with training data and the number of data per domain. To our knowledge, this is the first finite-sample guarantee for zero-shot domain adaptation. In addition, we provide experiments on two-way MNIST and four-way fiber sensing datasets to demonstrate the effectiveness of our proposed model.
    Learnable Hypergraph Laplacian for Hypergraph Learning. (arXiv:2106.06666v1 [cs.LG])
    (2 min) HyperGraph Convolutional Neural Networks (HGCNNs) have demonstrated their potential in modeling high-order relations preserved in graph structured data. However, most existing convolution filters are localized and determined by the pre-defined initial hypergraph topology, neglecting to explore implicit and long-ange relations in real-world data. In this paper, we propose the first learning-based method tailored for constructing adaptive hypergraph structure, termed HypERgrAph Laplacian aDaptor (HERALD), which serves as a generic plug-in-play module for improving the representational power of HGCNNs. Specifically, HERALD adaptively optimizes the adjacency relationship between hypernodes and hyperedges in an end-to-end manner and thus the task-aware hypergraph is learned. Furthermore, HERALD employs the self-attention mechanism to capture the non-local paired-nodes relation. Extensive experiments on various popular hypergraph datasets for node classification and graph classification tasks demonstrate that our approach obtains consistent and considerable performance enhancement, proving its effectiveness and generalization ability.
    TDGIA:Effective Injection Attacks on Graph Neural Networks. (arXiv:2106.06663v1 [cs.LG])
    (2 min) Graph Neural Networks (GNNs) have achieved promising performance in various real-world applications. However, recent studies have shown that GNNs are vulnerable to adversarial attacks. In this paper, we study a recently-introduced realistic attack scenario on graphs -- graph injection attack (GIA). In the GIA scenario, the adversary is not able to modify the existing link structure and node attributes of the input graph, instead the attack is performed by injecting adversarial nodes into it. We present an analysis on the topological vulnerability of GNNs under GIA setting, based on which we propose the Topological Defective Graph Injection Attack (TDGIA) for effective injection attacks. TDGIA first introduces the topological defective edge selection strategy to choose the original nodes for connecting with the injected ones. It then designs the smooth feature optimization objective to generate the features for the injected nodes. Extensive experiments on large-scale datasets show that TDGIA can consistently and significantly outperform various attack baselines in attacking dozens of defense GNN models. Notably, the performance drop on target GNNs resultant from TDGIA is more than double the damage brought by the best attack solution among hundreds of submissions on KDD-CUP 2020.
    Toward Accurate and Realistic Outfits Visualization with Attention to Details. (arXiv:2106.06593v1 [cs.CV])
    (2 min) Virtual try-on methods aim to generate images of fashion models wearing arbitrary combinations of garments. This is a challenging task because the generated image must appear realistic and accurately display the interaction between garments. Prior works produce images that are filled with artifacts and fail to capture important visual details necessary for commercial applications. We propose Outfit Visualization Net (OVNet) to capture these important details (e.g. buttons, shading, textures, realistic hemlines, and interactions between garments) and produce high quality multiple-garment virtual try-on images. OVNet consists of 1) a semantic layout generator and 2) an image generation pipeline using multiple coordinated warps. We train the warper to output multiple warps using a cascade loss, which refines each successive warp to focus on poorly generated regions of a previous warp and yields consistent improvements in detail. In addition, we introduce a method for matching outfits with the most suitable model and produce significant improvements for both our and other previous try-on methods. Through quantitative and qualitative analysis, we demonstrate our method generates substantially higher-quality studio images compared to prior works for multi-garment outfits. An interactive interface powered by this method has been deployed on fashion e-commerce websites and received overwhelmingly positive feedback.
    Scalars are universal: Gauge-equivariant machine learning, structured like classical physics. (arXiv:2106.06610v1 [cs.LG])
    (2 min) There has been enormous progress in the last few years in designing conceivable (though not always practical) neural networks that respect the gauge symmetries -- or coordinate freedom -- of physical law. Some of these frameworks make use of irreducible representations, some make use of higher order tensor objects, and some apply symmetry-enforcing constraints. Different physical laws obey different combinations of fundamental symmetries, but a large fraction (possibly all) of classical physics is equivariant to translation, rotation, reflection (parity), boost (relativity), and permutations. Here we show that it is simple to parameterize universally approximating polynomial functions that are equivariant under these symmetries, or under the Euclidean, Lorentz, and Poincar\'e groups, at any dimensionality $d$. The key observation is that nonlinear O($d$)-equivariant (and related-group-equivariant) functions can be expressed in terms of a lightweight collection of scalars -- scalar products and scalar contractions of the scalar, vector, and tensor inputs. These results demonstrate theoretically that gauge-invariant deep learning models for classical physics with good scaling for large problems are feasible right now.
    Optimal Counterfactual Explanations in Tree Ensembles. (arXiv:2106.06631v1 [cs.LG])
    (2 min) Counterfactual explanations are usually generated through heuristics that are sensitive to the search's initial conditions. The absence of guarantees of performance and robustness hinders trustworthiness. In this paper, we take a disciplined approach towards counterfactual explanations for tree ensembles. We advocate for a model-based search aiming at "optimal" explanations and propose efficient mixed-integer programming approaches. We show that isolation forests can be modeled within our framework to focus the search on plausible explanations with a low outlier score. We provide comprehensive coverage of additional constraints that model important objectives, heterogeneous data types, structural constraints on the feature space, along with resource and actionability restrictions. Our experimental analyses demonstrate that the proposed search approach requires a computational effort that is orders of magnitude smaller than previous mathematical programming algorithms. It scales up to large data sets and tree ensembles, where it provides, within seconds, systematic explanations grounded on well-defined models solved to optimality.
    Auto-NBA: Efficient and Effective Search Over the Joint Space of Networks, Bitwidths, and Accelerators. (arXiv:2106.06575v1 [cs.LG])
    (2 min) While maximizing deep neural networks' (DNNs') acceleration efficiency requires a joint search/design of three different yet highly coupled aspects, including the networks, bitwidths, and accelerators, the challenges associated with such a joint search have not yet been fully understood and addressed. The key challenges include (1) the dilemma of whether to explode the memory consumption due to the huge joint space or achieve sub-optimal designs, (2) the discrete nature of the accelerator design space that is coupled yet different from that of the networks and bitwidths, and (3) the chicken and egg problem associated with network-accelerator co-search, i.e., co-search requires operation-wise hardware cost, which is lacking during search as the optimal accelerator depending on the whole network is still unknown during search. To tackle these daunting challenges towards optimal and fast development of DNN accelerators, we propose a framework dubbed Auto-NBA to enable jointly searching for the Networks, Bitwidths, and Accelerators, by efficiently localizing the optimal design within the huge joint design space for each target dataset and acceleration specification. Our Auto-NBA integrates a heterogeneous sampling strategy to achieve unbiased search with constant memory consumption, and a novel joint-search pipeline equipped with a generic differentiable accelerator search engine. Extensive experiments and ablation studies validate that both Auto-NBA generated networks and accelerators consistently outperform state-of-the-art designs (including co-search/exploration techniques, hardware-aware NAS methods, and DNN accelerators), in terms of search time, task accuracy, and accelerator efficiency. Our codes are available at: https://github.com/RICE-EIC/Auto-NBA.
    Simple Combinatorial Algorithms for Combinatorial Bandits: Corruptions and Approximations. (arXiv:2106.06712v1 [cs.LG])
    (2 min) We consider the stochastic combinatorial semi-bandit problem with adversarial corruptions. We provide a simple combinatorial algorithm that can achieve a regret of $\tilde{O}\left(C+d^2K/\Delta_{min}\right)$ where $C$ is the total amount of corruptions, $d$ is the maximal number of arms one can play in each round, $K$ is the number of arms. If one selects only one arm in each round, we achieves a regret of $\tilde{O}\left(C+\sum_{\Delta_i>0}(1/\Delta_i)\right)$. Our algorithm is combinatorial and improves on the previous combinatorial algorithm by [Gupta et al., COLT2019] (their bound is $\tilde{O}\left(KC+\sum_{\Delta_i>0}(1/\Delta_i)\right)$), and almost matches the best known bounds obtained by [Zimmert et al., ICML2019] and [Zimmert and Seldin, AISTATS2019] (up to logarithmic factor). Note that the algorithms in [Zimmert et al., ICML2019] and [Zimmert and Seldin, AISTATS2019] require one to solve complex convex programs while our algorithm is combinatorial, very easy to implement, requires weaker assumptions and has very low oracle complexity and running time. We also study the setting where we only get access to an approximation oracle for the stochastic combinatorial semi-bandit problem. Our algorithm achieves an (approximation) regret bound of $\tilde{O}\left(d\sqrt{KT}\right)$. Our algorithm is very simple, only worse than the best known regret bound by $\sqrt{d}$, and has much lower oracle complexity than previous work.
    Federated Learning with Buffered Asynchronous Aggregation. (arXiv:2106.06639v1 [cs.LG])
    (2 min) Federated Learning (FL) trains a shared model across distributed devices while keeping the training data on the devices. Most FL schemes are synchronous: they perform a synchronized aggregation of model updates from individual devices. Synchronous training can be slow because of late-arriving devices (stragglers). On the other hand, completely asynchronous training makes FL less private because of incompatibility with secure aggregation. In this work, we propose a model aggregation scheme, FedBuff, that combines the best properties of synchronous and asynchronous FL. Similar to synchronous FL, FedBuff is compatible with secure aggregation. Similar to asynchronous FL, FedBuff is robust to stragglers. In FedBuff, clients trains asynchronously and send updates to the server. The server aggregates client updates in a private buffer until updates have been received, at which point a server model update is immediately performed. We provide theoretical convergence guarantees for FedBuff in a non-convex setting. Empirically, FedBuff converges up to 3.8x faster than previous proposals for synchronous FL (e.g., FedAvgM), and up to 2.5x faster than previous proposals for asynchronous FL (e.g., FedAsync). We show that FedBuff is robust to different staleness distributions and is more scalable than synchronous FL techniques.
    Disentangling the Roles of Curation, Data-Augmentation and the Prior in the Cold Posterior Effect. (arXiv:2106.06596v1 [cs.LG])
    (2 min) The "cold posterior effect" (CPE) in Bayesian deep learning describes the uncomforting observation that the predictive performance of Bayesian neural networks can be significantly improved if the Bayes posterior is artificially sharpened using a temperature parameter T<1. The CPE is problematic in theory and practice and since the effect was identified many researchers have proposed hypotheses to explain the phenomenon. However, despite this intensive research effort the effect remains poorly understood. In this work we provide novel and nuanced evidence relevant to existing explanations for the cold posterior effect, disentangling three hypotheses: 1. The dataset curation hypothesis of Aitchison (2020): we show empirically that the CPE does not arise in a real curated data set but can be produced in a controlled experiment with varying curation strength. 2. The data augmentation hypothesis of Izmailov et al. (2021) and Fortuin et al. (2021): we show empirically that data augmentation is sufficient but not necessary for the CPE to be present. 3. The bad prior hypothesis of Wenzel et al. (2020): we use a simple experiment evaluating the relative importance of the prior and the likelihood, strongly linking the CPE to the prior. Our results demonstrate how the CPE can arise in isolation from synthetic curation, data augmentation, and bad priors. Cold posteriors observed "in the wild" are therefore unlikely to arise from a single simple cause; as a result, we do not expect a simple "fix" for cold posteriors.
    A3C-S: Automated Agent Accelerator Co-Search towards Efficient Deep Reinforcement Learning. (arXiv:2106.06577v1 [cs.LG])
    (2 min) Driven by the explosive interest in applying deep reinforcement learning (DRL) agents to numerous real-time control and decision-making applications, there has been a growing demand to deploy DRL agents to empower daily-life intelligent devices, while the prohibitive complexity of DRL stands at odds with limited on-device resources. In this work, we propose an Automated Agent Accelerator Co-Search (A3C-S) framework, which to our best knowledge is the first to automatically co-search the optimally matched DRL agents and accelerators that maximize both test scores and hardware efficiency. Extensive experiments consistently validate the superiority of our A3C-S over state-of-the-art techniques.
    Online Learning of Competitive Equilibria in Exchange Economies. (arXiv:2106.06616v1 [cs.LG])
    (2 min) The sharing of scarce resources among multiple rational agents is one of the classical problems in economics. In exchange economies, which are used to model such situations, agents begin with an initial endowment of resources and exchange them in a way that is mutually beneficial until they reach a competitive equilibrium (CE). CE allocations are Pareto efficient and fair. Consequently, they are used widely in designing mechanisms for fair division. However, computing CEs requires the knowledge of agent preferences which are unknown in several applications of interest. In this work, we explore a new online learning mechanism, which, on each round, allocates resources to the agents and collects stochastic feedback on their experience in using that allocation. Its goal is to learn the agent utilities via this feedback and imitate the allocations at a CE in the long run. We quantify CE behavior via two losses and propose a randomized algorithm which achieves $\bigOtilde(\sqrt{T})$ loss after $T$ rounds under both criteria. Empirically, we demonstrate the effectiveness of this mechanism through numerical simulations.
    GANs N' Roses: Stable, Controllable, Diverse Image to Image Translation (works for videos too!). (arXiv:2106.06561v1 [cs.CV])
    (2 min) We show how to learn a map that takes a content code, derived from a face image, and a randomly chosen style code to an anime image. We derive an adversarial loss from our simple and effective definitions of style and content. This adversarial loss guarantees the map is diverse -- a very wide range of anime can be produced from a single content code. Under plausible assumptions, the map is not just diverse, but also correctly represents the probability of an anime, conditioned on an input face. In contrast, current multimodal generation procedures cannot capture the complex styles that appear in anime. Extensive quantitative experiments support the idea the map is correct. Extensive qualitative results show that the method can generate a much more diverse range of styles than SOTA comparisons. Finally, we show that our formalization of content and style allows us to perform video to video translation without ever training on videos.

2021-06-14

  • cs.CL updates on arXiv.org

    Nested and Balanced Entity Recognition using Multi-Task Learning. (arXiv:2106.06216v1 [cs.CL])
    (2 min) Entity Recognition (ER) within a text is a fundamental exercise in Natural Language Processing, enabling further depending tasks such as Knowledge Extraction, Text Summarisation, or Keyphrase Extraction. An entity consists of single words or of a consecutive sequence of terms, constituting the basic building blocks for communication. Mainstream ER approaches are mainly limited to flat structures, concentrating on the outermost entities while ignoring the inner ones. This paper introduces a partly-layered network architecture that deals with the complexity of overlapping and nested cases. The proposed architecture consists of two parts: (1) a shared Sequence Layer and (2) a stacked component with multiple Tagging Layers. The adoption of such an architecture has the advantage of preventing overfit to a specific word-length, thus maintaining performance for longer entities despite their lower frequency. To verify the proposed architecture's effectiveness, we train and evaluate this architecture to recognise two kinds of entities - Concepts (CR) and Named Entities (NER). Our approach achieves state-of-the-art NER performances, while it outperforms previous CR approaches. Considering these promising results, we see the possibility to evolve the architecture for other cases such as the extraction of events or the detection of argumentative components.
    WAX-ML: A Python library for machine learning and feedback loops on streaming data. (arXiv:2106.06524v1 [cs.LG])
    (2 min) Wax is what you put on a surfboard to avoid slipping. It is an essential tool to go surfing... We introduce WAX-ML a research-oriented Python library providing tools to design powerful machine learning algorithms and feedback loops working on streaming data. It strives to complement JAX with tools dedicated to time series. WAX-ML makes JAX-based programs easy to use for end-users working with pandas and xarray for data manipulation. It provides a simple mechanism for implementing feedback loops, allows the implementation of online learning and reinforcement learning algorithms with functions, and makes them easy to integrate by end-users working with the object-oriented reinforcement learning framework from the Gym library. It is released with an Apache open-source license on GitHub at https://github.com/eserie/wax-ml.
    CodemixedNLP: An Extensible and Open NLP Toolkit for Code-Mixing. (arXiv:2106.06004v1 [cs.CL])
    (2 min) The NLP community has witnessed steep progress in a variety of tasks across the realms of monolingual and multilingual language processing recently. These successes, in conjunction with the proliferating mixed language interactions on social media have boosted interest in modeling code-mixed texts. In this work, we present CodemixedNLP, an open-source library with the goals of bringing together the advances in code-mixed NLP and opening it up to a wider machine learning community. The library consists of tools to develop and benchmark versatile model architectures that are tailored for mixed texts, methods to expand training sets, techniques to quantify mixing styles, and fine-tuned state-of-the-art models for 7 tasks in Hinglish. We believe this work has a potential to foster a distributed yet collaborative and sustainable ecosystem in an otherwise dispersed space of code-mixing research. The toolkit is designed to be simple, easily extensible, and resourceful to both researchers as well as practitioners.
    CONDA: a CONtextual Dual-Annotated dataset for in-game toxicity understanding and detection. (arXiv:2106.06213v1 [cs.CL])
    (2 min) Traditional toxicity detection models have focused on the single utterance level without deeper understanding of context. We introduce CONDA, a new dataset for in-game toxic language detection enabling joint intent classification and slot filling analysis, which is the core task of Natural Language Understanding (NLU). The dataset consists of 45K utterances from 12K conversations from the chat logs of 1.9K completed Dota 2 matches. We propose a robust dual semantic-level toxicity framework, which handles utterance and token-level patterns, and rich contextual chatting history. Accompanying the dataset is a thorough in-game toxicity analysis, which provides comprehensive understanding of context at utterance, token, and dual levels. Inspired by NLU, we also apply its metrics to the toxicity detection tasks for assessing toxicity and game-specific aspects. We evaluate strong NLU models on CONDA, providing fine-grained results for different intent classes and slot classes. Furthermore, we examine the coverage of toxicity nature in our dataset by comparing it with other toxicity datasets.
    Sentence Extraction-Based Machine Reading Comprehension for Vietnamese. (arXiv:2105.09043v2 [cs.CL] UPDATED)
    (2 min) The development of natural language processing (NLP) in general and machine reading comprehension in particular has attracted the great attention of the research community. In recent years, there are a few datasets for machine reading comprehension tasks in Vietnamese with large sizes, such as UIT-ViQuAD and UIT-ViNewsQA. However, the datasets are not diverse in answers to serve the research. In this paper, we introduce UIT-ViWikiQA, the first dataset for evaluating sentence extraction-based machine reading comprehension in the Vietnamese language. The UIT-ViWikiQA dataset is converted from the UIT-ViQuAD dataset, consisting of comprises 23.074 question-answers based on 5.109 passages of 174 Wikipedia Vietnamese articles. We propose a conversion algorithm to create the dataset for sentence extraction-based machine reading comprehension and three types of approaches for sentence extraction-based machine reading comprehension in Vietnamese. Our experiments show that the best machine model is XLM-R_Large, which achieves an exact match (EM) of 85.97% and an F1-score of 88.77% on our dataset. Besides, we analyze experimental results in terms of the question type in Vietnamese and the effect of context on the performance of the MRC models, thereby showing the challenges from the UIT-ViWikiQA dataset that we propose to the language processing community.
    Writing by Memorizing: Hierarchical Retrieval-based Medical Report Generation. (arXiv:2106.06471v1 [cs.CL])
    (2 min) Medical report generation is one of the most challenging tasks in medical image analysis. Although existing approaches have achieved promising results, they either require a predefined template database in order to retrieve sentences or ignore the hierarchical nature of medical report generation. To address these issues, we propose MedWriter that incorporates a novel hierarchical retrieval mechanism to automatically extract both report and sentence-level templates for clinically accurate report generation. MedWriter first employs the Visual-Language Retrieval~(VLR) module to retrieve the most relevant reports for the given images. To guarantee the logical coherence between sentences, the Language-Language Retrieval~(LLR) module is introduced to retrieve relevant sentences based on the previous generated description. At last, a language decoder fuses image features and features from retrieved reports and sentences to generate meaningful medical reports. We verified the effectiveness of our model by automatic evaluation and human evaluation on two datasets, i.e., Open-I and MIMIC-CXR.
    BoB: BERT Over BERT for Training Persona-based Dialogue Models from Limited Personalized Data. (arXiv:2106.06169v1 [cs.CL])
    (2 min) Maintaining consistent personas is essential for dialogue agents. Although tremendous advancements have been brought, the limited-scale of annotated persona-dense data are still barriers towards training robust and consistent persona-based dialogue models. In this work, we show how the challenges can be addressed by disentangling persona-based dialogue generation into two sub-tasks with a novel BERT-over-BERT (BoB) model. Specifically, the model consists of a BERT-based encoder and two BERT-based decoders, where one decoder is for response generation, and another is for consistency understanding. In particular, to learn the ability of consistency understanding from large-scale non-dialogue inference data, we train the second decoder in an unlikelihood manner. Under different limited data settings, both automatic and human evaluations demonstrate that the proposed model outperforms strong baselines in response quality and persona consistency.
    Improving RNN-T ASR Performance with Date-Time and Location Awareness. (arXiv:2106.06183v1 [eess.AS])
    (2 min) In this paper, we explore the benefits of incorporating context into a Recurrent Neural Network (RNN-T) based Automatic Speech Recognition (ASR) model to improve the speech recognition for virtual assistants. Specifically, we use meta information extracted from the time at which the utterance is spoken and the approximate location information to make ASR context aware. We show that these contextual information, when used individually, improves overall performance by as much as 3.48% relative to the baseline and when the contexts are combined, the model learns complementary features and the recognition improves by 4.62%. On specific domains, these contextual signals show improvements as high as 11.5%, without any significant degradation on others. We ran experiments with models trained on data of sizes 30K hours and 10K hours. We show that the scale of improvement with the 10K hours dataset is much higher than the one obtained with 30K hours dataset. Our results indicate that with limited data to train the ASR model, contextual signals can improve the performance significantly.
    Sprachsynthese -- State-of-the-Art in englischer und deutscher Sprache. (arXiv:2106.06230v1 [cs.CL])
    (2 min) Reading text aloud is an important feature for modern computer applications. It not only facilitates access to information for visually impaired people, but is also a pleasant convenience for non-impaired users. In this article, the state of the art of speech synthesis is presented separately for mel-spectrogram generation and vocoders. It concludes with an overview of available data sets for English and German with a discussion of the transferability of the good speech synthesis results from English to German language.
    TellMeWhy: A Dataset for Answering Why-Questions in Narratives. (arXiv:2106.06132v1 [cs.CL])
    (2 min) Answering questions about why characters perform certain actions is central to understanding and reasoning about narratives. Despite recent progress in QA, it is not clear if existing models have the ability to answer "why" questions that may require commonsense knowledge external to the input narrative. In this work, we introduce TellMeWhy, a new crowd-sourced dataset that consists of more than 30k questions and free-form answers concerning why characters in short narratives perform the actions described. For a third of this dataset, the answers are not present within the narrative. Given the limitations of automated evaluation for this task, we also present a systematized human evaluation interface for this dataset. Our evaluation of state-of-the-art models show that they are far below human performance on answering such questions. They are especially worse on questions whose answers are external to the narrative, thus providing a challenge for future QA and narrative understanding research.
    Graph Neural Networks for Natural Language Processing: A Survey. (arXiv:2106.06090v1 [cs.CL])
    (2 min) Deep learning has become the dominant approach in coping with various tasks in Natural LanguageProcessing (NLP). Although text inputs are typically represented as a sequence of tokens, there isa rich variety of NLP problems that can be best expressed with a graph structure. As a result, thereis a surge of interests in developing new deep learning techniques on graphs for a large numberof NLP tasks. In this survey, we present a comprehensive overview onGraph Neural Networks(GNNs) for Natural Language Processing. We propose a new taxonomy of GNNs for NLP, whichsystematically organizes existing research of GNNs for NLP along three axes: graph construction,graph representation learning, and graph based encoder-decoder models. We further introducea large number of NLP applications that are exploiting the power of GNNs and summarize thecorresponding benchmark datasets, evaluation metrics, and open-source codes. Finally, we discussvarious outstanding challenges for making the full use of GNNs for NLP as well as future researchdirections. To the best of our knowledge, this is the first comprehensive overview of Graph NeuralNetworks for Natural Language Processing.
    How Should Agents Ask Questions For Situated Learning? An Annotated Dialogue Corpus. (arXiv:2106.06504v1 [cs.CL])
    (2 min) Intelligent agents that are confronted with novel concepts in situated environments will need to ask their human teammates questions to learn about the physical world. To better understand this problem, we need data about asking questions in situated task-based interactions. To this end, we present the Human-Robot Dialogue Learning (HuRDL) Corpus - a novel dialogue corpus collected in an online interactive virtual environment in which human participants play the role of a robot performing a collaborative tool-organization task. We describe the corpus data and a corresponding annotation scheme to offer insight into the form and content of questions that humans ask to facilitate learning in a situated environment. We provide the corpus as an empirically-grounded resource for improving question generation in situated intelligent agents.
    Unsupervised Knowledge Graph Alignment by Probabilistic Reasoning and Semantic Embedding. (arXiv:2105.05596v3 [cs.CL] UPDATED)
    (2 min) Knowledge Graph (KG) alignment is to discover the mappings (i.e., equivalent entities, relations, and others) between two KGs. The existing methods can be divided into the embedding-based models, and the conventional reasoning and lexical matching based systems. The former compute the similarity of entities via their cross-KG embeddings, but they usually rely on an ideal supervised learning setting for good performance and lack appropriate reasoning to avoid logically wrong mappings; while the latter address the reasoning issue but are poor at utilizing the KG graph structures and the entity contexts. In this study, we aim at combining the above two solutions and thus propose an iterative framework named PRASE which is based on probabilistic reasoning and semantic embedding. It learns the KG embeddings via entity mappings from a probabilistic reasoning system named PARIS, and feeds the resultant entity mappings and embeddings back into PARIS for augmentation. The PRASE framework is compatible with different embedding-based models, and our experiments on multiple datasets have demonstrated its state-of-the-art performance.
    Assessing Political Prudence of Open-domain Chatbots. (arXiv:2106.06157v1 [cs.CL])
    (2 min) Politically sensitive topics are still a challenge for open-domain chatbots. However, dealing with politically sensitive content in a responsible, non-partisan, and safe behavior way is integral for these chatbots. Currently, the main approach to handling political sensitivity is by simply changing such a topic when it is detected. This is safe but evasive and results in a chatbot that is less engaging. In this work, as a first step towards a politically safe chatbot, we propose a group of metrics for assessing their political prudence. We then conduct political prudence analysis of various chatbots and discuss their behavior from multiple angles through our automatic metric and human evaluation metrics. The testsets and codebase are released to promote research in this area.
    Spoken Style Learning with Multi-modal Hierarchical Context Encoding for Conversational Text-to-Speech Synthesis. (arXiv:2106.06233v1 [cs.SD])
    (2 min) For conversational text-to-speech (TTS) systems, it is vital that the systems can adjust the spoken styles of synthesized speech according to different content and spoken styles in historical conversations. However, the study about learning spoken styles from historical conversations is still in its infancy. Only the transcripts of the historical conversations are considered, which neglects the spoken styles in historical speeches. Moreover, only the interactions of the global aspect between speakers are modeled, missing the party aspect self interactions inside each speaker. In this paper, to achieve better spoken style learning for conversational TTS, we propose a spoken style learning approach with multi-modal hierarchical context encoding. The textual information and spoken styles in the historical conversations are processed through multiple hierarchical recurrent neural networks to learn the spoken style related features in global and party aspects. The attention mechanism is further employed to summarize these features into a conversational context encoding. Experimental results demonstrate the effectiveness of our proposed approach, which outperform a baseline method using context encoding learnt only from the transcripts in global aspects, with MOS score on the naturalness of synthesized speech increasing from 3.138 to 3.408 and ABX preference rate exceeding the baseline method by 36.45%.
    Towards unsupervised phone and word segmentation using self-supervised vector-quantized neural networks. (arXiv:2012.07551v2 [cs.CL] UPDATED)
    (2 min) We investigate segmenting and clustering speech into low-bitrate phone-like sequences without supervision. We specifically constrain pretrained self-supervised vector-quantized (VQ) neural networks so that blocks of contiguous feature vectors are assigned to the same code, thereby giving a variable-rate segmentation of the speech into discrete units. Two segmentation methods are considered. In the first, features are greedily merged until a prespecified number of segments are reached. The second uses dynamic programming to optimize a squared error with a penalty term to encourage fewer but longer segments. We show that these VQ segmentation methods can be used without alteration across a wide range of tasks: unsupervised phone segmentation, ABX phone discrimination, same-different word discrimination, and as inputs to a symbolic word segmentation algorithm. The penalized dynamic programming method generally performs best. While performance on individual tasks is only comparable to the state-of-the-art in some cases, in all tasks a reasonable competing approach is outperformed at a substantially lower bitrate.
    Calibrate Before Use: Improving Few-Shot Performance of Language Models. (arXiv:2102.09690v2 [cs.CL] UPDATED)
    (2 min) GPT-3 can perform numerous tasks when provided a natural language prompt that contains a few training examples. We show that this type of few-shot learning can be unstable: the choice of prompt format, training examples, and even the order of the training examples can cause accuracy to vary from near chance to near state-of-the-art. We demonstrate that this instability arises from the bias of language models towards predicting certain answers, e.g., those that are placed near the end of the prompt or are common in the pre-training data. To mitigate this, we first estimate the model's bias towards each answer by asking for its prediction when given the training prompt and a content-free test input such as "N/A". We then fit calibration parameters that cause the prediction for this input to be uniform across answers. On a diverse set of tasks, this contextual calibration procedure substantially improves GPT-3 and GPT-2's average accuracy (up to 30.0% absolute) and reduces variance across different choices of the prompt.
    Modeling Hierarchical Structures with Continuous Recursive Neural Networks. (arXiv:2106.06038v1 [cs.CL])
    (2 min) Recursive Neural Networks (RvNNs), which compose sequences according to their underlying hierarchical syntactic structure, have performed well in several natural language processing tasks compared to similar models without structural biases. However, traditional RvNNs are incapable of inducing the latent structure in a plain text sequence on their own. Several extensions have been proposed to overcome this limitation. Nevertheless, these extensions tend to rely on surrogate gradients or reinforcement learning at the cost of higher bias or variance. In this work, we propose Continuous Recursive Neural Network (CRvNN) as a backpropagation-friendly alternative to address the aforementioned limitations. This is done by incorporating a continuous relaxation to the induced structure. We demonstrate that CRvNN achieves strong performance in challenging synthetic tasks such as logical inference and ListOps. We also show that CRvNN performs comparably or better than prior latent structure models on real-world tasks such as sentiment analysis and natural language inference.
    From Paraphrasing to Semantic Parsing: Unsupervised Semantic Parsing via Synchronous Semantic Decoding. (arXiv:2106.06228v1 [cs.CL])
    (2 min) Semantic parsing is challenging due to the structure gap and the semantic gap between utterances and logical forms. In this paper, we propose an unsupervised semantic parsing method - Synchronous Semantic Decoding (SSD), which can simultaneously resolve the semantic gap and the structure gap by jointly leveraging paraphrasing and grammar constrained decoding. Specifically, we reformulate semantic parsing as a constrained paraphrasing problem: given an utterance, our model synchronously generates its canonical utterance and meaning representation. During synchronous decoding: the utterance paraphrasing is constrained by the structure of the logical form, therefore the canonical utterance can be paraphrased controlledly; the semantic decoding is guided by the semantics of the canonical utterance, therefore its logical form can be generated unsupervisedly. Experimental results show that SSD is a promising approach and can achieve competitive unsupervised semantic parsing performance on multiple datasets.
    A comprehensive solution to retrieval-based chatbot construction. (arXiv:2106.06139v1 [cs.CL])
    (2 min) In this paper we present the results of our experiments in training and deploying a self-supervised retrieval-based chatbot trained with contrastive learning for assisting customer support agents. In contrast to most existing research papers in this area where the focus is on solving just one component of a deployable chatbot, we present an end-to-end set of solutions to take the reader from an unlabelled chatlogs to a deployed chatbot. This set of solutions includes creating a self-supervised dataset and a weakly labelled dataset from chatlogs, as well as a systematic approach to selecting a fixed list of canned responses. We present a hierarchical-based RNN architecture for the response selection model, chosen for its ability to cache intermediate utterance embeddings, which helped to meet deployment inference speed requirements. We compare the performance of this architecture across 3 different learning objectives: self-supervised contrastive learning, binary classification, and multi-class classification. We find that using a self-supervised contrastive learning model outperforms training the binary and multi-class classification models on a weakly labelled dataset. Our results validate that the self-supervised contrastive learning approach can be effectively used for a real-world chatbot scenario.
    Causal Analysis of Syntactic Agreement Mechanisms in Neural Language Models. (arXiv:2106.06087v1 [cs.CL])
    (2 min) Targeted syntactic evaluations have demonstrated the ability of language models to perform subject-verb agreement given difficult contexts. To elucidate the mechanisms by which the models accomplish this behavior, this study applies causal mediation analysis to pre-trained neural language models. We investigate the magnitude of models' preferences for grammatical inflections, as well as whether neurons process subject-verb agreement similarly across sentences with different syntactic structures. We uncover similarities and differences across architectures and model sizes -- notably, that larger models do not necessarily learn stronger preferences. We also observe two distinct mechanisms for producing subject-verb agreement depending on the syntactic structure of the input sentence. Finally, we find that language models rely on similar sets of neurons when given sentences with similar syntactic structure.
    Semi-Supervised and Unsupervised Sense Annotation via Translations. (arXiv:2106.06462v1 [cs.CL])
    (2 min) Acquisition of multilingual training data continues to be a challenge in word sense disambiguation (WSD). To address this problem, unsupervised approaches have been developed in recent years that automatically generate sense annotations suitable for training supervised WSD systems. We present three new methods to creating sense-annotated corpora, which leverage translations, parallel corpora, lexical resources, and contextual and synset embeddings. Our semi-supervised method applies machine translation to transfer existing sense annotations to other languages. Our two unsupervised methods use a knowledge-based WSD system to annotate a parallel corpus, and refine the resulting sense annotations by identifying lexical translations. We obtain state-of-the-art results on standard WSD benchmarks.
    N-Best ASR Transformer: Enhancing SLU Performance using Multiple ASR Hypotheses. (arXiv:2106.06519v1 [cs.CL])
    (2 min) Spoken Language Understanding (SLU) systems parse speech into semantic structures like dialog acts and slots. This involves the use of an Automatic Speech Recognizer (ASR) to transcribe speech into multiple text alternatives (hypotheses). Transcription errors, common in ASRs, impact downstream SLU performance negatively. Approaches to mitigate such errors involve using richer information from the ASR, either in form of N-best hypotheses or word-lattices. We hypothesize that transformer models learn better with a simpler utterance representation using the concatenation of the N-best ASR alternatives, where each alternative is separated by a special delimiter [SEP]. In our work, we test our hypothesis by using concatenated N-best ASR alternatives as the input to transformer encoder models, namely BERT and XLM-RoBERTa, and achieve performance equivalent to the prior state-of-the-art model on DSTC2 dataset. We also show that our approach significantly outperforms the prior state-of-the-art when subjected to the low data regime. Additionally, this methodology is accessible to users of third-party ASR APIs which do not provide word-lattice information.
    Generalizing Cross-Document Event Coreference Resolution Across Multiple Corpora. (arXiv:2011.12249v2 [cs.CL] UPDATED)
    (2 min) Cross-document event coreference resolution (CDCR) is an NLP task in which mentions of events need to be identified and clustered throughout a collection of documents. CDCR aims to benefit downstream multi-document applications, but despite recent progress on corpora and system development, downstream improvements from applying CDCR have not been shown yet. We make the observation that every CDCR system to date was developed, trained, and tested only on a single respective corpus. This raises strong concerns on their generalizability -- a must-have for downstream applications where the magnitude of domains or event mentions is likely to exceed those found in a curated corpus. To investigate this assumption, we define a uniform evaluation setup involving three CDCR corpora: ECB+, the Gun Violence Corpus and the Football Coreference Corpus (which we reannotate on token level to make our analysis possible). We compare a corpus-independent, feature-based system against a recent neural system developed for ECB+. Whilst being inferior in absolute numbers, the feature-based system shows more consistent performance across all corpora whereas the neural system is hit-and-miss. Via model introspection, we find that the importance of event actions, event time, etc. for resolving coreference in practice varies greatly between the corpora. Additional analysis shows that several systems overfit on the structure of the ECB+ corpus. We conclude with recommendations on how to achieve generally applicable CDCR systems in the future -- the most important being that evaluation on multiple CDCR corpora is strongly necessary. To facilitate future research, we release our dataset, annotation guidelines, and system implementation to the public.
    Towards User-Driven Neural Machine Translation. (arXiv:2106.06200v1 [cs.CL])
    (2 min) A good translation should not only translate the original content semantically, but also incarnate personal traits of the original text. For a real-world neural machine translation (NMT) system, these user traits (e.g., topic preference, stylistic characteristics and expression habits) can be preserved in user behavior (e.g., historical inputs). However, current NMT systems marginally consider the user behavior due to: 1) the difficulty of modeling user portraits in zero-shot scenarios, and 2) the lack of user-behavior annotated parallel dataset. To fill this gap, we introduce a novel framework called user-driven NMT. Specifically, a cache-based module and a user-driven contrastive learning method are proposed to offer NMT the ability to capture potential user traits from their historical inputs under a zero-shot learning fashion. Furthermore, we contribute the first Chinese-English parallel corpus annotated with user behavior called UDT-Corpus. Experimental results confirm that the proposed user-driven NMT can generate user-specific translations.
    Zero-Shot Controlled Generation with Encoder-Decoder Transformers. (arXiv:2106.06411v1 [cs.CL])
    (2 min) Controlling neural network-based models for natural language generation (NLG) has broad applications in numerous areas such as machine translation, document summarization, and dialog systems. Approaches that enable such control in a zero-shot manner would be of great importance as, among other reasons, they remove the need for additional annotated data and training. In this work, we propose novel approaches for controlling encoder-decoder transformer-based NLG models in a zero-shot manner. This is done by introducing three control knobs; namely, attention biasing, decoder mixing, and context augmentation, that are applied to these models at generation time. These knobs control the generation process by directly manipulating trained NLG models (e.g., biasing cross-attention layers) to realize the desired attributes in the generated outputs. We show that not only are these NLG models robust to such manipulations, but also their behavior could be controlled without an impact on their generation performance. These results, to the best of our knowledge, are the first of their kind. Through these control knobs, we also investigate the role of transformer decoder's self-attention module and show strong evidence that its primary role is maintaining fluency of sentences generated by these models. Based on this hypothesis, we show that alternative architectures for transformer decoders could be viable options. We also study how this hypothesis could lead to more efficient ways for training encoder-decoder transformer models.
    Improving Pretrained Cross-Lingual Language Models via Self-Labeled Word Alignment. (arXiv:2106.06381v1 [cs.CL])
    (2 min) The cross-lingual language models are typically pretrained with masked language modeling on multilingual text or parallel sentences. In this paper, we introduce denoising word alignment as a new cross-lingual pre-training task. Specifically, the model first self-labels word alignments for parallel sentences. Then we randomly mask tokens in a bitext pair. Given a masked token, the model uses a pointer network to predict the aligned token in the other language. We alternately perform the above two steps in an expectation-maximization manner. Experimental results show that our method improves cross-lingual transferability on various datasets, especially on the token-level tasks, such as question answering, and structured prediction. Moreover, the model can serve as a pretrained word aligner, which achieves reasonably low error rates on the alignment benchmarks. The code and pretrained parameters are available at https://github.com/CZWin32768/XLM-Align.
    What is Multimodality?. (arXiv:2103.06304v3 [cs.AI] UPDATED)
    (2 min) The last years have shown rapid developments in the field of multimodal machine learning, combining e.g., vision, text or speech. In this position paper we explain how the field uses outdated definitions of multimodality that prove unfit for the machine learning era. We propose a new task-relative definition of (multi)modality in the context of multimodal machine learning that focuses on representations and information that are relevant for a given machine learning task. With our new definition of multimodality we aim to provide a missing foundation for multimodal research, an important component of language grounding and a crucial milestone towards NLU.
    NAAQA: A Neural Architecture for Acoustic Question Answering. (arXiv:2106.06147v1 [cs.CL])
    (2 min) The goal of the Acoustic Question Answering (AQA) task is to answer a free-form text question about the content of an acoustic scene. It was inspired by the Visual Question Answering (VQA) task. In this paper, based on the previously introduced CLEAR dataset, we propose a new benchmark for AQA that emphasizes the specific challenges of acoustic inputs, e.g. variable duration scenes. We also introduce NAAQA, a neural architecture that leverages specific properties of acoustic inputs. The usage of time and frequency 1D convolutions to process 2D spectro-temporal representations of acoustic content shows promising results and enables reductions in model complexity. NAAQA achieves 91.6% of accuracy on the AQA task with about 7 times fewer parameters than the previously explored VQA model. We provide a detailed analysis of the results for the different question types. The effectiveness of coordinate maps in this acoustic context was also studied and we show that time coordinate maps augment temporal localization capabilities which enhance performance of the network by about 17 percentage points.
    A Discussion on Building Practical NLP Leaderboards: The Case of Machine Translation. (arXiv:2106.06292v1 [cs.CL])
    (2 min) Recent advances in AI and ML applications have benefited from rapid progress in NLP research. Leaderboards have emerged as a popular mechanism to track and accelerate progress in NLP through competitive model development. While this has increased interest and participation, the over-reliance on single, and accuracy-based metrics have shifted focus from other important metrics that might be equally pertinent to consider in real-world contexts. In this paper, we offer a preliminary discussion of the risks associated with focusing exclusively on accuracy metrics and draw on recent discussions to highlight prescriptive suggestions on how to develop more practical and effective leaderboards that can better reflect the real-world utility of models.
    HMM-Free Encoder Pre-Training for Streaming RNN Transducer. (arXiv:2104.10764v2 [eess.AS] UPDATED)
    (2 min) This work describes an encoder pre-training procedure using frame-wise label to improve the training of streaming recurrent neural network transducer (RNN-T) model. Streaming RNN-T trained from scratch usually performs worse than non-streaming RNN-T. Although it is common to address this issue through pre-training components of RNN-T with other criteria or frame-wise alignment guidance, the alignment is not easily available in end-to-end manner. In this work, frame-wise alignment, used to pre-train streaming RNN-T's encoder, is generated without using a HMM-based system. Therefore an all-neural framework equipping HMM-free encoder pre-training is constructed. This is achieved by expanding the spikes of CTC model to their left/right blank frames, and two expanding strategies are proposed. To our best knowledge, this is the first work to simulate HMM-based frame-wise label using CTC model for pre-training. Experiments conducted on LibriSpeech and MLS English tasks show the proposed pre-training procedure, compared with random initialization, reduces the WER by relatively 5%~11% and the emission latency by 60 ms. Besides, the method is lexicon-free, so it is friendly to new languages without manually designed lexicon.
    Cross-lingual Emotion Detection. (arXiv:2106.06017v1 [cs.CL])
    (2 min) Emotion detection is of great importance for understanding humans. Constructing annotated datasets to train automated models can be expensive. We explore the efficacy of cross-lingual approaches that would use data from a source language to build models for emotion detection in a target language. We compare three approaches, namely: i) using inherently multilingual models; ii) translating training data into the target language; and iii) using an automatically tagged parallel corpus. In our study, we consider English as the source language with Arabic and Spanish as target languages. We study the effectiveness of different classification models such as BERT and SVMs trained with different features. Our BERT-based monolingual models that are trained on target language data surpass state-of-the-art (SOTA) by 4% and 5% absolute Jaccard score for Arabic and Spanish respectively. Next, we show that using cross-lingual approaches with English data alone, we can achieve more than 90% and 80% relative effectiveness of the Arabic and Spanish BERT models respectively. Lastly, we use LIME to interpret the differences between models.
    Turn the Combination Lock: Learnable Textual Backdoor Attacks via Word Substitution. (arXiv:2106.06361v1 [cs.CL])
    (2 min) Recent studies show that neural natural language processing (NLP) models are vulnerable to backdoor attacks. Injected with backdoors, models perform normally on benign examples but produce attacker-specified predictions when the backdoor is activated, presenting serious security threats to real-world applications. Since existing textual backdoor attacks pay little attention to the invisibility of backdoors, they can be easily detected and blocked. In this work, we present invisible backdoors that are activated by a learnable combination of word substitution. We show that NLP models can be injected with backdoors that lead to a nearly 100% attack success rate, whereas being highly invisible to existing defense strategies and even human inspections. The results raise a serious alarm to the security of NLP models, which requires further research to be resolved. All the data and code of this paper are released at https://github.com/thunlp/BkdAtk-LWS.
    UnNatural Language Inference. (arXiv:2101.00010v2 [cs.CL] UPDATED)
    (2 min) Recent investigations into the inner-workings of state-of-the-art large-scale pre-trained Transformer-based Natural Language Understanding (NLU) models indicate that they appear to know humanlike syntax, at least to some extent. We provide novel evidence that complicates this claim: we find that state-of-the-art Natural Language Inference (NLI) models assign the same labels to permuted examples as they do to the original, i.e. they are largely invariant to random word-order permutations. This behavior notably differs from that of humans; we struggle with ungrammatical sentences. To measure the severity of this issue, we propose a suite of metrics and investigate which properties of particular permutations lead models to be word-order invariant. In the MNLI dataset, for example, we find almost all (98.7%) examples contain at least one permutation which elicits the gold label. Models are sometimes even able to assign gold labels to permutations that they originally failed to predict correctly. We provide a comprehensive empirical evaluation of this phenomenon, and further show that this issue exists for both Transformers and pre-Transformer RNN / ConvNet based encoders, as well as across multiple languages (English and Mandarin Chinese). Our code and data are available at https://github.com/facebookresearch/unlu.
    Bridging Subword Gaps in Pretrain-Finetune Paradigm for Natural Language Generation. (arXiv:2106.06125v1 [cs.CL])
    (2 min) A well-known limitation in pretrain-finetune paradigm lies in its inflexibility caused by the one-size-fits-all vocabulary. This potentially weakens the effect when applying pretrained models into natural language generation (NLG) tasks, especially for the subword distributions between upstream and downstream tasks with significant discrepancy. Towards approaching this problem, we extend the vanilla pretrain-finetune pipeline with an extra embedding transfer step. Specifically, a plug-and-play embedding generator is introduced to produce the representation of any input token, according to pre-trained embeddings of its morphologically similar ones. Thus, embeddings of mismatch tokens in downstream tasks can also be efficiently initialized. We conduct experiments on a variety of NLG tasks under the pretrain-finetune fashion. Experimental results and extensive analyses show that the proposed strategy offers us opportunities to feel free to transfer the vocabulary, leading to more efficient and better performed downstream NLG models.
    FedNLP: An interpretable NLP System to Decode Federal Reserve Communications. (arXiv:2106.06247v1 [cs.CL])
    (2 min) The Federal Reserve System (the Fed) plays a significant role in affecting monetary policy and financial conditions worldwide. Although it is important to analyse the Fed's communications to extract useful information, it is generally long-form and complex due to the ambiguous and esoteric nature of content. In this paper, we present FedNLP, an interpretable multi-component Natural Language Processing system to decode Federal Reserve communications. This system is designed for end-users to explore how NLP techniques can assist their holistic understanding of the Fed's communications with NO coding. Behind the scenes, FedNLP uses multiple NLP models from traditional machine learning algorithms to deep neural network architectures in each downstream task. The demonstration shows multiple results at once including sentiment analysis, summary of the document, prediction of the Federal Funds Rate movement and visualization for interpreting the prediction model's result.
    One Sense Per Translation. (arXiv:2106.06082v1 [cs.CL])
    (2 min) The idea of using lexical translations to define sense inventories has a long history in lexical semantics. We propose a theoretical framework which allows us to answer the question of why this apparently reasonable idea failed to produce useful results. We formally prove several propositions on how the translations of a word relate to its senses, as well as on the relationship between synonymy and polysemy. We empirically validate our theoretical findings on BabelNet, and demonstrate how they could be used to perform unsupervised word sense disambiguation of a substantial fraction of the lexicon.
    Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. (arXiv:2102.05918v2 [cs.CV] UPDATED)
    (2 min) Pre-trained representations are becoming crucial for many NLP and perception tasks. While representation learning in NLP has transitioned to training on raw text without human annotations, visual and vision-language representations still rely heavily on curated training datasets that are expensive or require expert knowledge. For vision applications, representations are mostly learned using datasets with explicit class labels such as ImageNet or OpenImages. For vision-language, popular datasets like Conceptual Captions, MSCOCO, or CLIP all involve a non-trivial data collection (and cleaning) process. This costly curation process limits the size of datasets and hence hinders the scaling of trained models. In this paper, we leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps in the Conceptual Captions dataset. A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss. We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme. Our visual representation achieves strong performance when transferred to classification tasks such as ImageNet and VTAB. The aligned visual and language representations enables zero-shot image classification and also set new state-of-the-art results on Flickr30K and MSCOCO image-text retrieval benchmarks, even when compared with more sophisticated cross-attention models. The representations also enable cross-modality search with complex text and text + image queries.
    Spoken Term Detection Methods for Sparse Transcription in Very Low-resource Settings. (arXiv:2106.06160v1 [cs.CL])
    (2 min) We investigate the efficiency of two very different spoken term detection approaches for transcription when the available data is insufficient to train a robust ASR system. This work is grounded in very low-resource language documentation scenario where only few minutes of recording have been transcribed for a given language so far.Experiments on two oral languages show that a pretrained universal phone recognizer, fine-tuned with only a few minutes of target language speech, can be used for spoken term detection with a better overall performance than a dynamic time warping approach. In addition, we show that representing phoneme recognition ambiguity in a graph structure can further boost the recall while maintaining high precision in the low resource spoken term detection task.
    HUI-Audio-Corpus-German: A high quality TTS dataset. (arXiv:2106.06309v1 [cs.SD])
    (2 min) The increasing availability of audio data on the internet lead to a multitude of datasets for development and training of text to speech applications, based on neural networks. Highly differing quality of voice, low sampling rates, lack of text normalization and disadvantageous alignment of audio samples to corresponding transcript sentences still limit the performance of deep neural networks trained on this task. Additionally, data resources in languages like German are still very limited. We introduce the "HUI-Audio-Corpus-German", a large, open-source dataset for TTS engines, created with a processing pipeline, which produces high quality audio to transcription alignments and decreases manual effort needed for creation.
    To Beam Or Not To Beam: That is a Question of Cooperation for Language GANs. (arXiv:2106.06363v1 [cs.CL])
    (2 min) Due to the discrete nature of words, language GANs require to be optimized from rewards provided by discriminator networks, via reinforcement learning methods. This is a much harder setting than for continuous tasks, which enjoy gradient flows from discriminators to generators, usually leading to dramatic learning instabilities. However, we claim that this can be solved by making discriminator and generator networks cooperate to produce output sequences during training. These cooperative outputs, inherently built to obtain higher discrimination scores, not only provide denser rewards for training, but also form a more compact artificial set for discriminator training, hence improving its accuracy and stability. In this paper, we show that our SelfGAN framework, built on this cooperative principle, outperforms Teacher Forcing and obtains state-of-the-art results on two challenging tasks, Summarization and Question Generation.
    Dynamic Language Models for Continuously Evolving Content. (arXiv:2106.06297v1 [cs.CL])
    (2 min) The content on the web is in a constant state of flux. New entities, issues, and ideas continuously emerge, while the semantics of the existing conversation topics gradually shift. In recent years, pre-trained language models like BERT greatly improved the state-of-the-art for a large spectrum of content understanding tasks. Therefore, in this paper, we aim to study how these language models can be adapted to better handle continuously evolving web content. In our study, we first analyze the evolution of 2013 - 2019 Twitter data, and unequivocally confirm that a BERT model trained on past tweets would heavily deteriorate when directly applied to data from later years. Then, we investigate two possible sources of the deterioration: the semantic shift of existing tokens and the sub-optimal or failed understanding of new tokens. To this end, we both explore two different vocabulary composition methods, as well as propose three sampling methods which help in efficient incremental training for BERT-like models. Compared to a new model trained from scratch offline, our incremental training (a) reduces the training costs, (b) achieves better performance on evolving content, and (c) is suitable for online deployment. The superiority of our methods is validated using two downstream tasks. We demonstrate significant improvements when incrementally evolving the model from a particular base year, on the task of Country Hashtag Prediction, as well as on the OffensEval 2019 task.
    Continual Learning for Text Classification with Information Disentanglement Based Regularization. (arXiv:2104.05489v2 [cs.CL] UPDATED)
    (2 min) Continual learning has become increasingly important as it enables NLP models to constantly learn and gain knowledge over time. Previous continual learning methods are mainly designed to preserve knowledge from previous tasks, without much emphasis on how to well generalize models to new tasks. In this work, we propose an information disentanglement based regularization method for continual learning on text classification. Our proposed method first disentangles text hidden spaces into representations that are generic to all tasks and representations specific to each individual task, and further regularizes these representations differently to better constrain the knowledge required to generalize. We also introduce two simple auxiliary tasks: next sentence prediction and task-id prediction, for learning better generic and specific representation spaces. Experiments conducted on large-scale benchmarks demonstrate the effectiveness of our method in continual text classification tasks with various sequences and lengths over state-of-the-art baselines. We have publicly released our code at https://github.com/GT-SALT/IDBR.
    Local Explanation of Dialogue Response Generation. (arXiv:2106.06528v1 [cs.CL])
    (2 min) In comparison to the interpretation of classification models, the explanation of sequence generation models is also an important problem, however it has seen little attention. In this work, we study model-agnostic explanations of a representative text generation task -- dialogue response generation. Dialog response generation is challenging with its open-ended sentences and multiple acceptable responses. To gain insights into the reasoning process of a generation model, we propose anew method, local explanation of response generation (LERG) that regards the explanations as the mutual interaction of segments in input and output sentences. LERG views the sequence prediction as uncertainty estimation of a human response and then creates explanations by perturbing the input and calculating the certainty change over the human response. We show that LERG adheres to desired properties of explanations for text generation including unbiased approximation, consistency and cause identification. Empirically, our results show that our method consistently improves other widely used methods on proposed automatic- and human- evaluation metrics for this new task by 4.4-12.8%. Our analysis demonstrates that LERG can extract both explicit and implicit relations between input and output segments.
    Dynaboard: An Evaluation-As-A-Service Platform for Holistic Next-Generation Benchmarking. (arXiv:2106.06052v1 [cs.CL])
    (2 min) We introduce Dynaboard, an evaluation-as-a-service framework for hosting benchmarks and conducting holistic model comparison, integrated with the Dynabench platform. Our platform evaluates NLP models directly instead of relying on self-reported metrics or predictions on a single dataset. Under this paradigm, models are submitted to be evaluated in the cloud, circumventing the issues of reproducibility, accessibility, and backwards compatibility that often hinder benchmarking in NLP. This allows users to interact with uploaded models in real time to assess their quality, and permits the collection of additional metrics such as memory use, throughput, and robustness, which -- despite their importance to practitioners -- have traditionally been absent from leaderboards. On each task, models are ranked according to the Dynascore, a novel utility-based aggregation of these statistics, which users can customize to better reflect their preferences, placing more/less weight on a particular axis of evaluation or dataset. As state-of-the-art NLP models push the limits of traditional benchmarks, Dynaboard offers a standardized solution for a more diverse and comprehensive evaluation of model quality.
    Probabilistic, Structure-Aware Algorithms for Improved Variety, Accuracy, and Coverage of AMR Alignments. (arXiv:2106.06002v1 [cs.CL])
    (2 min) We present algorithms for aligning components of Abstract Meaning Representation (AMR) graphs to spans in English sentences. We leverage unsupervised learning in combination with heuristics, taking the best of both worlds from previous AMR aligners. Our unsupervised models, however, are more sensitive to graph substructures, without requiring a separate syntactic parse. Our approach covers a wider variety of AMR substructures than previously considered, achieves higher coverage of nodes and edges, and does so with higher accuracy. We will release our LEAMR datasets and aligner for use in research on AMR parsing, generation, and evaluation.
  • cs.CV updates on arXiv.org

    Writing by Memorizing: Hierarchical Retrieval-based Medical Report Generation. (arXiv:2106.06471v1 [cs.CL])
    (2 min) Medical report generation is one of the most challenging tasks in medical image analysis. Although existing approaches have achieved promising results, they either require a predefined template database in order to retrieve sentences or ignore the hierarchical nature of medical report generation. To address these issues, we propose MedWriter that incorporates a novel hierarchical retrieval mechanism to automatically extract both report and sentence-level templates for clinically accurate report generation. MedWriter first employs the Visual-Language Retrieval~(VLR) module to retrieve the most relevant reports for the given images. To guarantee the logical coherence between sentences, the Language-Language Retrieval~(LLR) module is introduced to retrieve relevant sentences based on the previous generated description. At last, a language decoder fuses image features and features from retrieved reports and sentences to generate meaningful medical reports. We verified the effectiveness of our model by automatic evaluation and human evaluation on two datasets, i.e., Open-I and MIMIC-CXR.
    Scale-invariant scale-channel networks: Deep networks that generalise to previously unseen scales. (arXiv:2106.06418v1 [cs.CV])
    (2 min) The ability to handle large scale variations is crucial for many real world visual tasks. A straightforward approach for handling scale in a deep network is to process an image at several scales simultaneously in a set of scale channels. Scale invariance can then, in principle, be achieved by using weight sharing between the scale channels together with max or average pooling over the outputs from the scale channels. The ability of such scale channel networks to generalise to scales not present in the training set over significant scale ranges has, however, not previously been explored. In this paper, we present a systematic study of this methodology by implementing different types of scale channel networks and evaluating their ability to generalise to previously unseen scales. We develop a formalism for analysing the covariance and invariance properties of scale channel networks, and explore how different design choices, unique to scaling transformations, affect the overall performance of scale channel networks. We first show that two previously proposed scale channel network designs do not generalise well to scales not present in the training set. We explain theoretically and demonstrate experimentally why generalisation fails in these cases. We then propose a new type of foveated scale channel architecture}, where the scale channels process increasingly larger parts of the image with decreasing resolution. This new type of scale channel network is shown to generalise extremely well, provided sufficient image resolution and the absence of boundary effects. Our proposed FovMax and FovAvg networks perform almost identically over a scale range of 8, also when training on single scale training data, and do also give improved performance when learning from datasets with large scale variations in the small sample regime.
    Learning Compositional Shape Priors for Few-Shot 3D Reconstruction. (arXiv:2106.06440v1 [cs.CV])
    (2 min) The impressive performance of deep convolutional neural networks in single-view 3D reconstruction suggests that these models perform non-trivial reasoning about the 3D structure of the output space. Recent work has challenged this belief, showing that, on standard benchmarks, complex encoder-decoder architectures perform similarly to nearest-neighbor baselines or simple linear decoder models that exploit large amounts of per-category data. However, building large collections of 3D shapes for supervised training is a laborious process; a more realistic and less constraining task is inferring 3D shapes for categories with few available training examples, calling for a model that can successfully generalize to novel object classes. In this work we experimentally demonstrate that naive baselines fail in this few-shot learning setting, in which the network must learn informative shape priors for inference of new categories. We propose three ways to learn a class-specific global shape prior, directly from data. Using these techniques, we are able to capture multi-scale information about the 3D shape, and account for intra-class variability by virtue of an implicit compositional structure. Experiments on the popular ShapeNet dataset show that our method outperforms a zero-shot baseline by over 40%, and the current state-of-the-art by over 10%, in terms of relative performance, in the few-shot setting.12
    A self-adapting super-resolution structures framework for automatic design of GAN. (arXiv:2106.06011v1 [cs.CV])
    (2 min) With the development of deep learning, the single super-resolution image reconstruction network models are becoming more and more complex. Small changes in hyperparameters of the models have a greater impact on model performance. In the existing works, experts have gradually explored a set of optimal model parameters based on empirical values or performing brute-force search. In this paper, we introduce a new super-resolution image reconstruction generative adversarial network framework, and a Bayesian optimization method used to optimizing the hyperparameters of the generator and discriminator. The generator is made by self-calibrated convolution, and discriminator is made by convolution lays. We have defined the hyperparameters such as the number of network layers and the number of neurons. Our method adopts Bayesian optimization as a optimization policy of GAN in our model. Not only can find the optimal hyperparameter solution automatically, but also can construct a super-resolution image reconstruction network, reducing the manual workload. Experiments show that Bayesian optimization can search the optimal solution earlier than the other two optimization algorithms.
    The Multi-Agent Behavior Dataset: Mouse Dyadic Social Interactions. (arXiv:2104.02710v3 [cs.LG] UPDATED)
    (2 min) Multi-agent behavior modeling aims to understand the interactions that occur between agents. We present a multi-agent dataset from behavioral neuroscience, the Caltech Mouse Social Interactions (CalMS21) Dataset. Our dataset consists of trajectory data of social interactions, recorded from videos of freely behaving mice in a standard resident-intruder assay. To help accelerate behavioral studies, the CalMS21 dataset provides benchmarks to evaluate the performance of automated behavior classification methods in three settings: (1) for training on large behavioral datasets all annotated by a single annotator, (2) for style transfer to learn inter-annotator differences in behavior definitions, and (3) for learning of new behaviors of interest given limited training data. The dataset consists of 6 million frames of unlabeled tracked poses of interacting mice, as well as over 1 million frames with tracked poses and corresponding frame-level behavior annotations. The challenge of our dataset is to be able to classify behaviors accurately using both labeled and unlabeled tracking data, as well as being able to generalize to new settings.
    Within-layer Diversity Reduces Generalization Gap. (arXiv:2106.06012v1 [cs.LG])
    (2 min) Neural networks are composed of multiple layers arranged in a hierarchical structure jointly trained with a gradient-based optimization, where the errors are back-propagated from the last layer back to the first one. At each optimization step, neurons at a given layer receive feedback from neurons belonging to higher layers of the hierarchy. In this paper, we propose to complement this traditional 'between-layer' feedback with additional 'within-layer' feedback to encourage diversity of the activations within the same layer. To this end, we measure the pairwise similarity between the outputs of the neurons and use it to model the layer's overall diversity. By penalizing similarities and promoting diversity, we encourage each neuron to learn a distinctive representation and, thus, to enrich the data representation learned within the layer and to increase the total capacity of the model. We theoretically study how the within-layer activation diversity affects the generalization performance of a neural network and prove that increasing the diversity of hidden activations reduces the estimation error. In addition to the theoretical guarantees, we present an empirical study on three datasets confirming that the proposed approach enhances the performance of state-of-the-art neural network models and decreases the generalization gap.
    Neural Architecture Search without Training. (arXiv:2006.04647v3 [cs.LG] UPDATED)
    (2 min) The time and effort involved in hand-designing deep neural networks is immense. This has prompted the development of Neural Architecture Search (NAS) techniques to automate this design. However, NAS algorithms tend to be slow and expensive; they need to train vast numbers of candidate networks to inform the search process. This could be alleviated if we could partially predict a network's trained accuracy from its initial state. In this work, we examine the overlap of activations between datapoints in untrained networks and motivate how this can give a measure which is usefully indicative of a network's trained performance. We incorporate this measure into a simple algorithm that allows us to search for powerful networks without any training in a matter of seconds on a single GPU, and verify its effectiveness on NAS-Bench-101, NAS-Bench-201, NATS-Bench, and Network Design Spaces. Our approach can be readily combined with more expensive search methods; we examine a simple adaptation of regularised evolutionary search. Code for reproducing our experiments is available at https://github.com/BayesWatch/nas-without-training.
    Team RUC_AIM3 Technical Report at ActivityNet 2021: Entities Object Localization. (arXiv:2106.06138v1 [cs.CV])
    (2 min) Entities Object Localization (EOL) aims to evaluate how grounded or faithful a description is, which consists of caption generation and object grounding. Previous works tackle this problem by jointly training the two modules in a framework, which limits the complexity of each module. Therefore, in this work, we propose to divide these two modules into two stages and improve them respectively to boost the whole system performance. For the caption generation, we propose a Unified Multi-modal Pre-training Model (UMPM) to generate event descriptions with rich objects for better localization. For the object grounding, we fine-tune the state-of-the-art detection model MDETR and design a post processing method to make the grounding results more faithful. Our overall system achieves the state-of-the-art performances on both sub-tasks in Entities Object Localization challenge at Activitynet 2021, with 72.57 localization accuracy on the testing set of sub-task I and 0.2477 F1_all_per_sent on the hidden testing set of sub-task II.
    Survey of Image Based Graph Neural Networks. (arXiv:2106.06307v1 [cs.LG])
    (2 min) In this survey paper, we analyze image based graph neural networks and propose a three-step classification approach. We first convert the image into superpixels using the Quickshift algorithm so as to reduce 30% of the input data. The superpixels are subsequently used to generate a region adjacency graph. Finally, the graph is passed through a state-of-art graph convolutional neural network to get classification scores. We also analyze the spatial and spectral convolution filtering techniques in graph neural networks. Spectral-based models perform better than spatial-based models and classical CNN with lesser compute cost.
    Representation Disentanglement for Multi-modal brain MR Analysis. (arXiv:2102.11456v2 [cs.CV] UPDATED)
    (2 min) Multi-modal MRIs are widely used in neuroimaging applications since different MR sequences provide complementary information about brain structures. Recent works have suggested that multi-modal deep learning analysis can benefit from explicitly disentangling anatomical (shape) and modality (appearance) information into separate image presentations. In this work, we challenge mainstream strategies by showing that they do not naturally lead to representation disentanglement both in theory and in practice. To address this issue, we propose a margin loss that regularizes the similarity in relationships of the representations across subjects and modalities. To enable robust training, we further use a conditional convolution to design a single model for encoding images of all modalities. Lastly, we propose a fusion function to combine the disentangled anatomical representations as a set of modality-invariant features for downstream tasks. We evaluate the proposed method on three multi-modal neuroimaging datasets. Experiments show that our proposed method can achieve superior disentangled representations compared to existing disentanglement strategies. Results also indicate that the fused anatomical representation has potential in the downstream task of zero-dose PET reconstruction and brain tumor segmentation. The code is available at \url{https://github.com/ouyangjiahong/representation-disentanglement}.
    Learning the Precise Feature for Cluster Assignment. (arXiv:2106.06159v1 [cs.CV])
    (2 min) Clustering is one of the fundamental tasks in computer vision and pattern recognition. Recently, deep clustering methods (algorithms based on deep learning) have attracted wide attention with their impressive performance. Most of these algorithms combine deep unsupervised representation learning and standard clustering together. However, the separation of representation learning and clustering will lead to suboptimal solutions because the two-stage strategy prevents representation learning from adapting to subsequent tasks (e.g., clustering according to specific cues). To overcome this issue, efforts have been made in the dynamic adaption of representation and cluster assignment, whereas current state-of-the-art methods suffer from heuristically constructed objectives with representation and cluster assignment alternatively optimized. To further standardize the clustering problem, we audaciously formulate the objective of clustering as finding a precise feature as the cue for cluster assignment. Based on this, we propose a general-purpose deep clustering framework which radically integrates representation learning and clustering into a single pipeline for the first time. The proposed framework exploits the powerful ability of recently developed generative models for learning intrinsic features, and imposes an entropy minimization on the distribution of the cluster assignment by a dedicated variational algorithm. Experimental results show that the performance of the proposed method is superior, or at least comparable to, the state-of-the-art methods on the handwritten digit recognition, fashion recognition, face recognition and object recognition benchmark datasets.
    A modular framework for object-based saccadic decisions in dynamic scenes. (arXiv:2106.06073v1 [cs.CV])
    (2 min) Visually exploring the world around us is not a passive process. Instead, we actively explore the world and acquire visual information over time. Here, we present a new model for simulating human eye-movement behavior in dynamic real-world scenes. We model this active scene exploration as a sequential decision making process. We adapt the popular drift-diffusion model (DDM) for perceptual decision making and extend it towards multiple options, defined by objects present in the scene. For each possible choice, the model integrates evidence over time and a decision (saccadic eye movement) is triggered as soon as evidence crosses a decision threshold. Drawing this explicit connection between decision making and object-based scene perception is highly relevant in the context of active viewing, where decisions are made continuously while interacting with an external environment. We validate our model with a carefully designed ablation study and explore influences of our model parameters. A comparison on the VidCom dataset supports the plausibility of the proposed approach.
    KRADA: Known-region-aware Domain Alignment for Open World Semantic Segmentation. (arXiv:2106.06237v1 [eess.IV])
    (2 min) In semantic segmentation, we aim to train a pixel-level classifier to assign category labels to all pixels in an image, where labeled training images and unlabeled test images are from the same distribution and share the same label set. However, in an open world, the unlabeled test images probably contain unknown categories and have different distributions from the labeled images. Hence, in this paper, we consider a new, more realistic, and more challenging problem setting where the pixel-level classifier has to be trained with labeled images and unlabeled open-world images -- we name it open world semantic segmentation (OSS). In OSS, the trained classifier is expected to identify unknown-class pixels and classify known-class pixels well. To solve OSS, we first investigate which distribution that unknown-class pixels obey. Then, motivated by the goodness-of-fit test, we use statistical measurements to show how a pixel fits the distribution of an unknown class and select highly-fitted pixels to form the unknown region in each image. Eventually, we propose an end-to-end learning framework, known-region-aware domain alignment (KRADA), to distinguish unknown classes while aligning distributions of known classes in labeled and unlabeled open-world images. The effectiveness of KRADA has been verified on two synthetic tasks and one COVID-19 segmentation task.
    Genetic U-Net: Automatically Designed Deep Networks for Retinal Vessel Segmentation Using a Genetic Algorithm. (arXiv:2010.15560v4 [eess.IV] UPDATED)
    (2 min) Recently, many methods based on hand-designed convolutional neural networks (CNNs) have achieved promising results in automatic retinal vessel segmentation. However, these CNNs remain constrained in capturing retinal vessels in complex fundus images. To improve their segmentation performance, these CNNs tend to have many parameters, which may lead to overfitting and high computational complexity. Moreover, the manual design of competitive CNNs is time-consuming and requires extensive empirical knowledge. Herein, a novel automated design method, called Genetic U-Net, is proposed to generate a U-shaped CNN that can achieve better retinal vessel segmentation but with fewer architecture-based parameters, thereby addressing the above issues. First, we devised a condensed but flexible search space based on a U-shaped encoder-decoder. Then, we used an improved genetic algorithm to identify better-performing architectures in the search space and investigated the possibility of finding a superior network architecture with fewer parameters. The experimental results show that the architecture obtained using the proposed method offered a superior performance with less than 1% of the number of the original U-Net parameters in particular and with significantly fewer parameters than other state-of-the-art models. Furthermore, through in-depth investigation of the experimental results, several effective operations and patterns of networks to generate superior retinal vessel segmentations were identified.
    Twins: Revisiting the Design of Spatial Attention in Vision Transformers. (arXiv:2104.13840v3 [cs.CV] UPDATED)
    (2 min) Very recently, a variety of vision transformer architectures for dense prediction tasks have been proposed and they show that the design of spatial attention is critical to their success in these tasks. In this work, we revisit the design of the spatial attention and demonstrate that a carefully-devised yet simple spatial attention mechanism performs favourably against the state-of-the-art schemes. As a result, we propose two vision transformer architectures, namely, Twins-PCPVT and Twins-SVT. Our proposed architectures are highly-efficient and easy to implement, only involving matrix multiplications that are highly optimized in modern deep learning frameworks. More importantly, the proposed architectures achieve excellent performance on a wide range of visual tasks including imagelevel classification as well as dense detection and segmentation. The simplicity and strong performance suggest that our proposed architectures may serve as stronger backbones for many vision tasks. Our code will be released soon at https://github.com/Meituan-AutoML/Twins .
    Step-Wise Hierarchical Alignment Network for Image-Text Matching. (arXiv:2106.06509v1 [cs.CV])
    (2 min) Image-text matching plays a central role in bridging the semantic gap between vision and language. The key point to achieve precise visual-semantic alignment lies in capturing the fine-grained cross-modal correspondence between image and text. Most previous methods rely on single-step reasoning to discover the visual-semantic interactions, which lacks the ability of exploiting the multi-level information to locate the hierarchical fine-grained relevance. Different from them, in this work, we propose a step-wise hierarchical alignment network (SHAN) that decomposes image-text matching into multi-step cross-modal reasoning process. Specifically, we first achieve local-to-local alignment at fragment level, following by performing global-to-local and global-to-global alignment at context level sequentially. This progressive alignment strategy supplies our model with more complementary and sufficient semantic clues to understand the hierarchical correlations between image and text. The experimental results on two benchmark datasets demonstrate the superiority of our proposed method.
    Online Multi-Object Tracking and Segmentation with GMPHD Filter and Mask-based Affinity Fusion. (arXiv:2009.00100v2 [cs.CV] UPDATED)
    (2 min) In this paper, we propose a highly practical fully online multi-object tracking and segmentation (MOTS) method that uses instance segmentation results as an input. The proposed method is based on the Gaussian mixture probability hypothesis density (GMPHD) filter, a hierarchical data association (HDA), and a mask-based affinity fusion (MAF) model to achieve high-performance online tracking. The HDA consists of two associations: segment-to-track and track-to-track associations. One affinity, for position and motion, is computed by using the GMPHD filter, and the other affinity, for appearance is computed by using the responses from a single object tracker such as a kernalized correlation filter. These two affinities are simply fused by using a score-level fusion method such as min-max normalization referred to as MAF. In addition, to reduce the number of false positive segments, we adopt mask IoU-based merging (mask merging). The proposed MOTS framework with the key modules: HDA, MAF, and mask merging, is easily extensible to simultaneously track multiple types of objects with CPU only execution in parallel processing. In addition, the developed framework only requires simple parameter tuning unlike many existing MOTS methods that need intensive hyperparameter optimization. In the experiments on the two popular MOTS datasets, the key modules show some improvements. For instance, ID-switch decreases by more than half compared to a baseline method in the training sets. In conclusion, our tracker achieves state-of-the-art MOTS performance in the test sets.
    Scaling Vision with Sparse Mixture of Experts. (arXiv:2106.05974v1 [cs.CV])
    (2 min) Sparsely-gated Mixture of Experts networks (MoEs) have demonstrated excellent scalability in Natural Language Processing. In Computer Vision, however, almost all performant networks are "dense", that is, every input is processed by every parameter. We present a Vision MoE (V-MoE), a sparse version of the Vision Transformer, that is scalable and competitive with the largest dense networks. When applied to image recognition, V-MoE matches the performance of state-of-the-art networks, while requiring as little as half of the compute at inference time. Further, we propose an extension to the routing algorithm that can prioritize subsets of each input across the entire batch, leading to adaptive per-image compute. This allows V-MoE to trade-off performance and compute smoothly at test-time. Finally, we demonstrate the potential of V-MoE to scale vision models, and train a 15B parameter model that attains 90.35% on ImageNet.
    Towards Generalising Neural Implicit Representations. (arXiv:2101.12690v2 [cs.CV] UPDATED)
    (2 min) Neural implicit representations have shown substantial improvements in efficiently storing 3D data, when compared to conventional formats. However, the focus of existing work has mainly been on storage and subsequent reconstruction. In this work, we show that training neural representations for reconstruction tasks alongside conventional tasks can produce more general encodings that admit equal quality reconstructions to single task training, whilst improving results on conventional tasks when compared to single task encodings. We reformulate the semantic segmentation task, creating a more representative task for implicit representation contexts, and through multi-task experiments on reconstruction, classification, and segmentation, show our approach learns feature rich encodings that admit equal performance for each task.
    View Generalization for Single Image Textured 3D Models. (arXiv:2106.06533v1 [cs.CV])
    (2 min) Humans can easily infer the underlying 3D geometry and texture of an object only from a single 2D image. Current computer vision methods can do this, too, but suffer from view generalization problems - the models inferred tend to make poor predictions of appearance in novel views. As for generalization problems in machine learning, the difficulty is balancing single-view accuracy (cf. training error; bias) with novel view accuracy (cf. test error; variance). We describe a class of models whose geometric rigidity is easily controlled to manage this tradeoff. We describe a cycle consistency loss that improves view generalization (roughly, a model from a generated view should predict the original view well). View generalization of textures requires that models share texture information, so a car seen from the back still has headlights because other cars have headlights. We describe a cycle consistency loss that encourages model textures to be aligned, so as to encourage sharing. We compare our method against the state-of-the-art method and show both qualitative and quantitative improvements.
    Bridge the Gap Between Model-based and Model-free Human Reconstruction. (arXiv:2106.06313v1 [cs.CV])
    (2 min) It is challenging to directly estimate the geometry of human from a single image due to the high diversity and complexity of body shapes with the various clothing styles. Most of model-based approaches are limited to predict the shape and pose of a minimally clothed body with over-smoothing surface. Although capturing the fine detailed geometries, the model-free methods are lack of the fixed mesh topology. To address these issues, we propose a novel topology-preserved human reconstruction approach by bridging the gap between model-based and model-free human reconstruction. We present an end-to-end neural network that simultaneously predicts the pixel-aligned implicit surface and the explicit mesh model built by graph convolutional neural network. Moreover, an extra graph convolutional neural network is employed to estimate the vertex offsets between the implicit surface and parametric mesh model. Finally, we suggest an efficient implicit registration method to refine the neural network output in implicit space. Experiments on DeepHuman dataset showed that our approach is effective.
    Calibration and Auto-Refinement for Light Field Cameras. (arXiv:2106.06181v1 [cs.CV])
    (2 min) The ability to create an accurate three-dimensional reconstruction of a captured scene draws attention to the principles of light fields. This paper presents an approach for light field camera calibration and rectification, based on pairwise pattern-based parameters extraction. It is followed by a correspondence-based algorithm for camera parameters refinement from arbitrary scenes using the triangulation filter and nonlinear optimization. The effectiveness of our approach is validated on both real and synthetic data.
    Efficient Deep Learning Architectures for Fast Identification of Bacterial Strains in Resource-Constrained Devices. (arXiv:2106.06505v1 [cs.CV])
    (2 min) This work presents twelve fine-tuned deep learning architectures to solve the bacterial classification problem over the Digital Image of Bacterial Species Dataset. The base architectures were mainly published as mobile or efficient solutions to the ImageNet challenge, and all experiments presented in this work consisted of making several modifications to the original designs, in order to make them able to solve the bacterial classification problem by using fine-tuning and transfer learning techniques. This work also proposes a novel data augmentation technique for this dataset, which is based on the idea of artificial zooming, strongly increasing the performance of every tested architecture, even doubling it in some cases. In order to get robust and complete evaluations, all experiments were performed with 10-fold cross-validation and evaluated with five different metrics: top-1 and top-5 accuracy, precision, recall, and F1 score. This paper presents a complete comparison of the twelve different architectures, cross-validated with the original and the augmented version of the dataset, the results are also compared with several literature methods. Overall, eight of the eleven architectures surpassed the 0.95 scores in top-1 accuracy with our data augmentation method, being 0.9738 the highest top-1 accuracy. The impact of the data augmentation technique is reported with relative improvement scores.
    Predicting Next Local Appearance for Video Anomaly Detection. (arXiv:2106.06059v1 [cs.CV])
    (2 min) We present a local anomaly detection method in videos. As opposed to most existing methods that are computationally expensive and are not very generalizable across different video scenes, we propose an adversarial framework that learns the temporal local appearance variations by predicting the appearance of a normally behaving object in the next frame of a scene by only relying on its current and past appearances. In the presence of an abnormally behaving object, the reconstruction error between the real and the predicted next appearance of that object indicates the likelihood of an anomaly. Our method is competitive with the existing state-of-the-art while being significantly faster for both training and inference and being better at generalizing to unseen video scenes.
    Towards Online Monitoring and Data-driven Control: A Study of Segmentation Algorithms for Laser Powder Bed Fusion Processes. (arXiv:2011.09065v2 [eess.IV] UPDATED)
    (2 min) An increasing number of laser powder bed fusion machines use off-axis infrared cameras to improve online monitoring and data-driven control capabilities. However, there is still a severe lack of algorithmic solutions to properly process the infrared images from these cameras that has led to several key limitations: a lack of online monitoring capabilities for the laser tracks, insufficient pre-processing of the infrared images for data-driven methods, and large memory requirements for storing the infrared images. To address these limitations, we study over 30 segmentation algorithms that segment each infrared image into a foreground and background. By evaluating each algorithm based on its segmentation accuracy, computational speed, and spatter detection characteristics, we identify promising algorithmic solutions. The identified algorithms can be readily applied to the laser powder bed fusion machines to address each of the above limitations and thus, significantly improve process control.
    Sparse and Imperceptible Adversarial Attack via a Homotopy Algorithm. (arXiv:2106.06027v1 [cs.LG])
    (2 min) Sparse adversarial attacks can fool deep neural networks (DNNs) by only perturbing a few pixels (regularized by l_0 norm). Recent efforts combine it with another l_infty imperceptible on the perturbation magnitudes. The resultant sparse and imperceptible attacks are practically relevant, and indicate an even higher vulnerability of DNNs that we usually imagined. However, such attacks are more challenging to generate due to the optimization difficulty by coupling the l_0 regularizer and box constraints with a non-convex objective. In this paper, we address this challenge by proposing a homotopy algorithm, to jointly tackle the sparsity and the perturbation bound in one unified framework. Each iteration, the main step of our algorithm is to optimize an l_0-regularized adversarial loss, by leveraging the nonmonotone Accelerated Proximal Gradient Method (nmAPG) for nonconvex programming; it is followed by an l_0 change control step, and an optional post-attack step designed to escape bad local minima. We also extend the algorithm to handling the structural sparsity regularizer. We extensively examine the effectiveness of our proposed homotopy attack for both targeted and non-targeted attack scenarios, on CIFAR-10 and ImageNet datasets. Compared to state-of-the-art methods, our homotopy attack leads to significantly fewer perturbations, e.g., reducing 42.91% on CIFAR-10 and 75.03% on ImageNet (average case, targeted attack), at similar maximal perturbation magnitudes, when still achieving 100% attack success rates. Our codes are available at: https://github.com/VITA-Group/SparseADV_Homotopy.
    Small Object Detection for Near Real-Time Egocentric Perception in a Manual Assembly Scenario. (arXiv:2106.06403v1 [cs.CV])
    (2 min) Detecting small objects in video streams of head-worn augmented reality devices in near real-time is a huge challenge: training data is typically scarce, the input video stream can be of limited quality, and small objects are notoriously hard to detect. In industrial scenarios, however, it is often possible to leverage contextual knowledge for the detection of small objects. Furthermore, CAD data of objects are typically available and can be used to generate synthetic training data. We describe a near real-time small object detection pipeline for egocentric perception in a manual assembly scenario: We generate a training data set based on CAD data and realistic backgrounds in Unity. We then train a YOLOv4 model for a two-stage detection process: First, the context is recognized, then the small object of interest is detected. We evaluate our pipeline on the augmented reality device Microsoft Hololens 2.
    Spectral Unsupervised Domain Adaptation for Visual Recognition. (arXiv:2106.06112v1 [cs.CV])
    (2 min) Unsupervised domain adaptation (UDA) aims to learn a well-performed model in an unlabeled target domain by leveraging labeled data from one or multiple related source domains. It remains a great challenge due to 1) the lack of annotations in the target domain and 2) the rich discrepancy between the distributions of source and target data. We propose Spectral UDA (SUDA), an efficient yet effective UDA technique that works in the spectral space and is generic across different visual recognition tasks in detection, classification and segmentation. SUDA addresses UDA challenges from two perspectives. First, it mitigates inter-domain discrepancies by a spectrum transformer (ST) that maps source and target images into spectral space and learns to enhance domain-invariant spectra while suppressing domain-variant spectra simultaneously. To this end, we design novel adversarial multi-head spectrum attention that leverages contextual information to identify domain-variant and domain-invariant spectra effectively. Second, it mitigates the lack of annotations in target domain by introducing multi-view spectral learning which aims to learn comprehensive yet confident target representations by maximizing the mutual information among multiple ST augmentations capturing different spectral views of each target sample. Extensive experiments over different visual tasks (e.g., detection, classification and segmentation) show that SUDA achieves superior accuracy and it is also complementary with state-of-the-art UDA methods with consistent performance boosts but little extra computation.
    Object Segmentation Without Labels with Large-Scale Generative Models. (arXiv:2006.04988v2 [cs.LG] UPDATED)
    (2 min) The recent rise of unsupervised and self-supervised learning has dramatically reduced the dependency on labeled data, providing effective image representations for transfer to downstream vision tasks. Furthermore, recent works employed these representations in a fully unsupervised setup for image classification, reducing the need for human labels on the fine-tuning stage as well. This work demonstrates that large-scale unsupervised models can also perform a more challenging object segmentation task, requiring neither pixel-level nor image-level labeling. Namely, we show that recent unsupervised GANs allow to differentiate between foreground/background pixels, providing high-quality saliency masks. By extensive comparison on standard benchmarks, we outperform existing unsupervised alternatives for object segmentation, achieving new state-of-the-art.
    Fast Weakly Supervised Action Segmentation Using Mutual Consistency. (arXiv:1904.03116v4 [cs.CV] UPDATED)
    (2 min) Action segmentation is the task of predicting the actions for each frame of a video. As obtaining the full annotation of videos for action segmentation is expensive, weakly supervised approaches that can learn only from transcripts are appealing. In this paper, we propose a novel end-to-end approach for weakly supervised action segmentation based on a two-branch neural network. The two branches of our network predict two redundant but different representations for action segmentation and we propose a novel mutual consistency (MuCon) loss that enforces the consistency of the two redundant representations. Using the MuCon loss together with a loss for transcript prediction, our proposed approach achieves the accuracy of state-of-the-art approaches while being $14$ times faster to train and $20$ times faster during inference. The MuCon loss proves beneficial even in the fully supervised setting.
    PyGAD: An Intuitive Genetic Algorithm Python Library. (arXiv:2106.06158v1 [cs.NE])
    (2 min) This paper introduces PyGAD, an open-source easy-to-use Python library for building the genetic algorithm. PyGAD supports a wide range of parameters to give the user control over everything in its life cycle. This includes, but is not limited to, population, gene value range, gene data type, parent selection, crossover, and mutation. PyGAD is designed as a general-purpose optimization library that allows the user to customize the fitness function. Its usage consists of 3 main steps: build the fitness function, create an instance of the pygad.GA class, and calling the pygad.GA.run() method. The library supports training deep learning models created either with PyGAD itself or with frameworks like Keras and PyTorch. Given its stable state, PyGAD is also in active development to respond to the user's requested features and enhancement received on GitHub https://github.com/ahmedfgad/GeneticAlgorithmPython. PyGAD comes with documentation https://pygad.readthedocs.io for further details and examples.
    Refining Pseudo Labels with Clustering Consensus over Generations for Unsupervised Object Re-identification. (arXiv:2106.06133v1 [cs.CV])
    (2 min) Unsupervised object re-identification targets at learning discriminative representations for object retrieval without any annotations. Clustering-based methods conduct training with the generated pseudo labels and currently dominate this research direction. However, they still suffer from the issue of pseudo label noise. To tackle the challenge, we propose to properly estimate pseudo label similarities between consecutive training generations with clustering consensus and refine pseudo labels with temporally propagated and ensembled pseudo labels. To the best of our knowledge, this is the first attempt to leverage the spirit of temporal ensembling to improve classification with dynamically changing classes over generations. The proposed pseudo label refinery strategy is simple yet effective and can be seamlessly integrated into existing clustering-based unsupervised re-identification methods. With our proposed approach, state-of-the-art method can be further boosted with up to 8.8% mAP improvements on the challenging MSMT17 dataset.
    Fast, Accurate Barcode Detection in Ultra High-Resolution Images. (arXiv:2102.06868v2 [cs.CV] UPDATED)
    (2 min) Object detection in Ultra High-Resolution (UHR) images has long been a challenging problem in computer vision due to the varying scales of the targeted objects. When it comes to barcode detection, resizing UHR input images to smaller sizes often leads to the loss of pertinent information, while processing them directly is highly inefficient and computationally expensive. In this paper, we propose using semantic segmentation to achieve a fast and accurate detection of barcodes of various scales in UHR images. Our pipeline involves a modified Region Proposal Network (RPN) on images of size greater than 10k$\times$10k and a newly proposed Y-Net segmentation network, followed by a post-processing workflow for fitting a bounding box around each segmented barcode mask. The end-to-end system has a latency of 16 milliseconds, which is $2.5\times$ faster than YOLOv4 and $5.9\times$ faster than Mask R-CNN. In terms of accuracy, our method outperforms YOLOv4 and Mask R-CNN by a $mAP$ of 5.5% and 47.1% respectively, on a synthetic dataset. We have made available the generated synthetic barcode dataset and its code at this http URL
    Improving Anytime Prediction with Parallel Cascaded Networks and a Temporal-Difference Loss. (arXiv:2102.09808v3 [cs.LG] UPDATED)
    (2 min) Although deep feedforward neural networks share some characteristics with the primate visual system, a key distinction is their dynamics. Deep nets typically operate in serial stages wherein each layer completes its computation before processing begins in subsequent layers. In contrast, biological systems have cascaded dynamics: information propagates from neurons at all layers in parallel but transmission occurs gradually over time, leading to speed-accuracy trade offs even in feedforward architectures. We explore the consequences of biologically inspired parallel hardware by constructing cascaded ResNets in which each residual block has propagation delays but all blocks update in parallel in a stateful manner. Because information transmitted through skip connections avoids delays, the functional depth of the architecture increases over time, yielding anytime predictions that improve with internal-processing time. We introduce a temporal-difference training loss that achieves a strictly superior speed-accuracy profile over standard losses and enables the cascaded architecture to outperform state-of-the-art anytime-prediction methods. The cascaded architecture has intriguing properties, including: it classifies typical instances more rapidly than atypical instances; it is more robust to both persistent and transient noise than is a conventional ResNet; and its time-varying output trace provides a signal that can be exploited to improve information processing and inference.
    Finding Physical Adversarial Examples for Autonomous Driving with Fast and Differentiable Image Compositing. (arXiv:2010.08844v2 [cs.CV] UPDATED)
    (2 min) There is considerable evidence that deep neural networks are vulnerable to adversarial perturbations applied directly to their digital inputs. However, it remains an open question whether this translates to vulnerabilities in real systems. For example, an attack on self-driving cars would in practice entail modifying the driving environment, which then impacts the video inputs to the car's controller, thereby indirectly leading to incorrect driving decisions. Such attacks require accounting for system dynamics and tracking viewpoint changes. We propose a scalable approach for finding adversarial modifications of a simulated autonomous driving environment using a differentiable approximation for the mapping from environmental modifications (rectangles on the road) to the corresponding video inputs to the controller neural network. Given the parameters of the rectangles, our proposed differentiable mapping composites them onto pre-recorded video streams of the original environment, accounting for geometric and color variations. Moreover, we propose a multiple trajectory sampling approach that enables our attacks to be robust to a car's self-correcting behavior. When combined with a neural network-based controller, our approach allows the design of adversarial modifications through end-to-end gradient-based optimization. Using the Carla autonomous driving simulator, we show that our approach is significantly more scalable and far more effective at identifying autonomous vehicle vulnerabilities in simulation experiments than a state-of-the-art approach based on Bayesian Optimization.
    Real-Time Global Illumination Decomposition of Videos. (arXiv:1908.01961v2 [cs.CV] UPDATED)
    (2 min) We propose the first approach for the decomposition of a monocular color video into direct and indirect illumination components in real time. We retrieve, in separate layers, the contribution made to the scene appearance by the scene reflectance, the light sources and the reflections from various coherent scene regions to one another. Existing techniques that invert global light transport require image capture under multiplexed controlled lighting, or only enable the decomposition of a single image at slow off-line frame rates. In contrast, our approach works for regular videos and produces temporally coherent decomposition layers at real-time frame rates. At the core of our approach are several sparsity priors that enable the estimation of the per-pixel direct and indirect illumination layers based on a small set of jointly estimated base reflectance colors. The resulting variational decomposition problem uses a new formulation based on sparse and dense sets of non-linear equations that we solve efficiently using a novel alternating data-parallel optimization strategy. We evaluate our approach qualitatively and quantitatively, and show improvements over the state of the art in this field, in both quality and runtime. In addition, we demonstrate various real-time appearance editing applications for videos with consistent illumination.
    Coordinate Independent Convolutional Networks -- Isometry and Gauge Equivariant Convolutions on Riemannian Manifolds. (arXiv:2106.06020v1 [cs.LG])
    (2 min) Motivated by the vast success of deep convolutional networks, there is a great interest in generalizing convolutions to non-Euclidean manifolds. A major complication in comparison to flat spaces is that it is unclear in which alignment a convolution kernel should be applied on a manifold. The underlying reason for this ambiguity is that general manifolds do not come with a canonical choice of reference frames (gauge). Kernels and features therefore have to be expressed relative to arbitrary coordinates. We argue that the particular choice of coordinatization should not affect a network's inference -- it should be coordinate independent. A simultaneous demand for coordinate independence and weight sharing is shown to result in a requirement on the network to be equivariant under local gauge transformations (changes of local reference frames). The ambiguity of reference frames depends thereby on the G-structure of the manifold, such that the necessary level of gauge equivariance is prescribed by the corresponding structure group G. Coordinate independent convolutions are proven to be equivariant w.r.t. those isometries that are symmetries of the G-structure. The resulting theory is formulated in a coordinate free fashion in terms of fiber bundles. To exemplify the design of coordinate independent convolutions, we implement a convolutional network on the M\"obius strip. The generality of our differential geometric formulation of convolutional networks is demonstrated by an extensive literature review which explains a large number of Euclidean CNNs, spherical CNNs and CNNs on general surfaces as specific instances of coordinate independent convolutions.
    Towards Real-World Blind Face Restoration with Generative Facial Prior. (arXiv:2101.04061v2 [cs.CV] UPDATED)
    (2 min) Blind face restoration usually relies on facial priors, such as facial geometry prior or reference prior, to restore realistic and faithful details. However, very low-quality inputs cannot offer accurate geometric prior while high-quality references are inaccessible, limiting the applicability in real-world scenarios. In this work, we propose GFP-GAN that leverages rich and diverse priors encapsulated in a pretrained face GAN for blind face restoration. This Generative Facial Prior (GFP) is incorporated into the face restoration process via novel channel-split spatial feature transform layers, which allow our method to achieve a good balance of realness and fidelity. Thanks to the powerful generative facial prior and delicate designs, our GFP-GAN could jointly restore facial details and enhance colors with just a single forward pass, while GAN inversion methods require expensive image-specific optimization at inference. Extensive experiments show that our method achieves superior performance to prior art on both synthetic and real-world datasets.
    A deep learning approach to clustering visual arts. (arXiv:2106.06234v1 [cs.CV])
    (2 min) Clustering artworks is difficult for several reasons. On the one hand, recognizing meaningful patterns based on domain knowledge and visual perception is extremely hard. On the other hand, applying traditional clustering and feature reduction techniques to the highly dimensional pixel space can be ineffective. To address these issues, in this paper we propose DELIUS: a DEep learning approach to cLustering vIsUal artS. The method uses a pre-trained convolutional network to extract features and then feeds these features into a deep embedded clustering model, where the task of mapping the raw input data to a latent space is jointly optimized with the task of finding a set of cluster centroids in this latent space. Quantitative and qualitative experimental results show the effectiveness of the proposed method. DELIUS can be useful for several tasks related to art analysis, in particular visual link retrieval and historical knowledge discovery in painting datasets.
    Overcoming Difficulty in Obtaining Dark-skinned Subjects for Remote-PPG by Synthetic Augmentation. (arXiv:2106.06007v1 [cs.CV])
    (2 min) Camera-based remote photoplethysmography (rPPG) provides a non-contact way to measure physiological signals (e.g., heart rate) using facial videos. Recent deep learning architectures have improved the accuracy of such physiological measurement significantly, yet they are restricted by the diversity of the annotated videos. The existing datasets MMSE-HR, AFRL, and UBFC-RPPG contain roughly 10%, 0%, and 5% of dark-skinned subjects respectively. The unbalanced training sets result in a poor generalization capability to unseen subjects and lead to unwanted bias toward different demographic groups. In Western academia, it is regrettably difficult in a university setting to collect data on these dark-skinned subjects. Here we show a first attempt to overcome the lack of dark-skinned subjects by synthetic augmentation. A joint optimization framework is utilized to translate real videos from light-skinned subjects to dark skin tones while retaining their pulsatile signals. In the experiment, our method exhibits around 31% reduction in mean absolute error for the dark-skinned group and 46% improvement on bias mitigation for all the groups, as compared with the previous work trained with just real samples.
    An Image Forensic Technique Based on JPEG Ghosts. (arXiv:2106.06439v1 [cs.CV])
    (2 min) The unprecedented growth in the easy availability of photo-editing tools has endangered the power of digital images.An image was supposed to be worth more than a thousand words,but now this can be said only if it can be authenticated orthe integrity of the image can be proved to be intact. In thispaper, we propose a digital image forensic technique for JPEG images. It can detect any forgery in the image if the forged portion called a ghost image is having a compression quality different from that of the cover image. It is based on resaving the JPEG image at different JPEG qualities, and the detection of the forged portion is maximum when it is saved at the same JPEG quality as the cover image. Also, we can precisely predictthe JPEG quality of the cover image by analyzing the similarity using Structural Similarity Index Measure (SSIM) or the energyof the images. The first maxima in SSIM or the first minima inenergy correspond to the cover image JPEG quality. We created adataset for varying JPEG compression qualities of the ghost and the cover images and validated the scalability of the experimental results.We also, experimented with varied attack scenarios, e.g. high-quality ghost image embedded in low quality of cover image,low-quality ghost image embedded in high-quality of cover image,and ghost image and cover image both at the same quality.The proposed method is able to localize the tampered portions accurately even for forgeries as small as 10x10 sized pixel blocks.Our technique is also robust against other attack scenarios like copy-move forgery, inserting text into image, rescaling (zoom-out/zoom-in) ghost image and then pasting on cover image.
    Progressive-Scale Boundary Blackbox Attack via Projective Gradient Estimation. (arXiv:2106.06056v1 [cs.LG])
    (2 min) Boundary based blackbox attack has been recognized as practical and effective, given that an attacker only needs to access the final model prediction. However, the query efficiency of it is in general high especially for high dimensional image data. In this paper, we show that such efficiency highly depends on the scale at which the attack is applied, and attacking at the optimal scale significantly improves the efficiency. In particular, we propose a theoretical framework to analyze and show three key characteristics to improve the query efficiency. We prove that there exists an optimal scale for projective gradient estimation. Our framework also explains the satisfactory performance achieved by existing boundary black-box attacks. Based on our theoretical framework, we propose Progressive-Scale enabled projective Boundary Attack (PSBA) to improve the query efficiency via progressive scaling techniques. In particular, we employ Progressive-GAN to optimize the scale of projections, which we call PSBA-PGAN. We evaluate our approach on both spatial and frequency scales. Extensive experiments on MNIST, CIFAR-10, CelebA, and ImageNet against different models including a real-world face recognition API show that PSBA-PGAN significantly outperforms existing baseline attacks in terms of query efficiency and attack success rate. We also observe relatively stable optimal scales for different models and datasets. The code is publicly available at https://github.com/AI-secure/PSBA.
    Conterfactual Generative Zero-Shot Semantic Segmentation. (arXiv:2106.06360v1 [cs.CV])
    (2 min) zero-shot learning is an essential part of computer vision. As a classical downstream task, zero-shot semantic segmentation has been studied because of its applicant value. One of the popular zero-shot semantic segmentation methods is based on the generative model Most new proposed works added structures on the same architecture to enhance this model. However, we found that, from the view of causal inference, the result of the original model has been influenced by spurious statistical relationships. Thus the performance of the prediction shows severe bias. In this work, we consider counterfactual methods to avoid the confounder in the original model. Based on this method, we proposed a new framework for zero-shot semantic segmentation. Our model is compared with baseline models on two real-world datasets, Pascal-VOC and Pascal-Context. The experiment results show proposed models can surpass previous confounded models and can still make use of additional structures to improve the performance. We also design a simple structure based on Graph Convolutional Networks (GCN) in this work.
    SimSwap: An Efficient Framework For High Fidelity Face Swapping. (arXiv:2106.06340v1 [cs.CV])
    (2 min) We propose an efficient framework, called Simple Swap (SimSwap), aiming for generalized and high fidelity face swapping. In contrast to previous approaches that either lack the ability to generalize to arbitrary identity or fail to preserve attributes like facial expression and gaze direction, our framework is capable of transferring the identity of an arbitrary source face into an arbitrary target face while preserving the attributes of the target face. We overcome the above defects in the following two ways. First, we present the ID Injection Module (IIM) which transfers the identity information of the source face into the target face at feature level. By using this module, we extend the architecture of an identity-specific face swapping algorithm to a framework for arbitrary face swapping. Second, we propose the Weak Feature Matching Loss which efficiently helps our framework to preserve the facial attributes in an implicit way. Extensive experiments on wild faces demonstrate that our SimSwap is able to achieve competitive identity performance while preserving attributes better than previous state-of-the-art methods. The code is already available on github: https://github.com/neuralchen/SimSwap.
    Part-aware Panoptic Segmentation. (arXiv:2106.06351v1 [cs.CV])
    (2 min) In this work, we introduce the new scene understanding task of Part-aware Panoptic Segmentation (PPS), which aims to understand a scene at multiple levels of abstraction, and unifies the tasks of scene parsing and part parsing. For this novel task, we provide consistent annotations on two commonly used datasets: Cityscapes and Pascal VOC. Moreover, we present a single metric to evaluate PPS, called Part-aware Panoptic Quality (PartPQ). For this new task, using the metric and annotations, we set multiple baselines by merging results of existing state-of-the-art methods for panoptic segmentation and part segmentation. Finally, we conduct several experiments that evaluate the importance of the different levels of abstraction in this single task.
    ViT-Inception-GAN for Image Colourising. (arXiv:2106.06321v1 [cs.CV])
    (2 min) Studies involving colourising images has been garnering researchers' keen attention over time, assisted by significant advances in various Machine Learning techniques and compute power availability. Traditionally, colourising images have been an intricate task that gave a substantial degree of freedom during the assignment of chromatic information. In our proposed method, we attempt to colourise images using Vision Transformer - Inception - Generative Adversarial Network (ViT-I-GAN), which has an Inception-v3 fusion embedding in the generator. For a stable and robust network, we have used Vision Transformer (ViT) as the discriminator. We trained the model on the Unsplash and the COCO dataset for demonstrating the improvement made by the Inception-v3 embedding. We have compared the results between ViT-GANs with and without Inception-v3 embedding.
    Pedestrian Attribute Recognition in Video Surveillance Scenarios Based on View-attribute Attention Localization. (arXiv:2106.06485v1 [cs.CV])
    (2 min) Pedestrian attribute recognition in surveillance scenarios is still a challenging task due to inaccurate localization of specific attributes. In this paper, we propose a novel view-attribute localization method based on attention (VALA), which relies on the strong relevance between attributes and views to capture specific view-attributes and to localize attribute-corresponding areas by attention mechanism. A specific view-attribute is composed by the extracted attribute feature and four view scores which are predicted by view predictor as the confidences for attribute from different views. View-attribute is then delivered back to shallow network layers for supervising deep feature extraction. To explore the location of a view-attribute, regional attention is introduced to aggregate spatial information of the input attribute feature in height and width direction for constraining the image into a narrow range. Moreover, the inter-channel dependency of view-feature is embedded in the above two spatial directions. An attention attribute-specific region is gained after fining the narrow range by balancing the ratio of channel dependencies between height and width branches. The final view-attribute recognition outcome is obtained by combining the output of regional attention with the view scores from view predictor. Experiments on three wide datasets (RAP, RAPv2, PETA, and PA-100K) demonstrate the effectiveness of our approach compared with state-of-the-art methods.
    Can we have it all? On the Trade-off between Spatial and Adversarial Robustness of Neural Networks. (arXiv:2002.11318v4 [cs.LG] UPDATED)
    (2 min) (Non-)robustness of neural networks to small, adversarial pixel-wise perturbations, and as more recently shown, to even random spatial transformations (e.g., translations, rotations) entreats both theoretical and empirical understanding. Spatial robustness to random translations and rotations is commonly attained via equivariant models (e.g., StdCNNs, GCNNs) and training augmentation, whereas adversarial robustness is typically achieved by adversarial training. In this paper, we prove a quantitative trade-off between spatial and adversarial robustness in a simple statistical setting. We complement this empirically by showing that: (a) as the spatial robustness of equivariant models improves by training augmentation with progressively larger transformations, their adversarial robustness worsens progressively, and (b) as the state-of-the-art robust models are adversarially trained with progressively larger pixel-wise perturbations, their spatial robustness drops progressively. Towards achieving pareto-optimality in this trade-off, we propose a method based on curriculum learning that trains gradually on more difficult perturbations (both spatial and adversarial) to improve spatial and adversarial robustness simultaneously.
    Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. (arXiv:2102.05918v2 [cs.CV] UPDATED)
    (2 min) Pre-trained representations are becoming crucial for many NLP and perception tasks. While representation learning in NLP has transitioned to training on raw text without human annotations, visual and vision-language representations still rely heavily on curated training datasets that are expensive or require expert knowledge. For vision applications, representations are mostly learned using datasets with explicit class labels such as ImageNet or OpenImages. For vision-language, popular datasets like Conceptual Captions, MSCOCO, or CLIP all involve a non-trivial data collection (and cleaning) process. This costly curation process limits the size of datasets and hence hinders the scaling of trained models. In this paper, we leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps in the Conceptual Captions dataset. A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss. We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme. Our visual representation achieves strong performance when transferred to classification tasks such as ImageNet and VTAB. The aligned visual and language representations enables zero-shot image classification and also set new state-of-the-art results on Flickr30K and MSCOCO image-text retrieval benchmarks, even when compared with more sophisticated cross-attention models. The representations also enable cross-modality search with complex text and text + image queries.
    Shallow Optical Flow Three-Stream CNN for Macro- and Micro-Expression Spotting from Long Videos. (arXiv:2106.06489v1 [cs.CV])
    (2 min) Facial expressions vary from the visible to the subtle. In recent years, the analysis of micro-expressions $-$ a natural occurrence resulting from the suppression of one's true emotions, has drawn the attention of researchers with a broad range of potential applications. However, spotting microexpressions in long videos becomes increasingly challenging when intertwined with normal or macro-expressions. In this paper, we propose a shallow optical flow three-stream CNN (SOFTNet) model to predict a score that captures the likelihood of a frame being in an expression interval. By fashioning the spotting task as a regression problem, we introduce pseudo-labeling to facilitate the learning process. We demonstrate the efficacy and efficiency of the proposed approach on the recent MEGC 2020 benchmark, where state-of-the-art performance is achieved on CAS(ME)$^{2}$ with equally promising results on SAMM Long Videos.
    MlTr: Multi-label Classification with Transformer. (arXiv:2106.06195v1 [cs.CV])
    (2 min) The task of multi-label image classification is to recognize all the object labels presented in an image. Though advancing for years, small objects, similar objects and objects with high conditional probability are still the main bottlenecks of previous convolutional neural network(CNN) based models, limited by convolutional kernels' representational capacity. Recent vision transformer networks utilize the self-attention mechanism to extract the feature of pixel granularity, which expresses richer local semantic information, while is insufficient for mining global spatial dependence. In this paper, we point out the three crucial problems that CNN-based methods encounter and explore the possibility of conducting specific transformer modules to settle them. We put forward a Multi-label Transformer architecture(MlTr) constructed with windows partitioning, in-window pixel attention, cross-window attention, particularly improving the performance of multi-label image classification tasks. The proposed MlTr shows state-of-the-art results on various prevalent multi-label datasets such as MS-COCO, Pascal-VOC, and NUS-WIDE with 88.5%, 95.8%, and 65.5% respectively. The code will be available soon at https://github.com/starmemda/MlTr/
    Antipodal Robotic Grasping using Generative Residual Convolutional Neural Network. (arXiv:1909.04810v4 [cs.RO] UPDATED)
    (2 min) In this paper, we present a modular robotic system to tackle the problem of generating and performing antipodal robotic grasps for unknown objects from n-channel image of the scene. We propose a novel Generative Residual Convolutional Neural Network (GR-ConvNet) model that can generate robust antipodal grasps from n-channel input at real-time speeds (~20ms). We evaluate the proposed model architecture on standard datasets and a diverse set of household objects. We achieved state-of-the-art accuracy of 97.7% and 94.6% on Cornell and Jacquard grasping datasets respectively. We also demonstrate a grasp success rate of 95.4% and 93% on household and adversarial objects respectively using a 7 DoF robotic arm.
    Dynamic Neural Networks: A Survey. (arXiv:2102.04906v3 [cs.CV] UPDATED)
    (2 min) Dynamic neural network is an emerging research topic in deep learning. Compared to static models which have fixed computational graphs and parameters at the inference stage, dynamic networks can adapt their structures or parameters to different inputs, leading to notable advantages in terms of accuracy, computational efficiency, adaptiveness, etc. In this survey, we comprehensively review this rapidly developing area by dividing dynamic networks into three main categories: 1) instance-wise dynamic models that process each instance with data-dependent architectures or parameters; 2) spatial-wise dynamic networks that conduct adaptive computation with respect to different spatial locations of image data and 3) temporal-wise dynamic models that perform adaptive inference along the temporal dimension for sequential data such as videos and texts. The important research problems of dynamic networks, e.g., architecture design, decision making scheme, optimization technique and applications, are reviewed systematically. Finally, we discuss the open problems in this field together with interesting future research directions.
    Gaussian Bounding Boxes and Probabilistic Intersection-over-Union for Object Detection. (arXiv:2106.06072v1 [cs.CV])
    (2 min) Most object detection methods use bounding boxes to encode and represent the object shape and location. In this work, we explore a fuzzy representation of object regions using Gaussian distributions, which provides an implicit binary representation as (potentially rotated) ellipses. We also present a similarity measure for the Gaussian distributions based on the Hellinger Distance, which can be viewed as a Probabilistic Intersection-over-Union (ProbIoU). Our experimental results show that the proposed Gaussian representations are closer to annotated segmentation masks in publicly available datasets, and that loss functions based on ProbIoU can be successfully used to regress the parameters of the Gaussian representation. Furthermore, we present a simple mapping scheme from traditional (or rotated) bounding boxes to Gaussian representations, allowing the proposed ProbIoU-based losses to be seamlessly integrated into any object detector.
    3D Semantic Scene Completion: a Survey. (arXiv:2103.07466v2 [cs.CV] UPDATED)
    (2 min) Semantic Scene Completion (SSC) aims to jointly estimate the complete geometry and semantics of a scene, assuming partial sparse input. In the last years following the multiplication of large-scale 3D datasets, SSC has gained significant momentum in the research community because it holds unresolved challenges. Specifically, SSC lies in the ambiguous completion of large unobserved areas and the weak supervision signal of the ground truth. This led to a substantially increasing number of papers on the matter. This survey aims to identify, compare and analyze the techniques providing a critical analysis of the SSC literature on both methods and datasets. Throughout the paper, we provide an in-depth analysis of the existing works covering all choices made by the authors while highlighting the remaining avenues of research. SSC performance of the SoA on the most popular datasets is also evaluated and analyzed.
    Attention-based Partial Face Recognition. (arXiv:2106.06415v1 [cs.CV])
    (2 min) Photos of faces captured in unconstrained environments, such as large crowds, still constitute challenges for current face recognition approaches as often faces are occluded by objects or people in the foreground. However, few studies have addressed the task of recognizing partial faces. In this paper, we propose a novel approach to partial face recognition capable of recognizing faces with different occluded areas. We achieve this by combining attentional pooling of a ResNet's intermediate feature maps with a separate aggregation module. We further adapt common losses to partial faces in order to ensure that the attention maps are diverse and handle occluded parts. Our thorough analysis demonstrates that we outperform all baselines under multiple benchmark protocols, including naturally and synthetically occluded partial faces. This suggests that our method successfully focuses on the relevant parts of the occluded face.
    Compositional Video Synthesis with Action Graphs. (arXiv:2006.15327v4 [cs.CV] UPDATED)
    (2 min) Videos of actions are complex signals containing rich compositional structure in space and time. Current video generation methods lack the ability to condition the generation on multiple coordinated and potentially simultaneous timed actions. To address this challenge, we propose to represent the actions in a graph structure called Action Graph and present the new ``Action Graph To Video'' synthesis task. Our generative model for this task (AG2Vid) disentangles motion and appearance features, and by incorporating a scheduling mechanism for actions facilitates a timely and coordinated video generation. We train and evaluate AG2Vid on the CATER and Something-Something V2 datasets, and show that the resulting videos have better visual quality and semantic consistency compared to baselines. Finally, our model demonstrates zero-shot abilities by synthesizing novel compositions of the learned actions. For code and pretrained models, see the project page https://roeiherz.github.io/AG2Video
    What is Multimodality?. (arXiv:2103.06304v3 [cs.AI] UPDATED)
    (2 min) The last years have shown rapid developments in the field of multimodal machine learning, combining e.g., vision, text or speech. In this position paper we explain how the field uses outdated definitions of multimodality that prove unfit for the machine learning era. We propose a new task-relative definition of (multi)modality in the context of multimodal machine learning that focuses on representations and information that are relevant for a given machine learning task. With our new definition of multimodality we aim to provide a missing foundation for multimodal research, an important component of language grounding and a crucial milestone towards NLU.
    Neural Network Modeling of Probabilities for Coding the Octree Representation of Point Clouds. (arXiv:2106.06482v1 [cs.CV])
    (2 min) This paper describes a novel lossless point cloud compression algorithm that uses a neural network for estimating the coding probabilities for the occupancy status of voxels, depending on wide three dimensional contexts around the voxel to be encoded. The point cloud is represented as an octree, with each resolution layer being sequentially encoded and decoded using arithmetic coding, starting from the lowest resolution, until the final resolution is reached. The occupancy probability of each voxel of the splitting pattern at each node of the octree is modeled by a neural network, having at its input the already encoded occupancy status of several octree nodes (belonging to the past and current resolutions), corresponding to a 3D context surrounding the node to be encoded. The algorithm has a fast and a slow version, the fast version selecting differently several voxels of the context, which allows an increased parallelization by sending larger batches of templates to be estimated by the neural network, at both encoder and decoder. The proposed algorithms yield state-of-the-art results on benchmark datasets. The implementation will be made available at https://github.com/marmus12/nnctx
    BiPointNet: Binary Neural Network for Point Clouds. (arXiv:2010.05501v4 [cs.CV] UPDATED)
    (2 min) To alleviate the resource constraint for real-time point cloud applications that run on edge devices, in this paper we present BiPointNet, the first model binarization approach for efficient deep learning on point clouds. We discover that the immense performance drop of binarized models for point clouds mainly stems from two challenges: aggregation-induced feature homogenization that leads to a degradation of information entropy, and scale distortion that hinders optimization and invalidates scale-sensitive structures. With theoretical justifications and in-depth analysis, our BiPointNet introduces Entropy-Maximizing Aggregation (EMA) to modulate the distribution before aggregation for the maximum information entropy, and Layer-wise Scale Recovery (LSR) to efficiently restore feature representation capacity. Extensive experiments show that BiPointNet outperforms existing binarization methods by convincing margins, at the level even comparable with the full precision counterpart. We highlight that our techniques are generic, guaranteeing significant improvements on various fundamental tasks and mainstream backbones. Moreover, BiPointNet gives an impressive 14.7x speedup and 18.9x storage saving on real-world resource-constrained devices.
    Instance-Level Task Parameters: A Robust Multi-task Weighting Framework. (arXiv:2106.06129v1 [cs.CV])
    (2 min) Recent works have shown that deep neural networks benefit from multi-task learning by learning a shared representation across several related tasks. However, performance of such systems depend on relative weighting between various losses involved during training. Prior works on loss weighting schemes assume that instances are equally easy or hard for all tasks. In order to break this assumption, we let the training process dictate the optimal weighting of tasks for every instance in the dataset. More specifically, we equip every instance in the dataset with a set of learnable parameters (instance-level task parameters) where the cardinality is equal to the number of tasks learned by the model. These parameters model the weighting of each task for an instance. They are updated by gradient descent and do not require hand-crafted rules. We conduct extensive experiments on SURREAL and CityScapes datasets, for human shape and pose estimation, depth estimation and semantic segmentation tasks. In these tasks, our approach outperforms recent dynamic loss weighting approaches, e.g. reducing surface estimation errors by 8.97% on SURREAL. When applied to datasets where one or more tasks can have noisy annotations, the proposed method learns to prioritize learning from clean labels for a given task, e.g. reducing surface estimation errors by up to 60%. We also show that we can reliably detect corrupt labels for a given task as a by-product from learned instance-level task parameters.
    Rethinking Architecture Design for Tackling Data Heterogeneity in Federated Learning. (arXiv:2106.06047v1 [cs.LG])
    (2 min) Federated learning is an emerging research paradigm enabling collaborative training of machine learning models among different organizations while keeping data private at each institution. Despite recent progress, there remain fundamental challenges such as lack of convergence and potential for catastrophic forgetting in federated learning across real-world heterogeneous devices. In this paper, we demonstrate that attention-based architectures (e.g., Transformers) are fairly robust to distribution shifts and hence improve federated learning over heterogeneous data. Concretely, we conduct the first rigorous empirical investigation of different neural architectures across a range of federated algorithms, real-world benchmarks, and heterogeneous data splits. Our experiments show that simply replacing convolutional networks with Transformers can greatly reduce catastrophic forgetting of previous devices, accelerate convergence, and reach a better global model, especially when dealing with heterogeneous data. We will release our code and pretrained models at https://github.com/Liangqiong/ViT-FL-main to encourage future exploration in robust architectures as an alternative to current research efforts on the optimization front.
    Making EfficientNet More Efficient: Exploring Batch-Independent Normalization, Group Convolutions and Reduced Resolution Training. (arXiv:2106.03640v2 [cs.LG] UPDATED)
    (2 min) Much recent research has been dedicated to improving the efficiency of training and inference for image classification. This effort has commonly focused on explicitly improving theoretical efficiency, often measured as ImageNet validation accuracy per FLOP. These theoretical savings have, however, proven challenging to achieve in practice, particularly on high-performance training accelerators. In this work, we focus on improving the practical efficiency of the state-of-the-art EfficientNet models on a new class of accelerator, the Graphcore IPU. We do this by extending this family of models in the following ways: (i) generalising depthwise convolutions to group convolutions; (ii) adding proxy-normalized activations to match batch normalization performance with batch-independent statistics; (iii) reducing compute by lowering the training resolution and inexpensively fine-tuning at higher resolution. We find that these three methods improve the practical efficiency for both training and inference. Our code will be made available online.
    Black-box Explanation of Object Detectors via Saliency Maps. (arXiv:2006.03204v2 [cs.CV] UPDATED)
    (2 min) We propose D-RISE, a method for generating visual explanations for the predictions of object detectors. Utilizing the proposed similarity metric that accounts for both localization and categorization aspects of object detection allows our method to produce saliency maps that show image areas that most affect the prediction. D-RISE can be considered "black-box" in the software testing sense, as it only needs access to the inputs and outputs of an object detector. Compared to gradient-based methods, D-RISE is more general and agnostic to the particular type of object detector being tested, and does not need knowledge of the inner workings of the model. We show that D-RISE can be easily applied to different object detectors including one-stage detectors such as YOLOv3 and two-stage detectors such as Faster-RCNN. We present a detailed analysis of the generated visual explanations to highlight the utilization of context and possible biases learned by object detectors.
    COVID-19 Classification Using Staked Ensembles: A Comprehensive Analysis. (arXiv:2010.05690v2 [cs.CV] UPDATED)
    (2 min) The issue of COVID-19, increasing with a massive mortality rate. This led to the WHO declaring it as a pandemic. In this situation, it is crucial to perform efficient and fast diagnosis. The reverse transcript polymerase chain reaction (RTPCR) test is conducted to detect the presence of SARS-CoV-2. This test is time-consuming and instead chest CT (or Chest X-ray) can be used for a fast and accurate diagnosis. Automated diagnosis is considered to be important as it reduces human effort and provides accurate and low-cost tests. The contributions of our research are three-fold. First, it is aimed to analyse the behaviour and performance of variant vision models ranging from Inception to NAS networks with the appropriate fine-tuning procedure. Second, the behaviour of these models is visually analysed by plotting CAMs for individual networks and determining classification performance with AUCROC curves. Thirdly, stacked ensembles techniques are imparted to provide higher generalisation on combining the fine-tuned models, in which six ensemble neural networks are designed by combining the existing fine-tuned networks. Implying these stacked ensembles provides a great generalization to the models. The ensemble model designed by combining all the fine-tuned networks obtained a state-of-the-art accuracy score of 99.17%. The precision and recall for the COVID-19 class are 99.99% and 89.79% respectively, which resembles the robustness of the stacked ensembles.
    K-shot NAS: Learnable Weight-Sharing for NAS with K-shot Supernets. (arXiv:2106.06442v1 [cs.CV])
    (2 min) In one-shot weight sharing for NAS, the weights of each operation (at each layer) are supposed to be identical for all architectures (paths) in the supernet. However, this rules out the possibility of adjusting operation weights to cater for different paths, which limits the reliability of the evaluation results. In this paper, instead of counting on a single supernet, we introduce $K$-shot supernets and take their weights for each operation as a dictionary. The operation weight for each path is represented as a convex combination of items in a dictionary with a simplex code. This enables a matrix approximation of the stand-alone weight matrix with a higher rank ($K>1$). A \textit{simplex-net} is introduced to produce architecture-customized code for each path. As a result, all paths can adaptively learn how to share weights in the $K$-shot supernets and acquire corresponding weights for better evaluation. $K$-shot supernets and simplex-net can be iteratively trained, and we further extend the search to the channel dimension. Extensive experiments on benchmark datasets validate that K-shot NAS significantly improves the evaluation accuracy of paths and thus brings in impressive performance improvements.
    Learning Intra-Batch Connections for Deep Metric Learning. (arXiv:2102.07753v3 [cs.CV] UPDATED)
    (2 min) The goal of metric learning is to learn a function that maps samples to a lower-dimensional space where similar samples lie closer than dissimilar ones. Particularly, deep metric learning utilizes neural networks to learn such a mapping. Most approaches rely on losses that only take the relations between pairs or triplets of samples into account, which either belong to the same class or two different classes. However, these methods do not explore the embedding space in its entirety. To this end, we propose an approach based on message passing networks that takes all the relations in a mini-batch into account. We refine embedding vectors by exchanging messages among all samples in a given batch allowing the training process to be aware of its overall structure. Since not all samples are equally important to predict a decision boundary, we use an attention mechanism during message passing to allow samples to weigh the importance of each neighbor accordingly. We achieve state-of-the-art results on clustering and image retrieval on the CUB-200-2011, Cars196, Stanford Online Products, and In-Shop Clothes datasets. To facilitate further research, we make available the code and the models at https://github.com/dvl-tum/intra_batch_connections.
    Recovery of Meteorites Using an Autonomous Drone and Machine Learning. (arXiv:2106.06523v1 [astro-ph.EP])
    (2 min) The recovery of freshly fallen meteorites from tracked and triangulated meteors is critical to determining their source asteroid families. However, locating meteorite fragments in strewn fields remains a challenge with very few meteorites being recovered from the meteors triangulated in past and ongoing meteor camera networks. We examined if locating meteorites can be automated using machine learning and an autonomous drone. Drones can be programmed to fly a grid search pattern and take systematic pictures of the ground over a large survey area. Those images can be analyzed using a machine learning classifier to identify meteorites in the field among many other features. Here, we describe a proof-of-concept meteorite classifier that deploys off-line a combination of different convolution neural networks to recognize meteorites from images taken by drones in the field. The system was implemented in a conceptual drone setup and tested in the suspected strewn field of a recent meteorite fall near Walker Lake, Nevada.
    A Framework to Enhance Generalization of Deep Metric Learning methods using General Discriminative Feature Learning and Class Adversarial Neural Networks. (arXiv:2106.06420v1 [cs.CV])
    (2 min) Metric learning algorithms aim to learn a distance function that brings the semantically similar data items together and keeps dissimilar ones at a distance. The traditional Mahalanobis distance learning is equivalent to find a linear projection. In contrast, Deep Metric Learning (DML) methods are proposed that automatically extract features from data and learn a non-linear transformation from input space to a semantically embedding space. Recently, many DML methods are proposed focused to enhance the discrimination power of the learned metric by providing novel sampling strategies or loss functions. This approach is very helpful when both the training and test examples are coming from the same set of categories. However, it is less effective in many applications of DML such as image retrieval and person-reidentification. Here, the DML should learn general semantic concepts from observed classes and employ them to rank or identify objects from unseen categories. Neglecting the generalization ability of the learned representation and just emphasizing to learn a more discriminative embedding on the observed classes may lead to the overfitting problem. To address this limitation, we propose a framework to enhance the generalization power of existing DML methods in a Zero-Shot Learning (ZSL) setting by general yet discriminative representation learning and employing a class adversarial neural network. To learn a more general representation, we propose to employ feature maps of intermediate layers in a deep neural network and enhance their discrimination power through an attention mechanism. Besides, a class adversarial network is utilized to enforce the deep model to seek class invariant features for the DML task. We evaluate our work on widely used machine vision datasets in a ZSL setting.
    AugNet: End-to-End Unsupervised Visual Representation Learning with Image Augmentation. (arXiv:2106.06250v1 [cs.CV])
    (2 min) Most of the achievements in artificial intelligence so far were accomplished by supervised learning which requires numerous annotated training data and thus costs innumerable manpower for labeling. Unsupervised learning is one of the effective solutions to overcome such difficulties. In our work, we propose AugNet, a new deep learning training paradigm to learn image features from a collection of unlabeled pictures. We develop a method to construct the similarities between pictures as distance metrics in the embedding space by leveraging the inter-correlation between augmented versions of samples. Our experiments demonstrate that the method is able to represent the image in low dimensional space and performs competitively in downstream tasks such as image classification and image similarity comparison. Specifically, we achieved over 60% and 27% accuracy on the STL10 and CIFAR100 datasets with unsupervised clustering, respectively. Moreover, unlike many deep-learning-based image retrieval algorithms, our approach does not require access to external annotated datasets to train the feature extractor, but still shows comparable or even better feature representation ability and easy-to-use characteristics. In our evaluations, the method outperforms all the state-of-the-art image retrieval algorithms on some out-of-domain image datasets. The code for the model implementation is available at https://github.com/chenmingxiang110/AugNet.
  • cs.IR updates on arXiv.org

    Anytime Ranking on Document-Ordered Indexes. (arXiv:2104.08976v2 [cs.IR] UPDATED)
    (2 min) Inverted indexes continue to be a mainstay of text search engines, allowing efficient querying of large document collections. While there are a number of possible organizations, document-ordered indexes are the most common, since they are amenable to various query types, support index updates, and allow for efficient dynamic pruning operations. One disadvantage with document-ordered indexes is that high-scoring documents can be distributed across the document identifier space, meaning that index traversal algorithms that terminate early might put search effectiveness at risk. The alternative is impact-ordered indexes, which primarily support top-k disjunctions, but also allow for anytime query processing, where the search can be terminated at any time, with search quality improving as processing latency increases. Anytime query processing can be used to effectively reduce high-percentile tail latency which is essential for operational scenarios in which a service level agreement (SLA) imposes response time requirements. In this work, we show how document-ordered indexes can be organized such that they can be queried in an anytime fashion, enabling strict latency control with effective early termination. Our experiments show that processing document-ordered topical segments selected by a simple score estimator outperforms existing anytime algorithms, and allows query runtimes to be accurately limited in order to comply with SLA requirements.
    Predicting Knowledge Gain during Web Search based on Multimedia Resource Consumption. (arXiv:2106.06244v1 [cs.IR])
    (2 min) In informal learning scenarios the popularity of multimedia content, such as video tutorials or lectures, has significantly increased. Yet, the users' interactions, navigation behavior, and consequently learning outcome, have not been researched extensively. Related work in this field, also called search as learning, has focused on behavioral or text resource features to predict learning outcome and knowledge gain. In this paper, we investigate whether we can exploit features representing multimedia resource consumption to predict of knowledge gain (KG) during Web search from in-session data, that is without prior knowledge about the learner. For this purpose, we suggest a set of multimedia features related to image and video consumption. Our feature extraction is evaluated in a lab study with 113 participants where we collected data for a given search as learning task on the formation of thunderstorms and lightning. We automatically analyze the monitored log data and utilize state-of-the-art computer vision methods to extract features about the seen multimedia resources. Experimental results demonstrate that multimedia features can improve KG prediction. Finally, we provide an analysis on feature importance (text and multimedia) for KG prediction.
    Writing by Memorizing: Hierarchical Retrieval-based Medical Report Generation. (arXiv:2106.06471v1 [cs.CL])
    (2 min) Medical report generation is one of the most challenging tasks in medical image analysis. Although existing approaches have achieved promising results, they either require a predefined template database in order to retrieve sentences or ignore the hierarchical nature of medical report generation. To address these issues, we propose MedWriter that incorporates a novel hierarchical retrieval mechanism to automatically extract both report and sentence-level templates for clinically accurate report generation. MedWriter first employs the Visual-Language Retrieval~(VLR) module to retrieve the most relevant reports for the given images. To guarantee the logical coherence between sentences, the Language-Language Retrieval~(LLR) module is introduced to retrieve relevant sentences based on the previous generated description. At last, a language decoder fuses image features and features from retrieved reports and sentences to generate meaningful medical reports. We verified the effectiveness of our model by automatic evaluation and human evaluation on two datasets, i.e., Open-I and MIMIC-CXR.
    Nested and Balanced Entity Recognition using Multi-Task Learning. (arXiv:2106.06216v1 [cs.CL])
    (2 min) Entity Recognition (ER) within a text is a fundamental exercise in Natural Language Processing, enabling further depending tasks such as Knowledge Extraction, Text Summarisation, or Keyphrase Extraction. An entity consists of single words or of a consecutive sequence of terms, constituting the basic building blocks for communication. Mainstream ER approaches are mainly limited to flat structures, concentrating on the outermost entities while ignoring the inner ones. This paper introduces a partly-layered network architecture that deals with the complexity of overlapping and nested cases. The proposed architecture consists of two parts: (1) a shared Sequence Layer and (2) a stacked component with multiple Tagging Layers. The adoption of such an architecture has the advantage of preventing overfit to a specific word-length, thus maintaining performance for longer entities despite their lower frequency. To verify the proposed architecture's effectiveness, we train and evaluate this architecture to recognise two kinds of entities - Concepts (CR) and Named Entities (NER). Our approach achieves state-of-the-art NER performances, while it outperforms previous CR approaches. Considering these promising results, we see the possibility to evolve the architecture for other cases such as the extraction of events or the detection of argumentative components.
    A Framework to Enhance Generalization of Deep Metric Learning methods using General Discriminative Feature Learning and Class Adversarial Neural Networks. (arXiv:2106.06420v1 [cs.CV])
    (2 min) Metric learning algorithms aim to learn a distance function that brings the semantically similar data items together and keeps dissimilar ones at a distance. The traditional Mahalanobis distance learning is equivalent to find a linear projection. In contrast, Deep Metric Learning (DML) methods are proposed that automatically extract features from data and learn a non-linear transformation from input space to a semantically embedding space. Recently, many DML methods are proposed focused to enhance the discrimination power of the learned metric by providing novel sampling strategies or loss functions. This approach is very helpful when both the training and test examples are coming from the same set of categories. However, it is less effective in many applications of DML such as image retrieval and person-reidentification. Here, the DML should learn general semantic concepts from observed classes and employ them to rank or identify objects from unseen categories. Neglecting the generalization ability of the learned representation and just emphasizing to learn a more discriminative embedding on the observed classes may lead to the overfitting problem. To address this limitation, we propose a framework to enhance the generalization power of existing DML methods in a Zero-Shot Learning (ZSL) setting by general yet discriminative representation learning and employing a class adversarial neural network. To learn a more general representation, we propose to employ feature maps of intermediate layers in a deep neural network and enhance their discrimination power through an attention mechanism. Besides, a class adversarial network is utilized to enforce the deep model to seek class invariant features for the DML task. We evaluate our work on widely used machine vision datasets in a ZSL setting.
    A Large-Scale Rich Context Query and Recommendation Dataset in Online Knowledge-Sharing. (arXiv:2106.06467v1 [cs.IR])
    (2 min) Data plays a vital role in machine learning studies. In the research of recommendation, both user behaviors and side information are helpful to model users. So, large-scale real scenario datasets with abundant user behaviors will contribute a lot. However, it is not easy to get such datasets as most of them are only hold and protected by companies. In this paper, a new large-scale dataset collected from a knowledge-sharing platform is presented, which is composed of around 100M interactions collected within 10 days, 798K users, 165K questions, 554K answers, 240K authors, 70K topics, and more than 501K user query keywords. There are also descriptions of users, answers, questions, authors, and topics, which are anonymous. Note that each user's latest query keywords have not been included in previous open datasets, which reveal users' explicit information needs. We characterize the dataset and demonstrate its potential applications for recommendation study. Multiple experiments show the dataset can be used to evaluate algorithms in general top-N recommendation, sequential recommendation, and context-aware recommendation. This dataset can also be used to integrate search and recommendation and recommendation with negative feedback. Besides, tasks beyond recommendation, such as user gender prediction, most valuable answerer identification, and high-quality answer recognition, can also use this dataset. To the best of our knowledge, this is the largest real-world interaction dataset for personalized recommendation.
    DebiasGAN: Eliminating Position Bias in News Recommendation with Adversarial Learning. (arXiv:2106.06258v1 [cs.IR])
    (2 min) News recommendation is important for improving news reading experience of users. Users' news click behaviors are widely used for inferring user interests and predicting future clicks. However, click behaviors are heavily affected by the biases brought by the positions of news displayed on the webpage. It is important to eliminate the effect of position biases on the recommendation model to accurately target user interests. In this paper, we propose a news recommendation method named DebiasGAN that can effectively eliminate the effect of position biases via adversarial learning. We use a bias-aware click model to capture the influence of position bias on click behaviors, and we use a bias-invariant click model with random candidate news positions to estimate the ideally unbiased click scores. We apply adversarial learning techniques to the hidden representations learned by the two models to help the bias-invariant click model capture the bias-independent interest of users on news. Experimental results on two real-world datasets show that DebiasGAN can effectively improve the accuracy of news recommendation by eliminating position biases.
    Modeling Sequences as Distributions with Uncertainty for Sequential Recommendation. (arXiv:2106.06165v1 [cs.IR])
    (2 min) The sequential patterns within the user interactions are pivotal for representing the user's preference and capturing latent relationships among items. The recent advancements of sequence modeling by Transformers advocate the community to devise more effective encoders for the sequential recommendation. Most existing sequential methods assume users are deterministic. However, item-item transitions might fluctuate significantly in several item aspects and exhibit randomness of user interests. This \textit{stochastic characteristics} brings up a solid demand to include uncertainties in representing sequences and items. Additionally, modeling sequences and items with uncertainties expands users' and items' interaction spaces, thus further alleviating cold-start problems. In this work, we propose a Distribution-based Transformer for Sequential Recommendation (DT4SR), which injects uncertainties into sequential modeling. We use Elliptical Gaussian distributions to describe items and sequences with uncertainty. We describe the uncertainty in items and sequences as Elliptical Gaussian distribution. And we adopt Wasserstein distance to measure the similarity between distributions. We devise two novel Trans-formers for modeling mean and covariance, which guarantees the positive-definite property of distributions. The proposed method significantly outperforms the state-of-the-art methods. The experiments on three benchmark datasets also demonstrate its effectiveness in alleviating cold-start issues. The code is available inhttps://github.com/DyGRec/DT4SR.
    IoT Virtualization with ML-based Information Extraction. (arXiv:2106.06022v1 [cs.DC])
    (2 min) For IoT to reach its full potential, the sharing and reuse of information in different applications and across verticals is of paramount importance. However, there are a plethora of IoT platforms using different representations, protocols and interaction patterns. To address this issue, the Fed4IoT project has developed an IoT virtualization platform that, on the one hand, integrates information from many different source platforms and, on the other hand, makes the information required by the respective users available in the target platform of choice. To enable this, information is translated into a common, neutral exchange format. The format of choice is NGSI-LD, which is being standardized by the ETSI Industry Specification Group on Context Information Management (ETSI ISG CIM). Thing Visors are the components that translate the source information to NGSI-LD, which is then delivered to the target platform and translated into the target format. ThingVisors can be implemented by hand, but this requires significant human effort, especially considering the heterogeneity of low level information produced by a multitude of sensors. Thus, supporting the human developer and, ideally, fully automating the process of extracting and enriching data and translating it to NGSI-LD is a crucial step. Machine learning is a promising approach for this, but it typically requires large amounts of hand-labelled data for training, an effort that makes it unrealistic in many IoT scenarios. A programmatic labelling approach called knowledge infusion that encodes expert knowledge is used for matching a schema or ontology extracted from the data with a target schema or ontology, providing the basis for annotating the data and facilitating the translation to NGSI-LD.
  • cs.LG updates on arXiv.org

    MagNet: A Neural Network for Directed Graphs. (arXiv:2102.11391v2 [cs.LG] UPDATED)
    (2 min) The prevalence of graph-based data has spurred the rapid development of graph neural networks (GNNs) and related machine learning algorithms. Yet, despite the many datasets naturally modeled as directed graphs, including citation, website, and traffic networks, the vast majority of this research focuses on undirected graphs. In this paper, we propose MagNet, a spectral GNN for directed graphs based on a complex Hermitian matrix known as the magnetic Laplacian. This matrix encodes undirected geometric structure in the magnitude of its entries and directional information in their phase. A "charge" parameter attunes spectral information to variation among directed cycles. We apply our network to a variety of directed graph node classification and link prediction tasks showing that MagNet performs well on all tasks and that its performance exceeds all other methods on a majority of such tasks. The underlying principles of MagNet are such that it can be adapted to other spectral GNN architectures.
    Interpreting Expert Annotation Differences in Animal Behavior. (arXiv:2106.06114v1 [cs.LG])
    (2 min) Hand-annotated data can vary due to factors such as subjective differences, intra-rater variability, and differing annotator expertise. We study annotations from different experts who labelled the same behavior classes on a set of animal behavior videos, and observe a variation in annotation styles. We propose a new method using program synthesis to help interpret annotation differences for behavior analysis. Our model selects relevant trajectory features and learns a temporal filter as part of a program, which corresponds to estimated importance an annotator places on that feature at each timestamp. Our experiments on a dataset from behavioral neuroscience demonstrate that compared to baseline approaches, our method is more accurate at capturing annotator labels and learns interpretable temporal filters. We believe that our method can lead to greater reproducibility of behavior annotations used in scientific studies. We plan to release our code.
    Regularized Softmax Deep Multi-Agent $Q$-Learning. (arXiv:2103.11883v2 [cs.LG] UPDATED)
    (2 min) Tackling overestimation in $Q$-learning is an important problem that has been extensively studied in single-agent reinforcement learning, but has received comparatively little attention in the multi-agent setting. In this work, we empirically demonstrate that QMIX, a popular $Q$-learning algorithm for cooperative multi-agent reinforcement learning (MARL), suffers from a more severe overestimation in practice than previously acknowledged, and is not mitigated by existing approaches. We rectify this with a novel regularization-based update scheme that penalizes large joint action-values that deviate from a baseline and demonstrate its effectiveness in stabilizing learning. Furthermore, we propose to employ a softmax operator, which we efficiently approximate in a novel way in the multi-agent setting, to further reduce the potential overestimation bias. Our approach, Regularized Softmax (RES) Deep Multi-Agent $Q$-Learning, is general and can be applied to any $Q$-learning based MARL algorithm. We demonstrate that, when applied to QMIX, RES avoids severe overestimation and significantly improves performance, yielding state-of-the-art results in a variety of cooperative multi-agent tasks, including the challenging StarCraft II micromanagement benchmarks.
    Sparse Bayesian Learning via Stepwise Regression. (arXiv:2106.06095v1 [cs.LG])
    (2 min) Sparse Bayesian Learning (SBL) is a powerful framework for attaining sparsity in probabilistic models. Herein, we propose a coordinate ascent algorithm for SBL termed Relevance Matching Pursuit (RMP) and show that, as its noise variance parameter goes to zero, RMP exhibits a surprising connection to Stepwise Regression. Further, we derive novel guarantees for Stepwise Regression algorithms, which also shed light on RMP. Our guarantees for Forward Regression improve on deterministic and probabilistic results for Orthogonal Matching Pursuit with noise. Our analysis of Backward Regression on determined systems culminates in a bound on the residual of the optimal solution to the subset selection problem that, if satisfied, guarantees the optimality of the result. To our knowledge, this bound is the first that can be computed in polynomial time and depends chiefly on the smallest singular value of the matrix. We report numerical experiments using a variety of feature selection algorithms. Notably, RMP and its limiting variant are both efficient and maintain strong performance with correlated features.
    Surface Warping Incorporating Machine Learning Assisted Domain Likelihood Estimation: A New Paradigm in Mine Geology Modelling and Automation. (arXiv:2103.03923v2 [physics.geo-ph] UPDATED)
    (3 min) This paper illustrates an application of machine learning (ML) within a complex system that performs grade estimation. In surface mining, assay measurements taken from production drilling often provide useful information that allows initially inaccurate surfaces created using sparse exploration data to be revised and subsequently improved. Recently, a Bayesian warping technique has been proposed to reshape modeled surfaces using geochemical and spatial constraints imposed by newly acquired blasthole data. This paper focuses on incorporating machine learning into this warping framework to make the likelihood computation generalizable. The technique works by adjusting the position of vertices on the surface to maximize the integrity of modeled geological boundaries with respect to sparse geochemical observations. Its foundation is laid by a Bayesian derivation in which the geological domain likelihood given the chemistry, p(g|c), plays a similar role to p(y(c)|g). This observation allows a manually calibrated process centered around the latter to be automated since ML techniques may be used to estimate the former in a data-driven way. Machine learning performance is evaluated for gradient boosting, neural network, random forest and other classifiers in a binary and multi-class context using precision and recall rates. Once ML likelihood estimators are integrated in the surface warping framework, surface shaping performance is evaluated using unseen data by examining the categorical distribution of test samples located above and below the warped surface. Large-scale validation experiments are performed to assess the overall efficacy of ML assisted surface warping as a fully integrated component within an ore grade estimation system where the posterior mean is obtained via Gaussian Process inference with a Matern 3/2 kernel.
    Binary Classification from Multiple Unlabeled Datasets via Surrogate Set Classification. (arXiv:2102.00678v2 [cs.LG] UPDATED)
    (2 min) To cope with high annotation costs, training a classifier only from weakly supervised data has attracted a great deal of attention these days. Among various approaches, strengthening supervision from completely unsupervised classification is a promising direction, which typically employs class priors as the only supervision and trains a binary classifier from unlabeled (U) datasets. While existing risk-consistent methods are theoretically grounded with high flexibility, they can learn only from two U sets. In this paper, we propose a new approach for binary classification from $m$ U-sets for $m\ge2$. Our key idea is to consider an auxiliary classification task called surrogate set classification (SSC), which is aimed at predicting from which U set each observed data is drawn. SSC can be solved by a standard (multi-class) classification method, and we use the SSC solution to obtain the final binary classifier through a certain linear-fractional transformation. We built our method in a flexible and efficient end-to-end deep learning framework and prove it to be classifier-consistent. Through experiments, we demonstrate the superiority of our proposed method over state-of-the-art methods.
    Optimal Complexity in Decentralized Training. (arXiv:2006.08085v3 [cs.LG] UPDATED)
    (2 min) Decentralization is a promising method of scaling up parallel machine learning systems. In this paper, we provide a tight lower bound on the iteration complexity for such methods in a stochastic non-convex setting. Our lower bound reveals a theoretical gap in known convergence rates of many existing decentralized training algorithms, such as D-PSGD. We prove by construction this lower bound is tight and achievable. Motivated by our insights, we further propose DeTAG, a practical gossip-style decentralized algorithm that achieves the lower bound with only a logarithm gap. Empirically, we compare DeTAG with other decentralized algorithms on image classification tasks, and we show DeTAG enjoys faster convergence compared to baselines, especially on unshuffled data and in sparse networks.
    Compositional Video Synthesis with Action Graphs. (arXiv:2006.15327v4 [cs.CV] UPDATED)
    (2 min) Videos of actions are complex signals containing rich compositional structure in space and time. Current video generation methods lack the ability to condition the generation on multiple coordinated and potentially simultaneous timed actions. To address this challenge, we propose to represent the actions in a graph structure called Action Graph and present the new ``Action Graph To Video'' synthesis task. Our generative model for this task (AG2Vid) disentangles motion and appearance features, and by incorporating a scheduling mechanism for actions facilitates a timely and coordinated video generation. We train and evaluate AG2Vid on the CATER and Something-Something V2 datasets, and show that the resulting videos have better visual quality and semantic consistency compared to baselines. Finally, our model demonstrates zero-shot abilities by synthesizing novel compositions of the learned actions. For code and pretrained models, see the project page https://roeiherz.github.io/AG2Video
    An Integer Linear Programming Framework for Mining Constraints from Data. (arXiv:2006.10836v2 [cs.LG] UPDATED)
    (2 min) Structured output prediction problems (e.g., sequential tagging, hierarchical multi-class classification) often involve constraints over the output label space. These constraints interact with the learned models to filter infeasible solutions and facilitate in building an accountable system. However, although constraints are useful, they are often based on hand-crafted rules. This raises a question -- \emph{can we mine constraints and rules from data based on a learning algorithm?} In this paper, we present a general framework for mining constraints from data. In particular, we consider the inference in structured output prediction as an integer linear programming (ILP) problem. Then, given the coefficients of the objective function and the corresponding solution, we mine the underlying constraints by estimating the outer and inner polytopes of the feasible set. We verify the proposed constraint mining algorithm in various synthetic and real-world applications and demonstrate that the proposed approach successfully identifies the feasible set at scale. In particular, we show that our approach can learn to solve 9x9 Sudoku puzzles and minimal spanning tree problems from examples without providing the underlying rules. Our algorithm can also integrate with a neural network model to learn the hierarchical label structure of a multi-label classification task. Besides, we provide a theoretical analysis about the tightness of the polytopes and the reliability of the mined constraints.
    Finite-Sample Analysis of Off-Policy Natural Actor-Critic Algorithm. (arXiv:2102.09318v2 [cs.LG] UPDATED)
    (2 min) In this paper, we provide finite-sample convergence guarantees for an off-policy variant of the natural actor-critic (NAC) algorithm based on Importance Sampling. In particular, we show that the algorithm converges to a global optimal policy with a sample complexity of $\mathcal{O}(\epsilon^{-3}\log^2(1/\epsilon))$ under an appropriate choice of stepsizes. In order to overcome the issue of large variance due to Importance Sampling, we propose the $Q$-trace algorithm for the critic, which is inspired by the V-trace algorithm \cite{espeholt2018impala}. This enables us to explicitly control the bias and variance, and characterize the trade-off between them. As an advantage of off-policy sampling, a major feature of our result is that we do not need any additional assumptions, beyond the ergodicity of the Markov chain induced by the behavior policy.
    Distributed Learning and its Application for Time-Series Prediction. (arXiv:2106.03211v2 [cs.LG] UPDATED)
    (3 min) Extreme events are occurrences whose magnitude and potential cause extensive damage on people, infrastructure, and the environment. Motivated by the extreme nature of the current global health landscape, which is plagued by the coronavirus pandemic, we seek to better understand and model extreme events. Modeling extreme events is common in practice and plays an important role in time-series prediction applications. Our goal is to (i) compare and investigate the effect of some common extreme events modeling methods to explore which method can be practical in reality and (ii) accelerate the deep learning training process, which commonly uses deep recurrent neural network (RNN), by implementing the asynchronous local Stochastic Gradient Descent (SGD) framework among multiple compute nodes. In order to verify our distributed extreme events modeling, we evaluate our proposed framework on a stock data set S\&P500, with a standard recurrent neural network. Our intuition is to explore the (best) extreme events modeling method which could work well under the distributed deep learning setting. Moreover, by using asynchronous distributed learning, we aim to significantly reduce the communication cost among the compute nodes and central server, which is the main bottleneck of almost all distributed learning frameworks. We implement our proposed work and evaluate its performance on representative data sets, such as S&P500 stock in $5$-year period. The experimental results validate the correctness of the design principle and show a significant training duration reduction upto $8$x, compared to the baseline single compute node. Our results also show that our proposed work can achieve the same level of test accuracy, compared to the baseline setting.
    Divergence Regulated Encoder Network for Joint Dimensionality Reduction and Classification. (arXiv:2012.15764v4 [cs.LG] UPDATED)
    (2 min) In this paper, we investigate performing joint dimensionality reduction and classification using a novel histogram neural network. Motivated by a popular dimensionality reduction approach, t-Distributed Stochastic Neighbor Embedding (t-SNE), our proposed method incorporates a classification loss computed on samples in a low-dimensional embedding space. We compare the learned sample embeddings against coordinates found by t-SNE in terms of classification accuracy and qualitative assessment. We also explore use of various divergence measures in the t-SNE objective. The proposed method has several advantages such as readily embedding out-of-sample points and reducing feature dimensionality while retaining class discriminability. Our results show that the proposed approach maintains and/or improves classification performance and reveals characteristics of features produced by neural networks that may be helpful for other applications.
    A Theory of Label Propagation for Subpopulation Shift. (arXiv:2102.11203v2 [cs.LG] UPDATED)
    (2 min) One of the central problems in machine learning is domain adaptation. Unlike past theoretical work, we consider a new model for subpopulation shift in the input or representation space. In this work, we propose a provably effective framework for domain adaptation based on label propagation. In our analysis, we use a simple but realistic expansion assumption, proposed in \citet{wei2021theoretical}. Using a teacher classifier trained on the source domain, our algorithm not only propagates to the target domain but also improves upon the teacher. By leveraging existing generalization bounds, we also obtain end-to-end finite-sample guarantees on the entire algorithm. In addition, we extend our theoretical framework to a more general setting of source-to-target transfer based on a third unlabeled dataset, which can be easily applied in various learning scenarios. Inspired by our theory, we adapt consistency-based semi-supervised learning methods to domain adaptation settings and gain significant improvements.
    Towards the Unification and Robustness of Perturbation and Gradient Based Explanations. (arXiv:2102.10618v3 [cs.LG] UPDATED)
    (2 min) As machine learning black boxes are increasingly being deployed in critical domains such as healthcare and criminal justice, there has been a growing emphasis on developing techniques for explaining these black boxes in a post hoc manner. In this work, we analyze two popular post hoc interpretation techniques: SmoothGrad which is a gradient based method, and a variant of LIME which is a perturbation based method. More specifically, we derive explicit closed form expressions for the explanations output by these two methods and show that they both converge to the same explanation in expectation, i.e., when the number of perturbed samples used by these methods is large. We then leverage this connection to establish other desirable properties, such as robustness, for these techniques. We also derive finite sample complexity bounds for the number of perturbations required for these methods to converge to their expected explanation. Finally, we empirically validate our theory using extensive experimentation on both synthetic and real world datasets.
    SPPL: Probabilistic Programming with Fast Exact Symbolic Inference. (arXiv:2010.03485v3 [cs.PL] UPDATED)
    (2 min) We present the Sum-Product Probabilistic Language (SPPL), a new probabilistic programming language that automatically delivers exact solutions to a broad range of probabilistic inference queries. SPPL translates probabilistic programs into sum-product expressions, a new symbolic representation and associated semantic domain that extends standard sum-product networks to support mixed-type distributions, numeric transformations, logical formulas, and pointwise and set-valued constraints. We formalize SPPL via a novel translation strategy from probabilistic programs to sum-product expressions and give sound exact algorithms for conditioning on and computing probabilities of events. SPPL imposes a collection of restrictions on probabilistic programs to ensure they can be translated into sum-product expressions, which allow the system to leverage new techniques for improving the scalability of translation and inference by automatically exploiting probabilistic structure. We implement a prototype of SPPL with a modular architecture and evaluate it on benchmarks the system targets, showing that it obtains up to 3500x speedups over state-of-the-art symbolic systems on tasks such as verifying the fairness of decision tree classifiers, smoothing hidden Markov models, conditioning transformed random variables, and computing rare event probabilities.
    Online Multi-Object Tracking and Segmentation with GMPHD Filter and Mask-based Affinity Fusion. (arXiv:2009.00100v2 [cs.CV] UPDATED)
    (2 min) In this paper, we propose a highly practical fully online multi-object tracking and segmentation (MOTS) method that uses instance segmentation results as an input. The proposed method is based on the Gaussian mixture probability hypothesis density (GMPHD) filter, a hierarchical data association (HDA), and a mask-based affinity fusion (MAF) model to achieve high-performance online tracking. The HDA consists of two associations: segment-to-track and track-to-track associations. One affinity, for position and motion, is computed by using the GMPHD filter, and the other affinity, for appearance is computed by using the responses from a single object tracker such as a kernalized correlation filter. These two affinities are simply fused by using a score-level fusion method such as min-max normalization referred to as MAF. In addition, to reduce the number of false positive segments, we adopt mask IoU-based merging (mask merging). The proposed MOTS framework with the key modules: HDA, MAF, and mask merging, is easily extensible to simultaneously track multiple types of objects with CPU only execution in parallel processing. In addition, the developed framework only requires simple parameter tuning unlike many existing MOTS methods that need intensive hyperparameter optimization. In the experiments on the two popular MOTS datasets, the key modules show some improvements. For instance, ID-switch decreases by more than half compared to a baseline method in the training sets. In conclusion, our tracker achieves state-of-the-art MOTS performance in the test sets.
    Synthesising Reinforcement Learning Policies through Set-Valued Inductive Rule Learning. (arXiv:2106.06009v1 [cs.AI])
    (2 min) Today's advanced Reinforcement Learning algorithms produce black-box policies, that are often difficult to interpret and trust for a person. We introduce a policy distilling algorithm, building on the CN2 rule mining algorithm, that distills the policy into a rule-based decision system. At the core of our approach is the fact that an RL process does not just learn a policy, a mapping from states to actions, but also produces extra meta-information, such as action values indicating the quality of alternative actions. This meta-information can indicate whether more than one action is near-optimal for a certain state. We extend CN2 to make it able to leverage knowledge about equally-good actions to distill the policy into fewer rules, increasing its interpretability by a person. Then, to ensure that the rules explain a valid, non-degenerate policy, we introduce a refinement algorithm that fine-tunes the rules to obtain good performance when executed in the environment. We demonstrate the applicability of our algorithm on the Mario AI benchmark, a complex task that requires modern reinforcement learning algorithms including neural networks. The explanations we produce capture the learned policy in only a few rules, that allow a person to understand what the black-box agent learned. Source code: https://gitlab.ai.vub.ac.be/yocoppen/svcn2
    Continuous-Time Model-Based Reinforcement Learning. (arXiv:2102.04764v3 [cs.LG] UPDATED)
    (2 min) Model-based reinforcement learning (MBRL) approaches rely on discrete-time state transition models whereas physical systems and the vast majority of control tasks operate in continuous-time. To avoid time-discretization approximation of the underlying process, we propose a continuous-time MBRL framework based on a novel actor-critic method. Our approach also infers the unknown state evolution differentials with Bayesian neural ordinary differential equations (ODE) to account for epistemic uncertainty. We implement and test our method on a new ODE-RL suite that explicitly solves continuous-time control systems. Our experiments illustrate that the model is robust against irregular and noisy data, is sample-efficient, and can solve control problems which pose challenges to discrete-time MBRL methods.
    Particle Dual Averaging: Optimization of Mean Field Neural Networks with Global Convergence Rate Analysis. (arXiv:2012.15477v2 [stat.ML] UPDATED)
    (2 min) We propose the particle dual averaging (PDA) method, which generalizes the dual averaging method in convex optimization to the optimization over probability distributions with quantitative runtime guarantee. The algorithm consists of an inner loop and outer loop: the inner loop utilizes the Langevin algorithm to approximately solve for a stationary distribution, which is then optimized in the outer loop. The method can thus be interpreted as an extension of the Langevin algorithm to naturally handle nonlinear functional on the probability space. An important application of the proposed method is the optimization of neural network in the mean field regime, which is theoretically attractive due to the presence of nonlinear feature learning, but quantitative convergence rate can be challenging to obtain. By adapting finite-dimensional convex optimization theory into the space of distributions, we analyze PDA in regularized empirical / expected risk minimization, and establish quantitative global convergence in learning two-layer mean field neural networks under more general settings. Our theoretical results are supported by numerical simulations on neural networks with reasonable size.
    UnNatural Language Inference. (arXiv:2101.00010v2 [cs.CL] UPDATED)
    (2 min) Recent investigations into the inner-workings of state-of-the-art large-scale pre-trained Transformer-based Natural Language Understanding (NLU) models indicate that they appear to know humanlike syntax, at least to some extent. We provide novel evidence that complicates this claim: we find that state-of-the-art Natural Language Inference (NLI) models assign the same labels to permuted examples as they do to the original, i.e. they are largely invariant to random word-order permutations. This behavior notably differs from that of humans; we struggle with ungrammatical sentences. To measure the severity of this issue, we propose a suite of metrics and investigate which properties of particular permutations lead models to be word-order invariant. In the MNLI dataset, for example, we find almost all (98.7%) examples contain at least one permutation which elicits the gold label. Models are sometimes even able to assign gold labels to permutations that they originally failed to predict correctly. We provide a comprehensive empirical evaluation of this phenomenon, and further show that this issue exists for both Transformers and pre-Transformer RNN / ConvNet based encoders, as well as across multiple languages (English and Mandarin Chinese). Our code and data are available at https://github.com/facebookresearch/unlu.
    A Distribution-Dependent Analysis of Meta-Learning. (arXiv:2011.00344v2 [stat.ML] UPDATED)
    (2 min) A key problem in the theory of meta-learning is to understand how the task distributions influence transfer risk, the expected error of a meta-learner on a new task drawn from the unknown task distribution. In this paper, focusing on fixed design linear regression with Gaussian noise and a Gaussian task (or parameter) distribution, we give distribution-dependent lower bounds on the transfer risk of any algorithm, while we also show that a novel, weighted version of the so-called biased regularized regression method is able to match these lower bounds up to a fixed constant factor. Notably, the weighting is derived from the covariance of the Gaussian task distribution. Altogether, our results provide a precise characterization of the difficulty of meta-learning in this Gaussian setting. While this problem setting may appear simple, we show that it is rich enough to unify the "parameter sharing" and "representation learning" streams of meta-learning; in particular, representation learning is obtained as the special case when the covariance matrix of the task distribution is unknown. For this case we propose to adopt the EM method, which is shown to enjoy efficient updates in our case. The paper is completed by an empirical study of EM. In particular, our experimental results show that the EM algorithm can attain the lower bound as the number of tasks grows, while the algorithm is also successful in competing with its alternatives when used in a representation learning context.
    Learning to Extend Molecular Scaffolds with Structural Motifs. (arXiv:2103.03864v2 [cs.LG] UPDATED)
    (2 min) Recent advancements in deep learning-based modeling of molecules promise to accelerate in silico drug discovery. A plethora of generative models is available, building molecules either atom-by-atom and bond-by-bond or fragment-by-fragment. However, many drug discovery projects require a fixed scaffold to be present in the generated molecule, and incorporating that constraint has only recently been explored. In this work, we propose a new graph-based model that naturally supports scaffolds as initial seed of the generative procedure, which is possible because our model is not conditioned on the generation history. At the same time, our generation procedure can flexibly choose between adding individual atoms and entire fragments. We show that training using a randomized generation order is necessary for good performance when extending scaffolds, and that the results are further improved by increasing the fragment vocabulary size. Our model pushes the state-of-the-art of graph-based molecule generation, while being an order of magnitude faster to train and sample from than existing approaches.
    Proxy-Normalizing Activations to Match Batch Normalization while Removing Batch Dependence. (arXiv:2106.03743v2 [cs.LG] UPDATED)
    (2 min) We investigate the reasons for the performance degradation incurred with batch-independent normalization. We find that the prototypical techniques of layer normalization and instance normalization both induce the appearance of failure modes in the neural network's pre-activations: (i) layer normalization induces a collapse towards channel-wise constant functions; (ii) instance normalization induces a lack of variability in instance statistics, symptomatic of an alteration of the expressivity. To alleviate failure mode (i) without aggravating failure mode (ii), we introduce the technique "Proxy Normalization" that normalizes post-activations using a proxy distribution. When combined with layer normalization or group normalization, this batch-independent normalization emulates batch normalization's behavior and consistently matches or exceeds its performance.
    Unsupervised Knowledge Graph Alignment by Probabilistic Reasoning and Semantic Embedding. (arXiv:2105.05596v3 [cs.CL] UPDATED)
    (2 min) Knowledge Graph (KG) alignment is to discover the mappings (i.e., equivalent entities, relations, and others) between two KGs. The existing methods can be divided into the embedding-based models, and the conventional reasoning and lexical matching based systems. The former compute the similarity of entities via their cross-KG embeddings, but they usually rely on an ideal supervised learning setting for good performance and lack appropriate reasoning to avoid logically wrong mappings; while the latter address the reasoning issue but are poor at utilizing the KG graph structures and the entity contexts. In this study, we aim at combining the above two solutions and thus propose an iterative framework named PRASE which is based on probabilistic reasoning and semantic embedding. It learns the KG embeddings via entity mappings from a probabilistic reasoning system named PARIS, and feeds the resultant entity mappings and embeddings back into PARIS for augmentation. The PRASE framework is compatible with different embedding-based models, and our experiments on multiple datasets have demonstrated its state-of-the-art performance.
    A Knowledge Distillation Ensemble Framework for Predicting Short and Long-term Hospitalisation Outcomes from Electronic Health Records Data. (arXiv:2011.09361v2 [cs.LG] UPDATED)
    (2 min) The ability to perform accurate prognosis of patients is crucial for proactive clinical decision making, informed resource management and personalised care. Existing outcome prediction models suffer from a low recall of infrequent positive outcomes. We present a highly-scalable and robust machine learning framework to automatically predict adversity represented by mortality and ICU admission from time-series vital signs and laboratory results obtained within the first 24 hours of hospital admission. The stacked platform comprises two components: a) an unsupervised LSTM Autoencoder that learns an optimal representation of the time-series, using it to differentiate the less frequent patterns which conclude with an adverse event from the majority patterns that do not, and b) a gradient boosting model, which relies on the constructed representation to refine prediction, incorporating static features of demographics, admission details and clinical summaries. The model is used to assess a patient's risk of adversity over time and provides visual justifications of its prediction based on the patient's static features and dynamic signals. Results of three case studies for predicting mortality and ICU admission show that the model outperforms all existing outcome prediction models, achieving PR-AUC of 0.891 (95$%$ CI: 0.878 - 0.969) in predicting mortality in ICU and general ward settings and 0.908 (95$%$ CI: 0.870-0.935) in predicting ICU admission.
    Twin Neural Network Regression is a Semi-Supervised Regression Algorithm. (arXiv:2106.06124v1 [cs.LG])
    (2 min) Twin neural network regression (TNNR) is a semi-supervised regression algorithm, it can be trained on unlabelled data points as long as other, labelled anchor data points, are present. TNNR is trained to predict differences between the target values of two different data points rather than the targets themselves. By ensembling predicted differences between the targets of an unseen data point and all training data points, it is possible to obtain a very accurate prediction for the original regression problem. Since any loop of predicted differences should sum to zero, loops can be supplied to the training data, even if the data points themselves within loops are unlabelled. Semi-supervised training improves TNNR performance, which is already state of the art, significantly.
    Calibrate Before Use: Improving Few-Shot Performance of Language Models. (arXiv:2102.09690v2 [cs.CL] UPDATED)
    (2 min) GPT-3 can perform numerous tasks when provided a natural language prompt that contains a few training examples. We show that this type of few-shot learning can be unstable: the choice of prompt format, training examples, and even the order of the training examples can cause accuracy to vary from near chance to near state-of-the-art. We demonstrate that this instability arises from the bias of language models towards predicting certain answers, e.g., those that are placed near the end of the prompt or are common in the pre-training data. To mitigate this, we first estimate the model's bias towards each answer by asking for its prediction when given the training prompt and a content-free test input such as "N/A". We then fit calibration parameters that cause the prediction for this input to be uniform across answers. On a diverse set of tasks, this contextual calibration procedure substantially improves GPT-3 and GPT-2's average accuracy (up to 30.0% absolute) and reduces variance across different choices of the prompt.
    Infinite-dimensional Folded-in-time Deep Neural Networks. (arXiv:2101.02966v2 [cs.LG] UPDATED)
    (2 min) The method recently introduced in arXiv:2011.10115 realizes a deep neural network with just a single nonlinear element and delayed feedback. It is applicable for the description of physically implemented neural networks. In this work, we present an infinite-dimensional generalization, which allows for a more rigorous mathematical analysis and a higher flexibility in choosing the weight functions. Precisely speaking, the weights are described by Lebesgue integrable functions instead of step functions. We also provide a functional back-propagation algorithm, which enables gradient descent training of the weights. In addition, with a slight modification, our concept realizes recurrent neural networks.
    Value Alignment Verification. (arXiv:2012.01557v2 [cs.LG] UPDATED)
    (2 min) As humans interact with autonomous agents to perform increasingly complicated, potentially risky tasks, it is important to be able to efficiently evaluate an agent's performance and correctness. In this paper we formalize and theoretically analyze the problem of efficient value alignment verification: how to efficiently test whether the behavior of another agent is aligned with a human's values. The goal is to construct a kind of "driver's test" that a human can give to any agent which will verify value alignment via a minimal number of queries. We study alignment verification problems with both idealized humans that have an explicit reward function as well as problems where they have implicit values. We analyze verification of exact value alignment for rational agents and propose and analyze heuristic and approximate value alignment verification tests in a wide range of gridworlds and a continuous autonomous driving domain. Finally, we prove that there exist sufficient conditions such that we can verify exact and approximate alignment across an infinite set of test environments via a constant-query-complexity alignment test.
    TOHAN: A One-step Approach towards Few-shot Hypothesis Adaptation. (arXiv:2106.06326v1 [cs.LG])
    (2 min) In few-shot domain adaptation (FDA), classifiers for the target domain are trained with accessible labeled data in the source domain (SD) and few labeled data in the target domain (TD). However, data usually contain private information in the current era, e.g., data distributed on personal phones. Thus, the private information will be leaked if we directly access data in SD to train a target-domain classifier (required by FDA methods). In this paper, to thoroughly prevent the privacy leakage in SD, we consider a very challenging problem setting, where the classifier for the TD has to be trained using few labeled target data and a well-trained SD classifier, named few-shot hypothesis adaptation (FHA). In FHA, we cannot access data in SD, as a result, the private information in SD will be protected well. To this end, we propose a target orientated hypothesis adaptation network (TOHAN) to solve the FHA problem, where we generate highly-compatible unlabeled data (i.e., an intermediate domain) to help train a target-domain classifier. TOHAN maintains two deep networks simultaneously, where one focuses on learning an intermediate domain and the other takes care of the intermediate-to-target distributional adaptation and the target-risk minimization. Experimental results show that TOHAN outperforms competitive baselines significantly.
    Improving Anytime Prediction with Parallel Cascaded Networks and a Temporal-Difference Loss. (arXiv:2102.09808v3 [cs.LG] UPDATED)
    (2 min) Although deep feedforward neural networks share some characteristics with the primate visual system, a key distinction is their dynamics. Deep nets typically operate in serial stages wherein each layer completes its computation before processing begins in subsequent layers. In contrast, biological systems have cascaded dynamics: information propagates from neurons at all layers in parallel but transmission occurs gradually over time, leading to speed-accuracy trade offs even in feedforward architectures. We explore the consequences of biologically inspired parallel hardware by constructing cascaded ResNets in which each residual block has propagation delays but all blocks update in parallel in a stateful manner. Because information transmitted through skip connections avoids delays, the functional depth of the architecture increases over time, yielding anytime predictions that improve with internal-processing time. We introduce a temporal-difference training loss that achieves a strictly superior speed-accuracy profile over standard losses and enables the cascaded architecture to outperform state-of-the-art anytime-prediction methods. The cascaded architecture has intriguing properties, including: it classifies typical instances more rapidly than atypical instances; it is more robust to both persistent and transient noise than is a conventional ResNet; and its time-varying output trace provides a signal that can be exploited to improve information processing and inference.
    Deep Adaptive Design: Amortizing Sequential Bayesian Experimental Design. (arXiv:2103.02438v2 [stat.ML] UPDATED)
    (2 min) We introduce Deep Adaptive Design (DAD), a method for amortizing the cost of adaptive Bayesian experimental design that allows experiments to be run in real-time. Traditional sequential Bayesian optimal experimental design approaches require substantial computation at each stage of the experiment. This makes them unsuitable for most real-world applications, where decisions must typically be made quickly. DAD addresses this restriction by learning an amortized design network upfront and then using this to rapidly run (multiple) adaptive experiments at deployment time. This network represents a design policy which takes as input the data from previous steps, and outputs the next design using a single forward pass; these design decisions can be made in milliseconds during the live experiment. To train the network, we introduce contrastive information bounds that are suitable objectives for the sequential setting, and propose a customized network architecture that exploits key symmetries. We demonstrate that DAD successfully amortizes the process of experimental design, outperforming alternative strategies on a number of problems.
    Data Profiling for Adversarial Training: On the Ruin of Problematic Data. (arXiv:2102.07437v2 [cs.LG] UPDATED)
    (2 min) There are multiple intriguing problems hovering in adversarial training, including robustness-accuracy trade-off, robust overfitting, and robustness overestimation. These problems pose great challenges to both reliable evaluation and practical deployment. Here, we show that these problems share one common cause -- low quality samples in the dataset. We first identify an intrinsic property of the data called \emph{problematic score} and then design controlled experiments to investigate its connections with these problems. Specifically, we find that when problematic data is removed, robust overfitting and robustness overestimation can be largely alleviated; and robustness-accuracy trade-off becomes less significant. These observations not only verify our intuition about data quality but also open new opportunities to advance adversarial training. Interestingly, simply removing problematic data from adversarial training, while making the training set smaller, yields better robustness for leading adversarial training strategies.
    Nonparametric Learning of Two-Layer ReLU Residual Units. (arXiv:2008.07648v2 [cs.LG] UPDATED)
    (2 min) We describe an algorithm that learns two-layer residual units with rectified linear unit (ReLU) activation: suppose the input $\mathbf{x}$ is from a distribution with support space $\mathbb{R}^d$ and the ground-truth generative model is such a residual unit, given by \[\mathbf{y}= \boldsymbol{B}^\ast\left[\left(\boldsymbol{A}^\ast\mathbf{x}\right)^+ + \mathbf{x}\right]\text{,}\] where ground-truth network parameters $\boldsymbol{A}^\ast \in \mathbb{R}^{d\times d}$ is a nonnegative full-rank matrix and $\boldsymbol{B}^\ast \in \mathbb{R}^{m\times d}$ is full-rank with $m \geq d$ and for $\mathbf{c} \in \mathbb{R}^d$, $[\mathbf{c}^{+}]_i = \max\{0, c_i\}$. We design layer-wise objectives as functionals whose analytic minimizers express the exact ground-truth network in terms of its parameters and nonlinearities. Following this objective landscape, learning residual units from finite samples can be formulated using convex optimization of a nonparametric function: for each layer, we first formulate the corresponding empirical risk minimization (ERM) as a positive semi-definite quadratic program (QP), then we show the solution space of the QP can be equivalently determined by a set of linear inequalities, which can then be efficiently solved by linear programming (LP). We further prove the statistical strong consistency of our algorithm, and demonstrate the robustness and sample efficiency of our algorithm by experiments.
    NAAQA: A Neural Architecture for Acoustic Question Answering. (arXiv:2106.06147v1 [cs.CL])
    (2 min) The goal of the Acoustic Question Answering (AQA) task is to answer a free-form text question about the content of an acoustic scene. It was inspired by the Visual Question Answering (VQA) task. In this paper, based on the previously introduced CLEAR dataset, we propose a new benchmark for AQA that emphasizes the specific challenges of acoustic inputs, e.g. variable duration scenes. We also introduce NAAQA, a neural architecture that leverages specific properties of acoustic inputs. The usage of time and frequency 1D convolutions to process 2D spectro-temporal representations of acoustic content shows promising results and enables reductions in model complexity. NAAQA achieves 91.6% of accuracy on the AQA task with about 7 times fewer parameters than the previously explored VQA model. We provide a detailed analysis of the results for the different question types. The effectiveness of coordinate maps in this acoustic context was also studied and we show that time coordinate maps augment temporal localization capabilities which enhance performance of the network by about 17 percentage points.
    Quantile Bandits for Best Arms Identification. (arXiv:2010.11568v2 [cs.LG] UPDATED)
    (2 min) We consider a variant of the best arm identification task in stochastic multi-armed bandits. Motivated by risk-averse decision-making problems, our goal is to identify a set of $m$ arms with the highest $\tau$-quantile values within a fixed budget. We prove asymmetric two-sided concentration inequalities for order statistics and quantiles of random variables that have non-decreasing hazard rate, which may be of independent interest. With these inequalities, we analyse a quantile version of Successive Accepts and Rejects (Q-SAR). We derive an upper bound for the probability of arm misidentification, the first justification of a quantile based algorithm for fixed budget multiple best arms identification. We show illustrative experiments for best arm identification.
    Of Moments and Matching: A Game-Theoretic Framework for Closing the Imitation Gap. (arXiv:2103.03236v2 [cs.LG] UPDATED)
    (2 min) We provide a unifying view of a large family of previous imitation learning algorithms through the lens of moment matching. At its core, our classification scheme is based on whether the learner attempts to match (1) reward or (2) action-value moments of the expert's behavior, with each option leading to differing algorithmic approaches. By considering adversarially chosen divergences between learner and expert behavior, we are able to derive bounds on policy performance that apply for all algorithms in each of these classes, the first to our knowledge. We also introduce the notion of moment recoverability, implicit in many previous analyses of imitation learning, which allows us to cleanly delineate how well each algorithmic family is able to mitigate compounding errors. We derive three novel algorithm templates (AdVIL, AdRIL, and DAeQuIL) with strong guarantees, simple implementation, and competitive empirical performance.
    Neural Architecture Search without Training. (arXiv:2006.04647v3 [cs.LG] UPDATED)
    (2 min) The time and effort involved in hand-designing deep neural networks is immense. This has prompted the development of Neural Architecture Search (NAS) techniques to automate this design. However, NAS algorithms tend to be slow and expensive; they need to train vast numbers of candidate networks to inform the search process. This could be alleviated if we could partially predict a network's trained accuracy from its initial state. In this work, we examine the overlap of activations between datapoints in untrained networks and motivate how this can give a measure which is usefully indicative of a network's trained performance. We incorporate this measure into a simple algorithm that allows us to search for powerful networks without any training in a matter of seconds on a single GPU, and verify its effectiveness on NAS-Bench-101, NAS-Bench-201, NATS-Bench, and Network Design Spaces. Our approach can be readily combined with more expensive search methods; we examine a simple adaptation of regularised evolutionary search. Code for reproducing our experiments is available at https://github.com/BayesWatch/nas-without-training.
    Towards an efficient approach for the nonconvex $\ell_p$ ball projection: algorithm and analysis. (arXiv:2101.01350v3 [math.OC] UPDATED)
    (2 min) This paper primarily focuses on computing the Euclidean projection of a vector onto the $\ell_{p}$ ball in which $p\in(0,1)$. Such a problem emerges as the core building block in statistical machine learning and signal processing tasks because of its ability to promote sparsity. However, efficient numerical algorithms for finding the projections are still not available, particularly in large-scale optimization. To meet this challenge, we first derive the first-order necessary optimality conditions of this problem using Fr\'echet normal cone. Based on this characterization, we develop a novel numerical approach for computing the stationary point through solving a sequence of projections onto the reweighted $\ell_{1}$-balls. This method is practically simple to implement and computationally efficient. Moreover, the proposed algorithm is shown to converge uniquely under mild conditions and has a worst-case $O(1/\sqrt{k})$ convergence rate. Numerical experiments demonstrate the efficiency of our proposed algorithm.
    Weakly Supervised Recovery of Semantic Attributes. (arXiv:2103.11888v2 [cs.LG] UPDATED)
    (2 min) We consider the problem of the extraction of semantic attributes, supervised only with classification labels. For example, when learning to classify images of birds into species, we would like to observe the emergence of features that zoologists use to classify birds. To tackle this problem, we propose training a neural network with discrete features in the last layer, which is followed by two heads: a multi-layered perceptron (MLP) and a decision tree. Since decision trees utilize simple binary decision stumps we expect those discrete features to obtain semantic meaning. We present a theoretical analysis as well as a practical method for learning in the intersection of two hypothesis classes. Our results on multiple benchmarks show an improved ability to extract a set of features that are highly correlated with the set of unseen attributes.
    Diffusion Asymptotics for Sequential Experiments. (arXiv:2101.09855v3 [math.ST] UPDATED)
    (2 min) We propose a new diffusion-asymptotic analysis for sequentially randomized experiments, including those that arise in solving multi-armed bandit problems. In an experiment with $ n $ time steps, we let the mean reward gaps between actions scale to the order $1/\sqrt{n}$ so as to preserve the difficulty of the learning task as $n$ grows. In this regime, we show that the behavior of a class of sequentially randomized Markov experiments converges to a diffusion limit, given as the solution of a stochastic differential equation. The diffusion limit thus enables us to derive refined, instance-specific characterization of the stochastic dynamics of adaptive experiments. As an application of this framework, we use the diffusion limit to obtain several new insights on the regret and belief evolution of Thompson sampling. We show that a version of Thompson sampling with an asymptotically uninformative prior variance achieves nearly-optimal instance-specific regret scaling when the reward gaps are relatively large. We also demonstrate that, in this regime, the posterior beliefs underlying Thompson sampling are highly unstable over time.
    Online and Distribution-Free Robustness: Regression and Contextual Bandits with Huber Contamination. (arXiv:2010.04157v3 [cs.LG] UPDATED)
    (2 min) In this work we revisit two classic high-dimensional online learning problems, namely linear regression and contextual bandits, from the perspective of adversarial robustness. Existing works in algorithmic robust statistics make strong distributional assumptions that ensure that the input data is evenly spread out or comes from a nice generative model. Is it possible to achieve strong robustness guarantees even without distributional assumptions altogether, where the sequence of tasks we are asked to solve is adaptively and adversarially chosen? We answer this question in the affirmative for both linear regression and contextual bandits. In fact our algorithms succeed where conventional methods fail. In particular we show strong lower bounds against Huber regression and more generally any convex M-estimator. Our approach is based on a novel alternating minimization scheme that interleaves ordinary least-squares with a simple convex program that finds the optimal reweighting of the distribution under a spectral constraint. Our results obtain essentially optimal dependence on the contamination level $\eta$, reach the optimal breakdown point, and naturally apply to infinite dimensional settings where the feature vectors are represented implicitly via a kernel map.
    Scene-Agnostic Multi-Microphone Speech Dereverberation. (arXiv:2010.11875v2 [eess.AS] UPDATED)
    (2 min) Neural networks (NNs) have been widely applied in speech processing tasks, and, in particular, those employing microphone arrays. Nevertheless, most existing NN architectures can only deal with fixed and position-specific microphone arrays. In this paper, we present an NN architecture that can cope with microphone arrays whose number and positions of the microphones are unknown, and demonstrate its applicability in the speech dereverberation task. To this end, our approach harnesses recent advances in deep learning on set-structured data to design an architecture that enhances the reverberant log-spectrum. We use noisy and noiseless versions of a simulated reverberant dataset to test the proposed architecture. Our experiments on the noisy data show that the proposed scene-agnostic setup outperforms a powerful scene-aware framework, sometimes even with fewer microphones. With the noiseless dataset we show that, in most cases, our method outperforms the position-aware network as well as the state-of-the-art weighted linear prediction error (WPE) algorithm.
    Environment Inference for Invariant Learning. (arXiv:2010.07249v4 [cs.LG] UPDATED)
    (2 min) Learning models that gracefully handle distribution shifts is central to research on domain generalization, robust optimization, and fairness. A promising formulation is domain-invariant learning, which identifies the key issue of learning which features are domain-specific versus domain-invariant. An important assumption in this area is that the training examples are partitioned into "domains" or "environments". Our focus is on the more common setting where such partitions are not provided. We propose EIIL, a general framework for domain-invariant learning that incorporates Environment Inference to directly infer partitions that are maximally informative for downstream Invariant Learning. We show that EIIL outperforms invariant learning methods on the CMNIST benchmark without using environment labels, and significantly outperforms ERM on worst-group performance in the Waterbirds and CivilComments datasets. Finally, we establish connections between EIIL and algorithmic fairness, which enables EIIL to improve accuracy and calibration in a fair prediction problem.
    Asynchronous \epsilon-Greedy Bayesian Optimisation. (arXiv:2010.07615v4 [cs.LG] UPDATED)
    (2 min) Batch Bayesian optimisation (BO) is a successful technique for the optimisation of expensive black-box functions. Asynchronous BO can reduce wallclock time by starting a new evaluation as soon as another finishes, thus maximising resource utilisation. To maximise resource allocation, we develop a novel asynchronous BO method, AEGiS (Asynchronous $\epsilon$-Greedy Global Search) that combines greedy search, exploiting the surrogate's mean prediction, with Thompson sampling and random selection from the approximate Pareto set describing the trade-off between exploitation (surrogate mean prediction) and exploration (surrogate posterior variance). We demonstrate empirically the efficacy of AEGiS on synthetic benchmark problems, meta-surrogate hyperparameter tuning problems and real-world problems, showing that AEGiS generally outperforms existing methods for asynchronous BO. When a single worker is available performance is no worse than BO using expected improvement.
    GraphNorm: A Principled Approach to Accelerating Graph Neural Network Training. (arXiv:2009.03294v3 [cs.LG] UPDATED)
    (2 min) Normalization is known to help the optimization of deep neural networks. Curiously, different architectures require specialized normalization methods. In this paper, we study what normalization is effective for Graph Neural Networks (GNNs). First, we adapt and evaluate the existing methods from other domains to GNNs. Faster convergence is achieved with InstanceNorm compared to BatchNorm and LayerNorm. We provide an explanation by showing that InstanceNorm serves as a preconditioner for GNNs, but such preconditioning effect is weaker with BatchNorm due to the heavy batch noise in graph datasets. Second, we show that the shift operation in InstanceNorm results in an expressiveness degradation of GNNs for highly regular graphs. We address this issue by proposing GraphNorm with a learnable shift. Empirically, GNNs with GraphNorm converge faster compared to GNNs using other normalization. GraphNorm also improves the generalization of GNNs, achieving better performance on graph classification benchmarks.
    Actionable Models: Unsupervised Offline Reinforcement Learning of Robotic Skills. (arXiv:2104.07749v3 [cs.RO] UPDATED)
    (2 min) We consider the problem of learning useful robotic skills from previously collected offline data without access to manually specified rewards or additional online exploration, a setting that is becoming increasingly important for scaling robot learning by reusing past robotic data. In particular, we propose the objective of learning a functional understanding of the environment by learning to reach any goal state in a given dataset. We employ goal-conditioned Q-learning with hindsight relabeling and develop several techniques that enable training in a particularly challenging offline setting. We find that our method can operate on high-dimensional camera images and learn a variety of skills on real robots that generalize to previously unseen scenes and objects. We also show that our method can learn to reach long-horizon goals across multiple episodes through goal chaining, and learn rich representations that can help with downstream tasks through pre-training or auxiliary objectives. The videos of our experiments can be found at https://actionable-models.github.io
    Distributional Soft Actor-Critic: Off-Policy Reinforcement Learning for Addressing Value Estimation Errors. (arXiv:2001.02811v3 [cs.LG] UPDATED)
    (2 min) In reinforcement learning (RL), function approximation errors are known to easily lead to the Q-value overestimations, thus greatly reducing policy performance. This paper presents a distributional soft actor-critic (DSAC) algorithm, which is an off-policy RL method for continuous control setting, to improve the policy performance by mitigating Q-value overestimations. We first discover in theory that learning a distribution function of state-action returns can effectively mitigate Q-value overestimations because it is capable of adaptively adjusting the update stepsize of the Q-value function. Then, a distributional soft policy iteration (DSPI) framework is developed by embedding the return distribution function into maximum entropy RL. Finally, we present a deep off-policy actor-critic variant of DSPI, called DSAC, which directly learns a continuous return distribution by keeping the variance of the state-action returns within a reasonable range to address exploding and vanishing gradient problems. We evaluate DSAC on the suite of MuJoCo continuous control tasks, achieving the state-of-the-art performance.
    Generalizable Episodic Memory for Deep Reinforcement Learning. (arXiv:2103.06469v3 [cs.LG] UPDATED)
    (2 min) Episodic memory-based methods can rapidly latch onto past successful strategies by a non-parametric memory and improve sample efficiency of traditional reinforcement learning. However, little effort is put into the continuous domain, where a state is never visited twice, and previous episodic methods fail to efficiently aggregate experience across trajectories. To address this problem, we propose Generalizable Episodic Memory (GEM), which effectively organizes the state-action values of episodic memory in a generalizable manner and supports implicit planning on memorized trajectories. GEM utilizes a double estimator to reduce the overestimation bias induced by value propagation in the planning process. Empirical evaluation shows that our method significantly outperforms existing trajectory-based methods on various MuJoCo continuous control tasks. To further show the general applicability, we evaluate our method on Atari games with discrete action space, which also shows a significant improvement over baseline algorithms.
    Sparse and Imperceptible Adversarial Attack via a Homotopy Algorithm. (arXiv:2106.06027v1 [cs.LG])
    (2 min) Sparse adversarial attacks can fool deep neural networks (DNNs) by only perturbing a few pixels (regularized by l_0 norm). Recent efforts combine it with another l_infty imperceptible on the perturbation magnitudes. The resultant sparse and imperceptible attacks are practically relevant, and indicate an even higher vulnerability of DNNs that we usually imagined. However, such attacks are more challenging to generate due to the optimization difficulty by coupling the l_0 regularizer and box constraints with a non-convex objective. In this paper, we address this challenge by proposing a homotopy algorithm, to jointly tackle the sparsity and the perturbation bound in one unified framework. Each iteration, the main step of our algorithm is to optimize an l_0-regularized adversarial loss, by leveraging the nonmonotone Accelerated Proximal Gradient Method (nmAPG) for nonconvex programming; it is followed by an l_0 change control step, and an optional post-attack step designed to escape bad local minima. We also extend the algorithm to handling the structural sparsity regularizer. We extensively examine the effectiveness of our proposed homotopy attack for both targeted and non-targeted attack scenarios, on CIFAR-10 and ImageNet datasets. Compared to state-of-the-art methods, our homotopy attack leads to significantly fewer perturbations, e.g., reducing 42.91% on CIFAR-10 and 75.03% on ImageNet (average case, targeted attack), at similar maximal perturbation magnitudes, when still achieving 100% attack success rates. Our codes are available at: https://github.com/VITA-Group/SparseADV_Homotopy.
    Lower-Bounded Proper Losses for Weakly Supervised Classification. (arXiv:2103.02893v2 [stat.ML] UPDATED)
    (2 min) This paper discusses the problem of weakly supervised classification, in which instances are given weak labels that are produced by some label-corruption process. The goal is to derive conditions under which loss functions for weak-label learning are proper and lower-bounded -- two essential requirements for the losses used in class-probability estimation. To this end, we derive a representation theorem for proper losses in supervised learning, which dualizes the Savage representation. We use this theorem to characterize proper weak-label losses and find a condition for them to be lower-bounded. From these theoretical findings, we derive a novel regularization scheme called generalized logit squeezing, which makes any proper weak-label loss bounded from below, without losing properness. Furthermore, we experimentally demonstrate the effectiveness of our proposed approach, as compared to improper or unbounded losses. The results highlight the importance of properness and lower-boundedness.
    Improved, Deterministic Smoothing for L_1 Certified Robustness. (arXiv:2103.10834v2 [cs.LG] UPDATED)
    (2 min) Randomized smoothing is a general technique for computing sample-dependent robustness guarantees against adversarial attacks for deep classifiers. Prior works on randomized smoothing against L_1 adversarial attacks use additive smoothing noise and provide probabilistic robustness guarantees. In this work, we propose a non-additive and deterministic smoothing method, Deterministic Smoothing with Splitting Noise (DSSN). To develop DSSN, we first develop SSN, a randomized method which involves generating each noisy smoothing sample by first randomly splitting the input space and then returning a representation of the center of the subdivision occupied by the input sample. In contrast to uniform additive smoothing, the SSN certification does not require the random noise components used to be independent. Thus, smoothing can be done effectively in just one dimension and can therefore be efficiently derandomized for quantized data (e.g., images). To the best of our knowledge, this is the first work to provide deterministic "randomized smoothing" for a norm-based adversarial threat model while allowing for an arbitrary classifier (i.e., a deep model) to be used as a base classifier and without requiring an exponential number of smoothing samples. On CIFAR-10 and ImageNet datasets, we provide substantially larger L_1 robustness certificates compared to prior works, establishing a new state-of-the-art. The determinism of our method also leads to significantly faster certificate computation. Code is available at: https://github.com/alevine0/smoothingSplittingNoise
    Moreau-Yosida $f$-divergences. (arXiv:2102.13416v2 [cs.LG] UPDATED)
    (2 min) Variational representations of $f$-divergences are central to many machine learning algorithms, with Lipschitz constrained variants recently gaining attention. Inspired by this, we define the Moreau-Yosida approximation of $f$-divergences with respect to the Wasserstein-$1$ metric. The corresponding variational formulas provide a generalization of a number of recent results, novel special cases of interest and a relaxation of the hard Lipschitz constraint. Additionally, we prove that the so-called tight variational representation of $f$-divergences can be to be taken over the quotient space of Lipschitz functions, and give a characterization of functions achieving the supremum in the variational representation. On the practical side, we propose an algorithm to calculate the tight convex conjugate of $f$-divergences compatible with automatic differentiation frameworks. As an application of our results, we propose the Moreau-Yosida $f$-GAN, providing an implementation of the variational formulas for the Kullback-Leibler, reverse Kullback-Leibler, $\chi^2$, reverse $\chi^2$, squared Hellinger, Jensen-Shannon, Jeffreys, triangular discrimination and total variation divergences as GANs trained on CIFAR-10, leading to competitive results and a simple solution to the problem of uniqueness of the optimal critic.
    Generative Archimedean Copulas. (arXiv:2102.11351v3 [cs.LG] UPDATED)
    (2 min) We propose a new generative modeling technique for learning multidimensional cumulative distribution functions (CDFs) in the form of copulas. Specifically, we consider certain classes of copulas known as Archimedean and hierarchical Archimedean copulas, popular for their parsimonious representation and ability to model different tail dependencies. We consider their representation as mixture models with Laplace transforms of latent random variables from generative neural networks. This alternative representation allows for computational efficiencies and easy sampling, especially in high dimensions. We describe multiple methods for optimizing the network parameters. Finally, we present empirical results that demonstrate the efficacy of our proposed method in learning multidimensional CDFs and its computational efficiency compared to existing methods.
    Multi-Task Reinforcement Learning with Context-based Representations. (arXiv:2102.06177v2 [cs.LG] UPDATED)
    (2 min) The benefit of multi-task learning over single-task learning relies on the ability to use relations across tasks to improve performance on any single task. While sharing representations is an important mechanism to share information across tasks, its success depends on how well the structure underlying the tasks is captured. In some real-world situations, we have access to metadata, or additional information about a task, that may not provide any new insight in the context of a single task setup alone but inform relations across multiple tasks. While this metadata can be useful for improving multi-task learning performance, effectively incorporating it can be an additional challenge. We posit that an efficient approach to knowledge transfer is through the use of multiple context-dependent, composable representations shared across a family of tasks. In this framework, metadata can help to learn interpretable representations and provide the context to inform which representations to compose and how to compose them. We use the proposed approach to obtain state-of-the-art results in Meta-World, a challenging multi-task benchmark consisting of 50 distinct robotic manipulation tasks.
    COVID-19 Classification Using Staked Ensembles: A Comprehensive Analysis. (arXiv:2010.05690v2 [cs.CV] UPDATED)
    (2 min) The issue of COVID-19, increasing with a massive mortality rate. This led to the WHO declaring it as a pandemic. In this situation, it is crucial to perform efficient and fast diagnosis. The reverse transcript polymerase chain reaction (RTPCR) test is conducted to detect the presence of SARS-CoV-2. This test is time-consuming and instead chest CT (or Chest X-ray) can be used for a fast and accurate diagnosis. Automated diagnosis is considered to be important as it reduces human effort and provides accurate and low-cost tests. The contributions of our research are three-fold. First, it is aimed to analyse the behaviour and performance of variant vision models ranging from Inception to NAS networks with the appropriate fine-tuning procedure. Second, the behaviour of these models is visually analysed by plotting CAMs for individual networks and determining classification performance with AUCROC curves. Thirdly, stacked ensembles techniques are imparted to provide higher generalisation on combining the fine-tuned models, in which six ensemble neural networks are designed by combining the existing fine-tuned networks. Implying these stacked ensembles provides a great generalization to the models. The ensemble model designed by combining all the fine-tuned networks obtained a state-of-the-art accuracy score of 99.17%. The precision and recall for the COVID-19 class are 99.99% and 89.79% respectively, which resembles the robustness of the stacked ensembles.
    SparseBERT: Rethinking the Importance Analysis in Self-attention. (arXiv:2102.12871v2 [cs.LG] UPDATED)
    (2 min) Transformer-based models are popularly used in natural language processing (NLP). Its core component, self-attention, has aroused widespread interest. To understand the self-attention mechanism, a direct method is to visualize the attention map of a pre-trained model. Based on the patterns observed, a series of efficient Transformers with different sparse attention masks have been proposed. From a theoretical perspective, universal approximability of Transformer-based models is also recently proved. However, the above understanding and analysis of self-attention is based on a pre-trained model. To rethink the importance analysis in self-attention, we study the significance of different positions in attention matrix during pre-training. A surprising result is that diagonal elements in the attention map are the least important compared with other attention positions. We provide a proof showing that these diagonal elements can indeed be removed without deteriorating model performance. Furthermore, we propose a Differentiable Attention Mask (DAM) algorithm, which further guides the design of the SparseBERT. Extensive experiments verify our interesting findings and illustrate the effect of the proposed algorithm.
    Asymmetric Heavy Tails and Implicit Bias in Gaussian Noise Injections. (arXiv:2102.07006v2 [stat.ML] UPDATED)
    (2 min) Gaussian noise injections (GNIs) are a family of simple and widely-used regularisation methods for training neural networks, where one injects additive or multiplicative Gaussian noise to the network activations at every iteration of the optimisation algorithm, which is typically chosen as stochastic gradient descent (SGD). In this paper we focus on the so-called `implicit effect' of GNIs, which is the effect of the injected noise on the dynamics of SGD. We show that this effect induces an asymmetric heavy-tailed noise on SGD gradient updates. In order to model this modified dynamics, we first develop a Langevin-like stochastic differential equation that is driven by a general family of asymmetric heavy-tailed noise. Using this model we then formally prove that GNIs induce an `implicit bias', which varies depending on the heaviness of the tails and the level of asymmetry. Our empirical results confirm that different types of neural networks trained with GNIs are well-modelled by the proposed dynamics and that the implicit effect of these injections induces a bias that degrades the performance of networks.
    A Framework to Enhance Generalization of Deep Metric Learning methods using General Discriminative Feature Learning and Class Adversarial Neural Networks. (arXiv:2106.06420v1 [cs.CV])
    (2 min) Metric learning algorithms aim to learn a distance function that brings the semantically similar data items together and keeps dissimilar ones at a distance. The traditional Mahalanobis distance learning is equivalent to find a linear projection. In contrast, Deep Metric Learning (DML) methods are proposed that automatically extract features from data and learn a non-linear transformation from input space to a semantically embedding space. Recently, many DML methods are proposed focused to enhance the discrimination power of the learned metric by providing novel sampling strategies or loss functions. This approach is very helpful when both the training and test examples are coming from the same set of categories. However, it is less effective in many applications of DML such as image retrieval and person-reidentification. Here, the DML should learn general semantic concepts from observed classes and employ them to rank or identify objects from unseen categories. Neglecting the generalization ability of the learned representation and just emphasizing to learn a more discriminative embedding on the observed classes may lead to the overfitting problem. To address this limitation, we propose a framework to enhance the generalization power of existing DML methods in a Zero-Shot Learning (ZSL) setting by general yet discriminative representation learning and employing a class adversarial neural network. To learn a more general representation, we propose to employ feature maps of intermediate layers in a deep neural network and enhance their discrimination power through an attention mechanism. Besides, a class adversarial network is utilized to enforce the deep model to seek class invariant features for the DML task. We evaluate our work on widely used machine vision datasets in a ZSL setting.
    Nonmyopic Multifidelity Active Search. (arXiv:2106.06356v1 [cs.LG])
    (2 min) Active search is a learning paradigm where we seek to identify as many members of a rare, valuable class as possible given a labeling budget. Previous work on active search has assumed access to a faithful (and expensive) oracle reporting experimental results. However, some settings offer access to cheaper surrogates such as computational simulation that may aid in the search. We propose a model of multifidelity active search, as well as a novel, computationally efficient policy for this setting that is motivated by state-of-the-art classical policies. Our policy is nonmyopic and budget aware, allowing for a dynamic tradeoff between exploration and exploitation. We evaluate the performance of our solution on real-world datasets and demonstrate significantly better performance than natural benchmarks.
    Revealing the Structure of Deep Neural Networks via Convex Duality. (arXiv:2002.09773v4 [cs.LG] UPDATED)
    (2 min) We study regularized deep neural networks (DNNs) and introduce a convex analytic framework to characterize the structure of the hidden layers. We show that a set of optimal hidden layer weights for a norm regularized DNN training problem can be explicitly found as the extreme points of a convex set. For the special case of deep linear networks, we prove that each optimal weight matrix aligns with the previous layers via duality. More importantly, we apply the same characterization to deep ReLU networks with whitened data and prove the same weight alignment holds. As a corollary, we also prove that norm regularized deep ReLU networks yield spline interpolation for one-dimensional datasets which was previously known only for two-layer networks. Furthermore, we provide closed-form solutions for the optimal layer weights when data is rank-one or whitened. The same analysis also applies to architectures with batch normalization even for arbitrary data. Therefore, we obtain a complete explanation for a recent empirical observation termed Neural Collapse where class means collapse to the vertices of a simplex equiangular tight frame.
    Verifying Quantized Neural Networks using SMT-Based Model Checking. (arXiv:2106.05997v1 [cs.LG])
    (2 min) Artificial Neural Networks (ANNs) are being deployed on an increasing number of safety-critical applications, including autonomous cars and medical diagnosis. However, concerns about their reliability have been raised due to their black-box nature and apparent fragility to adversarial attacks. Here, we develop and evaluate a symbolic verification framework using incremental model checking (IMC) and satisfiability modulo theories (SMT) to check for vulnerabilities in ANNs. More specifically, we propose several ANN-related optimizations for IMC, including invariant inference via interval analysis and the discretization of non-linear activation functions. With this, we can provide guarantees on the safe behavior of ANNs implemented both in floating-point and fixed-point (quantized) arithmetic. In this regard, our verification approach was able to verify and produce adversarial examples for 52 test cases spanning image classification and general machine learning applications. For small- to medium-sized ANN, our approach completes most of its verification runs in minutes. Moreover, in contrast to most state-of-the-art methods, our approach is not restricted to specific choices of activation functions or non-quantized representations.
    Inference of Causal Effects when Control Variables are Unknown. (arXiv:2012.08154v3 [stat.ME] UPDATED)
    (2 min) Conventional methods in causal effect inferencetypically rely on specifying a valid set of control variables. When this set is unknown or misspecified, inferences will be erroneous. We propose a method for inferring average causal effects when all potential confounders are observed, but thecontrol variables are unknown. When the data-generating process belongs to the class of acyclical linear structural causal models, we prove that themethod yields asymptotically valid confidence intervals. Our results build upon a smooth characterization of linear directed acyclic graphs. We verify the capability of the method to produce valid confidence intervals for average causal effects using synthetic data, even when the appropriate specification of control variables is unknown.
    The Multi-Agent Behavior Dataset: Mouse Dyadic Social Interactions. (arXiv:2104.02710v3 [cs.LG] UPDATED)
    (2 min) Multi-agent behavior modeling aims to understand the interactions that occur between agents. We present a multi-agent dataset from behavioral neuroscience, the Caltech Mouse Social Interactions (CalMS21) Dataset. Our dataset consists of trajectory data of social interactions, recorded from videos of freely behaving mice in a standard resident-intruder assay. To help accelerate behavioral studies, the CalMS21 dataset provides benchmarks to evaluate the performance of automated behavior classification methods in three settings: (1) for training on large behavioral datasets all annotated by a single annotator, (2) for style transfer to learn inter-annotator differences in behavior definitions, and (3) for learning of new behaviors of interest given limited training data. The dataset consists of 6 million frames of unlabeled tracked poses of interacting mice, as well as over 1 million frames with tracked poses and corresponding frame-level behavior annotations. The challenge of our dataset is to be able to classify behaviors accurately using both labeled and unlabeled tracking data, as well as being able to generalize to new settings.
    Can we have it all? On the Trade-off between Spatial and Adversarial Robustness of Neural Networks. (arXiv:2002.11318v4 [cs.LG] UPDATED)
    (2 min) (Non-)robustness of neural networks to small, adversarial pixel-wise perturbations, and as more recently shown, to even random spatial transformations (e.g., translations, rotations) entreats both theoretical and empirical understanding. Spatial robustness to random translations and rotations is commonly attained via equivariant models (e.g., StdCNNs, GCNNs) and training augmentation, whereas adversarial robustness is typically achieved by adversarial training. In this paper, we prove a quantitative trade-off between spatial and adversarial robustness in a simple statistical setting. We complement this empirically by showing that: (a) as the spatial robustness of equivariant models improves by training augmentation with progressively larger transformations, their adversarial robustness worsens progressively, and (b) as the state-of-the-art robust models are adversarially trained with progressively larger pixel-wise perturbations, their spatial robustness drops progressively. Towards achieving pareto-optimality in this trade-off, we propose a method based on curriculum learning that trains gradually on more difficult perturbations (both spatial and adversarial) to improve spatial and adversarial robustness simultaneously.
    Deep Network Approximation for Smooth Functions. (arXiv:2001.03040v4 [cs.LG] UPDATED)
    (2 min) This paper establishes optimal approximation error characterization of deep ReLU networks for smooth functions in terms of both width and depth simultaneously. To that end, we first prove that multivariate polynomials can be approximated by deep ReLU networks of width $\mathcal{O}(N)$ and depth $\mathcal{O}(L)$ with an approximation error $\mathcal{O}(N^{-L})$. Through local Taylor expansions and their deep ReLU network approximations, we show that deep ReLU networks of width $\mathcal{O}(N\ln N)$ and depth $\mathcal{O}(L\ln L)$ can approximate $f\in C^s([0,1]^d)$ with a nearly optimal approximation rate $\mathcal{O}(\|f\|_{C^s([0,1]^d)}N^{-2s/d}L^{-2s/d})$. Our estimate is non-asymptotic in the sense that it is valid for arbitrary width and depth specified by $N\in\mathbb{N}^+$ and $L\in\mathbb{N}^+$, respectively.
    Monotonic Neural Network: combining Deep Learning with Domain Knowledge for Chiller Plants Energy Optimization. (arXiv:2106.06143v1 [eess.SP])
    (2 min) In this paper, we are interested in building a domain knowledge based deep learning framework to solve the chiller plants energy optimization problems. Compared to the hotspot applications of deep learning (e.g. image classification and NLP), it is difficult to collect enormous data for deep network training in real-world physical systems. Most existing methods reduce the complex systems into linear model to facilitate the training on small samples. To tackle the small sample size problem, this paper considers domain knowledge in the structure and loss design of deep network to build a nonlinear model with lower redundancy function space. Specifically, the energy consumption estimation of most chillers can be physically viewed as an input-output monotonic problem. Thus, we can design a Neural Network with monotonic constraints to mimic the physical behavior of the system. We verify the proposed method in a cooling system of a data center, experimental results show the superiority of our framework in energy optimization compared to the existing ones.
    Exploiting Record Similarity for Practical Vertical Federated Learning. (arXiv:2106.06312v1 [cs.LG])
    (2 min) As the privacy of machine learning has drawn increasing attention, federated learning is introduced to enable collaborative learning without revealing raw data. Notably, \textit{vertical federated learning} (VFL), where parties share the same set of samples but only hold partial features, has a wide range of real-world applications. However, existing studies in VFL rarely study the ``record linkage'' process. They either design algorithms assuming the data from different parties have been linked or use simple linkage methods like exact-linkage or top1-linkage. These approaches are unsuitable for many applications, such as the GPS location and noisy titles requiring fuzzy matching. In this paper, we design a novel similarity-based VFL framework, FedSim, which is suitable for more real-world applications and achieves higher performance on traditional VFL tasks. Moreover, we theoretically analyze the privacy risk caused by sharing similarities. Our experiments on three synthetic datasets and five real-world datasets with various similarity metrics show that FedSim consistently outperforms other state-of-the-art baselines.
    PriorGrad: Improving Conditional Denoising Diffusion Models with Data-Driven Adaptive Prior. (arXiv:2106.06406v1 [stat.ML])
    (2 min) Denoising diffusion probabilistic models have been recently proposed to generate high-quality samples by estimating the gradient of the data density. The framework assumes the prior noise as a standard Gaussian distribution, whereas the corresponding data distribution may be more complicated than the standard Gaussian distribution, which potentially introduces inefficiency in denoising the prior noise into the data sample because of the discrepancy between the data and the prior. In this paper, we propose PriorGrad to improve the efficiency of the conditional diffusion model (for example, a vocoder using a mel-spectrogram as the condition) by applying an adaptive prior derived from the data statistics based on the conditional information. We formulate the training and sampling procedures of PriorGrad and demonstrate the advantages of an adaptive prior through a theoretical analysis. Focusing on the audio domain, we consider the recently proposed diffusion-based audio generative models based on both the spectral and time domains and show that PriorGrad achieves a faster convergence leading to data and parameter efficiency and improved quality, and thereby demonstrating the efficiency of a data-driven adaptive prior.
    KRADA: Known-region-aware Domain Alignment for Open World Semantic Segmentation. (arXiv:2106.06237v1 [eess.IV])
    (2 min) In semantic segmentation, we aim to train a pixel-level classifier to assign category labels to all pixels in an image, where labeled training images and unlabeled test images are from the same distribution and share the same label set. However, in an open world, the unlabeled test images probably contain unknown categories and have different distributions from the labeled images. Hence, in this paper, we consider a new, more realistic, and more challenging problem setting where the pixel-level classifier has to be trained with labeled images and unlabeled open-world images -- we name it open world semantic segmentation (OSS). In OSS, the trained classifier is expected to identify unknown-class pixels and classify known-class pixels well. To solve OSS, we first investigate which distribution that unknown-class pixels obey. Then, motivated by the goodness-of-fit test, we use statistical measurements to show how a pixel fits the distribution of an unknown class and select highly-fitted pixels to form the unknown region in each image. Eventually, we propose an end-to-end learning framework, known-region-aware domain alignment (KRADA), to distinguish unknown classes while aligning distributions of known classes in labeled and unlabeled open-world images. The effectiveness of KRADA has been verified on two synthetic tasks and one COVID-19 segmentation task.
    Preferential Temporal Difference Learning. (arXiv:2106.06508v1 [cs.LG])
    (2 min) Temporal-Difference (TD) learning is a general and very useful tool for estimating the value function of a given policy, which in turn is required to find good policies. Generally speaking, TD learning updates states whenever they are visited. When the agent lands in a state, its value can be used to compute the TD-error, which is then propagated to other states. However, it may be interesting, when computing updates, to take into account other information than whether a state is visited or not. For example, some states might be more important than others (such as states which are frequently seen in a successful trajectory). Or, some states might have unreliable value estimates (for example, due to partial observability or lack of data), making their values less desirable as targets. We propose an approach to re-weighting states used in TD updates, both when they are the input and when they provide the target for the update. We prove that our approach converges with linear function approximation and illustrate its desirable empirical behaviour compared to other TD-style methods.
    HIFI: Anomaly Detection for Multivariate Time Series with High-order Feature Interactions. (arXiv:2106.06167v1 [cs.LG])
    (2 min) Monitoring complex systems results in massive multivariate time series data, and anomaly detection of these data is very important to maintain the normal operation of the systems. Despite the recent emergence of a large number of anomaly detection algorithms for multivariate time series, most of them ignore the correlation modeling among multivariate, which can often lead to poor anomaly detection results. In this work, we propose a novel anomaly detection model for multivariate time series with \underline{HI}gh-order \underline{F}eature \underline{I}nteractions (HIFI). More specifically, HIFI builds multivariate feature interaction graph automatically and uses the graph convolutional neural network to achieve high-order feature interactions, in which the long-term temporal dependencies are modeled by attention mechanisms and a variational encoding technique is utilized to improve the model performance and robustness. Extensive experiments on three publicly available datasets demonstrate the superiority of our framework compared with state-of-the-art approaches.
    Data-driven battery operation for energy arbitrage using rainbow deep reinforcement learning. (arXiv:2106.06061v1 [cs.LG])
    (2 min) As the world seeks to become more sustainable, intelligent solutions are needed to increase the penetration of renewable energy. In this paper, the model-free deep reinforcement learning algorithm Rainbow Deep Q-Networks is used to control a battery in a small microgrid to perform energy arbitrage and more efficiently utilise solar and wind energy sources. The grid operates with its own demand and renewable generation based on a dataset collected at Keele University, as well as using dynamic energy pricing from a real wholesale energy market. Four scenarios are tested including using demand and price forecasting produced with local weather data. The algorithm and its subcomponents are evaluated against two continuous control benchmarks with Rainbow able to outperform all other method. This research shows the importance of using the distributional approach for reinforcement learning when working with complex environments and reward functions, as well as how it can be used to visualise and contextualise the agent's behaviour for real-world applications.
    Hierarchical Probabilistic Model for Blind Source Separation via Legendre Transformation. (arXiv:1909.11294v3 [stat.ML] UPDATED)
    (2 min) We present a novel blind source separation (BSS) method, called information geometric blind source separation (IGBSS). Our formulation is based on the log-linear model equipped with a hierarchically structured sample space, which has theoretical guarantees to uniquely recover a set of source signals by minimizing the KL divergence from a set of mixed signals. Source signals, received signals, and mixing matrices are realized as different layers in our hierarchical sample space. Our empirical results have demonstrated on images and time series data that our approach is superior to well established techniques and is able to separate signals with complex interactions.
    On Robust Mean Estimation under Coordinate-level Corruption. (arXiv:2002.04137v5 [cs.LG] UPDATED)
    (2 min) We study the problem of robust mean estimation and introduce a novel Hamming distance-based measure of distribution shift for coordinate-level corruptions. We show that this measure yields adversary models that capture more realistic corruptions than those used in prior works, and present an information-theoretic analysis of robust mean estimation in these settings. We show that for structured distributions, methods that leverage the structure yield information theoretically more accurate mean estimation. We also focus on practical algorithms for robust mean estimation and study when data cleaning-inspired approaches that first fix corruptions in the input data and then perform robust mean estimation can match the information theoretic bounds of our analysis. We finally demonstrate experimentally that this two-step approach outperforms structure-agnostic robust estimation and provides accurate mean estimation even for high-magnitude corruption.
    Generate, Annotate, and Learn: Generative Models Advance Self-Training and Knowledge Distillation. (arXiv:2106.06168v1 [cs.LG])
    (2 min) Semi-Supervised Learning (SSL) has seen success in many application domains, but this success often hinges on the availability of task-specific unlabeled data. Knowledge distillation (KD) has enabled compressing deep networks and ensembles, achieving the best results when distilling knowledge on fresh task-specific unlabeled examples. However, task-specific unlabeled data can be challenging to find. We present a general framework called "generate, annotate, and learn (GAL)" that uses unconditional generative models to synthesize in-domain unlabeled data, helping advance SSL and KD on different tasks. To obtain strong task-specific generative models, we adopt generic generative models, pretrained on open-domain data, and fine-tune them on inputs from specific tasks. Then, we use existing classifiers to annotate generated unlabeled examples with soft pseudo labels, which are used for additional training. When self-training is combined with samples generated from GPT2-large, fine-tuned on the inputs of each GLUE task, we outperform a strong RoBERTa-large baseline on the GLUE benchmark. Moreover, KD on GPT-2 samples yields a new state-of-the-art for 6-layer transformers on the GLUE leaderboard. Finally, self-training with GAL offers significant gains on image classification on CIFAR-10 and four tabular tasks from the UCI repository
    DouZero: Mastering DouDizhu with Self-Play Deep Reinforcement Learning. (arXiv:2106.06135v1 [cs.AI])
    (2 min) Games are abstractions of the real world, where artificial agents learn to compete and cooperate with other agents. While significant achievements have been made in various perfect- and imperfect-information games, DouDizhu (a.k.a. Fighting the Landlord), a three-player card game, is still unsolved. DouDizhu is a very challenging domain with competition, collaboration, imperfect information, large state space, and particularly a massive set of possible actions where the legal actions vary significantly from turn to turn. Unfortunately, modern reinforcement learning algorithms mainly focus on simple and small action spaces, and not surprisingly, are shown not to make satisfactory progress in DouDizhu. In this work, we propose a conceptually simple yet effective DouDizhu AI system, namely DouZero, which enhances traditional Monte-Carlo methods with deep neural networks, action encoding, and parallel actors. Starting from scratch in a single server with four GPUs, DouZero outperformed all the existing DouDizhu AI programs in days of training and was ranked the first in the Botzone leaderboard among 344 AI agents. Through building DouZero, we show that classic Monte-Carlo methods can be made to deliver strong results in a hard domain with a complex action space. The code and an online demo are released at https://github.com/kwai/DouZero with the hope that this insight could motivate future work.
    Adversarial purification with Score-based generative models. (arXiv:2106.06041v1 [cs.LG])
    (2 min) While adversarial training is considered as a standard defense method against adversarial attacks for image classifiers, adversarial purification, which purifies attacked images into clean images with a standalone purification model, has shown promises as an alternative defense method. Recently, an Energy-Based Model (EBM) trained with Markov-Chain Monte-Carlo (MCMC) has been highlighted as a purification model, where an attacked image is purified by running a long Markov-chain using the gradients of the EBM. Yet, the practicality of the adversarial purification using an EBM remains questionable because the number of MCMC steps required for such purification is too large. In this paper, we propose a novel adversarial purification method based on an EBM trained with Denoising Score-Matching (DSM). We show that an EBM trained with DSM can quickly purify attacked images within a few steps. We further introduce a simple yet effective randomized purification scheme that injects random noises into images before purification. This process screens the adversarial perturbations imposed on images by the random noises and brings the images to the regime where the EBM can denoise well. We show that our purification method is robust against various attacks and demonstrate its state-of-the-art performances.
    LocoProp: Enhancing BackProp via Local Loss Optimization. (arXiv:2106.06199v1 [cs.LG])
    (2 min) We study a local loss construction approach for optimizing neural networks. We start by motivating the problem as minimizing a squared loss between the pre-activations of each layer and a local target, plus a regularizer term on the weights. The targets are chosen so that the first gradient descent step on the local objectives recovers vanilla BackProp, while the exact solution to each problem results in a preconditioned gradient update. We improve the local loss construction by forming a Bregman divergence in each layer tailored to the transfer function which keeps the local problem convex w.r.t. the weights. The generalized local problem is again solved iteratively by taking small gradient descent steps on the weights, for which the first step recovers BackProp. We run several ablations and show that our construction consistently improves convergence, reducing the gap between first-order and second-order methods.
    Possibility results for graph clustering: A novel consistency axiom. (arXiv:1806.06142v5 [cs.LG] UPDATED)
    (2 min) Kleinberg introduced three natural clustering properties, or axioms, and showed they cannot be simultaneously satisfied by any clustering algorithm. We present a new clustering property, Monotonic Consistency, which avoids the well-known problematic behaviour of Kleinberg's Consistency axiom, and the impossibility result. Namely, we describe a clustering algorithm, Morse Clustering, inspired by Morse Theory in Differential Topology, which satisfies Kleinberg's original axioms with Consistency replaced by Monotonic Consistency. Morse clustering uncovers the underlying flow structure on a set or graph and returns a partition into trees representing basins of attraction of critical vertices. We also generalise Kleinberg's axiomatic approach to sparse graphs, showing an impossibility result for Consistency, and a possibility result for Monotonic Consistency and Morse clustering.
    Best Arm Identification in Graphical Bilinear Bandits. (arXiv:2012.07641v3 [cs.LG] UPDATED)
    (2 min) We introduce a new graphical bilinear bandit problem where a learner (or a \emph{central entity}) allocates arms to the nodes of a graph and observes for each edge a noisy bilinear reward representing the interaction between the two end nodes. We study the best arm identification problem in which the learner wants to find the graph allocation maximizing the sum of the bilinear rewards. By efficiently exploiting the geometry of this bandit problem, we propose a \emph{decentralized} allocation strategy based on random sampling with theoretical guarantees. In particular, we characterize the influence of the graph structure (e.g. star, complete or circle) on the convergence rate and propose empirical experiments that confirm this dependency.
    Unsupervised Neural Hidden Markov Models with a Continuous latent state space. (arXiv:2106.06536v1 [cs.LG])
    (2 min) We introduce a new procedure to neuralize unsupervised Hidden Markov Models in the continuous case. This provides higher flexibility to solve problems with underlying latent variables. This approach is evaluated on both synthetic and real data. On top of generating likely model parameters with comparable performances to off-the-shelf neural architecture (LSTMs, GRUs,..), the obtained results are easily interpretable.
    Probabilistic bounds on neuron death in deep rectifier networks. (arXiv:2007.06192v2 [cs.LG] UPDATED)
    (2 min) Neuron death is a complex phenomenon with implications for model trainability: the deeper the network, the lower the probability of finding a valid initialization. In this work, we derive both upper and lower bounds on the probability that a ReLU network is initialized to a trainable point, as a function of model hyperparameters. We show that it is possible to increase the depth of a network indefinitely, so long as the width increases as well. Furthermore, our bounds are asymptotically tight under reasonable assumptions: first, the upper bound coincides with the true probability for a single-layer network with the largest possible input set. Second, the true probability converges to our lower bound as the input set shrinks to a single point, or as the network complexity grows under an assumption about the output variance. We confirm these results by numerical simulation, showing rapid convergence to the lower bound with increasing network depth. Then, motivated by the theory, we propose a practical sign flipping scheme which guarantees that the ratio of living data points in a $k$-layer network is at least $2^{-k}$. Finally, we show how these issues are mitigated by network design features currently seen in practice, such as batch normalization, residual connections, dense networks and skip connections. This suggests that neuron death may provide insight into the efficacy of various model architectures.
    A self-adapting super-resolution structures framework for automatic design of GAN. (arXiv:2106.06011v1 [cs.CV])
    (2 min) With the development of deep learning, the single super-resolution image reconstruction network models are becoming more and more complex. Small changes in hyperparameters of the models have a greater impact on model performance. In the existing works, experts have gradually explored a set of optimal model parameters based on empirical values or performing brute-force search. In this paper, we introduce a new super-resolution image reconstruction generative adversarial network framework, and a Bayesian optimization method used to optimizing the hyperparameters of the generator and discriminator. The generator is made by self-calibrated convolution, and discriminator is made by convolution lays. We have defined the hyperparameters such as the number of network layers and the number of neurons. Our method adopts Bayesian optimization as a optimization policy of GAN in our model. Not only can find the optimal hyperparameter solution automatically, but also can construct a super-resolution image reconstruction network, reducing the manual workload. Experiments show that Bayesian optimization can search the optimal solution earlier than the other two optimization algorithms.
    A Nonmyopic Approach to Cost-Constrained Bayesian Optimization. (arXiv:2106.06079v1 [cs.LG])
    (2 min) Bayesian optimization (BO) is a popular method for optimizing expensive-to-evaluate black-box functions. BO budgets are typically given in iterations, which implicitly assumes each evaluation has the same cost. In fact, in many BO applications, evaluation costs vary significantly in different regions of the search space. In hyperparameter optimization, the time spent on neural network training increases with layer size; in clinical trials, the monetary cost of drug compounds vary; and in optimal control, control actions have differing complexities. Cost-constrained BO measures convergence with alternative cost metrics such as time, money, or energy, for which the sample efficiency of standard BO methods is ill-suited. For cost-constrained BO, cost efficiency is far more important than sample efficiency. In this paper, we formulate cost-constrained BO as a constrained Markov decision process (CMDP), and develop an efficient rollout approximation to the optimal CMDP policy that takes both the cost and future iterations into account. We validate our method on a collection of hyperparameter optimization problems as well as a sensor set selection application.
    Offline Reinforcement Learning as Anti-Exploration. (arXiv:2106.06431v1 [cs.LG])
    (2 min) Offline Reinforcement Learning (RL) aims at learning an optimal control from a fixed dataset, without interactions with the system. An agent in this setting should avoid selecting actions whose consequences cannot be predicted from the data. This is the converse of exploration in RL, which favors such actions. We thus take inspiration from the literature on bonus-based exploration to design a new offline RL agent. The core idea is to subtract a prediction-based exploration bonus from the reward, instead of adding it for exploration. This allows the policy to stay close to the support of the dataset. We connect this approach to a more common regularization of the learned policy towards the data. Instantiated with a bonus based on the prediction error of a variational autoencoder, we show that our agent is competitive with the state of the art on a set of continuous control locomotion and manipulation tasks.
    Quantifying and Reducing Bias in Maximum Likelihood Estimation of Structured Anomalies. (arXiv:2007.07878v2 [cs.LG] UPDATED)
    (2 min) Anomaly estimation, or the problem of finding a subset of a dataset that differs from the rest of the dataset, is a classic problem in machine learning and data mining. In both theoretical work and in applications, the anomaly is assumed to have a specific structure defined by membership in an $\textit{anomaly family}$. For example, in temporal data the anomaly family may be time intervals, while in network data the anomaly family may be connected subgraphs. The most prominent approach for anomaly estimation is to compute the Maximum Likelihood Estimator (MLE) of the anomaly; however, it was recently observed that for normally distributed data, the MLE is a $\textit{biased}$ estimator for some anomaly families. In this work, we demonstrate that in the normal means setting, the bias of the MLE depends on the size of the anomaly family. We prove that if the number of sets in the anomaly family that contain the anomaly is sub-exponential, then the MLE is asymptotically unbiased. We also provide empirical evidence that the converse is true: if the number of such sets is exponential, then the MLE is asymptotically biased. Our analysis unifies a number of earlier results on the bias of the MLE for specific anomaly families. Next, we derive a new anomaly estimator using a mixture model, and we prove that our anomaly estimator is asymptotically unbiased regardless of the size of the anomaly family. We illustrate the advantages of our estimator versus the MLE on disease outbreak and highway traffic data.
    Courteous Behavior of Automated Vehicles at Unsignalized Intersections via Reinforcement Learning. (arXiv:2106.06369v1 [cs.LG])
    (2 min) The transition from today's mostly human-driven traffic to a purely automated one will be a gradual evolution, with the effect that we will likely experience mixed traffic in the near future. Connected and automated vehicles can benefit human-driven ones and the whole traffic system in different ways, for example by improving collision avoidance and reducing traffic waves. Many studies have been carried out to improve intersection management, a significant bottleneck in traffic, with intelligent traffic signals or exclusively automated vehicles. However, the problem of how to improve mixed traffic at unsignalized intersections has received less attention. In this paper, we propose a novel approach to optimizing traffic flow at intersections in mixed traffic situations using deep reinforcement learning. Our reinforcement learning agent learns a policy for a centralized controller to let connected autonomous vehicles at unsignalized intersections give up their right of way and yield to other vehicles to optimize traffic flow. We implemented our approach and tested it in the traffic simulator SUMO based on simulated and real traffic data. The experimental evaluation demonstrates that our method significantly improves traffic flow through unsignalized intersections in mixed traffic settings and also provides better performance on a wide range of traffic situations compared to the state-of-the-art traffic signal controller for the corresponding signalized intersection.
    Analyzing the Travel and Charging Behavior of Electric Vehicles -- A Data-driven Approach. (arXiv:2106.06475v1 [cs.LG])
    (2 min) The increasing market penetration of electric vehicles (EVs) may pose significant electricity demand on power systems. This electricity demand is affected by the inherent uncertainties of EVs' travel behavior that makes forecasting the daily charging demand (CD) very challenging. In this project, we use the National House Hold Survey (NHTS) data to form sequences of trips, and develop machine learning models to predict the parameters of the next trip of the drivers, including trip start time, end time, and distance. These parameters are later used to model the temporal charging behavior of EVs. The simulation results show that the proposed modeling can effectively estimate the daily CD pattern based on travel behavior of EVs, and simple machine learning techniques can forecast the travel parameters with acceptable accuracy.
    Locally Sparse Networks for Interpretable Predictions. (arXiv:2106.06468v1 [cs.LG])
    (2 min) Despite the enormous success of neural networks, they are still hard to interpret and often overfit when applied to low-sample-size (LSS) datasets. To tackle these obstacles, we propose a framework for training locally sparse neural networks where the local sparsity is learned via a sample-specific gating mechanism that identifies the subset of most relevant features for each measurement. The sample-specific sparsity is predicted via a \textit{gating} network, which is trained in tandem with the \textit{prediction} network. By learning these subsets and weights of a prediction model, we obtain an interpretable neural network that can handle LSS data and can remove nuisance variables, which are irrelevant for the supervised learning task. Using both synthetic and real-world datasets, we demonstrate that our method outperforms state-of-the-art models when predicting the target function with far fewer features per instance.
    Continuous Herded Gibbs Sampling. (arXiv:2106.06430v1 [stat.ML])
    (2 min) Herding is a technique to sequentially generate deterministic samples from a probability distribution. In this work, we propose a continuous herded Gibbs sampler, that combines kernel herding on continuous densities with Gibbs sampling. Our algorithm allows for deterministically sampling from high-dimensional multivariate probability densities, without directly sampling from the joint density. Experiments with Gaussian mixture densities indicate that the L2 error decreases similarly to kernel herding, while the computation time is significantly lower, i.e., linear in the number of dimensions.
    Topological Detection of Trojaned Neural Networks. (arXiv:2106.06469v1 [cs.LG])
    (2 min) Deep neural networks are known to have security issues. One particular threat is the Trojan attack. It occurs when the attackers stealthily manipulate the model's behavior through Trojaned training samples, which can later be exploited. Guided by basic neuroscientific principles we discover subtle -- yet critical -- structural deviation characterizing Trojaned models. In our analysis we use topological tools. They allow us to model high-order dependencies in the networks, robustly compare different networks, and localize structural abnormalities. One interesting observation is that Trojaned models develop short-cuts from input to output layers. Inspired by these observations, we devise a strategy for robust detection of Trojaned models. Compared to standard baselines it displays better performance on multiple benchmarks.
    Model-Free Learning for Two-Player Zero-Sum Partially Observable Markov Games with Perfect Recall. (arXiv:2106.06279v1 [stat.ML])
    (2 min) We study the problem of learning a Nash equilibrium (NE) in an imperfect information game (IIG) through self-play. Precisely, we focus on two-player, zero-sum, episodic, tabular IIG under the perfect-recall assumption where the only feedback is realizations of the game (bandit feedback). In particular, the dynamic of the IIG is not known -- we can only access it by sampling or interacting with a game simulator. For this learning setting, we provide the Implicit Exploration Online Mirror Descent (IXOMD) algorithm. It is a model-free algorithm with a high-probability bound on the convergence rate to the NE of order $1/\sqrt{T}$ where $T$ is the number of played games. Moreover, IXOMD is computationally efficient as it needs to perform the updates only along the sampled trajectory.
    Online Continual Adaptation with Active Self-Training. (arXiv:2106.06526v1 [cs.LG])
    (2 min) Models trained with offline data often suffer from continual distribution shifts and expensive labeling in changing environments. This calls for a new online learning paradigm where the learner can continually adapt to changing environments with limited labels. In this paper, we propose a new online setting -- Online Active Continual Adaptation, where the learner aims to continually adapt to changing distributions using both unlabeled samples and active queries of limited labels. To this end, we propose Online Self-Adaptive Mirror Descent (OSAMD), which adopts an online teacher-student structure to enable online self-training from unlabeled data, and a margin-based criterion that decides whether to query the labels to track changing distributions. Theoretically, we show that, in the separable case, OSAMD has an $O({T}^{1/2})$ dynamic regret bound under mild assumptions, which is even tighter than the lower bound $\Omega(T^{2/3})$ of traditional online learning with full labels. In the general case, we show a regret bound of $O({\alpha^*}^{1/3} {T}^{2/3} + \alpha^* T)$, where $\alpha^*$ denotes the separability of domains and is usually small. Our theoretical results show that OSAMD can fast adapt to changing environments with active queries. Empirically, we demonstrate that OSAMD achieves favorable regrets under changing environments with limited labels on both simulated and real-world data, which corroborates our theoretical findings.
    Learning the optimal regularizer for inverse problems. (arXiv:2106.06513v1 [stat.ML])
    (2 min) In this work, we consider the linear inverse problem $y=Ax+\epsilon$, where $A\colon X\to Y$ is a known linear operator between the separable Hilbert spaces $X$ and $Y$, $x$ is a random variable in $X$ and $\epsilon$ is a zero-mean random process in $Y$. This setting covers several inverse problems in imaging including denoising, deblurring, and X-ray tomography. Within the classical framework of regularization, we focus on the case where the regularization functional is not given a priori but learned from data. Our first result is a characterization of the optimal generalized Tikhonov regularizer, with respect to the mean squared error. We find that it is completely independent of the forward operator $A$ and depends only on the mean and covariance of $x$. Then, we consider the problem of learning the regularizer from a finite training set in two different frameworks: one supervised, based on samples of both $x$ and $y$, and one unsupervised, based only on samples of $x$. In both cases, we prove generalization bounds, under some weak assumptions on the distribution of $x$ and $\epsilon$, including the case of sub-Gaussian variables. Our bounds hold in infinite-dimensional spaces, thereby showing that finer and finer discretizations do not make this learning problem harder. The results are validated through numerical simulations.
    Global Neighbor Sampling for Mixed CPU-GPU Training on Giant Graphs. (arXiv:2106.06150v1 [cs.LG])
    (2 min) Graph neural networks (GNNs) are powerful tools for learning from graph data and are widely used in various applications such as social network recommendation, fraud detection, and graph search. The graphs in these applications are typically large, usually containing hundreds of millions of nodes. Training GNN models on such large graphs efficiently remains a big challenge. Despite a number of sampling-based methods have been proposed to enable mini-batch training on large graphs, these methods have not been proved to work on truly industry-scale graphs, which require GPUs or mixed-CPU-GPU training. The state-of-the-art sampling-based methods are usually not optimized for these real-world hardware setups, in which data movement between CPUs and GPUs is a bottleneck. To address this issue, we propose Global Neighborhood Sampling that aims at training GNNs on giant graphs specifically for mixed-CPU-GPU training. The algorithm samples a global cache of nodes periodically for all mini-batches and stores them in GPUs. This global cache allows in-GPU importance sampling of mini-batches, which drastically reduces the number of nodes in a mini-batch, especially in the input layer, to reduce data copy between CPU and GPU and mini-batch computation without compromising the training convergence rate or model accuracy. We provide a highly efficient implementation of this method and show that our implementation outperforms an efficient node-wise neighbor sampling baseline by a factor of 2X-4X on giant graphs. It outperforms an efficient implementation of LADIES with small layers by a factor of 2X-14X while achieving much higher accuracy than LADIES.We also theoretically analyze the proposed algorithm and show that with cached node data of a proper size, it enjoys a comparable convergence rate as the underlying node-wise sampling method.
    Unsupervised Anomaly Detection Ensembles using Item Response Theory. (arXiv:2106.06243v1 [stat.ML])
    (2 min) Constructing an ensemble from a heterogeneous set of unsupervised anomaly detection methods is challenging because the class labels or the ground truth is unknown. Thus, traditional ensemble techniques that use the response variable or the class labels cannot be used to construct an ensemble for unsupervised anomaly detection. We use Item Response Theory (IRT) -- a class of models used in educational psychometrics to assess student and test question characteristics -- to construct an unsupervised anomaly detection ensemble. IRT's latent trait computation lends itself to anomaly detection because the latent trait can be used to uncover the hidden ground truth. Using a novel IRT mapping to the anomaly detection problem, we construct an ensemble that can downplay noisy, non-discriminatory methods and accentuate sharper methods. We demonstrate the effectiveness of the IRT ensemble on an extensive data repository, by comparing its performance to other ensemble techniques.
    Going Beyond Linear Transformers with Recurrent Fast Weight Programmers. (arXiv:2106.06295v1 [cs.LG])
    (2 min) Transformers with linearised attention ("linear Transformers") have demonstrated the practical scalability and effectiveness of outer product-based Fast Weight Programmers (FWPs) from the '90s. However, the original FWP formulation is more general than the one of linear Transformers: a slow neural network (NN) continually reprograms the weights of a fast NN with arbitrary NN architectures. In existing linear Transformers, both NNs are feedforward and consist of a single layer. Here we explore new variations by adding recurrence to the slow and fast nets. We evaluate our novel recurrent FWPs (RFWPs) on two synthetic algorithmic tasks (code execution and sequential ListOps), Wikitext-103 language models, and on the Atari 2600 2D game environment. Our models exhibit properties of Transformers and RNNs. In the reinforcement learning setting, we report large improvements over LSTM in several Atari games. Our code is public.
    Graph Neural Networks for Natural Language Processing: A Survey. (arXiv:2106.06090v1 [cs.CL])
    (2 min) Deep learning has become the dominant approach in coping with various tasks in Natural LanguageProcessing (NLP). Although text inputs are typically represented as a sequence of tokens, there isa rich variety of NLP problems that can be best expressed with a graph structure. As a result, thereis a surge of interests in developing new deep learning techniques on graphs for a large numberof NLP tasks. In this survey, we present a comprehensive overview onGraph Neural Networks(GNNs) for Natural Language Processing. We propose a new taxonomy of GNNs for NLP, whichsystematically organizes existing research of GNNs for NLP along three axes: graph construction,graph representation learning, and graph based encoder-decoder models. We further introducea large number of NLP applications that are exploiting the power of GNNs and summarize thecorresponding benchmark datasets, evaluation metrics, and open-source codes. Finally, we discussvarious outstanding challenges for making the full use of GNNs for NLP as well as future researchdirections. To the best of our knowledge, this is the first comprehensive overview of Graph NeuralNetworks for Natural Language Processing.
    What Can Knowledge Bring to Machine Learning? -- A Survey of Low-shot Learning for Structured Data. (arXiv:2106.06410v1 [cs.LG])
    (2 min) Supervised machine learning has several drawbacks that make it difficult to use in many situations. Drawbacks include: heavy reliance on massive training data, limited generalizability and poor expressiveness of high-level semantics. Low-shot Learning attempts to address these drawbacks. Low-shot learning allows the model to obtain good predictive power with very little or no training data, where structured knowledge plays a key role as a high-level semantic representation of human. This article will review the fundamental factors of low-shot learning technologies, with a focus on the operation of structured knowledge under different low-shot conditions. We also introduce other techniques relevant to low-shot learning. Finally, we point out the limitations of low-shot learning, the prospects and gaps of industrial applications, and future research directions.
    A comprehensive solution to retrieval-based chatbot construction. (arXiv:2106.06139v1 [cs.CL])
    (2 min) In this paper we present the results of our experiments in training and deploying a self-supervised retrieval-based chatbot trained with contrastive learning for assisting customer support agents. In contrast to most existing research papers in this area where the focus is on solving just one component of a deployable chatbot, we present an end-to-end set of solutions to take the reader from an unlabelled chatlogs to a deployed chatbot. This set of solutions includes creating a self-supervised dataset and a weakly labelled dataset from chatlogs, as well as a systematic approach to selecting a fixed list of canned responses. We present a hierarchical-based RNN architecture for the response selection model, chosen for its ability to cache intermediate utterance embeddings, which helped to meet deployment inference speed requirements. We compare the performance of this architecture across 3 different learning objectives: self-supervised contrastive learning, binary classification, and multi-class classification. We find that using a self-supervised contrastive learning model outperforms training the binary and multi-class classification models on a weakly labelled dataset. Our results validate that the self-supervised contrastive learning approach can be effectively used for a real-world chatbot scenario.
    Survey of Image Based Graph Neural Networks. (arXiv:2106.06307v1 [cs.LG])
    (2 min) In this survey paper, we analyze image based graph neural networks and propose a three-step classification approach. We first convert the image into superpixels using the Quickshift algorithm so as to reduce 30% of the input data. The superpixels are subsequently used to generate a region adjacency graph. Finally, the graph is passed through a state-of-art graph convolutional neural network to get classification scores. We also analyze the spatial and spectral convolution filtering techniques in graph neural networks. Spectral-based models perform better than spatial-based models and classical CNN with lesser compute cost.
    States of confusion: Eye and Head tracking reveal surgeons' confusion during arthroscopic surgery. (arXiv:2106.06261v1 [cs.HC])
    (2 min) During arthroscopic surgeries, surgeons are faced with challenges like cognitive re-projection of the 2D screen output into the 3D operating site or navigation through highly similar tissue. Training of these cognitive processes takes much time and effort for young surgeons, but is necessary and crucial for their education. In this study we want to show how to recognize states of confusion of young surgeons during an arthroscopic surgery, by looking at their eye and head movements and feeding them to a machine learning model. With an accuracy of over 94\% and detection speed of 0.039 seconds, our model is a step towards online diagnostic and training systems for the perceptual-cognitive processes of surgeons during arthroscopic surgeries.
    An adaptive cognitive sensor node for ECG monitoring in the Internet of Medical Things. (arXiv:2106.06498v1 [cs.LG])
    (2 min) The Internet of Medical Things (IoMT) paradigm is becoming mainstream in multiple clinical trials and healthcare procedures. It relies on novel very accurate and compact sensing devices and communication infrastructures, opening previously unmatched possibilities of implementing data collection and continuous patient monitoring. Nevertheless, to fully exploit the potential of this technology, some steps forwards are needed. First, the edge-computing paradigm must be added to the picture. A certain level of near-sensor processing has to be enabled, to improve the scalability, portability, reliability, responsiveness of the IoMT nodes. Second, novel, increasingly accurate, data analysis algorithms, such as those based on artificial intelligence and Deep Learning, must be exploited. To reach these objectives, designers, programmers of IoMT nodes, have to face challenging optimization tasks, in order to execute fairly complex computing tasks on low-power wearable and portable processing systems, with tight power and battery lifetime budgets. In this work, we explore the implementation of cognitive data analysis algorithm on resource-constrained computing platforms. To minimize power consumption, we add an adaptivity layer that dynamically manages the hardware and software configuration of the device to adapt it at runtime to the required operating mode. We have assessed our approach on a use-case using a convolutional neural network to classify electrocardiogram (ECG) traces on a low-power microcontroller. Our experimental results show that adapting the node setup to the workload at runtime can save up to 50% power consumption and a quantized neural network reaches an accuracy value higher than 98% for arrhythmia disorders detection on MIT-BIH Arrhythmia dataset.
    Is Homophily a Necessity for Graph Neural Networks?. (arXiv:2106.06134v1 [cs.LG])
    (2 min) Graph neural networks (GNNs) have shown great prowess in learning representations suitable for numerous graph-based machine learning tasks. When applied to semi-supervised node classification, GNNs are widely believed to work well due to the homophily assumption (``like attracts like''), and fail to generalize to heterophilous graphs where dissimilar nodes connect. Recent works design new architectures to overcome such heterophily-related limitations, citing poor baseline performance and new architecture improvements on a few heterophilous graph benchmark datasets as evidence for this notion. In our experiments, we empirically find that standard graph convolutional networks (GCNs) can actually achieve better performance than such carefully designed methods on some commonly used heterophilous graphs. This motivates us to reconsider whether homophily is truly necessary for good GNN performance. We find that this claim is not quite true, and in fact, GCNs can achieve strong performance on heterophilous graphs under certain conditions. Our work carefully characterizes these conditions, and provides supporting theoretical understanding and empirical observations. Finally, we examine existing heterophilous graphs benchmarks and reconcile how the GCN (under)performs on them based on this understanding.
    Nested and Balanced Entity Recognition using Multi-Task Learning. (arXiv:2106.06216v1 [cs.CL])
    (2 min) Entity Recognition (ER) within a text is a fundamental exercise in Natural Language Processing, enabling further depending tasks such as Knowledge Extraction, Text Summarisation, or Keyphrase Extraction. An entity consists of single words or of a consecutive sequence of terms, constituting the basic building blocks for communication. Mainstream ER approaches are mainly limited to flat structures, concentrating on the outermost entities while ignoring the inner ones. This paper introduces a partly-layered network architecture that deals with the complexity of overlapping and nested cases. The proposed architecture consists of two parts: (1) a shared Sequence Layer and (2) a stacked component with multiple Tagging Layers. The adoption of such an architecture has the advantage of preventing overfit to a specific word-length, thus maintaining performance for longer entities despite their lower frequency. To verify the proposed architecture's effectiveness, we train and evaluate this architecture to recognise two kinds of entities - Concepts (CR) and Named Entities (NER). Our approach achieves state-of-the-art NER performances, while it outperforms previous CR approaches. Considering these promising results, we see the possibility to evolve the architecture for other cases such as the extraction of events or the detection of argumentative components.
    An Ensemble Approach Towards Adversarial Robustness. (arXiv:2106.05996v1 [cs.LG])
    (2 min) It is a known phenomenon that adversarial robustness comes at a cost to natural accuracy. To improve this trade-off, this paper proposes an ensemble approach that divides a complex robust-classification task into simpler subtasks. Specifically, fractal divide derives multiple training sets from the training data, and fractal aggregation combines inference outputs from multiple classifiers that are trained on those sets. The resulting ensemble classifiers have a unique property that ensures robustness for an input if certain don't-care conditions are met. The new techniques are evaluated on MNIST and Fashion-MNIST, with no adversarial training. The MNIST classifier has 99% natural accuracy, 70% measured robustness and 36.9% provable robustness, within L2 distance of 2. The Fashion-MNIST classifier has 90% natural accuracy, 54.5% measured robustness and 28.2% provable robustness, within L2 distance of 1.5. Both results are new state of the art, and we also present new state-of-the-art binary results on challenging label-pairs.
    Neural Symbolic Regression that Scales. (arXiv:2106.06427v1 [cs.LG])
    (2 min) Symbolic equations are at the core of scientific discovery. The task of discovering the underlying equation from a set of input-output pairs is called symbolic regression. Traditionally, symbolic regression methods use hand-designed strategies that do not improve with experience. In this paper, we introduce the first symbolic regression method that leverages large scale pre-training. We procedurally generate an unbounded set of equations, and simultaneously pre-train a Transformer to predict the symbolic equation from a corresponding set of input-output-pairs. At test time, we query the model on a new set of points and use its output to guide the search for the equation. We show empirically that this approach can re-discover a set of well-known physical equations, and that it improves over time with more data and compute.
    Finding Physical Adversarial Examples for Autonomous Driving with Fast and Differentiable Image Compositing. (arXiv:2010.08844v2 [cs.CV] UPDATED)
    (2 min) There is considerable evidence that deep neural networks are vulnerable to adversarial perturbations applied directly to their digital inputs. However, it remains an open question whether this translates to vulnerabilities in real systems. For example, an attack on self-driving cars would in practice entail modifying the driving environment, which then impacts the video inputs to the car's controller, thereby indirectly leading to incorrect driving decisions. Such attacks require accounting for system dynamics and tracking viewpoint changes. We propose a scalable approach for finding adversarial modifications of a simulated autonomous driving environment using a differentiable approximation for the mapping from environmental modifications (rectangles on the road) to the corresponding video inputs to the controller neural network. Given the parameters of the rectangles, our proposed differentiable mapping composites them onto pre-recorded video streams of the original environment, accounting for geometric and color variations. Moreover, we propose a multiple trajectory sampling approach that enables our attacks to be robust to a car's self-correcting behavior. When combined with a neural network-based controller, our approach allows the design of adversarial modifications through end-to-end gradient-based optimization. Using the Carla autonomous driving simulator, we show that our approach is significantly more scalable and far more effective at identifying autonomous vehicle vulnerabilities in simulation experiments than a state-of-the-art approach based on Bayesian Optimization.
    Leveraging Public Data for Practical Private Query Release. (arXiv:2102.08598v2 [cs.LG] UPDATED)
    (2 min) In many statistical problems, incorporating priors can significantly improve performance. However, the use of prior knowledge in differentially private query release has remained underexplored, despite such priors commonly being available in the form of public datasets, such as previous US Census releases. With the goal of releasing statistics about a private dataset, we present PMW^Pub, which -- unlike existing baselines -- leverages public data drawn from a related distribution as prior information. We provide a theoretical analysis and an empirical evaluation on the American Community Survey (ACS) and ADULT datasets, which shows that our method outperforms state-of-the-art methods. Furthermore, PMW^Pub scales well to high-dimensional data domains, where running many existing methods would be computationally infeasible.
    Black-box Explanation of Object Detectors via Saliency Maps. (arXiv:2006.03204v2 [cs.CV] UPDATED)
    (2 min) We propose D-RISE, a method for generating visual explanations for the predictions of object detectors. Utilizing the proposed similarity metric that accounts for both localization and categorization aspects of object detection allows our method to produce saliency maps that show image areas that most affect the prediction. D-RISE can be considered "black-box" in the software testing sense, as it only needs access to the inputs and outputs of an object detector. Compared to gradient-based methods, D-RISE is more general and agnostic to the particular type of object detector being tested, and does not need knowledge of the inner workings of the model. We show that D-RISE can be easily applied to different object detectors including one-stage detectors such as YOLOv3 and two-stage detectors such as Faster-RCNN. We present a detailed analysis of the generated visual explanations to highlight the utilization of context and possible biases learned by object detectors.
    Making EfficientNet More Efficient: Exploring Batch-Independent Normalization, Group Convolutions and Reduced Resolution Training. (arXiv:2106.03640v2 [cs.LG] UPDATED)
    (2 min) Much recent research has been dedicated to improving the efficiency of training and inference for image classification. This effort has commonly focused on explicitly improving theoretical efficiency, often measured as ImageNet validation accuracy per FLOP. These theoretical savings have, however, proven challenging to achieve in practice, particularly on high-performance training accelerators. In this work, we focus on improving the practical efficiency of the state-of-the-art EfficientNet models on a new class of accelerator, the Graphcore IPU. We do this by extending this family of models in the following ways: (i) generalising depthwise convolutions to group convolutions; (ii) adding proxy-normalized activations to match batch normalization performance with batch-independent statistics; (iii) reducing compute by lowering the training resolution and inexpensively fine-tuning at higher resolution. We find that these three methods improve the practical efficiency for both training and inference. Our code will be made available online.
    View Generalization for Single Image Textured 3D Models. (arXiv:2106.06533v1 [cs.CV])
    (2 min) Humans can easily infer the underlying 3D geometry and texture of an object only from a single 2D image. Current computer vision methods can do this, too, but suffer from view generalization problems - the models inferred tend to make poor predictions of appearance in novel views. As for generalization problems in machine learning, the difficulty is balancing single-view accuracy (cf. training error; bias) with novel view accuracy (cf. test error; variance). We describe a class of models whose geometric rigidity is easily controlled to manage this tradeoff. We describe a cycle consistency loss that improves view generalization (roughly, a model from a generated view should predict the original view well). View generalization of textures requires that models share texture information, so a car seen from the back still has headlights because other cars have headlights. We describe a cycle consistency loss that encourages model textures to be aligned, so as to encourage sharing. We compare our method against the state-of-the-art method and show both qualitative and quantitative improvements.
    Dictionary and prior learning with unrolled algorithms for unsupervised inverse problems. (arXiv:2106.06338v1 [cs.LG])
    (2 min) Inverse problems consist in recovering a signal given noisy observations. One classical resolution approach is to leverage sparsity and integrate prior knowledge of the signal to the reconstruction algorithm to get a plausible solution. Still, this prior might not be sufficiently adapted to the data. In this work, we study Dictionary and Prior learning from degraded measurements as a bi-level problem, and we take advantage of unrolled algorithms to solve approximate formulations of Synthesis and Analysis. We provide an empirical and theoretical analysis of automatic differentiation for Dictionary Learning to understand better the pros and cons of unrolling in this context. We find that unrolled algorithms speed up the recovery process for a small number of iterations by improving the gradient estimation. Then we compare Analysis and Synthesis by evaluating the performance of unrolled algorithms for inverse problems, without access to any ground truth data for several classes of dictionaries and priors. While Analysis can achieve good results,Synthesis is more robust and performs better. Finally, we illustrate our method on pattern and structure learning tasks from degraded measurements.
    Optimal Rates for Averaged Stochastic Gradient Descent under Neural Tangent Kernel Regime. (arXiv:2006.12297v2 [stat.ML] UPDATED)
    (2 min) We analyze the convergence of the averaged stochastic gradient descent for overparameterized two-layer neural networks for regression problems. It was recently found that a neural tangent kernel (NTK) plays an important role in showing the global convergence of gradient-based methods under the NTK regime, where the learning dynamics for overparameterized neural networks can be almost characterized by that for the associated reproducing kernel Hilbert space (RKHS). However, there is still room for a convergence rate analysis in the NTK regime. In this study, we show that the averaged stochastic gradient descent can achieve the minimax optimal convergence rate, with the global convergence guarantee, by exploiting the complexities of the target function and the RKHS associated with the NTK. Moreover, we show that the target function specified by the NTK of a ReLU network can be learned at the optimal convergence rate through a smooth approximation of a ReLU network under certain conditions.
    Improved Contrastive Divergence Training of Energy Based Models. (arXiv:2012.01316v4 [cs.LG] UPDATED)
    (2 min) Contrastive divergence is a popular method of training energy-based models, but is known to have difficulties with training stability. We propose an adaptation to improve contrastive divergence training by scrutinizing a gradient term that is difficult to calculate and is often left out for convenience. We show that this gradient term is numerically significant and in practice is important to avoid training instabilities, while being tractable to estimate. We further highlight how data augmentation and multi-scale processing can be used to improve model robustness and generation quality. Finally, we empirically evaluate stability of model architectures and show improved performance on a host of benchmarks and use cases,such as image generation, OOD detection, and compositional generation.
    The Complexity of Sparse Tensor PCA. (arXiv:2106.06308v1 [cs.LG])
    (2 min) We study the problem of sparse tensor principal component analysis: given a tensor $\pmb Y = \pmb W + \lambda x^{\otimes p}$ with $\pmb W \in \otimes^p\mathbb{R}^n$ having i.i.d. Gaussian entries, the goal is to recover the $k$-sparse unit vector $x \in \mathbb{R}^n$. The model captures both sparse PCA (in its Wigner form) and tensor PCA. For the highly sparse regime of $k \leq \sqrt{n}$, we present a family of algorithms that smoothly interpolates between a simple polynomial-time algorithm and the exponential-time exhaustive search algorithm. For any $1 \leq t \leq k$, our algorithms recovers the sparse vector for signal-to-noise ratio $\lambda \geq \tilde{\mathcal{O}} (\sqrt{t} \cdot (k/t)^{p/2})$ in time $\tilde{\mathcal{O}}(n^{p+t})$, capturing the state-of-the-art guarantees for the matrix settings (in both the polynomial-time and sub-exponential time regimes). Our results naturally extend to the case of $r$ distinct $k$-sparse signals with disjoint supports, with guarantees that are independent of the number of spikes. Even in the restricted case of sparse PCA, known algorithms only recover the sparse vectors for $\lambda \geq \tilde{\mathcal{O}}(k \cdot r)$ while our algorithms require $\lambda \geq \tilde{\mathcal{O}}(k)$. Finally, by analyzing the low-degree likelihood ratio, we complement these algorithmic results with rigorous evidence illustrating the trade-offs between signal-to-noise ratio and running time. This lower bound captures the known lower bounds for both sparse PCA and tensor PCA. In this general model, we observe a more intricate three-way trade-off between the number of samples $n$, the sparsity $k$, and the tensor power $p$.
    Nystr\"om landmark sampling and regularized Christoffel functions. (arXiv:1905.12346v3 [cs.LG] UPDATED)
    (2 min) Selecting diverse and important items, called landmarks, from a large set is a problem of interest in machine learning. As a specific example, in order to deal with large training sets, kernel methods often rely on low rank matrix Nystr\"om approximations based on the selection or sampling of landmarks. In this context, we propose a deterministic and a randomized adaptive algorithm for selecting landmark points within a training data set, which are related to the minima of a sequence of kernelized Christoffel functions. Beyond the known connection between Christoffel functions and leverage scores, a connection of our method with determinantal point processes (DPPs) is also explained. Namely, our construction promotes diversity among important landmark points in a way similar to DPPs. Also, we explain how our randomized adaptive algorithm can influence the accuracy of Kernel Ridge Regression.
    Assessing the Effectiveness of Syntactic Structure to Learn Code Edit Representations. (arXiv:2106.06110v1 [cs.LG])
    (2 min) In recent times, it has been shown that one can use code as data to aid various applications such as automatic commit message generation, automatic generation of pull request descriptions and automatic program repair. Take for instance the problem of commit message generation. Treating source code as a sequence of tokens, state of the art techniques generate commit messages using neural machine translation models. However, they tend to ignore the syntactic structure of programming languages. Previous work, i.e., code2seq has used structural information from Abstract Syntax Tree (AST) to represent source code and they use it to automatically generate method names. In this paper, we elaborate upon this state of the art approach and modify it to represent source code edits. We determine the effect of using such syntactic structure for the problem of classifying code edits. Inspired by the code2seq approach, we evaluate how using structural information from AST, i.e., paths between AST leaf nodes can help with the task of code edit classification on two datasets of fine-grained syntactic edits. Our experiments shows that attempts of adding syntactic structure does not result in any improvements over less sophisticated methods. The results suggest that techniques such as code2seq, while promising, have a long way to go before they can be generically applied to learning code edit representations. We hope that these results will benefit other researchers and inspire them to work further on this problem.
    Machine Collaboration. (arXiv:2105.02569v2 [stat.ML] UPDATED)
    (2 min) We propose a new ensemble framework for supervised learning, called machine collaboration (MaC), using a collection of base machines for prediction tasks. Unlike bagging/stacking (a parallel & independent framework) and boosting (a sequential & top-down framework), MaC is a type of circular & interactive learning framework. The circular & interactive feature helps the base machines to transfer information circularly and update their structures and parameters accordingly. The theoretical result on the risk bound of the estimator from MaC reveals that the circular & interactive feature can help MaC reduce risk via a parsimonious ensemble. We conduct extensive experiments on MaC using both simulated data and 119 benchmark real datasets. The results demonstrate that in most cases, MaC performs significantly better than several other state-of-the-art methods, including classification and regression trees, neural networks, stacking, and boosting.
    Scalable Polyhedral Verification of Recurrent Neural Networks. (arXiv:2005.13300v3 [cs.LG] UPDATED)
    (2 min) We present a scalable and precise verifier for recurrent neural networks, called Prover based on two novel ideas: (i) a method to compute a set of polyhedral abstractions for the non-convex and nonlinear recurrent update functions by combining sampling, optimization, and Fermat's theorem, and (ii) a gradient descent based algorithm for abstraction refinement guided by the certification problem that combines multiple abstractions for each neuron. Using Prover, we present the first study of certifying a non-trivial use case of recurrent neural networks, namely speech classification. To achieve this, we additionally develop custom abstractions for the non-linear speech preprocessing pipeline. Our evaluation shows that Prover successfully verifies several challenging recurrent models in computer vision, speech, and motion sensor data classification beyond the reach of prior work.
    PAC-Learning for Strategic Classification. (arXiv:2012.03310v4 [cs.LG] UPDATED)
    (2 min) The study of strategic or adversarial manipulation of testing data to fool a classifier has attracted much recent attention. Most previous works have focused on two extreme situations where any testing data point either is completely adversarial or always equally prefers the positive label. In this paper, we generalize both of these through a unified framework for strategic classification, and introduce the notion of strategic VC-dimension (SVC) to capture the PAC-learnability in our general strategic setup. SVC provably generalizes the recent concept of adversarial VC-dimension (AVC) introduced by Cullina et al. arXiv:1806.01471. We instantiate our framework for the fundamental strategic linear classification problem. We fully characterize: (1) the statistical learnability of linear classifiers by pinning down its SVC; (2) its computational tractability by pinning down the complexity of the empirical risk minimization problem. Interestingly, the SVC of linear classifiers is always upper bounded by its standard VC-dimension. This characterization also strictly generalizes the AVC bound for linear classifiers in arXiv:1806.01471.
    Fast Weakly Supervised Action Segmentation Using Mutual Consistency. (arXiv:1904.03116v4 [cs.CV] UPDATED)
    (2 min) Action segmentation is the task of predicting the actions for each frame of a video. As obtaining the full annotation of videos for action segmentation is expensive, weakly supervised approaches that can learn only from transcripts are appealing. In this paper, we propose a novel end-to-end approach for weakly supervised action segmentation based on a two-branch neural network. The two branches of our network predict two redundant but different representations for action segmentation and we propose a novel mutual consistency (MuCon) loss that enforces the consistency of the two redundant representations. Using the MuCon loss together with a loss for transcript prediction, our proposed approach achieves the accuracy of state-of-the-art approaches while being $14$ times faster to train and $20$ times faster during inference. The MuCon loss proves beneficial even in the fully supervised setting.
    Hurricane Forecasting: A Novel Multimodal Machine Learning Framework. (arXiv:2011.06125v2 [cs.LG] UPDATED)
    (2 min) This paper describes a machine learning (ML) framework for tropical cyclone intensity and track forecasting, combining multiple distinct ML techniques and utilizing diverse data sources. Our framework, which we refer to as Hurricast (HURR), is built upon the combination of distinct data processing techniques using gradient-boosted trees and novel encoder-decoder architectures, including CNN, GRU and Transformers components. We propose a deep-feature extractor methodology to mix spatial-temporal data with statistical data efficiently. Our multimodal framework unleashes the potential of making forecasts based on a wide range of data sources, including historical storm data, and visual data such as reanalysis atmospheric images. We evaluate our models with current operational forecasts in North Atlantic and Eastern Pacific basins on 2016-2019 for 24-hour lead time, and show our models consistently outperform statistical-dynamical models and compete with the best dynamical models, while computing forecasts in seconds. Furthermore, the inclusion of Hurricast into an operational forecast consensus model leads to a significant improvement of 5% - 15% over NHC's official forecast, thus highlighting the complementary properties with existing approaches. In summary, our work demonstrates that combining different data sources and distinct machine learning methodologies can lead to superior tropical cyclone forecasting. We hope that this work opens the door for further use of machine learning in meteorological forecasting.
    N-Best ASR Transformer: Enhancing SLU Performance using Multiple ASR Hypotheses. (arXiv:2106.06519v1 [cs.CL])
    (2 min) Spoken Language Understanding (SLU) systems parse speech into semantic structures like dialog acts and slots. This involves the use of an Automatic Speech Recognizer (ASR) to transcribe speech into multiple text alternatives (hypotheses). Transcription errors, common in ASRs, impact downstream SLU performance negatively. Approaches to mitigate such errors involve using richer information from the ASR, either in form of N-best hypotheses or word-lattices. We hypothesize that transformer models learn better with a simpler utterance representation using the concatenation of the N-best ASR alternatives, where each alternative is separated by a special delimiter [SEP]. In our work, we test our hypothesis by using concatenated N-best ASR alternatives as the input to transformer encoder models, namely BERT and XLM-RoBERTa, and achieve performance equivalent to the prior state-of-the-art model on DSTC2 dataset. We also show that our approach significantly outperforms the prior state-of-the-art when subjected to the low data regime. Additionally, this methodology is accessible to users of third-party ASR APIs which do not provide word-lattice information.
    Visualizing Classifier Adjacency Relations: A Case Study in Speaker Verification and Voice Anti-Spoofing. (arXiv:2106.06362v1 [cs.SD])
    (2 min) Whether it be for results summarization, or the analysis of classifier fusion, some means to compare different classifiers can often provide illuminating insight into their behaviour, (dis)similarity or complementarity. We propose a simple method to derive 2D representation from detection scores produced by an arbitrary set of binary classifiers in response to a common dataset. Based upon rank correlations, our method facilitates a visual comparison of classifiers with arbitrary scores and with close relation to receiver operating characteristic (ROC) and detection error trade-off (DET) analyses. While the approach is fully versatile and can be applied to any detection task, we demonstrate the method using scores produced by automatic speaker verification and voice anti-spoofing systems. The former are produced by a Gaussian mixture model system trained with VoxCeleb data whereas the latter stem from submissions to the ASVspoof 2019 challenge.
    A Decentralized Adaptive Momentum Method for Solving a Class of Min-Max Optimization Problems. (arXiv:2106.06075v1 [math.OC])
    (2 min) Min-max saddle point games have recently been intensely studied, due to their wide range of applications, including training Generative Adversarial Networks~(GANs). However, most of the recent efforts for solving them are limited to special regimes such as convex-concave games. Further, it is customarily assumed that the underlying optimization problem is solved either by a single machine or in the case of multiple machines connected in centralized fashion, wherein each one communicates with a central node. The latter approach becomes challenging, when the underlying communications network has low bandwidth. In addition, privacy considerations may dictate that certain nodes can communicate with a subset of other nodes. Hence, it is of interest to develop methods that solve min-max games in a decentralized manner. To that end, we develop a decentralized adaptive momentum (ADAM)-type algorithm for solving min-max optimization problem under the condition that the objective function satisfies a Minty Variational Inequality condition, which is a generalization to convex-concave case. The proposed method overcomes shortcomings of recent non-adaptive gradient-based decentralized algorithms for min-max optimization problems that do not perform well in practice and require careful tuning. In this paper, we obtain non-asymptotic rates of convergence of the proposed algorithm (coined DADAM$^3$) for finding a (stochastic) first-order Nash equilibrium point and subsequently evaluate its performance on training GANs. The extensive empirical evaluation shows that DADAM$^3$ outperforms recently developed methods, including decentralized optimistic stochastic gradient for solving such min-max problems.
    Inter-domain Multi-relational Link Prediction. (arXiv:2106.06171v1 [cs.LG])
    (2 min) Multi-relational graph is a ubiquitous and important data structure, allowing flexible representation of multiple types of interactions and relations between entities. Similar to other graph-structured data, link prediction is one of the most important tasks on multi-relational graphs and is often used for knowledge completion. When related graphs coexist, it is of great benefit to build a larger graph via integrating the smaller ones. The integration requires predicting hidden relational connections between entities belonged to different graphs (inter-domain link prediction). However, this poses a real challenge to existing methods that are exclusively designed for link prediction between entities of the same graph only (intra-domain link prediction). In this study, we propose a new approach to tackle the inter-domain link prediction problem by softly aligning the entity distributions between different domains with optimal transport and maximum mean discrepancy regularizers. Experiments on real-world datasets show that optimal transport regularizer is beneficial and considerably improves the performance of baseline methods.
    Meta-Adaptive Nonlinear Control: Theory and Algorithms. (arXiv:2106.06098v1 [cs.LG])
    (2 min) We present an online multi-task learning approach for adaptive nonlinear control, which we call Online Meta-Adaptive Control (OMAC). The goal is to control a nonlinear system subject to adversarial disturbance and unknown $\textit{environment-dependent}$ nonlinear dynamics, under the assumption that the environment-dependent dynamics can be well captured with some shared representation. Our approach is motivated by robot control, where a robotic system encounters a sequence of new environmental conditions that it must quickly adapt to. A key emphasis is to integrate online representation learning with established methods from control theory, in order to arrive at a unified framework that yields both control-theoretic and learning-theoretic guarantees. We provide instantiations of our approach under varying conditions, leading to the first non-asymptotic end-to-end convergence guarantee for multi-task adaptive nonlinear control. OMAC can also be integrated with deep representation learning. Experiments show that OMAC significantly outperforms conventional adaptive control approaches which do not learn the shared representation.
    Learning to Pool in Graph Neural Networks for Extrapolation. (arXiv:2106.06210v1 [cs.LG])
    (2 min) Graph neural networks (GNNs) are one of the most popular approaches to using deep learning on graph-structured data, and they have shown state-of-the-art performances on a variety of tasks. However, according to a recent study, a careful choice of pooling functions, which are used for the aggregation or readout operation in GNNs, is crucial for enabling GNNs to extrapolate. Without the ideal combination of pooling functions, which varies across tasks, GNNs completely fail to generalize to out-of-distribution data, while the number of possible combinations grows exponentially with the number of layers. In this paper, we present GNP, a $L^p$ norm-like pooling function that is trainable end-to-end for any given task. Notably, GNP generalizes most of the widely-used pooling functions. We verify experimentally that simply replacing all pooling functions with GNP enables GNNs to extrapolate well on many node-level, graph-level, and set-related tasks; and GNP sometimes performs even better than optimal combinations of existing pooling functions.
    Knowledge Enhanced Machine Learning Pipeline against Diverse Adversarial Attacks. (arXiv:2106.06235v1 [cs.LG])
    (2 min) Despite the great successes achieved by deep neural networks (DNNs), recent studies show that they are vulnerable against adversarial examples, which aim to mislead DNNs by adding small adversarial perturbations. Several defenses have been proposed against such attacks, while many of them have been adaptively attacked. In this work, we aim to enhance the ML robustness from a different perspective by leveraging domain knowledge: We propose a Knowledge Enhanced Machine Learning Pipeline (KEMLP) to integrate domain knowledge (i.e., logic relationships among different predictions) into a probabilistic graphical model via first-order logic rules. In particular, we develop KEMLP by integrating a diverse set of weak auxiliary models based on their logical relationships to the main DNN model that performs the target task. Theoretically, we provide convergence results and prove that, under mild conditions, the prediction of KEMLP is more robust than that of the main DNN model. Empirically, we take road sign recognition as an example and leverage the relationships between road signs and their shapes and contents as domain knowledge. We show that compared with adversarial training and other baselines, KEMLP achieves higher robustness against physical attacks, $\mathcal{L}_p$ bounded attacks, unforeseen attacks, and natural corruptions under both whitebox and blackbox settings, while still maintaining high clean accuracy.
    WAX-ML: A Python library for machine learning and feedback loops on streaming data. (arXiv:2106.06524v1 [cs.LG])
    (2 min) Wax is what you put on a surfboard to avoid slipping. It is an essential tool to go surfing... We introduce WAX-ML a research-oriented Python library providing tools to design powerful machine learning algorithms and feedback loops working on streaming data. It strives to complement JAX with tools dedicated to time series. WAX-ML makes JAX-based programs easy to use for end-users working with pandas and xarray for data manipulation. It provides a simple mechanism for implementing feedback loops, allows the implementation of online learning and reinforcement learning algorithms with functions, and makes them easy to integrate by end-users working with the object-oriented reinforcement learning framework from the Gym library. It is released with an Apache open-source license on GitHub at https://github.com/eserie/wax-ml.
    Learning Abstract Representations through Lossy Compression of Multi-Modal Signals. (arXiv:2101.11376v2 [cs.LG] UPDATED)
    (2 min) A key competence for open-ended learning is the formation of increasingly abstract representations useful for driving complex behavior. Abstract representations ignore specific details and facilitate generalization. Here we consider the learning of abstract representations in a multi-modal setting with two or more input modalities. We treat the problem as a lossy compression problem and show that generic lossy compression of multimodal sensory input naturally extracts abstract representations that tend to strip away modalitiy specific details and preferentially retain information that is shared across the different modalities. Furthermore, we propose an architecture to learn abstract representations by identifying and retaining only the information that is shared across multiple modalities while discarding any modality specific information.
    RNN with Particle Flow for Probabilistic Spatio-temporal Forecasting. (arXiv:2106.06064v1 [stat.ML])
    (2 min) Spatio-temporal forecasting has numerous applications in analyzing wireless, traffic, and financial networks. Many classical statistical models often fall short in handling the complexity and high non-linearity present in time-series data. Recent advances in deep learning allow for better modelling of spatial and temporal dependencies. While most of these models focus on obtaining accurate point forecasts, they do not characterize the prediction uncertainty. In this work, we consider the time-series data as a random realization from a nonlinear state-space model and target Bayesian inference of the hidden states for probabilistic forecasting. We use particle flow as the tool for approximating the posterior distribution of the states, as it is shown to be highly effective in complex, high-dimensional settings. Thorough experimentation on several real world time-series datasets demonstrates that our approach provides better characterization of uncertainty while maintaining comparable accuracy to the state-of-the art point forecasting methods.
    Order Matters: Probabilistic Modeling of Node Sequence for Graph Generation. (arXiv:2106.06189v1 [stat.ML])
    (2 min) A graph generative model defines a distribution over graphs. One type of generative model is constructed by autoregressive neural networks, which sequentially add nodes and edges to generate a graph. However, the likelihood of a graph under the autoregressive model is intractable, as there are numerous sequences leading to the given graph; this makes maximum likelihood estimation challenging. Instead, in this work we derive the exact joint probability over the graph and the node ordering of the sequential process. From the joint, we approximately marginalize out the node orderings and compute a lower bound on the log-likelihood using variational inference. We train graph generative models by maximizing this bound, without using the ad-hoc node orderings of previous methods. Our experiments show that the log-likelihood bound is significantly tighter than the bound of previous schemes. Moreover, the models fitted with the proposed algorithm can generate high-quality graphs that match the structures of target graphs not seen during training. We have made our code publicly available at \hyperref[https://github.com/tufts-ml/graph-generation-vi]{https://github.com/tufts-ml/graph-generation-vi}.
    Scale-invariant scale-channel networks: Deep networks that generalise to previously unseen scales. (arXiv:2106.06418v1 [cs.CV])
    (2 min) The ability to handle large scale variations is crucial for many real world visual tasks. A straightforward approach for handling scale in a deep network is to process an image at several scales simultaneously in a set of scale channels. Scale invariance can then, in principle, be achieved by using weight sharing between the scale channels together with max or average pooling over the outputs from the scale channels. The ability of such scale channel networks to generalise to scales not present in the training set over significant scale ranges has, however, not previously been explored. In this paper, we present a systematic study of this methodology by implementing different types of scale channel networks and evaluating their ability to generalise to previously unseen scales. We develop a formalism for analysing the covariance and invariance properties of scale channel networks, and explore how different design choices, unique to scaling transformations, affect the overall performance of scale channel networks. We first show that two previously proposed scale channel network designs do not generalise well to scales not present in the training set. We explain theoretically and demonstrate experimentally why generalisation fails in these cases. We then propose a new type of foveated scale channel architecture}, where the scale channels process increasingly larger parts of the image with decreasing resolution. This new type of scale channel network is shown to generalise extremely well, provided sufficient image resolution and the absence of boundary effects. Our proposed FovMax and FovAvg networks perform almost identically over a scale range of 8, also when training on single scale training data, and do also give improved performance when learning from datasets with large scale variations in the small sample regime.
    Simple and Efficient Hard Label Black-box Adversarial Attacks in Low Query Budget Regimes. (arXiv:2007.07210v2 [cs.LG] UPDATED)
    (2 min) We focus on the problem of black-box adversarial attacks, where the aim is to generate adversarial examples for deep learning models solely based on information limited to output label~(hard label) to a queried data input. We propose a simple and efficient Bayesian Optimization~(BO) based approach for developing black-box adversarial attacks. Issues with BO's performance in high dimensions are avoided by searching for adversarial examples in a structured low-dimensional subspace. We demonstrate the efficacy of our proposed attack method by evaluating both $\ell_\infty$ and $\ell_2$ norm constrained untargeted and targeted hard label black-box attacks on three standard datasets - MNIST, CIFAR-10 and ImageNet. Our proposed approach consistently achieves 2x to 10x higher attack success rate while requiring 10x to 20x fewer queries compared to the current state-of-the-art black-box adversarial attacks.
    Probability Paths and the Structure of Predictions over Time. (arXiv:2106.06515v1 [cs.LG])
    (2 min) In settings ranging from weather forecasts to political prognostications to financial projections, probability estimates of future binary outcomes often evolve over time. For example, the estimated likelihood of rain on a specific day changes by the hour as new information becomes available. Given a collection of such probability paths, we introduce a Bayesian framework -- which we call the Gaussian latent information martingale, or GLIM -- for modeling the structure of dynamic predictions over time. Suppose, for example, that the likelihood of rain in a week is 50%, and consider two hypothetical scenarios. In the first, one expects the forecast is equally likely to become either 25% or 75% tomorrow; in the second, one expects the forecast to stay constant for the next several days. A time-sensitive decision-maker might select a course of action immediately in the latter scenario, but may postpone their decision in the former, knowing that new information is imminent. We model these trajectories by assuming predictions update according to a latent process of information flow, which is inferred from historical data. In contrast to general methods for time series analysis, this approach preserves the martingale structure of probability paths and better quantifies future uncertainties around probability paths. We show that GLIM outperforms three popular baseline methods, producing better estimated posterior probability path distributions measured by three different metrics. By elucidating the dynamic structure of predictions over time, we hope to help individuals make more informed choices.
    MSPM: A Modularized and Scalable Multi-Agent Reinforcement Learning-based System for Financial Portfolio Management. (arXiv:2102.03502v3 [q-fin.PM] UPDATED)
    (2 min) Financial portfolio management is one of the most applicable problems in reinforcement learning (RL) owing to its sequential decision-making nature. Existing RL-based approaches, while inspiring, often lack scalability, reusability, or profundity of intake information to accommodate the ever-changing capital markets. In this paper, we propose MSPM, a modularized and scalable, multi-agent RL-based system for financial portfolio management. MSPM involves two asynchronously updated units: an Evolving Agent Module (EAM) and Strategic Agent Module (SAM). A self-sustained EAM produces signal-comprised information for a specific asset using heterogeneous data inputs, and each EAM employs its reusability to have connections to multiple SAMs. An SAM is responsible for asset reallocation in a portfolio using profound information from the connected EAMs. With the elaborate architecture and the multi-step condensation of volatile market information, MSPM aims to provide a customizable, stable, and dedicated solution to portfolio management, unlike existing approaches. We also tackle the data-shortage issue of newly-listed stocks by transfer learning, and validate the indispensability of EAM with four different portfolios. Experiments on 8-year U.S. stock market data prove the effectiveness of MSPM in profit accumulation, by its outperformance over existing benchmarks.
    Learning the Precise Feature for Cluster Assignment. (arXiv:2106.06159v1 [cs.CV])
    (2 min) Clustering is one of the fundamental tasks in computer vision and pattern recognition. Recently, deep clustering methods (algorithms based on deep learning) have attracted wide attention with their impressive performance. Most of these algorithms combine deep unsupervised representation learning and standard clustering together. However, the separation of representation learning and clustering will lead to suboptimal solutions because the two-stage strategy prevents representation learning from adapting to subsequent tasks (e.g., clustering according to specific cues). To overcome this issue, efforts have been made in the dynamic adaption of representation and cluster assignment, whereas current state-of-the-art methods suffer from heuristically constructed objectives with representation and cluster assignment alternatively optimized. To further standardize the clustering problem, we audaciously formulate the objective of clustering as finding a precise feature as the cue for cluster assignment. Based on this, we propose a general-purpose deep clustering framework which radically integrates representation learning and clustering into a single pipeline for the first time. The proposed framework exploits the powerful ability of recently developed generative models for learning intrinsic features, and imposes an entropy minimization on the distribution of the cluster assignment by a dedicated variational algorithm. Experimental results show that the performance of the proposed method is superior, or at least comparable to, the state-of-the-art methods on the handwritten digit recognition, fashion recognition, face recognition and object recognition benchmark datasets.
    Hierarchical Reinforcement Learning for Air-to-Air Combat. (arXiv:2105.00990v2 [cs.LG] UPDATED)
    (2 min) Artificial Intelligence (AI) is becoming a critical component in the defense industry, as recently demonstrated by DARPA`s AlphaDogfight Trials (ADT). ADT sought to vet the feasibility of AI algorithms capable of piloting an F-16 in simulated air-to-air combat. As a participant in ADT, Lockheed Martin`s (LM) approach combines a hierarchical architecture with maximum-entropy reinforcement learning (RL), integrates expert knowledge through reward shaping, and supports modularity of policies. This approach achieved a $2^{nd}$ place finish in the final ADT event (among eight total competitors) and defeated a graduate of the US Air Force's (USAF) F-16 Weapons Instructor Course in match play.
    Object Segmentation Without Labels with Large-Scale Generative Models. (arXiv:2006.04988v2 [cs.LG] UPDATED)
    (2 min) The recent rise of unsupervised and self-supervised learning has dramatically reduced the dependency on labeled data, providing effective image representations for transfer to downstream vision tasks. Furthermore, recent works employed these representations in a fully unsupervised setup for image classification, reducing the need for human labels on the fine-tuning stage as well. This work demonstrates that large-scale unsupervised models can also perform a more challenging object segmentation task, requiring neither pixel-level nor image-level labeling. Namely, we show that recent unsupervised GANs allow to differentiate between foreground/background pixels, providing high-quality saliency masks. By extensive comparison on standard benchmarks, we outperform existing unsupervised alternatives for object segmentation, achieving new state-of-the-art.
    HPO-B: A Large-Scale Reproducible Benchmark for Black-Box HPO based on OpenML. (arXiv:2106.06257v1 [cs.LG])
    (2 min) Hyperparameter optimization (HPO) is a core problem for the machine learning community and remains largely unsolved due to the significant computational resources required to evaluate hyperparameter configurations. As a result, a series of recent related works have focused on the direction of transfer learning for quickly fine-tuning hyperparameters on a dataset. Unfortunately, the community does not have a common large-scale benchmark for comparing HPO algorithms. Instead, the de facto practice consists of empirical protocols on arbitrary small-scale meta-datasets that vary inconsistently across publications, making reproducibility a challenge. To resolve this major bottleneck and enable a fair and fast comparison of black-box HPO methods on a level playing field, we propose HPO-B, a new large-scale benchmark in the form of a collection of meta-datasets. Our benchmark is assembled and preprocessed from the OpenML repository and consists of 176 search spaces (algorithms) evaluated sparsely on 196 datasets with a total of 6.4 million hyperparameter evaluations. For ensuring reproducibility on our benchmark, we detail explicit experimental protocols, splits, and evaluation measures for comparing methods for both non-transfer, as well as, transfer learning HPO.
    Statistical Mechanical Analysis of Neural Network Pruning. (arXiv:2006.16617v3 [cs.LG] UPDATED)
    (2 min) Deep learning architectures with a huge number of parameters are often compressed using pruning techniques to ensure computational efficiency of inference during deployment. Despite multitude of empirical advances, there is a lack of theoretical understanding of the effectiveness of different pruning methods. We inspect different pruning techniques under the statistical mechanics formulation of a teacher-student framework and derive their generalization error (GE) bounds. It has been shown that Determinantal Point Process (DPP) based node pruning method is notably superior to competing approaches when tested on real datasets. Using GE bounds in the aforementioned setup we provide theoretical guarantees for their empirical observations. Another consistent finding in literature is that sparse neural networks (edge pruned) generalize better than dense neural networks (node pruned) for a fixed number of parameters. We use our theoretical setup to prove this finding and show that even the baseline random edge pruning method performs better than the DPP node pruning method. We also validate this empirically on real datasets.
    Efficient Competitions and Online Learning with Strategic Forecasters. (arXiv:2102.08358v2 [cs.LG] UPDATED)
    (2 min) Winner-take-all competitions in forecasting and machine-learning suffer from distorted incentives. Witkowski et al. 2018 identified this problem and proposed ELF, a truthful mechanism to select a winner. We show that, from a pool of $n$ forecasters, ELF requires $\Theta(n\log n)$ events or test data points to select a near-optimal forecaster with high probability. We then show that standard online learning algorithms select an $\epsilon$-optimal forecaster using only $O(\log(n) / \epsilon^2)$ events, by way of a strong approximate-truthfulness guarantee. This bound matches the best possible even in the nonstrategic setting. We then apply these mechanisms to obtain the first no-regret guarantee for non-myopic strategic experts.
    Measuring the sensitivity of Gaussian processes to kernel choice. (arXiv:2106.06510v1 [stat.ML])
    (2 min) Gaussian processes (GPs) are used to make medical and scientific decisions, including in cardiac care and monitoring of carbon dioxide emissions. But the choice of GP kernel is often somewhat arbitrary. In particular, uncountably many kernels typically align with qualitative prior knowledge (e.g. function smoothness or stationarity). But in practice, data analysts choose among a handful of convenient standard kernels (e.g. squared exponential). In the present work, we ask: Would decisions made with a GP differ under other, qualitatively interchangeable kernels? We show how to formulate this sensitivity analysis as a constrained optimization problem over a finite-dimensional space. We can then use standard optimizers to identify substantive changes in relevant decisions made with a GP. We demonstrate in both synthetic and real-world examples that decisions made with a GP can exhibit substantial sensitivity to kernel choice, even when prior draws are qualitatively interchangeable to a user.
    On Learnability via Gradient Method for Two-Layer ReLU Neural Networks in Teacher-Student Setting. (arXiv:2106.06251v1 [stat.ML])
    (2 min) Deep learning empirically achieves high performance in many applications, but its training dynamics has not been fully understood theoretically. In this paper, we explore theoretical analysis on training two-layer ReLU neural networks in a teacher-student regression model, in which a student network learns an unknown teacher network through its outputs. We show that with a specific regularization and sufficient over-parameterization, the student network can identify the parameters of the teacher network with high probability via gradient descent with a norm dependent stepsize even though the objective function is highly non-convex. The key theoretical tool is the measure representation of the neural networks and a novel application of a dual certificate argument for sparse estimation on a measure space. We analyze the global minima and global convergence property in the measure space.
    Twins: Revisiting the Design of Spatial Attention in Vision Transformers. (arXiv:2104.13840v3 [cs.CV] UPDATED)
    (2 min) Very recently, a variety of vision transformer architectures for dense prediction tasks have been proposed and they show that the design of spatial attention is critical to their success in these tasks. In this work, we revisit the design of the spatial attention and demonstrate that a carefully-devised yet simple spatial attention mechanism performs favourably against the state-of-the-art schemes. As a result, we propose two vision transformer architectures, namely, Twins-PCPVT and Twins-SVT. Our proposed architectures are highly-efficient and easy to implement, only involving matrix multiplications that are highly optimized in modern deep learning frameworks. More importantly, the proposed architectures achieve excellent performance on a wide range of visual tasks including imagelevel classification as well as dense detection and segmentation. The simplicity and strong performance suggest that our proposed architectures may serve as stronger backbones for many vision tasks. Our code will be released soon at https://github.com/Meituan-AutoML/Twins .
    Active Learning of Continuous-time Bayesian Networks through Interventions. (arXiv:2105.14742v2 [stat.ML] UPDATED)
    (2 min) We consider the problem of learning structures and parameters of Continuous-time Bayesian Networks (CTBNs) from time-course data under minimal experimental resources. In practice, the cost of generating experimental data poses a bottleneck, especially in the natural and social sciences. A popular approach to overcome this is Bayesian optimal experimental design (BOED). However, BOED becomes infeasible in high-dimensional settings, as it involves integration over all possible experimental outcomes. We propose a novel criterion for experimental design based on a variational approximation of the expected information gain. We show that for CTBNs, a semi-analytical expression for this criterion can be calculated for structure and parameter learning. By doing so, we can replace sampling over experimental outcomes by solving the CTBNs master-equation, for which scalable approximations exist. This alleviates the computational burden of sampling possible experimental outcomes in high-dimensions. We employ this framework in order to recommend interventional sequences. In this context, we extend the CTBN model to conditional CTBNs in order to incorporate interventions. We demonstrate the performance of our criterion on synthetic and real-world data.
    Recovery of Meteorites Using an Autonomous Drone and Machine Learning. (arXiv:2106.06523v1 [astro-ph.EP])
    (2 min) The recovery of freshly fallen meteorites from tracked and triangulated meteors is critical to determining their source asteroid families. However, locating meteorite fragments in strewn fields remains a challenge with very few meteorites being recovered from the meteors triangulated in past and ongoing meteor camera networks. We examined if locating meteorites can be automated using machine learning and an autonomous drone. Drones can be programmed to fly a grid search pattern and take systematic pictures of the ground over a large survey area. Those images can be analyzed using a machine learning classifier to identify meteorites in the field among many other features. Here, we describe a proof-of-concept meteorite classifier that deploys off-line a combination of different convolution neural networks to recognize meteorites from images taken by drones in the field. The system was implemented in a conceptual drone setup and tested in the suspected strewn field of a recent meteorite fall near Walker Lake, Nevada.
    Deep Two-Way Matrix Reordering for Relational Data Analysis. (arXiv:2103.14203v4 [stat.ML] UPDATED)
    (2 min) Matrix reordering is a task to permute the rows and columns of a given observed matrix such that the resulting reordered matrix shows meaningful or interpretable structural patterns. Most existing matrix reordering techniques share the common processes of extracting some feature representations from an observed matrix in a predefined manner, and applying matrix reordering based on it. However, in some practical cases, we do not always have prior knowledge about the structural pattern of an observed matrix. To address this problem, we propose a new matrix reordering method, called deep two-way matrix reordering (DeepTMR), using a neural network model. The trained network can automatically extract nonlinear row/column features from an observed matrix, which can then be used for matrix reordering. Moreover, the proposed DeepTMR provides the denoised mean matrix of a given observed matrix as an output of the trained network. This denoised mean matrix can be used to visualize the global structure of the reordered observed matrix. We demonstrate the effectiveness of the proposed DeepTMR by applying it to both synthetic and practical datasets.
    Guarantees for Tuning the Step Size using a Learning-to-Learn Approach. (arXiv:2006.16495v2 [stat.ML] UPDATED)
    (2 min) Choosing the right parameters for optimization algorithms is often the key to their success in practice. Solving this problem using a learning-to-learn approach -- using meta-gradient descent on a meta-objective based on the trajectory that the optimizer generates -- was recently shown to be effective. However, the meta-optimization problem is difficult. In particular, the meta-gradient can often explode/vanish, and the learned optimizer may not have good generalization performance if the meta-objective is not chosen carefully. In this paper we give meta-optimization guarantees for the learning-to-learn approach on a simple problem of tuning the step size for quadratic loss. Our results show that the na\"ive objective suffers from meta-gradient explosion/vanishing problem. Although there is a way to design the meta-objective so that the meta-gradient remains polynomially bounded, computing the meta-gradient directly using backpropagation leads to numerical issues. We also characterize when it is necessary to compute the meta-objective on a separate validation set to ensure the generalization performance of the learned optimizer. Finally, we verify our results empirically and show that a similar phenomenon appears even for more complicated learned optimizers parametrized by neural networks.
    Catch-A-Waveform: Learning to Generate Audio from a Single Short Example. (arXiv:2106.06426v1 [cs.SD])
    (2 min) Models for audio generation are typically trained on hours of recordings. Here, we illustrate that capturing the essence of an audio source is typically possible from as little as a few tens of seconds from a single training signal. Specifically, we present a GAN-based generative model that can be trained on one short audio signal from any domain (e.g. speech, music, etc.) and does not require pre-training or any other form of external supervision. Once trained, our model can generate random samples of arbitrary duration that maintain semantic similarity to the training waveform, yet exhibit new compositions of its audio primitives. This enables a long line of interesting applications, including generating new jazz improvisations or new a-cappella rap variants based on a single short example, producing coherent modifications to famous songs (e.g. adding a new verse to a Beatles song based solely on the original recording), filling-in of missing parts (inpainting), extending the bandwidth of a speech signal (super-resolution), and enhancing old recordings without access to any clean training example. We show that in all cases, no more than 20 seconds of training audio commonly suffice for our model to achieve state-of-the-art results. This is despite its complete lack of prior knowledge about the nature of audio signals in general.
    Demystifying Assumptions in Learning to Discover Novel Classes. (arXiv:2102.04002v3 [cs.LG] UPDATED)
    (2 min) In learning to discover novel classes (L2DNC), we are given labeled data from seen classes and unlabeled data from unseen classes, and we train clustering models for the unseen classes. However, the rigorous definition of L2DNC is unexplored, which results in that its implicit assumptions are still unclear. In this paper, we demystify assumptions behind L2DNC and find that high-level semantic features should be shared among the seen and unseen classes. This naturally motivates us to link L2DNC to meta-learning that has exactly the same assumption as L2DNC. Based on this finding, L2DNC is not only theoretically solvable, but can also be empirically solved by meta-learning algorithms after slight modifications. This L2DNC methodology significantly reduces the amount of unlabeled data needed for training and makes it more practical, as demonstrated in experiments. The use of very limited data is also justified by the application scenario of L2DNC: since it is unnatural to label only seen-class data, L2DNC is sampling instead of labeling in causality. Therefore, unseen-class data should be collected on the way of collecting seen-class data, which is why they are novel and first need to be clustered.
    The Limitations of Large Width in Neural Networks: A Deep Gaussian Process Perspective. (arXiv:2106.06529v1 [cs.LG])
    (2 min) Large width limits have been a recent focus of deep learning research: modulo computational practicalities, do wider networks outperform narrower ones? Answering this question has been challenging, as conventional networks gain representational power with width, potentially masking any negative effects. Our analysis in this paper decouples capacity and width via the generalization of neural networks to Deep Gaussian Processes (Deep GP), a class of hierarchical models that subsume neural nets. In doing so, we aim to understand how width affects standard neural networks once they have sufficient capacity for a given modeling task. Our theoretical and empirical results on Deep GP suggest that large width is generally detrimental to hierarchical models. Surprisingly, we prove that even nonparametric Deep GP converge to Gaussian processes, effectively becoming shallower without any increase in representational power. The posterior, which corresponds to a mixture of data-adaptable basis functions, becomes less data-dependent with width. Our tail analysis demonstrates that width and depth have opposite effects: depth accentuates a model's non-Gaussianity, while width makes models increasingly Gaussian. We find there is a "sweet spot" that maximizes test set performance before the limiting GP behavior prevents adaptability, occurring at width = 1 or width = 2 for nonparametric Deep GP. These results make strong predictions about the same phenomenon in conventional neural networks: we show empirically that many neural network architectures need 10 - 500 hidden units for sufficient capacity - depending on the dataset - but further width degrades test performance.
    Signed Graph Metric Learning via Gershgorin Disc Perfect Alignment. (arXiv:2006.08816v6 [cs.LG] UPDATED)
    (2 min) Given a convex and differentiable objective $Q(\M)$ for a real symmetric matrix $\M$ in the positive definite (PD) cone -- used to compute Mahalanobis distances -- we propose a fast general metric learning framework that is entirely projection-free. We first assume that $\M$ resides in a space $\cS$ of generalized graph Laplacian matrices corresponding to balanced signed graphs. $\M \in \cS$ that is also PD is called a graph metric matrix. Unlike low-rank metric matrices common in the literature, $\cS$ includes the important diagonal-only matrices as a special case. The key theorem to circumvent full eigen-decomposition and enable fast metric matrix optimization is Gershgorin disc perfect alignment (GDPA): given $\M \in \cS$ and diagonal matrix $\S$, where $S_{ii} = 1/v_i$ and $\v$ is $\M$'s first eigenvector, we prove that Gershgorin disc left-ends of similarity transform $\B = \S \M \S^{-1}$ are perfectly aligned at the smallest eigenvalue $\lambda_{\min}$. Using this theorem, we replace the PD cone constraint in the metric learning problem with tightest possible linear constraints per iteration, so that the alternating optimization of the diagonal / off-diagonal terms in $\M$ can be solved efficiently as linear programs via the Frank-Wolfe method. We update $\v$ using Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) with warm start as entries in $\M$ are optimized successively. Experiments show that our graph metric optimization is significantly faster than cone-projection schemes, and produces competitive binary classification performance.
    Variance Reduced Training with Stratified Sampling for Forecasting Models. (arXiv:2103.02062v2 [cs.LG] UPDATED)
    (2 min) In large-scale time series forecasting, one often encounters the situation where the temporal patterns of time series, while drifting over time, differ from one another in the same dataset. In this paper, we provably show under such heterogeneity, training a forecasting model with commonly used stochastic optimizers (e.g. SGD) potentially suffers large variance on gradient estimation, and thus incurs long-time training. We show that this issue can be efficiently alleviated via stratification, which allows the optimizer to sample from pre-grouped time series strata. For better trading-off gradient variance and computation complexity, we further propose SCott (Stochastic Stratified Control Variate Gradient Descent), a variance reduced SGD-style optimizer that utilizes stratified sampling via control variate. In theory, we provide the convergence guarantee of SCott on smooth non-convex objectives. Empirically, we evaluate SCott and other baseline optimizers on both synthetic and real-world time series forecasting problems, and demonstrate SCott converges faster with respect to both iterations and wall clock time.
    Exploiting Large-scale Teacher-Student Training for On-device Acoustic Models. (arXiv:2106.06126v1 [cs.SD])
    (2 min) We present results from Alexa speech teams on semi-supervised learning (SSL) of acoustic models (AM) with experiments spanning over 3000 hours of GPU time, making our study one of the largest of its kind. We discuss SSL for AMs in a small footprint setting, showing that a smaller capacity model trained with 1 million hours of unsupervised data can outperform a baseline supervised system by 14.3% word error rate reduction (WERR). When increasing the supervised data to seven-fold, our gains diminish to 7.1% WERR; to improve SSL efficiency at larger supervised data regimes, we employ a step-wise distillation into a smaller model, obtaining a WERR of 14.4%. We then switch to SSL using larger student models in low data regimes; while learning efficiency with unsupervised data is higher, student models may outperform teacher models in such a setting. We develop a theoretical sketch to explain this behavior.
    A Zeroth-Order Block Coordinate Descent Algorithm for Huge-Scale Black-Box Optimization. (arXiv:2102.10707v2 [math.OC] UPDATED)
    (2 min) We consider the zeroth-order optimization problem in the huge-scale setting, where the dimension of the problem is so large that performing even basic vector operations on the decision variables is infeasible. In this paper, we propose a novel algorithm, coined ZO-BCD, that exhibits favorable overall query complexity and has a much smaller per-iteration computational complexity. In addition, we discuss how the memory footprint of ZO-BCD can be reduced even further by the clever use of circulant measurement matrices. As an application of our new method, we propose the idea of crafting adversarial attacks on neural network based classifiers in a wavelet domain, which can result in problem dimensions of over 1.7 million. In particular, we show that crafting adversarial examples to audio classifiers in a wavelet domain can achieve the state-of-the-art attack success rate of 97.9%.
    Self-Trained One-class Classification for Unsupervised Anomaly Detection. (arXiv:2106.06115v1 [cs.LG])
    (2 min) Anomaly detection (AD), separating anomalies from normal data, has various applications across domains, from manufacturing to healthcare. While most previous works have shown to be effective for cases with fully or partially labeled data, they are less practical for AD applications due to tedious data labeling processes. In this work, we focus on unsupervised AD problems whose entire training data are unlabeled and may contain both normal and anomalous samples. To tackle this problem, we build a robust one-class classification framework via data refinement. To refine the data accurately, we propose an ensemble of one-class classifiers, each of which is trained on a disjoint subset of training data. Moreover, we propose a self-training of deep representation one-class classifiers (STOC) that iteratively refines the data and deep representations. In experiments, we show the efficacy of our method for unsupervised anomaly detection on benchmarks from image and tabular data domains. For example, with a 10% anomaly ratio on CIFAR-10 data, the proposed method outperforms state-of-the-art one-class classification method by 6.3 AUC and 12.5 average precision.
    Adapting to Misspecification in Contextual Bandits with Offline Regression Oracles. (arXiv:2102.13240v2 [cs.LG] UPDATED)
    (2 min) Computationally efficient contextual bandits are often based on estimating a predictive model of rewards given contexts and arms using past data. However, when the reward model is not well-specified, the bandit algorithm may incur unexpected regret, so recent work has focused on algorithms that are robust to misspecification. We propose a simple family of contextual bandit algorithms that adapt to misspecification error by reverting to a good safe policy when there is evidence that misspecification is causing a regret increase. Our algorithm requires only an offline regression oracle to ensure regret guarantees that gracefully degrade in terms of a measure of the average misspecification level. Compared to prior work, we attain similar regret guarantees, but we do no rely on a master algorithm, and do not require more robust oracles like online or constrained regression oracles (e.g., Foster et al. (2020a); Krishnamurthy et al. (2020)). This allows us to design algorithms for more general function approximation classes.
    Machine Learning Framework for Sensing and Modeling Interference in IoT Frequency Bands. (arXiv:2106.06010v1 [cs.LG])
    (2 min) Spectrum scarcity has surfaced as a prominent concern in wireless radio communications with the emergence of new technologies over the past few years. As a result, there is growing need for better understanding of the spectrum occupancy with newly emerging access technologies supporting the Internet of Things. In this paper, we present a framework to capture and model the traffic behavior of short-time spectrum occupancy for IoT applications in the shared bands to determine the existing interference. The proposed capturing method utilizes a software defined radio to monitor the short bursts of IoT transmissions by capturing the time series data which is converted to power spectral density to extract the observed occupancy. Furthermore, we propose the use of an unsupervised machine learning technique to enhance conventionally implemented energy detection methods. Our experimental results show that the temporal and frequency behavior of the spectrum can be well-captured using the combination of two models, namely, semi-Markov chains and a Poisson-distribution arrival rate. We conduct an extensive measurement campaign in different urban environments and incorporate the spatial effect on the IoT shared spectrum.
    Instance-Level Task Parameters: A Robust Multi-task Weighting Framework. (arXiv:2106.06129v1 [cs.CV])
    (2 min) Recent works have shown that deep neural networks benefit from multi-task learning by learning a shared representation across several related tasks. However, performance of such systems depend on relative weighting between various losses involved during training. Prior works on loss weighting schemes assume that instances are equally easy or hard for all tasks. In order to break this assumption, we let the training process dictate the optimal weighting of tasks for every instance in the dataset. More specifically, we equip every instance in the dataset with a set of learnable parameters (instance-level task parameters) where the cardinality is equal to the number of tasks learned by the model. These parameters model the weighting of each task for an instance. They are updated by gradient descent and do not require hand-crafted rules. We conduct extensive experiments on SURREAL and CityScapes datasets, for human shape and pose estimation, depth estimation and semantic segmentation tasks. In these tasks, our approach outperforms recent dynamic loss weighting approaches, e.g. reducing surface estimation errors by 8.97% on SURREAL. When applied to datasets where one or more tasks can have noisy annotations, the proposed method learns to prioritize learning from clean labels for a given task, e.g. reducing surface estimation errors by up to 60%. We also show that we can reliably detect corrupt labels for a given task as a by-product from learned instance-level task parameters.
    PyTorch Geometric Temporal: Spatiotemporal Signal Processing with Neural Machine Learning Models. (arXiv:2104.07788v3 [cs.LG] UPDATED)
    (2 min) We present PyTorch Geometric Temporal a deep learning framework combining state-of-the-art machine learning algorithms for neural spatiotemporal signal processing. The main goal of the library is to make temporal geometric deep learning available for researchers and machine learning practitioners in a unified easy-to-use framework. PyTorch Geometric Temporal was created with foundations on existing libraries in the PyTorch eco-system, streamlined neural network layer definitions, temporal snapshot generators for batching, and integrated benchmark datasets. These features are illustrated with a tutorial-like case study. Experiments demonstrate the predictive performance of the models implemented in the library on real world problems such as epidemiological forecasting, ridehail demand prediction and web-traffic management. Our sensitivity analysis of runtime shows that the framework can potentially operate on web-scale datasets with rich temporal features and spatial structure.
    High-Performance FPGA-based Accelerator for Bayesian Neural Networks. (arXiv:2105.09163v2 [cs.AR] UPDATED)
    (2 min) Neural networks (NNs) have demonstrated their potential in a wide range of applications such as image recognition, decision making or recommendation systems. However, standard NNs are unable to capture their model uncertainty which is crucial for many safety-critical applications including healthcare and autonomous vehicles. In comparison, Bayesian neural networks (BNNs) are able to express uncertainty in their prediction via a mathematical grounding. Nevertheless, BNNs have not been as widely used in industrial practice, mainly because of their expensive computational cost and limited hardware performance. This work proposes a novel FPGA-based hardware architecture to accelerate BNNs inferred through Monte Carlo Dropout. Compared with other state-of-the-art BNN accelerators, the proposed accelerator can achieve up to 4 times higher energy efficiency and 9 times better compute efficiency. Considering partial Bayesian inference, an automatic framework is proposed, which explores the trade-off between hardware and algorithmic performance. Extensive experiments are conducted to demonstrate that our proposed framework can effectively find the optimal points in the design space.
    Dynamic Game Theoretic Neural Optimizer. (arXiv:2105.03788v2 [cs.LG] UPDATED)
    (2 min) The connection between training deep neural networks (DNNs) and optimal control theory (OCT) has attracted considerable attention as a principled tool of algorithmic design. Despite few attempts being made, they have been limited to architectures where the layer propagation resembles a Markovian dynamical system. This casts doubts on their flexibility to modern networks that heavily rely on non-Markovian dependencies between layers (e.g. skip connections in residual networks). In this work, we propose a novel dynamic game perspective by viewing each layer as a player in a dynamic game characterized by the DNN itself. Through this lens, different classes of optimizers can be seen as matching different types of Nash equilibria, depending on the implicit information structure of each (p)layer. The resulting method, called Dynamic Game Theoretic Neural Optimizer (DGNOpt), not only generalizes OCT-inspired optimizers to richer network class; it also motivates a new training principle by solving a multi-player cooperative game. DGNOpt shows convergence improvements over existing methods on image classification datasets with residual and inception networks. Our work marries strengths from both OCT and game theory, paving ways to new algorithmic opportunities from robust optimal control and bandit-based optimization.
    Overfitting in Bayesian Optimization: an empirical study and early-stopping solution. (arXiv:2104.08166v2 [cs.LG] UPDATED)
    (2 min) Tuning machine learning models with Bayesian optimization (BO) is a successful strategy to find good hyperparameters. BO defines an iterative procedure where a cross-validated metric is evaluated on promising hyperparameters. In practice, however, an improvement of the validation metric may not translate in better predictive performance on a test set, especially when tuning models trained on small datasets. In other words, unlike conventional wisdom dictates, BO can overfit. In this paper, we carry out the first systematic investigation of overfitting in BO and demonstrate that this issue is serious, yet often overlooked in practice. We propose a novel criterion to early stop BO, which aims to maintain the solution quality while saving the unnecessary iterations that can lead to overfitting. Experiments on real-world hyperparameter optimization problems show that our approach effectively meets these goals and is more adaptive comparing to baselines.
    Hutch++: Optimal Stochastic Trace Estimation. (arXiv:2010.09649v5 [cs.DS] UPDATED)
    (2 min) We study the problem of estimating the trace of a matrix $A$ that can only be accessed through matrix-vector multiplication. We introduce a new randomized algorithm, Hutch++, which computes a $(1 \pm \epsilon)$ approximation to $tr(A)$ for any positive semidefinite (PSD) $A$ using just $O(1/\epsilon)$ matrix-vector products. This improves on the ubiquitous Hutchinson's estimator, which requires $O(1/\epsilon^2)$ matrix-vector products. Our approach is based on a simple technique for reducing the variance of Hutchinson's estimator using a low-rank approximation step, and is easy to implement and analyze. Moreover, we prove that, up to a logarithmic factor, the complexity of Hutch++ is optimal amongst all matrix-vector query algorithms, even when queries can be chosen adaptively. We show that it significantly outperforms Hutchinson's method in experiments. While our theory mainly requires $A$ to be positive semidefinite, we provide generalized guarantees for general square matrices, and show empirical gains in such applications.
    Benchmarking Multi-Agent Deep Reinforcement Learning Algorithms in Cooperative Tasks. (arXiv:2006.07869v3 [cs.LG] UPDATED)
    (2 min) Multi-agent deep reinforcement learning (MARL) suffers from a lack of commonly-used evaluation tasks and criteria, making comparisons between approaches difficult. In this work, we consistently evaluate and compare three different classes of MARL algorithms (independent learning, centralised multi-agent policy gradient, value decomposition) in a diverse range of cooperative multi-agent learning tasks. Our experiments serve as a reference for the expected performance of algorithms across different learning tasks, and we provide insights regarding the effectiveness of different learning approaches. We open-source EPyMARL, which extends the PyMARL codebase~\citep{samvelyan19smac} to include additional algorithms and allow for flexible configuration of algorithm implementation details such as parameter sharing. Finally, we open-source two environments for multi-agent research which focus on coordination under sparse rewards.
    Domain Transformer: Predicting Samples of Unseen, Future Domains. (arXiv:2106.06057v1 [cs.LG])
    (2 min) The data distribution commonly evolves over time leading to problems such as concept drift that often decrease classifier performance. We seek to predict unseen data (and their labels) allowing us to tackle challenges due to a non-constant data distribution in a \emph{proactive} manner rather than detecting and reacting to already existing changes that might already have led to errors. To this end, we learn a domain transformer in an unsupervised manner that allows generating data of unseen domains. Our approach first matches independently learned latent representations of two given domains obtained from an auto-encoder using a Cycle-GAN. In turn, a transformation of the original samples can be learned that can be applied iteratively to extrapolate to unseen domains. Our evaluation on CNNs on image data confirms the usefulness of the approach. It also achieves very good results on the well-known problem of unsupervised domain adaption, where labels but not samples have to be predicted.
    Represent Your Own Policies: Reinforcement Learning with Policy-extended Value Function Approximator. (arXiv:2010.09536v3 [cs.LG] UPDATED)
    (2 min) We study Policy-extended Value Function Approximator (PeVFA) in Reinforcement Learning (RL), which extends conventional value function approximator (VFA) to take as input not only the state (and action) but also an explicit policy representation. Such an extension enables PeVFA to preserve values of multiple policies at the same time and brings an appealing characteristic, i.e., \emph{value generalization among policies}. We formally analyze the value generalization under Generalized Policy Iteration (GPI). From theoretical and empirical lens, we show that generalized value estimates offered by PeVFA may have lower initial approximation error to true values of successive policies, which is expected to improve consecutive value approximation during GPI. Based on above clues, we introduce a new form of GPI with PeVFA which leverages the value generalization along policy improvement path. Moreover, we propose a representation learning framework for RL policy, providing several approaches to learn effective policy embeddings from policy network parameters or state-action pairs. In our experiments, we evaluate the efficacy of value generalization offered by PeVFA and policy representation learning in several OpenAI Gym continuous control tasks. For a representative instance of algorithm implementation, Proximal Policy Optimization (PPO) re-implemented under the paradigm of GPI with PeVFA achieves about 40\% performance improvement on its vanilla counterpart in most environments.
    Graph Transformer Networks: Learning Meta-path Graphs to Improve GNNs. (arXiv:2106.06218v1 [cs.LG])
    (2 min) Graph Neural Networks (GNNs) have been widely applied to various fields due to their powerful representations of graph-structured data. Despite the success of GNNs, most existing GNNs are designed to learn node representations on the fixed and homogeneous graphs. The limitations especially become problematic when learning representations on a misspecified graph or a heterogeneous graph that consists of various types of nodes and edges. To address this limitations, we propose Graph Transformer Networks (GTNs) that are capable of generating new graph structures, which preclude noisy connections and include useful connections (e.g., meta-paths) for tasks, while learning effective node representations on the new graphs in an end-to-end fashion. We further propose enhanced version of GTNs, Fast Graph Transformer Networks (FastGTNs), that improve scalability of graph transformations. Compared to GTNs, FastGTNs are 230x faster and use 100x less memory while allowing the identical graph transformations as GTNs. In addition, we extend graph transformations to the semantic proximity of nodes allowing non-local operations beyond meta-paths. Extensive experiments on both homogeneous graphs and heterogeneous graphs show that GTNs and FastGTNs with non-local operations achieve the state-of-the-art performance for node classification tasks. The code is available: https://github.com/seongjunyun/Graph_Transformer_Networks
    Towards a Unified Quadrature Framework for Large-Scale Kernel Machines. (arXiv:2011.01668v2 [cs.LG] UPDATED)
    (2 min) In this paper, we develop a quadrature framework for large-scale kernel machines via a numerical integration representation. Considering that the integration domain and measure of typical kernels, e.g., Gaussian kernels, arc-cosine kernels, are fully symmetric, we leverage deterministic fully symmetric interpolatory rules to efficiently compute quadrature nodes and associated weights for kernel approximation. The developed interpolatory rules are able to reduce the number of needed nodes while retaining a high approximation accuracy. Further, we randomize the above deterministic rules by the classical Monte-Carlo sampling and control variates techniques with two merits: 1) The proposed stochastic rules make the dimension of the feature mapping flexibly varying, such that we can control the discrepancy between the original and approximate kernels by tuning the dimnension. 2) Our stochastic rules have nice statistical properties of unbiasedness and variance reduction with fast convergence rate. In addition, we elucidate the relationship between our deterministic/stochastic interpolatory rules and current quadrature rules for kernel approximation, including the sparse grids quadrature and stochastic spherical-radial rules, thereby unifying these methods under our framework. Experimental results on several benchmark datasets show that our methods compare favorably with other representative kernel approximation based methods.
    Exploration-Exploitation Motivated Variational Auto-Encoder for Recommender Systems. (arXiv:2006.03573v4 [stat.ML] UPDATED)
    (2 min) Recent years have witnessed rapid developments on collaborative filtering techniques for improving the performance of recommender systems due to the growing need of companies to help users discover new and relevant items. However, the majority of existing literature focuses on delivering items which match the user model learned from users' past preferences. A good recommendation model is expected to recommend items that are known to enjoy and items that are novel to try. In this work, we introduce an exploitation-exploration motivated variational auto-encoder (XploVAE) to collaborative filtering. To facilitate personalized recommendations, we construct user-specific subgraphs, which contain the first-order proximity capturing observed user-item interactions for exploitation and the high-order proximity for exploration. A hierarchical latent space model is utilized to learn the personalized item embedding for a given user, along with the population distribution of all user subgraphs. Finally, experimental results on various real-world datasets clearly demonstrate the effectiveness of our proposed model on leveraging the exploitation and exploration recommendation tasks.
    DG-LMC: A Turn-key and Scalable Synchronous Distributed MCMC Algorithm. (arXiv:2106.06300v1 [stat.ME])
    (2 min) Performing reliable Bayesian inference on a big data scale is becoming a keystone in the modern era of machine learning. A workhorse class of methods to achieve this task are Markov chain Monte Carlo (MCMC) algorithms and their design to handle distributed datasets has been the subject of many works. However, existing methods are not completely either reliable or computationally efficient. In this paper, we propose to fill this gap in the case where the dataset is partitioned and stored on computing nodes within a cluster under a master/slaves architecture. We derive a user-friendly centralised distributed MCMC algorithm with provable scaling in high-dimensional settings. We illustrate the relevance of the proposed methodology on both synthetic and real data experiments.
    Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. (arXiv:2102.05918v2 [cs.CV] UPDATED)
    (2 min) Pre-trained representations are becoming crucial for many NLP and perception tasks. While representation learning in NLP has transitioned to training on raw text without human annotations, visual and vision-language representations still rely heavily on curated training datasets that are expensive or require expert knowledge. For vision applications, representations are mostly learned using datasets with explicit class labels such as ImageNet or OpenImages. For vision-language, popular datasets like Conceptual Captions, MSCOCO, or CLIP all involve a non-trivial data collection (and cleaning) process. This costly curation process limits the size of datasets and hence hinders the scaling of trained models. In this paper, we leverage a noisy dataset of over one billion image alt-text pairs, obtained without expensive filtering or post-processing steps in the Conceptual Captions dataset. A simple dual-encoder architecture learns to align visual and language representations of the image and text pairs using a contrastive loss. We show that the scale of our corpus can make up for its noise and leads to state-of-the-art representations even with such a simple learning scheme. Our visual representation achieves strong performance when transferred to classification tasks such as ImageNet and VTAB. The aligned visual and language representations enables zero-shot image classification and also set new state-of-the-art results on Flickr30K and MSCOCO image-text retrieval benchmarks, even when compared with more sophisticated cross-attention models. The representations also enable cross-modality search with complex text and text + image queries.
    Signal Processing on Higher-Order Networks: Livin' on the Edge ... and Beyond. (arXiv:2101.05510v3 [cs.SI] UPDATED)
    (2 min) In this tutorial, we provide a didactic treatment of the emerging topic of signal processing on higher-order networks. Drawing analogies from discrete and graph signal processing, we introduce the building blocks for processing data on simplicial complexes and hypergraphs, two common higher-order network abstractions that can incorporate polyadic relationships. We provide brief introductions to simplicial complexes and hypergraphs, with a special emphasis on the concepts needed for the processing of signals supported on these structures. Specifically, we discuss Fourier analysis, signal denoising, signal interpolation, node embeddings, and nonlinear processing through neural networks, using these two higher-order network models. In the context of simplicial complexes, we specifically focus on signal processing using the Hodge Laplacian matrix, a multi-relational operator that leverages the special structure of simplicial complexes and generalizes desirable properties of the Laplacian matrix in graph signal processing. For hypergraphs, we present both matrix and tensor representations, and discuss the trade-offs in adopting one or the other. We also highlight limitations and potential research avenues, both to inform practitioners and to motivate the contribution of new researchers to the area.
    DRLD-SP: A Deep Reinforcement Learning-based Dynamic Service Placement in Edge-Enabled Internet of Vehicles. (arXiv:2106.06291v1 [cs.NI])
    (2 min) The growth of 5G and edge computing has enabled the emergence of Internet of Vehicles. It supports different types of services with different resource and service requirements. However, limited resources at the edge, high mobility of vehicles, increasing demand, and dynamicity in service request-types have made service placement a challenging task. A typical static placement solution is not effective as it does not consider the traffic mobility and service dynamics. Handling dynamics in IoV for service placement is an important and challenging problem which is the primary focus of our work in this paper. We propose a Deep Reinforcement Learning-based Dynamic Service Placement (DRLD-SP) framework with the objective of minimizing the maximum edge resource usage and service delay while considering the vehicle's mobility, varying demand, and dynamics in the requests for different types of services. We use SUMO and MATLAB to carry out simulation experiments. The experimental results show that the proposed DRLD-SP approach is effective and outperforms other static and dynamic placement approaches.
    Deep Conditional Gaussian Mixture Model for Constrained Clustering. (arXiv:2106.06385v1 [cs.LG])
    (2 min) Constrained clustering has gained significant attention in the field of machine learning as it can leverage prior information on a growing amount of only partially labeled data. Following recent advances in deep generative models, we propose a novel framework for constrained clustering that is intuitive, interpretable, and can be trained efficiently in the framework of stochastic gradient variational inference. By explicitly integrating domain knowledge in the form of probabilistic relations, our proposed model (DC-GMM) uncovers the underlying distribution of data conditioned on prior clustering preferences, expressed as pairwise constraints. These constraints guide the clustering process towards a desirable partition of the data by indicating which samples should or should not belong to the same cluster. We provide extensive experiments to demonstrate that DC-GMM shows superior clustering performances and robustness compared to state-of-the-art deep constrained clustering methods on a wide range of data sets. We further demonstrate the usefulness of our approach on two challenging real-world applications.
    Towards Understanding Generalization via Decomposing Excess Risk Dynamics. (arXiv:2106.06153v1 [cs.LG])
    (2 min) Generalization is one of the critical issues in machine learning. However, traditional methods like uniform convergence are not powerful enough to fully explain generalization because they may yield vacuous bounds even in overparameterized linear regression regimes. An alternative solution is to analyze the generalization dynamics to derive algorithm-dependent bounds, e.g., stability. Unfortunately, the stability-based bound is still far from explaining the remarkable generalization ability of neural networks due to the coarse-grained analysis of the signal and noise. Inspired by the observation that neural networks show a slow convergence rate when fitting noise, we propose decomposing the excess risk dynamics and applying stability-based bound only on the variance part (which measures how the model performs on pure noise). We provide two applications for the framework, including a linear case (overparameterized linear regression with gradient descent) and a non-linear case (matrix recovery with gradient flow). Under the decomposition framework, the new bound accords better with the theoretical and empirical evidence compared to the stability-based bound and uniform convergence bound.
    Keyframe-Focused Visual Imitation Learning. (arXiv:2106.06452v1 [cs.LG])
    (2 min) Imitation learning trains control policies by mimicking pre-recorded expert demonstrations. In partially observable settings, imitation policies must rely on observation histories, but many seemingly paradoxical results show better performance for policies that only access the most recent observation. Recent solutions ranging from causal graph learning to deep information bottlenecks have shown promising results, but failed to scale to realistic settings such as visual imitation. We propose a solution that outperforms these prior approaches by upweighting demonstration keyframes corresponding to expert action changepoints. This simple approach easily scales to complex visual imitation settings. Our experimental results demonstrate consistent performance improvements over all baselines on image-based Gym MuJoCo continuous control tasks. Finally, on the CARLA photorealistic vision-based urban driving simulator, we resolve a long-standing issue in behavioral cloning for driving by demonstrating effective imitation from observation histories. Supplementary materials and code at: \url{https://tinyurl.com/imitation-keyframes}.
    TrafficStream: A Streaming Traffic Flow Forecasting Framework Based on Graph Neural Networks and Continual Learning. (arXiv:2106.06273v1 [cs.LG])
    (2 min) With the rapid growth of traffic sensors deployed, a massive amount of traffic flow data are collected, revealing the long-term evolution of traffic flows and the gradual expansion of traffic networks. How to accurately forecasting these traffic flow attracts the attention of researchers as it is of great significance for improving the efficiency of transportation systems. However, existing methods mainly focus on the spatial-temporal correlation of static networks, leaving the problem of efficiently learning models on networks with expansion and evolving patterns less studied. To tackle this problem, we propose a Streaming Traffic Flow Forecasting Framework, TrafficStream, based on Graph Neural Networks (GNNs) and Continual Learning (CL), achieving accurate predictions and high efficiency. Firstly, we design a traffic pattern fusion method, cleverly integrating the new patterns that emerged during the long-term period into the model. A JS-divergence-based algorithm is proposed to mine new traffic patterns. Secondly, we introduce CL to consolidate the knowledge learned previously and transfer them to the current model. Specifically, we adopt two strategies: historical data replay and parameter smoothing. We construct a streaming traffic dataset to verify the efficiency and effectiveness of our model. Extensive experiments demonstrate its excellent potential to extract traffic patterns with high efficiency on long-term streaming network scene. The source code is available at https://github.com/AprLie/TrafficStream.
    Policy Gradient Bayesian Robust Optimization for Imitation Learning. (arXiv:2106.06499v1 [cs.LG])
    (2 min) The difficulty in specifying rewards for many real-world problems has led to an increased focus on learning rewards from human feedback, such as demonstrations. However, there are often many different reward functions that explain the human feedback, leaving agents with uncertainty over what the true reward function is. While most policy optimization approaches handle this uncertainty by optimizing for expected performance, many applications demand risk-averse behavior. We derive a novel policy gradient-style robust optimization approach, PG-BROIL, that optimizes a soft-robust objective that balances expected performance and risk. To the best of our knowledge, PG-BROIL is the first policy optimization algorithm robust to a distribution of reward hypotheses which can scale to continuous MDPs. Results suggest that PG-BROIL can produce a family of behaviors ranging from risk-neutral to risk-averse and outperforms state-of-the-art imitation learning algorithms when learning from ambiguous demonstrations by hedging against uncertainty, rather than seeking to uniquely identify the demonstrator's reward function.
    The Shapley Value of Classifiers in Ensemble Games. (arXiv:2101.02153v2 [cs.LG] UPDATED)
    (2 min) What is the value of an individual model in an ensemble of binary classifiers? We answer this question by introducing a class of transferable utility cooperative games called \textit{ensemble games}. In machine learning ensembles, pre-trained models cooperate to make classification decisions. To quantify the importance of models in these ensemble games, we define \textit{Troupe} -- an efficient algorithm which allocates payoffs based on approximate Shapley values of the classifiers. We argue that the Shapley value of models in these games is an effective decision metric for choosing a high performing subset of models from the ensemble. Our analytical findings prove that our Shapley value estimation scheme is precise and scalable; its performance increases with size of the dataset and ensemble. Empirical results on real world graph classification tasks demonstrate that our algorithm produces high quality estimates of the Shapley value. We find that Shapley values can be utilized for ensemble pruning, and that adversarial models receive a low valuation. Complex classifiers are frequently found to be responsible for both correct and incorrect classification decisions.
    Learning Compositional Shape Priors for Few-Shot 3D Reconstruction. (arXiv:2106.06440v1 [cs.CV])
    (2 min) The impressive performance of deep convolutional neural networks in single-view 3D reconstruction suggests that these models perform non-trivial reasoning about the 3D structure of the output space. Recent work has challenged this belief, showing that, on standard benchmarks, complex encoder-decoder architectures perform similarly to nearest-neighbor baselines or simple linear decoder models that exploit large amounts of per-category data. However, building large collections of 3D shapes for supervised training is a laborious process; a more realistic and less constraining task is inferring 3D shapes for categories with few available training examples, calling for a model that can successfully generalize to novel object classes. In this work we experimentally demonstrate that naive baselines fail in this few-shot learning setting, in which the network must learn informative shape priors for inference of new categories. We propose three ways to learn a class-specific global shape prior, directly from data. Using these techniques, we are able to capture multi-scale information about the 3D shape, and account for intra-class variability by virtue of an implicit compositional structure. Experiments on the popular ShapeNet dataset show that our method outperforms a zero-shot baseline by over 40%, and the current state-of-the-art by over 10%, in terms of relative performance, in the few-shot setting.12
    Noise and Fluctuation of Finite Learning Rate Stochastic Gradient Descent. (arXiv:2012.03636v4 [stat.ML] UPDATED)
    (2 min) In the vanishing learning rate regime, stochastic gradient descent (SGD) is now relatively well understood. In this work, we propose to study the basic properties of SGD and its variants in the non-vanishing learning rate regime. The focus is on deriving exactly solvable results and discussing their implications. The main contributions of this work are to derive the stationary distribution for discrete-time SGD in a quadratic loss function with and without momentum; in particular, one implication of our result is that the fluctuation caused by discrete-time dynamics takes a distorted shape and is dramatically larger than a continuous-time theory could predict. Examples of applications of the proposed theory considered in this work include the approximation error of variants of SGD, the effect of minibatch noise, the optimal Bayesian inference, the escape rate from a sharp minimum, and the stationary covariance of a few second-order methods including damped Newton's method, natural gradient descent, and Adam.
    On the Robustness of Average Losses for Partial-Label Learning. (arXiv:2106.06152v1 [cs.LG])
    (2 min) Partial-label (PL) learning is a typical weakly supervised classification problem, where a PL of an instance is a set of candidate labels such that a fixed but unknown candidate is the true label. For PL learning, there are two lines of research: (a) the identification-based strategy (IBS) purifies each label set and extracts the true label; (b) the average-based strategy (ABS) treats all candidates equally for training. In the past two decades, IBS was a much hotter topic than ABS, since it was believed that IBS is more promising. In this paper, we theoretically analyze ABS and find it also promising in the sense of the robustness of its loss functions. Specifically, we consider five problem settings for the generation of clean or noisy PLs, and we prove that average PL losses with bounded multi-class losses are always robust under mild assumptions on the domination of true labels, while average PL losses with unbounded multi-class losses (e.g., the cross-entropy loss) may not be robust. We also conduct experiments to validate our theoretical findings. Note that IBS is heuristic, and we cannot prove its robustness by a similar proof technique; hence, ABS is more advantageous from a theoretical point of view, and it is worth paying attention to the design of more advanced PL learning methods following ABS.
    Within-layer Diversity Reduces Generalization Gap. (arXiv:2106.06012v1 [cs.LG])
    (2 min) Neural networks are composed of multiple layers arranged in a hierarchical structure jointly trained with a gradient-based optimization, where the errors are back-propagated from the last layer back to the first one. At each optimization step, neurons at a given layer receive feedback from neurons belonging to higher layers of the hierarchy. In this paper, we propose to complement this traditional 'between-layer' feedback with additional 'within-layer' feedback to encourage diversity of the activations within the same layer. To this end, we measure the pairwise similarity between the outputs of the neurons and use it to model the layer's overall diversity. By penalizing similarities and promoting diversity, we encourage each neuron to learn a distinctive representation and, thus, to enrich the data representation learned within the layer and to increase the total capacity of the model. We theoretically study how the within-layer activation diversity affects the generalization performance of a neural network and prove that increasing the diversity of hidden activations reduces the estimation error. In addition to the theoretical guarantees, we present an empirical study on three datasets confirming that the proposed approach enhances the performance of state-of-the-art neural network models and decreases the generalization gap.
    Neural Optimization Kernel: Towards Robust Deep Learning. (arXiv:2106.06097v1 [stat.ML])
    (2 min) Recent studies show a close connection between neural networks (NN) and kernel methods. However, most of these analyses (e.g., NTK) focus on the influence of (infinite) width instead of the depth of NN models. There remains a gap between theory and practical network designs that benefit from the depth. This paper first proposes a novel kernel family named Neural Optimization Kernel (NOK). Our kernel is defined as the inner product between two $T$-step updated functionals in RKHS w.r.t. a regularized optimization problem. Theoretically, we proved the monotonic descent property of our update rule for both convex and non-convex problems, and a $O(1/T)$ convergence rate of our updates for convex problems. Moreover, we propose a data-dependent structured approximation of our NOK, which builds the connection between training deep NNs and kernel methods associated with NOK. The resultant computational graph is a ResNet-type finite width NN. Our structured approximation preserved the monotonic descent property and $O(1/T)$ convergence rate. Namely, a $T$-layer NN performs $T$-step monotonic descent updates. Notably, we show our $T$-layered structured NN with ReLU maintains a $O(1/T)$ convergence rate w.r.t. a convex regularized problem, which explains the success of ReLU on training deep NN from a NN architecture optimization perspective. For the unsupervised learning and the shared parameter case, we show the equivalence of training structured NN with GD and performing functional gradient descent in RKHS associated with a fixed (data-dependent) NOK at an infinity-width regime. For finite NOKs, we prove generalization bounds. Remarkably, we show that overparameterized deep NN (NOK) can increase the expressive power to reduce empirical risk and reduce the generalization bound at the same time. Extensive experiments verify the robustness of our structured NOK blocks.
    A Novel Approach to Lifelong Learning: The Plastic Support Structure. (arXiv:2106.06298v1 [cs.LG])
    (2 min) We propose a novel approach to lifelong learning, introducing a compact encapsulated support structure which endows a network with the capability to expand its capacity as needed to learn new tasks while preventing the loss of learned tasks. This is achieved by splitting neurons with high semantic drift and constructing an adjacent network to encode the new tasks at hand. We call this the Plastic Support Structure (PSS), it is a compact structure to learn new tasks that cannot be efficiently encoded in the existing structure of the network. We validate the PSS on public datasets against existing lifelong learning architectures, showing it performs similarly to them but without prior knowledge of the task and in some cases with fewer parameters and in a more understandable fashion where the PSS is an encapsulated container for specific features related to specific tasks, thus making it an ideal "add-on" solution for endowing a network to learn more tasks.
    Automatic Risk Adaptation in Distributional Reinforcement Learning. (arXiv:2106.06317v1 [cs.LG])
    (2 min) The use of Reinforcement Learning (RL) agents in practical applications requires the consideration of suboptimal outcomes, depending on the familiarity of the agent with its environment. This is especially important in safety-critical environments, where errors can lead to high costs or damage. In distributional RL, the risk-sensitivity can be controlled via different distortion measures of the estimated return distribution. However, these distortion functions require an estimate of the risk level, which is difficult to obtain and depends on the current state. In this work, we demonstrate the suboptimality of a static risk level estimation and propose a method to dynamically select risk levels at each environment step. Our method ARA (Automatic Risk Adaptation) estimates the appropriate risk level in both known and unknown environments using a Random Network Distillation error. We show reduced failure rates by up to a factor of 7 and improved generalization performance by up to 14% compared to both risk-aware and risk-agnostic agents in several locomotion environments.
    JKOnet: Proximal Optimal Transport Modeling of Population Dynamics. (arXiv:2106.06345v1 [cs.LG])
    (2 min) Consider a heterogeneous population of points evolving with time. While the population evolves, both in size and nature, we can observe it periodically, through snapshots taken at different timestamps. Each of these snapshots is formed by sampling points from the population at that time, and then creating features to recover point clouds. While these snapshots describe the population's evolution on aggregate, they do not provide directly insights on individual trajectories. This scenario is encountered in several applications, notably single-cell genomics experiments, tracking of particles, or when studying crowd motion. In this paper, we propose to model that dynamic as resulting from the celebrated Jordan-Kinderlehrer-Otto (JKO) proximal scheme. The JKO scheme posits that the configuration taken by a population at time $t$ is one that trades off a decrease w.r.t. an energy (the model we seek to learn) penalized by an optimal transport distance w.r.t. the previous configuration. To that end, we propose JKOnet, a neural architecture that combines an energy model on measures, with (small) optimal displacements solved with input convex neural networks (ICNN). We demonstrate the applicability of our model to explain and predict population dynamics.
    Safe Reinforcement Learning with Linear Function Approximation. (arXiv:2106.06239v1 [cs.LG])
    (2 min) Safety in reinforcement learning has become increasingly important in recent years. Yet, existing solutions either fail to strictly avoid choosing unsafe actions, which may lead to catastrophic results in safety-critical systems, or fail to provide regret guarantees for settings where safety constraints need to be learned. In this paper, we address both problems by first modeling safety as an unknown linear cost function of states and actions, which must always fall below a certain threshold. We then present algorithms, termed SLUCB-QVI and RSLUCB-QVI, for episodic Markov decision processes (MDPs) with linear function approximation. We show that SLUCB-QVI and RSLUCB-QVI, while with \emph{no safety violation}, achieve a $\tilde{\mathcal{O}}\left(\kappa\sqrt{d^3H^3T}\right)$ regret, nearly matching that of state-of-the-art unsafe algorithms, where $H$ is the duration of each episode, $d$ is the dimension of the feature mapping, $\kappa$ is a constant characterizing the safety constraints, and $T$ is the total number of action plays. We further present numerical simulations that corroborate our theoretical findings.
    Data-Driven Multiscale Design of Cellular Composites with Multiclass Microstructures for Natural Frequency Maximization. (arXiv:2106.06478v1 [cs.CE])
    (2 min) For natural frequency optimization of engineering structures, cellular composites have been shown to possess an edge over solid. However, existing multiscale design methods for cellular composites are either computationally exhaustive or confined to a single class of microstructures. In this paper, we propose a data-driven topology optimization (TO) approach to enable the multiscale design of cellular structures with various choices of microstructure classes. The key component is a newly proposed latent-variable Gaussian process (LVGP) model through which different classes of microstructures are mapped into a low-dimensional continuous latent space. It provides an interpretable distance metric between classes and captures their effects on the homogenized stiffness tensors. By introducing latent vectors as design variables, a differentiable transition of stiffness matrix between classes can be easily achieved with an analytical gradient. After integrating LVGP with the density-based TO, an efficient data-driven cellular composite optimization process is developed to enable concurrent exploration of microstructure concepts and the associated volume fractions for natural frequency optimization. Examples reveal that the proposed cellular designs with multiclass microstructures achieve higher natural frequencies than both single-scale and single-class designs. This framework can be easily extended to other multi-scale TO problems, such as thermal compliance and dynamic response optimization.
    Label Noise SGD Provably Prefers Flat Global Minimizers. (arXiv:2106.06530v1 [cs.LG])
    (2 min) In overparametrized models, the noise in stochastic gradient descent (SGD) implicitly regularizes the optimization trajectory and determines which local minimum SGD converges to. Motivated by empirical studies that demonstrate that training with noisy labels improves generalization, we study the implicit regularization effect of SGD with label noise. We show that SGD with label noise converges to a stationary point of a regularized loss $L(\theta) +\lambda R(\theta)$, where $L(\theta)$ is the training loss, $\lambda$ is an effective regularization parameter depending on the step size, strength of the label noise, and the batch size, and $R(\theta)$ is an explicit regularizer that penalizes sharp minimizers. Our analysis uncovers an additional regularization effect of large learning rates beyond the linear scaling rule that penalizes large eigenvalues of the Hessian more than small ones. We also prove extensions to classification with general loss functions, SGD with momentum, and SGD with general noise covariance, significantly strengthening the prior work of Blanc et al. to global convergence and large learning rates and of HaoChen et al. to general models.
    DECORE: Deep Compression with Reinforcement Learning. (arXiv:2106.06091v1 [cs.AI])
    (2 min) Deep learning has become an increasingly popular and powerful option for modern pattern recognition systems. However, many deep neural networks have millions to billions of parameters, making them untenable for real-world applications with constraints on memory or latency. As a result, powerful network compression techniques are a must for the widespread adoption of deep learning. We present DECORE, a reinforcement learning approach to automate the network compression process. Using a simple policy gradient method to learn which neurons or channels to keep or remove, we are able to achieve compression rates 3x to 5x greater than contemporary approaches. In contrast with other architecture search methods, DECORE is simple and quick to train, requiring only a few hours of training on 1 GPU. When applied to standard network architectures on different datasets, our approach achieves 11x to 103x compression on different architectures while maintaining accuracies similar to those of the original, large networks.
    A Unified Framework for Constructing Nonconvex Regularizations. (arXiv:2106.06123v1 [stat.ML])
    (2 min) Over the past decades, many individual nonconvex methods have been proposed to achieve better sparse recovery performance in various scenarios. However, how to construct a valid nonconvex regularization function remains open in practice. In this paper, we fill in this gap by presenting a unified framework for constructing the nonconvex regularization based on the probability density function. Meanwhile, a new nonconvex sparse recovery method constructed via the Weibull distribution is studied.
    Feature Selection Tutorial with Python Examples. (arXiv:2106.06437v1 [cs.LG])
    (2 min) In Machine Learning, feature selection entails selecting a subset of the available features in a dataset to use for model development. There are many motivations for feature selection, it may result in better models, it may provide insight into the data and it may deliver economies in data gathering or data processing. For these reasons feature selection has received a lot of attention in data analytics research. In this paper we provide an overview of the main methods and present practical examples with Python implementations. While the main focus is on supervised feature selection techniques, we also cover some feature transformation methods.
    Decoupled Greedy Learning of CNNs for Synchronous and Asynchronous Distributed Learning. (arXiv:2106.06401v1 [cs.LG])
    (2 min) A commonly cited inefficiency of neural network training using back-propagation is the update locking problem: each layer must wait for the signal to propagate through the full network before updating. Several alternatives that can alleviate this issue have been proposed. In this context, we consider a simple alternative based on minimal feedback, which we call Decoupled Greedy Learning (DGL). It is based on a classic greedy relaxation of the joint training objective, recently shown to be effective in the context of Convolutional Neural Networks (CNNs) on large-scale image classification. We consider an optimization of this objective that permits us to decouple the layer training, allowing for layers or modules in networks to be trained with a potentially linear parallelization. With the use of a replay buffer we show that this approach can be extended to asynchronous settings, where modules can operate and continue to update with possibly large communication delays. To address bandwidth and memory issues we propose an approach based on online vector quantization. This allows to drastically reduce the communication bandwidth between modules and required memory for replay buffers. We show theoretically and empirically that this approach converges and compare it to the sequential solvers. We demonstrate the effectiveness of DGL against alternative approaches on the CIFAR-10 dataset and on the large-scale ImageNet dataset.
    K-shot NAS: Learnable Weight-Sharing for NAS with K-shot Supernets. (arXiv:2106.06442v1 [cs.CV])
    (2 min) In one-shot weight sharing for NAS, the weights of each operation (at each layer) are supposed to be identical for all architectures (paths) in the supernet. However, this rules out the possibility of adjusting operation weights to cater for different paths, which limits the reliability of the evaluation results. In this paper, instead of counting on a single supernet, we introduce $K$-shot supernets and take their weights for each operation as a dictionary. The operation weight for each path is represented as a convex combination of items in a dictionary with a simplex code. This enables a matrix approximation of the stand-alone weight matrix with a higher rank ($K>1$). A \textit{simplex-net} is introduced to produce architecture-customized code for each path. As a result, all paths can adaptively learn how to share weights in the $K$-shot supernets and acquire corresponding weights for better evaluation. $K$-shot supernets and simplex-net can be iteratively trained, and we further extend the search to the channel dimension. Extensive experiments on benchmark datasets validate that K-shot NAS significantly improves the evaluation accuracy of paths and thus brings in impressive performance improvements.
    Taylor Expansion of Discount Factors. (arXiv:2106.06170v1 [cs.LG])
    (2 min) In practical reinforcement learning (RL), the discount factor used for estimating value functions often differs from that used for defining the evaluation objective. In this work, we study the effect that this discrepancy of discount factors has during learning, and discover a family of objectives that interpolate value functions of two distinct discount factors. Our analysis suggests new ways for estimating value functions and performing policy optimization updates, which demonstrate empirical performance gains. This framework also leads to new insights on commonly-used deep RL heuristic modifications to policy optimization algorithms.
    Optimal Model Selection in Contextual Bandits with Many Classes via Offline Oracles. (arXiv:2106.06483v1 [cs.LG])
    (2 min) We study the problem of model selection for contextual bandits, in which the algorithm must balance the bias-variance trade-off for model estimation while also balancing the exploration-exploitation trade-off. In this paper, we propose the first reduction of model selection in contextual bandits to offline model selection oracles, allowing for flexible general purpose algorithms with computational requirements no worse than those for model selection for regression. Our main result is a new model selection guarantee for stochastic contextual bandits. When one of the classes in our set is realizable, up to a logarithmic dependency on the number of classes, our algorithm attains optimal realizability-based regret bounds for that class under one of two conditions: if the time-horizon is large enough, or if an assumption that helps with detecting misspecification holds. Hence our algorithm adapts to the complexity of this unknown class. Even when this realizable class is known, we prove improved regret guarantees in early rounds by relying on simpler model classes for those rounds and hence further establish the importance of model selection in contextual bandits.
    Evaluating Robustness of Predictive Uncertainty Estimation: Are Dirichlet-based Models Reliable?. (arXiv:2010.14986v2 [cs.LG] UPDATED)
    (2 min) Dirichlet-based uncertainty (DBU) models are a recent and promising class of uncertainty-aware models. DBU models predict the parameters of a Dirichlet distribution to provide fast, high-quality uncertainty estimates alongside with class predictions. In this work, we present the first large-scale, in-depth study of the robustness of DBU models under adversarial attacks. Our results suggest that uncertainty estimates of DBU models are not robust w.r.t. three important tasks: (1) indicating correctly and wrongly classified samples; (2) detecting adversarial examples; and (3) distinguishing between in-distribution (ID) and out-of-distribution (OOD) data. Additionally, we explore the first approaches to make DBU models more robust. While adversarial training has a minor effect, our median smoothing based approach significantly increases robustness of DBU models.
    FiSH: Fair Spatial Hotspots. (arXiv:2106.06049v1 [cs.LG])
    (2 min) Pervasiveness of tracking devices and enhanced availability of spatially located data has deepened interest in using them for various policy interventions, through computational data analysis tasks such as spatial hot spot detection. In this paper, we consider, for the first time to our best knowledge, fairness in detecting spatial hot spots. We motivate the need for ensuring fairness through statistical parity over the collective population covered across chosen hot spots. We then characterize the task of identifying a diverse set of solutions in the noteworthiness-fairness trade-off spectrum, to empower the user to choose a trade-off justified by the policy domain. Being a novel task formulation, we also develop a suite of evaluation metrics for fair hot spots, motivated by the need to evaluate pertinent aspects of the task. We illustrate the computational infeasibility of identifying fair hot spots using naive and/or direct approaches and devise a method, codenamed {\it FiSH}, for efficiently identifying high-quality, fair and diverse sets of spatial hot spots. FiSH traverses the tree-structured search space using heuristics that guide it towards identifying effective and fair sets of spatial hot spots. Through an extensive empirical analysis over a real-world dataset from the domain of human development, we illustrate that FiSH generates high-quality solutions at fast response times.
    Coded-InvNet for Resilient Prediction Serving Systems. (arXiv:2106.06445v1 [cs.LG])
    (2 min) Inspired by a new coded computation algorithm for invertible functions, we propose Coded-InvNet a new approach to design resilient prediction serving systems that can gracefully handle stragglers or node failures. Coded-InvNet leverages recent findings in the deep learning literature such as invertible neural networks, Manifold Mixup, and domain translation algorithms, identifying interesting research directions that span across machine learning and systems. Our experimental results show that Coded-InvNet can outperform existing approaches, especially when the compute resource overhead is as low as 10%. For instance, without knowing which of the ten workers is going to fail, our algorithm can design a backup task so that it can correctly recover the missing prediction result with an accuracy of 85.9%, significantly outperforming the previous SOTA by 32.5%.
    Rethinking Architecture Design for Tackling Data Heterogeneity in Federated Learning. (arXiv:2106.06047v1 [cs.LG])
    (2 min) Federated learning is an emerging research paradigm enabling collaborative training of machine learning models among different organizations while keeping data private at each institution. Despite recent progress, there remain fundamental challenges such as lack of convergence and potential for catastrophic forgetting in federated learning across real-world heterogeneous devices. In this paper, we demonstrate that attention-based architectures (e.g., Transformers) are fairly robust to distribution shifts and hence improve federated learning over heterogeneous data. Concretely, we conduct the first rigorous empirical investigation of different neural architectures across a range of federated algorithms, real-world benchmarks, and heterogeneous data splits. Our experiments show that simply replacing convolutional networks with Transformers can greatly reduce catastrophic forgetting of previous devices, accelerate convergence, and reach a better global model, especially when dealing with heterogeneous data. We will release our code and pretrained models at https://github.com/Liangqiong/ViT-FL-main to encourage future exploration in robust architectures as an alternative to current research efforts on the optimization front.
    Invariant Information Bottleneck for Domain Generalization. (arXiv:2106.06333v1 [cs.LG])
    (2 min) The main challenge for domain generalization (DG) is to overcome the potential distributional shift between multiple training domains and unseen test domains. One popular class of DG algorithms aims to learn representations that have an invariant causal relation across the training domains. However, certain features, called \emph{pseudo-invariant features}, may be invariant in the training domain but not the test domain and can substantially decreases the performance of existing algorithms. To address this issue, we propose a novel algorithm, called Invariant Information Bottleneck (IIB), that learns a minimally sufficient representation that is invariant across training and testing domains. By minimizing the mutual information between the representation and inputs, IIB alleviates its reliance on pseudo-invariant features, which is desirable for DG. To verify the effectiveness of the IIB principle, we conduct extensive experiments on large-scale DG benchmarks. The results show that IIB outperforms invariant learning baseline (e.g. IRM) by an average of 2.8\% and 3.8\% accuracy over two evaluation metrics.
    Model Selection for Bayesian Autoencoders. (arXiv:2106.06245v1 [stat.ML])
    (2 min) We develop a novel method for carrying out model selection for Bayesian autoencoders (BAEs) by means of prior hyper-parameter optimization. Inspired by the common practice of type-II maximum likelihood optimization and its equivalence to Kullback-Leibler divergence minimization, we propose to optimize the distributional sliced-Wasserstein distance (DSWD) between the output of the autoencoder and the empirical data distribution. The advantages of this formulation are that we can estimate the DSWD based on samples and handle high-dimensional problems. We carry out posterior estimation of the BAE parameters via stochastic gradient Hamiltonian Monte Carlo and turn our BAE into a generative model by fitting a flexible Dirichlet mixture model in the latent space. Consequently, we obtain a powerful alternative to variational autoencoders, which are the preferred choice in modern applications of autoencoders for representation learning with uncertainty. We evaluate our approach qualitatively and quantitatively using a vast experimental campaign on a number of unsupervised learning tasks and show that, in small-data regimes where priors matter, our approach provides state-of-the-art results, outperforming multiple competitive baselines.
    PyGAD: An Intuitive Genetic Algorithm Python Library. (arXiv:2106.06158v1 [cs.NE])
    (2 min) This paper introduces PyGAD, an open-source easy-to-use Python library for building the genetic algorithm. PyGAD supports a wide range of parameters to give the user control over everything in its life cycle. This includes, but is not limited to, population, gene value range, gene data type, parent selection, crossover, and mutation. PyGAD is designed as a general-purpose optimization library that allows the user to customize the fitness function. Its usage consists of 3 main steps: build the fitness function, create an instance of the pygad.GA class, and calling the pygad.GA.run() method. The library supports training deep learning models created either with PyGAD itself or with frameworks like Keras and PyTorch. Given its stable state, PyGAD is also in active development to respond to the user's requested features and enhancement received on GitHub https://github.com/ahmedfgad/GeneticAlgorithmPython. PyGAD comes with documentation https://pygad.readthedocs.io for further details and examples.
    ViT-Inception-GAN for Image Colourising. (arXiv:2106.06321v1 [cs.CV])
    (2 min) Studies involving colourising images has been garnering researchers' keen attention over time, assisted by significant advances in various Machine Learning techniques and compute power availability. Traditionally, colourising images have been an intricate task that gave a substantial degree of freedom during the assignment of chromatic information. In our proposed method, we attempt to colourise images using Vision Transformer - Inception - Generative Adversarial Network (ViT-I-GAN), which has an Inception-v3 fusion embedding in the generator. For a stable and robust network, we have used Vision Transformer (ViT) as the discriminator. We trained the model on the Unsplash and the COCO dataset for demonstrating the improvement made by the Inception-v3 embedding. We have compared the results between ViT-GANs with and without Inception-v3 embedding.
    GDI: Rethinking What Makes Reinforcement Learning Different From Supervised Learning. (arXiv:2106.06232v1 [cs.LG])
    (2 min) Deep Q Network (DQN) firstly kicked the door of deep reinforcement learning (DRL) via combining deep learning (DL) with reinforcement learning (RL), which has noticed that the distribution of the acquired data would change during the training process. DQN found this property might cause instability for training, so it proposed effective methods to handle the downside of the property. Instead of focusing on the unfavourable aspects, we find it critical for RL to ease the gap between the estimated data distribution and the ground truth data distribution while supervised learning (SL) fails to do so. From this new perspective, we extend the basic paradigm of RL called the Generalized Policy Iteration (GPI) into a more generalized version, which is called the Generalized Data Distribution Iteration (GDI). We see massive RL algorithms and techniques can be unified into the GDI paradigm, which can be considered as one of the special cases of GDI. We provide theoretical proof of why GDI is better than GPI and how it works. Several practical algorithms based on GDI have been proposed to verify the effectiveness and extensiveness of it. Empirical experiments prove our state-of-the-art (SOTA) performance on Arcade Learning Environment (ALE), wherein our algorithm has achieved 9620.98% mean human normalized score (HNS), 1146.39% median HNS and 22 human world record breakthroughs (HWRB) using only 200 training frames. Our work aims to lead the RL research to step into the journey of conquering the human world records and seek real superhuman agents on both performance and efficiency.
    Scaling Vision with Sparse Mixture of Experts. (arXiv:2106.05974v1 [cs.CV])
    (2 min) Sparsely-gated Mixture of Experts networks (MoEs) have demonstrated excellent scalability in Natural Language Processing. In Computer Vision, however, almost all performant networks are "dense", that is, every input is processed by every parameter. We present a Vision MoE (V-MoE), a sparse version of the Vision Transformer, that is scalable and competitive with the largest dense networks. When applied to image recognition, V-MoE matches the performance of state-of-the-art networks, while requiring as little as half of the compute at inference time. Further, we propose an extension to the routing algorithm that can prioritize subsets of each input across the entire batch, leading to adaptive per-image compute. This allows V-MoE to trade-off performance and compute smoothly at test-time. Finally, we demonstrate the potential of V-MoE to scale vision models, and train a 15B parameter model that attains 90.35% on ImageNet.
    Adversarial Robustness through the Lens of Causality. (arXiv:2106.06196v1 [cs.LG])
    (2 min) The adversarial vulnerability of deep neural networks has attracted significant attention in machine learning. From a causal viewpoint, adversarial attacks can be considered as a specific type of distribution change on natural data. As causal reasoning has an instinct for modeling distribution change, we propose to incorporate causality into mitigating adversarial vulnerability. However, causal formulations of the intuition of adversarial attack and the development of robust DNNs are still lacking in the literature. To bridge this gap, we construct a causal graph to model the generation process of adversarial examples and define the adversarial distribution to formalize the intuition of adversarial attacks. From a causal perspective, we find that the label is spuriously correlated with the style (content-independent) information when an instance is given. The spurious correlation implies that the adversarial distribution is constructed via making the statistical conditional association between style information and labels drastically different from that in natural distribution. Thus, DNNs that fit the spurious correlation are vulnerable to the adversarial distribution. Inspired by the observation, we propose the adversarial distribution alignment method to eliminate the difference between the natural distribution and the adversarial distribution. Extensive experiments demonstrate the efficacy of the proposed method. Our method can be seen as the first attempt to leverage causality for mitigating adversarial vulnerability.
    Modeling Sequences as Distributions with Uncertainty for Sequential Recommendation. (arXiv:2106.06165v1 [cs.IR])
    (2 min) The sequential patterns within the user interactions are pivotal for representing the user's preference and capturing latent relationships among items. The recent advancements of sequence modeling by Transformers advocate the community to devise more effective encoders for the sequential recommendation. Most existing sequential methods assume users are deterministic. However, item-item transitions might fluctuate significantly in several item aspects and exhibit randomness of user interests. This \textit{stochastic characteristics} brings up a solid demand to include uncertainties in representing sequences and items. Additionally, modeling sequences and items with uncertainties expands users' and items' interaction spaces, thus further alleviating cold-start problems. In this work, we propose a Distribution-based Transformer for Sequential Recommendation (DT4SR), which injects uncertainties into sequential modeling. We use Elliptical Gaussian distributions to describe items and sequences with uncertainty. We describe the uncertainty in items and sequences as Elliptical Gaussian distribution. And we adopt Wasserstein distance to measure the similarity between distributions. We devise two novel Trans-formers for modeling mean and covariance, which guarantees the positive-definite property of distributions. The proposed method significantly outperforms the state-of-the-art methods. The experiments on three benchmark datasets also demonstrate its effectiveness in alleviating cold-start issues. The code is available inhttps://github.com/DyGRec/DT4SR.
    Anomalous Sound Detection Using a Binary Classification Model and Class Centroids. (arXiv:2106.06151v1 [cs.SD])
    (2 min) An anomalous sound detection system to detect unknown anomalous sounds usually needs to be built using only normal sound data. Moreover, it is desirable to improve the system by effectively using a small amount of anomalous sound data, which will be accumulated through the system's operation. As one of the methods to meet these requirements, we focus on a binary classification model that is developed by using not only normal data but also outlier data in the other domains as pseudo-anomalous sound data, which can be easily updated by using anomalous data. In this paper, we implement a new loss function based on metric learning to learn the distance relationship from each class centroid in feature space for the binary classification model. The proposed multi-task learning of the binary classification and the metric learning makes it possible to build the feature space where the within-class variance is minimized and the between-class variance is maximized while keeping normal and anomalous classes linearly separable. We also investigate the effectiveness of additionally using anomalous sound data for further improving the binary classification model. Our results showed that multi-task learning using binary classification and metric learning to consider the distance from each class centroid in the feature space is effective, and performance can be significantly improved by using even a small amount of anomalous data during training.
    DORO: Distributional and Outlier Robust Optimization. (arXiv:2106.06142v1 [cs.LG])
    (2 min) Many machine learning tasks involve subpopulation shift where the testing data distribution is a subpopulation of the training distribution. For such settings, a line of recent work has proposed the use of a variant of empirical risk minimization(ERM) known as distributionally robust optimization (DRO). In this work, we apply DRO to real, large-scale tasks with subpopulation shift, and observe that DRO performs relatively poorly, and moreover has severe instability. We identify one direct cause of this phenomenon: sensitivity of DRO to outliers in the datasets. To resolve this issue, we propose the framework of DORO, for Distributional and Outlier Robust Optimization. At the core of this approach is a refined risk function which prevents DRO from overfitting to potential outliers. We instantiate DORO for the Cressie-Read family of R\'enyi divergence, and delve into two specific instances of this family: CVaR and $\chi^2$-DRO. We theoretically prove the effectiveness of the proposed method, and empirically show that DORO improves the performance and stability of DRO with experiments on large modern datasets, thereby positively addressing the open question raised by Hashimoto et al., 2018.
    Dynamic Language Models for Continuously Evolving Content. (arXiv:2106.06297v1 [cs.CL])
    (2 min) The content on the web is in a constant state of flux. New entities, issues, and ideas continuously emerge, while the semantics of the existing conversation topics gradually shift. In recent years, pre-trained language models like BERT greatly improved the state-of-the-art for a large spectrum of content understanding tasks. Therefore, in this paper, we aim to study how these language models can be adapted to better handle continuously evolving web content. In our study, we first analyze the evolution of 2013 - 2019 Twitter data, and unequivocally confirm that a BERT model trained on past tweets would heavily deteriorate when directly applied to data from later years. Then, we investigate two possible sources of the deterioration: the semantic shift of existing tokens and the sub-optimal or failed understanding of new tokens. To this end, we both explore two different vocabulary composition methods, as well as propose three sampling methods which help in efficient incremental training for BERT-like models. Compared to a new model trained from scratch offline, our incremental training (a) reduces the training costs, (b) achieves better performance on evolving content, and (c) is suitable for online deployment. The superiority of our methods is validated using two downstream tasks. We demonstrate significant improvements when incrementally evolving the model from a particular base year, on the task of Country Hashtag Prediction, as well as on the OffensEval 2019 task.
    Modeling Hierarchical Structures with Continuous Recursive Neural Networks. (arXiv:2106.06038v1 [cs.CL])
    (2 min) Recursive Neural Networks (RvNNs), which compose sequences according to their underlying hierarchical syntactic structure, have performed well in several natural language processing tasks compared to similar models without structural biases. However, traditional RvNNs are incapable of inducing the latent structure in a plain text sequence on their own. Several extensions have been proposed to overcome this limitation. Nevertheless, these extensions tend to rely on surrogate gradients or reinforcement learning at the cost of higher bias or variance. In this work, we propose Continuous Recursive Neural Network (CRvNN) as a backpropagation-friendly alternative to address the aforementioned limitations. This is done by incorporating a continuous relaxation to the induced structure. We demonstrate that CRvNN achieves strong performance in challenging synthetic tasks such as logical inference and ListOps. We also show that CRvNN performs comparably or better than prior latent structure models on real-world tasks such as sentiment analysis and natural language inference.
    Differentially Private Federated Learning via Inexact ADMM. (arXiv:2106.06127v1 [cs.LG])
    (2 min) Differential privacy (DP) techniques can be applied to the federated learning model to protect data privacy against inference attacks to communication among the learning agents. The DP techniques, however, hinder achieving a greater learning performance while ensuring strong data privacy. In this paper we develop a DP inexact alternating direction method of multipliers algorithm that solves a sequence of trust-region subproblems with the objective perturbation by random noises generated from a Laplace distribution. We show that our algorithm provides $\bar{\epsilon}$-DP for every iteration and $\mathcal{O}(1/T)$ rate of convergence in expectation, where $T$ is the number of iterations. Using MNIST and FEMNIST datasets for the image classification, we demonstrate that our algorithm reduces the testing error by at most $22\%$ compared with the existing DP algorithm, while achieving the same level of data privacy. The numerical experiment also shows that our algorithm converges faster than the existing algorithm.
    Collaborative Multidisciplinary Design Optimization with Neural Networks. (arXiv:2106.06092v1 [cs.LG])
    (2 min) The design of complex engineering systems leads to solving very large optimization problems involving different disciplines. Strategies allowing disciplines to optimize in parallel by providing sub-objectives and splitting the problem into smaller parts, such as Collaborative Optimization, are promising solutions.However, most of them have slow convergence which reduces their practical use. Earlier efforts to fasten convergence by learning surrogate models have not yet succeeded at sufficiently improving the competitiveness of these strategies.This paper shows that, in the case of Collaborative Optimization, faster and more reliable convergence can be obtained by solving an interesting instance of binary classification: on top of the target label, the training data of one of the two classes contains the distance to the decision boundary and its derivative. Leveraging this information, we propose to train a neural network with an asymmetric loss function, a structure that guarantees Lipshitz continuity, and a regularization towards respecting basic distance function properties. The approach is demonstrated on a toy learning example, and then applied to a multidisciplinary aircraft design problem.
    FedBABU: Towards Enhanced Representation for Federated Image Classification. (arXiv:2106.06042v1 [cs.LG])
    (2 min) Federated learning has evolved to improve a single global model under data heterogeneity (as a curse) or to develop multiple personalized models using data heterogeneity (as a blessing). However, there has been little research considering both directions simultaneously. In this paper, we first investigate the relationship between them by analyzing Federated Averaging at the client level and determine that a better federated global model performance does not constantly improve personalization. To elucidate the cause of this personalization performance degradation problem, we decompose the entire network into the body (i.e., extractor), related to universality, and the head (i.e., classifier), related to personalization. We then point out that this problem stems from training the head. Based on this observation, we propose a novel federated learning algorithm, coined as FedBABU, which updates only the body of the model during federated training (i.e., the head is randomly initialized and never updated), and the head is fine-tuned for personalization during the evaluation process. Extensive experiments show consistent performance improvements and an efficient personalization of FedBABU.
    High-Performance FPGA-based Accelerator for Bayesian Recurrent Neural Networks. (arXiv:2106.06048v1 [cs.LG])
    (2 min) Neural networks have demonstrated their great performance in a wide range of tasks. Especially in time-series analysis, recurrent architectures based on long-short term memory (LSTM) cells have manifested excellent capability to model time dependencies in real-world data. However, standard recurrent architectures cannot estimate their uncertainty which is essential for safety-critical applications such as in medicine. In contrast, Bayesian recurrent neural networks (RNNs) are able to provide uncertainty estimation with improved accuracy. Nonetheless, Bayesian RNNs are computationally and memory demanding, which limits their practicality despite their advantages. To address this issue, we propose an FPGA-based hardware design to accelerate Bayesian LSTM-based RNNs. To further improve the overall algorithmic-hardware performance, a co-design framework is proposed to explore the most optimal algorithmic-hardware configurations for Bayesian RNNs. We conduct extensive experiments on health-related tasks to demonstrate the improvement of our design and the effectiveness of our framework. Compared with GPU implementation, our FPGA-based design can achieve up to 10 times speedup with nearly 106 times higher energy efficiency. To the best of our knowledge, this is the first work targeting the acceleration of Bayesian RNNs on FPGAs.
    Convergence and Alignment of Gradient Descentwith Random Back propagation Weights. (arXiv:2106.06044v1 [stat.ML])
    (2 min) Stochastic gradient descent with backpropagation is the workhorse of artificial neural networks. It has long been recognized that backpropagation fails to be a biologically plausible algorithm. Fundamentally, it is a non-local procedure -- updating one neuron's synaptic weights requires knowledge of synaptic weights or receptive fields of downstream neurons. This limits the use of artificial neural networks as a tool for understanding the biological principles of information processing in the brain. Lillicrap et al. (2016) propose a more biologically plausible "feedback alignment" algorithm that uses random and fixed backpropagation weights, and show promising simulations. In this paper we study the mathematical properties of the feedback alignment procedure by analyzing convergence and alignment for two-layer networks under squared error loss. In the overparameterized setting, we prove that the error converges to zero exponentially fast, and also that regularization is necessary in order for the parameters to become aligned with the random backpropagation weights. Simulations are given that are consistent with this analysis and suggest further generalizations. These results contribute to our understanding of how biologically plausible algorithms might carry out weight learning in a manner different from Hebbian learning, with performance that is comparable with the full non-local backpropagation algorithm.
    Deep Probabilistic Koopman: Long-term time-series forecasting under periodic uncertainties. (arXiv:2106.06033v1 [cs.LG])
    (2 min) Probabilistic forecasting of complex phenomena is paramount to various scientific disciplines and applications. Despite the generality and importance of the problem, general mathematical techniques that allow for stable long-term forecasts with calibrated uncertainty measures are lacking. For most time series models, the difficulty of obtaining accurate probabilistic future time step predictions increases with the prediction horizon. In this paper, we introduce a surprisingly simple approach that characterizes time-varying distributions and enables reasonably accurate predictions thousands of timesteps into the future. This technique, which we call Deep Probabilistic Koopman (DPK), is based on recent advances in linear Koopman operator theory, and does not require time stepping for future time predictions. Koopman models also tend to have a small parameter footprint (often less than 10,000 parameters). We demonstrate the long-term forecasting performance of these models on a diversity of domains, including electricity demand forecasting, atmospheric chemistry, and neuroscience. For electricity demand modeling, our domain-agnostic technique outperforms all of 177 domain-specific competitors in the most recent Global Energy Forecasting Competition.
    Fair Preprocessing: Towards Understanding Compositional Fairness of Data Transformers in Machine Learning Pipeline. (arXiv:2106.06054v1 [cs.LG])
    (2 min) In recent years, many incidents have been reported where machine learning models exhibited discrimination among people based on race, sex, age, etc. Research has been conducted to measure and mitigate unfairness in machine learning models. For a machine learning task, it is a common practice to build a pipeline that includes an ordered set of data preprocessing stages followed by a classifier. However, most of the research on fairness has considered a single classifier based prediction task. What are the fairness impacts of the preprocessing stages in machine learning pipeline? Furthermore, studies showed that often the root cause of unfairness is ingrained in the data itself, rather than the model. But no research has been conducted to measure the unfairness caused by a specific transformation made in the data preprocessing stage. In this paper, we introduced the causal method of fairness to reason about the fairness impact of data preprocessing stages in ML pipeline. We leveraged existing metrics to define the fairness measures of the stages. Then we conducted a detailed fairness evaluation of the preprocessing stages in 37 pipelines collected from three different sources. Our results show that certain data transformers are causing the model to exhibit unfairness. We identified a number of fairness patterns in several categories of data transformers. Finally, we showed how the local fairness of a preprocessing stage composes in the global fairness of the pipeline. We used the fairness composition to choose appropriate downstream transformer that mitigates unfairness in the machine learning pipeline.
    ChemRL-GEM: Geometry Enhanced Molecular Representation Learning for Property Prediction. (arXiv:2106.06130v1 [cs.LG])
    (2 min) Effective molecular representation learning is of great importance to facilitate molecular property prediction, which is a fundamental task for the drug and material industry. Recent advances in graph neural networks (GNNs) have shown great promise in applying GNNs for molecular representation learning. Moreover, a few recent studies have also demonstrated successful applications of self-supervised learning methods to pre-train the GNNs to overcome the problem of insufficient labeled molecules. However, existing GNNs and pre-training strategies usually treat molecules as topological graph data without fully utilizing the molecular geometry information. Whereas, the three-dimensional (3D) spatial structure of a molecule, a.k.a molecular geometry, is one of the most critical factors for determining molecular physical, chemical, and biological properties. To this end, we propose a novel Geometry Enhanced Molecular representation learning method (GEM) for Chemical Representation Learning (ChemRL). At first, we design a geometry-based GNN architecture that simultaneously models atoms, bonds, and bond angles in a molecule. To be specific, we devised double graphs for a molecule: The first one encodes the atom-bond relations; The second one encodes bond-angle relations. Moreover, on top of the devised GNN architecture, we propose several novel geometry-level self-supervised learning strategies to learn spatial knowledge by utilizing the local and global molecular 3D structures. We compare ChemRL-GEM with various state-of-the-art (SOTA) baselines on different molecular benchmarks and exhibit that ChemRL-GEM can significantly outperform all baselines in both regression and classification tasks. For example, the experimental results show an overall improvement of $8.8\%$ on average compared to SOTA baselines on the regression tasks, demonstrating the superiority of the proposed method.
    Hybrid Generative-Contrastive Representation Learning. (arXiv:2106.06162v1 [cs.LG])
    (2 min) Unsupervised representation learning has recently received lots of interest due to its powerful generalizability through effectively leveraging large-scale unlabeled data. There are two prevalent approaches for this, contrastive learning and generative pre-training, where the former learns representations from instance-wise discrimination tasks and the latter learns them from estimating the likelihood. These seemingly orthogonal approaches have their own strengths and weaknesses. Contrastive learning tends to extract semantic information and discards details irrelevant for classifying objects, making the representations effective for discriminative tasks while degrading robustness to out-of-distribution data. On the other hand, the generative pre-training directly estimates the data distribution, so the representations tend to be robust but not optimal for discriminative tasks. In this paper, we show that we could achieve the best of both worlds by a hybrid training scheme. Specifically, we demonstrated that a transformer-based encoder-decoder architecture trained with both contrastive and generative losses can learn highly discriminative and robust representations without hurting the generative performance. We extensively validate our approach on various tasks.
    Coordinate Independent Convolutional Networks -- Isometry and Gauge Equivariant Convolutions on Riemannian Manifolds. (arXiv:2106.06020v1 [cs.LG])
    (2 min) Motivated by the vast success of deep convolutional networks, there is a great interest in generalizing convolutions to non-Euclidean manifolds. A major complication in comparison to flat spaces is that it is unclear in which alignment a convolution kernel should be applied on a manifold. The underlying reason for this ambiguity is that general manifolds do not come with a canonical choice of reference frames (gauge). Kernels and features therefore have to be expressed relative to arbitrary coordinates. We argue that the particular choice of coordinatization should not affect a network's inference -- it should be coordinate independent. A simultaneous demand for coordinate independence and weight sharing is shown to result in a requirement on the network to be equivariant under local gauge transformations (changes of local reference frames). The ambiguity of reference frames depends thereby on the G-structure of the manifold, such that the necessary level of gauge equivariance is prescribed by the corresponding structure group G. Coordinate independent convolutions are proven to be equivariant w.r.t. those isometries that are symmetries of the G-structure. The resulting theory is formulated in a coordinate free fashion in terms of fiber bundles. To exemplify the design of coordinate independent convolutions, we implement a convolutional network on the M\"obius strip. The generality of our differential geometric formulation of convolutional networks is demonstrated by an extensive literature review which explains a large number of Euclidean CNNs, spherical CNNs and CNNs on general surfaces as specific instances of coordinate independent convolutions.
    Gradual Domain Adaptation in the Wild:When Intermediate Distributions are Absent. (arXiv:2106.06080v1 [cs.LG])
    (2 min) We focus on the problem of domain adaptation when the goal is shifting the model towards the target distribution, rather than learning domain invariant representations. It has been shown that under the following two assumptions: (a) access to samples from intermediate distributions, and (b) samples being annotated with the amount of change from the source distribution, self-training can be successfully applied on gradually shifted samples to adapt the model toward the target distribution. We hypothesize having (a) is enough to enable iterative self-training to slowly adapt the model to the target distribution, by making use of an implicit curriculum. In the case where (a) does not hold, we observe that iterative self-training falls short. We propose GIFT, a method that creates virtual samples from intermediate distributions by interpolating representations of examples from source and target domains. We evaluate an iterative-self-training method on datasets with natural distribution shifts, and show that when applied on top of other domain adaptation methods, it improves the performance of the model on the target dataset. We run an analysis on a synthetic dataset to show that in the presence of (a) iterative-self-training naturally forms a curriculum of samples. Furthermore, we show that when (a) does not hold, GIFT performs better than iterative self-training.
    Progressive-Scale Boundary Blackbox Attack via Projective Gradient Estimation. (arXiv:2106.06056v1 [cs.LG])
    (2 min) Boundary based blackbox attack has been recognized as practical and effective, given that an attacker only needs to access the final model prediction. However, the query efficiency of it is in general high especially for high dimensional image data. In this paper, we show that such efficiency highly depends on the scale at which the attack is applied, and attacking at the optimal scale significantly improves the efficiency. In particular, we propose a theoretical framework to analyze and show three key characteristics to improve the query efficiency. We prove that there exists an optimal scale for projective gradient estimation. Our framework also explains the satisfactory performance achieved by existing boundary black-box attacks. Based on our theoretical framework, we propose Progressive-Scale enabled projective Boundary Attack (PSBA) to improve the query efficiency via progressive scaling techniques. In particular, we employ Progressive-GAN to optimize the scale of projections, which we call PSBA-PGAN. We evaluate our approach on both spatial and frequency scales. Extensive experiments on MNIST, CIFAR-10, CelebA, and ImageNet against different models including a real-world face recognition API show that PSBA-PGAN significantly outperforms existing baseline attacks in terms of query efficiency and attack success rate. We also observe relatively stable optimal scales for different models and datasets. The code is publicly available at https://github.com/AI-secure/PSBA.
    Bayesian Optimisation with Formal Guarantees. (arXiv:2106.06067v1 [cs.LG])
    (2 min) Application domains of Bayesian optimization include optimizing black-box functions or very complex functions. The functions we are interested in describe complex real-world systems applied in industrial settings. Even though they do have explicit representations, standard optimization techniques fail to provide validated solutions and correctness guarantees for them. In this paper we present a combination of Bayesian optimisation and SMT-based constraint solving to achieve safe and stable solutions with optimality guarantees.
    Information Theoretic Evaluation of Privacy-Leakage, Interpretability, and Transferability for a Novel Trustworthy AI Framework. (arXiv:2106.06046v1 [cs.LG])
    (2 min) Guidelines and principles of trustworthy AI should be adhered to in practice during the development of AI systems. This work suggests a novel information theoretic trustworthy AI framework based on the hypothesis that information theory enables taking into account the ethical AI principles during the development of machine learning and deep learning models via providing a way to study and optimize the inherent tradeoffs between trustworthy AI principles. A unified approach to "privacy-preserving interpretable and transferable learning" is presented via introducing the information theoretic measures for privacy-leakage, interpretability, and transferability. A technique based on variational optimization, employing conditionally deep autoencoders, is developed for practically calculating the defined information theoretic measures for privacy-leakage, interpretability, and transferability.
    Scalable Variational Gaussian Processes via Harmonic Kernel Decomposition. (arXiv:2106.05992v1 [cs.LG])
    (2 min) We introduce a new scalable variational Gaussian process approximation which provides a high fidelity approximation while retaining general applicability. We propose the harmonic kernel decomposition (HKD), which uses Fourier series to decompose a kernel as a sum of orthogonal kernels. Our variational approximation exploits this orthogonality to enable a large number of inducing points at a low computational cost. We demonstrate that, on a range of regression and classification problems, our approach can exploit input space symmetries such as translations and reflections, and it significantly outperforms standard variational methods in scalability and accuracy. Notably, our approach achieves state-of-the-art results on CIFAR-10 among pure GP models.
2021-07-11T20:11:37.289Z osmosfeed 1.11.0